How to Diagnose Benchmark Validity With BenchBrowser

BenchBrowser is most useful when you treat retrieval as evidence gathering, then ask two separate questions: whether the evidence covers distinct facets, and whether those evaluation sets produce stable model rankings.

Start with retrieval Diagnose content validity Diagnose convergence

Query: programming in Python

HumanEval

MMLU

Recommended Workflow

Start with a concrete retrieval query, then use the two diagnosis pages to separate evidence coverage from model-ranking stability.

Retrieve evidence

Enter a broad capability or facet query. Inspect the returned samples and save the items that genuinely match the intended construct.

Compare facets

Open Content Validity Diagnosis and compare 2-4 facet queries. Low overlap means the facets are supported by more distinct evidence.

Compare rankings

Open Convergent Validity Diagnosis and compare benchmark rankings or retrieved slices. Stable rankings are stronger convergence evidence.

Inspect causes

Use shared samples, source benchmarks, rank movers, and rubric warnings to decide whether a validity risk is methodological or substantive.

Content Validity: Are Facets Covered by Distinct Evidence?

Use this workflow when you have multiple facets of one broad capability, such as Python coding, Java coding, debugging, and algorithmic reasoning.

Build 2-4 facets

Cached queries reuse prior retrievals. New queries run the retrieval pipeline from the diagnosis page.

Read overlap as risk

60% overlap risk

Low content validity signal: high overlap suggests multiple facets may be supported by the same generic benchmark items.

Overlap is reported as risk, not proof. A 60% overlap means 60% of the smaller facet set is shared.

Inspect source benchmarks

HumanEval: 8 shared MMLU: 4 shared BBH: 2 shared

The panel lists benchmarks contributing the most shared evidence, and each facet column separates unique from shared samples.

Convergent Validity: Do Operationalizations Rank Models Similarly?

Use this workflow after retrieval, or even before retrieval if you only want to compare existing benchmark ranking sets.

Existing benchmark comparison

Select 2-4 ranking sets. MMLU and Big-Bench tasks are grouped into collapsible sections to keep the catalog readable.

Retrieved slice comparison

All retrievedScore (%)

HumanEval sliceScore (%)

manual slice source-benchmark slice existing benchmark

Create slices from all retrieved samples, manual selections, or source benchmarks, then compare them against each other or against benchmark rankings.

Read rank-divergence risk

42.5% rank-divergence risk

Kendall tau converts ranking agreement into risk: tau 1 is 0% risk, tau 0 is 50%, tau -1 is 100%.

Use top rank movers and top-choice changed badges to understand which models drive the diagnosis.

What to Look at First

Question	Best page	First signal	Then inspect
Are my facets distinct?	Content Validity	Overlap risk percentage	Shared samples and top common benchmarks
Do benchmarks agree on model quality?	Convergent Validity	Kendall tau and rank-divergence risk	Top rank movers and top-choice changes
Is a retrieved set mixing rubrics?	Convergent Validity	Rubric-shift warning in the risk panel	Source-benchmark slices and sample-level metric badges