How to Diagnose Benchmark Validity With BenchBrowser
BenchBrowser is most useful when you treat retrieval as evidence gathering, then ask two separate questions: whether the evidence covers distinct facets, and whether those evaluation sets produce stable model rankings.
Recommended Workflow
Start with a concrete retrieval query, then use the two diagnosis pages to separate evidence coverage from model-ranking stability.
Retrieve evidence
Enter a broad capability or facet query. Inspect the returned samples and save the items that genuinely match the intended construct.
Compare facets
Open Content Validity Diagnosis and compare 2-4 facet queries. Low overlap means the facets are supported by more distinct evidence.
Compare rankings
Open Convergent Validity Diagnosis and compare benchmark rankings or retrieved slices. Stable rankings are stronger convergence evidence.
Inspect causes
Use shared samples, source benchmarks, rank movers, and rubric warnings to decide whether a validity risk is methodological or substantive.
Content Validity: Are Facets Covered by Distinct Evidence?
Use this workflow when you have multiple facets of one broad capability, such as Python coding, Java coding, debugging, and algorithmic reasoning.
Build 2-4 facets
Cached queries reuse prior retrievals. New queries run the retrieval pipeline from the diagnosis page.
Read overlap as risk
Low content validity signal: high overlap suggests multiple facets may be supported by the same generic benchmark items.
Overlap is reported as risk, not proof. A 60% overlap means 60% of the smaller facet set is shared.
Inspect source benchmarks
The panel lists benchmarks contributing the most shared evidence, and each facet column separates unique from shared samples.
Convergent Validity: Do Operationalizations Rank Models Similarly?
Use this workflow after retrieval, or even before retrieval if you only want to compare existing benchmark ranking sets.
Existing benchmark comparison
Select 2-4 ranking sets. MMLU and Big-Bench tasks are grouped into collapsible sections to keep the catalog readable.
Retrieved slice comparison
Create slices from all retrieved samples, manual selections, or source benchmarks, then compare them against each other or against benchmark rankings.
Read rank-divergence risk
Kendall tau converts ranking agreement into risk: tau 1 is 0% risk, tau 0 is 50%, tau -1 is 100%.
Use top rank movers and top-choice changed badges to understand which models drive the diagnosis.
What to Look at First
| Question | Best page | First signal | Then inspect |
|---|---|---|---|
| Are my facets distinct? | Content Validity | Overlap risk percentage | Shared samples and top common benchmarks |
| Do benchmarks agree on model quality? | Convergent Validity | Kendall tau and rank-divergence risk | Top rank movers and top-choice changes |
| Is a retrieved set mixing rubrics? | Convergent Validity | Rubric-shift warning in the risk panel | Source-benchmark slices and sample-level metric badges |