What Makes KahneBench Different
KahneBench goes beyond simple question-and-answer bias testing. Here is what sets the methodology apart.
At a Glance
Anti-Contamination Testing
Tests use futuristic scenarios (Mars colonization, AI governance boards) and novel professions that are unlikely to appear in training data, forcing models to reason from first principles rather than pattern-match to memorized answers.
Control vs Treatment Design
Every test includes a control prompt (no bias trigger) and treatment prompts (with triggers). Bias is measured as deviation from the model's own baseline, not from an external 'correct' answer.
Four Testing Scales
Micro (isolated bias), Meso (interacting biases), Macro (sequential decisions), and Meta (self-correction). Real-world decisions involve all four, so the benchmark tests all four.
Context Sensitivity
Each bias is tested across expertise levels (novice to authority), formality (casual to academic), and stakes (low to critical). A model might resist anchoring as an 'expert' but fall for it as a 'beginner'.
Reasoning Variant Testing
Models with extended thinking (Claude Opus 4.6, GPT-5.2) are tested twice: once with minimal reasoning budget and once with full reasoning. This directly tests whether 'thinking harder' reduces bias.
Compound Bias Interactions
The Meso scale tests pairs of biases together (e.g., anchoring + availability) to see if they amplify or attenuate each other. Biases rarely occur in isolation in the real world.
Full Methodology
The sections below cover every methodological feature in detail. Each addresses a specific limitation of existing bias benchmarks.
Anti-Contamination Measures
Standard bias benchmarks use well-known scenarios like the Linda Problem or the Asian Disease Problem. Modern LLMs have likely memorized these and can 'pass' by pattern-matching rather than reasoning. KahneBench's NovelScenarioGenerator creates test cases using futuristic professions (quantum computing architect, space debris analyst) and novel contexts (Mars colonization, AI governance boards). The generate_contamination_resistant_batch() method produces entire test suites designed to be outside training distributions. An additional BLOOM generator uses an LLM to produce naturalistic scenarios, adding another layer of diversity.
Control vs Treatment Experimental Design
KahneBench follows the same within-subject experimental design used in cognitive psychology. Every test instance includes a control prompt (no bias trigger, establishing the model's rational baseline) and treatment prompts (with bias triggers at each intensity level). Bias is measured as the model's deviation from its own control-condition answer. This eliminates the problem of defining an external 'correct' answer and instead measures whether reasoning shifts when a bias trigger is introduced.
Multi-Scale Testing Architecture
Most benchmarks test biases in isolation. KahneBench uses four scales. Micro scale tests single isolated biases with control vs treatment comparison. Meso scale tests multiple interacting biases in complex scenarios (e.g., anchoring + availability + overconfidence). Macro scale tests bias persistence across sequential related decisions using DecisionChain structures, with bias-specific generators for anchoring, prospect theory, confirmation bias, and overconfidence. Meta scale tests self-correction and debiasing capacity by providing explicit warnings or reasoning prompts.
Trigger Intensity with Inverse Weighting
Each bias is tested at four trigger intensities: Weak, Moderate, Strong, and Adversarial. The BMS metric applies inverse weighting: Weak triggers get 2.0x weight, Moderate gets 1.0x, Strong gets 0.67x, Adversarial gets 0.5x. The logic: a model that succumbs to a subtle, weak trigger is more biased than one that only fails under adversarial pressure. This measures susceptibility rather than raw failure rate.
Context Sensitivity Testing
The ContextSensitivityEvaluator tests how contextual framing affects bias susceptibility across three dimensions. Expertise level ranges from novice to authority. Formality ranges from casual to academic. Stakes range from low to critical. A model might resist a bias when framed as a 'financial analyst reviewing data' but fall for it when framed as a 'curious person browsing online.' Six preset configurations test the most informative combinations, with gradient methods available to isolate each dimension.
Temporal and Conversational Testing
The TemporalEvaluator tests four temporal conditions. Immediate (System 1 dominant, instant response), Deliberative (with explicit reflection instructions), Persistent (bias stability across 5 sequential prompts), and Adaptive (pre/post feedback comparison). A separate ConversationalEvaluator tests bias in multi-turn dialogue, where biases may emerge, accumulate, or dissipate over the course of a conversation.
Compound Bias Interactions
The CompoundTestGenerator creates Meso-scale scenarios where multiple biases interact. It supports specifying primary and secondary biases with an interaction type: amplifying (biases reinforce each other) or attenuating (one bias counteracts another). For example, anchoring can amplify overconfidence, while loss aversion can interact with framing effects. These compound tests reveal emergent bias behaviors that single-bias tests miss entirely.
Reasoning Variant Testing
Models with extended thinking capabilities (like Claude Opus 4.6's thinking budget or GPT-5.2's reasoning effort) are tested twice: once with minimal reasoning and once with full reasoning capacity. This directly tests the dual-process hypothesis for LLMs: does giving the model more 'System 2' capacity reduce its 'System 1' errors? This parallels cognitive psychology's manipulation of cognitive load in human experiments.
Per-Bias Theoretical Grounding
Each of the 69 bias definitions includes four theoretical grounding fields: the original K&T paper citation, the specific System 1 mechanism that causes the bias, the System 2 strategy that corrects it, and the classic experimental paradigm. This connects every test to validated psychological theory rather than ad-hoc classification, and it directly informs the test generation templates.
Human Baseline Rates from Research Literature
Each bias includes a human susceptibility rate derived from published research (e.g., conjunction fallacy = 85%, anchoring effect = 65%). These enable the Human Similarity Score (HSS), which categorizes model behavior as super-human (less biased than humans), human-like, or worse than human. Each category has different safety implications for deployment.
Six-Metric Cognitive Profile
Rather than a single pass/fail score, KahneBench produces six complementary metrics: BMS (magnitude), BCI (cross-domain consistency), BMP (self-correction capacity), HSS (human similarity), RCI (noise vs signal distinction via repeated trials), and CAS (metacognitive calibration). Together they answer not just 'is the model biased?' but 'how biased, how reliably, can it recover, and does it know?'
Cognitive Fingerprint Output
The output of a full evaluation is a Cognitive Fingerprint Report: a structured per-bias, per-metric breakdown that identifies most susceptible biases, most resistant biases, human-like biases, and AI-specific biases. This is a richer deliverable than a leaderboard rank, enabling targeted mitigation strategies for specific deployment contexts.
Prompt Variation and Robustness Testing
The variation module tests robustness by varying prompt wording across multiple dimensions. The robustness module provides adversarial testing including contrastive robustness tests. If a model's bias resistance depends on exact prompt phrasing, the result is fragile. These tests check whether results hold across paraphrases and reformulations.
Test Quality Assessment
An LLM judge assesses the quality of generated test cases, filtering out ambiguous or poorly constructed instances. Dataset diversity is validated using self-BLEU and ROUGE scores to ensure generated test cases are sufficiently distinct from each other, preventing the benchmark from just testing the same scenario with minor rephrasing.
Cross-Domain Variants
Each test instance supports cross-domain variants, allowing the same underlying bias test to be adapted across all five ecological domains efficiently. This ensures consistent test logic while allowing domain-specific framing, enabling apples-to-apples comparison for the Bias Consistency Index (BCI).
Hybrid Scoring Pipeline
Response scoring uses deterministic extraction first (pattern-matching for expected answer formats), with an LLM-judge fallback when extraction fails. This improves coverage across diverse response styles while keeping automation practical. Unknown or failed extraction rates are tracked explicitly so that weak data quality is visible in metric interpretation rather than hidden behind defaults.
Run Provenance and Reproducibility
Every evaluation export includes model configuration, judge settings, benchmark tier, instance manifests, and code/version context. This makes runs auditable and reproducible. Test instances can be exported to JSON and reimported for exact replication, and generators accept seed parameters for deterministic output.
Transparent Methodological Limits
Human baseline rates are derived from published research literature and explicitly caveated. They represent aggregate population tendencies from specific experimental paradigms, not universal ground truth. The CAS metric flags insufficient confidence data rather than silently defaulting, and the RCI metric distinguishes noise-floor reliability from genuine consistency. KahneBench surfaces uncertainty rather than hiding it.