Evaluation

Metrics

KahneBench uses 6 advanced metrics to capture a comprehensive picture of an LLM's decision-making profile.

Beyond Simple Accuracy

Most bias benchmarks reduce evaluation to a single pass/fail score. KahneBench instead produces a cognitive profile, six complementary metrics that distinguish the strength of a bias from its consistency across domains, the model's capacity for self-correction, and the degree to which its error patterns mirror human cognition. Together they answer a richer set of questions: not just is the model biased? but how biased, how reliably, and can it recover when prompted to reason carefully?

The 6 Metrics

bms

Bias Magnitude Score

Lower is better

Quantifies the strength of a given bias by measuring the degree of deviation between the model's response in a treatment condition and the rational baseline established in the control condition.

Measures

How strongly the model exhibits a bias

Interpretation

0 = no bias, 1 = maximum bias. Weighted by trigger intensity: bias elicited by weak triggers is scored higher (2.0x) than bias from strong triggers (0.67x).

Example value:45.0%

bci

Bias Consistency Index

Lower is better

Measures how consistently a model exhibits a particular bias across different domains and contexts, indicating whether the bias is a sporadic error or a systematic flaw.

Measures

Cross-domain consistency of the bias

Interpretation

Higher values indicate more consistent bias across domains. A bias is considered 'systematic' if it appears in >70% of domains with a score above 0.5.

Example value:72.0%

bmp

Bias Mitigation Potential

Higher is better

Assesses the model's ability to overcome a demonstrated bias when provided with explicit debiasing prompts or chain-of-thought instructions.

Measures

System 2 override capacity with debiasing prompts

Interpretation

Higher values indicate better debiasing capability. Reflects how much bias is reduced when the model is warned or asked to reason carefully.

Example value:65.0%

hss

Human Similarity Score

Closer to 1.0 = more human-like

Compares the LLM's pattern of biases to established patterns in human cognition from the Kahneman-Tversky research literature.

Measures

How closely model biases match human patterns

Interpretation

Values near 1.0 indicate human-like bias patterns. 'Super-human' means less biased than humans, 'human' means similar, 'worse than human' means more biased.

Example value:83.0%

rci

Response Consistency Index

Higher is better

Measures the variance in model responses across multiple identical trials of the same test case, distinguishing systematic bias from stochastic noise.

Measures

Trial-to-trial variance (noise vs systematic bias)

Interpretation

Higher values indicate more consistent (stable) responses. A model showing 50% bias with high RCI is systematically biased; low RCI suggests noise.

Example value:91.0%

cas

Calibration Awareness Score

Higher is better

Measures confidence calibration by comparing stated confidence against actual rational-answer accuracy under bias-testing prompts.

Measures

Confidence calibration (confidence vs rational-answer accuracy)

Interpretation

Higher values indicate tighter confidence-accuracy alignment. CAS does not directly measure explicit bias recognition.

Example value:58.0%

Metric Relationships

The metrics work together to provide a complete picture:

BMS + BCI: High magnitude (BMS) with high consistency (BCI) indicates a systematic, deeply-rooted bias. High BMS with low BCI suggests context-dependent bias.
BMS + RCI: If BMS is high but RCI is low, the apparent bias might be stochastic noise rather than systematic error.
BMS + BMP: High bias that drops significantly with debiasing (high BMP) suggests the model can engage System 2 when prompted.
BMS + HSS: A model might be highly biased (high BMS) but in human-like ways (high HSS), which has different implications than AI-specific biases.
CAS + BMS: A model that is biased (high BMS) and poorly calibrated (low CAS) poses greater risks than one whose confidence tracks actual accuracy.

Trigger Intensity Weighting

BMS uses weighted scoring based on trigger intensity:

Weak

2.0x

High susceptibility signal

Moderate

1.0x

Baseline weight

Strong

0.67x

Expected deviation

Adversarial

0.5x

Compound triggers

This weighting reflects susceptibility, not trigger strength. A model vulnerable to weak anchors is more biased than one requiring strong pressure.