KahneBench
Documentation
Quick Start

Getting Started

Set up KahneBench and run your first cognitive bias evaluation in minutes.

Prerequisites

  • Python 3.10 or later
  • uv package manager (recommended) or pip
  • API key for your LLM provider (OpenAI, Anthropic, Fireworks, xAI, or Google)

Installation

Install from GitHub (quickest)

# Using uv (recommended)
uv pip install "git+https://github.com/ryanhartman4/KahneBench.git"

# Using pip
pip install "git+https://github.com/ryanhartman4/KahneBench.git"

Local development

git clone https://github.com/ryanhartman4/KahneBench.git
cd KahneBench
uv sync
uv sync --group dev

Quick Start: Basic Demo

Run the basic usage demo to see KahneBench in action with a mock LLM provider:

PYTHONPATH=src uv run python examples/basic_usage.py

This demonstrates the complete workflow:

  • Taxonomy exploration (69 biases across 16 categories)
  • Test case generation for specific biases
  • Compound (meso-scale) test generation for bias interactions
  • Evaluation execution with mock responses
  • Metrics calculation and cognitive fingerprint generation
  • Debiasing prompt generation

Supported Models

KahneBench supports evaluation across 10 frontier models from 5 providers:

ModelProviderModel ID
Claude Opus 4.6Anthropicclaude-opus-4-6
Claude Sonnet 4.5Anthropicclaude-sonnet-4-5
Claude Haiku 4.5Anthropicclaude-haiku-4-5
GPT-5.2OpenAIgpt-5.2-2025-12-11
GLM 4.7Fireworksaccounts/fireworks/models/glm-4p7
MiniMax M2P1Fireworksaccounts/fireworks/models/minimax-m2p1
Gemini 3 ProGooglegemini-3-pro-preview
DeepSeek V3.2Fireworksaccounts/fireworks/models/deepseek-v3p2
Kimi K2.5Fireworksaccounts/fireworks/models/kimi-k2p5
Grok 4.1 FastxAIgrok-4-1-fast-reasoning

Provider Setup

Set the environment variable for each provider you want to use:

export ANTHROPIC_API_KEY="sk-ant-..."    # Required for default CLI judge (claude-haiku-4-5)
export OPENAI_API_KEY="sk-..."           # OpenAI models (default CLI model: gpt-5)
export FIREWORKS_API_KEY="fw_..."        # Fireworks models (default CLI model: kimi-k2p5)
export GOOGLE_API_KEY="..."              # Google models (default CLI model: gemini-3-pro-preview)
export XAI_API_KEY="xai-..."            # xAI models (default CLI model: grok-4-1-fast-reasoning)

Quick Evaluation Examples

Select a provider below to see example evaluation commands:

CLI Options

  • --model, -m: Model ID (see Supported Models table above)
  • --provider, -p: Provider (anthropic, openai, fireworks, gemini, or xai)
  • --tier, -t: Benchmark tier - core (15 biases), extended (69 biases), or interaction (compound tests)
  • --domains, -d: Domains to test - individual, professional, social, temporal, risk
  • --trials, -n: Trials per condition (default: 3)
  • --output, -o: Output file prefix for results

CLI Commands

Show framework info

kahne-bench info

List all 69 biases

kahne-bench list-biases

List categories or biases in a category

kahne-bench list-categories
kahne-bench list-categories anchoring

Get detailed bias information

kahne-bench describe anchoring_effect

Generate test cases

kahne-bench generate \
  --bias anchoring_effect loss_aversion \
  --domain professional individual \
  --instances 3 \
  --output test_cases.json

Generate compound (meso-scale) tests

kahne-bench generate-compound \
  --bias anchoring_effect \
  --domain professional \
  --output compound_tests.json

Run evaluation

# With mock provider (for testing)
kahne-bench evaluate \
  -i test_cases.json \
  -p mock

# With OpenAI
kahne-bench evaluate \
  -i test_cases.json \
  -p openai \
  -m gpt-5.2-2025-12-11 \
  --trials 3

# With Anthropic
kahne-bench evaluate \
  -i test_cases.json \
  -p anthropic \
  -m claude-opus-4-6

# With Fireworks (DeepSeek)
kahne-bench evaluate \
  -i test_cases.json \
  -p fireworks \
  -m accounts/fireworks/models/deepseek-v3p2

Assess test quality

kahne-bench assess-quality -i test_cases.json

Generate BLOOM scenarios

kahne-bench generate-bloom --bias anchoring_effect

Run conversational evaluation

kahne-bench evaluate-conversation \
  -i test_cases.json -p openai -m gpt-5

Generate report from fingerprint

kahne-bench report fingerprint.json

Note: Advanced evaluators are Python API only: TemporalEvaluator, ContextSensitivityEvaluator, MacroScaleGenerator, RobustnessTester, and ContrastiveRobustnessTester.

Output Files

After running an evaluation, you'll get two output files:

  • *_results.json - Raw evaluation results with all responses
  • *_fingerprint.json - Cognitive fingerprint with computed metrics (BMS, BCI, BMP, HSS, RCI, CAS)

Limitations

KahneBench is a research framework, not a validated psychometric instrument. Results are best used for relative comparisons between models, not absolute claims about human-like bias. Human baselines are literature-derived and may be outdated, metric weights are design choices (not empirically calibrated), and template-based prompts can misclassify or miss responses.

Note on Google Gemini: Google Gemini models were not included in the current benchmark results due to persistent rate limiting during evaluation runs. We plan to include Gemini in future benchmark rounds when API access is more reliable.

Next Steps