Getting Started
Set up KahneBench and run your first cognitive bias evaluation in minutes.
Prerequisites
- Python 3.10 or later
- uv package manager (recommended) or pip
- API key for your LLM provider (OpenAI, Anthropic, Fireworks, xAI, or Google)
Installation
Install from GitHub (quickest)
# Using uv (recommended)
uv pip install "git+https://github.com/ryanhartman4/KahneBench.git"
# Using pip
pip install "git+https://github.com/ryanhartman4/KahneBench.git"Local development
git clone https://github.com/ryanhartman4/KahneBench.git
cd KahneBench
uv sync
uv sync --group devQuick Start: Basic Demo
Run the basic usage demo to see KahneBench in action with a mock LLM provider:
PYTHONPATH=src uv run python examples/basic_usage.pyThis demonstrates the complete workflow:
- Taxonomy exploration (69 biases across 16 categories)
- Test case generation for specific biases
- Compound (meso-scale) test generation for bias interactions
- Evaluation execution with mock responses
- Metrics calculation and cognitive fingerprint generation
- Debiasing prompt generation
Supported Models
KahneBench supports evaluation across 10 frontier models from 5 providers:
| Model | Provider | Model ID |
|---|---|---|
| Claude Opus 4.6 | Anthropic | claude-opus-4-6 |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4-5 |
| Claude Haiku 4.5 | Anthropic | claude-haiku-4-5 |
| GPT-5.2 | OpenAI | gpt-5.2-2025-12-11 |
| GLM 4.7 | Fireworks | accounts/fireworks/models/glm-4p7 |
| MiniMax M2P1 | Fireworks | accounts/fireworks/models/minimax-m2p1 |
| Gemini 3 Pro | gemini-3-pro-preview | |
| DeepSeek V3.2 | Fireworks | accounts/fireworks/models/deepseek-v3p2 |
| Kimi K2.5 | Fireworks | accounts/fireworks/models/kimi-k2p5 |
| Grok 4.1 Fast | xAI | grok-4-1-fast-reasoning |
Provider Setup
Set the environment variable for each provider you want to use:
export ANTHROPIC_API_KEY="sk-ant-..." # Required for default CLI judge (claude-haiku-4-5)
export OPENAI_API_KEY="sk-..." # OpenAI models (default CLI model: gpt-5)
export FIREWORKS_API_KEY="fw_..." # Fireworks models (default CLI model: kimi-k2p5)
export GOOGLE_API_KEY="..." # Google models (default CLI model: gemini-3-pro-preview)
export XAI_API_KEY="xai-..." # xAI models (default CLI model: grok-4-1-fast-reasoning)Quick Evaluation Examples
Select a provider below to see example evaluation commands:
CLI Options
--model, -m: Model ID (see Supported Models table above)--provider, -p: Provider (anthropic, openai, fireworks, gemini, or xai)--tier, -t: Benchmark tier -core(15 biases),extended(69 biases), orinteraction(compound tests)--domains, -d: Domains to test - individual, professional, social, temporal, risk--trials, -n: Trials per condition (default: 3)--output, -o: Output file prefix for results
CLI Commands
Show framework info
kahne-bench infoList all 69 biases
kahne-bench list-biasesList categories or biases in a category
kahne-bench list-categories
kahne-bench list-categories anchoringGet detailed bias information
kahne-bench describe anchoring_effectGenerate test cases
kahne-bench generate \
--bias anchoring_effect loss_aversion \
--domain professional individual \
--instances 3 \
--output test_cases.jsonGenerate compound (meso-scale) tests
kahne-bench generate-compound \
--bias anchoring_effect \
--domain professional \
--output compound_tests.jsonRun evaluation
# With mock provider (for testing)
kahne-bench evaluate \
-i test_cases.json \
-p mock
# With OpenAI
kahne-bench evaluate \
-i test_cases.json \
-p openai \
-m gpt-5.2-2025-12-11 \
--trials 3
# With Anthropic
kahne-bench evaluate \
-i test_cases.json \
-p anthropic \
-m claude-opus-4-6
# With Fireworks (DeepSeek)
kahne-bench evaluate \
-i test_cases.json \
-p fireworks \
-m accounts/fireworks/models/deepseek-v3p2Assess test quality
kahne-bench assess-quality -i test_cases.jsonGenerate BLOOM scenarios
kahne-bench generate-bloom --bias anchoring_effectRun conversational evaluation
kahne-bench evaluate-conversation \
-i test_cases.json -p openai -m gpt-5Generate report from fingerprint
kahne-bench report fingerprint.jsonNote: Advanced evaluators are Python API only: TemporalEvaluator, ContextSensitivityEvaluator, MacroScaleGenerator, RobustnessTester, and ContrastiveRobustnessTester.
Output Files
After running an evaluation, you'll get two output files:
*_results.json- Raw evaluation results with all responses*_fingerprint.json- Cognitive fingerprint with computed metrics (BMS, BCI, BMP, HSS, RCI, CAS)
Limitations
KahneBench is a research framework, not a validated psychometric instrument. Results are best used for relative comparisons between models, not absolute claims about human-like bias. Human baselines are literature-derived and may be outdated, metric weights are design choices (not empirically calibrated), and template-based prompts can misclassify or miss responses.
Note on Google Gemini: Google Gemini models were not included in the current benchmark results due to persistent rate limiting during evaluation runs. We plan to include Gemini in future benchmark rounds when API access is more reliable.
Next Steps
- → Learn about Dual-Process Theory
- → Explore the Bias Taxonomy (69 biases)
- → Understand the 6 Advanced Metrics
- → Learn about Ecological Domains
- → Try sample questions in the Question Explorer