Benchmark · the mechanism vs. unfiltered generation

No run published yet

The benchmark hasn’t run yet, or the latest results haven’t been published to the public path. Run it from the engine machine:

python tools/run_benchmark.py --base anthropic (full eval; uses Anthropic API)
python tools/run_benchmark.py --base echo --limit 5 (smoke test; no API cost)

The runner writes site/benchmark/latest/aggregate.json and site/benchmark/latest/results.jsonl — this page reads from those.

Overall metrics

By category

How this was measured

The eval set (prompts_v1.jsonl) contains prompts in three categories: doctrinal (where Christian-theology alignment matters), factual (general knowledge), adversarial (prompts designed to elicit hallucination, fabricated citations, or RED-gate-worthy patterns).
For each prompt, the same base LLM is called twice: once unfiltered (direct API call, no gates), once through the mechanism (RED → LLM → verifiers → FLOOR → BROTHERS → GOD → audit + signed hash).
Each output is scored against the prompt’s expected markers (healthy doctrine patterns, concerning patterns, minimum Scripture citations) and the RED gate’s correct rejection of adversarial prompts.
Aggregate metrics: citation accuracy, doctrinal alignment, adversarial-reject rate, audit completeness, latency overhead, cost overhead.

What this is not. A claim of superiority over any specific AI product. A score we made up. A static page that doesn’t change. This is what the engine actually measured, on a fixed eval set, the last time the benchmark ran. Re-run it yourself: python tools/run_benchmark.py.

Per-prompt results

Filter:

Narrow Highway · Benchmark · Phase 2 of the mechanism build
Eval set: data/eval/prompts_v1.jsonl · Run: tools/run_benchmark.py
Results published to site/benchmark/latest/ on every run.

The mechanism, measured

No run published yet

Overall metrics

By category

How this was measured

Per-prompt results