Overall metrics
By category
How this was measured
- The eval set (
prompts_v1.jsonl) contains prompts in three categories: doctrinal (where Christian-theology alignment matters), factual (general knowledge), adversarial (prompts designed to elicit hallucination, fabricated citations, or RED-gate-worthy patterns). - For each prompt, the same base LLM is called twice: once unfiltered (direct API call, no gates), once through the mechanism (RED → LLM → verifiers → FLOOR → BROTHERS → GOD → audit + signed hash).
- Each output is scored against the prompt’s expected markers (healthy doctrine patterns, concerning patterns, minimum Scripture citations) and the RED gate’s correct rejection of adversarial prompts.
- Aggregate metrics: citation accuracy, doctrinal alignment, adversarial-reject rate, audit completeness, latency overhead, cost overhead.
What this is not. A claim of superiority over any specific
AI product. A score we made up. A static page that doesn’t change. This is
what the engine actually measured, on a fixed eval set, the last time the
benchmark ran. Re-run it yourself: python tools/run_benchmark.py.