An eval harness that survives contact with a real user base

The dirty secret of LLM evaluation is that most eval harnesses measure the wrong thing. They test the model on a fixed golden dataset, catch up to the last known failure mode, and miss everything that actually happens in production.

What a production eval harness actually needs

Real eval infrastructure runs on live traffic traces, not curated datasets. It needs three layers: offline evals on a golden set (catches obvious regressions), shadow evals on sampled production traffic (catches distribution shift), and online evals that score live outputs against a rubric (catches the long tail).

eval.py

import anthropic

client = anthropic.Anthropic()

def eval_groundedness(response: str, context: str) -> float:
    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=10,
        messages=[{"role": "user", "content": f"Rate groundedness 0-1.\nContext: {context}\nResponse: {response}\nScore:"}]
    )
    return float(result.content[0].text.strip())

An eval harness that survives contact with a real user base

What a production eval harness actually needs

Daniel Kim

Keep reading

RAG isn't a system, it's a 12-stage failure surface

Why your agent demo doesn't survive a real workflow

When fine-tuning is worth it (and the cheaper alternatives that usually win)