The dirty secret of LLM evaluation is that most eval harnesses measure the wrong thing. They test the model on a fixed golden dataset, catch up to the last known failure mode, and miss everything that actually happens in production.
What a production eval harness actually needs
Real eval infrastructure runs on live traffic traces, not curated datasets. It needs three layers: offline evals on a golden set (catches obvious regressions), shadow evals on sampled production traffic (catches distribution shift), and online evals that score live outputs against a rubric (catches the long tail).
eval.py
import anthropic
client = anthropic.Anthropic()
def eval_groundedness(response: str, context: str) -> float:
result = client.messages.create(
model="claude-opus-4-7",
max_tokens=10,
messages=[{"role": "user", "content": f"Rate groundedness 0-1.\nContext: {context}\nResponse: {response}\nScore:"}]
)
return float(result.content[0].text.strip())