Hiring systems are the AI use case with the lowest tolerance for error. Candidates lose real opportunities; employers face real legal exposure. An eval harness for a hiring LLM is not optional — and it's not the same harness you'd run for a support chatbot.
Three kinds of error, all different
A hiring system can fail in at least three distinct ways:
- Capability error — the model scored a strong candidate as weak, or vice versa.
- Fairness error — the model systematically under- or over-scored candidates by a protected attribute (gender, region, school, age proxies).
- Instruction error — the model didn't follow the rubric, citing criteria it wasn't asked to weigh.
A single accuracy number hides all three. An honest eval reports each separately.
The regression set
For every hiring system we build, the regression set contains:
- 50–100 ground-truth cases annotated by the client's own interviewers, with rubric-level scores (not just hire/no-hire).
- A paired-counterfactual set — 20–40 resumes where only one attribute changed (name swapped, school swapped, gender-coded pronouns swapped). The model's score should not shift meaningfully across the pair. See Bolukbasi et al., 2016 for the foundational work on measuring this kind of bias in embeddings.
- Adversarial prompts — candidate self-descriptions crafted to trigger false positives (buzzword stuffing, invented certifications). The model must flag these, not reward them.
Fairness metrics we actually track
Not "the model is unbiased." That's a claim no one can honestly make. What we track:
- Demographic parity delta on scored outputs, per protected attribute slice.
- Equalised odds delta at the decision threshold — are false-positive and false-negative rates comparable across groups? See Hardt et al., 2016 for the formal definition.
- Counterfactual consistency — score drift on the paired set above. Target: < 0.05 on a 0–1 score.
All three get reported to the hiring panel. If any drift past threshold in production, the system stops scoring and falls back to human review.
Model-as-judge, calibrated
We use an LLM to grade rubric adherence on open-ended candidate responses. Two non-negotiable rules, borrowed from the Constitutional AI and LLM-as-judge literature:
- The judge model is different from the scoring model. Same-model self-evaluation is reliably biased toward its own outputs.
- The judge prompt has a human calibration set — a batch where experienced interviewers scored the same responses. The judge's outputs get correlated against human scores quarterly; if correlation drops below 0.7, the judge prompt gets rewritten.
What goes into CI
Every prompt or model change runs:
- Full regression set pass/fail
- Counterfactual consistency deltas
- Fairness metric deltas vs last production release
- Hallucination check (did the model cite facts not in the input?)
Any metric regressing beyond threshold blocks the deploy. No exceptions for "small" prompt edits — small edits cause most of the real regressions.
The escalation path
Even with all this, the system is never the final decision-maker on a candidate. Every scored candidate has:
- A human-readable rationale trace
- The exact rubric criteria that drove the score
- A one-click "show me similar candidates who were hired / not hired" comparison
- A manual-override path that goes to a senior interviewer
The system's job is to widen the funnel and surface signal. The hiring decision stays human — partly because that's the right call, partly because it's the only position that's legally defensible.
Hiring AI without a real eval harness is malpractice. Hiring AI with one is still hard — but at least it's honestly hard.
