Evals for LLM hiring pipelines

By Operonn TeamMarch 10, 20266 min readEVALSHIRINGFAIRNESS

0:00

3:48

Listen

Hiring systems are the AI use case with the lowest tolerance for error. Candidates lose real opportunities; employers face real legal exposure. An eval harness for a hiring LLM is not optional — and it's not the same harness you'd run for a support chatbot.

Three kinds of error, all different

A hiring system can fail in at least three distinct ways:

Capability error — the model scored a strong candidate as weak, or vice versa.
Fairness error — the model systematically under- or over-scored candidates by a protected attribute (gender, region, school, age proxies).
Instruction error — the model didn't follow the rubric, citing criteria it wasn't asked to weigh.

A single accuracy number hides all three. An honest eval reports each separately.

The regression set

For every hiring system we build, the regression set contains:

50–100 ground-truth cases annotated by the client's own interviewers, with rubric-level scores (not just hire/no-hire).
A paired-counterfactual set — 20–40 resumes where only one attribute changed (name swapped, school swapped, gender-coded pronouns swapped). The model's score should not shift meaningfully across the pair. See Bolukbasi et al., 2016 for the foundational work on measuring this kind of bias in embeddings.
Adversarial prompts — candidate self-descriptions crafted to trigger false positives (buzzword stuffing, invented certifications). The model must flag these, not reward them.

Fairness metrics we actually track

Not "the model is unbiased." That's a claim no one can honestly make. What we track:

Demographic parity delta on scored outputs, per protected attribute slice.
Equalised odds delta at the decision threshold — are false-positive and false-negative rates comparable across groups? See Hardt et al., 2016 for the formal definition.
Counterfactual consistency — score drift on the paired set above. Target: < 0.05 on a 0–1 score.

All three get reported to the hiring panel. If any drift past threshold in production, the system stops scoring and falls back to human review.

Model-as-judge, calibrated

We use an LLM to grade rubric adherence on open-ended candidate responses. Two non-negotiable rules, borrowed from the Constitutional AI and LLM-as-judge literature:

The judge model is different from the scoring model. Same-model self-evaluation is reliably biased toward its own outputs.
The judge prompt has a human calibration set — a batch where experienced interviewers scored the same responses. The judge's outputs get correlated against human scores quarterly; if correlation drops below 0.7, the judge prompt gets rewritten.

What goes into CI

Every prompt or model change runs:

Full regression set pass/fail
Counterfactual consistency deltas
Fairness metric deltas vs last production release
Hallucination check (did the model cite facts not in the input?)

Any metric regressing beyond threshold blocks the deploy. No exceptions for "small" prompt edits — small edits cause most of the real regressions.

The escalation path

Even with all this, the system is never the final decision-maker on a candidate. Every scored candidate has:

A human-readable rationale trace
The exact rubric criteria that drove the score
A one-click "show me similar candidates who were hired / not hired" comparison
A manual-override path that goes to a senior interviewer

The system's job is to widen the funnel and surface signal. The hiring decision stays human — partly because that's the right call, partly because it's the only position that's legally defensible.

Hiring AI without a real eval harness is malpractice. Hiring AI with one is still hard — but at least it's honestly hard.

ShareX LinkedIn Email