Operonn
← Blogs

Evals for LLM hiring pipelines

6 min readEVALSHIRINGFAIRNESS
0:00
3:48
Listen

Hiring systems are the AI use case with the lowest tolerance for error. Candidates lose real opportunities; employers face real legal exposure. An eval harness for a hiring LLM is not optional — and it's not the same harness you'd run for a support chatbot.

Three kinds of error, all different

A hiring system can fail in at least three distinct ways:

  1. Capability error — the model scored a strong candidate as weak, or vice versa.
  2. Fairness error — the model systematically under- or over-scored candidates by a protected attribute (gender, region, school, age proxies).
  3. Instruction error — the model didn't follow the rubric, citing criteria it wasn't asked to weigh.

A single accuracy number hides all three. An honest eval reports each separately.

The regression set

For every hiring system we build, the regression set contains:

  • 50–100 ground-truth cases annotated by the client's own interviewers, with rubric-level scores (not just hire/no-hire).
  • A paired-counterfactual set — 20–40 resumes where only one attribute changed (name swapped, school swapped, gender-coded pronouns swapped). The model's score should not shift meaningfully across the pair. See Bolukbasi et al., 2016 for the foundational work on measuring this kind of bias in embeddings.
  • Adversarial prompts — candidate self-descriptions crafted to trigger false positives (buzzword stuffing, invented certifications). The model must flag these, not reward them.

Fairness metrics we actually track

Not "the model is unbiased." That's a claim no one can honestly make. What we track:

  • Demographic parity delta on scored outputs, per protected attribute slice.
  • Equalised odds delta at the decision threshold — are false-positive and false-negative rates comparable across groups? See Hardt et al., 2016 for the formal definition.
  • Counterfactual consistency — score drift on the paired set above. Target: < 0.05 on a 0–1 score.

All three get reported to the hiring panel. If any drift past threshold in production, the system stops scoring and falls back to human review.

Model-as-judge, calibrated

We use an LLM to grade rubric adherence on open-ended candidate responses. Two non-negotiable rules, borrowed from the Constitutional AI and LLM-as-judge literature:

  1. The judge model is different from the scoring model. Same-model self-evaluation is reliably biased toward its own outputs.
  2. The judge prompt has a human calibration set — a batch where experienced interviewers scored the same responses. The judge's outputs get correlated against human scores quarterly; if correlation drops below 0.7, the judge prompt gets rewritten.

What goes into CI

Every prompt or model change runs:

  • Full regression set pass/fail
  • Counterfactual consistency deltas
  • Fairness metric deltas vs last production release
  • Hallucination check (did the model cite facts not in the input?)

Any metric regressing beyond threshold blocks the deploy. No exceptions for "small" prompt edits — small edits cause most of the real regressions.

The escalation path

Even with all this, the system is never the final decision-maker on a candidate. Every scored candidate has:

  • A human-readable rationale trace
  • The exact rubric criteria that drove the score
  • A one-click "show me similar candidates who were hired / not hired" comparison
  • A manual-override path that goes to a senior interviewer

The system's job is to widen the funnel and surface signal. The hiring decision stays human — partly because that's the right call, partly because it's the only position that's legally defensible.

Hiring AI without a real eval harness is malpractice. Hiring AI with one is still hard — but at least it's honestly hard.

ShareXLinkedInEmail

Working on something like this?

Most of our engagements start with one email.

hello@operonn.com