Skip to main content

Score Calibration

Validate whether your AI scores are consistent — and learn how to spot, diagnose, and fix calibration issues before they affect hiring decisions.

Written by Miguel Olivares

Score Calibration is where you answer one question: "Can I trust these scores?" A score of 4.2 only means something if the rubric applies the same standard to every candidate, every assessment, over time. This screen gives you the tools to verify that — and to catch the moment it starts to drift.


What is score calibration?

When an AI agent scores a candidate, it evaluates the transcript against a rubric — a set of soft skill dimensions, each with criteria for what counts as a 0, 1, 2, 3, 4, or 5. Calibration means the rubric is applying those criteria consistently:

  • A candidate who gives a strong answer about problem-solving should score the same whether they're the first call of the day or the hundredth.

  • Two candidates who give equivalently detailed answers should receive similar scores, regardless of which assessment they went through.

When calibration breaks down, you get score compression (everyone scores 4–5), score drift (scores creep higher or lower over time), or inconsistent variance (some batches show wildly different scoring patterns).


The four KPI cards

Card

What to look for

Avg Std. Deviation

Your headline consistency number. Target: ≤ 0.7. Warning: 0.7–1.1. Poor: > 1.1.

Latest Trend

Compares the most recent wave's std. dev to the one before it. Negative = improved; positive = degraded.

Healthy Waves

Count of waves where std. dev was ≤ 0.7. More healthy waves = more reliable scoring history.

Poor Waves

Count of waves where std. dev exceeded 1.1. Even one or two is worth investigating.


Standard deviation, in plain language

Standard deviation sounds technical, but the concept is simple. Imagine you scored 100 candidates and the average was 3.5:

Std. Dev

What it means in practice

0.5

Most candidates scored between 3.0 and 4.0. The rubric is consistent — it agrees with itself about what a "3.5-level" candidate looks like.

1.0

Scores range from 2.5 to 4.5. There's meaningful disagreement in how candidates are being evaluated.

1.5

Scores are everywhere from 2.0 to 5.0. The rubric isn't drawing consistent lines between performance levels.

A "wave" is a batch of candidates scored within a time window. The Hub groups scored candidates into waves automatically — you don't define them. If you scored 50 candidates on Monday and 50 on Tuesday, those are two waves. Comparing their std. devs tells you whether the rubric performed the same on both days.


Std. Deviation Over Time chart

This timeline shows daily or weekly std. dev values color-coded by health: green (≤ 0.7), yellow (0.7–1.1), red (> 1.1).

A sudden jump usually correlates with one of three causes:

  • A new assessment launched with an untested rubric.

  • An existing rubric or system prompt was recently changed.

  • A large batch of edge-case candidates came through — very short calls, non-English speakers, or technical failures.


Score distribution and compression

The score distribution chart shows how many candidates received each score level. A healthy distribution sits roughly bell-shaped around 3. If most scores cluster at 4–5, the rubric has lost its ability to differentiate — this is score compression.

For example: if 93% of candidates score 4 or 5, a candidate scoring 4.2 might be excellent — or might be average. You can't tell, because almost everyone gets a similar score.

Important: Score compression is a calibration issue, not a candidate quality issue. It doesn't mean all your candidates are performing equally — it means the scoring criteria are too lenient. Rubric adjustments are handled by the Talkpush team; do not attempt to change the configuration yourself.

What to check before contacting your representative

  1. Review the rubric language. If the criteria for a 4 or 5 are too easy to satisfy (e.g., "candidate answered the question"), nearly everyone will qualify. The criteria likely need tightening.

  2. Check the per-skill picture. Some skills may be well-calibrated while others are compressed. The Soft Skill Averages section below shows this — any skill where everyone scores 4.0+ with low deviation is likely compressed.

  3. Compare hired vs. not-hired. Go to Metrics → Outcome Analysis. If hired and not-hired candidates have similar score distributions, the score isn't adding value to hiring decisions.

  4. Share your findings. When you contact your Talkpush representative, include the affected assessment, time period, and any patterns you spotted. The faster the diagnosis, the faster the fix.


Soft Skill Averages

This section breaks down average score and std. deviation for each skill dimension. Use it to pinpoint which skills are driving calibration problems:

Pattern

What it signals

High average (4.0+), low std. dev

Over-scored. Criteria for this skill are too easy. The rubric needs tightening.

Low average, high std. dev

Ambiguous criteria. The rubric language may be unclear or open to interpretation.

Moderate average (2.5–3.5), moderate std. dev

Well-calibrated. The rubric is differentiating candidates effectively for this skill.

Sample size matters. A skill assessed on 312 candidates is far more reliable than one assessed on 1. Always check the "n assessed" count next to each skill before drawing conclusions.


Common workflows

Monthly calibration review

  1. Open Score Calibration and set the time filter to "Last 30 days."

  2. Check Avg Std. Deviation. Is it ≤ 0.7? If yes, scoring is healthy.

  3. Look at the Std. Deviation Over Time chart. Any spikes into yellow or red?

  4. Check the Score Distribution. Is there compression (most scores at 4–5)?

  5. Review Soft Skill Averages. Any dimensions with suspiciously high averages or unusually large deviations?

  6. If you find issues, contact your Talkpush representative — fixes usually involve rubric or system prompt adjustments.

Responding to a "poor wave" alert

  1. Note the time period from the Std. Deviation Over Time chart.

  2. Go to Reports and filter to that date range.

  3. Open 3–5 candidate reports from that period and compare their per-dimension scores.

  4. Look for the pattern: Is one skill scored inconsistently? Are short calls getting inflated scores? Are certain question types producing unreliable evaluations?

  5. Contact your Talkpush representative with the affected time period, agent name, and which dimensions look inconsistent.

  6. Monitor the next wave to confirm the fix worked.


When to contact your Talkpush representative

  • Score compression is detected (most candidates clustering at the same score).

  • Standard deviation is rising across multiple waves.

  • A specific soft skill dimension has a suspiciously high average (4.0+) with very low deviation.

  • Poor waves are appearing that you cannot explain by candidate volume changes.


See also

For pipeline volume, completion analysis, and hiring outcome predictiveness, see Metrics: Volume, Outcomes, and Hiring Intelligence.


Did this answer your question?