Skip to main content

TalkScore: How is the score generated?

A transparent walkthrough of the methodology behind every AI interview

Written by Crismin Joy Lagamayo

Hiring decisions are too important to come out of a black box. When a recruiter, hiring manager, or compliance reviewer asks "why did this candidate score a 4?", the answer needs to be specific, defensible, and grounded in something the candidate actually said or did. This article explains, end to end, how TalkScore arrives at every score it produces.

If you read nothing else, read this: TalkScore is not a model that ingests a transcript and emits a mysterious number. It is a structured pipeline of behavioral rubrics, each one authored to evaluate a single skill, each one explainable from the transcript up. Everything below describes how that pipeline works in practice.


The short version

  1. A candidate completes an AI-led interview, conducted by a voice or web agent built for the specific role.

  2. After the interview, the full transcript is analyzed by a large language model running against a series of skill-specific rubrics defined for that role.

  3. For each skill, the model produces a numerical score on a 0–5 scale and a short qualitative analysis paragraph.

  4. The per-skill scores are aggregated into an overall TalkScore.

  5. Every score, rubric, and analysis is visible to the recruiter and the client, and can be reviewed, edited, and fed back into rubric refinement.

The rest of this article is the long version.


What happens during the interview

Every TalkScore interview is conducted by an AI agent configured for the specific role. The agent has been given a structured prompt that defines:

  • The role context (job title, hiring company, role overview)

  • The verbatim opening sequence and closing sequence

  • The five to seven core interview questions, delivered word-for-word as written

  • Follow-up logic for shallow or vague answers

  • Conversational behavior rules (acknowledgments, tone, pacing, turn-taking)

  • Protocols for candidates who want to reschedule, decline, or skip a question

  • Topics the agent must not discuss (salary, protected characteristics, anything outside its scope)

The agent does not score the candidate during the interview. All evaluation is done after the call completes, against the full transcript, in a separate pass. This separation matters: the questions and the rubrics are independent, and the model isn't influenced by partial impressions formed mid-conversation.


How each skill is evaluated

Each role has a set of skills that matter for performance — typically between five and twelve, depending on role complexity. A customer service role might evaluate Active Listening, Customer Empathy, Problem Solving, Professionalism, and Attention to Detail. A collections role might emphasize Resilience, Composure under Pressure, and Persuasive Communication. A team-lead role might weigh Accountability, Drive, and Team Orientation more heavily.

For each skill, there is a dedicated evaluator: a self-contained prompt that runs against the transcript and produces two outputs.

Output 1: a numerical score, 0 to 5

Produced against an explicit rubric with anchor descriptions for every level. Here is the actual anchor scale for one skill — Humility, as used on a customer service role:

5 — Candidate gave clear examples of acknowledging a mistake, a knowledge gap, or a failure — honestly, without being defensive, and with a specific account of what they learned and how they changed. Credit was given to others where appropriate.

4 — Solid evidence of humility with minor gaps. Candidate acknowledged limitations or mistakes but may have softened the account slightly, or mentioned only learning outcomes without fully sitting with the difficulty.

3 — Mixed evidence. Candidate could name a weakness or mistake when directly asked, but framing was often defensive or included some self-justification. Humility is present but partial.

2 — Limited evidence. Candidate struggled to acknowledge limitations or reframed all difficult situations as learning wins rather than genuine mistakes. Little sign they held themselves accountable.

1 — Candidate showed strong resistance to acknowledging fault, consistently attributed difficulties to others, or gave answers that revealed an inflated self-assessment with no evidence to support it.

0 — No usable evidence of humility in the transcript.

Every skill on every role has anchors written to this level of behavioral specificity. The anchors describe what the candidate did or said — not how articulate they sounded saying it.

Output 2: an analysis paragraph

A short, recruiter-facing summary that explains the score in plain language, referencing specific moments in the transcript. Analysis paragraphs are constrained: they may not exceed a defined length (typically two to three sentences), they may not include generic filler ("the candidate was highly motivated") unless supported by a behavioral example, they may not comment on the candidate's tone, accent, or fluency, and they may not invent information not present in the transcript.

Together these two outputs serve different purposes. The numerical score makes candidates sortable, filterable, and comparable at scale. The analysis paragraph gives the recruiter the context they need to act on the score — to decide whether to advance the candidate, brief the next interviewer, or override the score with their own judgment.


What the model is explicitly told to ignore

A common concern with AI scoring is that the model will reward eloquent candidates and penalize nervous ones, or favor candidates whose communication style matches a particular cultural or class background. TalkScore's rubrics actively suppress these biases. Every skill rubric ends with an instruction along these lines:

Do not raise or lower the score based on specific words, adverbs, or phrases the candidate used. Evaluate based on what they actually described doing, deciding, or experiencing — not how they described it.

For skills where the bias trap is particularly obvious, the instruction is even more direct:

  • The Confidence rubric says: "Do not raise or lower the score based on how assertively or fluently the candidate spoke. Evaluate based on the behaviors and decisions described in their answers."

  • The Emotionality rubric says: "Do not raise or lower the score based on how calm or composed the candidate seemed during the interview itself. Evaluate based on the behaviors and responses they described in their examples."

The rubrics are written this way because we believe the construct being measured is behavior, not performance. A nervous candidate who describes a thoughtful response to a difficult customer should score as well as a polished candidate who describes the same response.


How the overall score is calculated

After every skill has been independently evaluated, the per-skill scores are aggregated into a single overall TalkScore. The default aggregation is the mean of the per-skill scores, presented on the same 0–5 scale. This is intentionally simple and intentionally inspectable: a recruiter who sees a TalkScore of 3.6 can drill into the constituent skills and see exactly which skills carried the score and which dragged it down.

For roles where some skills should count more than others, the aggregation can be weighted. Active Listening and Empathy might be weighted higher for a customer service role; Resilience and Composure might be weighted higher for a collections role. Weighting is configured per role and is visible in the rubric documentation. If you don't specify weights, all skills count equally — but for any role where weighting matters to your hiring outcomes, we'll help you define a weighting scheme that reflects what your operations team actually values.


Interview length: a deliberate trade-off

The most consequential design decision in any AI interview isn't which skills to score — it's how long the interview should be. There's a real trade-off here, and the right answer depends on the role, your candidate pool, and your hiring funnel. We'd rather make the trade-off explicit than pretend it doesn't exist.

The case for a longer interview

Every additional question gives the model more behavioral evidence per skill. A single answer about a past mistake is reasonable evidence for Humility; two or three independent stories give the scoring model substantially more to work with, and the resulting score is correspondingly more reliable. Skills that depend on a specific kind of evidence — say, Resourcefulness, which is best demonstrated through a story about an under-supported situation — particularly benefit from a question dedicated to surfacing that evidence. In an ideal world with infinite candidate patience, every skill would have its own dedicated question.

The case for a shorter interview

Candidates are not infinitely patient. A long interview is taxing — candidates fatigue, give shorter answers as the interview progresses, and start treating later questions as obstacles rather than opportunities. Completion rates fall: a 25-minute interview will see noticeably more mid-call drop-off than a 12-minute one, and the candidates who drop are disproportionately the strong ones who have other options. The candidate experience itself is a hiring signal — a respectful, well-paced interview is part of how you compete for talent. And there's a less obvious failure mode worth naming: fatigue degrades the quality of the data. A longer interview gets you more answers but not always better ones.

Where the happy medium sits

Most TalkScore interviews land between 10 and 15 minutes, with five to seven core questions, each capable of producing evidence for two or three skills at once. This isn't a coincidence — it's the band where evidence depth and candidate experience tend to balance for the roles we most commonly see. But it isn't a rule. Roles with deep skill requirements or low candidate volumes may justify longer interviews; high-volume frontline roles where speed and completion rate dominate may justify shorter ones.

This is a decision worth talking through, not defaulting on. Before launching a new role, our specialists work with your team to map the skills you want to measure against the interview design that will surface them most reliably without taxing your candidates more than necessary. If you have an existing interview that runs longer than it needs to, or shorter than it should, we can usually tell from the data — completion rates, per-skill score distributions, and "no usable evidence" rates all flag length-versus-coverage mismatches. Ask us; we've made this trade-off across hundreds of roles, and we have strong opinions.


Quality monitoring: how we know the rubric is working

A score is only as good as the rubric that produced it. TalkScore Hub includes a Score Calibration view that monitors scoring health continuously:

Monitor

What it surfaces

Standard deviation per scoring wave

Measures whether the rubric is differentiating candidates or compressing them all into a narrow band. A healthy rubric produces a moderate spread; a rubric where every candidate scores 4 or 5 is not actually informing your hiring decisions and is flagged for review.

Score distribution

Shows how candidates are distributed across the 0–5 scale. A heavily top-skewed distribution suggests the rubric criteria for the higher anchors are too easy to satisfy and need tightening.

Per-skill averages and ranges

Surfaces skills that are systematically over- or under-scored, which usually indicates a rubric drift that needs adjustment.

Quality flags

Surface specific issues with individual interviews: hallucinations, repetitions, technical errors, scoring inconsistencies, and bias indicators. Each flag is logged and reviewable.

These views are available to every TalkScore Hub client. When something looks off, we want you to be able to see it before we do, and we'll work with you to resolve it.


How feedback improves the rubrics

When a recruiter reviews a candidate report and disagrees with a score, they can submit a Score Opinion: a structured note that captures the disagreement, the skill in question, and the reasoning. Score Opinions accumulate in the Feedback tab of TalkScore Hub, and our scoring team reviews them on a rolling basis.

We are deliberate about how this feedback flows back into the rubrics. We do not automatically retrain the scoring model on individual disagreements, because doing so would mean one recruiter's idiosyncratic standard could pull the rubric for an entire team. Instead, when feedback patterns emerge — for example, when multiple reviewers consistently flag the same skill as over-scored on the same role — we revise the rubric, validate the revision against a held-out set of recent interviews, and deploy the change with full visibility. Every rubric revision is versioned, and we'll tell you what changed and why.

This is a human-in-the-loop process by design. Your team's feedback shapes our rubrics, but only after we've verified the change will improve scoring rather than introduce new variance. For clients who want a faster or more automated feedback loop, we're actively building richer options — including outcome-based feedback that ties rubric performance back to post-hire performance and retention. If that's something you'd like to pilot, we'd love to talk.


What's configurable

TalkScore is built to be customized to your roles and your standards, not used as a one-size-fits-all template. The following are configurable per client:

What

How it can change

Interview questions

We typically start from a baseline question set for the role type, then refine the questions in collaboration with your hiring managers.

Skills being scored

Add, remove, or rename skills. If you want to evaluate a skill we haven't defined a rubric for, we'll build it.

Rubric anchors

If your definition of a 5 on Empathy is different from our baseline, we'll rewrite the anchors to match yours.

Weighting between skills

Per-role weighting if some skills should count more than others.

Interview language

TalkScore supports English variants (UK, US, regional accents including South African, Indian, and West African) and is expanding to additional languages. Ask us what's currently available.

Agent voice and persona

Selected to fit the brand and the candidate experience you want to create.

Candidate-facing report

What gets shown to candidates after the interview, and what stays internal to your recruiting team.

CEFR language proficiency scoring

For roles where language ability is a core qualification, we can layer a CEFR-based language assessment (pronunciation, fluency, vocabulary, grammar, coherence) on top of the soft-skills scoring.


Q: Which model do you use for scoring?

A: TalkScore uses leading large language models for transcript analysis. We continuously evaluate new models as they become available and migrate to the best available option for scoring quality. The current model is documented in your TalkScore Hub configuration.

Q: Is the model trained on my candidates' data?

A: No. Your candidates' transcripts are not used to train the underlying language model. They are used only to evaluate the candidate against your rubric.

Q: Can I see the exact prompt used to score my candidates?

A: Yes. The full rubric for every skill is available in TalkScore Hub under Assessments → Configuration. You can read it, edit it, and see exactly what instructions the model is given. This is intentional — we don't believe in hidden scoring criteria.

Q: What if I disagree with a score?

A: You can override any score directly in the candidate report. You can also submit a Score Opinion explaining the disagreement, which feeds into our rubric review process. Overriding a score on one candidate doesn't change the scoring for other candidates — that requires a rubric revision, which we'll handle in coordination with your team.

Q: How do you prevent bias?

A: Three ways:

  1. The rubrics explicitly instruct the model to evaluate behavior rather than communication style — to ignore eloquence, accent, vocabulary, and confidence-of-delivery in favor of the actual behaviors and decisions described.

  2. The rubrics anchor every score level in observable behavior, not in subjective impression.

  3. TalkScore Hub's quality monitoring includes bias detection flags that surface when scoring patterns look suspicious across demographic or linguistic groups.

Q: Can a candidate game the system?

A: Any structured assessment can be prepared for, and we think that's fine. Candidates who research the role, think through their experiences, and articulate them clearly are demonstrating exactly the behaviors most roles require. What the rubrics protect against is candidates who sound good without saying anything substantive — generic stories, platitudes, and answers that could have been delivered by anyone. The rubrics are specifically designed to reward concrete behavioral evidence and penalize fluency without substance.

Q: What happens if the transcript has no evidence for a skill?

A: The model is instructed to return a 0 when there is no usable evidence in the transcript. If you find this happens often for a particular skill, it usually means the interview isn't probing that skill effectively — which is an interview design issue, not a scoring issue. We'll work with you to adjust the questions.

Q: How long does scoring take after the interview?

A: Scoring typically completes within minutes of the call ending. The candidate report appears in TalkScore Hub as soon as scoring is complete.


Working with us

TalkScore is not a finished product that we hand to you and walk away from. It is a methodology we operate in partnership with your hiring team. The rubrics, the questions, the weighting, the feedback loop — all of it is meant to evolve as you learn what predicts success in your roles.

If something in this article describes a capability you'd like to use, or a customization you'd like to make, talk to your Talkpush account team. If we don't already do something you need, we will almost certainly build it. The list of things TalkScore can do today is shorter than the list of things it will do six months from now, and most of what's been added recently was added because a client asked for it.

For details on the different interview formats this methodology runs in, and how to choose the right one for your roles, see How much does TalkScore cost?


Did this answer your question?