The science behind AI scored

Hirevue developed AI scored using a branch of computer science known as Natural Language Processing (NLP). NLP gives computers the ability to understand text and spoken words in a similar way to humans. AI scored uses NLP to replicate the way in which humans evaluate interview responses. Specifically, with AI scored, Hirevue is able to transcribe the content of what the candidate says in response to an on-demand video interview, and then evaluate it against the core competencies of the job. When evaluation is complete, AI scored can provide a non-biased recommendation to recruiters and hiring managers that they can use when making their final evaluation of the candidate.

AI scored development

A team of trained I-O psychologists from Hirevue created AI scored by evaluating tens of thousands of candidate interview responses using a structured scoring guide called Behaviorally Anchored Rating Scales (BARS). Using BARS as a tool to evaluate candidate responses against a set of core competencies, teams of Hirevue I-O psychologists evaluated each of the interviews, first individually and then comparing notes to reach a consensus.

This rigorous method for evaluating each candidate response was used to train the AI models and develop scoring algorithms that could be applied to data the AI had not yet encountered. The algorithms were tested on the new data set to determine whether they could be used to successfully replicate human raters at an extremely precise level. When testing proved to be successful, AI scored was built into the Hirevue product and introduced as a way to provide recruiters with AI-powered benchmarks that could help them to make fair, efficient, and effective recommendations.

Video: "The science of AI scored" with Chief Science Officer Mike Hudy (2:20)

Reliability of AI scored

Reliability refers to the degree to which you can rely on a model to replicate trained human raters. Leading experts in the field of AI recommend that machine-generated and human ratings be correlated at 0.60 as a standard to support use in guiding selection decisions. AI models used in AI scored average 0.77, far exceeding the standard recommended by experts. The AI models used in AI scored achieve reliabilities this high because they are trained on tens of thousands of expert ratings and 1.5 million interview responses.

Because the models used in AI scored were fine-tuned on interview question responses, this specificity allows it to detect and understand more nuances in interview responses. As a result, the scoring models used by AI scored can replicate subject matter expert ratings with much more precision than even the most powerful language models.