Automated Evaluation of Free Text Answer based Questions

A vast majority of competitive exams require the participants to solve objective-type questions, like questions that require one or more correct answers to be selected from a given set of answer choices or questions for which participants have to enter a numerical value. Evaluation of tests that are based on objective-type questions is quite straightforward.

However, there are many exams, for example, board exams, which include questions with free-text answers. Evaluation of free-text answers is still an open research problem with some successful solutions that target essay scoring. Developing a generic evaluator that can score free text answers of various styles across different academic domains requires using advanced NLP/NLU. It is an area of interest to Embibe.

We can divide the problem into two sub-problems.

  1. Entity linking
  2. Semantic similarity

Entity linking

In entity linking, we can include short forms/acronyms and aka(also known as) type entities. For example, acronyms like

“PMC “: “pollen mother cell”,

“MMC “: “megaspore mother cell”,

“PEN”: “primary endosperm nucleus”,

“PEC”: “primary endosperm cell”,

“LH “: “luteinizing hormone”,

“FSH “: “follicle stimulating hormone”

And aka like,

“mushroom”: “toadstool”,

“germs”: “microbes”,

“bacteria”: “microbes”,

“yeast”: “microbes”,

“renewable”: “inexhaustible”,

“traits”: “characteristics”,

We can also map chemical name entities like,

‘(NH4)(NO3)’: ‘Ammonium nitrate’,

‘(NH4)2C2O4’: ‘Ammonium oxalate’,

‘Ag2O’: ‘Silver oxide’,

‘Ag2SO4’: ‘Silver sulfate’,

‘Al(NO3)3’: ‘Aluminium nitrate’,

We can map acronyms and similar words using these mappings and then match them with actual answers.

Semantic Similarity

Two sentences could mean the same thing. We establish semantic similarity using domain-infused knowledge and inferring language model probabilities.

We can get the embedding of the student’s answer and compare it with the embedding of the actual answer. If the cosine distance between them is lower than a certain threshold, then we can consider them similar and mark the answer as correct.

We can use self-attention-based models like BERT[1] and RoBERTa[2] to get the embeddings of the student’s answer and correct answer, and calculate the cosine distance between them to get their similarity.

References:

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”

[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”