An Article on Approaches to QDF

Tests are the most preferred assessment technique used by learners to measure performance against the targeted learning outcomes. So tests must be fair and effective to identify students’ learning gaps and boost students’ learning. The ability of a test to meet these goals is an aggregation of how relevant each question of the test is. Thus, the reliability of a test can be increased by item analysis, where students’ responses for each question (item) are utilized to evaluate test performance. One of the important methods in Item Analysis is Item discrimination which refers to the power of a question in differentiating between different learners. Question Discrimination Factor (QDF) is an index that measures how well a question can differentiate between different user cohorts. It depicts how top scorers are more likely to get a question correct than low scorers.

To compute the QDF of questions, Embibe has used traditional statistical (Item Point Biserial Correlation) and deep learning-based methods. Item Point Biserial Correlation is basically Pearson-product moment correlation between question score and total score for a student. So higher the difference between total scores of students who got the question correct and those who got the question incorrect higher will be the QDF value. We also implemented the 2PL model from classical Item Response Theory using Deep Neural Network architecture. Given the students’ attempts data, we derive the question’s difficulty level and discrimination factor from weights of trained DNN. Here is an example of how the value of QDF varies with question attempt interactions of learners.

QDF = 0.11 QDF = 0.80
Question
1:Alcohols of low molecular weight are
a. Soluble in all solvents (Correct Option)
b. Soluble in water
c. Insoluble in all solvents
d. Soluble in water on heating
Question
2:Aspirin is also known as
a. Acetyl salicylic acid (Correct Option)
b. Methyl salicylic acid
c. Acetyl salicylate
d. Methyl salicylate
Table 1: Comparison between the distribution of total marks for correct and incorrect questions with low QDF and high QDF values.

Here: the x-axis represents the total marks scored, and the y-axis represents the normalized number of students. The yellow line denotes the distribution of total marks of students who got the question incorrect. The blue line denotes the distribution of total marks of students who got the question correct. In Question-1, there is a high overlap between the total marks of students who got the question correct, while in Question-2, the overlap is very less and hence the value of QDF is higher for Question-2 than Question-1. The final QDF value is the fine-tuned result of the above method and test parameters. 

Embibe conducted a validation experiment to compare the performance of students in two different tests : 

  1. Baseline Policy: Here, questions are selected without bias due to discrimination factors from the ground truth database, ensuring an expected distribution over difficulty levels and syllabus coverage.
  2. Discrimination Only Policy:  Here, questions were selected from the ground truth dataset, ensuring syllabus coverage (at least one question from each chapter) and ensuring that the overall discrimination factor of the questions is maximized at any difficulty level.

For the experiment, a total of 312 students were selected to take a test containing 75 questions. Two statistical metrics compared the performances of the test:

  1. Evaluation using RMSE: Using the IRT model, we predict each student’s probability in the evaluation set answering the questions correctly and compute the average ability from the scores of the students if they were to attempt the generated test paper. We also determine the ground truth ability of each student from the IRT model. Finally, we compute the root mean squared error (RMSE) between the ground truth ability and inferred ability to measure the accuracy.
  2. Evaluation using Spearman’s ρ: Here, we sort students’ abilities obtained from the ground truth data and the generated test and determine the rank correlation ρ between the two ranks.
Policy RMSE Rank corr p
Baseline Policy 0.844 0.59
Discrimination Only Policy 0.549 0.83
Table 2: Comparison of RMSE (inferred ability and ability from ground truth) and rank correlation ρ in tests generated by different policies Policies RMSE Rank corr ρ

Also, we found that the Discrimination Only Policy test gives a 24.8% better spread of scores (score at 95th percentile of students – score at 5th percentile) than the Baseline Policy test.

Hence, the use of high QDF questions in tests improves the quality of the test in terms of its power to differentiate among students under the same targeted learning goals. Also, these are leveraged to improve content quality where we identify questions with negative QDF and improve their relevance and clarity.

References

  • Soma Dhavala, Chirag Bhatia, Joy Bose, Keyur Faldu, Aditi Avasthi, “Auto Generation of Diagnostic Assessments and their Quality Evaluation,” July 2020, EDM.
  • Vincent LeBlanc, Michael A. A. Cox, “Interpretation of the point-biserial correlation coefficient in the context of a school examination,” January 2017, The Quantitative Methods for Psychology 13(1):46-56
  • Linden, W. D., and R. Hambleton. “Handbook of Modern Item Response Theory.” (1997), Biometrics 54:1680