Evaluating Inter-Rater Reliability: Transitioning to a Single Rater for Marking Modified Essay Questions in Undergraduate Medical Education

Shahid Hassan; Malanashita Ganeson; Ismail Abdul Sattar Burud

doi:10.18502/acta.v62i2.17040

Evaluating Inter-Rater Reliability: Transitioning to a Single Rater for Marking Modified Essay Questions in Undergraduate Medical Education

Shahid Hassan

Malanashita Ganeson

Ismail Abdul Sattar Burud

Abstract

Modified Essay Questions (MEQs) are often included in high-stakes examinations to assess higher-order cognitive skills. If the marking guides for MEQs are inadequate, this can lead to inconsistencies in marking. To ensure the reliability of MEQs as a subjective assessment tool, candidates’ responses are typically evaluated by two or more assessors. Previous studies have examined the impact of marker variance. Current study explores the possibility of assigning a single assessor to mark the students' performances in MEQ based on statistically drawn evidence in the clinical phase of the MBBS program at a private medical school in Malaysia. A robust evaluation method was employed to determine whether to continue with two raters or shift to a single-rater scheme for MEQs, using the Discrepancy-Agreement Grading (DAG) System for evaluation. A low standard deviation was observed across all 11 pairs of scores, with insignificant t-statistics (P>0.05) in 2 pairs (18.18%) and significant t-statistics (P<0.05) in 9 pairs (81.81%). The Intraclass Correlation Coefficient (ICC) results were excellent, ranging from .815 to .997, all with P<0.001. Regarding practical effect size (Cohen’s d), 1 pair (9.09%) was categorized as having a strong effect size (>0.8), 7 pairs (63.63%) as having a moderate effect size (0.5-<0.8), and 3 pairs (27.27%) as having a weak effect size (0.2-<0.5). The data analysis suggests that it is feasible to consider marking MEQ items by a single assessor without negatively impacting the reliability of the MEQ as an assessment tool.

1. Albanese MA. Challenges in using rater judgments in medical education. Acad Med 2000;75:975-80.
2. Downing SM. Threats to the validity of clinical teaching assessments: What they are and what to do about them. Med Educ 2005;39:249-255.
3. Mulholland H, McAleer S Report to the Examination Board of the Royal College of General Practitioners. Dundee. Centre for Medical Education. Unpublished; 1988
4. Palmer EJ, Duggan P, Devitt PG, Russell R. The modified essay question: Its exit from the exit examination? Med Teach 2010;32:e300-7.
5. Lockie C, McAleer S, Mulholland H, Neighbour R, Tombleson P. Modified essay question (MEQ) paper: Perestroika. Occas pap 1990:46;18-22.
6. Brennan RL. Elements of Generalizability Theory. Iowa, ACT Publications; 1983.
7. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.
8. Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bull 1979;86:420-8.
9. Streiner DL, Norman GR. Health Measurement Scales: A practical guide to their development and use. 4th ed. Oxford, England: Oxford University Press; 2008.
10. Yusoff MSB, Rahim AFA. "Discrepancy-agreement grading method: An alternative grading method to assess the assessment discrepancy in multiple choice questions. Educ Med J 2012;4:e22-33.
11. Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol 2013;4:863.
12. Hattie J, Timperley H. The power of feedback. Rev Educ Res 2007;77:81-112.
13. Kluger AN, DeNisi A. Feedback interventions: Toward the understanding of a double-edged sword. Curr Dir Psychol Sci 1998;7:67-72.
14. Norcini J. The power of feedback. Med Educ 2010;44:16-7.
15. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. United States: Lawrence Erlbaum Associates;1988.
16. van der Vleuten CPM, Norman GR, De Graaff E. Pitfalls in the pursuit of objectivity: Issues of reliability. Med Educ 1991;25:110-118.
17. Hutchinson L, Marks T, Pittilo M. The effects of standardized instructions on rater reliability in subjective assessments. Med Teach 2013;35:391-5.
18. Norman GR, Schmidt HG. The psychological basis of problem-based learning: A review of the evidence. Acad Med 1992;67:557-65.

Files	XML PDF (752KB)
Issue	Vol 62 No 2 (2024)
Section	Original Article(s)
DOI	https://doi.org/10.18502/acta.v62i2.17040
Keywords
Essay question Decision making Observers variation Interobserver reliability Scoring system

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

Hassan S, Ganeson M, Burud IAS. Evaluating Inter-Rater Reliability: Transitioning to a Single Rater for Marking Modified Essay Questions in Undergraduate Medical Education. Acta Med Iran. 2024;62(2):88-95.

Vancouver

Download Citation