Evaluating Inter-Rater Reliability: Transitioning to a Single Rater for Marking Modified Essay Questions in Undergraduate Medical Education

Tehran University of Medical Sciences Acta Medica Iranica 0044-6025 62 2 2024 11 16 Evaluating Inter-Rater Reliability: Transitioning to a Single Rater for Marking Modified Essay Questions in Undergraduate Medical Education 88 95 Shahid Hassan School of Medicine, American University of Barbados, Bridgetown, Barbados. Malanashita Ganeson Department of Family Medicine, Kualalumpur, Malaysia. Ismail Abdul Sattar Burud Department of Surgery, School of Medicine, International Medical University, Kuala Lumpur, Malaysia 2024 04 21 2024 08 25 Modified Essay Questions (MEQs) are often included in high-stakes examinations to assess higher-order cognitive skills. If the marking guides for MEQs are inadequate, this can lead to inconsistencies in marking. To ensure the reliability of MEQs as a subjective assessment tool, candidates’ responses are typically evaluated by two or more assessors. Previous studies have examined the impact of marker variance. Current study explores the possibility of assigning a single assessor to mark the students' performances in MEQ based on statistically drawn evidence in the clinical phase of the MBBS program at a private medical school in Malaysia. A robust evaluation method was employed to determine whether to continue with two raters or shift to a single-rater scheme for MEQs, using the Discrepancy-Agreement Grading (DAG) System for evaluation. A low standard deviation was observed across all 11 pairs of scores, with insignificant t-statistics (P>0.05) in 2 pairs (18.18%) and significant t-statistics (P<0.05) in 9 pairs (81.81%). The Intraclass Correlation Coefficient (ICC) results were excellent, ranging from .815 to .997, all with P<0.001. Regarding practical effect size (Cohen’s d), 1 pair (9.09%) was categorized as having a strong effect size (>0.8), 7 pairs (63.63%) as having a moderate effect size (0.5-<0.8), and 3 pairs (27.27%) as having a weak effect size (0.2-<0.5). The data analysis suggests that it is feasible to consider marking MEQ items by a single assessor without negatively impacting the reliability of the MEQ as an assessment tool. https://acta.tums.ac.ir/index.php/acta/article/view/11084 https://acta.tums.ac.ir/index.php/acta/article/download/11084/5894