Edzil.la

Presentation Can ChatGPT Grade Like a Human? Examining the Reliability and Validity of AI-Assisted Assessment in Academic Writing more

Sat, Jun 13, 11:20-11:45 Asia/Tokyo

Recent research has explored the use of generative AI for writing assessment, yet evidence regarding its reliability and validity remains mixed. This study examines whether ChatGPT (GPT-4o) can function as an analytic assessment tool for source-based academic essays written by postgraduate research students. A dataset of 122 essays, originally scored by two experienced human raters, was reevaluated by ChatGPT using a standardized analytic rubric (e.g., Idea Presentation, Academic Style, Citation, and Mechanics) and a zero-shot prompting approach. Non-parametric analyses and descriptive statistics were used to examine score alignment, ranking patterns, and domain-specific differences. Results show that while human and AI scores occupied a similar overall range, ChatGPT consistently awarded higher scores and did not rank essays in ways that aligned with human judgment. Significant differences emerged across most rubric domains: human raters scored higher on idea presentation, whereas ChatGPT assigned higher scores for academic style and citation practices; no significant difference was found for mechanics. Repeated AI scoring demonstrated high internal consistency, with variability concentrated in meaning-dependent domains such as argument clarity and source integration. Overall, the findings indicate that generative AI shows promise for reliable form-focused assessment but remains limited in evaluating rhetorical and conceptual quality.

Jaeuk PARK

About

Sessions

Presentation Can ChatGPT Grade Like a Human? Examining the Reliability and Validity of AI-Assisted Assessment in Academic Writing more