A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing
Published in: Lecture Notes in Computer Science, Volume 12748, pages 255–267 (2021). doi: 10.1007/978-3-030-78292-4_21
Posted on RAND.org on January 12, 2022
Automated Essay Scoring (AES) can reliably grade essays at scale and reduce human effort in both classroom and commercial settings. There are currently three dominant supervised learning paradigms for building AES models: feature-based, neural, and hybrid. While feature-based models are more explainable, neural network models often outperform feature-based models in terms of prediction accuracy. To create models that are accurate and explainable, hybrid approaches combining neural network and feature-based models are of increasing interest. We compare these three types of AES models with respect to a different evaluation dimension, namely algorithmic fairness. We apply three definitions of AES fairness to an essay corpus scored by different types of AES systems with respect to upper elementary students' use of text evidence. Our results indicate that different AES models exhibit different types of biases, spanning students' gender, race, and socioeconomic status. We conclude with a step towards mitigating AES bias once detected.