Feb 26, 2020
Since 2017, experts at the RAND Corporation and the University of Pittsburgh have been conducting research to advance automated writing scoring and feedback systems for elementary school student writers. This brief summarizes some of the contributions of the work, which have been documented in greater detail in peer-reviewed articles.
Writing is an important skill. Specifically, analytic text-based writing—which focuses on analysis and interpretation of fiction or informational texts, using evidence from the source text to support claims and interpretations—is critical to academic success and college readiness (Wang et al., 2018). As such, it is emphasized in national and state English language arts and writing standards. Students' opportunities to practice and learn analytic text-based writing are limited, however. Elementary grades have not historically focused on analytic text-based writing, teachers report feeling underprepared to teach writing, and the time needed to assess student writing is burdensome. When students do write, they rarely receive substantive feedback and rarely engage in cycles of revision that require them to apply feedback to strengthen their work.
Automated essay scoring (AES) and automated writing evaluation (AWE) systems have potential benefits. AES refers to the use of computer programs to assign a score to a piece of writing; AWE systems provide feedback on one or more aspects of writing and may or may not also provide scores. AES and AWE systems can reduce teachers' time burden related to grading and providing feedback, respectively, enabling teachers to assign more such writing tasks and thus provide more opportunities for students to learn and practice the skills. Moreover, students benefit from the more immediate—and consistent—feedback that AWE systems can offer, which supports cycles of revision.
AES and AWE systems, however, are not without challenges. Four common critiques are as follows:
With these issues in mind, the research team, composed of researchers at RAND and the University of Pittsburgh, developed eRevise, an AES and AWE system, and undertook studies to address these four challenges. Following a brief overview of eRevise and study context and methods, this brief presents four key findings from the research. Each finding addresses one of the four perils of AES and AWE systems that the research team identified.
The research team developed eRevise as a system for improving fifth- to seventh-grade students' skill in using text evidence. eRevise uses machine learning and NLP techniques to predict evidence-use scores in students' writing on the Response to Text Assessment (RTA). The RTA is a formative writing task aligned with the Common Core State Standards (National Governors Association Center for Best Practices, Council of Chief State School Officers, 2010). Students read a nonfiction text and then write an essay in which they evaluate claims made by an author (see Correnti et al., 2012; and Correnti et al., 2013).
The AES score provided by eRevise is based on features of evidence use on typical grading rubrics that humans would use to assess such writing. The features are (1) number of pieces of evidence used; (2) specificity of the evidence provided; (3) concentration, an indication of whether students simply listed evidence or provided elaboration or explanation; and (4) word count. During development, the AES system was trained on more than 1,500 previously collected and manually scored essays. The researchers first used NLP to represent each essay in terms of the four rubric-based features. They then used machine learning techniques to learn a model for predicting an essay's score given only the feature representation as the input. They conducted tenfold cross-validation. This entails randomly dividing the corpus of essays into ten parts, using nine of those parts as a training set and the remaining part for testing, and repeating this ten times (Rahimi et al., 2014).
As for the AWE component, eRevise uses the first two of the features identified above (number of pieces of evidence and specificity of evidence) to select the feedback messages to display to students. For example, students who did not provide enough pieces of evidence receive the following prompt: "Use more evidence from the article." See Figure 1 for a depiction of the architecture of eRevise and Figure 2 for a view of the student interface, with example feedback messages. For more details on eRevise, see Correnti et al., 2020, and Zhang et al., 2019.
This brief summarizes findings from three studies evaluating eRevise. Study 1 established validity of the eRevise scores (Correnti et al., 2020). It involved 65 sixth-grade language arts classrooms located in 29 different schools in the same large urban district in Maryland. The study examined a total of 1,529 essays. Study 2 evaluated eRevise's performance as a formative feedback system for improving student writing. This study involved fifth- and sixth-grade language arts teachers throughout Louisiana. In the first year, seven teachers participated, and the researchers analyzed essays from 143 students (Zhang et al., 2019). In the second year, 16 teachers participated, and the researchers analyzed 266 sets of first- and second-draft essays (Correnti et al., 2022). Study 3 examined the fairness of the algorithm underlying eRevise using the subset of 735 essays from Study 1 for that was scored by two human raters (i.e., those for which interrater reliability has been established; Litman et al., 2021).
In these studies, the research team examined various aspects of eRevise. They performed human scoring and annotation of a diverse sample of student essays to study the performance and fairness of the algorithms underlying eRevise. They used teacher surveys and writing tasks teachers assigned to students to construct measures of writing instruction that enabled them to examine relationships between students' writing and revision skills in the eRevise system and students' opportunities to practice such skills. Finally, the team surveyed students and interviewed teachers to understand their experiences with the system.
Accordingly, the researchers conducted various analytical procedures. These included calculating correlations to study relationships between students' evidence use score in eRevise and measures of student achievement (Correnti et al., 2020), as well as univariate and multivariate analyses using hierarchical linear models to examine the relationship between evidence-use scores and measures of classroom instructional quality (Correnti et al., 2020). To understand teachers' experiences with eRevise, the team qualitatively analyzed interview responses (Correnti et al., 2022). The evaluation of the extent to which eRevise improved student essays from first to second draft was observational; there was no comparison condition in which classes did not use eRevise or used another tool to revise their writing. This limits the claims that can be made about the effectiveness of eRevise.
To address the critique that AES and AWE systems tend to focus on surface-level aspects of writing, the researchers developed eRevise to focus on evidence use, an important aspect of analytic text-based writing. Furthermore, rather than relying on a general or holistic score for evidence use, thus leaving the construct of evidence use opaque, eRevise uses a rubric-based system to ensure that the features of "good text-evidence use" (e.g., number of pieces of evidence provided, specificity of evidence) are well represented by the scoring algorithm. eRevise uses these feature scores to select the feedback messages that students see. This helps to address the issue that most AWE systems do not provide feedback that reflects students' needs and the notion of strong evidence use and therefore can support students' revision process.
Zhang et al., 2019, provides evidence suggesting that use of eRevise improved students' use of evidence at the feature level and that focusing on the level of features (i.e., using feature scores rather than overall score to select messages and assess students' revision) is desirable because it yields more information than when considering evidence use generally. First, Zhang et al., 2019, showed that, on a 1–4 rating scale, the mean score for overall evidence use improved from 2.62 to 2.72, not statistically significant. Notably, about 20 percent of students already had the maximum score of 4 on their first draft, and for the majority of students, the score did not change. In contrast, scores for the two features used to detect weaknesses in students' evidence use and to provide feedback—number of pieces of evidence used and specificity of the evidence provided—improved significantly from the first to second draft. And improvement (on specificity of evidence) was observable even for the students who already had the maximum overall score of 4 on the first draft (Zhang et al., 2019). The findings suggest the feasibility of assessing a substantive dimension of writing in a fine-grained way that provides information to support students' writing development.
In contrast with most AES and AWE systems, which rarely assess performance in terms of whether the system adequately measures the targeted aspect of writing, the researchers gathered evidence about whether eRevise was assessing the evidence-use construct, as claimed. They examined associations between system-generated scores on students' use of text evidence and expected relevant measures of student achievement. Correnti et al., 2020, found a moderate correlation, both at the individual student level and the classroom level, between the eRevise evidence-use score and reading and (to a lesser extent) math achievement scores. This means that students (and classes) with higher scores on the state standardized test of reading (and math) tended to also score higher on the essay, as evaluated by eRevise. (The same associations held for human scores of the essays submitted through eRevise.) This aligns with expectations and helps support the idea that eRevise is measuring the skill that it is designed to measure. Moreover, a reasonable expectation is that students who have had more opportunities to engage in analytic text-based writing (e.g., those are in classes where more such assignments are given and where instructional time is devoted to working on such assignments) would have a higher evidence-use score. Results confirmed this hypothesis, both when students' evidence use was rated by the AES system and when it was rated by a human. Meanwhile, evidence-use scores in written essays were not well correlated with a measure of general reading instruction, which also conforms to expectations. This finding adds support to the idea that eRevise is measuring a distinct writing construct.
A second investigation related to construct validity focused on whether the system is helping students improve as intended. This investigation goes beyond looking only at differences in scores from the first draft to the second draft. Scores could go up simply because students had more time to work on their essays. Or scores could change because the students made revisions that were not aligned with the feedback messages that the system provided. Neither of these scenarios provides evidence that the AWE system works as designed. The researchers examined whether improvements in student essays aligned with the evidence-use features targeted in the feedback messages that eRevise provided (Correnti et al., 2022). Essentially, the second-draft feature scores improved in expected ways. For example, students who received the message to add more pieces of evidence in fact did so—their score for the "number of pieces of evidence" feature increased. Another analysis involved examining what students reported as the one thing they learned from eRevise about using evidence in their writing. In general, students' responses to this open-ended question more closely resembled the feedback messages they received than the messages they did not receive. In other words, students who were asked to provide more explanation were more inclined to say that they learned "to explain how my evidence ties in with my argument" than to say they learned that they need "a lot of evidence."
Teachers participating in studies of eRevise reported that the system was feasible to implement and beneficial insofar as it saved them time, facilitating timely feedback to students and providing students the opportunity to engage in the writing process. Teachers also indicated that the system messages were aligned with their instruction on use of text evidence. However, two-thirds of teachers suggested that the system should be seen as reinforcing the teacher's role, not replacing it in the classroom. In fact, several conveyed that, to get the most out of the system, the teacher should interact with it and the students' writing (Correnti et al., 2022).
Researchers analyzed teacher reports of how they interacted with students during the use of eRevise (Correnti et al., 2022). More than a third of the teachers did not interact with students; they treated the writing task and use of the system as if they were practice for the standardized test. Other teachers responded to student questions (e.g., "Is this enough evidence?") by referring students back to the task and the system. A final group of teachers interacted substantively with student questions. They appeared to use eRevise as a teaching and learning opportunity. For example, teachers elaborated upon or reinforced the system's feedback message. Analysis showed that the extent to which teachers interacted with students during use of eRevise was associated with the extent to which students' evidence-use scores improved from the first draft to the second draft. In short, in classrooms where teachers provided substantive help to any student, the writing improvement score for the class was higher than in classrooms where teachers did not provide any help or only referred students back to the system. Moreover, the students who asked for substantive assistance and received it saw higher improvement scores. Thus, students in classrooms where teachers provided substantive support benefitted overall, but students who asked specific questions and received substantive support benefitted the most.
There are currently three prominent ways to build the underlying algorithms for automatically scoring student essays—feature-based models, neural network models, and a hybrid of the two. Developers typically choose a model based on the extent to which they prioritize prediction accuracy versus the ability to explain what feature(s) or construct(s) the model is assessing. There has been less focus on the algorithmic fairness of these different models. The researchers compared the fairness of these models, examining potential bias based on gender (male versus female), race (Black versus not), and socioeconomic status (as indicated by eligibility for free or reduced-price lunch) (Litman et al., 2021). The research team operationalized fairness in three ways, including overall score accuracy, which measures whether AES scores are equally accurate for all student subgroups. About 48 percent of the essays were written by male students, 68 percent by Black students, and 55 percent by students from low-income families.
Results indicate that different models for scoring student writing expose different types of biases. Overall, based on the analysis, the hybrid model seems to be the fairest. It appears to exhibit bias only with respect to overall score accuracy for gender. Regardless of model and measure, the eRevise algorithm seems most biased with respect to gender. Males tended to produce essays with significantly lower word count. The team explored removing word count as a feature and found that this improved gender fairness but reduced score accuracy.
The work undertaken by the research team provides an argument for advancing the assessment of substantive dimensions of writing by designing AES and AWE systems that attend to features of writing that lead to actionable feedback to guide students' writing development, consider teachers' role in interacting with the student and the system, and are fair for all student subgroups.
Moving forward, more research is needed to build systems that can assess a wider range of important writing constructs—for example, quality of claims. AES and AWE systems must assess such dimensions to be useful to teachers and students (and researchers of writing) for monitoring students' writing skills toward college and career readiness (Correnti et al., 2020).
With respect to assessing the technical quality of AES and AWE systems, the research undertaken has pushed for more rigorous evaluations of an AWE system's performance. Specifically, developers and researchers need to move beyond human-computer reliability as the primary metric and toward ensuring that the system is assessing important writing features and helping students improve on them (Correnti et al., 2020).
It will take innovative thinking to design systems that deliberately invite and facilitate teacher interactions with students through feedback messages, error corrections, scores, or annotations. Beyond system-provided outputs for teachers (e.g., assessments of student progress) or teacher-accessible dashboards, which are primarily passive supports, developers can consider including discussion protocols that help teachers engage students in meaningful co-examination of system outputs. Teachers could, for example, elicit student understanding of the feedback messages and their revision plan. Developers should also consider ways in which teacher expertise and machine efficiency can intersect (Matsumura et al., forthcoming). For example, teachers may wish to provide additional feedback messages or customize messages given what they know about the student and their learning trajectory.
Finally, developers of AES/AWE systems should attend to algorithmic fairness. Although any bias is undesirable, researchers and developers need to consider the trade-offs between fairness and other important dimensions, such as reliability and explainability. The purpose of the AES or AWE system may matter; if AES is used for formative purposes, such as to provide feedback to improve teaching and learning, then transparency in how the score or feedback is derived is important. And any approaches to mitigating bias should also consider construct validity—whether removing or changing a feature risks underrepresenting the writing skill that the system is designed to assess and improve (Litman et al., 2021).