Distinctive Teacher Evaluation Programs Could Provide Lessons for Others
Measuring Teacher Effectiveness
Chicago teachers go on strike, in part due to concerns about "value-added" teacher evaluation systems. Education experts publish newspaper columns decrying new teacher evaluation schemes that allegedly overemphasize standardized tests, ignore other aspects of students' learning, undermine principals' judgment, and risk driving good teachers out of the profession.
Other experts and scholars, however, point out that relying solely on school administrators' subjective teacher evaluations, as has traditionally been done, has resulted in almost all teachers being rated as satisfactory, leading many to doubt the results. These critics assert that the time has come to rethink how teachers are evaluated and supported.
Such disagreements over teacher evaluation systems have been fueling heated debates about public education in the United States. Many states and districts are indeed retooling their teacher evaluation systems to incorporate student test scores — a goal encouraged by federal Race to the Top incentives and made possible by improved data systems and innovative statistical techniques known as value-added models. At the same time, using these models poses at least three major challenges: ensuring the reliability and validity of the estimates, accounting for teachers of subjects and grades that are not tested annually, and accounting for students who lack prior test scores or are enrolled only part of a school year.
Relying solely on school administrators' subjective teacher evaluations has resulted in almost all teachers being rated as satisfactory, leading many to doubt the results.
Several states and districts are tackling these challenges head-on as they begin incorporating student test scores into their teacher evaluation systems. Using insights from the research literature and preliminary snapshots of some of these systems, in 2010 we distilled a set of considerations and suggestions for education policymakers striving to design systems that incorporate student performance data in a way that is accurate and fair.
AP IMAGES/THE COMMERCIAL APPEAL, JIM WEBER
Kipp Middle School seventh grader Mirrkel Ivory works on an assessment test in Memphis, Tennessee, on August 21, 2012. Students at Kipp take standardized tests twice a year to measure progress in core courses.
Reliability and Validity
One appeal of value-added models is that they account for the prior test scores of a teacher's students. A typical value-added method works like this: Mr. Jones teaches 6th-grade math. To estimate his contributions to students' test score performance, statisticians collect the 5th-grade test scores of all his students (and possibly their 3rd- and 4th-grade scores as well), along with information about their backgrounds (such as gender, socioeconomic status, and whether they were in special education or English language learner programs). On the basis of these data, the statisticians predict the 6th-grade math scores of each individual student. Maria and Kwan are students in Mr. Jones's class. Maria's score is seven points higher than predicted on the 6th-grade test; Kwan's is two points lower. The estimate of Mr. Jones's added value is the average of all differences between the actual and predicted scores of Maria, Kwan, and the rest of the class.
When using such measures to inform teacher evaluations, policymakers and principals need to maximize both the reliability of the student test scores and the validity of how they are used. Reliability refers to the consistency or precision of these measures in representing student achievement over time. Validity refers to the appropriateness of the inferences drawn from such measures — and of the purposes toward which those measures are applied.
An important threat to reliability is measurement error. It is nearly impossible for any single test to present a complete picture of students' knowledge about a particular domain, which is to say that all test scores are subject to measurement error. However, scores on tests that are poorly constructed, administered inconsistently across classrooms, or scored subjectively (as in the case of essays and open-response items) often have more measurement error and are therefore less reliable than those from well-constructed, consistently administered, and objectively scored assessments. As for validity, one possible threat is an undue focus on test preparation in lieu of teaching the underlying content. Other threats to validity come with shifts in the content on which students are tested as they progress from one year to the next.
Additional challenges arise when students lack prior test scores or are enrolled in a teacher's class for only part of a school year. It may be prudent to estimate a teacher's value added using only the students enrolled in class most or all of the year and who have prior test scores on record.
State and District Experiments
In 2010, we culled public data about five teacher evaluation systems that were incorporating or working toward incorporating student performance data. These systems represented some of the best-documented programs in development at that time. Two of them were state-level programs in Tennessee and Delaware. The other three were district-level programs in Denver, Colorado; Hills-borough County, Florida; and Washington, D.C. Because these programs were — and are — still undergoing refinement, it is too soon to declare them successes or failures; meanwhile, a variety of similar systems are now being developed across the country.
Nonetheless, the early approaches of the five programs we studied remain interesting because of the distinctive ways that they responded to the inherent challenges. Certain aspects of these programs could serve as illustrative models for others. At the very least, they suggest an array of possibilities for other states and districts to consider.
The early approaches of the five programs we studied remain interesting because of the distinctive ways that they responded to the inherent challenges.
The Denver program combined teacher evaluations and incentives in four categories: knowledge and skills (including completion of professional development units), comprehensive professional evaluation (based on principal observations), market incentives (for teaching in hard-to-staff schools and subject areas), and student growth (including value-added and other approaches). The students were tested in math, reading, and writing in grades 3–10 and science in grades 5, 8, and 10. The value-added component was excluded for teachers in nontested subjects and grades.
The Hillsborough County initiative based 60 percent of a teacher's evaluation on classroom observations, with half of that based on a principal's observations and the other half based on observations by a trained mentor or peer evaluator. The other 40 percent was based on student achievement growth as measured by standardized state tests. In addition to administering statewide math, reading, writing, and science tests in grades 3–11, the county had developed more than 500 end-of-course exams available for a broad array of subjects — including foreign languages, art, music, career/technical education, and even physical education — that were not tested by state exams.
AP IMAGES/LYNNE SLADKY
Teacher Bev Campbell displays letters of the alphabet in her special education class at Amelia Earhart Elementary School in Hialeah, Florida, on April 3, 2012. Measuring the improvement of special education students is complicated.
In the statewide Tennessee system, 50 percent of a teacher's evaluation was based on principal observations, 35 percent on the teacher's value-added estimates from standardized state tests and end-of-course exams, and 15 percent on other tests of student performance, such as college entrance tests, advanced-placement tests, and end-of-year subject tests. The standardized state tests covered math, reading, science, and social studies in grades 3–8, while the end-of-course exams encompassed algebra, biology, chemistry, English, geometry, physics, and U.S. history.
In nearly a mirror image of the Tennessee system, the one in Washington, D.C., based 50 percent of a teacher's evaluation on his or her estimated value added to student achievement gains, 35 percent on five classroom observations per year, 10 percent on demonstrated commitment to the school community (including outreach to parents and collaboration with colleagues), and 5 percent on schoolwide student achievement growth. The city administered standardized tests for math and reading in grades 3–8, for science in grades 5 and 8, and for biology.
The Delaware program was in flux in 2010 but was nevertheless intriguing. The teacher evaluation formula assigned equal weights to planning and preparation, classroom environment, instruction, professional responsibilities, and student improvement. Rather than incorporating value-added estimates, the system that was in existence at the time focused on a teacher's ability to use assessment and accountability data to set annual goals for student performance and to measure student progress toward those goals. However, the state was working to incorporate teachers' value-added estimates into their evaluations for tested subjects, which were math, reading, writing, science, and social studies.
Lessons for Education Policymakers
For policymakers working to develop new teacher evaluation systems, we offer the following recommendations, which are gleaned from the research literature and informed by the challenges facing the state and local systems we profiled.
The core lesson here is that teacher effectiveness is multifaceted, and no single measure of that effectiveness is impervious to error.
Create comprehensive evaluation systems that incorporate multiple measures of teacher effectiveness. The systems highlighted above attest to the importance of evaluating teachers along multiple dimensions. These include not only value-added estimates of student achievement growth but also observational evidence of teacher effectiveness in the classroom and evidence of their professional contributions to their schools. Moreover, the examples remind us that evidence of classroom effectiveness includes the ability to plan appropriate lessons, set goals for student learning, and demonstrate that students have met those goals. The core lesson here is that teacher effectiveness is multifaceted, and no single measure of that effectiveness — whether observational or based on student test scores — is impervious to error. A comprehensive system will therefore take multiple sources of evidence into account.
AP IMAGES/DAVID SNODGRESS, HERALD-TRIBUNE
Stella Royal, assistant principal of Indiana's Bloomfield Jr/Sr High School, observes a digital communications class as part of a teacher evaluation on October 17, 2012.
Attend to the technical properties of reliability and validity in student assessments, especially in the context of how the assessments are used in high-stakes contexts. The reliability of scores and the validity of inferences drawn from those scores depend on how the assessments are being used. For example, teachers would not ideally be responsible for grading their own students on measures that carry high stakes. Moreover, policymakers should train and evaluate the raters of open-ended assessments to promote high levels of agreement across raters. Policymakers should also promote the consistent use and administration of student assessments across classrooms, particularly in the case of nonstandard assessments, such as student writing prompts or portfolio assessments.
Promote consistency in whatever student performance measures teachers are allowed to choose. If teachers are permitted to choose the assessment for which they are held accountable, they should be given clear parameters about the choices available. Where possible, teachers should be guided toward standardized assessments for which there is documented evidence of usefulness. This is an instance where limiting the choices of measures will promote consistency across classrooms, resulting in measures of effectiveness that are comparable among teachers in the same subjects and grades.
Use multiple prior years of student achievement data when estimating the value added by a teacher; where possible, average the annual estimates across multiple years of teaching. In Tennessee, for example, the system has historically included up to five years of students' prior test scores in establishing a solid baseline to estimate the value added by each teacher. The system also estimates a teacher's single-year and three-year value added in each subject taught, as long as there are at least three years of data on record for that teacher. Using multiple years of student achievement data when estimating a teacher's value added increases the accuracy of those estimates. In addition, averaging the estimates across multiple years reduces the instability of the estimates.
It is important for states and districts to learn from one another about what does and does not work.
Find ways to hold teachers accountable for students who are excluded from their value-added estimates. Students who have spent only part of the year in a teacher's classroom or who lack prior test scores cannot easily be included in the teacher's value-added estimates. In these cases, teachers should be encouraged to demonstrate student progress in other ways, such as on class tests or completed assignments. Having to show evidence of student progress on such measures may encourage teachers to attend to the needs of those whose test scores might otherwise be ignored.
Given the risks of incorporating student performance measures into teacher evaluation systems, experimentation is critical, and it is important for states and districts to learn from one another about what does and does not work. In the long term, it will also be important to examine how these kinds of evaluation systems that incorporate various rewards or sanctions affect the composition of the teacher workforce and disadvantaged students' access to good teachers. The hope is that by bringing teacher assessments into better alignment with teacher behavior and effectiveness, schools will have richer information with which to make personnel decisions, and teachers will have more-accurate information about how well their students are learning.