Students taking a test in a classroom

commentary

(The RAND Blog)

January 21, 2015

Reauthorizing ESEA: Four Recommendations to Make Testing Work

Photo by Lisa F. Young/Fotolia

by Laura S. Hamilton, Brian M. Stecher, Grace Evans

Since it expired over seven years ago, numerous attempts to reauthorize the Elementary and Secondary Education Act (ESEA)—better known in its most recent incarnation as No Child Left Behind (NCLB)—have surfaced and withered. However, Sen. Lamar Alexander, the new Chairman of the Senate Health, Education, Labor and Pensions Committee, has made overhauling NCLB a priority, and his counterpart in the House of Representatives, Rep. John Kline, is eager to get to work as well. So, will 2015 finally be the year Congress is able to reauthorize ESEA? Success will depend on legislators clearing several hurdles, such as decisions regarding teacher quality, school improvement, and charter schools. And at the center of the current debate—and perhaps the largest obstacle—remains the issue of federal requirements for testing.

One of the major questions legislators are grappling with is whether they should retain the requirement for annual testing in grades three through eight and one high school grade, or whether states will be given flexibility to develop alternative testing plans which might, for example, require testing in only a few grade levels. Decisions about testing requirements will influence students and educators at all levels of the system, and therefore need to be considered carefully.

Unfortunately, there is no easy answer about the ideal amount of student testing. Research suggests that testing adopted in response to NCLB has produced both positive and negative outcomes. RAND has found that the high-stakes nature of these tests is associated with score inflation; a narrowing of curriculum and instruction, which leads to the exclusion of important content; and extensive “gaming” as teachers have learned to predict test content and adjust their instruction in ways that raise test scores without necessarily improving student learning. Moreover, the focus on status—performance at a single point in time—rather than growth may have negatively influenced teachers' morale and support for state accountability systems, and it reduces the value of the information provided by accountability systems because status measures are heavily influenced by student background and other factors that are not under the control of schools.

At the same time, these requirements have provided families with accessible (if imperfect) information about school performance, focused many educators on the importance of raising student achievement, and illuminated achievement gaps between different racial/ethnic and socioeconomic groups, which raised a sense of urgency in schools around reducing these gaps. The actual amount of time spent on annual testing is not large relative to the amount of time spent learning, and the worst use of time is unnecessary targeted test preparation, which schools can control and reduce on their own. Total testing time can also be reduced by promoting coordination between state and local testing policies to avoid unnecessary duplication and to reduce lower priority testing. This would not necessarily require changing NCLB's testing structure.

Reducing the amount of required testing might mitigate some of its negative effects, but it will also reduce the amount of information available to various stakeholders. In particular, lack of consecutive-grade testing will make it impossible to measure students' year-to-year growth in achievement, leading users of test results to rely on the less-informative status-based scores.

Complicating the issue is that scores on standardized achievement tests are used for a range of objectives, which may require different types and timing of assessments. For example, informing parents how well their child is doing each year requires annual testing of every child in every grade in every important subject. By contrast, monitoring national performance could be accomplished by testing a sample of students from a sample of schools in a sample of states in a sample of grade levels in a sample of subjects. Each specific use of assessment should be subject to a validity investigation to determine the appropriateness of the test for that particular use.

Regardless of which approach to testing is chosen, we believe the following conditions should be in place to maximize the likelihood of positive outcomes:

  1. Tests should be designed to address the full range of college- and career-ready standards that many states have adopted. Given the emphasis these standards place on higher-order thinking and problem solving, this means not relying exclusively on inexpensive multiple-choice assessments. Advances in technology, such as computer-scored open-response questions and simulations of scientific investigations, provide an opportunity to expand the range of test formats in a cost-effective way, but improvements in both hardware and software technology are needed before this can be deployed on a national level.

  2. Test content and the way questions are asked should be somewhat unpredictable from one year to the next. One of the contributors to score inflation and curricular narrowing is the repeated use of similar items over time because it allows educators to become familiar with specific ways of assessing content, and research suggests that many educators teach to these specific formats rather than focusing more broadly on the underlying content the items are intended to measure.

  3. Reporting requirements should encourage states and districts to report on multiple indicators of school success, not just test scores. Inclusion of non-test measures—such as graduation rates; rates of enrollment in college preparatory, Advanced Placement, or International Baccalaureate courses; and availability of extracurricular opportunities or special services—should be encouraged.

  4. The specific metrics that are tied to decisions about accountability should be carefully designed to prevent gaming and the consequences attached to them should be designed to promote positive change. For instance, NCLB's emphasis on reporting whether a student's performance was above or below the “proficient” threshold led some educators to focus their efforts on the students who were just below that threshold as a means of maximizing their schools' performance on this metric and give less attention to other students. The validity and reliability of these metrics should be examined in the same way that the tests themselves are. Similarly, NCLB's formulaic penalties may have encouraged gaming and different accountability policies could encourage continuous improvement.

Testing has the potential to provide a broad range of stakeholders with critical information about the state of education in the United States. But with different questions being asked by national and state policymakers, districts, schools, and families, it is clear that there is still much to be negotiated about the role of federally required testing. As they make those decisions, our four suggestions can help members of Congress overcome some of the challenges testing has faced in the past and support high-quality teaching and learning for all of America's students.


Laura Hamilton is a senior behavioral scientist, Brian Stecher is a senior social scientist and Grace Evans is a legislative analyst at the nonprofit, nonpartisan RAND Corporation.

Commentary gives RAND researchers a platform to convey insights based on their professional expertise and often on their peer-reviewed research and analysis.