Text-Based Accountability Systems

Lessons of Kentucky's Experiment

by Daniel Koretz, Sheila Barron

Research Brief

In an effort to improve education, more and more states are adopting policies that make teachers accountable for student scores on statewide assessments. In the past, multiple-choice tests were often used for this purpose, but it is now widely argued that the result was both degraded instruction and inflated scores from inappropriate teaching to the test. Thus, in this decade, many states have turned to other forms of assessments that they hope will be "worth teaching to." Kentucky was among the first states to attract widespread attention by moving in this direction. As part of a comprehensive overhaul of its schools in the early 1990s, Kentucky put in place an assessment and accountability system that rewarded and sanctioned schools largely on the basis of changes in scores on complex, partially performance-based assessments. In the first four years of the program, scores showed steep gains, and the Kentucky Department of Education awarded approximately 50 million dollars to schools that showed large gains in scores.

The Kentucky program—called the Kentucky Instructional Results Information System, or KIRIS—uses a variety of testing techniques in an effort to guard against deleterious effects of teaching to the test. Besides traditional multiple-choice items, assessments have included "performance events" involving both group and individual activities, as well as open-response questions and writing and mathematics portfolios.

Despite enthusiasm for this new approach to assessment-based accountability systems, however, there has been little evidence until now indicating whether performance-based testing is more immune than traditional multiple-choice tests to the problem of inflated scores. A recent RAND report offers the first comprehensive evaluation of the effects of Kentucky's assessment program on achievement, examining the extent to which increases in scores indicate true gains in student learning. In The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS), authors Daniel Koretz and Sheila Barron suggest that increases in scores, particularly in the early years of the program, cannot be interpreted as revealing greater student mastery of the subjects tested.

During the period studied, KIRIS scores were reported in terms of four levels—Novice, Apprentice, Proficient, and Distinguished—that were arbitrarily assigned the scores 0, 40, 100, and 140, respectively. (To make trends comparable across subjects and tests, the RAND study converted these scores to a common scale and expressed changes in terms of fractions of a standard deviation.) Regardless of their starting point, at the end of 20 years all schools were expected to reach a mean score of 100—equivalent to all students performing at the Proficient level. To meet this target, a typical school would have had to show an improvement of 2 standard deviations over 20 years, or 0.2 per year; to receive cash awards, increases had to be larger.

By any standard, these targets are very high. In fact, they are unprecedented in large-scale educational interventions. A mean increase of 2 standard deviations would mean that half of all students in the year 2012 would have to exceed the performance exceeded by only two percent of students in 1992. Nevertheless, many schools not only met these targets but exceeded them in the first years of the program. Fourth-grade reading scores increased by 1.4 standard deviations in four years. Gains in math and science were smaller—0.6 and 0.5, respectively—but if they represented real gains in learning, they would be roughly equivalent to erasing about half the difference between Japan (one of the highest scoring countries) and the U.S. (which ranked 18th out of 25 nations with good samples) on the recent Third International Mathematics and Science Study (TIMMS) in the space of only three years. Such seemingly implausible gains called for validation.

Evidence Suggests Inflation

The RAND study assessed the validity of these gains using both external and internal evidence. External evidence is provided by trends for comparable tests, particularly the National Assessment of Educational Progress (NAEP) and the American College Testing (ACT) college-admissions tests. Internal evidence derives from the KIRIS test itself, such as an analysis of performance on new test items compared to performance on reused test items. The study focused on fourth and eighth grade performance in math, reading, and science—subjects that offered the best external evidence—from 1992 to 1996.

If increases in KIRIS scores indicate improved mastery of a broadly defined subject, then those gains should be reflected substantially on other tests. NAEP data are particularly important because the KIRIS assessment is designed to measure much of the same content and skills that NAEP measures. The discrepancy between KIRIS and NAEP scores in reading is unambiguous: while fourth-grade KIRIS scores increased by a remarkable 0.75 standard deviation in two years, NAEP scores remained unchanged. In mathematics, NAEP scores of Kentucky students did increase somewhat over four years, but the increase was comparable to national trends. Gains on KIRIS were 3.6 times larger than NAEP increases in the fourth grade and 4.1 times larger in the eighth grade.

The ACT is less valuable than NAEP as external evidence because its framework is less similar to that of KIRIS, it includes only multiple-choice items, and it is taken by an unrepresentative (but very large) sample of the state's high school students. Despite these differences, however, the Kentucky Department of Education has argued that correlations between the two tests provide evidence of the validity of KIRIS scores and that it is reasonable to assume that "increased learning that leads to improvement on one is likely to lead to improvement on the other" (KIRIS Accountability Cycle I Technical Manual, 1995, p. 14).

Between 1992 and 1995, the increased KIRIS scores of high school students in reading and math were not reflected in their ACT scores. As the figure shows, the contrast in performance on KIRIS and ACT math tests was striking: a difference of about 0.7 standard deviation. The difference in reading scores was about 0.4 standard deviation. In science, ACT scores increased slightly, by 0.1 standard deviation, but KIRIS scores increased about five times as much. Taken together, these trends suggest appreciable inflation of gains on KIRIS.

Changes in mathematics scores on KIRIS and ACT tests

Internal evidence reinforces this finding, particularly in math. Researchers noted a "sawtooth" pattern in which scores went up on reused items, then dropped when new items were introduced. The discrepancy in performance on new and reused items tended to be greater in schools that showed larger overall gains in math scores—a relationship consistent with coaching focused on reused items that could inflate scores. This relationship was absent, however, in reading. Although a moderate sawtooth pattern existed, it was not notably marked in schools that showed the greatest gains in reading.

Taken together, the evidence suggests that KIRIS scores have been inflated and are therefore not a meaningful indicator of increased learning. The size of the inflation cannot be determined precisely, but it appears to be appreciable. It must be emphasized, however, that the study focused on the first four years following introduction of the KIRIS system. One reason for rapid initial gains is that students are becoming more familiar with the specific demands of the test. In that case, later scores may be more accurate than initial scores, but the gain in scores would be misleadingly large nonetheless.

Steps to Improve Validity

One of the study's main conclusions is that changes in the question format are not sufficient to solve the problem of teaching to the test. Indeed, the use of open-response formats may exacerbate the potential for inflated scores by reducing the number of items that can be administered in a given amount of time. Although it may not be possible to eliminate score inflation entirely, it may be possible to reduce its severity. The authors propose several steps:

  • Set realistic targets for improvement. If teachers are told they must make larger gains than they can accomplish by legitimate means, they will have a greater incentive to cut corners by teaching to the test.
  • Tie assessments to clear curricula. If teachers are confronted with a centralized testing program but given no accompanying curriculum, they will have a strong incentive to use the test as a surrogate curriculum framework—again increasing the possibility of inappropriate teaching to the test.
  • Design assessments to minimize inflation. Inflation may be lessened by sampling systematically from subject areas, for example, or it may be necessary to eliminate the reuse of test questions.
  • Monitor for potential inflation of gains. Monitoring could involve external audit testing, as this study did with the NAEP test, adaptations to the testing program itself, or both.
  • Credit other aspects of educational performance. Improvements to large-scale assessment programs may be insufficient. Teachers often appropriately focus on shorter-term outcomes, such as finding ways to get an unmotivated student interested in a subject or finding clearer ways to present a certain body of knowledge, whether it is likely to be tested or not. It may be necessary to broaden accountability systems to include not just test scores but shorter-term outcomes as well.
  • Discount early gains. Sponsors of new assessments should advise educators, the public, and the press to discount gains over the first few years of a new program while students are familiarizing themselves with the demands of the new test.

KIRIS has been the subject of intense controversy over the past year, and the Kentucky General Assembly has already enacted legislation that will change it considerably. The RAND study, however, has relevance well beyond the Kentucky debate. The assessment and accountability systems in many states are similar to Kentucky's: they tend to focus on aggregate gains, link those gains to rewards, rely on assessments administered in only a few grades, use performance-based testing to "make tests worth teaching to," and set goals for improvement without reference to actual performance distributions. This study should serve as a warning to such state and local programs that they are risking inflated scores and need to assess the validity of recorded gains.

This report is part of the RAND Corporation Research brief series. RAND research briefs present policy-oriented summaries of individual published, peer-reviewed documents or of a body of published work.

The RAND Corporation is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.