Test-Based Accountability Systems
Test-Based Accountability Systems
The Kentucky program--called the Kentucky Instructional Results Information System, or KIRIS--uses a variety of testing techniques in an effort to guard against deleterious effects of teaching to the test. Besides traditional multiple-choice items, assessments have included "performance events" involving both group and individual activities, as well as open-response questions and writing and mathematics portfolios.
Despite enthusiasm for this new approach to assessment-based accountability systems, however, there has been little evidence until now indicating whether performance-based testing is more immune than traditional multiple-choice tests to the problem of inflated scores. A recent RAND report offers the first comprehensive evaluation of the effects of Kentucky's assessment program on achievement, examining the extent to which increases in scores indicate true gains in student learning. In The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS),authors Daniel Koretz and Sheila Barron suggest that increases in scores, particularly in the early years of the program, cannot be interpreted as revealing greater student mastery of the subjects tested.
Targets and Trends in KIRIS Scores
During the period studied, KIRIS scores were reported in terms of four levels--Novice, Apprentice, Proficient, and Distinguished--that were arbitrarily assigned the scores 0, 40, 100, and 140, respectively. (To make trends comparable across subjects and tests, the RAND study converted these scores to a common scale and expressed changes in terms of fractions of a standard deviation.) Regardless of their starting point, at the end of 20 years all schools were expected to reach a mean score of 100--equivalent to all students performing at the Proficient level. To meet this target, a typical school would have had to show an improvement of 2 standard deviations over 20 years, or 0.2 per year; to receive cash awards, increases had to be larger.By any standard, these targets are very high. In fact, they are unprecedented in large-scale educational interventions. A mean increase of 2 standard deviations would mean that half of all students in the year 2012 would have to exceed the performance exceeded by only two percent of students in 1992. Nevertheless, many schools not only met these targets but exceeded them in the first years of the program. Fourth-grade reading scores increased by 1.4 standard deviations in four years. Gains in math and science were smaller--0.6 and 0.5, respectively--but if they represented real gains in learning, they would be roughly equivalent to erasing about half the difference between Japan (one of the highest scoring countries) and the U.S. (which ranked 18th out of 25 nations with good samples) on the recent Third International Mathematics and Science Study (TIMMS) in the space of only three years. Such seemingly implausible gains called for validation.
Evidence Suggests Inflation
The RAND study assessed the validity of these gains using both external and internal evidence. External evidence is provided by trends for comparable tests, particularly the National Assessment of Educational Progress (NAEP) and the American College Testing (ACT) college-admissions tests. Internal evidence derives from the KIRIS test itself, such as an analysis of performance on new test items compared to performance on reused test items. The study focused on fourth and eighth grade performance in math, reading, and science--subjects that offered the best external evidence--from 1992 to 1996.If increases in KIRIS scores indicate improved mastery of a broadly defined subject, then those gains should be reflected substantially on other tests. NAEP data are particularly important because the KIRIS assessment is designed to measure much of the same content and skills that NAEP measures. The discrepancy between KIRIS and NAEP scores in reading is unambiguous: while fourth-grade KIRIS scores increased by a remarkable 0.75 standard deviation in two years, NAEP scores remained unchanged. In mathematics, NAEP scores of Kentucky students did increase somewhat over four years, but the increase was comparable to national trends. Gains on KIRIS were 3.6 times larger than NAEP increases in the fourth grade and 4.1 times larger in the eighth grade.
The ACT is less valuable than NAEP as external evidence because its framework is less similar to that of KIRIS, it includes only multiple-choice items, and it is taken by an unrepresentative (but very large) sample of the state's high school students. Despite these differences, however, the Kentucky Department of Education has argued that correlations between the two tests provide evidence of the validity of KIRIS scores and that it is reasonable to assume that "increased learning that leads to improvement on one is likely to lead to improvement on the other" (KIRIS Accountability Cycle I Technical Manual, 1995, p. 14).
Between 1992 and 1995, the increased KIRIS scores of high school students in reading and math were not reflected in their ACT scores. As the figure shows, the contrast in performance on KIRIS and ACT math tests was striking: a difference of about 0.7 standard deviation. The difference in reading scores was about 0.4 standard deviation. In science, ACT scores increased slightly, by 0.1 standard deviation, but KIRIS scores increased about five times as much. Taken together, these trends suggest appreciable inflation of gains on KIRIS.
Changes in mathematics scores on KIRIS and ACT tests
Taken together, the evidence suggests that KIRIS scores have been inflated and are therefore not a meaningful indicator of increased learning. The size of the inflation cannot be determined precisely, but it appears to be appreciable. It must be emphasized, however, that the study focused on the first four years following introduction of the KIRIS system. One reason for rapid initial gains is that students are becoming more familiar with the specific demands of the test. In that case, later scores may be more accurate than initial scores, but the gain in scores would be misleadingly large nonetheless.
Steps to Improve Validity
One of the study's main conclusions is that changes in the question format are not sufficient to solve the problem of teaching to the test. Indeed, the use of open-response formats may exacerbate the potential for inflated scores by reducing the number of items that can be administered in a given amount of time. Although it may not be possible to eliminate score inflation entirely, it may be possible to reduce its severity. The authors propose several steps:
- Set realistic targets for improvement. If teachers are told they must make larger gains than they can accomplish by legitimate means, they will have a greater incentive to cut corners by teaching to the test.
- Tie assessments to clear curricula. If teachers are confronted with a centralized testing program but given no accompanying curriculum, they will have a strong incentive to use the test as a surrogate curriculum framework--again increasing the possibility of inappropriate teaching to the test.
- Design assessments to minimize inflation. Inflation may be lessened by sampling systematically from subject areas, for example, or it may be necessary to eliminate the reuse of test questions.
- Monitor for potential inflation of gains. Monitoring could involve external audit testing, as this study did with the NAEP test, adaptations to the testing program itself, or both.
- Credit other aspects of educational performance. Improvements to large-scale assessment programs may be insufficient. Teachers often appropriately focus on shorter-term outcomes, such as finding ways to get an unmotivated student interested in a subject or finding clearer ways to present a certain body of knowledge, whether it is likely to be tested or not. It may be necessary to broaden accountability systems to include not just test scores but shorter-term outcomes as well.
- Discount early gains. Sponsors of new assessments should advise educators, the public, and the press to discount gains over the first few years of a new program while students are familiarizing themselves with the demands of the new test.
RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. This research brief summarizes work done within RAND Education and documented in The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS)by Daniel M. Koretz and Sheila I. Barron, MR-1014-EDU, 1998, ISBN: 0-8330-2687-9, which is available from National Book Network (Telephone: 800-462-6420; FAX: 301-459-2118). Building on more than 25 years of research and evaluation work, RAND Education (formerly RAND's Institute on Education and Training) has as its mission the improvement of educational policy and practice in formal and informal settings from early childhood on. A profile of RAND Education, abstracts of its publications, and ordering information may be viewed on the World Wide Web ().
RB-8017 (1999)
All rights reserved. Permission is given to duplicate this on-line document for personal use only, as long as it is unaltered and complete. Copies may not be duplicated for commercial purposes.Published 1999 by RAND
RAND's Home Page

