Building on more than 25 years of research and evaluation work, RAND Education has as its mission the improvement of educational policy and practice in formal and informal settings from early childhood on.
Many policymakers want to use tests to assess how well schools fare when compared to schools in the same district or state or in the nation. This information could be used to motivate school personnel, to reward successful efforts, and to indicate where additional resources or changes in practices are most needed. In fact, one of the primary benefits of such an accountability system, according to its advocates, is that it would serve as an impetus for reform by mobilizing teachers, parents, and members of the local community to support educational improvement (Smith, Stevenson, & Li, 1998). Of course, there are risks that test scores also could be used for less benign purposes, such as penalizing low-performing schools.
In his 1997 State of the Union address, President Clinton argued the merits of developing a national test for fourth grade reading and eighth grade math. This set of examinations would be the first in this country to offer individual student scores on a common test for all students, and would provide students, parents, and teachers with information regarding how well students are performing in relation to their peers across the country. Presumably, this information could help parents and teachers identify students who are falling behind in important skills so that the necessary help could be provided (Clinton, 1997). This proposal was intended to "help ensure that all of America's children have the opportunity to achieve academic success in reading and mathematics" (Smith, Stevenson, & Li, 1998, p. 42).
All of these proposed uses assume that tests provide valid and cost-effective indicators of student proficiency. This paper reviews the salient characteristics of the current inventory of tests, analyzes two recent proposals for creating a national testing program, including Clinton's Voluntary National Test (VNT), and describes a new approach to both statewide and nationwide testing that RAND is currently examining.
Section II of this paper discusses the criteria typically used for evaluating large-scale testing programs. Those familiar with such criteria should skip that section and turn directly to Section III, where tests presently used for large-scale assessment programs are discussed in light of the evaluation criteria. Section IV describes two methods proposed for obtaining valid measures of individual student achievement that can be used to monitor pupil progress relative to national standards: (1) linking existing state and local tests to the National Assessment of Educational Progress (NAEP), and (2) the VNT. Section V presents a promising new, alternative method, one that is based on using computerized adaptive tests (CATs) delivered over the Internet. We believe that this approach may offer a more-valid and reliable way to measure what students learn. Finally, Section VI summarizes the advantages of our proposed approach and notes issues that need to be resolved regarding its implementation.
Additionally, if a test is used to make important decisions about individual students--such as whether they should be promoted or graduated--then students should be offered several opportunities to pass truly comparable versions of the test (National Research Council, Committee on Appropriate Test Use, 1998). Students' test results also should be reported shortly after test administration.
Table 1 lists and briefly defines the major criteria typically considered in judging the technical quality of a test. More extensive discussions of these criteria can be found in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1985). Note that although our focus here (and thus the title of the table) is confined to tests used in large-scale assessment programs, many of these criteria are also relevant for other kinds of tests, such as those created by teachers for the purpose of assigning grades. Our list is not exhaustive, but it does include features that must be examined when evaluating testing programs.
|
Validity: Degree to which a test measures what it was intended to measure. Reliability: Consistency and dependability of assessment results. Equity: Fairness, lack of bias toward any particular group of examinees. Costs and feasibility: Monetary and other (e.g., opportunity) costs of administering a test. Test security and standardization: Degree to which inappropriate exposure to test questions is prevented and consistent administration conditions are maintained. Mechanisms for calibration across administrations: Procedure for ensuring that scores obtained on different administrations of the test (including those using alternate forms) are comparable with one another; e.g., there is adjustment for any differences in difficulty across different versions (forms) of the test. Use of relative or fixed standards: Way in which scores are interpreted--compared with performance of other examinees or with a prespecified standard such as a cut score. Alignment with curriculum: For tests intended to measure outcomes related to a particular instructional program, the degree to which items reflect what was actually taught. Consequences: Inferences and actions that result from the use of the test scores. Timeliness of score reporting: Length of time between test administration and score reporting. Motivation: Degree to which examinees are motivated to do their best. Representativeness of examinee samples: Whether the students taking the test are similar to the entire population about which inferences are being made. Public trust and acceptability Legal defensibility
|
A critical component of an assessment program that uses different sets of items across administrations is a mechanism for calibrating, or what is technically known as equating, scores obtained on one administration to those obtained on other administrations. Such a mechanism is essential if the scores are to be used to monitor changes in student or school performance over time. This is especially the case in large-scale assessment programs, where test security, and thereby test validity, requires that questions asked on one administration are generally different from those asked on other administrations. These differences in test content may make questions asked on one occasion, as a group, somewhat harder than those asked on another. Equating is designed to adjust the scores to correct for these differences in difficulty.
Test scores can be interpreted in terms of relative or fixed standards. Relative, or norm-referenced, judgments involve examining how well a student performs in comparison to other students. For example, test scores are often used to gauge how well the students at a school perform relative to students in some national sample of students or to students at other schools in the same district or state. In contrast, fixed, or criterion-referenced, judgments involve examining the degree to which students have mastered a spe-cific body of knowledge, skills, and abilities. These types of comparisons are used in standards-based testing programs that seek to provide information on numbers of students who have achieved a preset standard of performance. Both norm- and criterion-referenced interpretations assume that students take the same or truly comparable tests, that these tests are administered under the same conditions (e.g., time limits), and in the case of open-ended exams, that the standards used to evaluate answer quality are the same for all students.
Additional criteria for evaluating tests include the degree to which a test is aligned with the curriculum (which is particularly important for tests intended to measure effects of specific educational programs), consequences of test use, and timeliness of score reporting (an important consideration for scores intended to inform decisions about individual pupils). The level of student motivation also must be considered: When a testing situation has low stakes for students, they may not put forth their best effort.
The extent to which a sample of test-takers is representative of a larger population is an important criterion in evaluating tests used to make inferences about the performance of groups, such as students in a particular district or state. Finally, public acceptability and legal defensibility are necessary considerations in almost any large-scale assessment program.
The National Assessment of Educational Progress (NAEP) was established to monitor state and national achievement trends. The "state" portion of this program, which provides scores for individual states, has administered tests in one or two subjects every two years since 1990. As Table 2 shows, the grade levels and subjects tested vary on a yearly basis. For example, in 1996, State NAEP administered eighth-grade math and science tests and fourth-grade math tests; in 1998, State NAEP administered eighth-grade reading and writing tests and fourth-grade reading tests. The national portion of NAEP provides national scores and enables trend analysis. It also frequently administers tests in subjects other than those covered by State NAEP, such as history and social studies. Only State NAEP, however, provides the data needed to make comparisons among states.
Some tests have been devised specifically for making international comparisons. Two recent examples are the International Association for Evaluation of Educational Achievement Reading Literacy Study (Elley, 1993) and the Third International Math and Science Study (TIMSS; see Mullis et al., 1998). These efforts involve testing representative samples of students from participating countries, including the United States. Test results are used primarily for evaluating the relative standing of U.S. students compared with their counterparts in other countries. These testing programs are similar to NAEP in that scores are not reported for individuals, schools, districts, or states, which means that no real consequences are associated with good or poor performance for anyone involved in taking these tests.
Other large-scale standardized tests have been developed primarily for research purposes. The National Center for Education Statistics (NCES) has carried out a series of longitudinal studies, the most recent being the National Education Longitudinal Study of 1988 (NELS:88). These studies involve measuring student achievement in several subjects. In NELS:88, for example, tests were administered in four subjects to samples of students at grades 8, 10, and 12 (Owings, 1995). In addition, questionnaire data were collected from students, teachers, principals, and parents. The resulting database enables researchers to explore a wide variety of issues, such as relationships between test scores and various student characteristics and experiences. The longitudinal design of these studies allows researchers to track student achievement over time. However, scores are not given to students, parents, or schools.
All of the tests discussed thus far--NAEP, TIMSS, and NELS--are administered to nationally representative samples of students under highly controlled, standardized conditions. But again, the scores on these tests are not used to make decisions about individual students. The question of motivation thus arises, since students may not be motivated to perform their best when there are no personal stakes involved, and, consequently, their scores may not accurately reflect their true capabilities (O'Neil et al., 1997). The next type of tests discussed are those used for selection purposes, which means they do involve high stakes for students.
At the national level, the most notable examples of tests that involve high stakes are the college entrance examinations, which include the Scholastic Assessment Test (SAT) I (general) and II (subject tests) and the American College Testing (ACT) program tests. Students electing to take these tests are generally motivated to perform well because their scores influence their options for post-secondary education. Aggregate scores on these exams are sometimes used to monitor the college-bound population and to make comparisons among states, districts, and schools. However, whereas the samples of students taking NAEP and the international tests described above are representative, the samples of students taking the SAT and ACT exams are neither locally nor nationally representative. Moreover, the degree of selection and the factors governing selection are likely to vary across regions. In some states, for example, nearly all college-bound students take the SAT. In other states, whose higher education systems use the ACT, the SAT-taking population may be largely limited to students planning to attend out-of-state schools. Consequently, SAT and ACT scores are not appropriate for measuring achievement trends or comparing states or districts.
There are similar "selection bias" concerns with monitoring scores on the Armed Forces Qualification Test (AFQT), which is taken by men and women interested in enlisting in one of the U.S. military services.
The format of these assessments also varies. Some states use a commercially available, standardized multiple-choice test, whereas others develop their own tests or assessment methods, including portfolios or other open-ended response formats. A test's format often reflects a desire to evaluate the success of a reform effort. The Kentucky Instructional Results Information System (KIRIS), for example, was designed to monitor the effects of the Kentucky Education Reform Act. This objective led KIRIS to emphasize the kinds of open-ended tasks that were presumed to reflect reform-related instruction. The portfolios used in some states and districts also resulted from efforts to align assessment instruments with instruction and to evaluate students in an "authentic" way. Unfortunately, concerns about authenticity have often taken undue precedence over the technical quality of the assessments, resulting in tests that fail to provide valid or useful information (Klein, McCaffrey, Stecher, & Koretz, 1995; Koretz & Barron, 1998; Koretz, Stecher, Klein, & McCaffrey, 1994).
States sometimes change the nature of their tests from one year to the next. Kentucky eliminated multiple-choice items from KIRIS but then reintroduced them after an expert blue-ribbon panel found that such items were needed to ensure content coverage, score reliability, and stability of proficiency standards over time (Hambleton et al., 1995). California is now using a commercial multiple-choice test after pioneering the controversial California Learning Assessment System (CLAS) tests, which included both open-ended and multiple-choice items. Such changes in the measures used often reflect responses to political pressure, changes in curriculum, financial factors, or technical concerns. Whatever the reason, such changes make it difficult--if not impossible--to track achievement from one year to the next. However, as discussed below, when states use the same test for several years in a row, they are likely to encounter problems that threaten or even completely undermine the validity of the scores obtained. This situation led the National Research Council's Committee on Appropriate Test Use (1998) to recommend that a different test be used each year and that these tests have comparable content coverage and be equated to each other.
Because of the expense involved in hiring external test administrators, most states have classroom teachers administer the state tests to their own students. Administration typically involves teachers following a set of clearly specified steps and reading instructions to their students from a "script." These carefully designed procedures are intended to ensure that all students receive the same instructions across classrooms and schools. This standardization (including the amount of time students are given to respond) is essential if the scores of students from different schools and classrooms are to be validly interpreted.
Although it is reasonable to expect that most teachers will carry out this activity competently, test security and standardization may be compromised when teachers administer tests to students in their own schools. How teachers adhere to time limits and respond to student questions are just two of the many factors that have been found to vary across testing sites (see, e.g., Koretz, Mitchell, Barron, & Keith, 1996), and in some cases outright cheating has been reported (Stecklow, 1997). This variability in administration conditions compromises score comparability, and the resulting lack of public trust in the scores defeats the purpose of the assessment. Serious problems can also occur when test administration involves setting up or demonstrating equipment, or when students are given complex instructions. Another threat to consistency in administration conditions occurs when products (such as book reports) that are created in class or at home are used as part of the testing program. For example, in creating such products, some students may receive more teacher or parent assistance than others do, rendering comparisons among students inappropriate (Gearhart, Herman, Baker, & Whittaker, 1994).
We analyzed test data from several districts and states that have large-scale testing programs. We also gave other tests at these sites, using RAND-trained administrators, so that we could compare test scores under two quite different administration conditions. The results of this field experiment revealed that scores on the externally administered tests consistently showed the expected relationships among themselves and with various student characteristics, such as socioeconomic status. This was true regardless of format (e.g., multiple choice or open response). In contrast, several of the locally administered tests showed markedly different relationships with student characteristics and unexpectedly low correlations with the external measures. [1] This evidence suggests that the administration of the local assessments was compromised and that scores on statewide or district tests may not be trustworthy (see also Cannell, 1987 and 1989). The result of such breaches in test security and administration conditions is that student scores often reflect no more than how well those students can answer the particular questions asked on the test. That is, the scores cannot be used to make generalizations about how students are doing in the subject matter area (e.g., mathematics) being assessed (Mehrens & Kaminski, 1989).
One concern often voiced about the use of standardized tests is the limited content coverage of any single test (Jones, 1997; Stake, 1995). Because of the constraint of limited testing time, it is difficult to sample adequately from each content area, even when tests are designed to balance item content across a specified set of topics. NAEP solves this problem through matrix sampling; i.e., it administers a large number of items from each content area but requires any given student to answer only a small portion of the total set of items. This strategy permits more accurate inferences about typical student achievement than can be obtained when the same set of questions is administered to every student.
The multiple-choice format also may limit the kinds of skills that can be assessed. Many states have supplemented or replaced multiple-choice items with open-ended items, such as essays, hands-on experiments, or portfolios. Although combining several formats can enhance coverage, the addition of open-ended items causes other problems. They take longer to administer than do multiple-choice items, so fewer questions can be asked in a given amount of testing time. Furthermore, they do not necessarily measure the kinds of skills that their developers intended or that some advocates claim they do (Hamilton, Nussbaum, & Snow, 1997). And the substantial cost of scoring such items also introduces trade-offs between finances and test reliability (Stecher & Klein, 1997; Wainer & Thissen, 1993). The result is that attempts to increase breadth by using multiple formats are often made at some expense and with little assurance that the measurement goals have been met.
Some states feel they provide a form of motivation when they create high stakes for schools and teachers by tying financial rewards and sanctions to student performance. It is often argued that such "accountability" systems ensure that all students receive high-quality instruction, and there is some evidence that teachers do alter their practices in desirable ways in response to external tests (Koretz, Stecher, Klein, & McCaffrey, 1994; Stecher, Barron, Kaganoff, & Goodwin, 1998). But in many cases, these systems lead only to artificial inflation of scores with no real evidence of improved learning (Koretz & Barron, 1998).
As for how stakes are involved directly for students, the answer varies from state to state. Some states make graduation, placement in an advanced track, or promotion to the next grade contingent upon a student's achieving a particular score, whereas other states attach virtually no consequences to student scores. Differences of these types contribute to variation in the degree to which students are likely to be motivated to perform well, which has been shown to affect scores. Wainer (1993) found that when students perceive tests as having no direct consequences, they may not put forth as much effort as they would when the stakes are higher. Wolf and Smith (1995) found that the performance of college students was roughly one-quarter of a standard deviation higher when a test counted for their grade than when it did not. O'Neil et al. (1997) also found that paying eighth graders for correct answers on a NAEP examination resulted in higher scores (but this did not happen with twelfth graders). Interpretations of most achievement test results rely on the assumption that students have given their best efforts. The results from these studies illustrate that this assumption may not always be correct (see also Brown & Walberg, 1993).
Policymakers and the education community have recently considered two major approaches to providing a common, national measure of achievement. One strategy is to allow states to continue administering their own tests, and then to link the results of those tests to NAEP. The second approach, advocated by President Clinton in his 1997 State of the Union address, is to create a national test. We discuss both of these options.
As discussed above, the significant differences among state testing programs preclude making any direct comparisons of results across states. However, NAEP scores provide a potentially useful benchmark for putting the results of different tests on a common scale. This approach is analogous to measuring the value of the currency of different countries by comparing the worth of each country's currency to that of the U.S. dollar. In short, the goal is to transform a student's score on the state test into a score on a NAEP-based scale so that parents, schools, and teachers will be able to see how well their students are doing relative to a national set of standards. In a recent survey of state assessment and curriculum directors and other users of NAEP results, participants expressed support for this type of linking (Levine, Rathbun, Selden, & Davis, 1998).
Linking is often used in other testing situations. Each time the SAT is administered, for example, new test forms are created and the resulting scores are placed on a scale that allows them to be compared with scores obtained on previous administrations. The process of rendering alternate forms of the same test comparable so that their scores can be used interchangeably is typically called equating, but can be thought of more generally as a special case of linking. Because equating requires that the tests measure the same constructs in an equally reliable way, the linking of a state test to NAEP would not be considered true equating. Linking can still be done, however, and can, at least in theory, yield useful results, albeit weaker ones than are obtained through a strict equating procedure.
Several studies have been conducted to investigate the feasibility of linking state assessment results to NAEP. The results of Ercikan (1997) and of Linn and Kiplinger (1995) point to the instability of the link between state tests and NAEP: The NAEP scores derived from the linking differ substantially from actual NAEP scores, particularly at the extreme ends of the score distribution. Waltman (1997) found that the use of linking functions to classify students into performance categories tends to result in low levels of agreement between the state test and NAEP. Problems have also been encountered in attempts to link international assessments to NAEP (see, e.g., Beaton & Gonzales, 1993; Pashley & Phillips, 1993).
Most recently, the National Research Council convened a panel of experts to investigate the feasibility of linking. The panel concluded that linking is not a viable option because of differences in content, format, difficulty, measurement precision, and administration conditions (particularly test uses and consequences) across states (National Research Council, Committee on Equivalency and Linkage of Educational Tests, 1998).
Plausible explanations exist for why efforts to link tests have not been successful. We discuss several of these explanations here, including differences in content and format, variations in representativeness of student samples, variations in administration conditions and student motivation, reuse of test forms at the state level, and reporting time. Most of these are also discussed in the report prepared by the National Research Council's Committee on Equivalency and Linkage of Educational Tests (1998). [2]
One of the most obvious problems with linking is that the majority of state tests differ from NAEP in content, format, or both. A study by Bond and Jaeger (1993), in which content experts classified items on three standardized tests into content categories, revealed wide disparities in the balance of items across categories in NAEP versus the standardized tests. However, a study by Linn and Kiplinger (1995) revealed that even when the NAEP scale used is aligned with the content of a state test, the resulting linking errors are of the same magnitude as those resulting when content is not considered. Thus, "alignment" alone does not explain why linking failed. Before defensible links can be constructed between NAEP and existing state tests, more research will be needed to understand how content and format differences affect linking.
The representativeness of student samples is another factor that may affect the quality of linking. For NAEP, schools are sampled to be roughly representative of the state's population of students, but this sample may not represent the population of students who actually take the state test. For example, states differ in their policies regarding the inclusion in the testing program of students who have limited English proficiency. Schools also differ in the efforts they make (such as offering make-up testing) to ensure full participation among their students. The especially high absentee rates at some schools on the day the test is administered raise concerns about the validity of school-level data.
Differences in administration conditions and student motivation may also affect linking. Bloxom, Pashley, Nicewander, and Yan (1995), for example, linked NAEP scores to scores on the Armed Services Vocational Aptitude Battery (ASVAB). Examination of the score distributions on the two tests suggested that examinees were more motivated to perform well on the ASVAB than on NAEP, which is consistent with the fact that only the ASVAB scores had consequences for these examinees. It is likely that similar problems would be encountered when scores on other high-stakes tests are linked to NAEP scores.
State results are influenced by the number of times a particular test form has been used. Unlike NAEP, many states administer the same form of a test for several years in a row. As discussed above, Linn, Graue, and Sanders (1990) showed that scores increase as a form is reused, particularly during the first few years. Thus, repeated use of some state tests is another potential source of error.
Finally, an additional administrative concern with the plan to link tests is the inevitable significant lag between test administration and reporting. Test papers must be collected, assembled, and prepared for scoring; scores must be assigned, verified, and converted to a usable data format; and reports must be generated. Because of the scope and complexity of NAEP, its results can take over a year to be released, and it is not given every year. Consequently, the linking of state results to NAEP would not occur until well after state results have been released, which could render the linking almost meaningless, at least from the public's perspective.
Given the diversity and flux among state testing programs, some educators and policymakers have argued that what is needed is a single, common test administered across states. As noted earlier, President Clinton proposed a national testing program that would produce individual student scores for every fourth grader in reading and for every eighth grader in math. The modifier of "voluntary" was added to the name of this testing program after debate over the proposal began. The Department of Education has called Clinton's plan the Voluntary National Test (VNT) because states and districts would be allowed to choose whether to participate--federal law would not require participation. However, a state and district that chose to participate could require that all schools, teachers, and students in the specified grades participate.
Unlike state tests, many of which are chosen or developed to reflect state-developed standards, items on the VNT would be designed to correspond closely with NAEP. The goal of the VNT is to "provide students, parents, and teachers with meaningful scores to compare individual student performance to widely accepted national and international standards and to identify students and schools that need extra help" (U.S. Department of Education and National Science Foundation, 1998).
One of the most common criticisms of the VNT is that it places too many decisions about what to measure, and consequently what to teach, in the hands of the federal government. Advocates of the VNT argue that it would be aligned with the NAEP content frameworks, which presumably reflect broad consensus about what students should know and be able to do. However, state and local education agencies continue to set their own objectives, and the importance of local control over standards and assessments has been emphasized in several recent reports, including some by groups that have promoted national standards (e.g., National Council on Education Standards and Testing, 1992; National Research Council and National Council of Teachers of Mathematics, 1997). Of course, a national test need not preclude the use of state assessments as well, but critics fear that it could lead to narrowing of instruction and an overemphasis on preparing students for one test.
Even if agreement could be reached on what topics to assess, a single 90-minute test would probably not allow sufficient content coverage, especially if a significant portion of the testing time is devoted to a small number of open-ended items. Unlike NAEP, which uses matrix sampling and therefore can administer many more items than a student could reasonably be asked to take in a classroom period, the proposed VNT would be a single test given to all students. Hence, its coverage could not be as comprehensive as that of NAEP. Important topics would have to be omitted. Moreover, the VNT plan of testing only fourth graders in reading and eighth graders in math provides a very limited snapshot of student learning and neglects many important subjects and grade levels.
Another important issue raised by the VNT is test security. To provide valid measurement of student achievement, test forms must be kept secure. This is critically important given the consistent findings that high-stakes testing often leads to inappropriate coaching on the specific test items, resulting in scores that do not accurately reflect student proficiency (see, e.g., Koretz, Linn, Dunbar, & Shepard, 1991). Because a single form of the VNT would be given at different times across the nation and because the tests would be administered by local school personnel, there is a real risk that actual VNT items would be used in instruction or that other breaches of security (such as extensions of time limits and teacher assistance during the exam) would occur.
Finally, the costs associated with this type of large-scale testing are formidable. The cost of administering a commercially available, standardized multiple-choice test is typically between $4 and $6 per student in 1999 dollars. This figure includes the test booklet and answer sheets. There are additional charges for scoring and the reporting of results. The proposed VNT would also include open-ended items, which are likely to cost an additional $3 to $8 per pupil to score, primarily because scoring would have to be done by trained graders. An analysis of costs must also consider the time that teachers spend administering the test and the opportunity costs of having students spend classroom time taking it. Of course, any well-designed assessment program will have costs associated with it, but it is important to recognize these costs and then weigh them against whatever benefits are likely to accrue from the testing program.
Computers are also becoming central in the administration of many large-scale testing programs. For example, several of the tests produced by the Educational Testing Service (ETS) and the American College Testing (ACT) program are administered via computer. The Graduate Record Examination (GRE) is probably the most familiar example. This new administration format involves more than simply displaying the paper-and-pencil versions on a computer screen. Tests such as the GRE use a computerized adaptive testing (CAT) technology in which the presentation of an item and the decision concerning when to stop testing depend on the examinee's performance on earlier items (Bunderson, Inouye, & Olsen, 1989).
Numerous studies comparing CATs with traditional paper-and-pencil tests in the same subjects have found that the two methods rank-order students in about the same way (see Mead & Drasgow, 1993, for a review), but that the computerized approach does it in far less testing time. Bunderson, Inouye, and Olson (1989) demonstrated that the number of test items needed for a given level of precision can often be reduced by more than half with the computerized approach, thereby freeing up testing time for other educational activities.
These technological trends and the advantages offered by some forms of electronic testing led us to explore the feasibility of adopting a computerized system that could achieve the objectives sought after in pursuing a national test of achievement while avoiding the drawbacks of the linking and VNT strategies. There are, of course, other approaches that could be considered, but our evaluation of the alternatives suggests that Web-based computer adaptive testing is the most promising. Specifically, we propose that large-scale testing involve multiple-choice items presented in an adaptive mode, supplemented by open-ended and other types of items, all of which would be administered in schools via the Internet.
Currently, most CATs are administered on a stand-alone personal computer. To be used on a large scale, such as for a district, state, or national testing program, the system would have to be delivered into all U.S. schools in an efficient, secure, and cost-effective manner. A Web-based delivery system may be the best way to do this. Such a system could be upgraded and maintained at a central location without requiring complicated installation or modification at local school sites. In addition, data could be stored centrally, facilitating the development of systemwide norms and score summaries.
Another potential benefit is improved test security. Because each student within a classroom takes a different test (i.e., one tailored to his or her proficiency level), there is little risk of students being exposed to test items in advance or of teachers coaching their students on specific items. CATs are particularly useful for measuring growth over time. Students take different items on different occasions, so scores are not affected by practice. However, because all items are calibrated to the same scale, growth can be measured in a straightforward way.
CATs also offer efficient and inexpensive scoring. Scoring is done on-line, eliminating the need for packaging, shipping, and processing of answer sheets. Students could be given their results immediately after completing the tests. An Internet-based system would allow all records to be stored automatically at a central location, facilitating the production of score summaries. Teachers would have results in time to do something about students who are not progressing at the expected rate, and could incorporate this information into their instruction. Teachers could also use results for assigning grades, so that students would be motivated to do well.
Computerized assessments are especially appropriate for evaluating student progress in areas where computers are used frequently, such as writing. Russell and Haney (1997), for example, found that students accustomed to using computers in their classes performed better on a writing test when they could use computers rather than paper and pencil. Students using the computers wrote more and organized their compositions better. In short, the computer can accommodate many more item types than can a paper-and-pencil test, such as tasks that involve using a mouse to move objects around on the screen, draw graphics, etc. Moreover, as instruction comes to depend more heavily on technology, assessment will need to follow suit in order to be appropriately aligned with curriculum.
An additional limitation to most existing CAT systems is that they rely solely on multiple-choice items. However, as discussed above, an Internet-based testing program could certainly incorporate other kinds of items, and improvements in technology for administering and scoring open-ended items will increasingly make such items cost-effective.
Unfortunately, these expenditures of human and financial resources often fail to provide the kinds of credible data needed to draw appropriate conclusions about individual student progress or the efficacy of educational programs and reforms. This failure occurs in part because the stakes tied to most statewide and district tests are usually for teachers, principals, and other district staff, rather than for students. For example, students' scores generally do not affect their grades, promotion, or graduation. Consequently, test- takers may not be motivated to do their best. Moreover, teachers and other school personnel may feel pressured to ensure their students receive high scores so as, for example, to avoid being responsible for students not being promoted to the next grade level. Pressures such as these have apparently contributed to widespread breaches in both test security and the standardization of test administration conditions, which together have inflated scores and undermined their validity. Scores that may indicate how well students can answer questions about a passage they have read before, rather than how well they can read, are not credible data.
There are significant differences in goals, standards, curriculum, and instructional practices across teachers, schools, districts, and states. Thus, a single test cannot be congruent ("aligned") with all of these differences. And there are other problems with these tests: The format of most of them limits the range of relevant skills and abilities that can be measured, there is often a long delay between when the test is given and when the results are provided to schools, and students experience frustration when asked questions that are much too easy or too difficult for their particular ability level.
These considerations and other factors led us to propose that most paper-and-pencil tests be replaced with computerized adaptive tests (CATs) administered over the Internet. The major advantages of a CAT system include the following:
[2]The NRC report was published a few months after the initial draft of this paper was published.
American Federation of Teachers (1997). Making Standards Matter 1997: An Annual Fifty-State Report on Efforts to Raise Academic Standards. Washington, DC: Author.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). Standards for Educational and Psychological Testing. Washington, DC: Author.
Angoff, W. A. (1971). "Scales, Norms, and Equivalent Scores." In R. L. Thorndike (ed.), Educational Measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education.
Beaton, A. E., & Gonzalez, E. J. (1993). Comparing the NAEP Trial State Assessment Results with the IEAP International Results. Report prepared for the National Academy of Education Panel on the NAEP Trial State Assessment. Stanford, CA: National Academy of Education.
Bennett, R. E. (1998). Speculations on the Future of Large-Scale Educational Testing. Princeton, NJ: Educational Testing Service.
Bloxom, B., Pashley, P. J., Nicewander, W. A., & Yan, D. (1995). "Linking to a Large-Scale Assessment: An Empirical Evaluation." Journal of Educational and Behavioral Statistics, 20, 1-26.
Bond, L., & Jaeger, R. M. (1993). Judged Congruence Between Various State Assessment Tests in Mathematics and the 1990 National Assessment of Educational Progress Item Pool for Grade 8 Mathematics. Report prepared for the National Academy of Education Panel on the NAEP Trial State Assessment. Stanford, CA: National Academy of Education.
Brown, S. M., & Walberg, H. J. (1993). "Motivational Effects on Test Scores of Elementary Students." Journal of Educational Research, 86, 133-136.
Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989). "The Four Generations of Computerized Educational Measurement." In R. L. Linn (ed.), Educational Measurement (3rd ed., pp. 367-407). New York: Macmillan.
Cannell, J. J. (1987). Nationally Normed Elementary Achievement Testing in America's Public Schools: How All Fifty States Are Above the National Average. Daniels, WV: Friends for Education.
Cannell, J. J. (1989). How Public Educators Cheat on Standardized Achievement Tests. Albuquerque, NM: Friends for Education.
Clinton, W. J. (1997). State of the Union Address.
Collis, B. A., Knezek, G. A., Lai, K., Miyashita, K. T., Pelgrum, W. J., Plomp, T., & Sakamoto, T. (1996). Children and Computers in School. Mahwah, NJ: Erlbaum.
Education Week (January 1999). "Quality Counts '99."
Elley, W. B. (1993). International Report: The IEA Study of Literature: Achievement and Instruction in Thirty-Two School Systems. Oxford: Pergamon.
Ercikan, K. (1997). "Linking Statewide Tests to the National Assessment of Educational Progress: Accuracy of Combining Test Results Across States." Applied Measurement in Education, 10, 145-159.
Gearhart, M., Herman, J. L., Baker, E. L., & Whittaker, A. K. (1994). Whose Work Is It? A Question for the Validity of Large-Scale Portfolio Assessment. CSE Technical Report 363. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Hambleton, R. K., Jaeger, R. M., Koretz, D., Linn, R. L., Millman, J., & Phillips, S. E. (1995, June). Review of the Measurement Quality of the Kentucky Instructional Results Information System, 1991-1994. Report prepared for the Office of Educational Accountability, Kentucky General Assembly.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). "Interview Procedures for Validating Science Assessments." Applied Measurement in Education, 10, 181-200.
Jones, L. (1997). "National Tests and Education Reform: Are They Compatible?" ETS William H. Angoff Memorial Lecture. Princeton, NJ: Educational Testing Service.
Klein, S. P., McCaffrey, D. M., Stecher, B., & Koretz, D. (1995). "The Reliability of Mathematics Portfolio Scores: Lessons from the Vermont Experience." Applied Measurement in Education, 8, 243-260.
Kolen, M. J., & Brennan, R. L. (1995). Test Equating. New York: Springer.
Koretz, D., and Barron, S. I. (1998). The Validity of Gains on the Kentucky Instructional Results Information System (KIRIS). Santa Monica, CA: RAND.
Koretz, D., Linn, R. L., Dunbar, S. B., and Shepard, L. A. (1991). "The Effects of High-Stakes Testing: Preliminary Evidence About Generalization Across Tests." In R. L. Linn (chair), The Effects of High Stakes Testing. Symposium presented at the annual meetings of the American Educational Research Association and the National Council on Measurement in Education, Chicago, April.
Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996). Final Report: Perceived Effects of the Maryland School Performance Assessment Program. CSE Technical Report 409. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Koretz, D., Stecher, B., Klein, S., & McCaffrey, D. (1994). "The Vermont Portfolio Assessment Program: Findings and Implications." Educational Measurement: Issues and Practice, 13 (3), 5-16.
Levine, R., Rathbun, A., Selden, R., & Davis, A. (1998). NAEP's Constituents: What Do They Want? Report of the NAEP Constituents' Survey and Focus. NCES 98-521. Washington, DC: U.S. Department of Education.
Linn, R. L. (1993). "Linking Results of Distinct Assessments." Applied Measurement in Education, 6, 83-102.
Linn, R. L., Graue, M. E., & Sanders, N. M. (1990). "Comparing State and District Test Results to National Norms: The Validity of Claims That 'Everyone Is Above Average.'" Educational Measurement: Issues and Practice, 9, 5-14.
Linn, R. L., & Kiplinger, V. L. (1995). "Linking Statewide Tests to the National Assessment of Educational Progress: Stability of Results." Applied Measurement in Education, 8, 135-155.
Mead, A. D., & Drasgow, F. (1993). "Equivalence of Computerized and Paper-and-Pencil Cognitive Ability Tests: A Meta- analysis." Psychological Bulletin, 114, 449-458.
Mehrens, W. A., & Kaminski, J. (1989). "Methods for Improving Standardized Test Scores: Fruitful, Fruitless, or Fraudulent?" Educational Measurement: Issues and Practice, 8 (1), 14-173;22.
Mullis, I.V.F., Martin, M. O., Beaton, A. E., Gonzales, E. J., Kelly, D. L., & Smith, T. A. (1998). Mathematics and Science Achievement in the Final Year of Secondary School: IEA's Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: TIMSS International Study Center.
National Center for Education Statistics (1998). Internet Access in Public Schools. Washington, DC: U.S. Department of Education.
National Council on Education Standards and Testing (1992). Raising Standards for American Education. Washington, DC: U.S. Government Printing Office.
National Research Council, Committee on Appropriate Test Use (1998). High Stakes: Tests for Tracking, Promotion, and Graduation. Washington, DC: National Academy Press.
National Research Council, Committee on Equivalency and Linkage of Educational Tests (1998). Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: National Academy Press.
National Research Council and National Council of Teachers of Mathematics (1997). Improving Student Learning in Mathematics and Science: The Role of National Standards in State Policy.Washington, DC: National Academy Press.
O'Neil, H. F., Sugrue, B., Abedi, J., Baker, E. L., & Golan, S. (1997). Final Report on Experimental Studies of Motivation and NAEP Test Performance. CSE Technical Report 427. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Owings, J. (1995). Psychometric Report for the NELS:88 Base Year Through Second Follow-Up. NCES 95382. Washington, DC: National Center for Education Statistics.
Pashley, P. J., & Phillips, G. W. (1993). Toward World-Class Standards: A Research Study Linking International and National Assessments. Princeton, NJ: Educational Testing Service.
Russell, M., & Haney, W. (1997). "Testing Writing on Computers: An Experiment Comparing Student Performance on Tests Conducted via Computer and via Paper-and-Pencil." Education Policy Analysis Archives, 5 (3); available at http://olam.ed.asu.edu/epaa/.
Smith, M. L. (1991). "Meanings of Test Preparation." American Educational Research Journal, 38, 521542.
Smith, M. S., Stevenson, D. L., & Li, C. P. (1998). "Voluntary National Tests Would Improve Education." Educational Leadership, 55 (6), 42-44.
Stake, R. E. (1995). "The Invalidity of Standardized Testing for Measuring Mathematics Achievement." In Thomas A. Romberg (ed.), Reform in School Mathematics and Authentic Assessment (pp. 173-235). Albany, NY: SUNY Press.
Stecher, B. M., Barron, S., Kaganoff, T., & Goodwin, J. (1998). The Effects of Standards-Based Assessment on Classroom Practices: Results of the 1996-97 RAND Survey of Kentucky Teachers of Mathematics and Writing. CSE Technical Report 482. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Stecher, B. M., & Klein, S. P. (1997). "The Cost of Science Performance Assessments in Large-Scale Testing Programs." Educational Evaluation and Policy Analysis, 19, 1-14.
Stecklow, S. (1997, September 2). "Kentucky's Teachers Get Bonuses, but Some Are Caught Cheating." The Wall Street Journal, pp. A1, A5.
U. S. Department of Education (1997). The NAEP Guide: A Description of the Content and Methods of the 1997 and 1998 Assessments. NCES 97990. Washington, DC: National Center for Education Statistics.
U. S. Department of Education and National Science Foundation (1998). Action Strategy for Improving Achievement in Mathematics and Science.
Wainer, H. (1993). "Measurement Problems." Journal of Educational Measurement, 30, 1-21.
Wainer, H., & Thissen, D. (1993). "Combining Multiple-Choice and Constructed-Response Test Scores: Toward a Marxist Theory of Test Construction." Applied Measurement in Education, 6, 103-118.
Waltman, K. K. (1997). "Using Performance Standards to Link Statewide Achievement Results to NAEP." Journal of Educational Measurement, 34, 101-121.
Wenglinsky, H. (1998). Does It Compute? The Relationship Between Educational Technology and Student Achievement in Mathematics. ETS Policy Information Report. Princeton, NJ: Educational Testing Service.
Wolf, L. F., & Smith, J. K. (1995). "The Consequence of Consequence: Motivation, Anxiety, and Test Performance." Applied Measurement in Education, 8, 227-242.
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage or retrieval) without permission in writing from RAND.Published April 1999 by RANDRAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND® is a registered trademark. RAND's publications do not necessarily reflect the opinions or policies of its research sponsors.