Collection of Innovative, High-Quality Data
Compared to the natural sciences, data used in the social sciences are usually of considerably lower quality. Survey data suffer from a number of well-known drawbacks, including imperfect samples and random or systematic errors in responses. In some instances, survey data can be combined with census or administrative data, but in many cases, these data are far from flawless as well. In addition, the collection of data is often a costly and tedious affair, requiring extensive planning and possibly taking years between the first design of a survey and the actual availability of the final data file. Researchers are often forced to use existing secondary data that may be less than ideal for the research question at hand.
Many of the most pressing scientific and policy-relevant demographic research issues cannot be studied adequately with existing data. Particularly for multidisciplinary research, one often needs the collection of data pertaining to very different life domains in one survey. Since most surveys tend to concentrate on only one or two domains, datasets suited for multidisciplinary studies are relatively rare. (There are obvious exceptions, with the PSID and HRS being prominent examples, along with many of the RAND surveys discussed below.) Furthermore, collecting survey data is costly, which limits the speed of innovation in data collection.
Progress in improving data quality and adequacy requires efforts on many fronts, including techniques to reduce refusal rates and increase rates of retention in longitudinal surveys, the reduction of item non-response or systematic biases, the use of new technology and the combination of information from different sources, and the collection of comparable data for different populations.
Contributions by PRC Staff
One hallmark of the RAND PRC has been the collection of innovative, high-quality micro-data in the United States and overseas. Additionally, several staff members have worked on various aspects of data quality and new technology.
- New Immigrant Survey
- Family life Surveys
- Los Angeles Family and Neighborhood Survey
- Methods to Improve Data Quality
- New Technology
New Immigrant Survey
The New Immigrant Survey (NIS) is a collaborative effort of PRC researcher Smith, along with Guillermina Jasso at New York University, Doug Massey at Princeton University, and Mark Rosenzweig at Harvard University. The NIS will, for the first time, carry out a comprehensive multi-cohort longitudinal survey of new legal immigrants to the United States based on nationally representative samples of the administrative records compiled by the U.S. Immigration and Naturalization Service pertaining to immigrants newly admitted to permanent residence. The questionnaire will ascertain prospective and retrospective information on pre-immigration education, work, health, migration, marriage and fertility histories for newly arrived immigrants.
To assess the impact on the next generation, information will also be obtained about and from the children of the sampled immigrants, both the immigrant children they brought with them and the U.S. citizen children born to them in the United States. The NIS builds upon a pilot project (NIS-P) implemented in 1996, which evaluated the cost and feasibility of fielding the full survey. The pilot data have already yielded important scientific findings and demonstrated the feasibility of the full NIS (Jasso et al., 2000a). Fieldwork for the first round of the NIS began in 2003 and was completed in June 2004. The data are now being prepared for public release. There will be a biannual follow-up of the initial cohort of 11,000 immigrants for three additional years and then a two-year interview periodicity thereafter. The survey design anticipates the introduction of new cohorts over time.
Back to Top
Family Life Surveys
Beginning with the first Malaysian Family Life Survey (MFLS1, 1976–1977), RAND designed, fielded, and analyzed a series of multipurpose household surveys in developing countries, including a follow-up survey in Malaysia (MFLS2, 1988–1989), and new surveys in Indonesia (IFLS1, IFLS2, IFLS2+ and IFLS3 in 1993, 1997, 1998, and 2000, respectively), Guatemala (1995), and Bangladesh (1996). The resulting rich databases are unique in the context of developing countries, with detailed current and retrospective information at the individual, household, and community level on a range of demographic, social, health and economic topics not found in most other data sources. RAND PRC staff have assessed the quality of the retrospective data in these types of surveys and demonstrated both the strengths and limitations of these data (Beckett et al., 2000). By way of illustration, we provide somewhat more detailed information about the Indonesian Family Life Surveys (IFLS).
The first wave of the IFLS, fielded in 1993, was a multi-purpose survey of approximately 7,200 households in 13 provinces, representing more than 85 percent of the Indonesian population. For selected individuals in sampled households, detailed contemporaneous and retrospective information was collected on a wide array of family life topics, including health, health care utilization, health insurance, pregnancy and prenatal care, contraception, schooling, labor force participation, earnings, migration, living arrangements, social support networks, transfers, wealth, and consumption. The second survey wave, conducted in 1997, interviewed about 7,600 households, including IFLS1 households that moved to new locations, and new households formed when IFLS1 respondents moved out of their original household. Just under 95 percent of original IFLS1 households were successfully located and interviewed. New features of the household survey included physical health assessments conducted by a nurse who measured each respondent's height, weight, blood pressure, pulse, lung capacity, hemoglobin level, and timed respondents as they rose from a sitting to standing position. Respondents between the ages of seven and twenty-four were asked to complete tests of their cognitive skills in language and mathematics.
As IFLS2 was coming out of the field, Indonesia was in the midst of a major economic crisis. To provide data of use to both policymakers and researchers, the IFLS team raised resources to field a resurvey of a 25 percent subsample of the IFLS (IFLS2+) in late 1998. To minimize biases from attrition, overall follow-up and tracking of movers was taken very seriously, with the result that we were able to contact 98 percent of targeted IFLS2 households and 96 percent of targeted IFLS2 individuals. The breadth of the original instruments was retained so that data are available on a large array of behaviors and outcomes. Finally, fieldwork for IFLS3 was conducted in 2000, using a questionnaire similar to the one used in the earlier waves. Again, as the result of extensive tracking of migrants, 95 percent of IFLS1, IFLS2 and IFLS2+ households were reinterviewed. In total, IFLS3 included interviews with 10,440 households, including 125 households (38 percent) of IFLS1 households that could not be found or refused an interview in 1997 and 1998. Community- and facility-level data collected as part of the IFLS break new ground in the depth and breadth of contextual data.
As part of the first IFLS in 1993 (IFLS1), extensive information was collected from community leaders and through visits to the schools and health facilities available to community members for each of the 321 communities in which we interviewed households. In follow-up interviews with sampled individuals and households in 1997 (IFLS2), 1998 (IFLS2+), and 2000 (IFLS3), data were again collected at the community level, including interviews with many of the original facilities, as well as samples of new facilities. Panel data on communities and facilities are particularly important in Indonesia, where the government has made substantial investments in developing and upgrading infrastructure. Various PRC researchers have been involved in one or more waves of IFLS. PRC affiliate researcher Strauss who was PI on IFLS3 is leading IFLS4 which finishes fieldwork in May 2008 and looks to publicly release data in Spring 2009.
Back to Top
Los Angeles Family and Neighborhood Survey
The Los Angeles Family and Neighborhood Survey (L.A.FANS) is designed to answer key research and policy questions about family, peer, and neighborhood effects on child and youth development, the effects of welfare reform at the neighborhood level, and residential mobility and neighborhood change.
Los Angeles County, a diverse, geographically dispersed urban area, is viewed as a bellwether for future multi-ethnic megacities in the United States and overseas. L.A.FANS offers several features that are rare among longitudinal household surveys in the United States. First, L.A.FANS samples a sufficient number of families per neighborhood (50) and a sufficient number of neighborhoods (65) to permit analyses of family and neighborhood effects on child and family well being. Second, after the first wave of data collection, L.A.FANS will reinterview sample members who remain in the neighborhood, continue to follow all adults and children in the sample even if they move out of the neighborhood, and interview a sample of new neighborhood entrants. Thus, L.A.FANS will combine the features of a panel study of children and families at the neighborhood level with a repeated cross-sectional sample of each sampled neighborhood, thereby permitting analyses of neighborhood dynamics and selective migration in and out of neighborhoods.
Third, at each wave, in addition to the extensive family- and individual-level data that will be collected, L.A.FANS will gather extensive community-level information through systematic observations by interviewers of the neighborhood physical and social environment as well as through administrative data from public- and private-sector sources. Although L.A.FANS is well designed to study children, the survey also provides a unique opportunity to analyze important demographic and socio-economic behaviors and outcomes among adults and elderly, with a particular focus on the effects of neighborhood social and physical environments. Because of its unique design, L.A.FANS was cited in a National Academy of Sciences report by Singer and Ryff (2001:184) as an example of the type of high-priority study for testing emerging hypotheses about the social determinants of health. The first wave of L.A.FANS was fielded in 2000–2001, while the second wave will be fielded in 2005-2006. RAND adjunct staff member Pebley and PRC researcher Sastry lead the L.A.FANS project.
Grants to support the design and fieldwork for Wave 2 of L.A.FANS have been awarded by NICHD, NIA, and NIEHS. The NICHD grant covers tracking and reinterviewing of adult and child respondents from Wave 1, interviews with a sample of new respondents who moved into the sampled neighborhoods between Wave 1 and Wave 2, and neighborhood observations. The NIA grant covers the collection of extensive biomarker information and an expanded set of self-reported health measures on the adult panel respondents. The NIEHS grant covers the collection of biomarker data on stress and health for all children in the L.A.FANS sample.
Back to Top
Methods to Improve Data Quality
Some improvements of data quality are the result of seemingly simple changes, like a change in questionnaire organization. A prime example, applied in the HRS, is to combine the module dealing with net worth with the module dealing with income. The idea is that, for some income sources, data quality is enhanced if questions about assets and income are combined into a single question sequence. Hurd, Juster and Smith (2003) report that missing data rates on income from assets are roughly cut in half by asking for the income from an asset right after the question for ownership and quantity of the asset, while mean income representing the return on assets almost doubled. This doubling of income appears to be a quality gain, since it aligns reporting income from assets with the national accounts. Such “fine-tuning” is a continuing process, which benefits substantially from the cumulative experience of researchers.
Other improvements may be more complex or have complicated side effects that require detailed analysis. An example that has gained considerable prominence in recent years is an attempt to increase item response rates by using bracketing. Some surveys like the PSID and the HRS use bracketing to reduce the harm to data quality from item nonresponse. The use of brackets dramatically improves item response rates, but the use of brackets may also introduce systematic biases. With unfolding brackets, the entry point into a bracketing sequence probably acts as an anchor or reference point, affecting the estimated distribution of wealth, income, and other economic variables. The anchoring effect has been documented extensively, and in many cases its quantitative effect is substantial (Hurd, 1999a). Ongoing research by PRC researchers Hurd, Kapteyn, and Zissimopoulos (2001) aims to assess the size of the bias in particular applications and to find ways to adjust for the biases. Van Soest and Hurd (2003a,b) have developed tests and models for anchoring and yea-saying effects in bracket questions using experimental AHEAD data on consumption.
Another innovation that RAND PRC researchers have been involved in deals with the use of subjective information (e.g., time preference or subjective probabilities). These concepts play a role in many dimensions of human decision-making. For example, in life cycle models of consumption behavior, the level and rate of change of consumption depends on the individual's time preference and perceived survival curve, which can be estimated from life tables and subjective survival to some particular age. It turns out that direct questions about the subjective probability of uncertain events yield valid information. In panel data, the subjective survival probabilities predict mortality outcomes in that those who reported lower subjective survival probabilities die sooner than those who report higher probabilities. The probabilities evolve in the panel in response to new information. For example, the onset of a cancer leads to a reduction in the subjective survival probability, as does the death of a parent at an early age (Hurd and McGarry, 2002).
In other research, Kapteyn and Teppa (2003) have used choices of respondents among different consumption paths to estimate subjective time preference rates, intertemporal substitution elasiticities, and the strength of habit formation in consumption. Kapteyn and Teppa (2003) have used various self-reported risk aversion measures to explain portfolio choice of investors. Both of these papers extend earlier work by Barsky, Juster, Kimball, and Shapiro (1997). As a final example, Kapteyn, Smith, and Van Soest (2004) are conducting a string of experiments where respondents are shown “vignettes” of hypothetical individuals with symptoms of certain health conditions. Respondents are then asked the extent to which the hypothetical individuals are limited in the work they can do. These responses are used to correct self-reports of work-limiting disability. The authors find that at least half of observed differences in self reported work disabilities between The Netherlands and the U.S. can be ascribed to differences in response scales. Based on the research of Kapteyn, Smith and Van Soest, work disability and general health vignettes have been introduced into SHARE, HRS, and ELSA. The approach is an extension of the work by King et al. (2003). Similarly, in L.A.FANS vignettes to calibrate self-reported overall health status are planned for Wave 2 of the survey. These data are likely to shed light on reporting differences between natives and immigrants, English- and Spanish-speakers, and among Hispanics with different degrees of assimilation in the United States.
Back to Top
New Technology
Internet interviewing constitutes a prime example of a new technology gaining in prominence. In many ways, Internet interviewing can be seen as a combination (or an extension) of other kinds of interviewing (paper and pencil; computer-assisted personal interviewing, or CAPI; computer-assisted telephone interviewing, or CATI). Then again, data collection via the Internet can offer a number of advantages over traditional methods. For example, it is less expensive, and it offers the possibility of graphical or animated presentation (e.g., display of probabilities through pie charts or exploding scales).
However, Internet access is far from universal at this point; thus, Internet interviewing will have to be used jointly with other approaches or for special target populations. RAND researchers (Hurd, Smith, Kapteyn) are involved in several studies addressing these issues. Kapteyn is the founder of the so-called CentERpanel, an Internetpanel of 2,000 households in The Netherlands, which is representative of the Dutch population and which is used frequently for scientific experiments (see http://centerdata.uvt.nl/). Kapteyn and Hurd are PI and co-PI, respectively, of an R01 project, joint with the University of Michigan, to assess the possibilities for Internet interviewing in the HRS. The project comprises a multidisciplinary team of psychologists, survey statisticians, economists, health scientists, and epidemiologists. One of the main activities is to set up a mixed mode panel (telephone and Internet) of households who will be interviewed twice a year. This setup allows for extensive experimentation with question formats, with the measurement of new concepts, with the study of mode effects on response rates and the quality of response, and with issues of selectivity. We expect to use the experience thus gained in other directions and in different applications. For instance, recently Kapteyn was awarded a grant for a “Roybal Center for Economic Decision Making,” which will extend the experimental capabilities of the Internet panel in semi-interactive decision experiments.
International comparisons. RAND has a long tradition in data collection
in developing countries, as described above. However, it has less of
a tradition of collecting data in developed countries that are comparable
with existing or planned U.S. datasets. A number of RAND researchers
are increasingly involved in international efforts to collect comparable
data in a large number of different countries. In particular, Smith and
Hurd are consultants for ELSA (English Longitudinal Study of Ageing)
and SHARE (Survey of Health Ageing and Retirement in Europe). Kapteyn
is co-PI on the latter project. Van Soest leads the working group on
data validation and on the development of the data base. Both ELSA and
SHARE are surveys similar to the HRS in respectively England and ten
continental European countries. In addition to collecting extensive survey
information, ELSA also collects medical information by having registered
nurses visit the homes of all respondents.
The value of internationally comparable data (or more generally of international
research, cf. e.g. Sastry, 2000) can hardly be overrated. For instance,
the International Social Security Project of Gruber and Wise (see, for
example, contributions by Kapteyn and de Vos, 1999, 2004) has produced
convincing evidence on the strong incentives inherent in social security
systems in different countries to retire at ever-earlier ages. Yet in
later stages of that project (where micro-data are used for all participating
countries), the lack of comparable micro-data across countries has made
further progress difficult. In the aforementioned work of Kapteyn, Smith,
and Van Soest (2004), the authors show vignettes to respondents in both
the United States and The Netherlands. This gives them a direct handle
on the effect of cultural norms on the extent to which health conditions
justify reduced work effort.
Understanding the nature of existing data. It is also important to understand the nature of existing data. As a spinoff of Klerman's ongoing work on welfare and Medicaid in California, Klerman and Ringel (2003) match CPS data for 1990 to 2000 with the corresponding administrative data on welfare and Medicaid use (the Medi-Cal Eligibility Data System). They confirm earlier inferences of massive and increasing under-reporting of program participation in the CPS. With the matched data, they show a dose-response relation (the more months of program participation, the more likely is an affirmative response) and that under-reporting is not random. They then show the identifying assumptions needed to generate a multiply-imputed file based on the new information on response errors. Finally, they use this multiply-imputed file to generate corrected estimates of uninsurance (much lower than is implied by the uncorrected data) and program take-up (much higher, especially in populations near the edge of eligibility).
In related work, Klerman and colleagues are currently exploring the nature of nonresponse in surveys. As part of RAND's Statewide Evaluation of CalWORKs, they conducted a household survey of current and recent welfare recipients, with particularly intensive phone and in-person attempts to contact. The survey was based on a list sample generated from administrative data. The administrative data provide particularly rich information on the entire sample—both respondents and nonrespondents—including the entire basic demographics, the history of welfare receipt, and history of earnings. Contrary to suggestive findings for earlier surveys in California, they find that nonresponse is not highly differential across the history of program participation and earnings. This result is reassuring because it implies that relatively fewer resources need to be invested in tracking down nonrespondents.
Future Directions and Scientific Objectives. Future directions in this area will be designed to accomplish two key scientific objectives: (1) collect and disseminate existing surveys (e.g., IFLS, L.A.FANS, NIS) using state-of-the-art methods; and (2) advance scientific knowledge about new methods for innovative, high-quality data collection.
In terms of these two objectives, the PRC staff implementing planned surveys continually aim to integrate relevant new advances in survey methodology into each successive wave of a survey, as well as other innovations in questionnaire content or other data collected to increase the scientific value of the data. For example, the IFLS surveys have added new objective health data with each successive wave and have implemented new and improved tracking techniques to reduce attrition between the survey rounds. As another example, Sastry and Pebley will seek ways to improve retention rates in the second wave of the L.A.FANS. A somewhat different example is a project undertaken by Kapteyn and Rohwedder in collaboration with Anders Klevmarken of the University of Uppsala in Sweden, which surveys respondents who are part of an administrative panel (called LINDA). This allows for validation of survey information to the extent that administrative information and survey information overlap, makes it possible to access a wealth of administrative data, and, by adding targeted survey information, greatly increases the scope of analysis. In the future, new RAND surveys will benefit from some of the methodological advances discussed above in questionnaire designs, imputation techniques, correction of anchoring effects, and so on. Other developments in survey methodology will be integrated as well. Finally, several ideas exist for starting new surveys consistent with the research directions that are currently foreseen within the PRC.

Top