Missing Data in the Second Longitudinal Study of Young People in England (LSYPE2)

Missing data

The Department for Education commissioned RAND Europe to provide a statistically robust, unbiased, and consistent approach to account for missing Key Stage 2 data in the LSYPE2 cohort. The final report balances the need for practical solutions for analysts with the desire to exhaustively explore options for dealing with the missing data.

Background

The Second Longitudinal Study of Young People in England (LSYPE2) started in 2013 to understand the compulsory education, school-to-work transition, career and lives of young people. Although of rich academic interest, the key purpose for this dataset is to provide a resource for evidence-based policy development.

A significant barrier to achieving this purpose is the fact that LSYPE2 has incomplete data, owing to a boycott of Key Stage 2 (KS2) testing in 2010. In LSYPE2 KS2 data is missing for approximately 30% of the cohort. A further 7.5% have missing data because consent for linking survey findings to the national pupil database (NPD) was not obtained, and a further 4% for other reasons.

Goals

RAND Europe was commissioned by the Department for Education to provide a statistically robust, unbiased, and consistent approach to missing KS2 data in the LSYPE2 cohort. The project team will:

  • Identify the extent and nature of the missing KS2 data for the LSYPE2 cohort
  • Recommend and implement appropriate method(s) for dealing with this missing data – creating accessible dataset(s) with imputed missing KS2 values
  • Create simple and clear documentation on this for users from wide-ranging backgrounds.

Conclusions

  • Using multiple imputation (MI) and inverse probability weighting (IPW), we produced plausible values for KS2 scores (via MI) and analytical weights (via IPW) for pupils missing data due to the boycott. This gives analysts two options when deciding how to deal with missing data.
  • Comparing complete-case analysis with MI suggests that MI would be more efficient than the complete-case approach. This is because standard errors (SE) would be smaller, meaning this approach should be used if statistical inference is the aim of a given analysis.
  • However, it is for analysts to decide whether the missing data arising from the boycott will cause difficulties regarding inferences and conclusions and to take appropriate steps to deal with these.

RAND Team Members

  • Ian-Brunton Smith
  • Anna Vignoles
  • Katie Saunders