Accounting for Misclassification in Electronic Health Records-derived Exposures Using Generalized Linear Finite Mixture Models

Published in: Health Services and Outcomes Research Methodology, Volume 17, Issue 2 (June 2017), pages 101 - 112. doi: 10.1007/s10742-016-0149-5

Posted on RAND.org on December 07, 2017

by Rebecca A. Hubbard, Eric A. Johnson, Jessica Chubak, Karen J. Wernli, Aruna Kamineni, Timothy Bogart, Carolyn M. Rutter

Read More

Access further information on this document at Health Services and Outcomes Research Methodology

This article was published outside of RAND. The full text of the article can be found at the link above.

Exposures derived from electronic health records (EHR) may be misclassified, leading to biased estimates of their association with outcomes of interest. An example of this problem arises in the context of cancer screening where test indication, the purpose for which a test was performed, is often unavailable. This poses a challenge to understanding the effectiveness of screening tests because estimates of screening test effectiveness are biased if some diagnostic tests are misclassified as screening. Prediction models have been developed for a variety of exposure variables that can be derived from EHR, but no previous research has investigated appropriate methods for obtaining unbiased association estimates using these predicted probabilities. The full likelihood incorporating information on both the predicted probability of exposure-class membership and the association between the exposure and outcome of interest can be expressed using a finite mixture model. When the regression model of interest is a generalized linear model (GLM), the expectation–maximization algorithm can be used to estimate the parameters using standard software for GLMs. Using simulation studies, we compared the bias and efficiency of this mixture model approach to alternative approaches including multiple imputation and dichotomization of the predicted probabilities to create a proxy for the missing predictor. The mixture model was the only approach that was unbiased across all scenarios investigated. Finally, we explored the performance of these alternatives in a study of colorectal cancer screening with colonoscopy. These findings have broad applicability in studies using EHR data where gold-standard exposures are unavailable and prediction models have been developed for estimating proxies.

Research conducted by

This report is part of the RAND Corporation external publication series. Many RAND studies are published in peer-reviewed scholarly journals, as chapters in commercial books, or as documents published by other organizations.

The RAND Corporation is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.