Systematic Investigation of the Effects of Missing Data on Statistical Models for Networks
Published Aug 14, 2019
Published Aug 14, 2019
Theoretical perspectives and promising findings from social network analysis are an important influence on contemporary social and behavioral sciences, whose recent empirical and theoretical developments, in turn, have impacted network science. Network studies facilitate direct behavioral intervention by pinpointing how human relationships encourage or discourage attitudes, actions, and behaviors. Social network analysis provides important tools for identifying and understanding the social and contextual factors relevant to engagement in particular behaviors. By quantifying relational information and linking it to human behavior, many important quantitative methods, such as matrix algebra, graph theory and statistical analysis, can be applied to identify structural patterns in social networks and measure the association of those patterns with various behavioral outcomes.
Network sampling design and measurement strategies tend to correspond with study size. Medium to small studies often elicit network members via free recall using one or more name generators. Depending on the type of study this may be followed by a battery of questions that the respondent answers about each network member and their relationship to that network member as well as an assessment of the interconnections among named network members. The smallest studies can follow a similar pattern but often provide a roster of names for participants to choose from. Study size, however, should not be the sole factor influencing the choice of network study design. Clearly, one major concern in network research is optimal sampling of individual actors while, at the same time, gathering relevant and adequate information on relational ties. Here, we consider the effects of study design variables on statistical model parameters.
Overall, we believe that the "take home message" from this study is that we are conservative modelers. In the context of network statistical models this means that increasing levels of missingness would still identify key main effects but would eliminate secondary findings like the relationship between primary behaviors and other behavioral covariates. We are more likely to overlook significance where it should be than to find significance where it isn't. This is particularly true for SIENA models. p* models/ERGMs, however, are prone to higher levels of error across both Type 1 and Type 2 errors. Not surprisingly, more missing data leads to greater likelihood of Type 2 errors. Interestingly, in some cases, less missing data leads to greater likelihood of Type 1 errors simply because more data leads to more 'power' in a statistical sense and hence a greater frequency of rejecting the null when it's true.
The research described in this report was conducted by the Pardee RAND Graduate School.
This publication is part of the RAND working paper series. RAND working papers are intended to share researchers' latest findings and to solicit informal peer review. They have been approved for circulation by RAND but may not have been formally edited or peer reviewed.
This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited; linking directly to this product page is encouraged. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial purposes. For information on reprint and reuse permissions, please visit www.rand.org/pubs/permissions.
RAND is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND's publications do not necessarily reflect the opinions of its research clients and sponsors.