Over the last decade, multiple imputation has rapidly become one of the most widely-used methods for handling missing data. However, one of the big uncertainties about the practice of multiple imputation is how many imputed data sets are needed to get good results. In this post, I’ll summarize what I know about this issue. Bottom line: The old recommendation of three to five data sets is usually insufficient.
Background: As the name suggests, multiple imputation involves producing several imputed data sets, each with somewhat different imputed values for the missing data. The goal is for the imputed values to be random draws from the posterior predictive distribution of the missing data, given the observed data. After imputing several data sets, the analyst applies conventional estimation methods to each data set. Parameter estimates are then simply averaged across the several analyses. Standard errors are calculated using Rubin’s (1987) formula that combines variability within and between data sets.
LEARN MORE IN A SEMINAR WITH PAUL ALLISON
Why do we need more than one imputed data set? Two reasons: First, with only a single data set, the parameter estimates will be highly inefficient. That is, they will have more sampling variability than necessary. Averaging results over several data sets can yield a major reduction in this variability. (This has an analog in psychometrics: multiple-item scales are better than single-item scales because they produce more reliable measurements). The second reason is that the variability of the estimates across the multiple data sets provides the necessary information to get estimates of the standard errors that accurately reflect the uncertainty about the missing values.
Both of these reasons, efficiency of point estimates and estimation of standard errors, have implications for the number of imputations. But the implications are rather different, and that explains why the consensus about the number of imputations has changed dramatically in recent years.
The early literature focused on efficiency, and the conclusion was that you could usually get by with three to five data sets. Schafer (1999) upped that number slightly when he stated that “Unless rates of missing information are unusually high, there tends to be little or no practical benefit to using more than five to ten imputations.” That conclusion was based on Rubin’s formula for relative efficiency: 1/(1+F/M) where F is the fraction of missing information and M is the number of imputations. Thus, even with 50% missing information, five imputed data sets would produce point estimates that were 91% as efficient as those based on an infinite number of imputations. Ten data sets would yield 95% efficiency.
But what’s good enough for efficiency isn’t necessarily good enough for standard error estimates, confidence intervals, and p-values. One of the critical components of Rubin’s standard error formula for multiple imputation is the variance of each parameter estimate across the multiple data sets. But ask yourself this: How accurately can you estimate a variance with just three observations? Or even five or ten? With so few observations (data sets), it shouldn’t be surprising that standard error estimates (and, hence, p-values) can be very unstable. As many have noticed, if you repeat the whole imputation/estimation process, the p-values may look very different.
So how many imputations do you need for accurate, stable p-values? More than ten, in many situations, especially if the fraction of missing information is high. Graham et al. (2007) approached the problem in terms of loss of power for hypothesis testing. Based on simulations (and a willingness to tolerate up to a 1 percent loss of power), they recommended 20 imputations for 10% to 30% missing information, and 40 imputations for 50% missing information. See their Table 5 for other scenarios.
Similar recommendations were proposed by Bodner (2008), who also relied on simulation evidence, and by Royston et al. (2011), who analytically derived an approximation to the Monte Carlo error of the p-value. Despite their different approaches, both sources agreed on the following simplified rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete. So if 27% of the cases in your data set have missing data on one or more variables in your model, you should generate about 30 imputed data sets.
Of course, getting more data sets requires more computing time. With large data sets and many variables in the imputation model, this can become burdensome. There’s an easy way to reduce computing time if you’re imputing with the popular MCMC method under the assumption of multivariate normality. Just lower the number of iterations between data sets. The default in SAS (PROC MI) and Stata (mi command) is 100 iterations between data sets. But my experience in examining autocorrelation diagnostics is that 100 is way more than enough in the vast majority of cases. I’m comfortable with 10 iterations between data sets, although I’d stick with at least 100 burn-in iterations before the first data set.
You can learn more about multiple imputation in my book Missing Data or in my course of the same name.
REFERENCES
Bodner, Todd E. (2008) “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15: 651-675.
Graham, John W., Allison E. Olchowski and Tamika D. Gilreath (2007) “How many imputations are really needed? Some practical clarifications of multiple imputation theory.” Prevention Science 8: 206–213.
Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Schafer, Joseph L. (1999) “Multiple imputation: a primer.” Statistical Methods in Medical Research
8: 3-15.
White, Ian R., Patrick Royston and Angela M. Wood (2011) “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30: 377-399.
Comments
This is excellent advice. I had been using 20 multiply imputed datasets for a dataset where F=0.40, but I will now re-run the analyses and increase M to 40. Thank you!
We approached the problem of choosing the number of multiple imputation from another view: why not using the full MCMC chain (throwing most of them away seems to be a waste)? Of course people would be worried about the independence assumptions that Rubin had in his combining rules derivation, but heuristically we think for very large m, the variability in \bar{q}_\infty is likely to be small compared to b_m+\bar{u}_m, therefore applying same combining rules for dependent draws (i.e. full MCMC chain) is theoretically justifiable.
From our simulation studies and applying MIX package on a real data set, we see gain in precision (in terms of length of confidence interval and confidence interval coverage rates) by using dependent draws, and this method eliminates the sometimes difficult task of obtaining independent draws (which addresses your point on “How accurately can you estimate a variance with just three observations? Or even five or ten?”; and actually some practitioners find that using m=10 and m=20 gives different inference results, and this is because of the b_m, variance across different sets).