In my July 2012 post, I argued that maximum likelihood (ML) has several advantages over multiple imputation (MI) for handling missing data:
- ML is simpler to implement (if you have the right software).
- Unlike multiple imputation, ML has no potential incompatibility between an imputation model and an analysis model.
- ML produces a deterministic result rather than a different result every time.
Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.
What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.
LEARN MORE IN A SEMINAR WITH PAUL ALLISON
With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.
So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on large-sample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a t-distribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?
Wrong! In a paper that will be published soon in Structural Equation Modeling, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.
What about confidence intervals and p-values? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a t-distribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this t-distribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.
Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.
Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for non-linear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.
MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.
If you want to learn more about both multiple imputation and maximum likelihood, check out my course on Missing Data.
I remember previously you offered an on-line course on missing data. Are you still offering the on-line course?
Yes, the course is still available at https://statisticalhorizons.com/missing-data-sas-online.
Hi Professor Allison,
I am confused about what exactly a full information maximum likelihood (FIML) is. In your 2012 SAS Global Forum paper (page 5), you mentioned that the maximum likelihood approach handles missing data by summing “over all possible values” of missing variables in a joint distribution. Is this FIML? Or is that just a general way to hand missing data in the ML framework, and FIML is a ‘tool’ for estimation?
FIML is the name used in the structural equation modeling (SEM) literature for maximum likelihood handling of missing data. The term is somewhat misleading since “full information maximum likelihood” is often used in other contexts that have nothing to do with missing data. And people outside the SEM world just say maximum likelihood for missing values.
Hi Professor Allison, it is suggested in the description page for your seminar on Longitudinal Data Analysis Using Structural Equation Modeling that, in this case, in order for longitudinal data analysis to work, the number of individuals should be substantially larger than the number of time points. May I ask why this is the case?
When you estimate a panel data model with standard SEM packages, you have to analyze the data in the wide form: one record for each individual with all the variables at different times on that record. So, for example, if your model has 10 variables and there are 10 time points, there will actually be 50 variables on the record, and the covariance matrix would be 50 x 50. Unless you have at least 50 cases, that matrix will be singular and can’t be analyzed.
Hello Prof. Allison,
After working on a series of simulations I found that ML performs better when the percentage of missing information (number of data gaps) is greater in a case with a single dependent having MI. I tested % of incompleteness from 0.10 to 0.40. Would you be able to explain why this is happening? Thanks in advance for any answer.
Was data missing on the dependent variable only? In that case ML (assuming MAR) is equivalent to complete case analysis. There is no benefit to imputation, which only increases the sampling variability of the estimator. And that variability should increase with the percentage of missing data.