Updated August 4, 2025
I’ve long been an advocate of multiple imputation for handling missing data. For example, in my Missing Data seminar, I spend about two-thirds of the course on multiple imputation. The other third covers maximum likelihood (ML). Both methods are pretty good, especially when compared with more traditional methods like listwise deletion or conventional imputation. ML and multiple imputation make similar assumptions, and they have similar statistical properties.
The reason I spend more time talking about multiple imputation is not that I prefer it. On the contrary, I prefer to use maximum likelihood to handle missing data whenever possible. One reason is that ML is simpler, at least if you have the right software. And that’s why I spend more time on multiple imputation, because it takes more time to explain all the different ways to do it and all the little things you have to keep track of and be careful about.
The other big problem with multiple imputation is that, to be effective, your imputation model has to be “congenial” with your analysis model. The two models don’t have to be identical, but they can’t have major inconsistencies. And there are lots of ways that they can be inconsistent. For example, if your analysis model has interactions, then your imputation model better have them as well. If your analysis model uses a transformed version of a variable, your imputation model should use the same transformation. That’s not an issue with ML because everything is done under a single model.
One other attraction of ML is that it produces a deterministic result. By contrast, multiple imputation gives you a different result every time you run it because random draws are a crucial part of the process. You can reduce that variability as much as you want by imputing more data sets, but it’s not always easy to know how many data sets are enough.
The catch with ML is that you need special software to implement it. Fortunately, most major statistical packages now have methods for handling missing data by ML. For example, the default in most mixed modeling software (like PROC MIXED in SAS, xtmixed in Stata or lme4 in R) is to use ML to handle missing data on the response variable.
For linear models with missing data on predictors, there are now many easy-to-use implementations of ML, under the rubric of structural equation modeling. Examples include PROC CALIS in SAS, the sem command in Stata, and the lavaan package in R. These packages typically refer to the method as full information maximum likelihood (FIML) to distinguish it from an earlier two-step method that didn’t produce accurate standard errors.
Here’s an example of how to do FIML for a linear regression using lavaan:
nlsy <- rio::import("https://statisticalhorizons.com/wp-content/uploads/2022/01/nlsymiss.dta")
nlsy.model <- "anti ~ self + pov + black + hispanic + childage + divorce +
gender + momage + momwork"
nlsy.fit <- lavaan::sem(nlsy.model, data=nlsy.data, missing="fiml.x")
summary(nlsy.fit)
The sem function does the analysis. The option “fiml.x” is all you need to invoke maximum likelihood for handling the missing data. Without that option, 326 of the 528 observations would have been deleted because of missing data, and the estimates would just be ordinary least squares regression.
Here’s the output, which looks like what you’d usually see from a linear regression:
Estimate Std.Err z-value P(>|z|)
anti ~
self -0.067 0.021 -3.193 0.001
pov 0.646 0.161 4.006 0.000
black 0.085 0.157 0.543 0.587
hispanic -0.324 0.166 -1.955 0.051
childage -0.004 0.100 -0.039 0.969
divorce -0.106 0.145 -0.730 0.465
gender -0.561 0.117 -4.815 0.000
momage 0.021 0.028 0.740 0.459
momwork 0.219 0.141 1.553 0.120
Note that the standard errors, z-values, and p-values are all fully adjusted for the substantial amounts of missing data in several of the variables.
Now here’s the code for the functionally equivalent analysis using multiple imputation with the jomo package.
nlsy <- rio::import("https://statisticalhorizons.com/wp-content/uploads/2022/01/nlsymiss.dta")
nlsy$race <- NULL
outjomo <-jomo::jomo(nlsy)
mi_list <- mitools::imputationList(split(outjomo, outjomo$Imputation)[-1])
mi_results <- with(mi_list, lm(anti~self+pov+black+hispanic+childage+divorce+
gender+momage+momwork))
knitr::kable(summary(mice::pool(as.mira(mi_results)), digits=3)
The jomo function produced 5 data sets, each with different imputed values. The imputationList function modified those data sets to get them into the right format. The with and lm functions estimated the linear model for each of the 5 data sets. And, finally, the pool function combined those 5 sets of results into a single set, using “Rubin’s rules.” Clearly, ML is the simpler way to go. And it produces estimates that are at least as good as the multiple imputation estimates, and probably a little better.
|term | estimate| std.error| statistic| df| p.value|
|:-----------|--------:|---------:|---------:|-------:|-------:|
|(Intercept) | 2.646| 1.233| 2.146| 519.147| 0.032|
|self | -0.065| 0.019| -3.424| 551.795| 0.001|
|pov | 0.625| 0.161| 3.876| 63.026| 0.000|
|black | 0.085| 0.180| 0.471| 26.171| 0.642|
|hispanic | -0.297| 0.172| -1.724| 77.321| 0.089|
|childage | -0.007| 0.101| -0.073| 546.919| 0.942|
|divorce | -0.106| 0.145| -0.731| 521.439| 0.465|
|gender | -0.558| 0.118| -4.739| 547.427| 0.000|
|momage | 0.018| 0.028| 0.639| 550.276| 0.523|
|momwork | 0.195| 0.139| 1.399| 125.649| 0.164
These results are very similar to what we got with maximum likelihood. But when I ran the jomo code again, I got different results:
|term | estimate| std.error| statistic| df| p.value|
|:-----------|--------:|---------:|---------:|-------:|-------:|
|(Intercept) | 2.403| 1.246| 1.927| 377.036| 0.055|
|self | -0.058| 0.021| -2.828| 122.468| 0.005|
|pov | 0.733| 0.173| 4.225| 32.748| 0.000|
|black | 0.091| 0.145| 0.631| 508.059| 0.528|
|hispanic | -0.294| 0.176| -1.677| 57.958| 0.099|
|childage | -0.013| 0.103| -0.131| 384.854| 0.896|
|divorce | -0.108| 0.146| -0.741| 398.662| 0.459|
|gender | -0.555| 0.116| -4.773| 566.980| 0.000|
|momage | 0.024| 0.030| 0.813| 203.204| 0.417|
|momwork | 0.165| 0.152| 1.089| 42.030| 0.283|
Note that the coefficient for pov has gone from .625 to .733. That’s a 17% increase! To get more stable estimates for this data set (which has a lot of missing data) you really need more like 25 data sets. You could get those by changing the jomo function to jomo(nlsy, nimps=25). You can also force jomo to give you the same results by setting the same “seed” for the random number generator before each run, but there’s no reason to prefer results from one seed to those from another seed.
Now suppose you want to estimate something more complicated than a linear regression model. Maybe it’s a logistic model for binary data or a negative binomial model for count data. Most structural equation modeling software can’t do FIML for those kinds of models. But Mplus can, and it’s by far the best software for handling missing data by FIML for generalized linear models. (There’s also a recently developed package for R called mdmb that can do at least some of what Mplus can do, but it’s much more difficult to use).
Here’s an example of using Mplus to estimate an ordered logit model. We’ll use the same data set as before, but we’ll treat the dependent variable, anti, as an ordered categorical variable. That variable actually has seven values (0, 1, …, 6) and is highly skewed to the right. So, an ordered logit model is more appealing that a strictly linear model. Here’s the Mplus code:
DATA: FILE = c:\data\nlsymiss.dat;
VARIABLE:
NAMES = anti self pov black hispanic childage divorce gender momage momwork ;
MISSING = .; CATEGORICAL = anti;
ANALYSIS: ESTIMATOR=ML; INTEGRATION = MONTECARLO;
MODEL: anti ON self pov black hispanic childage divorce gender momage momwork;
self pov black hispanic childage divorce gender momage momwork;
And here are the results:
ANTI ON
SELF -0.089 0.029 -3.037 0.002
POV 0.755 0.216 3.495 0.000
BLACK 0.148 0.208 0.709 0.478
HISPANIC -0.381 0.221 -1.721 0.085
CHILDAGE 0.023 0.130 0.180 0.857
DIVORCE -0.134 0.190 -0.708 0.479
GENDER -0.696 0.155 -4.494 0.000
MOMAGE 0.029 0.037 0.789 0.430
MOMWORK 0.276 0.185 1.492 0.136
The p-values are very similar to what we got with the linear model using lavaan. But, of course, the coefficients are not directly comparable.
The downside of Mplus is that it’s not cheap: currently $595 for academics, $695 for commercial users, and $195 for validated students. But it’s an amazing piece of software that can do things that are not available anywhere else.
Statistical Horizons is celebrating 20 years of advancing research methods—and to thank our community, we’re offering a free, 3-hour seminar with our Founder and President, Paul Allison, one of the world’s leading experts on missing data. As part of our Distinguished Speaker Series, Dr. Allison presents Missing Data Then and Now. He will revisit his influential work on missing data methods, tracing key developments from the past two decades. You’ll get a clear, practical update on maximum likelihood, multiple imputation, and fully Bayesian approaches—what’s changed, what’s endured, and what today’s researchers need to know.
Comments
Very interesting article. Regrettably SPSS and AMOS are not mentioned. AMOS is much cheaper than MPLUS and it visualizes the models that you set up which is great.
Yes, Amos is cheap for students and faculty. But unlike Mplus, it can’t do FIML for categorical data. And Mplus can also produce great diagrams of your model. Finally, lavaan is even cheaper (free) and it can do more than Amos. Accessory packages can produce good diagrams.
Hi Dr.Allison,
I’ve read your post and the full article on ML vs MI, which made me convinced that FIML is a simpler and as efficient (if not more) compared to MI, at least in my data. I’m a SAS user and I found proc CALIS method= FIML an easy proc to model. However, I still can’t figure out how to get confidence intervals and P-values from the path estimates.
Second, what if we have variables that we want to use in the path model, but include some missing data themselves. Can they still be effectively used or an MI method will be more appropriate.
Thanks,
You can get confidence intervals by putting the option CL on the PROC statement. As for p-values, the most recent versions of CALIS report them. Older versions do not.
There should be no problem with including other variables that have missing data.
Hi Dr. Allison – I know ML has several advantages over MI, but I’ve decided that MI may be more appropriate for my data. Thus, my question relates to MI. I have several categorical predictors with missing values that I’ve recoded into dummy variables and my dependent variable is also dummy coded. My question is, do I need to incorporate all categorical predictors and their respective dummies in the imputation model (as auxiliary variables)? Or should I just include the predictor variables that I’ll use in the final analysis model (in this case logistic regression)? Thanks!
How many variables are you dealing with? How many will be in the model, and how many will be excluded?
In the original data set, 3 dummy predictor variables and 4 categorical predictor variables (with multiple items each). I recoded the 4 categorical variables (for simplicity ill just refer to them as catvar 1, 2, 3, etc.) into multiple dummy variables each. For example: catvar 1 (recoded into 2 dummies); catvar 2 (recoded into 3 dummies); catvar 3 (recoded into 2 dummies); catvar 4 (recoded into 3 dummies).
This stuff is pretty exploratory, so at this stage in the analysis I’m running multiple models with different combinations of variables to check for robustness, etc. Thus, at a minimum the final model could include 7 dummy variables. At a maximum, the final model could have 13 variables, with all recoded dummies included. However, I will report all findings to be sure. I hope this makes sense.
OK, then I would just use all of them in the imputation. For the imputation, I would probably go with a fully conditional method (aka MICE), treating the categorical variables as single variables and imputing them by multinomial logistic regression.
Thanks for the feedback. Second question: In my final model I will have all binary predictors and a binary DV. Am I correct in assuming that MI is more appropriate than ML in this situation? I cant seem to find enough consensus in the literature.
Yeah, I’d probably go the MI route.
Hi, Paul.
It seems to me that there are situations in which partial information is easier to incorporate through multiple imputation. Consider, for example, a multi-item scale which is “missing” because a single item was unobserved. With MI, one could impute this missing item score and use it along with the observed item scores to calculate the response of interest. Could ML incorporate such appropriate or efficient use of the partially-observed item scores?
In principle, yes ML could be used in this situation. You would have to postulate a latent variable with each item as an indicator of that variable. Standard SEM packages can do this, but the model could end up being rather complicated.
I wonder whether multiple imputation is a good solution to sample selection bias. Take the famous example of estimating gender wage inequalities that Heckman worked on. In that analysis, the fact that many women weren’t in the workforce led to downward estimation of the wage gap. If we had a dataset with many women not working, would it make sense to multiply impute income for all of them to get a better comparison against men?
It’s a possible solution, but not with conventional imputation software which assumes missing at random. And even if you had the right software, there’s no reason to think it would be any better than standard Heckman methods. The one attraction of MI in this situation is that it might be easier to do a sensitivity analysis to different forms or degrees of not missing at random.
One benefit of MI over ML that is worth mentioning is the ease of including “missingness”-related covariates in the imputation model to improve the plausibility of the MAR assumption. Including these auxiliary variables in the ML-estimated model is more of a challenge.
Valid point, but a lot of ML software makes it pretty easy. Mplus, for example, has an AUXILIARY option where you can list as many covariates as you like. Also, it’s more important that auxiliary variables be correlated with the VARIABLES that are missing, rather than their MISSINGNESS. The best auxiliaries are those that are correlated with both the variables and their missingness.
When using mixed models with missing responses, if a variable is not in the substantive model but is predictive of missingness (and may also be predictive of the underlying values), isn’t it enough to correctly specify this auxiliary variable as a covariate in the model?
You first need to consider why the auxiliary variable is not in the substantive model. If it’s a consequence of y or an alternative measure of y, then including it as a covariate could bias the coefficients of all the other covariates. If it’s because it’s not associated with y, then including it as a covariate is unlikely to be very helpful. Even if potential auxiliary variables are predictive of missingness, they won’t be helpful in reducing either bias or standard errors unless they are also correlated with the variable that has missing data.
I see, so including auxiliary variables as covariates in the mixed model will only reduce bias if they are associated with both missingness and with the outcome. In the case of them being alternative measures of y, why would this bias other coefficients?
Because such a variable will be correlated with both the error term and the other predictors.
I concur with Paul, and would like to add the following. In my view, while multiple imputation is a great method for ‘accounting’ for the uncertainty brought about by the presence of missing values, it does require a ‘proper’ imputation model. To me, this is in general a difficult task – to develop such a model – unless one is an expert in the substantive area of concern, or better even has a panel of experts available. Further, while in principle the ‘imputer’ and ‘analyst’ can be different persons, in reality different research questions could need in the above sense possibly different imputation models, and then the distinction between ‘imputer’ and ‘analyst’ may be a downside of the approach (if considering the completed/imputed data sets as given and to be used by the analyst). I would also like to say that missing data per se presents even a more serious and difficult problem than that of possibly having to choose between ML (with auxiliary variables, if available) and MI.