## Why Maximum Likelihood is Better Than Multiple Imputation

##### July 9, 2012 By Paul Allison

I’ve long been an advocate of multiple imputation for handling missing data. For example, in my two-day Missing Data seminar, I spend about two-thirds of the course on multiple imputation, using PROC MI in SAS and the **mi** command in Stata. The other third covers maximum likelihood (ML). Both methods are pretty good, especially when compared with more traditional methods like listwise deletion or conventional imputation. ML and multiple imputation make similar assumptions, and they have similar statistical properties.

The reason I spend more time talking about multiple imputation is *not* that I prefer it. On the contrary, I prefer to use maximum likelihood to handle missing data whenever possible. One reason is that ML is simpler, at least if you have the right software. And that’s why I spend more time on multiple imputation, because it takes more time to explain all the different ways to do it and all the little things you have to keep track of and be careful about.

The other big problem with multiple imputation is that, to be effective, your imputation model has to be “congenial” with your analysis model. The two models don’t have to be identical, but they can’t have major inconsistencies. And there are lots of ways that they can be inconsistent. For example, if your analysis model has interactions, then your imputation model better have them as well. If your analysis model uses a transformed version of a variable, your imputation model should use the same transformation. That’s not an issue with ML because everything is done under a single model.

One other attraction of ML is that it produces a deterministic result. By contrast, multiple imputation gives you a different result every time you run it because random draws are a crucial part of the process. You can reduce that variability as much as you want by imputing more data sets, but it’s not always easy to know how many data sets are enough.

The catch with ML is that you need specially designed software to implement it. Fortunately, in recent years several major statistical packages have introduced methods for handling missing data by ML. For example, the default in most mixed modeling software (like PROC MIXED in SAS or the **xtmixed** command in Stata) is to use ML to handle missing data on the response variable. For linear models with missing data on *predictors*, there are now easy-to-use implementations of ML in both SAS (PROC CALIS) and Stata (the **sem** command). For logistic regression and Cox regression, the only commercial package that does ML for missing data is Mplus.

To get the whole story, you can download the paper that accompanied my recent keynote address at the 2012 SAS Global Forum.

I concur with Paul, and would like to add the following. In my view, while multiple imputation is a great method for ‘accounting’ for the uncertainty brought about by the presence of missing values, it does require a ‘proper’ imputation model. To me, this is in general a difficult task – to develop such a model – unless one is an expert in the substantive area of concern, or better even has a panel of experts available. Further, while in principle the ‘imputer’ and ‘analyst’ can be different persons, in reality different research questions could need in the above sense possibly different imputation models, and then the distinction between ‘imputer’ and ‘analyst’ may be a downside of the approach (if considering the completed/imputed data sets as given and to be used by the analyst). I would also like to say that missing data per se presents even a more serious and difficult problem than that of possibly having to choose between ML (with auxiliary variables, if available) and MI.

One benefit of MI over ML that is worth mentioning is the ease of including “missingness”-related covariates in the imputation model to improve the plausibility of the MAR assumption. Including these auxiliary variables in the ML-estimated model is more of a challenge.

Valid point, but a lot of ML software makes it pretty easy. Mplus, for example, has an AUXILIARY option where you can list as many covariates as you like. Also, it’s more important that auxiliary variables be correlated with the VARIABLES that are missing, rather than their MISSINGNESS. The best auxiliaries are those that are correlated with both the variables and their missingness.

When using mixed models with missing responses, if a variable is not in the substantive model but is predictive of missingness (and may also be predictive of the underlying values), isn’t it enough to correctly specify this auxiliary variable as a covariate in the model?

You first need to consider why the auxiliary variable is not in the substantive model. If it’s a consequence of y or an alternative measure of y, then including it as a covariate could bias the coefficients of all the other covariates. If it’s because it’s not associated with y, then including it as a covariate is unlikely to be very helpful. Even if potential auxiliary variables are predictive of missingness, they won’t be helpful in reducing either bias or standard errors unless they are also correlated with the variable that has missing data.

I see, so including auxiliary variables as covariates in the mixed model will only reduce bias if they are associated with both missingness and with the outcome. In the case of them being alternative measures of y, why would this bias other coefficients?

Because such a variable will be correlated with both the error term and the other predictors.

I wonder whether multiple imputation is a good solution to sample selection bias. Take the famous example of estimating gender wage inequalities that Heckman worked on. In that analysis, the fact that many women weren’t in the workforce led to downward estimation of the wage gap. If we had a dataset with many women not working, would it make sense to multiply impute income for all of them to get a better comparison against men?

It’s a possible solution, but not with conventional imputation software which assumes missing at random. And even if you had the right software, there’s no reason to think it would be any better than standard Heckman methods. The one attraction of MI in this situation is that it might be easier to do a sensitivity analysis to different forms or degrees of not missing at random.

Hi, Paul.

It seems to me that there are situations in which partial information is easier to incorporate through multiple imputation. Consider, for example, a multi-item scale which is “missing” because a single item was unobserved. With MI, one could impute this missing item score and use it along with the observed item scores to calculate the response of interest. Could ML incorporate such appropriate or efficient use of the partially-observed item scores?

In principle, yes ML could be used in this situation. You would have to postulate a latent variable with each item as an indicator of that variable. Standard SEM packages can do this, but the model could end up being rather complicated.

Hi Dr. Allison – I know ML has several advantages over MI, but I’ve decided that MI may be more appropriate for my data. Thus, my question relates to MI. I have several categorical predictors with missing values that I’ve recoded into dummy variables and my dependent variable is also dummy coded. My question is, do I need to incorporate all categorical predictors and their respective dummies in the imputation model (as auxiliary variables)? Or should I just include the predictor variables that I’ll use in the final analysis model (in this case logistic regression)? Thanks!

How many variables are you dealing with? How many will be in the model, and how many will be excluded?

In the original data set, 3 dummy predictor variables and 4 categorical predictor variables (with multiple items each). I recoded the 4 categorical variables (for simplicity ill just refer to them as catvar 1, 2, 3, etc.) into multiple dummy variables each. For example: catvar 1 (recoded into 2 dummies); catvar 2 (recoded into 3 dummies); catvar 3 (recoded into 2 dummies); catvar 4 (recoded into 3 dummies).

This stuff is pretty exploratory, so at this stage in the analysis I’m running multiple models with different combinations of variables to check for robustness, etc. Thus, at a minimum the final model could include 7 dummy variables. At a maximum, the final model could have 13 variables, with all recoded dummies included. However, I will report all findings to be sure. I hope this makes sense.

OK, then I would just use all of them in the imputation. For the imputation, I would probably go with a fully conditional method (aka MICE), treating the categorical variables as single variables and imputing them by multinomial logistic regression.

Thanks for the feedback. Second question: In my final model I will have all binary predictors and a binary DV. Am I correct in assuming that MI is more appropriate than ML in this situation? I cant seem to find enough consensus in the literature.

Yeah, I’d probably go the MI route.

Hi Dr.Allison,

I’ve read your post and the full article on ML vs MI, which made me convinced that FIML is a simpler and as efficient (if not more) compared to MI, at least in my data. I’m a SAS user and I found proc CALIS method= FIML an easy proc to model. However, I still can’t figure out how to get confidence intervals and P-values from the path estimates.

Second, what if we have variables that we want to use in the path model, but include some missing data themselves. Can they still be effectively used or an MI method will be more appropriate.

Thanks,

You can get confidence intervals by putting the option CL on the PROC statement. As for p-values, the most recent versions of CALIS report them. Older versions do not.

There should be no problem with including other variables that have missing data.