When I teach my seminar on Missing Data, the most common question I get is “What can I do if my data are not missing at random?” My usual answer is “Not much,” followed by “but you can do a sensitivity analysis.” Everyone agrees that a sensitivity analysis is essential for investigating possible violations of the missing at random assumption. But, unfortunately, there’s little guidance in the literature on how to actually do one. And, until recently, no commercial software package had options for doing a sensitivity analysis.
I’m happy to report that PROC MI in SAS 9.4 has several options for doing a sensitivity analysis based on multiple imputation. I’ve recently had a chance to read the documentation and do a few test runs. A little later in this post, I’ll tell you what I’ve learned.
But first, some background. There are two widely-used “modern” methods for handling missing data: multiple imputation and maximum likelihood. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random (MAR). Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables. This assumption would be violated, for example, if people with high income were less likely to report their income.
The MAR assumption does allow missingness to depend on anything that you observe, it just can’t depend on things that you don’t observe. MAR is not a testable assumption. You may suspect that your data are not missing at random, but nothing in your data will tell you whether or not that’s the case.
It’s possible to do multiple imputation or maximum likelihood when data are missing not at random (MNAR), but to do that, you first need to specify a model for the missing data mechanism—that is, a model of how missingness depends on both observed and unobserved quantities. That raises three issues:
- For any data set, there are an infinite number of possible MNAR models.
- Nothing in the data will tell you which of those models is better than another.
- Results may depend heavily on which model you choose.
That’s a dangerous combination. And it’s why a sensitivity analysis is so important. The basic idea is to try out a bunch of plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions. On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy.
Keep in mind that this is not a test. Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways.
There’s nothing very deep about this. The hard part is figuring out how to come up with a reasonable set of models. It’s particularly hard if you’re using maximum likelihood to handle the missing data. Elsewhere I’ve argued for the advantages of maximum likelihood over multiple imputation. But one attraction of multiple imputation is that it’s easier to do a decent sensitivity analysis.
That’s where the new options for PROC MI come in. I think they’re easiest to explain by way of an example. In my Missing Data seminar, I use an example data set called COLLEGE, which contains information on 1302 four-year colleges and universities in the U.S. The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier.
There are lots of missing data for the five predictor variables, but we’re going to focus on the 98 colleges that did not report their graduation rates. It’s plausible that colleges with low graduation rates would be less likely to report those rates in order to avoid adverse publicity. If so, that would probably entail a violation of the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower (unobserved) graduation rates than those colleges that report their graduation rates, controlling for other variables.
PROC MI allows us to build that supposition into the multiple imputation model. We can, for example, specify an imputation model that says that the imputed values of GRADRAT are only 80% of what they would be if the data were actually missing at random. Here’s the SAS code for doing that:
PROC MI DATA=MY.COLLEGE OUT=MIOUT;
VAR GRADRAT CSAT LENROLL STUFAC PRIVATE RMBRD ACT;
MNAR ADJUST(GRADRAT / SCALE=.80);
This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. The ADJUST option says to multiply the imputed values of GRADRAT by .80 at each step of the iterative process. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1.
The MNAR statement only works if you specify the MONOTONE method or the FCS method, which is what I used here. FCS stands for fully conditional specification, and it’s equivalent to the chained equation or sequential regression method used in many other packages. The MNAR statement does not work if you use the default MCMC method. [It could probably be done for MCMC, but that would mess up the elegant computational algorithm. FCS is already a “messy” algorithm, so a little more mess is no big deal].
Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example,
MNAR ADJUST(GRADRAT / SHIFT = -20);
This would subtract 20 points from any imputed graduation rates. Again, to do a sensitivity analysis, you’d want to try out a range of different SHIFT values to see what effect that would have on your results.
The SHIFT and SCALE options can be combined. The SHIFT option can also be used for adjusting the imputations of categorical outcomes (binary, ordinal or nominal), except that the changes are applied on the log-odds scale.
Another option allows you to restrict the adjustments to certain subsets of the data, e.g.,
MNAR ADJUST(GRADRAT / SHIFT = -20 ADJUSTOBS=(PRIVATE=’1’));
This says to subtract 20 points from the imputed values of graduation rates, but only for private colleges, not for public colleges. If you use the ADJUSTOBS option, the subsetting variable (PRIVATE in this case) should be listed in a CLASS statement.
If you don’t use SAS, you can do adjustments like this using other multiple imputation software along with a little programming. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and the adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.
This particular way of doing a sensitivity analysis is based on something called pattern-mixture models for MNAR. You can read more about pattern-mixture models in Chapter 10 of the book Multiple Imputation and Its Application by James Carpenter and Michael Kenward.
Finally, it’s worth noting that the inclusion of appropriate auxiliary variables into the imputation model can go a long way toward reducing the likelihood of MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing. For more on auxiliary variables, see this recent paper by Tenko Raykov, one of Statistical Horizons’ instructors.