I thought I knew what it meant for data to be missing at random. After all, I’ve written a book titled Missing Data, and I’ve been teaching courses on missing data for more than 15 years. I really ought to know what missing at random means.
But now that I’m in the process of revising that book, I’ve come to the conclusion that missing at random (MAR) is more complicated than I thought. In fact, the MAR assumption has some peculiar features that make me wonder if it can ever be truly satisfied in common situations when more than one variable has missing data.
First, a little background. There are two modern methods for handling missing data that have achieved widespread popularity: maximum likelihood and multiple imputation. As implemented in most software packages, both of these methods depend on the assumption that the data are missing at random.
Here’s how I described the MAR assumption in my book:
Data on Y are said to be missing at random if the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis. To express this more formally, suppose there are only two variables X and Y, where X always is observed and Y sometimes is missing. MAR means that
Pr(Y missing| Y, X) = Pr(Y missing| X).
In words, this expression means that the conditional probability of missing data on Y, given both Y and X, is equal to the probability of missing data on Y given X alone.
Suppose that Y is body weight and X is gender. Then if women are less likely to disclose their weight, Y can be MAR because the probability of missing Y depends on X, which is observed. But what if both men and women are less likely to disclose their weight if they’re overweight? Then values are not MAR, since the probability Y is missing depends on Y net of X.
That definition works fine when only one variable has missing data. But things get more complicated when two or more variables have missing data. I learned this from Example 1.13 in the classic book by Little and Rubin (2002). That example deals with one of the simplest cases, when there are just two variables, X and Y. Suppose that both of them have missing data, and the missingness falls into four patterns:
1. X and Y both observed.
2. X observed but Y missing.
3. X missing but Y observed.
4. X and Y both missing.
For each of these patterns, there is a corresponding probability of observing that pattern. The MAR assumption is about how those four probabilities depend on the values of X and Y. Specifically, MAR says that the probability of each pattern may depend on the variables observed in that pattern, but not on variables that are not observed (after conditioning on the observed variables).
Here’s how this plays out:
The probability of getting pattern 1 (both X and Y observed) may depend on the values of both X and Y. It doesn’t have to depend on them, but MAR allows for that dependence.
The probability of pattern 2 (X observed and Y missing) may depend on X, but it can’t depend on Y.
The probability of pattern 3 (X missing and Y observed) may depend on Y, but it can’t depend on X.
The probability of pattern 4 (both variables missing) can’t depend on either X or Y.
What’s important to stress here is that MAR is a statement about the probabilities of observing particular patterns of missingness. It’s not about the probabilities that individual variables are missing.
Here’s the thing that troubles me about this: except in special cases—like missing completely at random or monotone missing data—it is surprisingly difficult to generate simulated data that would satisfy all four of the above conditions. And my personal rule is that if I can’t simulate it, I don’t really understand it.
There are three approaches to simulation that would seem like natural starting points:
1. Two dichotomous logistic regression models, one in which the probability that Y is missing depends on X, and the other in which the probability that X is missing depends on Y. The problem with that approach is that, by implication, the probability that both variables are missing would depend on the values of both variables. And that’s disallowed by MAR.
2. A multinomial logit model for the four probabilities, with both X and Y as predictors. As noted by Robins and Gill (1997), however, this would violate the MAR conditions for patterns 2, 3, and 4.
3. Specify a separate model for each of the four patterns. That’s problematic because the four probabilities have to sum to 1, and the probability of pattern 4 has to be a constant q. Consequently, the three other probabilities have to sum to 1-q. It’s not obvious how to impose that constraint without violating other MAR conditions, especially when X and Y are continuous. I’ve done it for dichotomous X and Y, but that necessitated some trial and error in order to avoid violating the constraint.
So none of those approaches is satisfactory. However, when I presented this problem to Paul von Hippel, who is co-authoring the revision to Missing Data, he came up with the following algorithm for two variables:
1. Start with a sample of data on two variables X and Y, both fully observed.
2. Randomly divide the sample into three groups. The proportions in each group can be whatever you choose.
3. In Group 1, let the probability that Y is missing be a function of X. A logistic regression model would be a natural choice, but it could be a probit model or something else.
4. In Group 2, let the probability that X is missing be a function of Y. Again, there are many possibilities for this function.
5. In Group 3, set both X and Y to be missing.
That’s it. I’ve verified that it works with a simulation. However, I later learned that it’s been done before. It’s one example of class of algorithms for MAR proposed by Robins and Gill (1997) called Randomized Monotone Missingness models. This class of models can handle any number of variables with missing data. These models also allow the probability of falling into each of the subgroups to depend on other variables that are always observed.
So, yes, it’s possible to simulate MAR when two are more variables are missing, and that’s progress. But is it plausible that this algorithm corresponds to any real-world situations? I’m skeptical. The key question is, why would the mechanism producing missing data be different in different subgroups?
For those of us who regularly use multiple imputation or maximum likelihood to handle missing data, I think we are just going to have to live with the MAR assumption, despite its peculiarities. It has been shown to be the weakest assumption that still implies ignorability—that is, the ability to make inferences without having to model the missing data mechanism. So if you’re not satisfied with MAR, you’ll have to engage in a much more complicated modeling process, one that still involves untestable assumptions.
Before concluding, I should mention that there is one situation where MAR makes sense and can also be easily simulated. That’s the case of a longitudinal study with drop out as the only cause of missing data. In that case, MAR means that drop out can depend on anything that is observed before the drop out. But it can’t depend on anything that would have been observed after the drop out.
For some applications, you might still suspect that drop out depends on what would have been observed immediately after the drop out. But at least MAR in that setting is an assumption that is plausible, understandable, and algorithmically reproducible.
Little, Roderick J.A., and Donald B. Rubin (2002) Statistical Analysis with Missing Data. Wiley.
Robins, James M., and Richard D. Gill (1997) “Non-response models for the analysis of non-monotone ignorable missing data.” Statistics in Medicine 16: 39-56.