Allison PicWhen I teach my seminar on Missing Data, the most common question I get is “What can I do if my data are not missing at random?” My usual answer is “Not much,” followed by “but you can do a sensitivity analysis.” Everyone agrees that a sensitivity analysis is essential for investigating possible violations of the missing at random assumption. But, unfortunately, there’s little guidance in the literature on how to actually do one. And, until recently, no commercial software package had options for doing a sensitivity analysis.

I’m happy to report that PROC MI in SAS 9.4 has several options for doing a sensitivity analysis based on multiple imputation. I’ve recently had a chance to read the documentation and do a few test runs. A little later in this post, I’ll tell you what I’ve learned.

But first, some background. There are two widely-used “modern” methods for handling missing data: multiple imputation and maximum likelihood. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random (MAR). Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables. This assumption would be violated, for example, if people with high income were less likely to report their income.

The MAR assumption does allow missingness to depend on anything that you observe, it just can’t depend on things that you don’t observe. MAR is not a testable assumption. You may suspect that your data are not missing at random, but nothing in your data will tell you whether or not that’s the case.

It’s possible to do multiple imputation or maximum likelihood when data are missing not at random (MNAR), but to do that, you first need to specify a model for the missing data mechanism—that is, a model of how missingness depends on both observed and unobserved quantities. That raises three issues:

  • For any data set, there are an infinite number of possible MNAR models.
  • Nothing in the data will tell you which of those models is better than another.
  • Results may depend heavily on which model you choose.

That’s a dangerous combination. And it’s why a sensitivity analysis is so important. The basic idea is to try out a bunch of plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions. On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy.

Keep in mind that this is not a test. Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways.

There’s nothing very deep about this. The hard part is figuring out how to come up with a reasonable set of models. It’s particularly hard if you’re using maximum likelihood to handle the missing data. Elsewhere I’ve argued for the advantages of maximum likelihood over multiple imputation. But one attraction of multiple imputation is that it’s easier to do a decent sensitivity analysis.

That’s where the new options for PROC MI come in. I think they’re easiest to explain by way of an example. In my Missing Data seminar, I use an example data set called COLLEGE, which contains information on 1302 four-year colleges and universities in the U.S. The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier.

There are lots of missing data for the five predictor variables, but we’re going to focus on the 98 colleges that did not report their graduation rates. It’s plausible that colleges with low graduation rates would be less likely to report those rates in order to avoid adverse publicity. If so, that would probably entail a violation of the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower (unobserved) graduation rates than those colleges that report their graduation rates, controlling for other variables.

PROC MI allows us to build that supposition into the multiple imputation model. We can, for example, specify an imputation model that says that the imputed values of GRADRAT are only 80% of what they would be if the data were actually missing at random. Here’s the SAS code for doing that:

      FCS ;

This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. The ADJUST option says to multiply the imputed values of GRADRAT by .80 at each step of the iterative process. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1.

The MNAR statement only works if you specify the MONOTONE method or the FCS method, which is what I used here. FCS stands for fully conditional specification, and it’s equivalent to the chained equation or sequential regression method used in many other packages. The MNAR statement does not work if you use the default MCMC method. [It could probably be done for MCMC, but that would mess up the elegant computational algorithm. FCS is already a “messy” algorithm, so a little more mess is no big deal].

 Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example,


This would subtract 20 points from any imputed graduation rates. Again, to do a sensitivity analysis, you’d want to try out a range of different SHIFT values to see what effect that would have on your results.

The SHIFT and SCALE options can be combined. The SHIFT option can also be used for adjusting the imputations of categorical outcomes (binary, ordinal or nominal), except that the changes are applied on the log-odds scale.

Another option allows you to restrict the adjustments to certain subsets of the data, e.g.,


This says to subtract 20 points from the imputed values of graduation rates, but only for private colleges, not for public colleges. If you use the ADJUSTOBS option, the subsetting variable (PRIVATE in this case) should be listed in a CLASS statement.  

There are also other options, which you can read about here. An introductory article written by the guy who developed PROC MI, Yang Yuan, can be downloaded here.

If you don’t use SAS, you can do adjustments like this using other multiple imputation software along with a little programming. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and the adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.

This particular way of doing a sensitivity analysis is based on something called pattern-mixture models for MNAR. You can read more about pattern-mixture models in Chapter 10 of the book Multiple Imputation and Its Application by James Carpenter and Michael Kenward.

Finally, it’s worth noting that the inclusion of appropriate auxiliary variables into the imputation model can go a long way toward reducing the likelihood of MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing. For more on auxiliary variables, see this recent paper by Tenko Raykov, one of Statistical Horizons’ instructors.

Allison PicFor several years now, I’ve been promoting something I called the “hybrid method” as a way of analyzing longitudinal and other forms of clustered data. My books Fixed Effects Regression Methods for Longitudinal Data Using SAS (2005) and Fixed Effects Regression Models (2009) both devoted quite a few pages to this methodology. However, recent research has led me to be a little more cautious about this approach, especially for logistic regression models.

Like other fixed effects methods, the hybrid method provides a way of controlling for all cluster-level covariates, whether observed or not. It is sometimes called the “poor man’s conditional likelihood” (Neuhaus and McCulloch 2006) or the “between-within (BW) method” (Sjölander et al. 2013). Actually, I like that second name so much that I’m going to start using it in all future teaching and writing.

First introduced by Neuhaus and Kalbfleisch (1998), the BW method embeds a fixed effects estimator within the framework of a random effects (mixed) model, thereby enabling one to reap several benefits of both approaches. It’s especially attractive for models like ordinal logistic regression or negative binomial regression where more standard fixed effects methods (like conditional likelihood) are not available.

I’ll review the BW method in a moment, but first let me tell you why I’m writing this post. Despite the many attractions of the method, I’ve long been bothered by the fact that there was no proof that it did what was claimed—that is, provide consistent (and, hence, approximately unbiased) estimates of the causal parameters of interest. For linear models, the BW method produces exactly the same results as the standard fixed effects method, so that’s reassuring. But for logistic regression, the BW estimates are not identical to those produced by conditional likelihood (the gold standard)—close, but no cigar.

I’m glad to report that there’s been some real progress on this front in recent years, but not all of it is encouraging. Goetgeluk and Vansteelandt (2008) proved that the BW method produces consistent estimates in the linear case—something that was already pretty apparent. However, they also showed that the BW method could yield biased estimates for some nonlinear models, including logistic and Poisson regression. Fortunately, they also observed that the biases were typically small in most practical settings.

Brumback et al. (2010) also showed by simulation that the BW method, when applied to binary logistic regression, could produce inconsistent estimates of the within-cluster causal parameters, although, again, the biases were small. But in a later paper, Brumback et al. (2012) presented a simulation in which the downward bias was as high as 45%.

So, does that mean that we should no longer use the BW method for logistic regression? I don’t think so. To explain that, I need to get a little more technical. Let Yij be a binary (0,1) dependent variable for the jth individual in the ith cluster. Let Xij be a column vector of predictor variables, again, for individual i in cluster j. (If you’re not comfortable with vectors, just think of X as a single variable). We postulate the model

     logit(Pr(Yij = 1| Xij, αi)) = μ + βXij + αi                                                                                     (1)

where β is a row vector of coefficients, and αi is an unobserved “effect” of being in cluster i. It can be thought of as representing the combined effects of all cluster-level variables that are not included in the model.

How can we consistently estimate β? Well, we could estimate a generalized linear mixed model, treating α as a normally distributed random variable with mean 0 and constant variance. This can be done in SAS with PROC GLIMMIX or in Stata with the melogit command. But standard implementations assume that α is independent of X, which means that we aren’t really controlling for α as a potential confounder. If α is actually correlated with X, you’ll get biased estimates of β using this method, and the bias could be severe.

A second approach is to use conditional likelihood, which is known to produce consistent estimates of β even if α is correlated with X. You can do this with the clogit command in Stata or with PROC LOGISTIC using the STRATA statement.

A third approach is the BW method, which says to decompose X into a within-cluster component and a between-cluster component. Letting Xi be a vector of cluster-specific means (sorry to put the bar underneath), we estimate the following generalized linear mixed model

         logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βW(XijXi) +  βBXi + αi                                                       (2)

using standard software like GLIMMIX or melogit.

It’s long been believed that the estimate of βW, because it depends only on within-cluster variation, controls for confounding with α and produces consistent estimates of β in equation (1). We now know that that’s not true, at least not in general. But there is one condition in which it is true. Specifically, it’s true when αi in (1) is a linear function of Xi, plus a random error that is independent of ij (Brumback et al. 2010). It’s reasonable to suppose, then, that the performance of the BW method for logistic regression depends, at least in part, on how closely this linearity assumption is satisfied. I’ll make use of that idea shortly.

Why not just do conditional likelihood, which we know will give consistent estimates under more general conditions? First, there are lots of non-linear models for which conditional likelihood is just not possible. Those models include binary probit, binary complementary log-log, cumulative (ordered) logit, and negative binomial regression.

Second, even for binary logit, there are many attractions of the BW method that conditional likelihood can’t provide: random slopes, a test for confounding, the ability to include cluster-level covariates, and the ability to allow for higher-levels of clustering.

As previously mentioned, all the available evidence suggests that the bias of the BW method is small in most situations. The simulation by Brumback et al. that yielded 45% downward bias was very extreme: the cluster-level effect was highly correlated with the x’s, it had an dominating effect on the outcome, and it depended on the maximum of the x’s across individuals, not their mean. Also, there were only two individuals per cluster, which tends to magnify any other sources of bias.

Here’s a modest proposal for how to reduce any biases that may occur. First, it’s easy to show that the following equation is the exact equivalent of model (2):

           logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij +  (βBβW)Xi + αi                                                       (3)

This tells us that as long as the means are in the equation, you don’t have to express Xij as deviations from those means.

To check or allow for nonlinearity, I suggest adding polynomial functions of the means to equation (3), such as

          logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij + θ1Xθ2Xi2  +θ3Xi3 + α                                          (4)

You could even include other cluster-level functions of the Xijs, such as their standard deviation. If the additional terms are not statistically significant, and if the estimate of βW doesn’t change much, then you can feel more confident that bias is not an issue.

To sum up, the BW/hybrid method is not quite as good as I once thought, at least not for non-linear models. But it’s certainly better than doing conventional random effects/mixed models, and usually comes pretty close to the mark. And, lastly, there may be ways to improve it.


Allison, PD (2005) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: The SAS Institute.

Allison, PD (2009) Fixed Effects Regression Models. Thousand Oaks, CA: Sage Publications.

Brumback BA, AB Dailey, LC Brumback, MD Livingston, Z He (2010) “Adjusting for confounding by cluster using generalized linear mixed models.” Statistics and Probability Letters 80:1650–1654.

Brumback BA, HW Zheng, and AB Dailey (2012) “Adjusting for confounding by neighborhood using generalized linear mixed models and complex survey data.” Statistics in Medicine 32: 1313–1324.

Goetgeluk, S and S Vansteelandt (2008) “Conditional generalized estimating equations for the analysis of clustered and longitudinal data.” Biometrics 64: 772-780.

Neuhaus, JM and JD Kalbfleisch (1998) “Between- and within-cluster covariate effects in the analysis of clustered data.” Biometrics 54: (2) 638-645.

Neuhaus, JM and CE McCulloch (2006) “Separating between- and within-cluster covariate effects by using conditional and partitioning methods.” Journal of the Royal Statistical Society Series B 68: 859–872.

Sjölander, A, P Lichtenstein, H Larsson and Y Pawitan (2013) “Between–within models for survival analysis.” Statistics in Medicine 32: 3067–3076