Allison PicFor the analysis of binary data, logistic regression dominates all other methods in both the social and biomedical sciences. It wasn’t always this way. In a 1934 article in Science, Charles Bliss proposed the probit function for analyzing binary data, and that method was later popularized in David Finney’s 1947 book Probit Analysis. For many years, probit was the method of choice in biological research.

In a 1944 article in the Journal of the American Statistical Association, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book Analysis of Binary Data. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary log-log) but logistic regression is the overwhelming favorite. 

So what is it about logistic regression that makes it so popular?  In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there is something special about the logit link that gives it a natural advantage over alternative link functions.

First, a brief review. For binary data, the goal is to model the probability p that one of two outcomes occurs. The logit function is log[p/(1-p)], which varies between -∞ and +∞ as p varies between 0 and 1.The logistic regression model says that

            log[p/(1-p)] = b0 + b1x1 + … + bkxk

or, equivalently,

            p = 1/(1 + exp{-(b0 + b1x1 + … + bkxk)})

Estimation of the b coefficients is usually accomplished by maximum likelihood.

In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)

What’s most important about the logit link is that it guarantees that p is bounded by 0 and 1, no matter what the b’s and the x’s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary log-log link, and an infinite number of other possible link functions.  But there are a few things that are special about logit:

1. If you exponentiate the coefficients, you get adjusted odds ratios. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to non-statisticians. Coefficients from probit regression are not nearly so interpretable.

2. With logit, you can do disproportionate stratified random sampling on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes all of the events, and a 10% random sample of the non-events. This property is the justification for the widely-used case-control method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.

3. You can do exact logistic regression. With conventional maximum likelihood estimation, the p-values and confidence intervals are large-sample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact p-values are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.

4. You can do conditional logistic regression. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for between-cluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the between-cluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.  

What is it about the logit link that makes it possible to do these useful variations of logistic regression?  The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important exponential family of distributions. In general form, the probability distribution for the exponential family can be written as

            f(x|b) = h(x)exp{T(x)’g(b)–A(b)}

In this formula, x is a vector of the data, b is a vector of parameters, and h, g, T, and A are known functions.  If g(b) = b, then b is said to be the natural parameter (or canonical parameter) of the distribution. T(x) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.

For the binomial distribution, this formula specializes to 

            f(x|p) = [N x]exp{x log[p/(1-p)] –N log(1-p)}

where N is the number of trials, x is the number of events and [N x] is the binomial coefficient.  We see immediately that log[p/(1-p)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.

Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors. 

Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.  

There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables.  You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.

If you’d like to learn more about these and other methods for binary data, see my book Logistic Regression Using SAS: Theory and Application.  

Allison PicPredictive mean matching (PMM) is an attractive way to do multiple imputation for missing data, especially for imputing quantitative variables that are not normally distributed. But, as I explain below, it’s also easy to do it the wrong way. 

Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.

PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.

There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.

Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:

  1. For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b.
  2. Make a random draw from the “posterior predictive distribution” of b, producing a new set of coefficients b*. Typically this would be a random draw from a multivariate normal distribution with mean b and the estimated covariance matrix of b (with an additional random draw for the residual variance). This step is necessary to produce sufficient variability in the imputed values, and is common to all “proper” methods for multiple imputation.
  3. Using b*, generate predicted values for x for all cases, both those with data missing on x and those with data present.
  4. For each case with missing x, identify a set of cases with observed x whose predicted values are close to the predicted value for the case with missing data.
  5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value.
  6. Repeat steps 2 through 5 for each completed data set. 

Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.

There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the user-written ice command for Stata set the default at k=10. 

On the other hand, for the SPSS missing values module and for the built-in mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.

That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014). 

Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.

So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.

The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.


Little, Roderick J. A. (1988) “Missing-data adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287-296.

Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 75-87.

Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 87-94.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.

Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425-446.

Allison PicIn my November and December posts, I extolled the virtues of SEM for estimating dynamic panel models. By combining fixed effects with lagged values of the predictor variables, I argued that this approach offers the best option for making causal inferences with non-experimental panel data. It controls for all time-invariant variables, whether observed or not, and it allows for the possibility of reciprocal causation. Simulation evidence strongly favors SEM over the more popular Arellano-Bond method for estimating these kinds of models.

Despite the potential for this method, I recently learned that it’s vulnerable to a very troubling kind of bias when the lag structure is misspecified. In the latest issue of Sociological Methods and Research, Stephen Vaisey and Andrew Miles showed by both simulation and formal proof that a positive contemporaneous effect will often show up as a negative effect when estimating a fixed effects model with a predictor that is lagged by one time unit. They concluded that, for most social science applications, “artifactual negative ‘effects’ will likely be the rule rather than the exception.”

Vaisey and Miles investigated this problem only for the case of three periods of data, no lagged effect of the dependent variable y on itself, and no true effect of y on x. In that case, maximum likelihood reduces to OLS regression using difference scores: y3y2 on x2x1. They showed that the coefficient for x2x1 has an expected value that is exactly -.5 times the true coefficient.

My own simulations suggest that a sign reversal can also happen with four or more periods and a lagged dependent variable. And the effect of one variable on the other doesn’t have to be exactly contemporaneous. The reversal of sign can also occur if the correct lag is one week, but the estimated model specifies a lag of one year. Note that this artifact does not arise with random effects models. It’s specific to fixed effects models with lagged predictors. That should not be interpreted as an endorsement of random effects models, however, because they are much more prone to bias from omitted variables.

As noted by Vaisey and Miles, a 2011 article in the Journal of Quantitative Criminology may exemplify the problem of misspecified lags. Following my advice, Ousey, Wilcox and Fisher used the fixed effects SEM method to examine the relationship between victimization and offending. Numerous studies have found a positive, cross-sectional relationship between these variables: people who report being victims of crimes are also more likely to commit crimes. But Ousey et al. found negative effects of each variable on the other. Respondents who reported higher levels of offending in year t had lower levels of victimization in year t+1, after adjusting for fixed effects. And respondents with higher levels of victimization in year t had lower levels of offending in year t+1.

This surprising result could be real. But it could also occur if there is a positive effect of victimization on offending that is almost instantaneous rather than lagged by one year. And, finally, it could also occur if there is a positive, instantaneous effect of offending on victimization.

What can be done about this problem? Well, one implication is that more thought should go into the design of panel surveys. If you expect that changes in x will produce changes in y a month later, then collecting monthly data would be much better than collecting annual data. This could have the added advantage of reducing the total time for data collection, although it might also increase certain kinds of response bias.

What if your data have already been collected? Here’s a tentative recommendation that worked well in a few simulations. As a robustness check, estimate models that include both contemporaneous and lagged predictors. If a one-year lag is the correct specification, then the contemporaneous effect should be small and not statistically significant. If, on the other hand, the contemporaneous effect is large and significant, it should raise serious doubts about validity of the method and the kinds of conclusions that can be drawn. It may be that the data are simply not suitable for separating the effect of x on y from the effect of y on x.

I tried this strategy on a subset of the data used by Ousey et al. to study victimization and offending. When both contemporaneous and lagged predictors were included, I found a strong positive effect of victimization on offending in the same year. The one-year lagged effect was negative but small and non-significant. The same thing happened in the reverse direction. Offending had a strong positive effect on victimization in the same year, but the lagged effect was negative and not significant. My take: these data don’t allow one to draw any firm conclusions about whether victimization affects offending or offending affects victimization. They certainly don’t provide a basis for claiming negative effects of each variable on the other.

Clearly this is a problem that needs a great deal more study. There is a substantial econometric literature on determining the number of lags needed for autoregressive models but, as far as I know, Vaisey and Miles are the first to identify this particular phenomenon.

By the way, Steve Vaisey teaches a highly-rated course for Statistical Horizons called Treatment Effects Analysis.  

Allison PicThis is a follow-up to last month’s post, in which I considered the use of panel data to answer questions about causal ordering: does x cause y or does y cause x?  In the interim, I’ve done many more simulations to compare the two competing methods, Arellano-Bond and ML-SEM, and I’m going to report some key results here. If you want all the details, read my recent paper by clicking here.  If you’d like to learn how to use these methods, check out my new seminar titled Longitudinal Data Analysis Using SEM

Quick review: The basic approach is to assume a cross-lagged linear model, with y at time t affected by both x and y and time t-1, and x at time t also affected by both lagged variables. The equations are

 yit = b1xi(t-1) + b2yi(t-1) + ci + eit

xit = a1xi(t-1) + a2yi(t-1) fi + dit

for i = 1,…, n, and t = 1,…, T.  

The terms ci and fi represent individual-specific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, eit and dit are assumed to represent pure random noise, independent of any variables measured at earlier time points. Additional exogenous variables could also be added to these equations.

Conventional estimation methods are biased because of the lagged dependent variable and because of the reciprocal relationship between the two variables. The most popular solution is the Arellano-Bond (A-B) method (or one of its cousins), but I have previously argued for the use of maximum likelihood (ML) as implemented in structural equation modeling (SEM) software.

Last month I presented very preliminary simulation results showing that ML-SEM had substantially lower mean-squared error (MSE) than A-B under a few conditions. Since then I’ve done simulations for 31 different sets of parameter values and data configurations. For each condition, I generated 1,000 samples, ran the two methods on each sample, and then calculated bias, mean squared error, and coverage for confidence intervals. Since the two equations are symmetrical, the focus is on the coefficients in the first equation, b1 for the effect of x on y, and b2 for the effect of y on itself.

The simulations for ML were done with PROC CALIS in SAS. I originally started with the sem command in Stata, but it had a lot of convergence problems for the smaller sample sizes. The A-B simulations were done in Stata with the xtabond command. I tried PROC PANEL in SAS, but couldn’t find any combination of options that produced approximately unbiased estimates.   

Here are some of the things I’ve learned:

Under every condition, ML showed little bias and quite accurate confidence interval coverage. That means that about 95% of the nominal 95% confidence intervals included the true value.

Except under “extreme” conditions, A-B also had little bias and reasonably accurate confidence interval coverage. 

However, compared with A-B, ML-SEM always showed less bias and smaller sampling variance. My standard of comparison is relative efficiency, which is the ratio of MSE for ML to MSE for A-B. (MSE is the sum of the sampling variance plus the squared bias.) Across 31 different conditions, relative efficiency of the two estimators ranged from .02 to .96, with a median of .50. To translate, if the relative efficiency is .50, you’d need twice as large a sample to get the same accuracy with A-B as with ML.  

Relative efficiency of the two estimators is strongly affected by the value of the parameter b2, the effect of yt-1 on yt.  As b2 gets close to 1, the A-B estimators for both b1 and b2 become badly biased (toward 0), and the sample variance increases, which is consistent with previous literature on the A-B estimator.  For ML, on the other hand, bias and variance are rather insensitive to the value of b2. Here are the numbers:


Rel Eff b1

Rel Eff b2























Relative efficiency is strongly affected by the number of time points, but in the opposite direction for the two coefficients. Thus, relative efficiency for b1 increases almost linearly as the number of time points goes from 3 to 10.  But for b2, relative efficiency is highest at T=3, declines markedly for T=4 and T=5, and then remains stable. 


Rel Eff b1

Rel Eff b2

















Relative efficiency is also strongly affected by the ratio of the variance of ci, (the fixed effect) to the variance of eit (the pure random error). In the next table, I hold constant the variance of c and vary the standard devation of e


Rel Eff b1

Rel Eff b2














Relative efficiency is not strongly affected by:

  • Sample size
  • The value of b1
  • The correlation between ci and fi, the two fixed-effects variables.

Because ML is based on the assumption of multivariate normality, one might suspect that A-B would do better than ML if the distributions were not normal. To check that out, I generated all the variables using a 2-df chi-square variable, which is highly skewed to the right.  ML still did great in this situation, and was still about twice as efficient as A-B.

In sum, ML-SEM outperforms A-B in every situation studied, by a very substantial margin.   

Allison PicDoes x cause y or does y cause x? Virtually everyone agrees that cross-sectional data are of no use in answering this question. The ideal, of course, would be to do two randomized experiments, one examining the effect of x on y, and the other focused on the reverse effect. Absent this, most social scientists would say that some of kind of longitudinal data ought to do the trick. But what kinds of data are needed and how should they be analyzed?

In this post, I review some earlier work I’ve done on these questions, and I report new simulation results comparing the Arellano-Bond method with maximum likelihood (ML) using structural equation modeling (SEM) software. Arrelano-Bond is hugely popular among economists, but not widely known in other disciplines. ML with SEM is a method that I’ve been advocating for almost 15 years (Allison 2000, 2005a, 2005b, 2009). Long story short: ML rules.

I focus on panel data in which we observe yit and xit for =1,…, n and =1,…, T. The proposed linear model allows for reciprocal, lagged effects of these two variables on each other:

yit = b1xi(t-1) + b2yi(t-1) + ci + eit

xit = a1xi(t-1) + a2yi(t-1) + fi + dit

The terms ci and fi represent individual-specific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, eit and dit are assumed to represent pure random noise, independent of any variables measured at earlier time points.

If all the assumptions are met, b1 can be interpreted as the causal effect of x on y, and a2 can be interpreted as the causal effect of y on x. This model can be elaborated in various ways to include, for example, other predictor variables, different lags, and coefficients that change over time.

Estimation of the model is not straightforward for reasons that are well known in the econometric literature. First, the presence of a lagged dependent variable as a predictor in each equation means that conventional fixed effects methods yield biased estimates of the coefficients under almost any condition. But even if the lagged dependent variables were excluded from the equations, the error term in each equation would still be correlated with all future values of both x and y. For example, e2 -> y2 -> x3.  So, again, conventional fixed effects will produce biased coefficients. 

Arrelano and Bond (1991) solved these problems by using earlier lagged values of x and y as instrumental variables and by applying a generalized method of moments (GMM) estimator.  Several software packages now implement this method, including SAS, Stata, LIMDEP, and the plm package for R. 

My solution to the problems has been to estimate each equation separately by ML using any SEM package (e.g., LISREL, Mplus, PROC CALIS in SAS, or sem in Stata).  Two “tricks” are necessary. Focusing on the first equation, fixed effects are accommodated by allowing c to be correlated with all measurements of x (as well as the initial measurement of y). Second, the error term e is allowed to be correlated with all future measurements of x. Analogous methods are used to estimate the second equation.  For details, see the SEM chapters in my 2005 and 2009 books.

In my 2005 paper, I presented simulation evidence that the ML-SEM method produces approximately unbiased estimates of the coefficients under a variety of conditions. For years, I’ve been promising to do a head-to-head comparison of ML with Arellano-Bond, but I’ve just now gotten around to doing it.

What I’m going to report here are some very preliminary but dramatic results. The model used to generate the data was one in which x has a positive effect on y, but y has a negative effect on x

yit = .5xi(t-1) + .5yi(t-1) + ci + eit

xit = .5x(t-1) – .5yi(t-1) + fi + dit

All variables have normal distributions, c has a positive correlation with x, f has a positive correlation with y, and c and f are positively correlated with each other. The baseline model had 5 time points (T=5), with sample sizes of 50, 100, 400, and 1600. Then, keeping the sample size at 400, I examined T=  4, and 10. For each condition I did 1000 replications.

I focus here on the coefficient for the effect of x on y in the first equation. For each condition, I calculated the mean squared error (MSE), which is the variance of the estimator plus its squared bias. There was little bias in either estimator, so the MSE primarily reflects sampling variance.

Here are the preliminary results:

Mean Squared Error for Two Estimators




Relative efficiency













N=1600, T=5




N=400, T=4




N=400, T=10




The last column, relative efficiency, is the ratio of the MSE for ML to the MSE for A-B. With 5 time points, A-B is only about half as efficient as ML-SEM, for any sample size.  But the number of time points has a dramatic effect.  A-B is only 29% efficient for T=4 but 79% efficient for T=10. 

The next steps are to vary such things as the magnitudes of the coefficients, the variances of the error terms, and the correlations between c and f with each other and with the predictor variables.

Besides its efficiency advantage, the ML-SEM framework makes it easier than A-B to accomplish several things:

  • Handle missing data by FIML.
  • Relax various constraints, such as constant error variance or constant coefficients.
  • Construct a likelihood ratio test comparing fixed vs. random effects, the equivalent of the Hausman test which not infrequently breaks down.
  • Add an autoregressive structure to the time-specific error components.

Before concluding, I must mention that Hsiao et al. (2002) also did a simulation study to compare ML with a variety of other estimators for the panel model, including A-B. However, their approach to ML was very different than mine, and it has not been implemented in any commercial software packages. Hsiao et al. found that ML did better with respect to both bias and efficiency than any of the other estimators, under almost all conditions. Nevertheless, the differences between ML and A-B were much smaller than those reported here. 

If you’re reading this post, you should definitely read next month’s follow up by clicking here

To learn more about these and other methods for panel data, check out my seminars, Longitudinal Data Analysis Using SAS and Longitudinal Data Analysis Using Stata. Both will be offered in the spring of 2015.  Plus, I am offering a new, more advanced seminar titled Longitudinal Data Analysis Using SEM in Fort Myers, Florida, January 23-24. 


Allison, Paul D. (2000) “Inferring Causal Order from Panel Data.” Paper presented at the Ninth International Conference on Panel Data, June 22, Geneva, Switzerland.

Allison, Paul D. (2005a) “Causal Inference with Panel Data.” Paper presented at the Annual Meeting of the American Sociological Association, August, Philadelphia, PA.

Allison, Paul D. (2005b) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: SAS Institute.

Allison, Paul D. (2009) Fixed Effects Regression Models. Thousand Oaks, CA:  Sage Publications.

Arellano, M. and S. Bond (1991) “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations.” The Review of Economic Studies 58: 277-297.  

Hsiao, Cheng, M. Hashem Pesaran, and A. Kamil Tahmiscioglu (2002) “Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods.”Journal of Econometrics 109: 107-150.

Allison PicWhen I teach my seminar on Missing Data, the most common question I get is “What can I do if my data are not missing at random?” My usual answer is “Not much,” followed by “but you can do a sensitivity analysis.” Everyone agrees that a sensitivity analysis is essential for investigating possible violations of the missing at random assumption. But, unfortunately, there’s little guidance in the literature on how to actually do one. And, until recently, no commercial software package had options for doing a sensitivity analysis.

I’m happy to report that PROC MI in SAS 9.4 has several options for doing a sensitivity analysis based on multiple imputation. I’ve recently had a chance to read the documentation and do a few test runs. A little later in this post, I’ll tell you what I’ve learned.

But first, some background. There are two widely-used “modern” methods for handling missing data: multiple imputation and maximum likelihood. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random (MAR). Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables. This assumption would be violated, for example, if people with high income were less likely to report their income.

The MAR assumption does allow missingness to depend on anything that you observe, it just can’t depend on things that you don’t observe. MAR is not a testable assumption. You may suspect that your data are not missing at random, but nothing in your data will tell you whether or not that’s the case.

It’s possible to do multiple imputation or maximum likelihood when data are missing not at random (MNAR), but to do that, you first need to specify a model for the missing data mechanism—that is, a model of how missingness depends on both observed and unobserved quantities. That raises three issues:

  • For any data set, there are an infinite number of possible MNAR models.
  • Nothing in the data will tell you which of those models is better than another.
  • Results may depend heavily on which model you choose.

That’s a dangerous combination. And it’s why a sensitivity analysis is so important. The basic idea is to try out a bunch of plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions. On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy.

Keep in mind that this is not a test. Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways.

There’s nothing very deep about this. The hard part is figuring out how to come up with a reasonable set of models. It’s particularly hard if you’re using maximum likelihood to handle the missing data. Elsewhere I’ve argued for the advantages of maximum likelihood over multiple imputation. But one attraction of multiple imputation is that it’s easier to do a decent sensitivity analysis.

That’s where the new options for PROC MI come in. I think they’re easiest to explain by way of an example. In my Missing Data seminar, I use an example data set called COLLEGE, which contains information on 1302 four-year colleges and universities in the U.S. The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier.

There are lots of missing data for the five predictor variables, but we’re going to focus on the 98 colleges that did not report their graduation rates. It’s plausible that colleges with low graduation rates would be less likely to report those rates in order to avoid adverse publicity. If so, that would probably entail a violation of the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower (unobserved) graduation rates than those colleges that report their graduation rates, controlling for other variables.

PROC MI allows us to build that supposition into the multiple imputation model. We can, for example, specify an imputation model that says that the imputed values of GRADRAT are only 80% of what they would be if the data were actually missing at random. Here’s the SAS code for doing that:

      FCS ;

This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. The ADJUST option says to multiply the imputed values of GRADRAT by .80 at each step of the iterative process. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1.

The MNAR statement only works if you specify the MONOTONE method or the FCS method, which is what I used here. FCS stands for fully conditional specification, and it’s equivalent to the chained equation or sequential regression method used in many other packages. The MNAR statement does not work if you use the default MCMC method. [It could probably be done for MCMC, but that would mess up the elegant computational algorithm. FCS is already a “messy” algorithm, so a little more mess is no big deal].

 Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example,


This would subtract 20 points from any imputed graduation rates. Again, to do a sensitivity analysis, you’d want to try out a range of different SHIFT values to see what effect that would have on your results.

The SHIFT and SCALE options can be combined. The SHIFT option can also be used for adjusting the imputations of categorical outcomes (binary, ordinal or nominal), except that the changes are applied on the log-odds scale.

Another option allows you to restrict the adjustments to certain subsets of the data, e.g.,


This says to subtract 20 points from the imputed values of graduation rates, but only for private colleges, not for public colleges. If you use the ADJUSTOBS option, the subsetting variable (PRIVATE in this case) should be listed in a CLASS statement.  

There are also other options, which you can read about here. An introductory article written by the guy who developed PROC MI, Yang Yuan, can be downloaded here.

If you don’t use SAS, you can do adjustments like this using other multiple imputation software along with a little programming. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and the adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.

This particular way of doing a sensitivity analysis is based on something called pattern-mixture models for MNAR. You can read more about pattern-mixture models in Chapter 10 of the book Multiple Imputation and Its Application by James Carpenter and Michael Kenward.

Finally, it’s worth noting that the inclusion of appropriate auxiliary variables into the imputation model can go a long way toward reducing the likelihood of MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing. For more on auxiliary variables, see this recent paper by Tenko Raykov, one of Statistical Horizons’ instructors.

Allison PicFor several years now, I’ve been promoting something I called the “hybrid method” as a way of analyzing longitudinal and other forms of clustered data. My books Fixed Effects Regression Methods for Longitudinal Data Using SAS (2005) and Fixed Effects Regression Models (2009) both devoted quite a few pages to this methodology. However, recent research has led me to be a little more cautious about this approach, especially for logistic regression models.

Like other fixed effects methods, the hybrid method provides a way of controlling for all cluster-level covariates, whether observed or not. It is sometimes called the “poor man’s conditional likelihood” (Neuhaus and McCulloch 2006) or the “between-within (BW) method” (Sjölander et al. 2013). Actually, I like that second name so much that I’m going to start using it in all future teaching and writing.

First introduced by Neuhaus and Kalbfleisch (1998), the BW method embeds a fixed effects estimator within the framework of a random effects (mixed) model, thereby enabling one to reap several benefits of both approaches. It’s especially attractive for models like ordinal logistic regression or negative binomial regression where more standard fixed effects methods (like conditional likelihood) are not available.

I’ll review the BW method in a moment, but first let me tell you why I’m writing this post. Despite the many attractions of the method, I’ve long been bothered by the fact that there was no proof that it did what was claimed—that is, provide consistent (and, hence, approximately unbiased) estimates of the causal parameters of interest. For linear models, the BW method produces exactly the same results as the standard fixed effects method, so that’s reassuring. But for logistic regression, the BW estimates are not identical to those produced by conditional likelihood (the gold standard)—close, but no cigar.

I’m glad to report that there’s been some real progress on this front in recent years, but not all of it is encouraging. Goetgeluk and Vansteelandt (2008) proved that the BW method produces consistent estimates in the linear case—something that was already pretty apparent. However, they also showed that the BW method could yield biased estimates for some nonlinear models, including logistic and Poisson regression. Fortunately, they also observed that the biases were typically small in most practical settings.

Brumback et al. (2010) also showed by simulation that the BW method, when applied to binary logistic regression, could produce inconsistent estimates of the within-cluster causal parameters, although, again, the biases were small. But in a later paper, Brumback et al. (2012) presented a simulation in which the downward bias was as high as 45%.

So, does that mean that we should no longer use the BW method for logistic regression? I don’t think so. To explain that, I need to get a little more technical. Let Yij be a binary (0,1) dependent variable for the jth individual in the ith cluster. Let Xij be a column vector of predictor variables, again, for individual i in cluster j. (If you’re not comfortable with vectors, just think of X as a single variable). We postulate the model

     logit(Pr(Yij = 1| Xij, αi)) = μ + βXij + αi                                                                                     (1)

where β is a row vector of coefficients, and αi is an unobserved “effect” of being in cluster i. It can be thought of as representing the combined effects of all cluster-level variables that are not included in the model.

How can we consistently estimate β? Well, we could estimate a generalized linear mixed model, treating α as a normally distributed random variable with mean 0 and constant variance. This can be done in SAS with PROC GLIMMIX or in Stata with the melogit command. But standard implementations assume that α is independent of X, which means that we aren’t really controlling for α as a potential confounder. If α is actually correlated with X, you’ll get biased estimates of β using this method, and the bias could be severe.

A second approach is to use conditional likelihood, which is known to produce consistent estimates of β even if α is correlated with X. You can do this with the clogit command in Stata or with PROC LOGISTIC using the STRATA statement.

A third approach is the BW method, which says to decompose X into a within-cluster component and a between-cluster component. Letting Xi be a vector of cluster-specific means (sorry to put the bar underneath), we estimate the following generalized linear mixed model

         logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βW(XijXi) +  βBXi + αi                                                       (2)

using standard software like GLIMMIX or melogit.

It’s long been believed that the estimate of βW, because it depends only on within-cluster variation, controls for confounding with α and produces consistent estimates of β in equation (1). We now know that that’s not true, at least not in general. But there is one condition in which it is true. Specifically, it’s true when αi in (1) is a linear function of Xi, plus a random error that is independent of ij (Brumback et al. 2010). It’s reasonable to suppose, then, that the performance of the BW method for logistic regression depends, at least in part, on how closely this linearity assumption is satisfied. I’ll make use of that idea shortly.

Why not just do conditional likelihood, which we know will give consistent estimates under more general conditions? First, there are lots of non-linear models for which conditional likelihood is just not possible. Those models include binary probit, binary complementary log-log, cumulative (ordered) logit, and negative binomial regression.

Second, even for binary logit, there are many attractions of the BW method that conditional likelihood can’t provide: random slopes, a test for confounding, the ability to include cluster-level covariates, and the ability to allow for higher-levels of clustering.

As previously mentioned, all the available evidence suggests that the bias of the BW method is small in most situations. The simulation by Brumback et al. that yielded 45% downward bias was very extreme: the cluster-level effect was highly correlated with the x’s, it had an dominating effect on the outcome, and it depended on the maximum of the x’s across individuals, not their mean. Also, there were only two individuals per cluster, which tends to magnify any other sources of bias.

Here’s a modest proposal for how to reduce any biases that may occur. First, it’s easy to show that the following equation is the exact equivalent of model (2):

           logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij +  (βBβW)Xi + αi                                                       (3)

This tells us that as long as the means are in the equation, you don’t have to express Xij as deviations from those means.

To check or allow for nonlinearity, I suggest adding polynomial functions of the means to equation (3), such as

          logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij + θ1Xθ2Xi2  +θ3Xi3 + α                                          (4)

You could even include other cluster-level functions of the Xijs, such as their standard deviation. If the additional terms are not statistically significant, and if the estimate of βW doesn’t change much, then you can feel more confident that bias is not an issue.

To sum up, the BW/hybrid method is not quite as good as I once thought, at least not for non-linear models. But it’s certainly better than doing conventional random effects/mixed models, and usually comes pretty close to the mark. And, lastly, there may be ways to improve it.


Allison, PD (2005) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: The SAS Institute.

Allison, PD (2009) Fixed Effects Regression Models. Thousand Oaks, CA: Sage Publications.

Brumback BA, AB Dailey, LC Brumback, MD Livingston, Z He (2010) “Adjusting for confounding by cluster using generalized linear mixed models.” Statistics and Probability Letters 80:1650–1654.

Brumback BA, HW Zheng, and AB Dailey (2012) “Adjusting for confounding by neighborhood using generalized linear mixed models and complex survey data.” Statistics in Medicine 32: 1313–1324.

Goetgeluk, S and S Vansteelandt (2008) “Conditional generalized estimating equations for the analysis of clustered and longitudinal data.” Biometrics 64: 772-780.

Neuhaus, JM and JD Kalbfleisch (1998) “Between- and within-cluster covariate effects in the analysis of clustered data.” Biometrics 54: (2) 638-645.

Neuhaus, JM and CE McCulloch (2006) “Separating between- and within-cluster covariate effects by using conditional and partitioning methods.” Journal of the Royal Statistical Society Series B 68: 859–872.

Sjölander, A, P Lichtenstein, H Larsson and Y Pawitan (2013) “Between–within models for survival analysis.” Statistics in Medicine 32: 3067–3076


Allison PicLet me tell you about my favorite new toy, the SAS® University Edition, which was just released on May 28. It’s essentially free SAS for anybody who wants it, and it has the potential to be a real game changer. SAS has long had a reputation for being one of the best statistical packages around, but also one of the most expensive. Last I checked, the starting price for a single-user license was around $10,000. Not surprisingly, virtually everyone who uses SAS gets their license through their employer or their university. 

So why is SAS now offering the core of its product line for free? For many years, SAS has made tons of money selling software to big companies, but its popularity among academics has been steadily waning. The decline in the academic market share has been especially steep in statistics departments where R has now become the preferred programming environment. This has created a serious problem for the SAS business model because the students of today are the business analysts of tomorrow. If they graduate with no experience using SAS, they will be far less likely to insist that their companies pay for a very costly software package. And the many companies that currently use SAS are finding it increasingly difficult to find new hires with SAS skills. 

SAS has made some previous attempts to solve this dilemma. Several years ago they released the SAS Learning Edition, which individuals could buy for around $100. But the functionality of that product was so limited that it was really only good for learning how to code in SAS. More recently, they introduced SAS On Demand which enabled academic users to access SAS via a web server. I tried using this system for a couple of courses, but I found it way too cumbersome, both for me and for my students.

With the University Edition (UE), SAS has finally produced a winner. Here are some things I like about it:

  • UE includes most of the SAS products that statistical analysts will need:  BASE, STAT, IML, and ACCESS.
  • It’s a completely local package and does NOT have to be connected to the Internet. 
  • UE can handle fairly large data sets (more on that later). 
  • When you sign on with an Internet connection, you are notified if an update is available. You can then update with the click of a button.
  • The browser-based interface, called SAS Studio, is a snap to learn and use.
  • SAS Studio will run in recent editions of all popular browsers, including Internet Explorer, Chrome, Safari and Firefox. 
  • UE can run on Macs, Windows, and Linux machines.
  • It runs smoothly and speedily, although not quite as fast as a regular installed version of SAS.
  • And did I mention that it’s absolutely free for anyone who wants it?

The license agreement states that UE can be used “solely for your own internal, non-commercial academic purposes.” As far as I can tell, there’s nothing to prevent someone in a business setting from downloading, installing, and running UE. But business users should bear in mind that the SAS Institute is known for zealously protecting its intellectual property.

You’re probably wondering, what’s the catch?  Well, there are a few things not to like, but they are relatively minor in my opinion:

  • UE only installs on 64-bit machines with at least 1 gig of memory.
  • UE doesn’t have SAS/ETS (econometrics & time series), SAS/OR (operations research) or SAS/QC (quality control). Most importantly, it doesn’t have SAS/GRAPH, although it does have ODS graphics. So you can’t use PROC GPLOT, but you can use PROC SGPLOT.
  • If you’re not connected to the Internet, it can take up to two minutes to start up, compared to only 10 seconds if you are connected. Weird, huh?
  • Installation can be a little tricky, so you need to follow all the instructions carefully.
  • It took me nearly two hours to download UE, but that was over a not-so-speedy WIFI connection. 

Now for a few details and suggestions. UE runs as a virtual machine, so you first need to download and install a free copy of Oracle’s VirtualBox software. (UE also runs with VMware Player or VMware Fusion, but those cost real money). After downloading UE, you open VirtualBox and then install UE as a virtual machine. With VirtualBox still open, you can start up UE by pointing your web browser to http://localhost:10080. For more details, check out the FAQs on the SAS support site.

I was warned by a SAS tech support person that UE may not work on “very large” data sets. But it worked fine with the biggest data set that I have, which has 414,000 cases, 674 variables, and takes up 888 MB on my computer.

If you want to use existing SAS data sets and programs, the most straightforward approach is to copy them into a dedicated folder for UE. Alternatively, you can create a folder shortcut to your existing data sets–but the process is a bit tricky. 

When I ran UE using a SAS data set that had been created by SAS 9.3 on a Windows machine, I got a warning in the Log window that the data set “is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.” I’m guessing that this happens because VirtualBox creates a Linux environment for UE to run in. And SAS data sets in Windows are not identical to SAS data sets in Linux.

In any case, this difference in file formats can really slow things down. When I ran a logistic regression with five predictors on the aforementioned data set, it took 47 seconds of real time and 38 seconds of CPU time. My solution was to use a DATA step to copy the old data set into a new data set (presumably in UE’s preferred format). When I re-ran the logistic regression on the new data set, execution improved dramatically: real time declined to 18 seconds and CPU time to 6.5 seconds. By comparison, when I ran the same regression on my standard installed version of SAS 9.3, the real time was 12 seconds and the CPU time was 2 seconds. So UE is definitely slower than “real” SAS, but the difference seems tolerable for most applications.

SAS Studio is the slick new interface for accessing SAS via a web browser. It’s designed not just for UE, but for any environment where users need to access SAS on a remote server. SAS Studio will be instantly familiar to anyone who has used the traditional SAS Display Manager with its editor window (now called Code), Log window, and Results window. As with PC SAS, you can have multiple program windows open in SAS Studio. But unlike PC SAS, each program window has its own Log and Results window. If you’re accustomed to using SAS on a PC, you can immediately start doing things the way you’ve always done them. However, there are lots of cool new features, most of which are easily learned by pointing and clicking on icons. For example, when you’re in the Results window, there are buttons that will save your output to an HTML file, a PDF file, or an RTF file.

Here’s a hint that you may find useful: by default, SAS Studio is in batch mode. That means that whenever you run a block of code, whatever is already in the Log and Results windows will get overwritten. If you want your results to accumulate, click on the “go interactive” icon in the Code window. You can also change your Preferences to start each session in interactive mode.The downside to the interactive mode is that temporary data sets and macros produced in one program window are not available to any other program window. 

If you plan to use UE a lot, it’s worth investing some time to learn the ins and outs of SAS Studio. A good introductory article (22 pages) can be found here.  Or click here for an 8-minute video tutorial. If total mastery is your thing, you can download the 300-page manual here.

So there you have it, free SAS in a (virtual) box. I would guess that at least 95% of the statistical analyses that I’ve done using SAS over the last 10 years could have been done with UE. That’s great news for potential users who don’t currently have access to SAS. But it must be a little scary for the SAS Insititute. Will this free product cannibalize existing sales? Loss leaders are always risky, and it will be interesting to see how this plays out. Personally, I’m rooting for UE to be a big success, both for users and for SAS.  

Allison PicIn the first chapter of my 1999 book Multiple Regression, I wrote

“There are two main uses of multiple regression: prediction and causal analysis. In a prediction study, the goal is to develop a formula for making predictions about the dependent variable, based on the observed values of the independent variables….In a causal analysis, the independent variables are regarded as causes of the dependent variable. The aim of the study is to determine whether a particular independent variable really affects the dependent variable, and to estimate the magnitude of that effect, if any.”

As in most regression textbooks, I then proceeded to devote the bulk of the book to issues related to causal inference—because that’s how most academic researchers use regression most of the time.

Outside of academia, however, regression (in all its forms) is primarily used for prediction. And with the rise of Big Data, predictive regression modeling has undergone explosive growth in the last decade. It’s important, then, to ask whether our current ways of teaching regression methods really meet the needs of those who primarily use those methods for developing predictive models.

Despite the fact that regression can be used for both causal inference and prediction, it turns out that there are some important differences in how the methodology is used, or should be used, in the two kinds of application. I’ve been thinking about these differences lately, and I’d like to share a few that strike me as being particularly salient. I invite readers of this post to suggest others as well.

1. Omitted variables. For causal inference, a major goal is to get unbiased estimates of the regression coefficients. And for non-experimental data, the most important threat to that goal is omitted variable bias. In particular, we need to worry about variables that both affect the dependent variable and are correlated with the variables that are currently in the model. Omission of such variables can totally invalidate our conclusions.

With predictive modeling, however, omitted variable bias is much less of an issue. The goal is to get optimal predictions based on a linear combination of whatever variables are available. There is simply no sense in which we are trying to get optimal estimates of “true” coefficients. Omitted variables are a concern only insofar as we might be able to improve predictions by including variables that are not currently available. But that has nothing to do with bias of the coefficients.

2. R2. Everyone would rather have a big R2 than a small R2, but that criterion is more important in a predictive study. Even with a low R2, you can do a good job of testing hypotheses about the effects of the variables of interest. That’s because, for parameter estimation and hypothesis testing, a low R2 can be counterbalanced by a large sample size.

For predictive modeling, on the other hand, maximization of R2 is crucial. Technically, the more important criterion is the standard error of prediction, which depends both on the R2 and the variance of y in the population. In any case, large sample sizes cannot compensate for models that are lacking in predictive power.

3. Multicollinearity. In causal inference, multicollinearity is often a major concern. The problem is that when two or more variables are highly correlated, it can be very difficult to get reliable estimates of the coefficients for each one of them, controlling for the others. And since the goal is accurate coefficient estimates, this can be devastating.

In predictive studies, because we don’t care about the individual coefficients, we can tolerate a good deal more multicollinearity. Even if two variables are highly correlated, it can be worth including both of them if each one contributes significantly to the predictive power of the model.

4. Missing data. Over the last 30 years, there have been major developments in our ability to handle missing data, including methods such as multiple imputation, maximum likelihood, and inverse probability weighting. But all these advances have focused on parameter estimation and hypothesis testing. They have not addressed the special needs of those who do predictive modeling.

There are two main issues in predictive applications. First, the fact that a data value is missing may itself provide useful information for prediction. And second, it’s often the case that data are missing not only for the “training” sample, but also for new cases for which predictions are needed. It does no good to have optimal estimates of coefficients when you don’t have the corresponding x values by which to multiply them.

Both of these problems are addressed by the well-known “dummy variable adjustment” method, described in my book Missing Data, even though that method is known to produce biased parameter estimates. There may well be better methods, but the only article I’ve seen that seriously addresses these issues is a 1998 unpublished paper by Warren Sarle.

5. Measurement error. It’s well known that measurement error in predictors leads to bias in estimates of regression coefficients. Is this a problem for a predictive analysis? Well, it’s certainly true that poor measurement of predictors is likely to degrade their predictive power. So efforts to improve measurement could have a payoff. Most predictive modelers don’t have that luxury, however. They have to work with what they’ve got. And after-the-fact corrections for measurement error (e.g., via errors-in-variables models or structural equation models) will probably not help at all.

I’m sure this list of differences is not exhaustive. If you think of others, please add a comment. One could argue that, in the long run, a correct causal model is likely to be a better basis for prediction than one based on a linear combination of whatever variables happen to be available. It’s plausible that correct causal models would be more stable over time and across different populations, compared with ad hoc predictive models. But those who do predictive modeling can’t wait for the long run. They need predictions here and now, and they must do the best with what they have.

Allison PicAt the 1998 Annual Meeting of the American Political Science Association, Gary King and three co-authors presented a paper titled “Listwise deletion is evil: What to do about missing data in political science.” The paper was later published under a different title in the American Political Science Review, but the original title has stuck in my head ever since. Is listwise deletion really evil? How much harm has it caused? Should we avoid it at all costs?

King et al. claimed that, on average, data analysis in political science research typically loses about a third of the cases due to listwise deletion of missing data (also known as complete case analysis). As a consequence, the increase in mean squared error is comparable to what you might expect from omitted variable bias. They then made the case that multiple imputation would be far superior in the vast majority of research projects.

Now I’ve been a proponent of multiple imputation for many years, although I’ve also argued that maximum likelihood (ML) methods for handling missing data may be even better. Nevertheless, I still believe that listwise deletion is not as bad as many people think, and even has advantages over ML and multiple imputation in some applications.

King et al. were obviously correct that listwise deletion can lead to massive losses of data, which can substantially increase the probability of Type II errors. But with the rise of “big data”, many researchers now find themselves in situations where statistical power is not a major issue. If listwise deletion reduces your sample size from a million to 500,000, loss of power is probably not going to keep you up at night.

In cases like this, the focus should shift to bias. Which method—listwise deletion or multiple imputation—is going to give you the least bias? It’s well known that listwise deletion does not introduce bias if the data are missing completely at random (MCAR). Under MCAR, listwise deletion is equivalent to simple random sampling, and we know that simple random sampling does not lead to bias.

But MCAR is a very strong assumption, and there are usually many reasons to suspect violations. For example, if men are less likely to report their income than women, then data on income are not MCAR. That would certainly lead to biased estimates of mean income for the whole population.

By contrast standard methods for multiple imputation and ML are approximately unbiased under the much weaker assumption that data are missing at random (MAR). Under MAR, it’s OK if men are less likely to report their income than women, as long as the probability of reporting income does not depend on income itself (within each gender). Thus, MAR allows the probability of missingness to depend on observed variables, and that’s a major advantage over listwise deletion in reducing bias.

This is all well known. What most researchers don’t know is that listwise deletion may actually be less biased than multiple imputation or ML when data are missing on predictor variables in regression analysis. For example, suppose you’re doing a linear regression in which the dependent variable is number of children and one of the predictors is annual income. Suppose, further, that 30% of respondents did not report their income. Multiple imputation or ML, when done correctly, can produce approximately unbiased estimates of the regression coefficients when income data are MAR. But so will listwise deletion. And, remarkably, listwise deletion will produce unbiased estimates even if the data are not missing at random.

You can find a simple proof of this result in footnote 1 of my book Missing Data, although I was certainly not the first to prove it. What this means is that even if people with high income are less likely to report their income (a violation of MAR), that won’t lead to bias if you use listwise deletion. But it certainly could lead to bias if you use standard implementations of multiple imputation or ML.

There are two important caveats here. First, the probability that predictors are missing cannot depend on the dependent variable. Thus, in our example, the probability that income is missing cannot depend on the number of children.

Second, this property of listwise deletion presumes that you are estimating a correctly specified regression model. In particular, if the regression coefficients are actually different for different subgroups but your model doesn’t allow for this, then listwise deletion can skew your results more toward one subgroup or the other. Consequently, your estimates won’t be unbiased estimates of the (misspecified) regression in the population.

The robustness of listwise deletion to “not missing at random” extends to any kind of regression, not just linear. For predicting number of children, for example, you might prefer a negative binomial regression to a linear regression, and the same result would apply.  

For logistic regression, the result is even stronger. You can have missing data that are not missing at random on the dependent variable, and logistic regression using listwise deletion will still give approximately unbiased estimates of the regression coefficients (but not the intercept). For example, suppose the dependent variable is whether or not people graduate from college, and people who graduate are much more likely to report their graduation status than those who do not. That’s not a problem for listwise deletion, but it would definitely be a problem for multiple imputation or ML.

For this property of logistic regression, the caveat is that missingness on the dependent variable cannot depend on the predictor variables. Incidentally, the proof of this result is also the justification of the famous case-control method in epidemiology.

Here’s another little-known fact about listwise deletion. If data are missing only on the dependent variable and the data are missing at random, then listwise deletion is equivalent to maximum likelihood. And ML can’t be improved upon by multiple imputation, so you might as well just do listwise deletion. Multiple imputation would only add more sampling variation to the estimates.

So the upshot is this. If listwise deletion still leaves you with a large sample, you might reasonably prefer it over maximum likelihood or multiple imputation. At the least, you should think carefully about the relative advantages and disadvantages of these methods, and not dismiss listwise deletion out of hand.

Finally, if you compare listwise deletion with other traditional methods like pairwise deletion, dummy variable adjustment, or conventional imputation, there’s really no contest. The other methods either get the standard errors wrong, the parameter estimates wrong, or both. At a minimum, listwise deletion gives you “honest” standard errors that reflect the actual amount of information used. And it’s by far the easiest method to apply.