vonhippel_SH2010_48B1DF-TESTIn his April 1 post, Paul Allison pointed out several attractive properties of the logistic regression model.  But he neglected to consider the merits of an older and simpler approach: just doing linear regression with a 1-0 dependent variable.  In both the social and health sciences, students are almost universally taught that when the outcome variable in a regression is dichotomous, they should use logistic instead of linear regression. Yet economists, though certainly aware of logistic regression, often use a linear model to model dichotomous outcomes.

Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.


Let’s start by comparing the two models explicitly. If the outcome Y is a dichotomy with values 1 and 0, define p = E(Y|X), which is just the probability that Y is 1, given some value of the regressors X. Then the linear and logistic probability models are:

p = a+ a1X1a2X2 + … + akXk    (linear)

ln[p/(1-p)] = b+ b1X1b2X2 + … + bkXk       (logistic)

The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1-p) is a linear function of the regressors.

The major advantage of the linear model is its interpretability. In the linear model, if a1 is (say) .05, that means that a one-unit increase in X1 is associated with a 5 percentage point increase in the probability that Y is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.

The logistic model is less interpretable. In the logistic model, if b1 is .05, that means that a one-unit increase in X1 is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.

How intuitive are odds ratios?

Because the log odds scale is so hard to interpret, it is common to report logistic regression results as odds ratios. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:

p/(1-p) = d0 × (d1)X1 × (d2)X2 × … × (dk)Xk

On the left side we have the odds and on the right side we have a product involving the odds ratios d1 = exp(b1), d2 = exp(b2), etc.

Odds ratios seem like they should be intuitive. If d1 = 2, for example, that means that a one-unit increase in Xdoubles the odds that Y is 1. That sounds like something we should understand.

But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1-p)—and so does the odds ratio.

Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a get-out-the-vote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.

If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.

Here’s a table that shows what doubling the odds does to various initial probabilities:

Before doubling

After doubling










































It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If p is close to 0, then doubling the odds is approximately the same as doubling p. If p is close to 1, then doubling the odds is approximately the same as halving 1-p. But if p is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.

That’s why I’m not crazy about odds ratios.

How nonlinear is the logistic model?

The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).

For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability  (cf. Long 1997).


It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.


When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model.

A rule of thumb

These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.

Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).

Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[p/(1-p)] are undefined when p is equal to 0 or 1. When p gets close to 0 or 1 logistic regression can suffer from complete separation, quasi-complete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic. 

Computation and estimation

Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderate-sized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]

The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance p(1-p), but the heteroscedasticity is minor if p is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticity-consistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science. 


Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity, 43(1), 59–74. http://doi.org/10.1007/s11135-007-9077-3

King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163. http://doi.org/10.2307/25791637

Long, J. S. (1997) Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.


[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.

[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.

Allison PicWhen estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time t is what happened at time t-1. 

This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models.  Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as dynamic panel data models.  Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward.  Let yit be the value of the dependent variable for individual i at time t.  Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector xit:

            yit = b0 + b1yi(t-1) + b2xit +  ui + eit

The random intercept ui represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the right-hand side.

That’s where the problem lies. Because the model applies to all time points, i has a direct effect on yi(t-1).  But if i affects yi(t-1), it can’t also be statistically independent of yi(t-1). The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male). 

Here’s the Stata code:

use "https://statisticalhorizons.com/wp-content/uploads/wages.dta", clear
 xtset id t
 xtreg lwage L.lwage ed fem t


The xtset command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7.  The xtreg command fits a random-intercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors.  L.lwage specifies the one-year lag of lwage

Here’s the output:

       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       lwage |
         L1. |   .8747517   .0085886   101.85   0.000     .8579183    .8915851
          ed |   .0108335   .0011933     9.08   0.000     .0084947    .0131724
         fem |    -.06705    .010187    -6.58   0.000    -.0870162   -.0470839
           t |   .0071965   .0019309     3.73   0.000     .0034119    .0109811
       _cons |   .7624068   .0491383    15.52   0.000     .6660974    .8587161

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males.  Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages. 

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply xtdpdqml to the wage data:

xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)

The re option specifies a random effects (random intercepts) model.  By default, the command includes the lag-1 dependent variable as a predictor.  The initval option sets the starting values for the four variance parameters that are part of the model.  Here is the output:

       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       lwage |
         L1. |   .4142827   .0230843    17.95   0.000     .3690383     .459527
          ed |   .0403258   .0031841    12.66   0.000     .0340851    .0465666
         fem |  -.2852665   .0271688   -10.50   0.000    -.3385164   -.2320166
           t |   .0533413   .0027533    19.37   0.000     .0479449    .0587378
       _cons |    3.25368   .1304816    24.94   0.000      2.99794    3.509419

Results are markedly different from those produced above by xtreg.  The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference.  Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with xtdpdml, use

xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv

The inv option is for time-invariant variables.  The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1-time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.

Here is the output:

             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
lwage2       |
      lwage1 |   .4088803   .0229742    17.80   0.000     .3638517     .453909
          ed |   .0406719   .0032025    12.70   0.000     .0343951    .0469486
         fem |  -.2878266    .027345   -10.53   0.000    -.3414218   -.2342315

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in Allison et al. (2018) . As a matter of fact, the xtdpdml command is just a front-end to the sem command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself.  That’s why we wrote the command.  

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both xtdpdqml and xtdpdml can handle this situation also. 

If you’d like to learn more about dynamic panel data models, check out my 2-day course on Longitudinal Data Analysis Using SEM


Allison, Paul D., Richard Williams and Enrique Moral-Benito (2017) “Maximum likelihood for cross-lagged panel models with fixed effects.” Socius 3: 1-17.

Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 1635-1659.

Kripfganz, S. (2016). “Quasi-maximum likelihood estimation of linear dynamic short-T panel-data models.” Stata Journal 16 (4), 1013–1038.

Rabe-Hesketh, Sophia, and Anders Skrondal  (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP. 

Allison PicIn my July 2012 post, I argued that maximum likelihood (ML) has several advantages over multiple imputation (MI) for handling missing data:

  • ML is simpler to implement (if you have the right software).
  • Unlike multiple imputation, ML has no potential incompatibility between an imputation model and an analysis model.
  • ML produces a deterministic result rather than a different result every time.

Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.

What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.

With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.

So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on large-sample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a t-distribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?

Wrong! In a paper that will be published soon in Structural Equation Modeling, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.

What about confidence intervals and p-values? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a t-distribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this t-distribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.

Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.

Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for non-linear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.

MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.

If you want to learn more about both multiple imputation and maximum likelihood, check out my two-day course on Missing Data that will be offered this fall. Dates and location have not yet been set. 

Allison PicFor the analysis of binary data, logistic regression dominates all other methods in both the social and biomedical sciences. It wasn’t always this way. In a 1934 article in Science, Charles Bliss proposed the probit function for analyzing binary data, and that method was later popularized in David Finney’s 1947 book Probit Analysis. For many years, probit was the method of choice in biological research.

In a 1944 article in the Journal of the American Statistical Association, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book Analysis of Binary Data. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary log-log) but logistic regression is the overwhelming favorite. 

So what is it about logistic regression that makes it so popular?  In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there is something special about the logit link that gives it a natural advantage over alternative link functions.

First, a brief review. For binary data, the goal is to model the probability p that one of two outcomes occurs. The logit function is log[p/(1-p)], which varies between -∞ and +∞ as p varies between 0 and 1.The logistic regression model says that

            log[p/(1-p)] = b0 + b1x1 + … + bkxk

or, equivalently,

            p = 1/(1 + exp{-(b0 + b1x1 + … + bkxk)})

Estimation of the b coefficients is usually accomplished by maximum likelihood.

In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)

What’s most important about the logit link is that it guarantees that p is bounded by 0 and 1, no matter what the b’s and the x’s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary log-log link, and an infinite number of other possible link functions.  But there are a few things that are special about logit:

1. If you exponentiate the coefficients, you get adjusted odds ratios. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to non-statisticians. Coefficients from probit regression are not nearly so interpretable.

2. With logit, you can do disproportionate stratified random sampling on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes all of the events, and a 10% random sample of the non-events. This property is the justification for the widely-used case-control method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.

3. You can do exact logistic regression. With conventional maximum likelihood estimation, the p-values and confidence intervals are large-sample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact p-values are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.

4. You can do conditional logistic regression. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for between-cluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the between-cluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.  

What is it about the logit link that makes it possible to do these useful variations of logistic regression?  The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important exponential family of distributions. In general form, the probability distribution for the exponential family can be written as

            f(x|b) = h(x)exp{T(x)’g(b)–A(b)}

In this formula, x is a vector of the data, b is a vector of parameters, and h, g, T, and A are known functions.  If g(b) = b, then b is said to be the natural parameter (or canonical parameter) of the distribution. T(x) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.

For the binomial distribution, this formula specializes to 

            f(x|p) = [N x]exp{x log[p/(1-p)] –N log(1-p)}

where N is the number of trials, x is the number of events and [N x] is the binomial coefficient.  We see immediately that log[p/(1-p)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.

Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors. 

Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.  

There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables.  You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.

If you’d like to learn more about these and other methods for binary data, see my book Logistic Regression Using SAS: Theory and Application.  

Allison PicPredictive mean matching (PMM) is an attractive way to do multiple imputation for missing data, especially for imputing quantitative variables that are not normally distributed. But, as I explain below, it’s also easy to do it the wrong way. 

Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.

PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.

There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.

Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:

  1. For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b.
  2. Make a random draw from the “posterior predictive distribution” of b, producing a new set of coefficients b*. Typically this would be a random draw from a multivariate normal distribution with mean b and the estimated covariance matrix of b (with an additional random draw for the residual variance). This step is necessary to produce sufficient variability in the imputed values, and is common to all “proper” methods for multiple imputation.
  3. Using b*, generate predicted values for x for all cases, both those with data missing on x and those with data present.
  4. For each case with missing x, identify a set of cases with observed x whose predicted values are close to the predicted value for the case with missing data.
  5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value.
  6. Repeat steps 2 through 5 for each completed data set. 

Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.

There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the user-written ice command for Stata set the default at k=10. 

On the other hand, for the SPSS missing values module and for the built-in mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.

That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014). 

Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.

So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.

The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.


Little, Roderick J. A. (1988) “Missing-data adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287-296.

Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 75-87.

Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 87-94.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.

Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425-446.

Allison PicIn my November and December posts, I extolled the virtues of SEM for estimating dynamic panel models. By combining fixed effects with lagged values of the predictor variables, I argued that this approach offers the best option for making causal inferences with non-experimental panel data. It controls for all time-invariant variables, whether observed or not, and it allows for the possibility of reciprocal causation. Simulation evidence strongly favors SEM over the more popular Arellano-Bond method for estimating these kinds of models.

Despite the potential for this method, I recently learned that it’s vulnerable to a very troubling kind of bias when the lag structure is misspecified. In the latest issue of Sociological Methods and Research, Stephen Vaisey and Andrew Miles showed by both simulation and formal proof that a positive contemporaneous effect will often show up as a negative effect when estimating a fixed effects model with a predictor that is lagged by one time unit. They concluded that, for most social science applications, “artifactual negative ‘effects’ will likely be the rule rather than the exception.”

Vaisey and Miles investigated this problem only for the case of three periods of data, no lagged effect of the dependent variable y on itself, and no true effect of y on x. In that case, maximum likelihood reduces to OLS regression using difference scores: y3y2 on x2x1. They showed that the coefficient for x2x1 has an expected value that is exactly -.5 times the true coefficient.

My own simulations suggest that a sign reversal can also happen with four or more periods and a lagged dependent variable. And the effect of one variable on the other doesn’t have to be exactly contemporaneous. The reversal of sign can also occur if the correct lag is one week, but the estimated model specifies a lag of one year. Note that this artifact does not arise with random effects models. It’s specific to fixed effects models with lagged predictors. That should not be interpreted as an endorsement of random effects models, however, because they are much more prone to bias from omitted variables.

As noted by Vaisey and Miles, a 2011 article in the Journal of Quantitative Criminology may exemplify the problem of misspecified lags. Following my advice, Ousey, Wilcox and Fisher used the fixed effects SEM method to examine the relationship between victimization and offending. Numerous studies have found a positive, cross-sectional relationship between these variables: people who report being victims of crimes are also more likely to commit crimes. But Ousey et al. found negative effects of each variable on the other. Respondents who reported higher levels of offending in year t had lower levels of victimization in year t+1, after adjusting for fixed effects. And respondents with higher levels of victimization in year t had lower levels of offending in year t+1.

This surprising result could be real. But it could also occur if there is a positive effect of victimization on offending that is almost instantaneous rather than lagged by one year. And, finally, it could also occur if there is a positive, instantaneous effect of offending on victimization.

What can be done about this problem? Well, one implication is that more thought should go into the design of panel surveys. If you expect that changes in x will produce changes in y a month later, then collecting monthly data would be much better than collecting annual data. This could have the added advantage of reducing the total time for data collection, although it might also increase certain kinds of response bias.

What if your data have already been collected? Here’s a tentative recommendation that worked well in a few simulations. As a robustness check, estimate models that include both contemporaneous and lagged predictors. If a one-year lag is the correct specification, then the contemporaneous effect should be small and not statistically significant. If, on the other hand, the contemporaneous effect is large and significant, it should raise serious doubts about validity of the method and the kinds of conclusions that can be drawn. It may be that the data are simply not suitable for separating the effect of x on y from the effect of y on x.

I tried this strategy on a subset of the data used by Ousey et al. to study victimization and offending. When both contemporaneous and lagged predictors were included, I found a strong positive effect of victimization on offending in the same year. The one-year lagged effect was negative but small and non-significant. The same thing happened in the reverse direction. Offending had a strong positive effect on victimization in the same year, but the lagged effect was negative and not significant. My take: these data don’t allow one to draw any firm conclusions about whether victimization affects offending or offending affects victimization. They certainly don’t provide a basis for claiming negative effects of each variable on the other.

Clearly this is a problem that needs a great deal more study. There is a substantial econometric literature on determining the number of lags needed for autoregressive models but, as far as I know, Vaisey and Miles are the first to identify this particular phenomenon.

By the way, Steve Vaisey teaches a highly-rated course for Statistical Horizons called Treatment Effects Analysis.  

Allison PicThis is a follow-up to last month’s post, in which I considered the use of panel data to answer questions about causal ordering: does x cause y or does y cause x?  In the interim, I’ve done many more simulations to compare the two competing methods, Arellano-Bond and ML-SEM, and I’m going to report some key results here. If you want all the details, read my recent paper by clicking here.  If you’d like to learn how to use these methods, check out my new seminar titled Longitudinal Data Analysis Using SEM

Quick review: The basic approach is to assume a cross-lagged linear model, with y at time t affected by both x and y and time t-1, and x at time t also affected by both lagged variables. The equations are

 yit = b1xi(t-1) + b2yi(t-1) + ci + eit

xit = a1xi(t-1) + a2yi(t-1) fi + dit

for i = 1,…, n, and t = 1,…, T.  

The terms ci and fi represent individual-specific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, eit and dit are assumed to represent pure random noise, independent of any variables measured at earlier time points. Additional exogenous variables could also be added to these equations.

Conventional estimation methods are biased because of the lagged dependent variable and because of the reciprocal relationship between the two variables. The most popular solution is the Arellano-Bond (A-B) method (or one of its cousins), but I have previously argued for the use of maximum likelihood (ML) as implemented in structural equation modeling (SEM) software.

Last month I presented very preliminary simulation results showing that ML-SEM had substantially lower mean-squared error (MSE) than A-B under a few conditions. Since then I’ve done simulations for 31 different sets of parameter values and data configurations. For each condition, I generated 1,000 samples, ran the two methods on each sample, and then calculated bias, mean squared error, and coverage for confidence intervals. Since the two equations are symmetrical, the focus is on the coefficients in the first equation, b1 for the effect of x on y, and b2 for the effect of y on itself.

The simulations for ML were done with PROC CALIS in SAS. I originally started with the sem command in Stata, but it had a lot of convergence problems for the smaller sample sizes. The A-B simulations were done in Stata with the xtabond command. I tried PROC PANEL in SAS, but couldn’t find any combination of options that produced approximately unbiased estimates.   

Here are some of the things I’ve learned:

Under every condition, ML showed little bias and quite accurate confidence interval coverage. That means that about 95% of the nominal 95% confidence intervals included the true value.

Except under “extreme” conditions, A-B also had little bias and reasonably accurate confidence interval coverage. 

However, compared with A-B, ML-SEM always showed less bias and smaller sampling variance. My standard of comparison is relative efficiency, which is the ratio of MSE for ML to MSE for A-B. (MSE is the sum of the sampling variance plus the squared bias.) Across 31 different conditions, relative efficiency of the two estimators ranged from .02 to .96, with a median of .50. To translate, if the relative efficiency is .50, you’d need twice as large a sample to get the same accuracy with A-B as with ML.  

Relative efficiency of the two estimators is strongly affected by the value of the parameter b2, the effect of yt-1 on yt.  As b2 gets close to 1, the A-B estimators for both b1 and b2 become badly biased (toward 0), and the sample variance increases, which is consistent with previous literature on the A-B estimator.  For ML, on the other hand, bias and variance are rather insensitive to the value of b2. Here are the numbers:


Rel Eff b1

Rel Eff b2























Relative efficiency is strongly affected by the number of time points, but in the opposite direction for the two coefficients. Thus, relative efficiency for b1 increases almost linearly as the number of time points goes from 3 to 10.  But for b2, relative efficiency is highest at T=3, declines markedly for T=4 and T=5, and then remains stable. 


Rel Eff b1

Rel Eff b2

















Relative efficiency is also strongly affected by the ratio of the variance of ci, (the fixed effect) to the variance of eit (the pure random error). In the next table, I hold constant the variance of c and vary the standard devation of e


Rel Eff b1

Rel Eff b2














Relative efficiency is not strongly affected by:

  • Sample size
  • The value of b1
  • The correlation between ci and fi, the two fixed-effects variables.

Because ML is based on the assumption of multivariate normality, one might suspect that A-B would do better than ML if the distributions were not normal. To check that out, I generated all the variables using a 2-df chi-square variable, which is highly skewed to the right.  ML still did great in this situation, and was still about twice as efficient as A-B.

In sum, ML-SEM outperforms A-B in every situation studied, by a very substantial margin.   

Allison PicDoes x cause y or does y cause x? Virtually everyone agrees that cross-sectional data are of no use in answering this question. The ideal, of course, would be to do two randomized experiments, one examining the effect of x on y, and the other focused on the reverse effect. Absent this, most social scientists would say that some of kind of longitudinal data ought to do the trick. But what kinds of data are needed and how should they be analyzed?

In this post, I review some earlier work I’ve done on these questions, and I report new simulation results comparing the Arellano-Bond method with maximum likelihood (ML) using structural equation modeling (SEM) software. Arrelano-Bond is hugely popular among economists, but not widely known in other disciplines. ML with SEM is a method that I’ve been advocating for almost 15 years (Allison 2000, 2005a, 2005b, 2009). Long story short: ML rules.

I focus on panel data in which we observe yit and xit for =1,…, n and =1,…, T. The proposed linear model allows for reciprocal, lagged effects of these two variables on each other:

yit = b1xi(t-1) + b2yi(t-1) + ci + eit

xit = a1xi(t-1) + a2yi(t-1) + fi + dit

The terms ci and fi represent individual-specific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, eit and dit are assumed to represent pure random noise, independent of any variables measured at earlier time points.

If all the assumptions are met, b1 can be interpreted as the causal effect of x on y, and a2 can be interpreted as the causal effect of y on x. This model can be elaborated in various ways to include, for example, other predictor variables, different lags, and coefficients that change over time.

Estimation of the model is not straightforward for reasons that are well known in the econometric literature. First, the presence of a lagged dependent variable as a predictor in each equation means that conventional fixed effects methods yield biased estimates of the coefficients under almost any condition. But even if the lagged dependent variables were excluded from the equations, the error term in each equation would still be correlated with all future values of both x and y. For example, e2 -> y2 -> x3.  So, again, conventional fixed effects will produce biased coefficients. 

Arrelano and Bond (1991) solved these problems by using earlier lagged values of x and y as instrumental variables and by applying a generalized method of moments (GMM) estimator.  Several software packages now implement this method, including SAS, Stata, LIMDEP, and the plm package for R. 

My solution to the problems has been to estimate each equation separately by ML using any SEM package (e.g., LISREL, Mplus, PROC CALIS in SAS, or sem in Stata).  Two “tricks” are necessary. Focusing on the first equation, fixed effects are accommodated by allowing c to be correlated with all measurements of x (as well as the initial measurement of y). Second, the error term e is allowed to be correlated with all future measurements of x. Analogous methods are used to estimate the second equation.  For details, see the SEM chapters in my 2005 and 2009 books.

In my 2005 paper, I presented simulation evidence that the ML-SEM method produces approximately unbiased estimates of the coefficients under a variety of conditions. For years, I’ve been promising to do a head-to-head comparison of ML with Arellano-Bond, but I’ve just now gotten around to doing it.

What I’m going to report here are some very preliminary but dramatic results. The model used to generate the data was one in which x has a positive effect on y, but y has a negative effect on x

yit = .5xi(t-1) + .5yi(t-1) + ci + eit

xit = .5x(t-1) – .5yi(t-1) + fi + dit

All variables have normal distributions, c has a positive correlation with x, f has a positive correlation with y, and c and f are positively correlated with each other. The baseline model had 5 time points (T=5), with sample sizes of 50, 100, 400, and 1600. Then, keeping the sample size at 400, I examined T=  4, and 10. For each condition I did 1000 replications.

I focus here on the coefficient for the effect of x on y in the first equation. For each condition, I calculated the mean squared error (MSE), which is the variance of the estimator plus its squared bias. There was little bias in either estimator, so the MSE primarily reflects sampling variance.

Here are the preliminary results:

Mean Squared Error for Two Estimators




Relative efficiency













N=1600, T=5




N=400, T=4




N=400, T=10




The last column, relative efficiency, is the ratio of the MSE for ML to the MSE for A-B. With 5 time points, A-B is only about half as efficient as ML-SEM, for any sample size.  But the number of time points has a dramatic effect.  A-B is only 29% efficient for T=4 but 79% efficient for T=10. 

The next steps are to vary such things as the magnitudes of the coefficients, the variances of the error terms, and the correlations between c and f with each other and with the predictor variables.

Besides its efficiency advantage, the ML-SEM framework makes it easier than A-B to accomplish several things:

  • Handle missing data by FIML.
  • Relax various constraints, such as constant error variance or constant coefficients.
  • Construct a likelihood ratio test comparing fixed vs. random effects, the equivalent of the Hausman test which not infrequently breaks down.
  • Add an autoregressive structure to the time-specific error components.

Before concluding, I must mention that Hsiao et al. (2002) also did a simulation study to compare ML with a variety of other estimators for the panel model, including A-B. However, their approach to ML was very different than mine, and it has not been implemented in any commercial software packages. Hsiao et al. found that ML did better with respect to both bias and efficiency than any of the other estimators, under almost all conditions. Nevertheless, the differences between ML and A-B were much smaller than those reported here. 

If you’re reading this post, you should definitely read next month’s follow up by clicking here

To learn more about these and other methods for panel data, check out my seminars, Longitudinal Data Analysis Using SAS and Longitudinal Data Analysis Using Stata. Both will be offered in the spring of 2015.  Plus, I am offering a new, more advanced seminar titled Longitudinal Data Analysis Using SEM in Fort Myers, Florida, January 23-24. 


Allison, Paul D. (2000) “Inferring Causal Order from Panel Data.” Paper presented at the Ninth International Conference on Panel Data, June 22, Geneva, Switzerland.

Allison, Paul D. (2005a) “Causal Inference with Panel Data.” Paper presented at the Annual Meeting of the American Sociological Association, August, Philadelphia, PA.

Allison, Paul D. (2005b) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: SAS Institute.

Allison, Paul D. (2009) Fixed Effects Regression Models. Thousand Oaks, CA:  Sage Publications.

Arellano, M. and S. Bond (1991) “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations.” The Review of Economic Studies 58: 277-297.  

Hsiao, Cheng, M. Hashem Pesaran, and A. Kamil Tahmiscioglu (2002) “Maximum likelihood estimation of fixed effects dynamic panel data models covering short time periods.”Journal of Econometrics 109: 107-150.

Allison PicWhen I teach my seminar on Missing Data, the most common question I get is “What can I do if my data are not missing at random?” My usual answer is “Not much,” followed by “but you can do a sensitivity analysis.” Everyone agrees that a sensitivity analysis is essential for investigating possible violations of the missing at random assumption. But, unfortunately, there’s little guidance in the literature on how to actually do one. And, until recently, no commercial software package had options for doing a sensitivity analysis.

I’m happy to report that PROC MI in SAS 9.4 has several options for doing a sensitivity analysis based on multiple imputation. I’ve recently had a chance to read the documentation and do a few test runs. A little later in this post, I’ll tell you what I’ve learned.

But first, some background. There are two widely-used “modern” methods for handling missing data: multiple imputation and maximum likelihood. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random (MAR). Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables. This assumption would be violated, for example, if people with high income were less likely to report their income.

The MAR assumption does allow missingness to depend on anything that you observe, it just can’t depend on things that you don’t observe. MAR is not a testable assumption. You may suspect that your data are not missing at random, but nothing in your data will tell you whether or not that’s the case.

It’s possible to do multiple imputation or maximum likelihood when data are missing not at random (MNAR), but to do that, you first need to specify a model for the missing data mechanism—that is, a model of how missingness depends on both observed and unobserved quantities. That raises three issues:

  • For any data set, there are an infinite number of possible MNAR models.
  • Nothing in the data will tell you which of those models is better than another.
  • Results may depend heavily on which model you choose.

That’s a dangerous combination. And it’s why a sensitivity analysis is so important. The basic idea is to try out a bunch of plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions. On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy.

Keep in mind that this is not a test. Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways.

There’s nothing very deep about this. The hard part is figuring out how to come up with a reasonable set of models. It’s particularly hard if you’re using maximum likelihood to handle the missing data. Elsewhere I’ve argued for the advantages of maximum likelihood over multiple imputation. But one attraction of multiple imputation is that it’s easier to do a decent sensitivity analysis.

That’s where the new options for PROC MI come in. I think they’re easiest to explain by way of an example. In my Missing Data seminar, I use an example data set called COLLEGE, which contains information on 1302 four-year colleges and universities in the U.S. The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier.

There are lots of missing data for the five predictor variables, but we’re going to focus on the 98 colleges that did not report their graduation rates. It’s plausible that colleges with low graduation rates would be less likely to report those rates in order to avoid adverse publicity. If so, that would probably entail a violation of the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower (unobserved) graduation rates than those colleges that report their graduation rates, controlling for other variables.

PROC MI allows us to build that supposition into the multiple imputation model. We can, for example, specify an imputation model that says that the imputed values of GRADRAT are only 80% of what they would be if the data were actually missing at random. Here’s the SAS code for doing that:

      FCS ;

This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. The ADJUST option says to multiply the imputed values of GRADRAT by .80 at each step of the iterative process. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1.

The MNAR statement only works if you specify the MONOTONE method or the FCS method, which is what I used here. FCS stands for fully conditional specification, and it’s equivalent to the chained equation or sequential regression method used in many other packages. The MNAR statement does not work if you use the default MCMC method. [It could probably be done for MCMC, but that would mess up the elegant computational algorithm. FCS is already a “messy” algorithm, so a little more mess is no big deal].

 Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example,


This would subtract 20 points from any imputed graduation rates. Again, to do a sensitivity analysis, you’d want to try out a range of different SHIFT values to see what effect that would have on your results.

The SHIFT and SCALE options can be combined. The SHIFT option can also be used for adjusting the imputations of categorical outcomes (binary, ordinal or nominal), except that the changes are applied on the log-odds scale.

Another option allows you to restrict the adjustments to certain subsets of the data, e.g.,


This says to subtract 20 points from the imputed values of graduation rates, but only for private colleges, not for public colleges. If you use the ADJUSTOBS option, the subsetting variable (PRIVATE in this case) should be listed in a CLASS statement.  

There are also other options, which you can read about here. An introductory article written by the guy who developed PROC MI, Yang Yuan, can be downloaded here.

If you don’t use SAS, you can do adjustments like this using other multiple imputation software along with a little programming. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and the adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.

This particular way of doing a sensitivity analysis is based on something called pattern-mixture models for MNAR. You can read more about pattern-mixture models in Chapter 10 of the book Multiple Imputation and Its Application by James Carpenter and Michael Kenward.

Finally, it’s worth noting that the inclusion of appropriate auxiliary variables into the imputation model can go a long way toward reducing the likelihood of MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing. For more on auxiliary variables, see this recent paper by Tenko Raykov, one of Statistical Horizons’ instructors.

Allison PicFor several years now, I’ve been promoting something I called the “hybrid method” as a way of analyzing longitudinal and other forms of clustered data. My books Fixed Effects Regression Methods for Longitudinal Data Using SAS (2005) and Fixed Effects Regression Models (2009) both devoted quite a few pages to this methodology. However, recent research has led me to be a little more cautious about this approach, especially for logistic regression models.

Like other fixed effects methods, the hybrid method provides a way of controlling for all cluster-level covariates, whether observed or not. It is sometimes called the “poor man’s conditional likelihood” (Neuhaus and McCulloch 2006) or the “between-within (BW) method” (Sjölander et al. 2013). Actually, I like that second name so much that I’m going to start using it in all future teaching and writing.

First introduced by Neuhaus and Kalbfleisch (1998), the BW method embeds a fixed effects estimator within the framework of a random effects (mixed) model, thereby enabling one to reap several benefits of both approaches. It’s especially attractive for models like ordinal logistic regression or negative binomial regression where more standard fixed effects methods (like conditional likelihood) are not available.

I’ll review the BW method in a moment, but first let me tell you why I’m writing this post. Despite the many attractions of the method, I’ve long been bothered by the fact that there was no proof that it did what was claimed—that is, provide consistent (and, hence, approximately unbiased) estimates of the causal parameters of interest. For linear models, the BW method produces exactly the same results as the standard fixed effects method, so that’s reassuring. But for logistic regression, the BW estimates are not identical to those produced by conditional likelihood (the gold standard)—close, but no cigar.

I’m glad to report that there’s been some real progress on this front in recent years, but not all of it is encouraging. Goetgeluk and Vansteelandt (2008) proved that the BW method produces consistent estimates in the linear case—something that was already pretty apparent. However, they also showed that the BW method could yield biased estimates for some nonlinear models, including logistic and Poisson regression. Fortunately, they also observed that the biases were typically small in most practical settings.

Brumback et al. (2010) also showed by simulation that the BW method, when applied to binary logistic regression, could produce inconsistent estimates of the within-cluster causal parameters, although, again, the biases were small. But in a later paper, Brumback et al. (2012) presented a simulation in which the downward bias was as high as 45%.

So, does that mean that we should no longer use the BW method for logistic regression? I don’t think so. To explain that, I need to get a little more technical. Let Yij be a binary (0,1) dependent variable for the jth individual in the ith cluster. Let Xij be a column vector of predictor variables, again, for individual i in cluster j. (If you’re not comfortable with vectors, just think of X as a single variable). We postulate the model

     logit(Pr(Yij = 1| Xij, αi)) = μ + βXij + αi                                                                                     (1)

where β is a row vector of coefficients, and αi is an unobserved “effect” of being in cluster i. It can be thought of as representing the combined effects of all cluster-level variables that are not included in the model.

How can we consistently estimate β? Well, we could estimate a generalized linear mixed model, treating α as a normally distributed random variable with mean 0 and constant variance. This can be done in SAS with PROC GLIMMIX or in Stata with the melogit command. But standard implementations assume that α is independent of X, which means that we aren’t really controlling for α as a potential confounder. If α is actually correlated with X, you’ll get biased estimates of β using this method, and the bias could be severe.

A second approach is to use conditional likelihood, which is known to produce consistent estimates of β even if α is correlated with X. You can do this with the clogit command in Stata or with PROC LOGISTIC using the STRATA statement.

A third approach is the BW method, which says to decompose X into a within-cluster component and a between-cluster component. Letting Xi be a vector of cluster-specific means (sorry to put the bar underneath), we estimate the following generalized linear mixed model

         logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βW(XijXi) +  βBXi + αi                                                       (2)

using standard software like GLIMMIX or melogit.

It’s long been believed that the estimate of βW, because it depends only on within-cluster variation, controls for confounding with α and produces consistent estimates of β in equation (1). We now know that that’s not true, at least not in general. But there is one condition in which it is true. Specifically, it’s true when αi in (1) is a linear function of Xi, plus a random error that is independent of ij (Brumback et al. 2010). It’s reasonable to suppose, then, that the performance of the BW method for logistic regression depends, at least in part, on how closely this linearity assumption is satisfied. I’ll make use of that idea shortly.

Why not just do conditional likelihood, which we know will give consistent estimates under more general conditions? First, there are lots of non-linear models for which conditional likelihood is just not possible. Those models include binary probit, binary complementary log-log, cumulative (ordered) logit, and negative binomial regression.

Second, even for binary logit, there are many attractions of the BW method that conditional likelihood can’t provide: random slopes, a test for confounding, the ability to include cluster-level covariates, and the ability to allow for higher-levels of clustering.

As previously mentioned, all the available evidence suggests that the bias of the BW method is small in most situations. The simulation by Brumback et al. that yielded 45% downward bias was very extreme: the cluster-level effect was highly correlated with the x’s, it had an dominating effect on the outcome, and it depended on the maximum of the x’s across individuals, not their mean. Also, there were only two individuals per cluster, which tends to magnify any other sources of bias.

Here’s a modest proposal for how to reduce any biases that may occur. First, it’s easy to show that the following equation is the exact equivalent of model (2):

           logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij +  (βBβW)Xi + αi                                                       (3)

This tells us that as long as the means are in the equation, you don’t have to express Xij as deviations from those means.

To check or allow for nonlinearity, I suggest adding polynomial functions of the means to equation (3), such as

          logit(Pr(Yij = 1| Xij, Xi, αi)) = μ + βWXij + θ1Xθ2Xi2  +θ3Xi3 + α                                          (4)

You could even include other cluster-level functions of the Xijs, such as their standard deviation. If the additional terms are not statistically significant, and if the estimate of βW doesn’t change much, then you can feel more confident that bias is not an issue.

To sum up, the BW/hybrid method is not quite as good as I once thought, at least not for non-linear models. But it’s certainly better than doing conventional random effects/mixed models, and usually comes pretty close to the mark. And, lastly, there may be ways to improve it.


Allison, PD (2005) Fixed Effects Regression Methods for Longitudinal Data Using SAS. Cary, NC: The SAS Institute.

Allison, PD (2009) Fixed Effects Regression Models. Thousand Oaks, CA: Sage Publications.

Brumback BA, AB Dailey, LC Brumback, MD Livingston, Z He (2010) “Adjusting for confounding by cluster using generalized linear mixed models.” Statistics and Probability Letters 80:1650–1654.

Brumback BA, HW Zheng, and AB Dailey (2012) “Adjusting for confounding by neighborhood using generalized linear mixed models and complex survey data.” Statistics in Medicine 32: 1313–1324.

Goetgeluk, S and S Vansteelandt (2008) “Conditional generalized estimating equations for the analysis of clustered and longitudinal data.” Biometrics 64: 772-780.

Neuhaus, JM and JD Kalbfleisch (1998) “Between- and within-cluster covariate effects in the analysis of clustered data.” Biometrics 54: (2) 638-645.

Neuhaus, JM and CE McCulloch (2006) “Separating between- and within-cluster covariate effects by using conditional and partitioning methods.” Journal of the Royal Statistical Society Series B 68: 859–872.

Sjölander, A, P Lichtenstein, H Larsson and Y Pawitan (2013) “Between–within models for survival analysis.” Statistics in Medicine 32: 3067–3076