In my last post, I explained several reasons why I prefer logistic regression over a linear probability model estimated by ordinary least squares, despite the fact that linear regression is often an excellent approximation and is more easily interpreted by many researchers.

I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the margins command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese). 

If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the margins command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.) 

I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)

There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The Stata code for estimating the model is

   webuse nhanes2f, clear
   logistic diabetes i.female age

The i. in front of black and female tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.

Here are the results:

    diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   2.050133   .2599694     5.66   0.000     1.598985     2.62857
    1.female |   1.167141   .1100592     1.64   0.101     .9701892    1.404074
         age |   1.061269    .003962    15.93   0.000     1.053532    1.069063
       _cons |   .0016525    .000392   -27.00   0.000     .0010381    .0026307


A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for non-blacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.

Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes.  Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:

   margins, dydx(black female age)

yielding the output

             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   .0400922   .0087055     4.61   0.000     .0230297    .0571547
    1.female |   .0067987   .0041282     1.65   0.100    -.0012924    .0148898
         age |   .0026287   .0001869    14.06   0.000     .0022623    .0029951


The dy/dx column gives the estimated change in the probability of diabetes for a 1-unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for non-blacks, on average.

How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: black=1 and black=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like age, the calculation of the AME is a little more complicated.

There’s another, more traditional, way to get marginal effects: for the variable black, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when black=1 and when black=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.

We can get the MEM in Stata with the command:

   margins, dydx(black female age) atmeans

with the result

             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   .0290993   .0066198     4.40   0.000     .0161246    .0420739
    1.female |   .0047259   .0028785     1.64   0.101    -.0009158    .0103677
         age |   .0018234   .0000877    20.80   0.000     .0016516    .0019953


Here the MEM for black is noticeably smaller than the AME for black. This is a consequence of the inherent non-linearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM). 

The non-linearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values. 

Here’s how to get the AME for black at four different ages:

   margins, dydx(black) at(age=(20 35 50 65)) 

1._at        : age             =          20
2._at        : age             =          35
3._at        : age             =          50
4._at        : age             =          65
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------      |
         _at |
          1  |   .0060899   .0016303     3.74   0.000     .0028946    .0092852
          2  |   .0144831   .0035013     4.14   0.000     .0076207    .0213455
          3  |   .0332459   .0074944     4.44   0.000      .018557    .0479347
          4  |   .0704719    .015273     4.61   0.000     .0405374    .1004065


We see that the AME of black at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should not be thought of as an interaction, however, because the model says that the effect on the log-odds (logit) is the same at all ages.

Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the log-odds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used

   predict yhat
   sum yhat


The predict command generates predicted probabilities and stores them in the variable yhat. The sum command produces descriptive statistics for yhat, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:

   twoway function y=ln(x/(1-x)), range(.005 .244) xtitle("Probability") ytitle("Log odds") 

Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:

   reg diabetes black female age, robust


I’ve requested robust standard errors to deal with any heteroskedasticity, but the t-statistics are about the same even without this option.

             |               Robust
    diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black |   .0383148    .008328     4.60   0.000     .0219902    .0546393
      female |   .0068601   .0041382     1.66   0.097    -.0012515    .0149716
         age |   .0021575   .0001214    17.76   0.000     .0019194    .0023955
       _cons |  -.0619671   .0049469   -12.53   0.000    -.0716639   -.0522703


The t-statistics here are pretty close to the z-statistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of black on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.  

The upshot is that combining logistic regression with the margins command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probability-based estimates vary under different conditions.

In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:

  • Within the range of .20 to .80, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
  • People understand changes in probabilities much better than they understand odds ratios.
  • OLS regression is much faster than linear regression.

I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.

Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.

Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with k-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.

Predicted probabilities. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.

In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.

Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.

In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.

Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.

A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.

Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.

Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.

Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.

This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.

One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.

There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is

webuse nhanes2f, clear
logistic diabetes black female age black#c.age

with the following results:


    diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
       black |   3.318733   1.825189     2.18   0.029     1.129381    9.752231
      female |   1.165212   .1098623     1.62   0.105     .9686107    1.401718
         age |   1.063009   .0044723    14.52   0.000     1.054279    1.071811
 black#c.age |   .9918406   .0090871    -0.89   0.371     .9741892    1.009812
       _cons |   .0014978   .0003971   -24.53   0.000     .0008909    .0025183

Given the high p-value for the interaction (.371), there is clearly no evidence here that the effect of black varies with age. Now let’s estimate the same model as a linear regression:

reg diabetes black female age black#c.age

The new results are:

    diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black |  -.0215031   .0191527    -1.12   0.262    -.0590461    .0160399
      female |   .0069338    .004152     1.67   0.095    -.0012049    .0150725
         age |   .0020176   .0001276    15.82   0.000     .0017675    .0022676
 black#c.age |   .0012962   .0003883     3.34   0.001     .0005351    .0020573
       _cons |  -.0553238   .0068085    -8.13   0.000    -.0686697   -.0419779

Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase. 

Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.

Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome.  I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.

In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.

In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R. 

In July 2015 I pointed out some advantages of the linear probability model over the logistic model. The linear model is much easier to interpret, and the linear model runs much faster, which can be important if the data set is large or the model is complicated. In addition, the linear probability model often fits about as well as the logistic model, since over some ranges the probability p is almost linearly related to the log odds function ln(p(1-p)) that is used in logistic regression.

As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.

Figure 1. The relationship between probability and log odds over the range of probabilities from .2 to .8.

This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.

For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2nd grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.

Figure 2. The relationship between probability and log odds over the range of probabilities from .09 to .13.

The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.

I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are still situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).

I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.[1]

Figure 3. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.

On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.

To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.

The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:

twoway function y=ln(x/(1-x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)

I plotted Figure 2 using the same command, only changing the range to (.09 .13).

In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.


Angrist, J. D., & Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion (1st ed.). Princeton University Press.

Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.

Pischke, J.-S. (2012, July 9). Probit better than LPM? Retrieved from

von Hippel, P.T. & Workman, J. (2016). From Kindergarten Through Second Grade, U.S. Children’s Obesity Prevalence Grows Only During Summer Vacations. Obesity Volume 24, Issue 11, pages 2296–2300.

[1] The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed n/(n-1) where n is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.

On April 21-22, 2017, I will be offering a seminar on causal mediation analysis in Philadelphia with Statistical Horizons. The course will cover very recent developments in this area.

Mediation is about the mechanisms or pathways by which some treatment or exposure affects an outcome. Questions about mediation arise with considerable frequency in the biomedical and social sciences. Various approaches have been used in the social sciences for decades, especially in psychology. However, the new literature on mediation from the perspective of causal inference has helped clarify the assumptions that are being made by these analyses.  It has also generalized the approach to a much broader class of settings.

The course will cover many of these new advances. We will also talk about the relationships between the newer methods and the approaches that have been used in the biomedical and social sciences.

I have published a book on this topic with Oxford University Press:

The course will cover a lot of the material in Chapters 1-8 of the book, and the book will also serve as a useful reference for the course. Both the book and the course are based on a longer course on mediation and interaction that I have taught at Harvard for several years. I have managed to distill the essential content of that course into material that can be covered in just two days, so the course will provide a fairly comprehensive overview in a relatively short period of time.

The course will also provide an introduction to concepts in causal inference and so might be a fairly accessible way into that literature. The prerequistes for the course are fairly minimal, however, so I’ll essentially just be assuming some familiarity with linear and logistic regression. I have worked hard to make the course as accessible as possible, and have now taught and refined the material numerous times.

The rough outline of the course and the broad topics we will cover are as follows:

  1. Concepts and Methods for Mediation
  2. Sensitivity Analysis
  3. Mediation with Time-to-Event Outcome
  4. Multiple Mediators
  5. Surrogate Outcomes
  6. A Unification of Mediation and Interaction

More specifically, the course will cover the relationship between traditional methods for mediation in the biomedical and the social sciences and new methods in causal inference. For dichotomous, continuous, and time-to-event outcomes, we will discuss when the standard approaches to mediation analysis are valid or not valid, and we’ll extend these to more complex settings.

The no-confounding assumptions needed for these techniques will be described. SAS, SPSS and Stata macros to implement these techniques will be explained and distributed to course participants. The use and implementation of sensitivity analysis techniques to assess the how sensitive conclusions are to violations of assumptions will be covered, along with extensions to multiple mediators.

If you have any questions about the course feel free to drop me a note ( and I would be happy to try to provide further information. In any case, I would of course be happy to have you there.

Vaisey PicWhen Paul Allison asked me if I wanted to teach a course in Bangladesh, my first reaction was confusion.  Other than knowing a few basic facts about the place, I had spent little time thinking about the country and none at all imagining myself going there. And here, suddenly, was an opportunity to spend a week in the country teaching Stata.

I love Stata and always jump at the chance to teach it. But I was honestly a bit terrified. I had never been outside North America and Western Europe and don’t really consider myself much of a traveler. What would Dhaka be like? I had no real reference point for trying to predict the experience. Even as I was boarding the flight from Boston to Dubai (the second leg of a 28-hour trip), there was a part of me still asking, “What have you gotten yourself into?”

Fortunately my love of teaching Stata was greater than my anxiety because the week I spent in Dhaka ended up being one of the most satisfying of my professional life.

Most of the students I taught were participants in the CDC’s Field Epidemiology Training Program (FETP). This worldwide program takes government doctors who’ve been doing clinical work and trains them to be field epidemiologists. All of the doctors had received basic biostatistics training before I arrived. So my primary role was to teach them how to use Stata to implement the techniques they’d already learned in class.

The most surprising aspect of this experience was how much we ended up focusing on data cleaning. In most of my career, I’ve used very well-behaved data sets like the General Social Survey. In Dhaka, most of the students were dealing with what initially seemed to me very disorganized spreadsheets. Their data had all the numbers stored as text, inconsistent capitalization, and in many cases were half “long” format and half “wide” format.

I found it amazingly satisfying to teach these doctors how to take messy, complicated epidemiological surveillance data and turn them into something they could really use. There were one or two students who had been unable to make any progress on their projects for months because of data management difficulties. I had never met data management problems that complicated before. One particular reshape had me pulling my hair out and took me almost two hours to figure out. But I have rarely—maybe never—been as satisfied in front of a computer screen as the moment I finally got it.

Over the five days of the course, we covered a great deal, from t-tests, to logistic regression, to complex survey analysis. I especially enjoyed teaching these eager, energetic young doctors because they knew exactly how they were going to use these skills to make a difference in their country.

Fortunately I also had a little time to experience Dhaka. I got to ride a rickshaw, take a boat on the Ganges, walk down Hindu Street, and see buildings built by the East India Company. Just walking down the streets of Dhaka was one of the most incredible experiences of my life. The sights and sounds were beyond anything I had ever experienced. As I packed up my laptop on the last day of class, with the sound of the call to prayer in the background, my thoughts were no longer “what have I gotten myself into?” but instead “is this really over?”

Boarding the flight to return home, I was so grateful that my love of Stata had taken me to such an unexpected place. I hope I made as much of an impact on my students as they and their country made on me.

Stephen Vaisey is an Associate Professor of Sociology at Duke University. For Statistical Horizons, he regularly teaches a seminar on Treatment Effects Analysis.

vonhippel_SH2010_48B1DF-TESTIn his April 1 post, Paul Allison pointed out several attractive properties of the logistic regression model.  But he neglected to consider the merits of an older and simpler approach: just doing linear regression with a 1-0 dependent variable.  In both the social and health sciences, students are almost universally taught that when the outcome variable in a regression is dichotomous, they should use logistic instead of linear regression. Yet economists, though certainly aware of logistic regression, often use a linear model to model dichotomous outcomes.

Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.


Let’s start by comparing the two models explicitly. If the outcome Y is a dichotomy with values 1 and 0, define p = E(Y|X), which is just the probability that Y is 1, given some value of the regressors X. Then the linear and logistic probability models are:

p = a+ a1X1a2X2 + … + akXk    (linear)

ln[p/(1-p)] = b+ b1X1b2X2 + … + bkXk       (logistic)

The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1-p) is a linear function of the regressors.

The major advantage of the linear model is its interpretability. In the linear model, if a1 is (say) .05, that means that a one-unit increase in X1 is associated with a 5 percentage point increase in the probability that Y is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.

The logistic model is less interpretable. In the logistic model, if b1 is .05, that means that a one-unit increase in X1 is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.

How intuitive are odds ratios?

Because the log odds scale is so hard to interpret, it is common to report logistic regression results as odds ratios. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:

p/(1-p) = d0 × (d1)X1 × (d2)X2 × … × (dk)Xk

On the left side we have the odds and on the right side we have a product involving the odds ratios d1 = exp(b1), d2 = exp(b2), etc.

Odds ratios seem like they should be intuitive. If d1 = 2, for example, that means that a one-unit increase in Xdoubles the odds that Y is 1. That sounds like something we should understand.

But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1-p)—and so does the odds ratio.

Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a get-out-the-vote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.

If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.

Here’s a table that shows what doubling the odds does to various initial probabilities:

Before doubling

After doubling










































It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If p is close to 0, then doubling the odds is approximately the same as doubling p. If p is close to 1, then doubling the odds is approximately the same as halving 1-p. But if p is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.

That’s why I’m not crazy about odds ratios.

How nonlinear is the logistic model?

The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).

For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability  (cf. Long 1997).


It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.


When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model.

A rule of thumb

These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.

Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).

Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[p/(1-p)] are undefined when p is equal to 0 or 1. When p gets close to 0 or 1 logistic regression can suffer from complete separation, quasi-complete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic. 

Computation and estimation

Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderate-sized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]

The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance p(1-p), but the heteroscedasticity is minor if p is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticity-consistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science. 


Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity, 43(1), 59–74.

King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163.

Long, J. S. (1997) Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.


[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.

[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.

Allison PicWhen estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time t is what happened at time t-1. 

This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models.  Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as dynamic panel data models.  Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward.  Let yit be the value of the dependent variable for individual i at time t.  Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector xit:

            yit = b0 + b1yi(t-1) + b2xit +  ui + eit

The random intercept ui represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the right-hand side.

That’s where the problem lies. Because the model applies to all time points, i has a direct effect on yi(t-1).  But if i affects yi(t-1), it can’t also be statistically independent of yi(t-1). The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male). 

Here’s the Stata code:

use “”, clear
xtset id t
xtreg lwage L.lwage ed fem t

The xtset command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7.  The xtreg command fits a random-intercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors.  L.lwage specifies the one-year lag of lwage

Here’s the output:

       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       lwage |
         L1. |   .8747517   .0085886   101.85   0.000     .8579183    .8915851
          ed |   .0108335   .0011933     9.08   0.000     .0084947    .0131724
         fem |    -.06705    .010187    -6.58   0.000    -.0870162   -.0470839
           t |   .0071965   .0019309     3.73   0.000     .0034119    .0109811
       _cons |   .7624068   .0491383    15.52   0.000     .6660974    .8587161

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males.  Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages. 

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply xtdpdqml to the wage data:

xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)

The re option specifies a random effects (random intercepts) model.  By default, the command includes the lag-1 dependent variable as a predictor.  The initval option sets the starting values for the four variance parameters that are part of the model.  Here is the output:

       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       lwage |
         L1. |   .4142827   .0230843    17.95   0.000     .3690383     .459527
          ed |   .0403258   .0031841    12.66   0.000     .0340851    .0465666
         fem |  -.2852665   .0271688   -10.50   0.000    -.3385164   -.2320166
           t |   .0533413   .0027533    19.37   0.000     .0479449    .0587378
       _cons |    3.25368   .1304816    24.94   0.000      2.99794    3.509419

Results are markedly different from those produced above by xtreg.  The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference.  Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with xtdpdml, use

xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv

The inv option is for time-invariant variables.  The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1-time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.

Here is the output:

             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
lwage2       |
      lwage1 |   .4088803   .0229742    17.80   0.000     .3638517     .453909
          ed |   .0406719   .0032025    12.70   0.000     .0343951    .0469486
         fem |  -.2878266    .027345   -10.53   0.000    -.3414218   -.2342315

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper.  As a matter of fact, the xtdpdml command is just a front-end to the sem command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself.  That’s why we wrote the command.  

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both xtdpdqml and xtdpdml can handle this situation also. 

If you’d like to learn more about dynamic panel data models, check out my 2-day course on Longitudinal Data Analysis Using SEM. It will be offered again October 16-17, 2015, in Los Angeles. 


Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 1635-1659.

Rabe-Hesketh, Sophia, and Anders Skrondal  (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP. 

Allison PicIn my July 2012 post, I argued that maximum likelihood (ML) has several advantages over multiple imputation (MI) for handling missing data:

  • ML is simpler to implement (if you have the right software).
  • Unlike multiple imputation, ML has no potential incompatibility between an imputation model and an analysis model.
  • ML produces a deterministic result rather than a different result every time.

Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.

What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.

With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.

So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on large-sample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a t-distribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?

Wrong! In a paper that will be published soon in Structural Equation Modeling, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.

What about confidence intervals and p-values? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a t-distribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this t-distribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.

Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.

Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for non-linear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.

MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.

If you want to learn more about both multiple imputation and maximum likelihood, check out my two-day course on Missing Data that will be offered this fall. Dates and location have not yet been set. 

Allison PicFor the analysis of binary data, logistic regression dominates all other methods in both the social and biomedical sciences. It wasn’t always this way. In a 1934 article in Science, Charles Bliss proposed the probit function for analyzing binary data, and that method was later popularized in David Finney’s 1947 book Probit Analysis. For many years, probit was the method of choice in biological research.

In a 1944 article in the Journal of the American Statistical Association, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book Analysis of Binary Data. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary log-log) but logistic regression is the overwhelming favorite. 

So what is it about logistic regression that makes it so popular?  In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there is something special about the logit link that gives it a natural advantage over alternative link functions.

First, a brief review. For binary data, the goal is to model the probability p that one of two outcomes occurs. The logit function is log[p/(1-p)], which varies between -∞ and +∞ as p varies between 0 and 1.The logistic regression model says that

            log[p/(1-p)] = b0 + b1x1 + … + bkxk

or, equivalently,

            p = 1/(1 + exp{-(b0 + b1x1 + … + bkxk)})

Estimation of the b coefficients is usually accomplished by maximum likelihood.

In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)

What’s most important about the logit link is that it guarantees that p is bounded by 0 and 1, no matter what the b’s and the x’s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary log-log link, and an infinite number of other possible link functions.  But there are a few things that are special about logit:

1. If you exponentiate the coefficients, you get adjusted odds ratios. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to non-statisticians. Coefficients from probit regression are not nearly so interpretable.

2. With logit, you can do disproportionate stratified random sampling on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes all of the events, and a 10% random sample of the non-events. This property is the justification for the widely-used case-control method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.

3. You can do exact logistic regression. With conventional maximum likelihood estimation, the p-values and confidence intervals are large-sample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact p-values are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.

4. You can do conditional logistic regression. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for between-cluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the between-cluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.  

What is it about the logit link that makes it possible to do these useful variations of logistic regression?  The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important exponential family of distributions. In general form, the probability distribution for the exponential family can be written as

            f(x|b) = h(x)exp{T(x)’g(b)–A(b)}

In this formula, x is a vector of the data, b is a vector of parameters, and h, g, T, and A are known functions.  If g(b) = b, then b is said to be the natural parameter (or canonical parameter) of the distribution. T(x) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.

For the binomial distribution, this formula specializes to 

            f(x|p) = [N x]exp{x log[p/(1-p)] –N log(1-p)}

where N is the number of trials, x is the number of events and [N x] is the binomial coefficient.  We see immediately that log[p/(1-p)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.

Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors. 

Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.  

There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables.  You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.

If you’d like to learn more about these and other methods for binary data, see my book Logistic Regression Using SAS: Theory and Application.  

Allison PicPredictive mean matching (PMM) is an attractive way to do multiple imputation for missing data, especially for imputing quantitative variables that are not normally distributed. But, as I explain below, it’s also easy to do it the wrong way. 

Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.

PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.

There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.

Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:

  1. For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b.
  2. Make a random draw from the “posterior predictive distribution” of b, producing a new set of coefficients b*. Typically this would be a random draw from a multivariate normal distribution with mean b and the estimated covariance matrix of b (with an additional random draw for the residual variance). This step is necessary to produce sufficient variability in the imputed values, and is common to all “proper” methods for multiple imputation.
  3. Using b*, generate predicted values for x for all cases, both those with data missing on x and those with data present.
  4. For each case with missing x, identify a set of cases with observed x whose predicted values are close to the predicted value for the case with missing data.
  5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value.
  6. Repeat steps 2 through 5 for each completed data set. 

Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.

There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the user-written ice command for Stata set the default at k=10. 

On the other hand, for the SPSS missing values module and for the built-in mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.

That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014). 

Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.

So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.

The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.


Little, Roderick J. A. (1988) “Missing-data adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287-296.

Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 75-87.

Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 87-94.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.

Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425-446.