In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:

  • Within the range of .20 to .80, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
  • People understand changes in probabilities much better than they understand odds ratios.
  • OLS regression is much faster than linear regression.

I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.

Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.

Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with k-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.

Predicted probabilities. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.

In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.

Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.

In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.

Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.

A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.

Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.

Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.

Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.

This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.

One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.

There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is

webuse nhanes2f, clear
logistic diabetes black female age black#c.age

with the following results:

------------------------------------------------------------------------------

    diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       black |   3.318733   1.825189     2.18   0.029     1.129381    9.752231
      female |   1.165212   .1098623     1.62   0.105     .9686107    1.401718
         age |   1.063009   .0044723    14.52   0.000     1.054279    1.071811
 black#c.age |   .9918406   .0090871    -0.89   0.371     .9741892    1.009812
       _cons |   .0014978   .0003971   -24.53   0.000     .0008909    .0025183
------------------------------------------------------------------------------

Given the high p-value for the interaction (.371), there is clearly no evidence here that the effect of black varies with age. Now let’s estimate the same model as a linear regression:

reg diabetes black female age black#c.age

The new results are:

------------------------------------------------------------------------------
    diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       black |  -.0215031   .0191527    -1.12   0.262    -.0590461    .0160399
      female |   .0069338    .004152     1.67   0.095    -.0012049    .0150725
         age |   .0020176   .0001276    15.82   0.000     .0017675    .0022676
 black#c.age |   .0012962   .0003883     3.34   0.001     .0005351    .0020573
       _cons |  -.0553238   .0068085    -8.13   0.000    -.0686697   -.0419779
------------------------------------------------------------------------------

Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase. 

Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.

Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome.  I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.

In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.

In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R. 

In July 2015 I pointed out some advantages of the linear probability model over the logistic model. The linear model is much easier to interpret, and the linear model runs much faster, which can be important if the data set is large or the model is complicated. In addition, the linear probability model often fits about as well as the logistic model, since over some ranges the probability p is almost linearly related to the log odds function ln(p(1-p)) that is used in logistic regression.

As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.

Figure 1. The relationship between probability and log odds over the range of probabilities from .2 to .8.

This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.

For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2nd grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.

Figure 2. The relationship between probability and log odds over the range of probabilities from .09 to .13.

The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.

I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are still situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).

I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.[1]

Figure 3. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.

On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.

To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.

The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:

twoway function y=ln(x/(1-x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)

I plotted Figure 2 using the same command, only changing the range to (.09 .13).

In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.

References

Angrist, J. D., & Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion (1st ed.). Princeton University Press.

Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.

Pischke, J.-S. (2012, July 9). Probit better than LPM? Retrieved from http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/

von Hippel, P.T. & Workman, J. (2016). From Kindergarten Through Second Grade, U.S. Children’s Obesity Prevalence Grows Only During Summer Vacations. Obesity Volume 24, Issue 11, pages 2296–2300. http://onlinelibrary.wiley.com/doi/10.1002/oby.21613/full

[1] The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed n/(n-1) where n is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.

On April 21-22, 2017, I will be offering a seminar on causal mediation analysis in Philadelphia with Statistical Horizons. The course will cover very recent developments in this area.

Mediation is about the mechanisms or pathways by which some treatment or exposure affects an outcome. Questions about mediation arise with considerable frequency in the biomedical and social sciences. Various approaches have been used in the social sciences for decades, especially in psychology. However, the new literature on mediation from the perspective of causal inference has helped clarify the assumptions that are being made by these analyses.  It has also generalized the approach to a much broader class of settings.

The course will cover many of these new advances. We will also talk about the relationships between the newer methods and the approaches that have been used in the biomedical and social sciences.

I have published a book on this topic with Oxford University Press:

The course will cover a lot of the material in Chapters 1-8 of the book, and the book will also serve as a useful reference for the course. Both the book and the course are based on a longer course on mediation and interaction that I have taught at Harvard for several years. I have managed to distill the essential content of that course into material that can be covered in just two days, so the course will provide a fairly comprehensive overview in a relatively short period of time.

The course will also provide an introduction to concepts in causal inference and so might be a fairly accessible way into that literature. The prerequistes for the course are fairly minimal, however, so I’ll essentially just be assuming some familiarity with linear and logistic regression. I have worked hard to make the course as accessible as possible, and have now taught and refined the material numerous times.

The rough outline of the course and the broad topics we will cover are as follows:

  1. Concepts and Methods for Mediation
  2. Sensitivity Analysis
  3. Mediation with Time-to-Event Outcome
  4. Multiple Mediators
  5. Surrogate Outcomes
  6. A Unification of Mediation and Interaction

More specifically, the course will cover the relationship between traditional methods for mediation in the biomedical and the social sciences and new methods in causal inference. For dichotomous, continuous, and time-to-event outcomes, we will discuss when the standard approaches to mediation analysis are valid or not valid, and we’ll extend these to more complex settings.

The no-confounding assumptions needed for these techniques will be described. SAS, SPSS and Stata macros to implement these techniques will be explained and distributed to course participants. The use and implementation of sensitivity analysis techniques to assess the how sensitive conclusions are to violations of assumptions will be covered, along with extensions to multiple mediators.

If you have any questions about the course feel free to drop me a note (tvanderw@hsph.harvard.edu) and I would be happy to try to provide further information. In any case, I would of course be happy to have you there.