In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:
- Within the range of .20 to .80 for the predicted probabilities, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
- People understand changes in probabilities much better than they understand odds ratios.
- OLS regression is much faster than logistic regression.
I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.
LEARN MORE IN A SEMINAR WITH PAUL ALLISON
Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.
Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with k-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.
Predicted probabilities. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.
In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.
Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.
In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.
Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.
A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.
Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.
Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.
Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.
This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.
One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.
There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is
webuse nhanes2f, clear
logistic diabetes black female age black#c.age
with the following results:
------------------------------------------------------------------------------ diabetes | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- black | 3.318733 1.825189 2.18 0.029 1.129381 9.752231 female | 1.165212 .1098623 1.62 0.105 .9686107 1.401718 age | 1.063009 .0044723 14.52 0.000 1.054279 1.071811 black#c.age | .9918406 .0090871 -0.89 0.371 .9741892 1.009812 _cons | .0014978 .0003971 -24.53 0.000 .0008909 .0025183 ------------------------------------------------------------------------------
Given the high p-value for the interaction (.371), there is clearly no evidence here that the effect of black varies with age. Now let’s estimate the same model as a linear regression:
reg diabetes black female age black#c.age
The new results are:
------------------------------------------------------------------------------ diabetes | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- black | -.0215031 .0191527 -1.12 0.262 -.0590461 .0160399 female | .0069338 .004152 1.67 0.095 -.0012049 .0150725 age | .0020176 .0001276 15.82 0.000 .0017675 .0022676 black#c.age | .0012962 .0003883 3.34 0.001 .0005351 .0020573 _cons | -.0553238 .0068085 -8.13 0.000 -.0686697 -.0419779 ------------------------------------------------------------------------------
Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase.
Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.
Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome. I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.
In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.
In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R.
Comments
I appreciate the commentary. Two suggestions:
1st paragraph:
“Within the range of .20 to .80” should say “Within the range of .20 to .80 for the mean of predicted probabilities”. Right?
“OLS regression is much faster than linear regression”: “linear” should say “logistic”.
1. Well, it’s not just the mean of the predicted probabilities. Ideally, all the predicted probabilities would be within that range.
2. Yes, thanks for pointing out that mistake.
Hi Paul:
I like your blog.
I would be interested to hear your thoughts on the arguments made in favor of the LPM because logistic regression coefficients are influenced by unobserved heterogeneity. You probably are familiar with this widely-cited article by Carina Mood (Eur Sociol Rev 2010 26 (1): 67-82) which suggests the LPM as a possible solution to avoid errors in interpreting comparisons of logistic regression coefficients across groups or across models.
As you might suspect, I am not a fan of Mood’s proposal. To me, that’s essentially saying that the model we want really to estimate has certain problems. So we’ll switch to a model that we KNOW is wrong because it doesn’t have those problems. As a purely descriptive device for comparing two groups, the standard logistic model is perfectly fine. It’s only when you’re trying to compare the underlying subject-specific coefficients across groups that problems arise. And if you’re really interested in those subject-specific coefficients, you surely don’t want to switch to LPM. As I explained in my post, the LPM is much more vulnerable to “spurious” interactions than logistic.