Skip to content

In Defense of Logit – Part 2

Paul Allison
April 27, 2017

In my last post, I explained several reasons why I prefer logistic regression over a linear probability model estimated by ordinary least squares, despite the fact that linear regression is often an excellent approximation and is more easily interpreted by many researchers.

I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the margins command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese).

If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the margins command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.)


I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)

There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The Stata code for estimating the model is

   webuse nhanes2f, clear
   logistic diabetes i.female age

The i. in front of black and female tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.

Here are the results:

    diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   2.050133   .2599694     5.66   0.000     1.598985     2.62857
    1.female |   1.167141   .1100592     1.64   0.101     .9701892    1.404074
         age |   1.061269    .003962    15.93   0.000     1.053532    1.069063
       _cons |   .0016525    .000392   -27.00   0.000     .0010381    .0026307

A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for non-blacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.

Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes.  Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:

   margins, dydx(black female age)

yielding the output

             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   .0400922   .0087055     4.61   0.000     .0230297    .0571547
    1.female |   .0067987   .0041282     1.65   0.100    -.0012924    .0148898
         age |   .0026287   .0001869    14.06   0.000     .0022623    .0029951

The dy/dx column gives the estimated change in the probability of diabetes for a 1-unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for non-blacks, on average.

How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: black=1 and black=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like age, the calculation of the AME is a little more complicated.

There’s another, more traditional, way to get marginal effects: for the variable black, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when black=1 and when black=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.

We can get the MEM in Stata with the command:

   margins, dydx(black female age) atmeans

with the result

             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+---------------------------------------------------------------- |   .0290993   .0066198     4.40   0.000     .0161246    .0420739
    1.female |   .0047259   .0028785     1.64   0.101    -.0009158    .0103677
         age |   .0018234   .0000877    20.80   0.000     .0016516    .0019953

Here the MEM for black is noticeably smaller than the AME for black. This is a consequence of the inherent non-linearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM).

The non-linearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values.

Here’s how to get the AME for black at four different ages:

   margins, dydx(black) at(age=(20 35 50 65)) 

1._at        : age             =          20
2._at        : age             =          35
3._at        : age             =          50
4._at        : age             =          65
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------      |
         _at |
          1  |   .0060899   .0016303     3.74   0.000     .0028946    .0092852
          2  |   .0144831   .0035013     4.14   0.000     .0076207    .0213455
          3  |   .0332459   .0074944     4.44   0.000      .018557    .0479347
          4  |   .0704719    .015273     4.61   0.000     .0405374    .1004065

We see that the AME of black at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should not be thought of as an interaction, however, because the model says that the effect on the log-odds (logit) is the same at all ages.

Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the log-odds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used

   predict yhat
   sum yhat

The predict command generates predicted probabilities and stores them in the variable yhat. The sum command produces descriptive statistics for yhat, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:

   twoway function y=ln(x/(1-x)), range(.005 .244) xtitle("Probability") ytitle("Log odds") 

Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:

   reg diabetes black female age, robust

I’ve requested robust standard errors to deal with any heteroskedasticity, but the t-statistics are about the same even without this option.

             |               Robust
    diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black |   .0383148    .008328     4.60   0.000     .0219902    .0546393
      female |   .0068601   .0041382     1.66   0.097    -.0012515    .0149716
         age |   .0021575   .0001214    17.76   0.000     .0019194    .0023955
       _cons |  -.0619671   .0049469   -12.53   0.000    -.0716639   -.0522703

The t-statistics here are pretty close to the z-statistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of black on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.

The upshot is that combining logistic regression with the margins command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probability-based estimates vary under different conditions.


Leave a Reply

Your email address will not be published. Required fields are marked *