## What’s So Special About Logit?

##### April 1, 2015 By Paul Allison

For the analysis of binary data, logistic regression dominates all other methods in both the social and biomedical sciences. It wasn’t always this way. In a 1934 article in *Science*, Charles Bliss proposed the probit function for analyzing binary data, and that method was later popularized in David Finney’s 1947 book *Probit* *Analysis. *For many years, probit was the method of choice in biological research.

In a 1944 article in the *Journal of the American Statistical Association*, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book *Analysis of Binary Data*. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary log-log) but logistic regression is the overwhelming favorite.

So what is it about logistic regression that makes it so popular? In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there *is* something special about the logit link that gives it a natural advantage over alternative link functions.

First, a brief review. For binary data, the goal is to model the probability *p* that one of two outcomes occurs. The logit function is log[*p*/(1-*p*)], which varies between -∞ and +∞ as *p* varies between 0 and 1.The logistic regression model says that

log[*p*/(1-*p*)] = *b*_{0} + *b*_{1}*x*_{1} + … + *b _{k}x_{k}*

or, equivalently,

*p* = 1/(1 + exp{-(*b*_{0} + *b*_{1}*x*_{1} + … + *b _{k}x_{k}*)})

Estimation of the *b* coefficients is usually accomplished by maximum likelihood.

In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)

What’s most important about the logit link is that it guarantees that *p* is bounded by 0 and 1, no matter what the *b’*s and the *x’*s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary log-log link, and an infinite number of other possible link functions. But there are a few things that are special about logit:

1. If you exponentiate the coefficients, you get **adjusted odds ratios**. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to non-statisticians. Coefficients from probit regression are not nearly so interpretable.

2. With logit, you can do **disproportionate stratified random sampling** on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes *all* of the events, and a 10% random sample of the non-events. This property is the justification for the widely-used case-control method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.

3. You can do **exact logistic regression**. With conventional maximum likelihood estimation, the *p*-values and confidence intervals are large-sample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact *p*-values are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.

4. You can do **conditional logistic regression**. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for between-cluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the between-cluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.

What is it about the logit link that makes it possible to do these useful variations of logistic regression? The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important **exponential family** of distributions. In general form, the probability distribution for the exponential family can be written as

*f*(*x*|*b*) = *h*(*x*)exp{*T*(*x*)’*g*(*b*)–*A*(*b*)}

In this formula, *x* is a vector of the data, *b* is a vector of parameters, and *h*, *g*, *T*, and *A* are known functions. If* g*(*b*) = *b*, then *b* is said to be the **natural parameter** (or canonical parameter) of the distribution. *T*(*x*) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.

For the binomial distribution, this formula specializes to

*f*(*x*|*p*) = [*N x*]exp{*x* log[*p*/(1-*p*)] –*N* log(1-*p*)}

where *N *is the number of trials, *x* is the number of events and [*N x*] is the binomial coefficient. We see immediately that log[*p*/(1-*p*)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.

Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors.

Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.

There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables. You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.

If you’d like to learn more about these and other methods for binary data, see my book *Logistic Regression Using SAS: Theory and Application. *

“Check out next month’s post for arguments in favor of linear models for binary outcomes.”

So, is this post still coming?

Yes, it’s coming in a few days.

We have a 144 sample and followed simple random sampling. Can we do probit model? What should be the minimum size for probit regression?

How many cases do you have for the less frequent category of your dependent variable?

Dr. Allison,

Given the July 5, 2014 post by Paul von Hippel, how would you recommend we present the results of logistic regression analysis? In terms of the odds ratios given their intuitive ease of understanding, or convert the ratios to probabilities?

I prefer odds ratios.