Competing risks are common in the analysis of event time data. The classic example is death, with distinctions among different kinds of death: if you die of a heart attack, you can’t then die of cancer or suicide. But examples also abound in other fields. A marriage can end either by divorce or by the death of a spouse, but not both. If a loan is terminated by prepayment, it is no longer at risk of a default.
In this post, I argue that one of the more popular methods for regression analysis of competing risks—the analysis of subdistribution hazards, introduced by Fine and Gray (1999)—is not valid for causal inference. In fact, it can produce completely misleading results. Instead, I recommend the analysis of causespecific hazards, a longstanding and easily implemented method.
Let’s review that classic method first. The estimation of causespecific hazard functions (Kalbfleish and Prentice 2002) can be accomplished with standard methods for single kinds of events. You simply treat all competing events as though the individual were right censored at the time the competing event occurs. For example, if you want to study the effect of obesity on the risk of death due to heart disease, just estimate a Cox proportional hazards model in which all causes of death other than heart disease are treated as censoring. Or if you want to estimate the effect income on divorce, estimate a Cox model in which spousal death is treated as censoring.
By contrast, the method of Fine and Gray (1999) does not treat competing events in the same way as censored observations. Based on cumulative incidence functions, their method estimates a proportional hazards model for something they call the subdistribution hazard.
The definition of the subdistribution hazard is similar to that for a causespecific hazard, with one key difference: the causespecific hazard removes an individual from the risk set when any type of event occurs; the subdistribution hazard removes an individual from the risk set when an event of the focal type occurs or when the individual is truly censored. However, when a competing event occurs, the individual remains in the risk set. Fine and Gray acknowledge that this is “unnatural” because, in fact, those who experience competing events are no longer actually at risk of the focal event. But it’s necessary in order to get a model that correctly predicts cumulative incidence functions. (More on that later).
According to Google Scholar, the Fine and Gray paper has been cited more than 5,000 times. It is now widely available in most major software packages, including Stata (with the stcrreg command), SAS (with the EVENTCODE option in PROC PHREG) and R (with the ‘cmprsk’ package). In some fields, it has become the standard method for analyzing competing risks. In the minds of many researchers, it is the only proper way to analyze competing risks.
But there’s one big problem: the subdistribution method doesn’t isolate distinct causal effects on the competing risks. In fact, it confounds them in predictable and alarming ways. Specifically, any variable that increases the causespecific risk of event A will appear to decrease the subdistribution hazard for event B. Why? Because whenever a type A event occurs, it eliminates the possibility that a type B event will happen.
Here’s a simple simulation that demonstrates this phenomenon. I generated 10,000 observations on each of two event types, labeled A and B. Event times for type A were generated by a Weibull regression model (a parametric version of the proportional hazards model). The only predictor was variable X, which had a standard normal distribution.
Event times for type B were generated by an identical Weibull regression model in which the only predictor was variable Z, also standard normal. X and Z had a correlation of .50.
If event A occurred before event B, the event time for B was not observed. Similarly, if B occurred before A, the event time for A was not observed. Any event times greater than 8 were treated as right censored. SAS code for generating the data and performing the analysis is appended below.
In the resulting data set, there were 4350 type A events, 4277 type B events, and 1391 truly censored observations (neither A nor B was observed). Given the data generating process, any attempt to estimate causal parameters should find no effect of X on B and no effect of Z on A.
In Table 1 below, I show the results from applying the two different methods: causespecific Cox regression models with competing events treated as censoring; and subdistribution proportional hazards models. For type A, the causespecific estimate of .494 for the effect of X is close to the true value of .500 and highly significant. The coefficient of .008 for Z is close to the true value of 0 and far from statistically significant. Overall, pretty good performance.
The subdistribution estimates for type A, on the other hand, are clearly unsatisfactory. The coefficient for X appears to be biased downward by about 10%. The coefficient for Z (.216) is far from the true value of 0, and highly significant. Thus, the subdistribution estimates would lead one to conclude that an increase in Z reduced the risk of event A. What it actually did is reduce the likelihood that type A events would be observed because it increased the risk of event B.
The results for event B in the lower panel of Table 1 are the mirror image of those for event A. For both X and Z, the causespecific estimates are close to the true values. The subdistribution estimates are biased, and would lead to incorrect causal inferences.
Table 1. Results from Two Methods for Estimating Competing Risks Models, NonInformative Censoring.

CauseSpecific Hazards

Subdistribution Hazards

Type A

Estimate

S.E.

p

Estimate

S.E.

p

x

.490

.018

<.0001

.420

.018

<.0001

z

.009

.018

.601

.214

.017

<.0001

Type B







x

.021

.018

.248

.187

.017

<.0001

z

.488

.018

<.0001

.433

.018

<.001

So why would anyone seriously consider the subdistribution method? Well, there’s one big problem with the causespecific hazards approach. Virtually all methods based on causespecific hazards implicitly assume that censoring is noninformative. Roughly speaking, that means that if an individual is censored at a particular point in time, that fact tells you nothing about the individual’s risk of the event.
If the censoring times are determined by the study design (as when all event times beyond a certain calendar time are censored), that’s not usually an issue. But if censoring times are not under the control of the investigator, the censoring may be informative. And that can lead to bias.
If competing risks are treated as a form of censoring, then we need to be concerned about whether that censoring is informative or noninformative. How might competing events be informative? If a spouse dies, does that tell us anything about the risk that the couple would have divorced? Maybe, but probably not. Does the fact that a person dies of heart disease tell us anything about that person’s risk of dying of cancer? Maybe so. That would definitely be the case if cancer and heart disease had common risk factors, and those factors were not included in the regression model.
Unfortunately, there’s no way to test the assumption that censoring is noninformative (Tsiatis 1975). And even if there were, there’s no good method available for estimating causal effects when censoring is informative.
By contrast, the subdistribution hazard method does not explicitly assume that competing risks are noninformative. And that has been one of its major attractions. However, as I now show, the subdistribution method does no better (and actually somewhat worse) than the causespecific method when competing events are informative.
In the previous simulation, the assumption of noninformative censoring was satisfied by the data generating process. To model informative censoring, I added an unobserved common risk factor to the regression equations for the two event times. This was simply a normally distributed random variable with a standard deviation of 2. This “random intercept” induced a correlation of .28 between the uncensored event times for type A and type B. I then reestimated the models, with results shown in Table 2 below.
The message of Table 2 is this: Yes, informative censoring leads to causespecific estimates that are biased. And unlike in the previous table, the causespecific estimates might lead one to conclude, incorrectly, that Z affects the risk for event A and that X affects the risk for event B. But the subdistribution estimates are also biased. And, with one minor exception, the biases are worse for the subdistribution method than for the causespecific method.
Table 2. Results from Two Methods for Estimating Competing Risks Models, Informative Censoring.

CauseSpecific Hazards

Subdistribution Hazards

Type A

Estimate

S.E.

p

Estimate

S.E.

p

x

.347

.019

<.0001

.349

.019

<.0001

z

.067

.019

.0005

.185

.019

<.0001

Type B







x

.102

.019

<.0001

.212

.019

<.0001

z

.372

.019

<.0001

.368

.019

<.0001

To repeat my earlier question: Why would anyone ever seriously consider using the subdistribution method? To be fair to Fine and Gray, they never claimed that subdistribution regression would accurately estimate causal parameters. Instead, they introduced the method as a way to model the impact of covariates on the cumulative incidence functions. These functions are often preferable to KaplanMeier estimates of the survivor function because they do a better job of describing the empirical distribution of events rather than some hypothetical distribution that would apply only in the absence of competing risks.
Cumulative incidence functions are particularly useful for prediction. Suppose you have a cohort of newly diagnosed cancer patients. Based on the experience of earlier patients, you want to predict in five years what percentage will have died of cancer, what percentage will have died of other causes, and what percentage will still be alive. Cumulative incidence functions will give you that information. KaplanMeier survivor functions will not.
The Fine and Gray method provides a way to introduce covariate information into those predictions, potentially making them more accurate for individual patients. It’s important to note, however, that one can also calculate cumulative incidence functions based on causespecific hazard functions. Most commercial packages for Cox regression don’t have that capability, but there are downloadable SAS macros that will accomplish the task.
In sum, the subdistribution method may be useful for generating predicted probabilities that individuals will be in particular states at particular times. But it is not useful for estimating causal effects of covariates on the risks that different kinds of events will occur. For that task, the analysis of causespecific hazards is the way to go. Unfortunately, both methods are vulnerable to competing events that are informative for each other. The only effective way to deal with that problem is to estimate causespecific hazard models that include common risk factors as covariates.
SAS Code for Simulation
data finegraytest;
do i=1 to 10000;
*Generate a common risk factor. To include it in the model, change 0 to 1;
comrisk=0*rannor(0);
*Generate x and z, bivariate standard normal with r=.5;
x=rannor(0);
z=.5*x + sqrt(1.25)*rannor(0);
*Generate w with Weibull distribution depending on x;
logw=2 + .75*x + 1.5*log(ranexp(0)) + 2*comrisk;
w=exp(logw);
*Generate y with Weibull distribution depending on z;
logy=2 + .75*z + 1.5*log(ranexp(0)) + 2*comrisk;
y=exp(logy);
*Allow events to censor each other;
if y>w then do;
type=1;
t=w; end;
else if w>y then do;
type=2;
t=y; end;
*Censor all event times at 8;
if t>8 then do;
type=0;
t=8;
end;
output;
end;
run;
proc freq;
table type; run;
/*Estimate causespecific hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0 1)=x z; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0 2)=x z; run;
/*Estimate subdistributions hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=1; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=2; run;
In my courses and books on longitudinal data analysis, I spend a lot of time talking about the betweenwithin model for fixed effects. I used to call it the hybrid model, but others have convinced me that “betweenwithin” provides a more meaningful description.
Last week my longtime collaborator, Paula England, asked me a question about the betweenwithin model that stumped me at first. When I finally figured out the answer, I realized that it could potentially be important to anyone who uses the betweenwithin method. Hence, this post.
Here is Paula’s project, coauthored with Eman Abdelhadi. They have a large, crosssectional sample of adult women from approximately 50 countries. They are estimating a logistic regression model for a dichotomous dependent variable: whether or not a woman is employed. One of the independent variables is also dichotomous: whether or not a woman is a Muslim (coded 1 or 0). In order to control for all betweencountry differences, they estimate a betweenwithin model with the following characteristics:
 a randomintercepts model with countries as clusters.
 a “within” predictor that is the Muslim indicator (dummy variable) minus the country mean of that variable.
 a “between” predictor that is the country mean, i.e., the proportion Muslim.
 other control variables at both the person level and the country level.
The coefficient for the within predictor is very close to what you would get from a classic fixed effects estimator, as estimated by conditional logistic regression. (My 2014 post discusses why they are not identical). So it represents the effect of being Muslim on employment, controlling for all countrylevel variables, both observed and unobserved.
Here is the question. Can you interpret the coefficient for the between predictor (proportion Muslim) as the effect of living in a country that is more or less Muslim, regardless of whether a person is personally a Muslim? In other words, can you interpret the between coefficient as a contextual effect?
Surprisingly, the answer is no. By construction, the within variable is uncorrelated with the between variable. For that reason, the coefficient for proportion Muslim does not actually control for the personlevel effect of being Muslim.
The reason I never thought about this question before is that I primarily use the betweenwithin model for longitudinal data, with repeated measurements clustered within persons. In that setting, the coefficients for the between variables are usually uninteresting or misleading. But when applied to other kinds of clustering, the estimation and interpretation of “between” effects can be quite important.
Fortunately, there’s a simple solution to the problem: Estimate the model using the original Muslim indicator rather than the deviation from its countrylevel mean. This model is mathematically equivalent to the original betweenwithin model—the coefficients for the Muslim indicator and for all the control variables will be exactly the same. But the coefficient for the proportion Muslim will change, and in a predictable way. Here is the algebra:
In these equations p_{ij} is the probability of employment for person i in country j, x_{ij} is the Muslim indicator for person i in country j, and xbarj is the mean of the Muslim indicator in country j.
The first equation is the usual formulation of the betweenwithin model. The second equation is the algebra that gets us to the third equation. That equation is the new formulation of the model with x_{ij} and xbarj as predictors. Notice that the coefficient of xbarj in the third equation can be found by subtracting the coefficient for the within effect from the coefficient for the between effect in the first equation.
So you don’t actually have to reestimate the model to get the contextual effect. You can just take the difference in the coefficients in the standard betweenwithin model. Furthermore, testing whether those coefficients are the same or different (which is usually done to test fixed vs. random effects) is equivalent to testing for the existence of a contextual effect.
Keep in mind that although the coefficient for xbarj in the third equation can be interpreted as a contextual effect, it does not control for any unobservables. So it could be biased by the omission of countrylevel variables or personlevel variables that affect employment and are correlated with proportion Muslim. Note, also, that the fact that the personlevel variable is dichotomous in this example is completely incidental. The same logic would apply if x_{ij} were continuous.
I thought I knew what it meant for data to be missing at random. After all, I’ve written a book titled Missing Data, and I’ve been teaching courses on missing data for more than 15 years. I really ought to know what missing at random means.
But now that I’m in the process of revising that book, I’ve come to the conclusion that missing at random (MAR) is more complicated than I thought. In fact, the MAR assumption has some peculiar features that make me wonder if it can ever be truly satisfied in common situations when more than one variable has missing data.
First, a little background. There are two modern methods for handling missing data that have achieved widespread popularity: maximum likelihood and multiple imputation. As implemented in most software packages, both of these methods depend on the assumption that the data are missing at random.
Here’s how I described the MAR assumption in my book:
Data on Y are said to be missing at random if the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis. To express this more formally, suppose there are only two variables X and Y, where X always is observed and Y sometimes is missing. MAR means that
Pr(Y missing Y, X) = Pr(Y missing X).
In words, this expression means that the conditional probability of missing data on Y, given both Y and X, is equal to the probability of missing data on Y given X alone.
Suppose that Y is body weight and X is gender. Then if women are less likely to disclose their weight, Y can be MAR because the probability of missing Y depends on X, which is observed. But what if both men and women are less likely to disclose their weight if they’re overweight? Then values are not MAR, since the probability Y is missing depends on Y net of X.
That definition works fine when only one variable has missing data. But things get more complicated when two or more variables have missing data. I learned this from Example 1.13 in the classic book by Little and Rubin (2002). That example deals with one of the simplest cases, when there are just two variables, X and Y. Suppose that both of them have missing data, and the missingness falls into four patterns:
1. X and Y both observed.
2. X observed but Y missing.
3. X missing but Y observed.
4. X and Y both missing.
For each of these patterns, there is a corresponding probability of observing that pattern. The MAR assumption is about how those four probabilities depend on the values of X and Y. Specifically, MAR says that the probability of each pattern may depend on the variables observed in that pattern, but not on variables that are not observed (after conditioning on the observed variables).
Here’s how this plays out:
The probability of getting pattern 1 (both X and Y observed) may depend on the values of both X and Y. It doesn’t have to depend on them, but MAR allows for that dependence.
The probability of pattern 2 (X observed and Y missing) may depend on X, but it can’t depend on Y.
The probability of pattern 3 (X missing and Y observed) may depend on Y, but it can’t depend on X.
The probability of pattern 4 (both variables missing) can’t depend on either X or Y.
What’s important to stress here is that MAR is a statement about the probabilities of observing particular patterns of missingness. It’s not about the probabilities that individual variables are missing.
Here’s the thing that troubles me about this: except in special cases—like missing completely at random or monotone missing data—it is surprisingly difficult to generate simulated data that would satisfy all four of the above conditions. And my personal rule is that if I can’t simulate it, I don’t really understand it.
There are three approaches to simulation that would seem like natural starting points:
1. Two dichotomous logistic regression models, one in which the probability that Y is missing depends on X, and the other in which the probability that X is missing depends on Y. The problem with that approach is that, by implication, the probability that both variables are missing would depend on the values of both variables. And that’s disallowed by MAR.
2. A multinomial logit model for the four probabilities, with both X and Y as predictors. As noted by Robins and Gill (1997), however, this would violate the MAR conditions for patterns 2, 3, and 4.
3. Specify a separate model for each of the four patterns. That’s problematic because the four probabilities have to sum to 1, and the probability of pattern 4 has to be a constant q. Consequently, the three other probabilities have to sum to 1q. It’s not obvious how to impose that constraint without violating other MAR conditions, especially when X and Y are continuous. I’ve done it for dichotomous X and Y, but that necessitated some trial and error in order to avoid violating the constraint.
So none of those approaches is satisfactory. However, when I presented this problem to Paul von Hippel, who is coauthoring the revision to Missing Data, he came up with the following algorithm for two variables:
1. Start with a sample of data on two variables X and Y, both fully observed.
2. Randomly divide the sample into three groups. The proportions in each group can be whatever you choose.
3. In Group 1, let the probability that Y is missing be a function of X. A logistic regression model would be a natural choice, but it could be a probit model or something else.
4. In Group 2, let the probability that X is missing be a function of Y. Again, there are many possibilities for this function.
5. In Group 3, set both X and Y to be missing.
That’s it. I’ve verified that it works with a simulation. However, I later learned that it’s been done before. It’s one example of class of algorithms for MAR proposed by Robins and Gill (1997) called Randomized Monotone Missingness models. This class of models can handle any number of variables with missing data. These models also allow the probability of falling into each of the subgroups to depend on other variables that are always observed.
So, yes, it’s possible to simulate MAR when two are more variables are missing, and that’s progress. But is it plausible that this algorithm corresponds to any realworld situations? I’m skeptical. The key question is, why would the mechanism producing missing data be different in different subgroups?
For those of us who regularly use multiple imputation or maximum likelihood to handle missing data, I think we are just going to have to live with the MAR assumption, despite its peculiarities. It has been shown to be the weakest assumption that still implies ignorability—that is, the ability to make inferences without having to model the missing data mechanism. So if you’re not satisfied with MAR, you’ll have to engage in a much more complicated modeling process, one that still involves untestable assumptions.
Before concluding, I should mention that there is one situation where MAR makes sense and can also be easily simulated. That’s the case of a longitudinal study with drop out as the only cause of missing data. In that case, MAR means that drop out can depend on anything that is observed before the drop out. But it can’t depend on anything that would have been observed after the drop out.
For some applications, you might still suspect that drop out depends on what would have been observed immediately after the drop out. But at least MAR in that setting is an assumption that is plausible, understandable, and algorithmically reproducible.
REFERENCES
Little, Roderick J.A., and Donald B. Rubin (2002) Statistical Analysis with Missing Data. Wiley.
Robins, James M., and Richard D. Gill (1997) “Non‐response models for the analysis of non‐monotone ignorable missing data.” Statistics in Medicine 16: 3956.
In my last post, I explained several reasons why I prefer logistic regression over a linear probability model estimated by ordinary least squares, despite the fact that linear regression is often an excellent approximation and is more easily interpreted by many researchers.
I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the margins command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese).
If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the margins command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.)
I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)
There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The Stata code for estimating the model is
webuse nhanes2f, clear
logistic diabetes i.black i.female age
The i. in front of black and female tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.
Here are the results:

diabetes  Odds Ratio Std. Err. z P>z [95% Conf. Interval]
+
1.black  2.050133 .2599694 5.66 0.000 1.598985 2.62857
1.female  1.167141 .1100592 1.64 0.101 .9701892 1.404074
age  1.061269 .003962 15.93 0.000 1.053532 1.069063
_cons  .0016525 .000392 27.00 0.000 .0010381 .0026307

A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for nonblacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.
Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes. Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:
margins, dydx(black female age)
yielding the output

 Deltamethod
 dy/dx Std. Err. z P>z [95% Conf. Interval]
+
1.black  .0400922 .0087055 4.61 0.000 .0230297 .0571547
1.female  .0067987 .0041282 1.65 0.100 .0012924 .0148898
age  .0026287 .0001869 14.06 0.000 .0022623 .0029951

The dy/dx column gives the estimated change in the probability of diabetes for a 1unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for nonblacks, on average.
How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: black=1 and black=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like age, the calculation of the AME is a little more complicated.
There’s another, more traditional, way to get marginal effects: for the variable black, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when black=1 and when black=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.
We can get the MEM in Stata with the command:
margins, dydx(black female age) atmeans
with the result

 Deltamethod
 dy/dx Std. Err. z P>z [95% Conf. Interval]
+
1.black  .0290993 .0066198 4.40 0.000 .0161246 .0420739
1.female  .0047259 .0028785 1.64 0.101 .0009158 .0103677
age  .0018234 .0000877 20.80 0.000 .0016516 .0019953

Here the MEM for black is noticeably smaller than the AME for black. This is a consequence of the inherent nonlinearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM).
The nonlinearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values.
Here’s how to get the AME for black at four different ages:
margins, dydx(black) at(age=(20 35 50 65))
1._at : age = 20
2._at : age = 35
3._at : age = 50
4._at : age = 65

 Deltamethod
 dy/dx Std. Err. z P>z [95% Conf. Interval]
+
1.black 
_at 
1  .0060899 .0016303 3.74 0.000 .0028946 .0092852
2  .0144831 .0035013 4.14 0.000 .0076207 .0213455
3  .0332459 .0074944 4.44 0.000 .018557 .0479347
4  .0704719 .015273 4.61 0.000 .0405374 .1004065

We see that the AME of black at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should not be thought of as an interaction, however, because the model says that the effect on the logodds (logit) is the same at all ages.
Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the logodds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used
predict yhat
sum yhat
The predict command generates predicted probabilities and stores them in the variable yhat. The sum command produces descriptive statistics for yhat, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:
twoway function y=ln(x/(1x)), range(.005 .244) xtitle("Probability") ytitle("Log odds")
Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:
reg diabetes black female age, robust
I’ve requested robust standard errors to deal with any heteroskedasticity, but the tstatistics are about the same even without this option.

 Robust
diabetes  Coef. Std. Err. t P>t [95% Conf. Interval]
+
black  .0383148 .008328 4.60 0.000 .0219902 .0546393
female  .0068601 .0041382 1.66 0.097 .0012515 .0149716
age  .0021575 .0001214 17.76 0.000 .0019194 .0023955
_cons  .0619671 .0049469 12.53 0.000 .0716639 .0522703

The tstatistics here are pretty close to the zstatistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of black on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.
The upshot is that combining logistic regression with the margins command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probabilitybased estimates vary under different conditions.
In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:
 Within the range of .20 to .80 for the predicted probabilities, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
 People understand changes in probabilities much better than they understand odds ratios.
 OLS regression is much faster than logistic regression.
I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other nonlinear alternatives, like probit or complementary loglog. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.
Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.
Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with kfold crossvalidation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.
Predicted probabilities. Even if you really dislike odds ratios, the logit model has a wellknown advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 10 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.
In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only inbounds predictions, the probabilities may be more accurately estimated with logistic.
Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.
In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.
Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.
A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.
Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.
Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an Sshaped curve—not necessarily logistic, but something like it.
Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.
This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.
One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.
There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is
webuse nhanes2f, clear
logistic diabetes black female age black#c.age
with the following results:

diabetes  Odds Ratio Std. Err. z P>z [95% Conf. Interval]
+
black  3.318733 1.825189 2.18 0.029 1.129381 9.752231
female  1.165212 .1098623 1.62 0.105 .9686107 1.401718
age  1.063009 .0044723 14.52 0.000 1.054279 1.071811
black#c.age  .9918406 .0090871 0.89 0.371 .9741892 1.009812
_cons  .0014978 .0003971 24.53 0.000 .0008909 .0025183

Given the high pvalue for the interaction (.371), there is clearly no evidence here that the effect of black varies with age. Now let’s estimate the same model as a linear regression:
reg diabetes black female age black#c.age
The new results are:

diabetes  Coef. Std. Err. t P>t [95% Conf. Interval]
+
black  .0215031 .0191527 1.12 0.262 .0590461 .0160399
female  .0069338 .004152 1.67 0.095 .0012049 .0150725
age  .0020176 .0001276 15.82 0.000 .0017675 .0022676
black#c.age  .0012962 .0003883 3.34 0.001 .0005351 .0020573
_cons  .0553238 .0068085 8.13 0.000 .0686697 .0419779

Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase.
Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and nonblacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.
Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome. I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.
In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.
In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R.
In July 2015 I pointed out some advantages of the linear probability model over the logistic model. The linear model is much easier to interpret, and the linear model runs much faster, which can be important if the data set is large or the model is complicated. In addition, the linear probability model often fits about as well as the logistic model, since over some ranges the probability p is almost linearly related to the log odds function ln(p(1p)) that is used in logistic regression.
As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.
Figure 1. The relationship between probability and log odds over the range of probabilities from .2 to .8.
This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.
For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2^{nd} grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.
Figure 2. The relationship between probability and log odds over the range of probabilities from .09 to .13.
The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.
I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are still situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).
I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.[1]
Figure 3. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.
On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.
To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.
The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:
twoway function y=ln(x/(1x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)
I plotted Figure 2 using the same command, only changing the range to (.09 .13).
In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.
Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.
References
Angrist, J. D., & Pischke, J.S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion (1st ed.). Princeton University Press.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.
Pischke, J.S. (2012, July 9). Probit better than LPM? Retrieved from http://www.mostlyharmlesseconometrics.com/2012/07/probitbetterthanlpm/
von Hippel, P.T. & Workman, J. (2016). From Kindergarten Through Second Grade, U.S. Children’s Obesity Prevalence Grows Only During Summer Vacations. Obesity Volume 24, Issue 11, pages 2296–2300. http://onlinelibrary.wiley.com/doi/10.1002/oby.21613/full
[1] The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed n/(n1) where n is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.
On April 2122, 2017, I will be offering a seminar on causal mediation analysis in Philadelphia with Statistical Horizons. The course will cover very recent developments in this area.
Mediation is about the mechanisms or pathways by which some treatment or exposure affects an outcome. Questions about mediation arise with considerable frequency in the biomedical and social sciences. Various approaches have been used in the social sciences for decades, especially in psychology. However, the new literature on mediation from the perspective of causal inference has helped clarify the assumptions that are being made by these analyses. It has also generalized the approach to a much broader class of settings.
The course will cover many of these new advances. We will also talk about the relationships between the newer methods and the approaches that have been used in the biomedical and social sciences.
I have published a book on this topic with Oxford University Press:
The course will cover a lot of the material in Chapters 18 of the book, and the book will also serve as a useful reference for the course. Both the book and the course are based on a longer course on mediation and interaction that I have taught at Harvard for several years. I have managed to distill the essential content of that course into material that can be covered in just two days, so the course will provide a fairly comprehensive overview in a relatively short period of time.
The course will also provide an introduction to concepts in causal inference and so might be a fairly accessible way into that literature. The prerequistes for the course are fairly minimal, however, so I’ll essentially just be assuming some familiarity with linear and logistic regression. I have worked hard to make the course as accessible as possible, and have now taught and refined the material numerous times.
The rough outline of the course and the broad topics we will cover are as follows:
 Concepts and Methods for Mediation
 Sensitivity Analysis
 Mediation with TimetoEvent Outcome
 Multiple Mediators
 Surrogate Outcomes
 A Unification of Mediation and Interaction
More specifically, the course will cover the relationship between traditional methods for mediation in the biomedical and the social sciences and new methods in causal inference. For dichotomous, continuous, and timetoevent outcomes, we will discuss when the standard approaches to mediation analysis are valid or not valid, and we’ll extend these to more complex settings.
The noconfounding assumptions needed for these techniques will be described. SAS, SPSS and Stata macros to implement these techniques will be explained and distributed to course participants. The use and implementation of sensitivity analysis techniques to assess the how sensitive conclusions are to violations of assumptions will be covered, along with extensions to multiple mediators.
If you have any questions about the course feel free to drop me a note (tvanderw@hsph.harvard.edu) and I would be happy to try to provide further information. In any case, I would of course be happy to have you there.
When Paul Allison asked me if I wanted to teach a course in Bangladesh, my first reaction was confusion. Other than knowing a few basic facts about the place, I had spent little time thinking about the country and none at all imagining myself going there. And here, suddenly, was an opportunity to spend a week in the country teaching Stata.
I love Stata and always jump at the chance to teach it. But I was honestly a bit terrified. I had never been outside North America and Western Europe and don’t really consider myself much of a traveler. What would Dhaka be like? I had no real reference point for trying to predict the experience. Even as I was boarding the flight from Boston to Dubai (the second leg of a 28hour trip), there was a part of me still asking, “What have you gotten yourself into?”
Fortunately my love of teaching Stata was greater than my anxiety because the week I spent in Dhaka ended up being one of the most satisfying of my professional life.
Most of the students I taught were participants in the CDC’s Field Epidemiology Training Program (FETP). This worldwide program takes government doctors who’ve been doing clinical work and trains them to be field epidemiologists. All of the doctors had received basic biostatistics training before I arrived. So my primary role was to teach them how to use Stata to implement the techniques they’d already learned in class.
The most surprising aspect of this experience was how much we ended up focusing on data cleaning. In most of my career, I’ve used very wellbehaved data sets like the General Social Survey. In Dhaka, most of the students were dealing with what initially seemed to me very disorganized spreadsheets. Their data had all the numbers stored as text, inconsistent capitalization, and in many cases were half “long” format and half “wide” format.
I found it amazingly satisfying to teach these doctors how to take messy, complicated epidemiological surveillance data and turn them into something they could really use. There were one or two students who had been unable to make any progress on their projects for months because of data management difficulties. I had never met data management problems that complicated before. One particular reshape had me pulling my hair out and took me almost two hours to figure out. But I have rarely—maybe never—been as satisfied in front of a computer screen as the moment I finally got it.
Over the five days of the course, we covered a great deal, from ttests, to logistic regression, to complex survey analysis. I especially enjoyed teaching these eager, energetic young doctors because they knew exactly how they were going to use these skills to make a difference in their country.
Fortunately I also had a little time to experience Dhaka. I got to ride a rickshaw, take a boat on the Ganges, walk down Hindu Street, and see buildings built by the East India Company. Just walking down the streets of Dhaka was one of the most incredible experiences of my life. The sights and sounds were beyond anything I had ever experienced. As I packed up my laptop on the last day of class, with the sound of the call to prayer in the background, my thoughts were no longer “what have I gotten myself into?” but instead “is this really over?”
Boarding the flight to return home, I was so grateful that my love of Stata had taken me to such an unexpected place. I hope I made as much of an impact on my students as they and their country made on me.
Stephen Vaisey is an Associate Professor of Sociology at Duke University. For Statistical Horizons, he regularly teaches a seminar on Treatment Effects Analysis.
In his April 1 post, Paul Allison pointed out several attractive properties of the logistic regression model. But he neglected to consider the merits of an older and simpler approach: just doing linear regression with a 10 dependent variable. In both the social and health sciences, students are almost universally taught that when the outcome variable in a regression is dichotomous, they should use logistic instead of linear regression. Yet economists, though certainly aware of logistic regression, often use a linear model to model dichotomous outcomes.
Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.
Interpretability
Let’s start by comparing the two models explicitly. If the outcome Y is a dichotomy with values 1 and 0, define p = E(YX), which is just the probability that Y is 1, given some value of the regressors X. Then the linear and logistic probability models are:
p = a_{0 }+ a_{1}X_{1} + a_{2}X_{2} + … + a_{k}X_{k} _{ }(linear)
ln[p/(1p)] = b_{0 }+ b_{1}X_{1} + b_{2}X_{2} + … + b_{k}X_{k} _{ }(logistic)
The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1p) is a linear function of the regressors.
The major advantage of the linear model is its interpretability. In the linear model, if a_{1} is (say) .05, that means that a oneunit increase in X_{1} is associated with a 5 percentage point increase in the probability that Y is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.
The logistic model is less interpretable. In the logistic model, if b_{1} is .05, that means that a oneunit increase in X_{1} is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.
How intuitive are odds ratios?
Because the log odds scale is so hard to interpret, it is common to report logistic regression results as odds ratios. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:
p/(1p) = d_{0} × (d_{1})^{X1}_{ }× (d_{2})^{X2} × … × (d_{k})^{Xk}
On the left side we have the odds and on the right side we have a product involving the odds ratios d_{1 }= exp(b_{1}), d_{2} = exp(b_{2}), etc.
Odds ratios seem like they should be intuitive. If d_{1} = 2, for example, that means that a oneunit increase in X_{1 }doubles the odds that Y is 1. That sounds like something we should understand.
But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1p)—and so does the odds ratio.
Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a getoutthevote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.
If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.
Here’s a table that shows what doubling the odds does to various initial probabilities:
Before doubling

After doubling

Probability

Odds

Odds

Probability

10%

0.11

0.22

18%

20%

0.25

0.50

33%

30%

0.43

0.86

46%

40%

0.67

1.33

57%

50%

1.00

2.00

67%

60%

1.50

3.00

75%

70%

2.33

4.67

82%

80%

4.00

8.00

89%

90%

9.00

18.0

95%

It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If p is close to 0, then doubling the odds is approximately the same as doubling p. If p is close to 1, then doubling the odds is approximately the same as halving 1p. But if p is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.
That’s why I’m not crazy about odds ratios.
How nonlinear is the logistic model?
The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).
For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability (cf. Long 1997).
It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.
When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those outofbounds predicted probabilities are the Achilles heel of the linear model.
A rule of thumb
These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.
Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).
Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[p/(1p)] are undefined when p is equal to 0 or 1. When p gets close to 0 or 1 logistic regression can suffer from complete separation, quasicomplete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic.
Computation and estimation
Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderatesized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]
The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance p(1p), but the heteroscedasticity is minor if p is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticityconsistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.
Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.
References
Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity, 43(1), 59–74. http://doi.org/10.1007/s1113500790773
King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163. http://doi.org/10.2307/25791637
Long, J. S. (1997) Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.
[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.
[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.
When estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time t is what happened at time t1.
This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models. Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.
In economics, models with lagged dependent variables are known as dynamic panel data models. Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.
The basic argument is pretty straightforward. Let y_{it} be the value of the dependent variable for individual i at time t. Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector x_{it}:
y_{it} = b_{0} + b_{1}y_{i}_{(t1)} + b_{2}x_{it} + u_{i} + e_{it}
The random intercept u_{i} represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the righthand side.
That’s where the problem lies. Because the model applies to all time points, u_{i} has a direct effect on y_{i}_{(t1)}. But if u_{i} affects y_{i}_{(t1)}, it can’t also be statistically independent of y_{i}_{(t1)}. The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).
Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.
Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male).
Here’s the Stata code:
use “https://statisticalhorizons.com/wpcontent/uploads/wages.dta”, clear
xtset id t
xtreg lwage L.lwage ed fem t
The xtset command tells Stata that this is a “crosssection timeseries” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7. The xtreg command fits a randomintercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors. L.lwage specifies the oneyear lag of lwage.
Here’s the output:

lwage  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage 
L1.  .8747517 .0085886 101.85 0.000 .8579183 .8915851
ed  .0108335 .0011933 9.08 0.000 .0084947 .0131724
fem  .06705 .010187 6.58 0.000 .0870162 .0470839
t  .0071965 .0019309 3.73 0.000 .0034119 .0109811
_cons  .7624068 .0491383 15.52 0.000 .6660974 .8587161
+
When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males. Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.
As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274278 in RabeHesketh and Skrondal (2012).
Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “crosssection timeseries dynamic panel data estimation by quasimaximum likelihood.”
Here’s how to apply xtdpdqml to the wage data:
xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)
The re option specifies a random effects (random intercepts) model. By default, the command includes the lag1 dependent variable as a predictor. The initval option sets the starting values for the four variance parameters that are part of the model. Here is the output:

lwage  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage 
L1.  .4142827 .0230843 17.95 0.000 .3690383 .459527
ed  .0403258 .0031841 12.66 0.000 .0340851 .0465666
fem  .2852665 .0271688 10.50 0.000 .3385164 .2320166
t  .0533413 .0027533 19.37 0.000 .0479449 .0587378
_cons  3.25368 .1304816 24.94 0.000 2.99794 3.509419

Results are markedly different from those produced above by xtreg. The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than nonblacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(.28)1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.
So doing it right can make a big difference. Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique MoralBenito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.
To estimate a model for the wage data with xtdpdml, use
xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv
The inv option is for timeinvariant variables. The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.
Here is the output:

 Coef. Std. Err. z P>z [95% Conf. Interval]
+
lwage2 
lwage1  .4088803 .0229742 17.80 0.000 .3638517 .453909
ed  .0406719 .0032025 12.70 0.000 .0343951 .0469486
fem  .2878266 .027345 10.53 0.000 .3414218 .2342315

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.
If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper. As a matter of fact, the xtdpdml command is just a frontend to the sem command in Stata. But it’s a lot more tedious and errorprone to set up the equations yourself. That’s why we wrote the command.
By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixedeffects models. You can’t put a lagged dependent variable on the righthand side. Both xtdpdqml and xtdpdml can handle this situation also.
If you’d like to learn more about dynamic panel data models, check out my 2day course on Longitudinal Data Analysis Using SEM. It will be offered again October 1617, 2015, in Los Angeles.
References
Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 16351659.
RabeHesketh, Sophia, and Anders Skrondal (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP.