When using multiple imputation, you may wonder how many imputations you need. A simple answer is that more imputations are better. As you add more imputations, your estimates get more precise, meaning they have smaller standard errors (SEs). And your estimates get more replicable, meaning they would not change too much if you imputed the data again.
So how many imputations are enough? An old rule of thumb was that 3 to 10 imputations typically suffice (Rubin 1987). But that advice only ensured the precision and replicability of point estimates. When the number of imputations is small, it is not uncommon to have point estimates that replicate well but SE estimates that do not.
In a recent paper (von Hippel 2018), for example, I estimated the mean body mass index (BMI) of first graders, from a sample where three quarters of BMI measurements were missing. With 5 imputations, the point estimate was 16.642 the first time I imputed the data, and 16.659 the second time. That’s good replicability; despite three quarters of the values being imputed, the two point estimates differed by just 0.1%.
But the SE estimate didn’t replicate as well; it was .023 the first time I imputed the data, and .026 the second time—a difference of 13%. Naturally, if the SE estimate isn’t replicable, related quantities like confidence intervals, t statistics, and p values won’t be replicable, either.
So you often need more imputations to get replicable SE estimates. But how many more? Read on.
A New Formula
I recently published a new formula (von Hippel 2018) that estimates how many imputations M you need for replicable SE estimates. The number of imputations is approximately
This formula depends on two quantities, FMI and CV(SE).
FMI is the fraction of missing information. The FMI is not the fraction of values that are missing; it is the fraction by which the squared SE would shrink if the data were complete. You don’t need to figure out the FMI. Standard MI software gives you an estimate.
CV() is a coefficient of variation, which you can think of as roughly the percentage by which you’d be willing to see the SE estimate change if the data were imputed again.
For example, if you have FMI=30 percent missing information, and you would accept the SE estimate changing by 10 percent if you imputed the data again, then you’ll only need M=5 or 6 imputations. But if you’d only accept the SE changing by 5%, then you’ll need M=19 imputations. (Naturally, the same formulas work if FMI and CV(SE) are expressed as proportions, .3 and .1, rather than percentages of 30 and 10.)
Notice that the number of imputations increases quadratically, with the square of FMI. This quadratic rule is better than an older rule M = 100 FMI, according to which the number of imputations should increase linearly with FMI (Bodner 2008; White et al. 2011).
Here’s a graph, adapted from von Hippel (2018), that fits the linear and quadratic rules to a simulation carried out by Bodner (2008), showing how many imputations are needed to achieve a goal similar to CV(SE)=.05. With that goal, the quadratic rule simplifies to M=1+200 FMI2, and the linear rule, as usual, is M=100 FMI.
Clearly the quadratic rule fits the simulation better than the linear rule. The rules agree when FMI=0.5, but when FMI is larger, the linear rule underestimates the number of imputations needed, and when FMI is smaller, the linear rule overestimates the number of imputations needed. For example, with 20% missing information, the linear rule says that you need 20 imputations, but the quadratic rule says you can make do with 9.
When the fraction of missing information gets above 70%, both rules underestimate the number of imputations needed for stable t-based confidence intervals. I suspect that happens because the degrees of freedom in the t statistic becomes unstable (von Hippel 2018). I am looking at that issue separately.
A Two Step Recipe
A limitation of the quadratic rule is that FMI is not known in advance. FMI has to be estimated, typically by multiple imputation. And the estimate of FMI itself can be unreliable unless the number of imputations is large (Harel 2007).
So it’s a circular problem. You need an estimate of FMI to decide how many imputations you need. But you can’t get an estimate of FMI until you impute the data. For that reason, I recommend a two-step recipe (von Hippel, 2018):
First, carry out a pilot analysis. Impute the data using a convenient number of imputations. (20 imputations is a reasonable default, if it doesn’t take too long.) Estimate the FMI by analyzing the imputed data.
Next, plug the estimated FMI into the formula above to figure out how many imputations you need to achieve a certain value of CV(SE). If you need more imputations than you had in the pilot, then add those imputations and analyze the data again.
There are two small wrinkles:
First, when you plug an estimate of FMI into the formula, you shouldn’t use a point estimate. Instead, you should use the upper bound of a 95% confidence interval for FMI. That way you take only a 2.5% risk of having too few imputations in your final analysis.
Second, in your analysis you may estimate several parameters, as in a multiple regression. In that case, you have to decide which SEs you want to be replicable. If you don’t have a strong opinion, the simplest thing is to focus on the SE of the parameter with the largest FMI.
The two-step recipe has been implemented in three popular data analysis packages.
In Stata, you can install my command how_many_imputations by typing ssc install how_many_. When there are multiple parameters, it uses the highest FMI.
In R, you can use Josh Errickson’s howManyImputations package, which is available via GitHub, here.
The two-stage procedure isn’t the only good way to find a suitable number of imputations. An alternative is to keep adding imputed datasets until the estimates converge, or change very little as new imputed datasets are added. This approach can be applied to any quantity you want to estimate—a point estimate, a standard error, a confidence interval, a p value. This approach, called iterative multiple imputation (Nassiri et al. 2018), has been implemented in the R package imi, which is available via GitHub, here.
Stata’s mi estimate command uses a jackknife procedure to estimate how much your standard error estimates and p values would change if the data were imputed again with the same number of imputations (Royston, Carlin, and White 2009). These jackknife estimates can give you some idea whether you need more imputations, but they can’t directly tell you how many imputations to add.
(An earlier, shorter version of this post appeared on missingdata.org in January 2018)
von Hippel, Paul T. (2018). “How many imputations do you need? A two-stage calculation using a quadratic rule.” Sociological Methods and Research, published online, behind a paywall. A free pre-publication version is available as an arXiv e-print.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9
Harel, O. (2007) Inferences on missing information under multiple imputation and two-stage multiple imputation. Statistical Methodology, 4(1), 75-89.
Nassiri, Vahid, Geert Molenberghs, Geert Verbeke, and João Barbosa-Breda. (2019). “Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets.” The American Statistician, 17 pages online ahead of print. https://doi.org/10.1080/00031305.2018.1543615
Royston, Patrick, John B. Carlin, and Ian R. White. (2009). “Multiple Imputation of Missing Values: New Features for Mim.” The Stata Journal 9(2):252–64.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
When R first came out, around the year 2000, I was really excited. Here was a powerful, programmable statistical package that was free to anyone. I thought “This could revolutionize data analysis.” But when I gave it a test run, I quickly got discouraged. All the routine data management tasks seemed much harder in R than in SAS or Stata, my go-to statistical packages. And many of my favorite analytical tools (like Cox regression) that were readily available in commercial packages seemed to be missing from R. Plus, everything was formulated in terms of vectors and matrices, and I knew that wouldn’t fly with the sociology graduate students I was teaching. So I put R back on the shelf and let others play with this shiny new toy.
Fast forward 20 years and things have changed dramatically. It now seems that no matter what you want to do in statistics, there’s an R package for that. Slick interface apps like RStudio have streamlined most of the routine tasks of managing R itself. And meta-packages like the tidyverse have made data management and manipulation much easier and more straightforward.
Along with those changes, the community of R users has grown enormously. It’s hard to determine the exact number of R users, but recent estimates (by Oracle and by Revolution Analytics) put the worldwide user base at about 2 million. In a regularly updated blog post, Robert Muenchen has been tracking the popularity of data science software for several years, using a variety of metrics. One graph that I found particularly interesting was this one, which shows the number of mentions of various statistical packages in scholarly articles, based on a search of Google Scholar:
The most striking feature of this graph is the overall dominance of SPSS. However, SPSS peaked in 2009 and has since suffered a massive decline. SAS has also declined greatly since 2010, and it is now below R (the orange triangles) which has been steadily increasing in scholarly mentions since its inception.
The growth in R has two important consequences:
It’s now much easier to get answers to any kind of R question, either by a quick web search or by just asking someone who works nearby.
You can readily communicate your program code to hundreds of thousands of other researchers, without worrying about whether they have the same software installed.
It’s this second point that I want to stress here. I’ve come to the conclusion that R should be every data analyst’s second language. To me, it’s like English. No matter what your native language, it’s extremely useful to have English as a second language so that you can communicate easily with others who don’t share your native tongue. And so it is with R.
I’ve certainly found that to be true in my own work. Although I’m far from being a skilled R programmer, I’ve recently developed enough familiarity and competence with R that I can translate nearly all of my teaching examples and exercises into R. Currently, I try to provide R code for all the seminars I teach for Statistical Horizons. My two most recent publications have included R code. And I can now do a much better job of reading and reviewing papers that include R code.
R also enables me to do some things that I can’t do in SAS or Stata. For example, R can do multiple imputation for clustered data using either the jomo package or the mice package. And for linear structural equation modeling, I prefer the lavaan package over what’s in SAS or Stata (but Mplus is still better).
More and more people who come to Statistical Horizons seminars prefer to work in R, and they really appreciate having R code. Increasingly, our new instructors also prefer to teach their courses using R. Of course, that’s mainly a cohort phenomenon. Statistics departments overwhelmingly prefer to teach with R, at both the undergraduate and graduate levels, so there are thousands of newly-minted R users every year.
Don’t get me wrong, there are still lots of things I don’t like about R. There are so many user-contributed R packages that it’s hard to figure out which one will do what you want. And because there will often be many packages that will do you want, you then have the problem of figuring out which one is best. Documentation for many packages is sparse or poorly written. Even more bothersome is that some packages conflict with other packages because they have functions (or other elements) that share the same names. That can cause serious headaches.
Despite those problems, I am confident that R has a bright future. There are even newer competitors (like Python) that are favored in the data science/machine learning community. But R has reached such a critical mass that it will be very hard to stop.
How do you go about learning R? Well, there are lots of good materials on the web. But I think the best way is to take our seminar Statistics with R taught by Professor Andrew Miles. It’s designed for people who are reasonably proficient in statistics and already familiar with another package (like SPSS, SAS or Stata), but who just want to learn how to do those same things in R. In two days, you’ll get comfortable enough with R to do most of the statistical tasks that you are already doing in other packages. Andrew’s course also has lots of hands-on exercises, something that’s essential for deep learning of any new tool. You can get more information about his course here.
Standard methods for the analysis of panel data depend on an assumption of directional symmetry that most researchers don’t even think about. Specifically, these methods assume that if a one-unit increase in variable X produces a change of B units in variable Y, then a one-unit decrease in X will result in a change of –B units in Y.
Does that make sense? Probably not for most applications. For example, is it plausible that the increase in happiness when a person gets married is exactly matched by the decrease in happiness when a person gets divorced? Or that a $10K increase in income has the same effect on savings as a $10K decrease (in the opposite direction).
In this post, I’m going to show you how to relax that assumption. I’ll do it for the simplest situation where there are only two time points. But I’ve also written a more detailed paper covering the multi-period situation.
Here’s the example with two-period data. The data set has 581 children who were studied in 1990 and 1992 as part of the National Longitudinal Survey of Youth. I’ll use three variables that were measured at each of the two time points:
anti antisocial behavior, measured with a scale from 0 to 6. self self-esteem, measured with a scale ranging from 6 to 24. pov poverty status of family, coded 1 for family in poverty, otherwise 0.
The goal is to estimate the causal effects of self and pov on anti. I’ll focus on fixed effects methods (Allison 2005, 2009) because they are ideal for studying the effects of increases or decreases over time. They also have the remarkable ability to control for all time-invariant confounders.
For two-period data, there are several equivalent ways to estimate a fixed effects model. The difference score method is the one that’s the most straightforward for allowing directional asymmetry. It works like this: for each variable, subtract the time 1 value from the time 2 value to create a difference score. Then, just estimate an ordinary linear regression with the difference scores.
Here is Stata code for a standard symmetrical model:
Self-esteem has a highly significant negative effect on antisocial behavior. Specifically, for each 1-unit increase in self-esteem, antisocial behavior goes down by .039 units. But that also means that for each 1-unit decrease in self-esteem, antisocial behavior goes up by .039 units. Poverty has a positive (but non-significant) effect on self-esteem. Children who move into poverty have an estimated increase in antisocial behavior of .112. But children who move out of poverty have an estimated decrease antisocial behavior of .112.
How can we relax the constraint that these effects have to be the same in both directions? York and Light (2017) showed the way. What’s needed is to decompose the difference score for each predictor variable into its positive and negative components. Specifically, if D is a difference score for variable X, create a new variable D+ which equals D if D is greater than 0, otherwise 0. And create a second variable D– which equals –D if D is less than 0, otherwise 0.
What does this tell us? A 1-unit increase in self-esteem lowers antisocial behavior by .005 units (an effect that is far from statistically significant). A 1-unit decrease in self-esteem increases antisocial behavior by .074 units (highly significant). So it looks like decreases in self-esteem have a big effect, while increases have little impact. Note that the original estimate of -.039 was about midway between these two estimates.
Are the two effects significantly different? We can test that with the command
which yields a p-value of .10, not quite statistically significant.
Neither of the two poverty coefficients is statistically significant, although they are in the expected direction: moving into poverty increases antisocial behavior while moving out of poverty reduces it, but by about half the magnitude. These two effects are definitely not significantly different.
So that’s basically it for two-period data. When there are three or more periods, you have to create multiple records for each individual. Each record contains difference scores for adjacent periods. When estimating the regression model, you need to allow for negative correlations between adjacent records by using generalized least squares.
I discuss all these options in a paper that can be found here. That paper also presents a data generating model that justifies the asymmetric first difference method. The data generating model can be extended to allow for the estimation of asymmetric logistic regression models, which can’t be estimated with difference scores.
Allison, Paul D. Fixed effects regression models. Vol. 160. SAGE publications, 2009. Allison, Paul D. Fixed effects regression methods for longitudinal data using SAS. SAS Institute, 2005. York, Richard, and Ryan Light. “Directional asymmetry in sociological analyses.” Socius 3 (2017): 1-13.
/* The NLSY data set can be downloaded at statisticalhorizons.com/resources/data-sets */
proc reg data=nlsydiff;
model antidiff=selfdiff povdiff;
proc reg data=nlsydiff;
model antidiff=selfpos selfneg povpos povneg;
When I teach courses on structural equation modeling (SEM), I tell my students that any model with instrumental variables can be estimated in the SEM framework. Then I present a classic example of simultaneous causation in which X affects Y, and Y also affects X. Models like this can be estimated if each of the two variables also has an instrumental variable—a variable that affects it but not the other variable. Specification of the model is fairly straightforward in any SEM package.
However, there are lots of other uses of instrumental variables. My claim that they can all be estimated with SEM is more speculative than I usually care to admit. A couple weeks ago I got a question about instrumental variables in SEM that really had me stumped. What seemed like the right way to do it was giving the wrong answer—at least not the answer produced by all the standard econometric methods, like two-stage least squares (2SLS) or the generalized method of moments (GMM). When I finally hit on the solution, I figured that others might benefit from what I had learned. Hence, this blog post.
The question came from Steven Utke, an assistant professor of accounting at the University of Connecticut. Steve was trying to replicate a famous study by David Card (1995) that attempted to estimate the causal effect of education on wages by using proximity to college as an instrumental variable. Steve was able to replicate Card’s 2SLS analysis, but he couldn’t get an SEM model to produce similar results.
For didactic purposes, I’m going to greatly simplify Card’s analysis by excluding a lot of variables that are not essential for the example. I’ll use Stata for the analysis, but equivalent SAS code can be found in Addendum 2.
Let’s begin with a bivariate regression of the log of hourly wages on years of education, for a sample of 3,010 young men in 1976.
Education has a highly significant coefficient, whose magnitude can be interpreted as follows: each additional year of schooling is associated with about a 5% increase in wages.
Of course, the problem with treating this as a causal effect is that there are likely to be many omitted variables that affect both education and wages. We could control for those variables by measuring them and including them in the model (and Card did that for many variables). But there’s no way we can control for all the possible confounding variables, especially because some variables are difficult to measure (e.g., ability). We must therefore conclude that education is correlated with the error term in the regression (a form of endogeneity), and that our regression coefficient is therefore biased to an unknown degree.
Card proposed to solve this problem by introducing proximity to college as an instrumental variable. Specifically, nearc4 was a dummy (indicator) variable for whether or not the person was raised in a local labor market that included a four-year college. As with any instrumental variable, the crucial assumption was that nearc4 would affect years of schooling but would NOT directly affect lwage. Although there may be reasons to doubt that assumption (some discussed by Card), they are not relevant to our discussion here.
Using the ivregress command in Stata, I estimated the instrumental variable model by 2SLS. In case you’re not familiar with that venerable method, it amounts to this: (a) do an OLS regression of educ on nearc4, (b) calculate predicted values from that regression, and (c) regress lwage on those predicted values. The only tricky part is getting the standard errors right. Here are the results:
Remarkably, the coefficient for education in this regression is more than three times as large as in the bivariate regression and is still highly significant. According to this estimate, each additional year of schooling yields a 21% increase in wages (calculated as 100(exp(.188)-1), a very large effect by any reasonable standard. (Card got a coefficient of around .14 with other predictors in the model).
Now let’s try to do this with a structural equation model, using Stata’s sem command. As with all SEM software, the default is to do maximum likelihood estimation under the assumption of multivariate normality. The basic idea is to specify a model in which nearc4 affects educ, and educ affects lwage. But there can be no direct effect of nearc4 on lwage. Here’s a path diagram for the model:
This can’t be right, however, because the coefficient for educ is identical to what we got in our initial bivariate regression, with an almost identical standard error. Clearly, the inclusion of the regression of educ on nearc4 has accomplished nothing. Can this problem be fixed, and if so, how?
I struggled with this for about an hour before it hit me. The solution is to build into the model the very thing that we are trying to correct for with the instrumental variable. If there are omitted variables that are affecting both educ and lwage, they should produce a correlation between the error term for educ and the error term for lwage. That correlation needs to be included in the SEM model. Here’s a path diagram for the new model:
In Stata, it’s done like this:
. sem (lwage <-educ) (educ<-nearc4), cov(e.educ*e.lwage)
The cov option allows a covariance (and therefore a correlation) between the two error terms. Here are the new results:
Now the coefficient and standard error for educ are identical to what we saw earlier with 2SLS. Problem solved! (SEM won’t always give identical results to 2SLS, as I explain in Addendum 1).
Interestingly, the estimated correlation between the error terms (not shown in the table) was -.66. That means that the collective impact of omitted variables is to affect educ and lwage in opposite directions. Whatever raises lwage lowers educ, and vice versa.
It’s hard to imagine what variables would behave like this. And that difficulty raises questions about the plausibility of the model. (Card suggested that measurement error in education could produce a negative correlation, but the degree of error would have to be unreasonably large to produce the result we just saw.) In any case, that’s not really our concern here. It does point out, however, that one of the advantages of the SEM approach is that you get an estimate of the error correlation—something you typically don’t see with 2SLS.
What’s the lesson here? If you want to use SEM to estimate an instrumental variable model, it’s essential to think carefully about why you need instruments in the first place. You should make sure that the specified model reflects all your beliefs about the causal mechanism. In particular, if you suspect that error terms are correlated with observed variables, those correlations must be built into the model.
If you’d like to learn more about structural equation modeling, check out my SEM short courses, which come in 2-day and 5-day versions. You can find links to past and future offerings here. You might also be interested in Felix Elwert’s 2-day course on Instrumental Variables. His past and future offerings can be found here.
Card, David (1995) “Using geographic variation in college proximity to estimate the return to schooling.” Aspects of Labour Economics: Essays in Honour of John Vanderkamp. University of Toronto Press.
Addendum 1. Just-Identified Models
The reason 2SLS and SEM produce identical coefficient estimates for this example is that the model is just-identified. That is, the model imposes no restrictions on the variances and covariances of the variables. Other instrumental variable methods like limited information maximum likelihood or GMM also yield identical results in the just-identified situation. For models that are over-identified (for example, if there are two instrumental variables instead of just one), these methods will all yield somewhat different results.
Competing risks are common in the analysis of event time data. The classic example is death, with distinctions among different kinds of death: if you die of a heart attack, you can’t then die of cancer or suicide. But examples also abound in other fields. A marriage can end either by divorce or by the death of a spouse, but not both. If a loan is terminated by pre-payment, it is no longer at risk of a default.
In this post, I argue that one of the more popular methods for regression analysis of competing risks—the analysis of subdistribution hazards, introduced by Fine and Gray (1999)—is not valid for causal inference. In fact, it can produce completely misleading results. Instead, I recommend the analysis of cause-specific hazards, a long-standing and easily implemented method.
Let’s review that classic method first. The estimation of cause-specific hazard functions (Kalbfleish and Prentice 2002) can be accomplished with standard methods for single kinds of events. You simply treat all competing events as though the individual were right censored at the time the competing event occurs. For example, if you want to study the effect of obesity on the risk of death due to heart disease, just estimate a Cox proportional hazards model in which all causes of death other than heart disease are treated as censoring. Or if you want to estimate the effect income on divorce, estimate a Cox model in which spousal death is treated as censoring.
By contrast, the method of Fine and Gray (1999) does not treat competing events in the same way as censored observations. Based on cumulative incidence functions, their method estimates a proportional hazards model for something they call the subdistribution hazard.
The definition of the subdistribution hazard is similar to that for a cause-specific hazard, with one key difference: the cause-specific hazard removes an individual from the risk set when any type of event occurs; the subdistribution hazard removes an individual from the risk set when an event of the focal type occurs or when the individual is truly censored. However, when a competing event occurs, the individual remains in the risk set. Fine and Gray acknowledge that this is “unnatural” because, in fact, those who experience competing events are no longer actually at risk of the focal event. But it’s necessary in order to get a model that correctly predicts cumulative incidence functions. (More on that later).
According to Google Scholar, the Fine and Gray paper has been cited more than 5,000 times. It is now widely available in most major software packages, including Stata (with the stcrreg command), SAS (with the EVENTCODE option in PROC PHREG) and R (with the ‘cmprsk’ package). In some fields, it has become the standard method for analyzing competing risks. In the minds of many researchers, it is the only proper way to analyze competing risks.
But there’s one big problem: the subdistribution method doesn’t isolate distinct causal effects on the competing risks. In fact, it confounds them in predictable and alarming ways. Specifically, any variable that increases the cause-specific risk of event A will appear to decrease the subdistribution hazard for event B. Why? Because whenever a type A event occurs, it eliminates the possibility that a type B event will happen.
Here’s a simple simulation that demonstrates this phenomenon. I generated 10,000 observations on each of two event types, labeled A and B. Event times for type A were generated by a Weibull regression model (a parametric version of the proportional hazards model). The only predictor was variable X, which had a standard normal distribution.
Event times for type B were generated by an identical Weibull regression model in which the only predictor was variable Z, also standard normal. X and Z had a correlation of .50.
If event A occurred before event B, the event time for B was not observed. Similarly, if B occurred before A, the event time for A was not observed. Any event times greater than 8 were treated as right censored. SAS code for generating the data and performing the analysis is appended below.
In the resulting data set, there were 4350 type A events, 4277 type B events, and 1391 truly censored observations (neither A nor B was observed). Given the data generating process, any attempt to estimate causal parameters should find no effect of X on B and no effect of Z on A.
In Table 1 below, I show the results from applying the two different methods: cause-specific Cox regression models with competing events treated as censoring; and subdistribution proportional hazards models. For type A, the cause-specific estimate of .494 for the effect of X is close to the true value of .500 and highly significant. The coefficient of -.008 for Z is close to the true value of 0 and far from statistically significant. Overall, pretty good performance.
The subdistribution estimates for type A, on the other hand, are clearly unsatisfactory. The coefficient for X appears to be biased downward by about 10%. The coefficient for Z (-.216) is far from the true value of 0, and highly significant. Thus, the subdistribution estimates would lead one to conclude that an increase in Z reduced the risk of event A. What it actually did is reduce the likelihood that type A events would be observed because it increased the risk of event B.
The results for event B in the lower panel of Table 1 are the mirror image of those for event A. For both X and Z, the cause-specific estimates are close to the true values. The subdistribution estimates are biased, and would lead to incorrect causal inferences.
Table 1. Results from Two Methods for Estimating Competing Risks Models, Non-Informative Censoring.
So why would anyone seriously consider the subdistribution method? Well, there’s one big problem with the cause-specific hazards approach. Virtually all methods based on cause-specific hazards implicitly assume that censoring is non-informative. Roughly speaking, that means that if an individual is censored at a particular point in time, that fact tells you nothing about the individual’s risk of the event.
If the censoring times are determined by the study design (as when all event times beyond a certain calendar time are censored), that’s not usually an issue. But if censoring times are not under the control of the investigator, the censoring may be informative. And that can lead to bias.
If competing risks are treated as a form of censoring, then we need to be concerned about whether that censoring is informative or non-informative. How might competing events be informative? If a spouse dies, does that tell us anything about the risk that the couple would have divorced? Maybe, but probably not. Does the fact that a person dies of heart disease tell us anything about that person’s risk of dying of cancer? Maybe so. That would definitely be the case if cancer and heart disease had common risk factors, and those factors were not included in the regression model.
Unfortunately, there’s no way to test the assumption that censoring is non-informative (Tsiatis 1975). And even if there were, there’s no good method available for estimating causal effects when censoring is informative.
By contrast, the subdistribution hazard method does not explicitly assume that competing risks are non-informative. And that has been one of its major attractions. However, as I now show, the subdistribution method does no better (and actually somewhat worse) than the cause-specific method when competing events are informative.
In the previous simulation, the assumption of non-informative censoring was satisfied by the data generating process. To model informative censoring, I added an unobserved common risk factor to the regression equations for the two event times. This was simply a normally distributed random variable with a standard deviation of 2. This “random intercept” induced a correlation of .28 between the uncensored event times for type A and type B. I then re-estimated the models, with results shown in Table 2 below.
The message of Table 2 is this: Yes, informative censoring leads to cause-specific estimates that are biased. And unlike in the previous table, the cause-specific estimates might lead one to conclude, incorrectly, that Z affects the risk for event A and that Xaffects the risk for event B. But the subdistribution estimates are also biased. And, with one minor exception, the biases are worse for the subdistribution method than for the cause-specific method.
Table 2. Results from Two Methods for Estimating Competing Risks Models, Informative Censoring.
To repeat my earlier question: Why would anyone ever seriously consider using the subdistribution method? To be fair to Fine and Gray, they never claimed that subdistribution regression would accurately estimate causal parameters. Instead, they introduced the method as a way to model the impact of covariates on the cumulative incidence functions. These functions are often preferable to Kaplan-Meier estimates of the survivor function because they do a better job of describing the empirical distribution of events rather than some hypothetical distribution that would apply only in the absence of competing risks.
Cumulative incidence functions are particularly useful for prediction. Suppose you have a cohort of newly diagnosed cancer patients. Based on the experience of earlier patients, you want to predict in five years what percentage will have died of cancer, what percentage will have died of other causes, and what percentage will still be alive. Cumulative incidence functions will give you that information. Kaplan-Meier survivor functions will not.
The Fine and Gray method provides a way to introduce covariate information into those predictions, potentially making them more accurate for individual patients. It’s important to note, however, that one can also calculate cumulative incidence functions based on cause-specific hazard functions. Most commercial packages for Cox regression don’t have that capability, but there are downloadable SAS macros that will accomplish the task.
In sum, the subdistribution method may be useful for generating predicted probabilities that individuals will be in particular states at particular times. But it is not useful for estimating causal effects of covariates on the risks that different kinds of events will occur. For that task, the analysis of cause-specific hazards is the way to go. Unfortunately, both methods are vulnerable to competing events that are informative for each other. The only effective way to deal with that problem is to estimate cause-specific hazard models that include common risk factors as covariates.
SAS Code for Simulation
do i=1 to 10000;
*Generate a common risk factor. To include it in the model, change 0 to 1;
*Generate x and z, bivariate standard normal with r=.5;
z=.5*x + sqrt(1-.25)*rannor(0);
*Generate w with Weibull distribution depending on x;
logw=2 + -.75*x + 1.5*log(ranexp(0)) + 2*comrisk;
*Generate y with Weibull distribution depending on z;
logy=2 + -.75*z + 1.5*log(ranexp(0)) + 2*comrisk;
*Allow events to censor each other;
if y>w then do;
else if w>y then do;
*Censor all event times at 8;
if t>8 then do;
table type; run;
/*Estimate cause-specific hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0 2)=x z; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0 1)=x z; run;
/*Estimate subdistributions hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=1; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=2; run;
In my courses and books on longitudinal data analysis, I spend a lot of time talking about the between-within model for fixed effects. I used to call it the hybrid model, but others have convinced me that “between-within” provides a more meaningful description.
Last week my long-time collaborator, Paula England, asked me a question about the between-within model that stumped me at first. When I finally figured out the answer, I realized that it could potentially be important to anyone who uses the between-within method. Hence, this post.
Here is Paula’s project, co-authored with Eman Abdelhadi. They have a large, cross-sectional sample of adult women from approximately 50 countries. They are estimating a logistic regression model for a dichotomous dependent variable: whether or not a woman is employed. One of the independent variables is also dichotomous: whether or not a woman is a Muslim (coded 1 or 0). In order to control for all between-country differences, they estimate a between-within model with the following characteristics:
a random-intercepts model with countries as clusters.
a “within” predictor that is the Muslim indicator (dummy variable) minus the country mean of that variable.
a “between” predictor that is the country mean, i.e., the proportion Muslim.
other control variables at both the person level and the country level.
The coefficient for the within predictor is very close to what you would get from a classic fixed effects estimator, as estimated by conditional logistic regression. (My 2014 post discusses why they are not identical). So it represents the effect of being Muslim on employment, controlling for all country-level variables, both observed and unobserved.
Here is the question. Can you interpret the coefficient for the between predictor (proportion Muslim) as the effect of living in a country that is more or less Muslim, regardless of whether a person is personally a Muslim? In other words, can you interpret the between coefficient as a contextual effect?
Surprisingly, the answer is no. By construction, the within variable is uncorrelated with the between variable. For that reason, the coefficient for proportion Muslim does not actually control for the person-level effect of being Muslim.
The reason I never thought about this question before is that I primarily use the between-within model for longitudinal data, with repeated measurements clustered within persons. In that setting, the coefficients for the between variables are usually uninteresting or misleading. But when applied to other kinds of clustering, the estimation and interpretation of “between” effects can be quite important.
Fortunately, there’s a simple solution to the problem: Estimate the model using the original Muslim indicator rather than the deviation from its country-level mean. This model is mathematically equivalent to the original between-within model—the coefficients for the Muslim indicator and for all the control variables will be exactly the same. But the coefficient for the proportion Muslim will change, and in a predictable way. Here is the algebra:
In these equations pij is the probability of employment for person i in country j, xij is the Muslim indicator for person i in country j, and x-bar-j is the mean of the Muslim indicator in country j.
The first equation is the usual formulation of the between-within model. The second equation is the algebra that gets us to the third equation. That equation is the new formulation of the model with xij and x-bar-j as predictors. Notice that the coefficient of x-bar-j in the third equation can be found by subtracting the coefficient for the within effect from the coefficient for the between effect in the first equation.
So you don’t actually have to re-estimate the model to get the contextual effect. You can just take the difference in the coefficients in the standard between-within model. Furthermore, testing whether those coefficients are the same or different (which is usually done to test fixed vs. random effects) is equivalent to testing for the existence of a contextual effect.
Keep in mind that although the coefficient for x-bar-j in the third equation can be interpreted as a contextual effect, it does not control for any unobservables. So it could be biased by the omission of country-level variables or person-level variables that affect employment and are correlated with proportion Muslim. Note, also, that the fact that the person-level variable is dichotomous in this example is completely incidental. The same logic would apply if xij were continuous.
I thought I knew what it meant for data to be missing at random. After all, I’ve written a book titled Missing Data, and I’ve been teaching courses on missing data for more than 15 years. I really ought to know what missing at random means.
But now that I’m in the process of revising that book, I’ve come to the conclusion that missing at random (MAR) is more complicated than I thought. In fact, the MAR assumption has some peculiar features that make me wonder if it can ever be truly satisfied in common situations when more than one variable has missing data.
First, a little background. There are two modern methods for handling missing data that have achieved widespread popularity: maximum likelihood and multiple imputation. As implemented in most software packages, both of these methods depend on the assumption that the data are missing at random.
Here’s how I described the MAR assumption in my book:
Data on Y are said to be missing at random if the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis. To express this more formally, suppose there are only two variables X and Y, where X always is observed and Y sometimes is missing. MAR means that
Pr(Y missing| Y, X) = Pr(Y missing| X).
In words, this expression means that the conditional probability of missing data on Y, given both Y and X, is equal to the probability of missing data on Y given X alone.
Suppose that Y is body weight and X is gender. Then if women are less likely to disclose their weight, Y can be MAR because the probability of missing Y depends on X, which is observed. But what if both men and women are less likely to disclose their weight if they’re overweight? Then values are not MAR, since the probability Y is missing depends on Y net of X.
That definition works fine when only one variable has missing data. But things get more complicated when two or more variables have missing data. I learned this from Example 1.13 in the classic book by Little and Rubin (2002). That example deals with one of the simplest cases, when there are just two variables, X and Y. Suppose that both of them have missing data, and the missingness falls into four patterns:
1. X and Y both observed.
2. X observed but Y missing.
3. X missing but Y observed.
4. X and Y both missing.
For each of these patterns, there is a corresponding probability of observing that pattern. The MAR assumption is about how those four probabilities depend on the values of X and Y. Specifically, MAR says that the probability of each pattern may depend on the variables observed in that pattern, but not on variables that are not observed (after conditioning on the observed variables).
Here’s how this plays out:
The probability of getting pattern 1 (both X and Y observed) may depend on the values of both X and Y. It doesn’t have to depend on them, but MAR allows for that dependence.
The probability of pattern 2 (X observed and Y missing) may depend on X, but it can’t depend on Y.
The probability of pattern 3 (X missing and Y observed) may depend on Y, but it can’t depend on X.
The probability of pattern 4 (both variables missing) can’t depend on either X or Y.
What’s important to stress here is that MAR is a statement about the probabilities of observing particular patterns of missingness. It’s not about the probabilities that individual variables are missing.
Here’s the thing that troubles me about this: except in special cases—like missing completely at random or monotone missing data—it is surprisingly difficult to generate simulated data that would satisfy all four of the above conditions. And my personal rule is that if I can’t simulate it, I don’t really understand it.
There are three approaches to simulation that would seem like natural starting points:
1. Two dichotomous logistic regression models, one in which the probability that Y is missing depends on X, and the other in which the probability that X is missing depends on Y. The problem with that approach is that, by implication, the probability that both variables are missing would depend on the values of both variables. And that’s disallowed by MAR.
2. A multinomial logit model for the four probabilities, with both X and Y as predictors. As noted by Robins and Gill (1997), however, this would violate the MAR conditions for patterns 2, 3, and 4.
3. Specify a separate model for each of the four patterns. That’s problematic because the four probabilities have to sum to 1, and the probability of pattern 4 has to be a constant q. Consequently, the three other probabilities have to sum to 1-q. It’s not obvious how to impose that constraint without violating other MAR conditions, especially when X and Y are continuous. I’ve done it for dichotomous X and Y, but that necessitated some trial and error in order to avoid violating the constraint.
So none of those approaches is satisfactory. However, when I presented this problem to Paul von Hippel, who is co-authoring the revision to Missing Data, he came up with the following algorithm for two variables:
1. Start with a sample of data on two variables X and Y, both fully observed.
2. Randomly divide the sample into three groups. The proportions in each group can be whatever you choose.
3. In Group 1, let the probability that Y is missing be a function of X. A logistic regression model would be a natural choice, but it could be a probit model or something else.
4. In Group 2, let the probability that X is missing be a function of Y. Again, there are many possibilities for this function.
5. In Group 3, set both X and Y to be missing.
That’s it. I’ve verified that it works with a simulation. However, I later learned that it’s been done before. It’s one example of class of algorithms for MAR proposed by Robins and Gill (1997) called Randomized Monotone Missingness models. This class of models can handle any number of variables with missing data. These models also allow the probability of falling into each of the subgroups to depend on other variables that are always observed.
So, yes, it’s possible to simulate MAR when two are more variables are missing, and that’s progress. But is it plausible that this algorithm corresponds to any real-world situations? I’m skeptical. The key question is, why would the mechanism producing missing data be different in different subgroups?
For those of us who regularly use multiple imputation or maximum likelihood to handle missing data, I think we are just going to have to live with the MAR assumption, despite its peculiarities. It has been shown to be the weakest assumption that still implies ignorability—that is, the ability to make inferences without having to model the missing data mechanism. So if you’re not satisfied with MAR, you’ll have to engage in a much more complicated modeling process, one that still involves untestable assumptions.
Before concluding, I should mention that there is one situation where MAR makes sense and can also be easily simulated. That’s the case of a longitudinal study with drop out as the only cause of missing data. In that case, MAR means that drop out can depend on anything that is observed before the drop out. But it can’t depend on anything that would have been observed after the drop out.
For some applications, you might still suspect that drop out depends on what would have been observed immediately after the drop out. But at least MAR in that setting is an assumption that is plausible, understandable, and algorithmically reproducible.
Little, Roderick J.A., and Donald B. Rubin (2002) Statistical Analysis with Missing Data. Wiley.
Robins, James M., and Richard D. Gill (1997) “Non‐response models for the analysis of non‐monotone ignorable missing data.” Statistics in Medicine 16: 39-56.
In my last post, I explained several reasons why I prefer logistic regression over a linear probability model estimated by ordinary least squares, despite the fact that linear regression is often an excellent approximation and is more easily interpreted by many researchers.
I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the margins command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese).
If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the margins command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.)
I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)
There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The Stata code for estimating the model is
webuse nhanes2f, clear
logistic diabetes i.black i.female age
The i. in front of black and female tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.
A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for non-blacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.
Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes. Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:
The dy/dx column gives the estimated change in the probability of diabetes for a 1-unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for non-blacks, on average.
How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: black=1 and black=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like age, the calculation of the AME is a little more complicated.
There’s another, more traditional, way to get marginal effects: for the variable black, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when black=1 and when black=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.
Here the MEM for black is noticeably smaller than the AME for black. This is a consequence of the inherent non-linearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM).
The non-linearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values.
Here’s how to get the AME for black at four different ages:
We see that the AME of black at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should not be thought of as an interaction, however, because the model says that the effect on the log-odds (logit) is the same at all ages.
Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the log-odds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used
The predict command generates predicted probabilities and stores them in the variable yhat. The sum command produces descriptive statistics for yhat, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:
twoway function y=ln(x/(1-x)), range(.005 .244) xtitle("Probability") ytitle("Log odds")
Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:
reg diabetes black female age, robust
I’ve requested robust standard errors to deal with any heteroskedasticity, but the t-statistics are about the same even without this option.
The t-statistics here are pretty close to the z-statistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of black on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.
The upshot is that combining logistic regression with the margins command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probability-based estimates vary under different conditions.
In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:
Within the range of .20 to .80 for the predicted probabilities, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
People understand changes in probabilities much better than they understand odds ratios.
OLS regression is much faster than logistic regression.
I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.
Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.
Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with k-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.
Predicted probabilities. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.
In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.
Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.
In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.
Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.
A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.
Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.
Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.
Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.
This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.
One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.
There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is
webuse nhanes2f, clear logistic diabetes black female age black#c.age
Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase.
Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.
Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome. I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.
In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.
In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R.
In July 2015 I pointed out some advantages of the linear probability model over the logistic model. The linear model is much easier to interpret, and the linear model runs much faster, which can be important if the data set is large or the model is complicated. In addition, the linear probability model often fits about as well as the logistic model, since over some ranges the probability p is almost linearly related to the log odds function ln(p(1-p)) that is used in logistic regression.
As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.
Figure 1. The relationship between probability and log odds over the range of probabilities from .2 to .8.
This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.
For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2nd grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.
Figure 2. The relationship between probability and log odds over the range of probabilities from .09 to .13.
The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.
I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are still situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).
I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.
Figure 3. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.
On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.
To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.
The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:
twoway function y=ln(x/(1-x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)
I plotted Figure 2 using the same command, only changing the range to (.09 .13).
In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.
Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.
Angrist, J. D., & Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion (1st ed.). Princeton University Press.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.
Pischke, J.-S. (2012, July 9). Probit better than LPM? Retrieved from http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/
 The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed n/(n-1) where n is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.