In two earlier posts on this blog (here and here), my colleague Paul von Hippel made a strong case for using OLS linear regression instead of logistic regression for binary dependent variables. When you do that, you are implicitly estimating what’s known as a linear probability model (LPM), which says that the probability of some event is a linear function of a set of predictors.

Von Hippel had four major arguments:

• OLS regression is much faster than logistic regression, and that can be a big advantage in many applications, particularly when the data are big, the model is complex, and the model needs to be varied or updated frequently.
• People understand changes in probabilities much better than they understand odds ratios.
• Within the range of .20 to .80 for the predicted probabilities, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, OLS regression may do well if the range is narrow.
• It’s not uncommon for logistic regression to break down completely because of “quasi-complete separation” in which maximum likelihood estimates simply don’t exist. This never happens with the LPM.

Although I agreed with all of von Hippel’s points, I responded with two more blog posts (here and here) in which I gave several reasons why logistic regression was still preferable to the LPM. One of them was that LPMs frequently yield predicted probabilities that are greater than 1 or less than 0, and that obviously can’t be right.

For applications that focus on the regression coefficients, out-of-bounds predictions may not be a big problem. But in many other settings, valid predicted probabilities are essential. These include inverse probability weighting, propensity score methods, and expected loss calculations. Many authors have pointed to invalid predicted probabilities as the principal disadvantage of LPMs (e.g., Westin 1973, Long 1997, Hellevik 2007, Wooldridge 2010, Greene 2017).

Well, I’m happy to tell you that there’s an easy and effective fix for that problem. And it’s been right under our noses ever since Fisher (1936) identified the key elements of a solution. Here’s a quick summary of how the method works:

• The coefficient estimates from OLS with a binary outcome can be transformed into maximum likelihood estimates of the parameters of a “linear discriminant model”.
• The linear discriminant model (LDM) implies a logistic regression model for the dependence of the outcome on the predictors.
• To get valid predicted probabilities, just plug the values of the predictors into that logistic regression model using the transformed parameter estimates.

In the remainder of this post, I’ll flesh out the details and then give two examples.

What is the linear discriminant model?

The linear discriminant model was first proposed by Fisher as a method for classifying units into one or the other of two categories, based on a linear function of the predictor variables. Let x be the vector of predictors, and let y have values of 1 or 0. The LDM assumes that, within each category of y, x has a multivariate normal distribution, with different mean vectors for each category, u1 and u0, but a covariance matrix S that is the same for both categories.

Fact 1. The model specifies the conditional distribution x given the value of y. However, using Bayes’ theorem, it can be re-expressed to give the conditional probability of each value of y, given a value of x. What’s remarkable is that this yields a logistic regression model. Specifically,

log[P(y=1|x)/Pr(y=0|x)] =  a + b’x,

where a and b are functions of the two mean vectors and the covariance matrix.

Fact 2. The other remarkable fact is that maximum likelihood estimates of a and b can be obtained (indirectly) by ordinary least squares regression of y on x. This was originally noted by Fisher but put into more usable form by Haggstrom (1983). To estimate b, the OLS slope coefficients must each be multiplied by K = N / RSS where N is the sample size and RSS is the residual sum of squares. K is typically substantially larger than 1.

The intercept a is obtained as follows. Let m be the sample mean of y and let c be the intercept in the OLS regression. Then, the ML estimate of a is

log[m/(1-m)] + K(c-.5) + .5[1/m – 1/(1-m)]

Getting predicted probabilities

These two facts suggest the following procedure for getting predicted probabilities from a linear probability model.

• Estimate the LPM by OLS.
• Transform the parameters as described in Fact 2.
• Generate predicted probabilities by using the logistic equation in Fact 1.

I call this the LDM method. To make it easy, I have written software tools for three different statistical packages. All tools are named predict_ldm:

A SAS macro available here.
A Stata ado file available here (co-authored with Richard Williams).
An R function shown below in Appendix 3 (co-authored with Stephen Vaisey).

Evaluating the results

The LDM method will absolutely give you predicted probabilities that are always within the (0,1) interval. Aside from that, are they any good? In a moment, I’ll present two examples that suggest that they are very good indeed. However, a major reason for concern is that the LDM assumes multivariate normality of the predictors, but there are very few applications in which all the predictors will meet that assumption, or even come close.

Several studies have suggested that the LDM is pretty robust to violations of the multivariate normality assumption (Press and Wilson 1978, Wilensky and Rossiter 1978, Chatla and Shmueli 2017), but none of those investigations was very systematic. Moreover, they focused on coefficients and test statistics, not on predicted probabilities.

I have applied the method to many data sets, and all but two have shown excellent performance of the LDM method. How did I evaluate the method? I first fit the linear model and applied the LDM method to get predicted probabilities. Then I fit a logistic model using the standard iterative ML method, which makes no assumptions about the distribution of predictors. I then compared the predicted probabilities from LDM and conventional logistic regression in several ways.

I figured that conventional logit should be the gold standard. We can’t expect the LDM method to do any better than conventional logit because both rely on the same underlying model for y, except that LDM makes additional assumptions about the predictor variables.

Example 1. Married women’s labor force participation

For this example, I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The binary dependent variable has a value of 1 if the woman was currently in the labor force (478 women), otherwise 0. Independent variables are number of children under the age of six, age, education (in years), and years of labor force experience, as well as the square of experience. In the Appendices, you’ll find code for this example (and the next) using Stata, SAS and R. The output shown here was produced by Stata.

The table below gives summary statistics for the three different methods for getting predicted values. yhat contains predicted values from the linear model, yhat_ldm contains predicted values based on the LDM method, and yhat_logit contains predicted values from a conventional logistic regression.

```    Variable |  Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------
yhat |  753    .5683931    .2517531  -.2782369   1.118993
yhat_ldm |  753    .5745898    .2605278   .0136687   .9676127
yhat_logit |  753    .5683931    .2548012   .0145444   .9651493```

The linear predicted values have a maximum of 1.12 and a minimum of -.28. In fact, 12 of the linear predicted values were greater than 1, and 14 were less than 0 (not shown in the table). By necessity, the predicted values for both LDM and logit were bounded by (0,1).

Next we’ll look at the correlations among the three sets of predicted values:

```             |     yhat yhat_ldm yhat_logit
-------------+-----------------------------
yhat |   1.0000
yhat_ldm |   0.9870   1.0000
yhat_logit |   0.9880   0.9994   1.0000

```

Although the linear predictions are highly correlated with the logistic predictions, the correlation between the LDM predictions and the logistic predictions is even higher. Here’s the scatter plot for linear vs. logistic:

Notice that as the logistic predictions get close to 1, the linear predictions get larger than the logistic predictions. And, symmetrically, as the logistic predictions get close to 0, the linear predictions get smaller than the logistic predictions. This should alert us to the possibility that, even if none of the linear predicted values is outside the (0, 1) interval, the values at the two extremes may be biased in predictable ways.

By contrast, the plot below for LDM vs. logistic is very well calibrated.

For this example, I’d say that very little would be lost in using the LDM predicted values instead of predicted values from conventional logistic regression.

Example 2. All-Cause mortality among lung cancer patients

After deleting cases with missing data, there were 1,029 patients in the data set. Death is the dependent variable, and 764 of the patients died. What’s different about this example is that all the predictors are categorical and, therefore, have a joint distribution that is nothing like a multivariate normal. The predictors are surgery (1 or 0), a randomized treatment (1 or 0), hospital (Mayo Clinic, Johns Hopkins, or Sloan Kettering), and cancer stage at diagnosis (1, 2, or 3). Both of the 3-category variables are represented by two dummy (indicator) variables. The fitted models included main effects of these variables but no interactions.

Again, I ran two regressions, one linear and one logistic. The linear predictions were transformed by the LDM method. As the next table shows, there were definitely cases with linear predicted values greater than 1. In fact, there were 96 such cases.

```    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
yhat |      1,029    .7424684    .2206033   .4116536    1.01567
yhat_ldm |      1,029    .7299414     .251858    .303894   .9679288
yhat_logit |      1,029    .7424684    .2247691   .3621077   .9674457```

Here are the correlations among the three versions of the predicted values.

```             |     yhat yhat_ldm yhat_logit
-------------+---------------------------
yhat |   1.0000
yhat_ldm |   0.9754   1.0000
yhat_logit |   0.9815   0.9908   1.0000```

Again, the LDM predictions do a better job than the linear predictions of matching the conventional logistic predictions.

Conclusion

The LDM method is a simple and effective way to transform predicted values from a linear probability model so that they do not fall outside the (0,1) interval. Although the method relies on strong assumptions about the distribution of the predictor variables, it nevertheless performed very well in 13 out of 15 data sets and models that I have examined, none of which had predictors that came close to multivariate normality.

What about the two exceptions? Well, I’m still studying them to get a better understanding of what went wrong, but I can tell you this: In one of them, the model included a 2-way interaction that was highly significant in the linear model but not significant in the conventional logistic model. In the other, there were two predictors that were highly collinear. I am also designing a simulation study to explore the conditions under which the LDM method does or does not work well.

Is this a perfect method? Certainly not. I still prefer to use conventional logistic regression wherever computationally practical. But even with today’s enormous computing power, there are still many applications where linear models are quite attractive. And if you’re going to use an LPM and you want valid predicted probabilities, the LDM method seems like a pretty good way to get them.

References

Chatla, S. B., & Shmueli, G. (2017). An extensive examination of regression models with a binary outcome variable. Journal of the Association for Information Systems18(4), 1.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics7(2), 179-188.

Greene, W.H. (2017) Econometric Analysis. Pearson.

Haggstrom, G. W. (1983). Logistic regression and discriminant analysis by ordinary least squares. Journal of Business & Economic Statistics, 1(3), 229-238.

Hellevik, Ottar (2009): Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity 43.1 59-74.

Long, J. S. (1997) Regression models for categorical and limited dependent variables. Sage Thousand Oaks.

Mroz, T.A. (1987) “The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765-799.

Press, S. J., & Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association73(364), 699-705.

Westin, R. B. (1974). Predictions from binary choice models. Journal of Econometrics2(1), 1-16.

Wilensky, G. R., & Rossiter, L. F. (1978). OLS and logit estimation in a physician location study. In Proceedings of the Social Statistics Section, American Statistical Association (pp. 260-265).

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.

Appendix 1: Stata code for the examples

Note: This program uses the predict_ldm command, which may be downloaded here.

``` /*Example 1. 753 married women. DV=1 if in the labor force, else 0.*/

reg inlf kidslt6 age educ exper expersq
predict_ldm
logit inlf kidslt6 age educ exper expersq
predict yhat_logit
sum yhat yhat_ldm yhat_logit
corr yhat yhat_ldm yhat_logit
scatter yhat yhat_logit, xtitle(Logistic Predicted Probabilities) ///
ytitle(Fitted Values from Linear Model)
scatter yhat_ldm yhat_logit, xtitle(Logistic Predicted Probabilities) ///
ytitle(LDM Predicted Probabilities)

/*Example 2. 1029 lung cancer patients. DV is 1 if patient died, else 0. */

reg dead operated treat i.instit i.stage
predict_ldm
logit dead operated treat i.instit i.stage
predict yhat_logit
sum yhat yhat_ldm yhat_logit
corr yhat yhat_ldm yhat_logit
scatter yhat yhat_logit
scatter yhat_ldm yhat_logit```

Appendix 2:  SAS code for the examples

Note: This program uses the predict_ldm macro, which can be downloaded here. The data sets may be dowloaded here and

```/*Example 1. 753 married women. DV=1 if in the labor force, else 0.*/

%predict_ldm (data=my.mroz, y=inlf, x=kidslt6 age educ exper expersq);
proc logistic data=predict_ldm desc;
model inlf=kidslt6 age educ exper expersq;
output out=c pred=yhat_logit;
run;
proc corr data=c;
var yhat yhat_ldm yhat_logit;
run;
proc gplot;
plot yhat*yhat_logit;
plot yhat_ldm*yhat_logit;
run;

/*Example 2. 1029 lung cancer patients. DV is 1 if patient died, else 0. */

data lung;
set my.lung;
hopkins=(instit=2);
mayo=(instit=1);
if stage=. then delete;
stage3 = (stage=3);
stage2 = (stage=2);
run;
x=operated treat hopkins mayo stage3 stage2);
proc logistic data=predict_ldm desc;
model dead=operated treat hopkins mayo stage3 stage2;
output out=c pred=yhat_logit;
run;
proc corr data=c;
var yhat yhat_ldm yhat_logit;
run;
proc gplot data=c;
plot yhat*yhat_logit;
plot yhat_ldm*yhat_logit;
run;```

Appendix 3:  R code for the examples

```# Function to convert linear predictions to the probability scale:
predict_ldm <- function(fit) {
c <- fit\$coefficients[1]
k <- nrow( fit\$model ) / deviance( fit )
m <- mean( fit\$model[,1] )
a <- log(m/(1-m) ) + k*( c-.5 ) + .5*(1/m - 1/(1-m) )
lin_preds <- predict(fit)
my_preds_logit <- k*(lin_preds-c) + a
my_preds <- 1/(1 + exp(-my_preds_logit))
names(my_preds) <- "ldm_preds"
return(my_preds)
}

# Example 1
library(haven)
library(psych)
mroz_fit <- lm(inlf ~ kidslt6 + age + educ + exper + expersq, data = mroz)
summary(mroz_fit)
lin_preds <- predict(mroz_fit)
ldm_preds <-predict_ldm(mroz_fit)
logit_fit <- glm(inlf ~ kidslt6 + age + educ + exper + expersq,
family = binomial, data = mroz)
summary(logit_fit)
logit_preds <- predict(logit_fit, type = "response")
three_preds <- cbind(lin_preds, ldm_preds, logit_preds)
describe(three_preds, fast=T)
cor(three_preds)
plot(logit_preds,lin_preds)
plot(logit_preds, ldm_preds)
# Example 2
lung\$i.stage <- as.factor(lung\$stage)
lung\$i.instit <- as.factor(lung\$instit)
lung_fit <- lm(dead ~ treat+operated+i.stage+i.instit, data = lung)
summary(lung_fit)
lin_preds <- predict(lung_fit)
ldm_preds <-predict_ldm(lung_fit)
family=binomial, data = lung)
summary(logit_fit)
logit_preds <- predict(logit_fit, type = "response")
three_preds <- cbind(lin_preds, ldm_preds, logit_preds)
describe(three_preds, fast=T)
cor(three_preds)
plot(logit_preds,lin_preds)
plot(logit_preds, ldm_preds)```

We are thrilled to introduce a new addition to the Statistical Horizons family — Code Horizons. This new initiative emerges from a simple realization. When we do our research, most of the things we actually do are not things we learned in graduate school. These include writing scripts to parse textual data, creating effective visualizations, making reproducible reports so that others can use our code and replicate our results, creating HTML5 slideshows and websites, and learning new languages like R or Python. More than ever, proficiency at “coding” (broadly defined) has become an essential element of research productivity in science and industry.

Personally, these are skills that I have had to learn largely on my own. Unfortunately, this usually involved a ton of painful trial and error, false starts, and dead ends. Wouldn’t it be great if that pain could be reduced, or even eliminated?

Our goal at Code Horizons is to help you get up to speed quickly on these vital skills so you can keep your focus where it belongs — doing your research.

We are just getting started, and we plan to offer many courses in the coming months and years. For now, we will be launching with two new courses — Python for Data Analysis and Workflow of Data Analysis (featuring both R and Stata). We will also continue to offer such essential courses as Data Visualization, Text as Data, and Statistics with R. Our aim is to help you learn the skills you need in an efficient and affordable way in order to rapidly increase your productivity.

As Director of Code Horizons, I am really excited about these new courses, and I will be attending them myself. Watch this space for more developments in the coming months!

When using multiple imputation, you may wonder how many imputations you need. A simple answer is that more imputations are better. As you add more imputations, your estimates get more precise, meaning they have smaller standard errors (SEs). And your estimates get more replicable, meaning they would not change too much if you imputed the data again.

There are limits, though. No matter how many imputations you use, multiple imputation estimates can never be more precise or replicable than maximum likelihood estimates. And beyond a certain number of imputations, any improvement in precision and replicability becomes negligible.

So how many imputations are enough? An old rule of thumb was that 3 to 10 imputations typically suffice (Rubin 1987). But that advice only ensured the precision and replicability of point estimates. When the number of imputations is small, it is not uncommon to have point estimates that replicate well but SE estimates that do not.

In a recent paper (von Hippel 2018), for example, I estimated the mean body mass index (BMI) of first graders, from a sample where three quarters of BMI measurements were missing. With 5 imputations, the point estimate was 16.642 the first time I imputed the data, and 16.659 the second time. That’s good replicability; despite three quarters of the values being imputed, the two point estimates differed by just 0.1%.

But the SE estimate didn’t replicate as well; it was .023 the first time I imputed the data, and .026 the second time—a difference of 13%. Naturally, if the SE estimate isn’t replicable, related quantities like confidence intervals, t statistics, and p values won’t be replicable, either.

So you often need more imputations to get replicable SE estimates. But how many more? Read on.

### A New Formula

I recently published a new formula (von Hippel 2018) that estimates how many imputations M you need for replicable SE estimates. The number of imputations is approximately

This formula depends on two quantities, FMI and CV(SE).

• FMI is the fraction of missing information. The FMI is not the fraction of values that are missing; it is the fraction by which the squared SE would shrink if the data were complete. You don’t need to figure out the FMI. Standard MI software gives you an estimate.
• CV() is a coefficient of variation, which you can think of as roughly the percentage by which you’d be willing to see the SE estimate change if the data were imputed again.

For example, if you have FMI=30 percent missing information, and you would accept the SE estimate changing by 10 percent if you imputed the data again, then you’ll only need M=5 or 6 imputations. But if you’d only accept the SE changing by 5%, then you’ll need M=19 imputations. (Naturally, the same formulas work if FMI and CV(SE) are expressed as proportions, .3 and .1, rather than percentages of 30 and 10.)

Notice that the number of imputations increases quadratically, with the square of FMI. This quadratic rule is better than an older rule M = 100 FMI, according to which the number of imputations should increase linearly with FMI (Bodner 2008; White et al. 2011).

Here’s a graph, adapted from von Hippel (2018), that fits the linear and quadratic rules to a simulation carried out by Bodner (2008), showing how many imputations are needed to achieve a goal similar to CV(SE)=.05. With that goal, the quadratic rule simplifies to M=1+200 FMI2, and the linear rule, as usual, is M=100 FMI.

Clearly the quadratic rule fits the simulation better than the linear rule. The rules agree when FMI=0.5, but when FMI is larger, the linear rule underestimates the number of imputations needed, and when FMI is smaller, the linear rule overestimates the number of imputations needed. For example, with 20% missing information, the linear rule says that you need 20 imputations, but the quadratic rule says you can make do with 9.

When the fraction of missing information gets above 70%, both rules underestimate the number of imputations needed for stable t-based confidence intervals. I suspect that happens because the degrees of freedom in the t statistic becomes unstable (von Hippel 2018). I am looking at that issue separately.

### A Two Step Recipe

A limitation of the quadratic rule is that FMI is not known in advance. FMI has to be estimated, typically by multiple imputation. And the estimate of FMI itself can be unreliable unless the number of imputations is large (Harel 2007).

So it’s a circular problem. You need an estimate of FMI to decide how many imputations you need. But you can’t get an estimate of FMI until you impute the data. For that reason, I recommend a two-step recipe (von Hippel, 2018):

1. First, carry out a pilot analysis. Impute the data using a convenient number of imputations. (20 imputations is a reasonable default, if it doesn’t take too long.) Estimate the FMI by analyzing the imputed data.
2. Next, plug the estimated FMI into the formula above to figure out how many imputations you need to achieve a certain value of CV(SE). If you need more imputations than you had in the pilot, then add those imputations and analyze the data again.

There are two small wrinkles:

First, when you plug an estimate of FMI into the formula, you shouldn’t use a point estimate. Instead, you should use the upper bound of a 95% confidence interval for FMI. That way you take only a 2.5% risk of having too few imputations in your final analysis.

Second, in your analysis you may estimate several parameters, as in a multiple regression. In that case, you have to decide which SEs you want to be replicable. If you don’t have a strong opinion, the simplest thing is to focus on the SE of the parameter with the largest FMI.

### Software

The two-step recipe has been implemented in three popular data analysis packages.

• In Stata, you can install my command how_many_imputations by typing ssc install how_many_.  When there are multiple parameters, it uses the highest FMI.
• In SAS, you can use my macro %mi_combine, which is available in this Google Drive folder.
• In R, you can use Josh Errickson’s howManyImputations package, which is available via GitHub, here.

#### Alternatives

The two-stage procedure isn’t the only good way to find a suitable number of imputations. An alternative is to keep adding imputed datasets until the estimates converge, or change very little as new imputed datasets are added. This approach can be applied to any quantity you want to estimate—a point estimate, a standard error, a confidence interval, a p value. This approach, called iterative multiple imputation (Nassiri et al. 2018), has been implemented in the R package imi, which is available via GitHub, here.

Stata’s mi estimate command uses a jackknife procedure to estimate how much your standard error estimates and p values would change if the data were imputed again with the same number of imputations (Royston, Carlin, and White 2009). These jackknife estimates can give you some idea whether you need more imputations, but they can’t directly tell you how many imputations to add.

(An earlier, shorter version of this post appeared on missingdata.org in January 2018)

### Key reference

von Hippel, Paul T. (2018). “How many imputations do you need? A two-stage calculation using a quadratic rule.” Sociological Methods and Research, published online, behind a paywall. A free pre-publication version is available as an arXiv e-print.

Bodner, T. E. (2008). What Improves with Increased Missing Data Imputations? Structural Equation Modeling, 15(4), 651–675. https://doi.org/10.1080/10705510802339072

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9

Harel, O. (2007) Inferences on missing information under multiple imputation and two-stage multiple imputation. Statistical Methodology, 4(1), 75-89.

Nassiri, Vahid, Geert Molenberghs, Geert Verbeke, and João Barbosa-Breda. (2019). “Iterative Multiple Imputation: A Framework to Determine the Number of Imputed Datasets.” The American Statistician, 17 pages online ahead of print. https://doi.org/10.1080/00031305.2018.1543615

Royston, Patrick, John B. Carlin, and Ian R. White. (2009). “Multiple Imputation of Missing Values: New Features for Mim.” The Stata Journal 9(2):252–64.

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067

When R first came out, around the year 2000, I was really excited.  Here was a powerful, programmable statistical package that was free to anyone. I thought “This could revolutionize data analysis.”  But when I gave it a test run, I quickly got discouraged. All the routine data management tasks seemed much harder in R than in SAS or Stata, my go-to statistical packages. And many of my favorite analytical tools (like Cox regression) that were readily available in commercial packages seemed to be missing from R. Plus, everything was formulated in terms of vectors and matrices, and I knew that wouldn’t fly with the sociology graduate students I was teaching. So I put R back on the shelf and let others play with this shiny new toy.

Fast forward 20 years and things have changed dramatically. It now seems that no matter what you want to do in statistics, there’s an R package for that. Slick interface apps like RStudio have streamlined most of the routine tasks of managing R itself. And meta-packages like the tidyverse have made data management and manipulation much easier and more straightforward.

Along with those changes, the community of R users has grown enormously. It’s hard to determine the exact number of R users, but recent estimates (by Oracle and by Revolution Analytics) put the worldwide user base at about 2 million. In a regularly updated blog post, Robert Muenchen has been tracking the popularity of data science software for several years, using a variety of metrics. One graph that I found particularly interesting was this one, which shows the number of mentions of various statistical packages in scholarly articles, based on a search of Google Scholar:

The most striking feature of this graph is the overall dominance of SPSS. However, SPSS peaked in 2009 and has since suffered a massive decline. SAS has also declined greatly since 2010, and it is now below R (the orange triangles) which has been steadily increasing in scholarly mentions since its inception.

The growth in R has two important consequences:

• It’s now much easier to get answers to any kind of R question, either by a quick web search or by just asking someone who works nearby.
• You can readily communicate your program code to hundreds of thousands of other researchers, without worrying about whether they have the same software installed.

It’s this second point that I want to stress here. I’ve come to the conclusion that R should be every data analyst’s second language. To me, it’s like English. No matter what your native language, it’s extremely useful to have English as a second language so that you can communicate easily with others who don’t share your native tongue. And so it is with R.

I’ve certainly found that to be true in my own work. Although I’m far from being a skilled R programmer, I’ve recently developed enough familiarity and competence with R that I can translate nearly all of my teaching examples and exercises into R. Currently, I try to provide R code for all the seminars I teach for Statistical Horizons. My two most recent publications have included R code. And I can now do a much better job of reading and reviewing papers that include R code.

R also enables me to do some things that I can’t do in SAS or Stata. For example, R can do multiple imputation for clustered data using either the jomo package or the mice package.  And for linear structural equation modeling, I prefer the lavaan package over what’s in SAS or Stata (but Mplus is still better).

More and more people who come to Statistical Horizons seminars prefer to work in R, and they really appreciate having R code. Increasingly, our new instructors also prefer to teach their courses using R. Of course, that’s mainly a cohort phenomenon. Statistics departments overwhelmingly prefer to teach with R, at both the undergraduate and graduate levels, so there are thousands of newly-minted R users every year.

Don’t get me wrong, there are still lots of things I don’t like about R. There are so many user-contributed R packages that it’s hard to figure out which one will do what you want. And because there will often be many packages that will do you want, you then have the problem of figuring out which one is best. Documentation for many packages is sparse or poorly written. Even more bothersome is that some packages conflict with other packages because they have functions (or other elements) that share the same names. That can cause serious headaches.

Despite those problems, I am confident that R has a bright future. There are even newer competitors (like Python) that are favored in the data science/machine learning community. But R has reached such a critical mass that it will be very hard to stop.

How do you go about learning R?  Well, there are lots of good materials on the web. But I think the best way is to take our seminar Statistics with R taught by Professor Andrew Miles. It’s designed for people who are reasonably proficient in statistics and already familiar with another package (like SPSS, SAS or Stata), but who just want to learn how to do those same things in R.  In two days, you’ll get comfortable enough with R to do most of the statistical tasks that you are already doing in other packages. Andrew’s course also has lots of hands-on exercises, something that’s essential for deep learning of any new tool. You can get more information about his course here.

Standard methods for the analysis of panel data depend on an assumption of directional symmetry that most researchers don’t even think about. Specifically, these methods assume that if a one-unit increase in variable X produces a change of B units in variable Y, then a one-unit decrease in X will result in a change of –B units in Y

Does that make sense? Probably not for most applications. For example, is it plausible that the increase in happiness when a person gets married is exactly matched by the decrease in happiness when a person gets divorced? Or that a \$10K increase in income has the same effect on savings as a \$10K decrease (in the opposite direction).

In this post, I’m going to show you how to relax that assumption. I’ll do it for the simplest situation where there are only two time points. But I’ve also written a more detailed paper covering the multi-period situation.

Here’s the example with two-period data. The data set has 581 children who were studied in 1990 and 1992 as part of the National Longitudinal Survey of Youth.  I’ll use three variables that were measured at each of the two time points:

anti     antisocial behavior, measured with a scale from 0 to 6.
self      self-esteem, measured with a scale ranging from 6 to 24.
pov      poverty status of family, coded 1 for family in poverty, otherwise 0.

The goal is to estimate the causal effects of self and pov on anti. I’ll focus on fixed effects methods (Allison 2005, 2009) because they are ideal for studying the effects of increases or decreases over time. They also have the remarkable ability to control for all time-invariant confounders.

For two-period data, there are several equivalent ways to estimate a fixed effects model. The difference score method is the one that’s the most straightforward for allowing directional asymmetry. It works like this: for each variable, subtract the time 1 value from the time 2 value to create a difference score. Then, just estimate an ordinary linear regression with the difference scores.

Here is Stata code for a standard symmetrical model:

```use nlsy.dta, clear
generate antidiff=anti92-anti90
generate selfdiff=self92-self90
generate povdiff=pov92-pov90
regress antidiff selfdiff povdiff```

You’ll find equivalent SAS code at the end of this post.

And here are the results:

```------------------------------------------------------------------------------
antidiff |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
selfdiff |  -.0391292   .0136396    -2.87   0.004    -.0659185     -.01234
povdiff |   .1969039   .1326352     1.48   0.138    -.0636018    .4574096
_cons |   .0403031   .0533833     0.75   0.451    -.0645458    .1451521
------------------------------------------------------------------------------```

Self-esteem has a highly significant negative effect on antisocial behavior. Specifically, for each 1-unit increase in self-esteem, antisocial behavior goes down by .039 units. But that also means that for each 1-unit decrease in self-esteem, antisocial behavior goes up by .039 units. Poverty has a positive (but non-significant) effect on self-esteem. Children who move into poverty have an estimated increase in antisocial behavior of .112. But children who move out of poverty have an estimated decrease antisocial behavior of .112.

How can we relax the constraint that these effects have to be the same in both directions?  York and Light (2017) showed the way. What’s needed is to decompose the difference score for each predictor variable into its positive and negative components. Specifically, if D is a difference score for variable X, create a new variable D+ which equals D if D is greater than 0, otherwise 0. And create a second variable D which equals –D if D is less than 0, otherwise 0.

Here’s how to create these variables in Stata:

```generate selfpos=selfdiff*(selfdiff>0)
generate selfneg=-selfdiff*(selfdiff<0)
generate povpos=povdiff*(povdiff>0)
generate povneg=-povdiff*(povdiff<0)```

The inequalities in parentheses are logical expressions that have a value of 1 if the inequality is true and 0 if the inequality is false.

Now just regress antidiff on all four of these variables,

`regress antidiff selfpos selfneg povpos povneg`

which produces the following table:

```--------------------------------------------------------------------------
antidiff |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------+----------------------------------------------------------------
selfpos |  -.0048386   .0251504    -0.19   0.848    -.0542362    .0445589
selfneg |   .0743077    .025658     2.90   0.004      .023913    .1247024
povpos |   .2502064   .2003789     1.25   0.212     -.143356    .6437688
povneg |   -.126328   .1923669    -0.66   0.512    -.5041541    .2514981
_cons |  -.0749517    .086383    -0.87   0.386    -.2446157    .0947123```
`--------------------------------------------------------------------------`

What does this tell us?  A 1-unit increase in self-esteem lowers antisocial behavior by .005 units (an effect that is far from statistically significant). A 1-unit decrease in self-esteem increases antisocial behavior by .074 units (highly significant). So it looks like decreases in self-esteem have a big effect, while increases have little impact. Note that the original estimate of -.039 was about midway between these two estimates.

Are the two effects significantly different? We can test that with the command

`test selfpos=-selfneg`

which yields a p-value of .10, not quite statistically significant.

Neither of the two poverty coefficients is statistically significant, although they are in the expected direction: moving into poverty increases antisocial behavior while moving out of poverty reduces it, but by about half the magnitude. These two effects are definitely not significantly different.

So that’s basically it for two-period data. When there are three or more periods, you have to create multiple records for each individual. Each record contains difference scores for adjacent periods. When estimating the regression model, you need to allow for negative correlations between adjacent records by using generalized least squares.

I discuss all these options in a paper that can be found here. That paper also presents a data generating model that justifies the asymmetric first difference method. The data generating model can be extended to allow for the estimation of asymmetric logistic regression models, which can’t be estimated with difference scores.

If you want to learn more about fixed effects methods, see my two books on this topic:  Fixed Effects Regression Models and Fixed Effects Regression Methods for Longitudinal Data Using SASOr take one of my 2-day seminars on longitudinal data analysis.

References:

Allison, Paul D. Fixed effects regression models. Vol. 160. SAGE publications, 2009.
Allison, Paul D. Fixed effects regression methods for longitudinal data using SAS. SAS Institute, 2005.
York, Richard, and Ryan Light. “Directional asymmetry in sociological analyses.” Socius 3 (2017): 1-13.

```
SAS Program

data nlsydiff;
set my.nlsy;
antidiff=anti2-anti1;
selfdiff=self2-self1;
povdiff=pov2-pov1;
proc reg data=nlsydiff;
model antidiff=selfdiff povdiff;
run;
data nlsydiff;
set nlsydiff;
selfpos=selfdiff*(selfdiff>0);
selfneg=-selfdiff*(selfdiff<0);
povpos=povdiff*(povdiff>0);
povneg=-povdiff*(povdiff<0);
proc reg data=nlsydiff;
model antidiff=selfpos selfneg povpos povneg;
test selfpos=-selfneg;
test povpos=-povneg;
run;```

When I teach courses on structural equation modeling (SEM), I tell my students that any model with instrumental variables can be estimated in the SEM framework. Then I present a classic example of simultaneous causation in which X affects Y, and Y also affects X. Models like this can be estimated if each of the two variables also has an instrumental variable—a variable that affects it but not the other variable. Specification of the model is fairly straightforward in any SEM package.

However, there are lots of other uses of instrumental variables. My claim that they can all be estimated with SEM is more speculative than I usually care to admit. A couple weeks ago I got a question about instrumental variables in SEM that really had me stumped. What seemed like the right way to do it was giving the wrong answer—at least not the answer produced by all the standard econometric methods, like two-stage least squares (2SLS) or the generalized method of moments (GMM). When I finally hit on the solution, I figured that others might benefit from what I had learned. Hence, this blog post.

The question came from Steven Utke, an assistant professor of accounting at the University of Connecticut. Steve was trying to replicate a famous study by David Card (1995) that attempted to estimate the causal effect of education on wages by using proximity to college as an instrumental variable. Steve was able to replicate Card’s 2SLS analysis, but he couldn’t get an SEM model to produce similar results.

For didactic purposes, I’m going to greatly simplify Card’s analysis by excluding a lot of variables that are not essential for the example. I’ll use Stata for the analysis, but equivalent SAS code can be found in Addendum 2.

Let’s begin with a bivariate regression of the log of hourly wages on years of education, for a sample of 3,010 young men in 1976.

```. use "https://dl.dropboxusercontent.com/s/bu8ze4db4bywwin/card", clear
. gen lwage = log(wage)
. regress lwage educ

------------------------------------------------------------------------------
lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .0520942   .0028697    18.15   0.000     .0464674     .057721
_cons |   5.570882   .0388295   143.47   0.000     5.494747    5.647017
------------------------------------------------------------------------------```

Education has a highly significant coefficient, whose magnitude can be interpreted as follows: each additional year of schooling is associated with about a 5% increase in wages.

Of course, the problem with treating this as a causal effect is that there are likely to be many omitted variables that affect both education and wages. We could control for those variables by measuring them and including them in the model (and Card did that for many variables). But there’s no way we can control for all the possible confounding variables, especially because some variables are difficult to measure (e.g., ability). We must therefore conclude that education is correlated with the error term in the regression (a form of endogeneity), and that our regression coefficient is therefore biased to an unknown degree.

Card proposed to solve this problem by introducing proximity to college as an instrumental variable. Specifically, nearc4 was a dummy (indicator) variable for whether or not the person was raised in a local labor market that included a four-year college. As with any instrumental variable, the crucial assumption was that nearc4 would affect years of schooling but would NOT directly affect lwage. Although there may be reasons to doubt that assumption (some discussed by Card), they are not relevant to our discussion here.

Using the ivregress command in Stata, I estimated the instrumental variable model by 2SLS. In case you’re not familiar with that venerable method, it amounts to this: (a) do an OLS regression of educ on nearc4, (b) calculate predicted values from that regression, and (c) regress lwage on those predicted values. The only tricky part is getting the standard errors right. Here are the results:

```. ivregress 2sls lwage (educ=nearc4)

Instrumental variables (2SLS)
------------------------------------------------------------------------------
lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ |   .1880626   .0262826     7.16   0.000     .1365497    .2395756
_cons |   3.767472   .3487458    10.80   0.000     3.083942    4.451001
------------------------------------------------------------------------------

Instrumented:  educ
Instruments:   nearc4

```

Remarkably, the coefficient for education in this regression is more than three times as large as in the bivariate regression and is still highly significant. According to this estimate, each additional year of schooling yields a 21% increase in wages (calculated as 100(exp(.188)-1), a very large effect by any reasonable standard.  (Card got a coefficient of around .14 with other predictors in the model).

Now let’s try to do this with a structural equation model, using Stata’s sem command. As with all SEM software, the default is to do maximum likelihood estimation under the assumption of multivariate normality. The basic idea is to specify a model in which nearc4 affects educ, and educ affects lwage.  But there can be no direct effect of nearc4 on lwage.  Here’s a path diagram for the model:

Here’s the Stata code and the results:

```. sem (lwage <- educ) (educ <- nearc4)

------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
lwage      |
educ |   .0520942   .0028688    18.16   0.000     .0464716    .0577169
_cons |   5.570882   .0388166   143.52   0.000     5.494803    5.646961
-----------+----------------------------------------------------------------
educ       |
nearc4 |    .829019   .1036643     8.00   0.000     .6258406    1.032197
_cons |   12.69801   .0856132   148.32   0.000     12.53022    12.86581```

This can’t be right, however, because the coefficient for educ is identical to what we got in our initial bivariate regression, with an almost identical standard error. Clearly, the inclusion of the regression of educ on nearc4 has accomplished nothing. Can this problem be fixed, and if so, how?

I struggled with this for about an hour before it hit me. The solution is to build into the model the very thing that we are trying to correct for with the instrumental variable. If there are omitted variables that are affecting both educ and lwage, they should produce a correlation between the error term for educ and the error term for lwage. That correlation needs to be included in the SEM model. Here’s a path diagram for the new model:

In Stata, it’s done like this:

`. sem (lwage <-educ) (educ<-nearc4), cov(e.educ*e.lwage)`

The cov option allows a covariance (and therefore a correlation) between the two error terms. Here are the new results:

```------------------------------------------------------------------------------------
|      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
Structural         |
lwage            |
educ |   .1880626   .0262826     7.16   0.000     .1365497    .2395756
_cons |   3.767472   .3487458    10.80   0.000     3.083942    4.451001
-----------------+----------------------------------------------------------------
educ             |
nearc4 |    .829019   .1036643     8.00   0.000     .6258406    1.032197
_cons |   12.69801   .0856132   148.32   0.000     12.53022    12.86581
-------------------+----------------------------------------------------------------```

Now the coefficient and standard error for educ are identical to what we saw earlier with 2SLS. Problem solved! (SEM won’t always give identical results to 2SLS, as I explain in Addendum 1).

Interestingly, the estimated correlation between the error terms (not shown in the table) was -.66. That means that the collective impact of omitted variables is to affect educ and lwage in opposite directions. Whatever raises lwage lowers educ, and vice versa.

It’s hard to imagine what variables would behave like this. And that difficulty raises questions about the plausibility of the model. (Card suggested that measurement error in education could produce a negative correlation, but the degree of error would have to be unreasonably large to produce the result we just saw.) In any case, that’s not really our concern here. It does point out, however, that one of the advantages of the SEM approach is that you get an estimate of the error correlation—something you typically don’t see with 2SLS.

What’s the lesson here?  If you want to use SEM to estimate an instrumental variable model, it’s essential to think carefully about why you need instruments in the first place. You should make sure that the specified model reflects all your beliefs about the causal mechanism. In particular, if you suspect that error terms are correlated with observed variables, those correlations must be built into the model.

If you’d like to learn more about structural equation modeling, check out my SEM short courses, which come in 2-day and 5-day versions. You can find links to past and future offerings here. You might also be interested in Felix Elwert’s 2-day course on Instrumental Variables. His past and future offerings can be found here.

Reference:

Card, David (1995) “Using geographic variation in college proximity to estimate the return to schooling.” Aspects of Labour Economics: Essays in Honour of John Vanderkamp. University of Toronto Press.

The reason 2SLS and SEM produce identical coefficient estimates for this example is that the model is just-identified. That is, the model imposes no restrictions on the variances and covariances of the variables. Other instrumental variable methods like limited information maximum likelihood or GMM also yield identical results in the just-identified situation. For models that are over-identified (for example, if there are two instrumental variables instead of just one), these methods will all yield somewhat different results.

Go to http://davidcard.berkeley.edu/data_sets.html to get the data set as a text file along with a sas program for reading in the data.  Here is the code for the analysis:

```proc reg data=card;
model lwage76 = ed76;
proc syslin 2sls data=card;
endogenous ed76;
instruments nearc4;
model lwage76 = ed76;
proc calis data=card;
path
lwage76 <- ed76,
ed76 <- nearc4;
proc calis data=card;
path
lwage76 <- ed76,
ed76 <- nearc4,
ed76 <-> lwage76; *This specifies a correlation between the error terms;
run;```

Competing risks are common in the analysis of event time data. The classic example is death, with distinctions among different kinds of death: if you die of a heart attack, you can’t then die of cancer or suicide. But examples also abound in other fields. A marriage can end either by divorce or by the death of a spouse, but not both. If a loan is terminated by pre-payment, it is no longer at risk of a default.

In this post, I argue that one of the more popular methods for regression analysis of competing risks—the analysis of subdistribution hazards, introduced by Fine and Gray (1999)—is not valid for causal inference. In fact, it can produce completely misleading results. Instead, I recommend the analysis of cause-specific hazards, a long-standing and easily implemented method.

Let’s review that classic method first. The estimation of cause-specific hazard functions (Kalbfleish and Prentice 2002) can be accomplished with standard methods for single kinds of events. You simply treat all competing events as though the individual were right censored at the time the competing event occurs. For example, if you want to study the effect of obesity on the risk of death due to heart disease, just estimate a Cox proportional hazards model in which all causes of death other than heart disease are treated as censoring. Or if you want to estimate the effect income on divorce, estimate a Cox model in which spousal death is treated as censoring.

By contrast, the method of Fine and Gray (1999) does not treat competing events in the same way as censored observations. Based on cumulative incidence functions, their method estimates a proportional hazards model for something they call the subdistribution hazard.

The definition of the subdistribution hazard is similar to that for a cause-specific hazard, with one key difference: the cause-specific hazard removes an individual from the risk set when any type of event occurs; the subdistribution hazard removes an individual from the risk set when an event of the focal type occurs or when the individual is truly censored. However, when a competing event occurs, the individual remains in the risk set. Fine and Gray acknowledge that this is “unnatural” because, in fact, those who experience competing events are no longer actually at risk of the focal event. But it’s necessary in order to get a model that correctly predicts cumulative incidence functions. (More on that later).

According to Google Scholar, the Fine and Gray paper has been cited more than 5,000 times. It is now widely available in most major software packages, including Stata (with the stcrreg command), SAS (with the EVENTCODE option in PROC PHREG) and R (with the ‘cmprsk’ package). In some fields, it has become the standard method for analyzing competing risks. In the minds of many researchers, it is the only proper way to analyze competing risks.

But there’s one big problem: the subdistribution method doesn’t isolate distinct causal effects on the competing risks. In fact, it confounds them in predictable and alarming ways. Specifically, any variable that increases the cause-specific risk of event A will appear to decrease the subdistribution hazard for event B. Why? Because whenever a type A event occurs, it eliminates the possibility that a type B event will happen.

Here’s a simple simulation that demonstrates this phenomenon. I generated 10,000 observations on each of two event types, labeled A and B. Event times for type A were generated by a Weibull regression model (a parametric version of the proportional hazards model). The only predictor was variable X, which had a standard normal distribution.

Event times for type B were generated by an identical Weibull regression model in which the only predictor was variable Z, also standard normal. X and Z had a correlation of .50.

If event A occurred before event B, the event time for B was not observed. Similarly, if B occurred before A, the event time for A was not observed. Any event times greater than 8 were treated as right censored. SAS code for generating the data and performing the analysis is appended below.

In the resulting data set, there were 4350 type A events, 4277 type B events, and 1391 truly censored observations (neither A nor B was observed). Given the data generating process, any attempt to estimate causal parameters should find no effect of X on B and no effect of Z on A.

In Table 1 below, I show the results from applying the two different methods: cause-specific Cox regression models with competing events treated as censoring; and subdistribution proportional hazards models. For type A, the cause-specific estimate of .494 for the effect of X is close to the true value of .500 and highly significant. The coefficient of -.008 for Z is close to the true value of 0 and far from statistically significant. Overall, pretty good performance.

The subdistribution estimates for type A, on the other hand, are clearly unsatisfactory. The coefficient for X appears to be biased downward by about 10%. The coefficient for Z (-.216) is far from the true value of 0, and highly significant. Thus, the subdistribution estimates would lead one to conclude that an increase in Z reduced the risk of event A. What it actually did is reduce the likelihood that type A events would be observed because it increased the risk of event B.

The results for event B in the lower panel of Table 1 are the mirror image of those for event A. For both X and Z, the cause-specific estimates are close to the true values. The subdistribution estimates are biased, and would lead to incorrect causal inferences.

Table 1. Results from Two Methods for Estimating Competing Risks Models, Non-Informative Censoring.

 Cause-Specific Hazards Subdistribution Hazards Type A Estimate S.E. p Estimate S.E. p x .490 .018 <.0001 .420 .018 <.0001 z -.009 .018 .601 -.214 .017 <.0001 Type B x .021 .018 .248 -.187 .017 <.0001 z .488 .018 <.0001 .433 .018 <.001

So why would anyone seriously consider the subdistribution method? Well, there’s one big problem with the cause-specific hazards approach. Virtually all methods based on cause-specific hazards implicitly assume that censoring is non-informative. Roughly speaking, that means that if an individual is censored at a particular point in time, that fact tells you nothing about the individual’s risk of the event.

If the censoring times are determined by the study design (as when all event times beyond a certain calendar time are censored), that’s not usually an issue. But if censoring times are not under the control of the investigator, the censoring may be informative. And that can lead to bias.

If competing risks are treated as a form of censoring, then we need to be concerned about whether that censoring is informative or non-informative. How might competing events be informative? If a spouse dies, does that tell us anything about the risk that the couple would have divorced? Maybe, but probably not. Does the fact that a person dies of heart disease tell us anything about that person’s risk of dying of cancer? Maybe so. That would definitely be the case if cancer and heart disease had common risk factors, and those factors were not included in the regression model.

Unfortunately, there’s no way to test the assumption that censoring is non-informative (Tsiatis 1975). And even if there were, there’s no good method available for estimating causal effects when censoring is informative.

By contrast, the subdistribution hazard method does not explicitly assume that competing risks are non-informative. And that has been one of its major attractions. However, as I now show, the subdistribution method does no better (and actually somewhat worse) than the cause-specific method when competing events are informative.

In the previous simulation, the assumption of non-informative censoring was satisfied by the data generating process. To model informative censoring, I added an unobserved common risk factor to the regression equations for the two event times. This was simply a normally distributed random variable with a standard deviation of 2. This “random intercept” induced a correlation of .28 between the uncensored event times for type A and type B. I then re-estimated the models, with results shown in Table 2 below.

The message of Table 2 is this: Yes, informative censoring leads to cause-specific estimates that are biased. And unlike in the previous table, the cause-specific estimates might lead one to conclude, incorrectly, that Z affects the risk for event A and that X affects the risk for event B. But the subdistribution estimates are also biased. And, with one minor exception, the biases are worse for the subdistribution method than for the cause-specific method.

Table 2. Results from Two Methods for Estimating Competing Risks Models, Informative Censoring.

 Cause-Specific Hazards Subdistribution Hazards Type A Estimate S.E. p Estimate S.E. p x .347 .019 <.0001 .349 .019 <.0001 z -.067 .019 .0005 -.185 .019 <.0001 Type B x -.102 .019 <.0001 -.212 .019 <.0001 z .372 .019 <.0001 .368 .019 <.0001

To repeat my earlier question: Why would anyone ever seriously consider using the subdistribution method? To be fair to Fine and Gray, they never claimed that subdistribution regression would accurately estimate causal parameters. Instead, they introduced the method as a way to model the impact of covariates on the cumulative incidence functions. These functions are often preferable to Kaplan-Meier estimates of the survivor function because they do a better job of describing the empirical distribution of events rather than some hypothetical distribution that would apply only in the absence of competing risks.

Cumulative incidence functions are particularly useful for prediction. Suppose you have a cohort of newly diagnosed cancer patients. Based on the experience of earlier patients, you want to predict in five years what percentage will have died of cancer, what percentage will have died of other causes, and what percentage will still be alive. Cumulative incidence functions will give you that information. Kaplan-Meier survivor functions will not.

The Fine and Gray method provides a way to introduce covariate information into those predictions, potentially making them more accurate for individual patients. It’s important to note, however, that one can also calculate cumulative incidence functions based on cause-specific hazard functions. Most commercial packages for Cox regression don’t have that capability, but there are downloadable SAS macros that will accomplish the task.

In sum, the subdistribution method may be useful for generating predicted probabilities that individuals will be in particular states at particular times.  But it is not useful for estimating causal effects of covariates on the risks that different kinds of events will occur.  For that task, the analysis of cause-specific hazards is the way to go.  Unfortunately, both methods are vulnerable to competing events that are informative for each other. The only effective way to deal with that problem is to estimate cause-specific hazard models that include common risk factors as covariates.

SAS Code for Simulation

```data finegraytest;
do i=1 to 10000;
*Generate a common risk factor. To include it in the model, change 0 to 1;
comrisk=0*rannor(0);
*Generate x and z, bivariate standard normal with r=.5;
x=rannor(0);
z=.5*x + sqrt(1-.25)*rannor(0);
*Generate w with Weibull distribution depending on x;
logw=2 + -.75*x + 1.5*log(ranexp(0)) + 2*comrisk;
w=exp(logw);
*Generate y with Weibull distribution depending on z;
logy=2 + -.75*z + 1.5*log(ranexp(0)) + 2*comrisk;
y=exp(logy);
*Allow events to censor each other;
if y>w then do;
type=1;
t=w; end;
else if w>y then do;
type=2;
t=y; end;
*Censor all event times at 8;
if t>8 then do;
type=0;
t=8;
end;
output;
end;
run;

proc freq;
table type; run;

/*Estimate cause-specific hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0 2)=x z; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0 1)=x z; run;

/*Estimate subdistributions hazard regressions */
*Model for type1 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=1; run;
*Model for type2 event;
proc phreg data=finegraytest;
model t*type(0)=x z / eventcode=2; run; ```

In my courses and books on longitudinal data analysis, I spend a lot of time talking about the between-within model for fixed effects. I used to call it the hybrid model, but others have convinced me that “between-within” provides a more meaningful description.

Last week my long-time collaborator, Paula England, asked me a question about the between-within model that stumped me at first. When I finally figured out the answer, I realized that it could potentially be important to anyone who uses the between-within method. Hence, this post.

Here is Paula’s project, co-authored with Eman Abdelhadi. They have a large, cross-sectional sample of adult women from approximately 50 countries. They are estimating a logistic regression model for a dichotomous dependent variable: whether or not a woman is employed. One of the independent variables is also dichotomous: whether or not a woman is a Muslim (coded 1 or 0). In order to control for all between-country differences, they estimate a between-within model with the following characteristics:

• a random-intercepts model with countries as clusters.
• a “within” predictor that is the Muslim indicator (dummy variable) minus the country mean of that variable.
• a “between” predictor that is the country mean, i.e., the proportion Muslim.
• other control variables at both the person level and the country level.

The coefficient for the within predictor is very close to what you would get from a classic fixed effects estimator, as estimated by conditional logistic regression. (My 2014 post discusses why they are not identical). So it represents the effect of being Muslim on employment, controlling for all country-level variables, both observed and unobserved.

Here is the question. Can you interpret the coefficient for the between predictor (proportion Muslim) as the effect of living in a country that is more or less Muslim, regardless of whether a person is personally a Muslim? In other words, can you interpret the between coefficient as a contextual effect?

Surprisingly, the answer is no. By construction, the within variable is uncorrelated with the between variable. For that reason, the coefficient for proportion Muslim does not actually control for the person-level effect of being Muslim.

The reason I never thought about this question before is that I primarily use the between-within model for longitudinal data, with repeated measurements clustered within persons. In that setting, the coefficients for the between variables are usually uninteresting or misleading. But when applied to other kinds of clustering, the estimation and interpretation of “between” effects can be quite important.

Fortunately, there’s a simple solution to the problem: Estimate the model using the original Muslim indicator rather than the deviation from its country-level mean. This model is mathematically equivalent to the original between-within model—the coefficients for the Muslim indicator and for all the control variables will be exactly the same. But the coefficient for the proportion Muslim will change, and in a predictable way.  Here is the algebra:

In these equations pij is the probability of employment for person i in country j, xij is the Muslim indicator for person i in country j, and ­x-bar-j is the mean of the Muslim indicator in country j

The first equation is the usual formulation of the between-within model.  The second equation is the algebra that gets us to the third equation. That equation is the new formulation of the model with xij and ­x-bar-j as predictors. Notice that the coefficient of ­x-bar-j in the third equation can be found by subtracting the coefficient for the within effect from the coefficient for the between effect in the first equation.

So you don’t actually have to re-estimate the model to get the contextual effect. You can just take the difference in the coefficients in the standard between-within model.  Furthermore, testing whether those coefficients are the same or different (which is usually done to test fixed vs. random effects) is equivalent to testing for the existence of a contextual effect.

Keep in mind that although the coefficient for ­x-bar-j in the third equation can be interpreted as a contextual effect, it does not control for any unobservables. So it could be biased by the omission of country-level variables or person-level variables that affect employment and are correlated with proportion Muslim. Note, also, that the fact that the person-level variable is dichotomous in this example is completely incidental. The same logic would apply if xij were continuous.

I thought I knew what it meant for data to be missing at random. After all, I’ve written a book titled Missing Data, and I’ve been teaching courses on missing data for more than 15 years. I really ought to know what missing at random means.

But now that I’m in the process of revising that book, I’ve come to the conclusion that missing at random (MAR) is more complicated than I thought. In fact, the MAR assumption has some peculiar features that make me wonder if it can ever be truly satisfied in common situations when more than one variable has missing data.

First, a little background. There are two modern methods for handling missing data that have achieved widespread popularity: maximum likelihood and multiple imputation. As implemented in most software packages, both of these methods depend on the assumption that the data are missing at random.

Here’s how I described the MAR assumption in my book:

Data on Y are said to be missing at random if the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis. To express this more formally, suppose there are only two variables X and Y, where X always is observed and Y sometimes is missing. MAR means that

Pr(Y missing| Y, X) = Pr(Y missing| X).

In words, this expression means that the conditional probability of missing data on Y, given both Y and X, is equal to the probability of missing data on Y given X alone.

Suppose that Y is body weight and X is gender. Then if women are less likely to disclose their weight, Y can be MAR because the probability of missing Y depends on X, which is observed. But what if both men and women are less likely to disclose their weight if they’re overweight? Then values are not MAR, since the probability Y is missing depends on Y net of X.

That definition works fine when only one variable has missing data. But things get more complicated when two or more variables have missing data. I learned this from Example 1.13 in the classic book by Little and Rubin (2002). That example deals with one of the simplest cases, when there are just two variables, X and Y.  Suppose that both of them have missing data, and the missingness falls into four patterns:

1. X and Y both observed.

2. X observed but Y missing.

3. X missing but Y observed.

4. X and Y both missing.

For each of these patterns, there is a corresponding probability of observing that pattern. The MAR assumption is about how those four probabilities depend on the values of X and Y. Specifically, MAR says that the probability of each pattern may depend on the variables observed in that pattern, but not on variables that are not observed (after conditioning on the observed variables).

Here’s how this plays out:

The probability of getting pattern 1 (both X and Y observed) may depend on the values of both X and Y. It doesn’t have to depend on them, but MAR allows for that dependence.

The probability of pattern 2 (X observed and Y missing) may depend on X, but it can’t depend on Y

The probability of pattern 3 (X missing and Y observed) may depend on Y, but it can’t depend on X.

The probability of pattern 4 (both variables missing) can’t depend on either X or Y

What’s important to stress here is that MAR is a statement about the probabilities of observing particular patterns of missingness. It’s not about the probabilities that individual variables are missing.

Here’s the thing that troubles me about this: except in special cases—like missing completely at random or monotone missing data—it is surprisingly difficult to generate simulated data that would satisfy all four of the above conditions. And my personal rule is that if I can’t simulate it, I don’t really understand it.

There are three approaches to simulation that would seem like natural starting points:

1. Two dichotomous logistic regression models, one in which the probability that Y is missing depends on X, and the other in which the probability that X is missing depends on Y. The problem with that approach is that, by implication, the probability that both variables are missing would depend on the values of both variables. And that’s disallowed by MAR.

2. A multinomial logit model for the four probabilities, with both X and Y as predictors. As noted by Robins and Gill (1997), however, this would violate the MAR conditions for patterns 2, 3, and 4.

3. Specify a separate model for each of the four patterns. That’s problematic because the four probabilities have to sum to 1, and the probability of pattern 4 has to be a constant q. Consequently, the three other probabilities have to sum to 1-q. It’s not obvious how to impose that constraint without violating other MAR conditions, especially when X and Y are continuous. I’ve done it for dichotomous X and Y, but that necessitated some trial and error in order to avoid violating the constraint.

So none of those approaches is satisfactory. However, when I presented this problem to Paul von Hippel, who is co-authoring the revision to Missing Data, he came up with the following algorithm for two variables:

1. Start with a sample of data on two variables X and Y, both fully observed.

2. Randomly divide the sample into three groups. The proportions in each group can be whatever you choose.

3. In Group 1, let the probability that Y is missing be a function of X. A logistic regression model would be a natural choice, but it could be a probit model or something else.

4. In Group 2, let the probability that X is missing be a function of Y. Again, there are many possibilities for this function.

5. In Group 3, set both X and Y to be missing.

That’s it. I’ve verified that it works with a simulation. However, I later learned that it’s been done before. It’s one example of class of algorithms for MAR proposed by Robins and Gill (1997) called Randomized Monotone Missingness models. This class of models can handle any number of variables with missing data. These models also allow the probability of falling into each of the subgroups to depend on other variables that are always observed.

So, yes, it’s possible to simulate MAR when two are more variables are missing, and that’s progress. But is it plausible that this algorithm corresponds to any real-world situations? I’m skeptical. The key question is, why would the mechanism producing missing data be different in different subgroups?

For those of us who regularly use multiple imputation or maximum likelihood to handle missing data, I think we are just going to have to live with the MAR assumption, despite its peculiarities. It has been shown to be the weakest assumption that still implies ignorability—that is, the ability to make inferences without having to model the missing data mechanism. So if you’re not satisfied with MAR, you’ll have to engage in a much more complicated modeling process, one that still involves untestable assumptions.

Before concluding, I should mention that there is one situation where MAR makes sense and can also be easily simulated. That’s the case of a longitudinal study with drop out as the only cause of missing data. In that case, MAR means that drop out can depend on anything that is observed before the drop out. But it can’t depend on anything that would have been observed after the drop out.

For some applications, you might still suspect that drop out depends on what would have been observed immediately after the drop out. But at least MAR in that setting is an assumption that is plausible, understandable, and algorithmically reproducible.

REFERENCES

Little, Roderick J.A., and Donald B. Rubin (2002) Statistical Analysis with Missing Data. Wiley.

Robins, James M., and Richard D. Gill (1997) “Non‐response models for the analysis of non‐monotone ignorable missing data.” Statistics in Medicine 16: 39-56.

In my last post, I explained several reasons why I prefer logistic regression over a linear probability model estimated by ordinary least squares, despite the fact that linear regression is often an excellent approximation and is more easily interpreted by many researchers.

I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the margins command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese).

If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the margins command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.)

I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)

There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The Stata code for estimating the model is

```   webuse nhanes2f, clear
logistic diabetes i.black i.female age

```

The i. in front of black and female tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.

Here are the results:

```------------------------------------------------------------------------------
diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.black |   2.050133   .2599694     5.66   0.000     1.598985     2.62857
1.female |   1.167141   .1100592     1.64   0.101     .9701892    1.404074
age |   1.061269    .003962    15.93   0.000     1.053532    1.069063
_cons |   .0016525    .000392   -27.00   0.000     .0010381    .0026307
------------------------------------------------------------------------------```

A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for non-blacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.

Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes.  Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:

```   margins, dydx(black female age)
```

yielding the output

```------------------------------------------------------------------------------
|            Delta-method
|      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.black |   .0400922   .0087055     4.61   0.000     .0230297    .0571547
1.female |   .0067987   .0041282     1.65   0.100    -.0012924    .0148898
age |   .0026287   .0001869    14.06   0.000     .0022623    .0029951
------------------------------------------------------------------------------```

The dy/dx column gives the estimated change in the probability of diabetes for a 1-unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for non-blacks, on average.

How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: black=1 and black=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like age, the calculation of the AME is a little more complicated.

There’s another, more traditional, way to get marginal effects: for the variable black, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when black=1 and when black=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.

We can get the MEM in Stata with the command:

```   margins, dydx(black female age) atmeans

```

with the result

```------------------------------------------------------------------------------
|            Delta-method
|      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.black |   .0290993   .0066198     4.40   0.000     .0161246    .0420739
1.female |   .0047259   .0028785     1.64   0.101    -.0009158    .0103677
age |   .0018234   .0000877    20.80   0.000     .0016516    .0019953
------------------------------------------------------------------------------```

Here the MEM for black is noticeably smaller than the AME for black. This is a consequence of the inherent non-linearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM).

The non-linearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values.

Here’s how to get the AME for black at four different ages:

```   margins, dydx(black) at(age=(20 35 50 65))

```
```1._at        : age             =          20
2._at        : age             =          35
3._at        : age             =          50
4._at        : age             =          65
------------------------------------------------------------------------------
|            Delta-method
|      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.black      |
_at |
1  |   .0060899   .0016303     3.74   0.000     .0028946    .0092852
2  |   .0144831   .0035013     4.14   0.000     .0076207    .0213455
3  |   .0332459   .0074944     4.44   0.000      .018557    .0479347
4  |   .0704719    .015273     4.61   0.000     .0405374    .1004065
------------------------------------------------------------------------------```

We see that the AME of black at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should not be thought of as an interaction, however, because the model says that the effect on the log-odds (logit) is the same at all ages.

Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the log-odds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used

```   predict yhat
sum yhat```

The predict command generates predicted probabilities and stores them in the variable yhat. The sum command produces descriptive statistics for yhat, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:

```   twoway function y=ln(x/(1-x)), range(.005 .244) xtitle("Probability") ytitle("Log odds")

```

Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:

`   reg diabetes black female age, robust`

I’ve requested robust standard errors to deal with any heteroskedasticity, but the t-statistics are about the same even without this option.

```------------------------------------------------------------------------------
|               Robust
diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
black |   .0383148    .008328     4.60   0.000     .0219902    .0546393
female |   .0068601   .0041382     1.66   0.097    -.0012515    .0149716
age |   .0021575   .0001214    17.76   0.000     .0019194    .0023955
_cons |  -.0619671   .0049469   -12.53   0.000    -.0716639   -.0522703
------------------------------------------------------------------------------```

The t-statistics here are pretty close to the z-statistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of black on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.

The upshot is that combining logistic regression with the margins command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probability-based estimates vary under different conditions.