Last week my long-time collaborator, Paula England, asked me a question about the between-within model that stumped me at first. When I finally figured out the answer, I realized that it could potentially be important to anyone who uses the between-within method. Hence, this post.

Here is Paula’s project, co-authored with Eman Abdelhadi. They have a large, cross-sectional sample of adult women from approximately 50 countries. They are estimating a logistic regression model for a dichotomous dependent variable: whether or not a woman is employed. One of the independent variables is also dichotomous: whether or not a woman is a Muslim (coded 1 or 0). In order to control for all between-country differences, they estimate a between-within model with the following characteristics:

- a random-intercepts model with countries as clusters.
- a “within” predictor that is the Muslim indicator (dummy variable) minus the country mean of that variable.
- a “between” predictor that is the country mean, i.e., the proportion Muslim.
- other control variables at both the person level and the country level.

The coefficient for the within predictor is very close to what you would get from a classic fixed effects estimator, as estimated by conditional logistic regression. (My 2014 post discusses why they are not identical). So it represents the effect of being Muslim on employment, controlling for all country-level variables, both observed and unobserved.

Here is the question. Can you interpret the coefficient for the between predictor (proportion Muslim) as the effect of living in a country that is more or less Muslim, regardless of whether a person is personally a Muslim? In other words, can you interpret the between coefficient as a contextual effect?

Surprisingly, the answer is no. By construction, the within variable is uncorrelated with the between variable. For that reason, the coefficient for proportion Muslim does not actually control for the person-level effect of being Muslim.

The reason I never thought about this question before is that I primarily use the between-within model for longitudinal data, with repeated measurements clustered within persons. In that setting, the coefficients for the between variables are usually uninteresting or misleading. But when applied to other kinds of clustering, the estimation and interpretation of “between” effects can be quite important.

Fortunately, there’s a simple solution to the problem: Estimate the model using the original Muslim indicator rather than the deviation from its country-level mean. This model is mathematically equivalent to the original between-within model—the coefficients for the Muslim indicator and for all the control variables will be exactly the same. But the coefficient for the proportion Muslim will change, and in a predictable way. Here is the algebra:

In these equations *p _{ij}* is the probability of employment for person

The first equation is the usual formulation of the between-within model. The second equation is the algebra that gets us to the third equation. That equation is the new formulation of the model with *x _{ij}* and

So you don’t actually have to re-estimate the model to get the contextual effect. You can just take the difference in the coefficients in the standard between-within model. Furthermore, testing whether those coefficients are the same or different (which is usually done to test fixed vs. random effects) is equivalent to testing for the existence of a contextual effect.

Keep in mind that although the coefficient for *x*-bar-*j* in the third equation *can* be interpreted as a contextual effect, it does not control for any unobservables. So it could be biased by the omission of country-level variables or person-level variables that affect employment and are correlated with proportion Muslim. Note, also, that the fact that the person-level variable is dichotomous in this example is completely incidental. The same logic would apply if *x _{ij}* were continuous.

But now that I’m in the process of revising that book, I’ve come to the conclusion that missing at random (MAR) is more complicated than I thought. In fact, the MAR assumption has some peculiar features that make me wonder if it can ever be truly satisfied in common situations when more than one variable has missing data.

First, a little background. There are two modern methods for handling missing data that have achieved widespread popularity: maximum likelihood and multiple imputation. As implemented in most software packages, both of these methods depend on the assumption that the data are missing at random.

Here’s how I described the MAR assumption in my book:

Data on *Y* are said to be missing at random if the probability of missing data on *Y* is unrelated to the value of *Y*, after controlling for other variables in the analysis. To express this more formally, suppose there are only two variables *X* and *Y*, where *X* always is observed and *Y* sometimes is missing. MAR means that

Pr(*Y* missing| *Y*, *X*) = Pr(*Y* missing| *X*).

In words, this expression means that the conditional probability of missing data on *Y*, given both *Y* and *X*, is equal to the probability of missing data on *Y* given *X* alone.

Suppose that *Y* is body weight and *X* is gender. Then if women are less likely to disclose their weight, *Y* can be MAR because the probability of missing *Y* depends on *X*, which is observed. But what if both men and women are less likely to disclose their weight *if they’re overweight*? Then values are not MAR, since the probability *Y* is missing depends on *Y* net of *X*.

That definition works fine when only one variable has missing data. But things get more complicated when two or more variables have missing data. I learned this from Example 1.13 in the classic book by Little and Rubin (2002). That example deals with one of the simplest cases, when there are just two variables, *X *and *Y*. Suppose that both of them have missing data, and the missingness falls into four patterns:

1.* X* and *Y* both observed.

2.* X* observed but *Y* missing.

3.* X* missing but *Y* observed.

4.* X* and *Y* both missing.

For each of these patterns, there is a corresponding probability of observing that pattern. The MAR assumption is about how those four probabilities depend on the values of *X* and *Y*. Specifically, MAR says that the probability of each pattern may depend on the variables observed in that pattern, but not on variables that are not observed (after conditioning on the observed variables).

Here’s how this plays out:

The probability of getting pattern 1 (both *X* and *Y* observed) may depend on the values of both *X* and *Y*. It doesn’t *have* to depend on them, but MAR allows for that dependence.

The probability of pattern 2 (*X* observed and *Y* missing) may depend on *X*, but it *can’t* depend on *Y*.

The probability of pattern 3 (*X *missing and *Y* observed) may depend on *Y*, but it can’t depend on *X*.

The probability of pattern 4 (both variables missing) can’t depend on either *X* or *Y*.

What’s important to stress here is that MAR is a statement about the probabilities of observing particular *patterns* of missingness. It’s not about the probabilities that individual variables are missing.

Here’s the thing that troubles me about this: except in special cases—like missing completely at random or monotone missing data—it is surprisingly difficult to generate simulated data that would satisfy all four of the above conditions. And my personal rule is that if I can’t simulate it, I don’t really understand it.

There are three approaches to simulation that would seem like natural starting points:

1. Two dichotomous logistic regression models, one in which the probability that *Y* is missing depends on *X*, and the other in which the probability that *X* is missing depends on *Y*. The problem with that approach is that, by implication, the probability that both variables are missing would depend on the values of both variables. And that’s disallowed by MAR.

2. A multinomial logit model for the four probabilities, with both *X* and *Y* as predictors. As noted by Robins and Gill (1997), however, this would violate the MAR conditions for patterns 2, 3, and 4.

3. Specify a separate model for each of the four patterns. That’s problematic because the four probabilities have to sum to 1, and the probability of pattern 4 has to be a constant *q*. Consequently, the three other probabilities have to sum to 1-*q*. It’s not obvious how to impose that constraint without violating other MAR conditions, especially when *X* and *Y* are continuous. I’ve done it for dichotomous *X* and *Y*, but that necessitated some trial and error in order to avoid violating the constraint.

So none of those approaches is satisfactory. However, when I presented this problem to Paul von Hippel, who is co-authoring the revision to *Missing Data*, he came up with the following algorithm for two variables:

1. Start with a sample of data on two variables *X* and *Y*, both fully observed.

2. Randomly divide the sample into three groups. The proportions in each group can be whatever you choose.

3. In Group 1, let the probability that *Y* is missing be a function of *X*. A logistic regression model would be a natural choice, but it could be a probit model or something else.

4. In Group 2, let the probability that *X* is missing be a function of *Y*. Again, there are many possibilities for this function.

5. In Group 3, set both *X* and *Y* to be missing.

That’s it. I’ve verified that it works with a simulation. However, I later learned that it’s been done before. It’s one example of class of algorithms for MAR proposed by Robins and Gill (1997) called Randomized Monotone Missingness models. This class of models can handle any number of variables with missing data. These models also allow the probability of falling into each of the subgroups to depend on other variables that are always observed.

So, yes, it’s possible to simulate MAR when two are more variables are missing, and that’s progress. But is it plausible that this algorithm corresponds to any real-world situations? I’m skeptical. The key question is, why would the mechanism producing missing data be different in different subgroups?

For those of us who regularly use multiple imputation or maximum likelihood to handle missing data, I think we are just going to have to live with the MAR assumption, despite its peculiarities. It has been shown to be the *weakest* assumption that still implies ignorability—that is, the ability to make inferences without having to model the missing data mechanism. So if you’re not satisfied with MAR, you’ll have to engage in a much more complicated modeling process, one that still involves untestable assumptions.

Before concluding, I should mention that there is one situation where MAR makes sense and can also be easily simulated. That’s the case of a longitudinal study with drop out as the only cause of missing data. In that case, MAR means that drop out can depend on anything that is observed before the drop out. But it can’t depend on anything that would have been observed after the drop out.

For some applications, you might still suspect that drop out depends on what would have been observed immediately after the drop out. But at least MAR in that setting is an assumption that is plausible, understandable, and algorithmically reproducible.

REFERENCES

Little, Roderick J.A., and Donald B. Rubin (2002) *Statistical Analysis with Missing Data*. Wiley.

Robins, James M., and Richard D. Gill (1997) “Non‐response models for the analysis of non‐monotone ignorable missing data.” *Statistics in Medicine* 16: 39-56.

I addressed the issue of interpretability by arguing that odds ratios are, in fact, reasonably interpretable. But even if you hate odds ratios, you can still get estimates from a logistic regression that have a probability interpretation. In this post, I show how easy it is to do just that using the **margins** command in Stata. (It’s even easier with the SPost13 package for Stata by Long & Freese).

If you want to do this sort of thing in R, there’s a package that just came out called margins. It’s modeled after the **margins** command in Stata but doesn’t have all its capabilities. For SAS users, there is an official SAS document on methods for calculating marginal effects. (This is reasonably easy for PROC QLIM but takes a discouraging amount of programming for PROC LOGISTIC.)

I’ll continue using the NHANES data set that I used in the last post. As previously noted, that data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session. (I am indebted to Richard Williams for this example. You can get much more detail from his PowerPoint presentation available here.)

There are 10,335 cases with complete data on the variables of interest. I first estimated a logistic regression model with **diabetes** (coded 1 or 0) as the dependent variable. Predictors are **age** (in years) and two dummy (indicator) variables, **black** and **female**. The Stata code for estimating the model is

webuse nhanes2f, clear logistic diabetes i.black i.female age

The **i.** in front of **black** and **female** tells Stata to treat these as categorical (factor) variables. Ordinarily, this wouldn’t be necessary for binary predictors. But it’s important for getting Stata to calculate marginal effects in an optimal way.

Here are the results:

------------------------------------------------------------------------------ diabetes | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.black | 2.050133 .2599694 5.66 0.000 1.598985 2.62857 1.female | 1.167141 .1100592 1.64 0.101 .9701892 1.404074 age | 1.061269 .003962 15.93 0.000 1.053532 1.069063 _cons | .0016525 .000392 -27.00 0.000 .0010381 .0026307 ------------------------------------------------------------------------------

A typical interpretation would be to say that the odds of diabetes is twice as large for blacks as for non-blacks, an “effect” that is highly significant. Females have a 17% higher odds of diabetes than males, but that’s not statistically significant. Finally, each additional year of age is associated with a 6% increase in the odds of diabetes, a highly significant effect.

Now suppose we want to estimate each variable’s effect on the probability rather than odds of diabetes. Here’s how to get what’s called the Average Marginal Effect (AME) for each of the predictors:

margins, dydx(black female age)

yielding the output

------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.black | .0400922 .0087055 4.61 0.000 .0230297 .0571547 1.female | .0067987 .0041282 1.65 0.100 -.0012924 .0148898 age | .0026287 .0001869 14.06 0.000 .0022623 .0029951 ------------------------------------------------------------------------------

The dy/dx column gives the estimated change in the probability of diabetes for a 1-unit increase in each predictor. So, for example, the predicted probability of diabetes for blacks is .04 higher than for non-blacks, on average.

How is this calculated? For each individual, the predicted probability is calculated in the usual way for logistic regression, but under two different conditions: **black**=1 and **black**=0. All other predictor variables are held at their observed values for that person. For each person, the difference between those two probabilities is calculated, and then averaged over all persons. For a quantitative variable like **age**, the calculation of the AME is a little more complicated.

There’s another, more traditional, way to get marginal effects: for the variable **black**, hold the other two predictors at their means. Then calculate the difference between the predicted probabilities when **black**=1 and when **black**=0. This is called the Marginal Effect at the Means (MEM). Note that calculation of the MEM requires only the model estimates, while the AME requires operating on the individual observations.

We can get the MEM in Stata with the command:

margins, dydx(black female age) atmeans

with the result

------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.black | .0290993 .0066198 4.40 0.000 .0161246 .0420739 1.female | .0047259 .0028785 1.64 0.101 -.0009158 .0103677 age | .0018234 .0000877 20.80 0.000 .0016516 .0019953 ------------------------------------------------------------------------------

Here the MEM for **black** is noticeably smaller than the AME for **black**. This is a consequence of the inherent non-linearity of the model, specifically, the fact that averaging on the probability scale (the AME) does not generally produce the same results as averaging on the logit scale (the MEM).

The non-linearity of the logistic model suggests that we should ideally be looking at marginal effects under different conditions. Williams (and others) refers to this as Marginal Effects at Representative Values.

Here’s how to get the AME for **black** at four different ages:

margins, dydx(black) at(age=(20 35 50 65))

1._at : age = 20 2._at : age = 35 3._at : age = 50 4._at : age = 65 ------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.black | _at | 1 | .0060899 .0016303 3.74 0.000 .0028946 .0092852 2 | .0144831 .0035013 4.14 0.000 .0076207 .0213455 3 | .0332459 .0074944 4.44 0.000 .018557 .0479347 4 | .0704719 .015273 4.61 0.000 .0405374 .1004065 ------------------------------------------------------------------------------

We see that the AME of **black** at age 20 is quite small (.006) but becomes rather large (.070) by age 65. This should *not* be thought of as an interaction, however, because the model says that the effect on the log-odds (logit) is the same at all ages.

Would it make sense to apply the linear probability model to these data? Von Hippel advises us to plot the log-odds versus the range of predicted probabilities and check whether the graph looks approximately linear. To get the predicted probabilities and their range, I used

predict yhat sum yhat

The **predict** command generates predicted probabilities and stores them in the variable **yhat**. The **sum** command produces descriptive statistics for **yhat**, including the minimum and maximum values. In this case, the minimum was .005 and the maximum was .244. I then used von Hippel’s suggested Stata command to produce the graph:

twoway function y=ln(x/(1-x)), range(.005 .244) xtitle("Probability") ytitle("Log odds")

Clearly, there is substantial departure from linearity. But let’s go ahead and estimate the linear model anyway:

reg diabetes black female age, robust

I’ve requested robust standard errors to deal with any heteroskedasticity, but the *t*-statistics are about the same even without this option.

------------------------------------------------------------------------------ | Robust diabetes | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- black | .0383148 .008328 4.60 0.000 .0219902 .0546393 female | .0068601 .0041382 1.66 0.097 -.0012515 .0149716 age | .0021575 .0001214 17.76 0.000 .0019194 .0023955 _cons | -.0619671 .0049469 -12.53 0.000 -.0716639 -.0522703 ------------------------------------------------------------------------------

The *t*-statistics here are pretty close to the *z*-statistics for the logistic regression, and the coefficients themselves are quite similar to the AMEs shown above. But this model will not allow you to see how the effect of **black** on the probability of diabetes differs by age. For that, you’d have to explicitly build the interaction of those variables into the model, as I did in my previous post.

The upshot is that combining logistic regression with the **margins** command gives you the best of both worlds. You get a model that is likely to be a more accurate description of the world than a linear regression model. It will always produce predicted probabilities within the allowable range, and its parameters will tend to be more stable over varying conditions. On the other hand, you can also get numerical estimates that are interpretable as probabilities. And you can see how those probability-based estimates vary under different conditions.

- Within the range of .20 to .80, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
- People understand changes in probabilities much better than they understand odds ratios.
- OLS regression is much faster than linear regression.

I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.

*Speed*. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.

Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with *k*-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the *Invariance* section below.

*Predicted probabilities*. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.

In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.

*Interpretability*. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.

In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.

Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.

A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.

Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its **margins** command, which I will demonstrate in my next post.

*Invariance*. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.

Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.

This issue also arises for categorical predictors. Consider a dichotomous *y* and a single dichotomous predictor *x*. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for *y* on *x*.

One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.

There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with **diabetes** (coded 1 or 0) as the dependent variable. Predictors are **age** (in years) and two dummy (indicator) variables, **black** and **female**. The model also includes the interaction of **black **and **age**. The Stata code for estimating the model is

webuse nhanes2f, clear

logistic diabetes black female age black#c.age

with the following results:

------------------------------------------------------------------------------ diabetes | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- black | 3.318733 1.825189 2.18 0.029 1.129381 9.752231 female | 1.165212 .1098623 1.62 0.105 .9686107 1.401718 age | 1.063009 .0044723 14.52 0.000 1.054279 1.071811 black#c.age | .9918406 .0090871 -0.89 0.371 .9741892 1.009812 _cons | .0014978 .0003971 -24.53 0.000 .0008909 .0025183 ------------------------------------------------------------------------------

Given the high *p*-value for the interaction (.371), there is clearly no evidence here that the effect of **black** varies with **age**. Now let’s estimate the same model as a linear regression:

reg diabetes black female age black#c.age

The new results are:

------------------------------------------------------------------------------ diabetes | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- black | -.0215031 .0191527 -1.12 0.262 -.0590461 .0160399 female | .0069338 .004152 1.67 0.095 -.0012049 .0150725 age | .0020176 .0001276 15.82 0.000 .0017675 .0022676 black#c.age | .0012962 .0003883 3.34 0.001 .0005351 .0020573 _cons | -.0553238 .0068085 -8.13 0.000 -.0686697 -.0419779 ------------------------------------------------------------------------------

Now we have strong evidence for an interaction, specifically that the effect of **black** is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase.

Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.

Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome. I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.

In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.

In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the **margins** command in Stata. I’ll also provide links for how to do it in SAS and R.

As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.

__Figure 1__. The relationship between probability and log odds over the range of probabilities from .2 to .8.

This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.

For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2^{nd} grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.

__Figure 2__. The relationship between probability and log odds over the range of probabilities from .09 to .13.

The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.

I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are *still* situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).

I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.[1]

__Figure 3__. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.

On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.

To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.

The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:

**twoway function y=ln(x/(1-x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)**

I plotted Figure 2 using the same command, only changing the range to (.09 .13).

In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.

*Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.*

Angrist, J. D., & Pischke, J.-S. (2008). *Mostly Harmless Econometrics: An Empiricist’s Companion* (1st ed.). Princeton University Press.

Long, J. S. (1997). *Regression Models for Categorical and Limited Dependent Variables* (1st ed.). Sage Publications, Inc.

Pischke, J.-S. (2012, July 9). Probit better than LPM? Retrieved from http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/

von Hippel, P.T. & Workman, J. (2016). From Kindergarten Through Second Grade, U.S. Children’s Obesity Prevalence Grows Only During Summer Vacations. Obesity Volume 24, Issue 11, pages 2296–2300. http://onlinelibrary.wiley.com/doi/10.1002/oby.21613/full

[1] The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed *n*/(*n*-1) where *n* is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.

Mediation is about the mechanisms or pathways by which some treatment or exposure affects an outcome. Questions about mediation arise with considerable frequency in the biomedical and social sciences. Various approaches have been used in the social sciences for decades, especially in psychology. However, the new literature on mediation from the perspective of causal inference has helped clarify the assumptions that are being made by these analyses. It has also generalized the approach to a much broader class of settings.

The course will cover many of these new advances. We will also talk about the relationships between the newer methods and the approaches that have been used in the biomedical and social sciences.

I have published a book on this topic with Oxford University Press:

The course will cover a lot of the material in Chapters 1-8 of the book, and the book will also serve as a useful reference for the course. Both the book and the course are based on a longer course on mediation and interaction that I have taught at Harvard for several years. I have managed to distill the essential content of that course into material that can be covered in just two days, so the course will provide a fairly comprehensive overview in a relatively short period of time.

The course will also provide an introduction to concepts in causal inference and so might be a fairly accessible way into that literature. The prerequistes for the course are fairly minimal, however, so I’ll essentially just be assuming some familiarity with linear and logistic regression. I have worked hard to make the course as accessible as possible, and have now taught and refined the material numerous times.

The rough outline of the course and the broad topics we will cover are as follows:

- Concepts and Methods for Mediation
- Sensitivity Analysis
- Mediation with Time-to-Event Outcome
- Multiple Mediators
- Surrogate Outcomes
- A Unification of Mediation and Interaction

More specifically, the course will cover the relationship between traditional methods for mediation in the biomedical and the social sciences and new methods in causal inference. For dichotomous, continuous, and time-to-event outcomes, we will discuss when the standard approaches to mediation analysis are valid or not valid, and we’ll extend these to more complex settings.

The no-confounding assumptions needed for these techniques will be described. SAS, SPSS and Stata macros to implement these techniques will be explained and distributed to course participants. The use and implementation of sensitivity analysis techniques to assess the how sensitive conclusions are to violations of assumptions will be covered, along with extensions to multiple mediators.

If you have any questions about the course feel free to drop me a note (tvanderw@hsph.harvard.edu) and I would be happy to try to provide further information. In any case, I would of course be happy to have you there.

]]>I love Stata and always jump at the chance to teach it. But I was honestly a bit terrified. I had never been outside North America and Western Europe and don’t really consider myself much of a traveler. What would Dhaka be like? I had no real reference point for trying to predict the experience. Even as I was boarding the flight from Boston to Dubai (the second leg of a 28-hour trip), there was a part of me still asking, “What have you gotten yourself into?”

Fortunately my love of teaching Stata was greater than my anxiety because the week I spent in Dhaka ended up being one of the most satisfying of my professional life.

Most of the students I taught were participants in the CDC’s Field Epidemiology Training Program (FETP). This worldwide program takes government doctors who’ve been doing clinical work and trains them to be field epidemiologists. All of the doctors had received basic biostatistics training before I arrived. So my primary role was to teach them how to use Stata to implement the techniques they’d already learned in class.

The most surprising aspect of this experience was how much we ended up focusing on data cleaning. In most of my career, I’ve used very well-behaved data sets like the General Social Survey. In Dhaka, most of the students were dealing with what initially seemed to me very disorganized spreadsheets. Their data had all the numbers stored as text, inconsistent capitalization, and in many cases were half “long” format and half “wide” format.

I found it amazingly satisfying to teach these doctors how to take messy, complicated epidemiological surveillance data and turn them into something they could really use. There were one or two students who had been unable to make any progress on their projects for months because of data management difficulties. I had never met data management problems that complicated before. One particular reshape had me pulling my hair out and took me almost two hours to figure out. But I have rarely—maybe never—been as satisfied in front of a computer screen as the moment I finally got it.

Over the five days of the course, we covered a great deal, from t-tests, to logistic regression, to complex survey analysis. I especially enjoyed teaching these eager, energetic young doctors because they knew *exactly* how they were going to use these skills to make a difference in their country.

Fortunately I also had a little time to experience Dhaka. I got to ride a rickshaw, take a boat on the Ganges, walk down Hindu Street, and see buildings built by the East India Company. Just walking down the streets of Dhaka was one of the most incredible experiences of my life. The sights and sounds were beyond anything I had ever experienced. As I packed up my laptop on the last day of class, with the sound of the call to prayer in the background, my thoughts were no longer “what have I gotten myself into?” but instead “is this *really* over?”

Boarding the flight to return home, I was so grateful that my love of Stata had taken me to such an unexpected place. I hope I made as much of an impact on my students as they and their country made on me.

*Stephen Vaisey is an Associate Professor of Sociology at Duke University. For Statistical Horizons, he regularly teaches a seminar on Treatment Effects Analysis.*

Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.

Let’s start by comparing the two models explicitly. If the outcome* Y* is a dichotomy with values 1 and 0, define *p* = *E*(*Y*|*X*), which is just the probability that *Y* is 1, given some value of the regressors *X*. Then the linear and logistic probability models are:

*p = a*_{0 }*+ a*_{1}*X*_{1} + *a*_{2}*X*_{2} + … + *a _{k}X_{k}*

ln[*p*/(1-*p*)] = *b*_{0 }*+ b*_{1}*X*_{1} + *b*_{2}*X*_{2} + … + *b _{k}X_{k}*

The linear model assumes that the probability *p* is a linear function of the regressors, while the logistic model assumes that the natural log of the odds *p*/(1-*p*) is a linear function of the regressors.

The major advantage of the linear model is its interpretability. In the linear model, if *a*_{1} is (say) .05, that means that a one-unit increase in *X*_{1} is associated with a 5 percentage point increase in the probability that *Y* is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.

The logistic model is less interpretable. In the logistic model, if *b*_{1} is .05, that means that a one-unit increase in *X*_{1} is associated with a .05 increase in the log odds that *Y* is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.

Because the log odds scale is so hard to interpret, it is common to report logistic regression results as *odds ratios*. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:

*p*/(1-*p*) = *d*_{0} × (*d*_{1})^{X1}_{ }× (*d*_{2})^{X2} × … × (*d _{k}*)

On the left side we have the odds and on the right side we have a product involving the odds ratios *d*_{1 }= exp(*b*_{1}), *d*_{2}* = *exp(*b*_{2}), etc.

Odds ratios seem like they should be intuitive. If *d*_{1} = 2, for example, that means that a one-unit increase in *X*_{1 }doubles the odds that *Y* is 1. That sounds like something we should understand.

But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1-p)—and so does the odds ratio.

Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a get-out-the-vote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.

If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.

Here’s a table that shows what doubling the odds does to various initial probabilities:

Before doubling |
After doubling |
||

Probability |
Odds |
Odds |
Probability |

10% |
0.11 |
0.22 |
18% |

20% |
0.25 |
0.50 |
33% |

30% |
0.43 |
0.86 |
46% |

40% |
0.67 |
1.33 |
57% |

50% |
1.00 |
2.00 |
67% |

60% |
1.50 |
3.00 |
75% |

70% |
2.33 |
4.67 |
82% |

80% |
4.00 |
8.00 |
89% |

90% |
9.00 |
18.0 |
95% |

It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If *p* is close to 0, then doubling the odds is approximately the same as doubling *p*. If *p* is close to 1, then doubling the odds is approximately the same as halving 1-*p*. But if *p* is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.

That’s why I’m not crazy about odds ratios.

The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).

For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of *X*, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability (cf. Long 1997).

It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.

When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model.

These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.

Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).

Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[*p*/(1-*p*)] are undefined when *p* is equal to 0 or 1. When *p* gets close to 0 or 1 logistic regression can suffer from complete separation, quasi-complete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic.

Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderate-sized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]

The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance *p*(1-*p*), but the heteroscedasticity is minor if *p* is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticity-consistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.

*Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science. *

Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. *Quality & Quantity*, *43*(1), 59–74. http://doi.org/10.1007/s11135-007-9077-3

King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. *Political Analysis*, *9*(2), 137–163. http://doi.org/10.2307/25791637

Long, J. S. (1997) *Regression Models for Categorical and Limited Dependent Variables* (1st ed.). Sage Publications, Inc.

[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.

[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.

]]>This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models. Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as *dynamic panel data* models. Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward. Let *y _{it}* be the value of the dependent variable for individual

*y _{it}* =

The random intercept *u _{i}* represents the combined effect on

That’s where the problem lies. Because the model applies to all time points, *u _{i}* has a direct effect on

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the **wages** data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the **xtreg** command, let’s first estimate a random intercepts model for **lwage **(log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: **ed** (years of education) and **fem** (1 for female, 0 for male).

Here’s the Stata code:

**use “https://statisticalhorizons.com/wp-content/uploads/wages.dta”, clear**** xtset id t**** xtreg lwage L.lwage ed fem t**

The **xtset** command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable **id** and a time variable **t **that ranges from 1 to 7. The **xtreg** command fits a random-intercepts model by default, with **lwage** as the dependent variable and the subsequent four variables as predictors. **L.lwage** specifies the one-year lag of **lwage**.

Here’s the output:

------------------------------------------------------------------------------

lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage |

L1. | .8747517 .0085886 101.85 0.000 .8579183 .8915851

ed | .0108335 .0011933 9.08 0.000 .0084947 .0131724

fem | -.06705 .010187 -6.58 0.000 -.0870162 -.0470839

t | .0071965 .0019309 3.73 0.000 .0034119 .0109811

_cons | .7624068 .0491383 15.52 0.000 .6660974 .8587161

-------------+----------------------------------------------------------------

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males. Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called **xtdpdqml**. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply **xtdpdqml** to the wage data:

**xtset id t****xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)**

The **re** option specifies a random effects (random intercepts) model. By default, the command includes the lag-1 dependent variable as a predictor. The **initval** option sets the starting values for the four variance parameters that are part of the model. Here is the output:

------------------------------------------------------------------------------

lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage |

L1. | .4142827 .0230843 17.95 0.000 .3690383 .459527

ed | .0403258 .0031841 12.66 0.000 .0340851 .0465666

fem | -.2852665 .0271688 -10.50 0.000 -.3385164 -.2320166

t | .0533413 .0027533 19.37 0.000 .0479449 .0587378

_cons | 3.25368 .1304816 24.94 0.000 2.99794 3.509419

------------------------------------------------------------------------------

Results are markedly different from those produced above by **xtreg**. The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference. Unfortunately, **xtdpdqml** has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, **xtdpdml**, that removes many of these limitations. (Note that the only difference in the names for the two commands is the **q** in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with **xtdpdml**, use

**xtset id t**** xtdpdml lwage, inv(ed fem blk) errorinv**

The **inv** option is for time-invariant variables. The **errorinv** option forces the error variance to be the same at all points in time. Like **xtdpdqml**, this command automatically includes a 1-time unit lag of the dependent variable. Unlike **xtdpdqml**, **xtdpdml** can include longer lags and/or multiple lags.

Here is the output:

------------------------------------------------------------------------------

| Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage2 |

lwage1 | .4088803 .0229742 17.80 0.000 .3638517 .453909

ed | .0406719 .0032025 12.70 0.000 .0343951 .0469486

fem | -.2878266 .027345 -10.53 0.000 -.3414218 -.2342315

------------------------------------------------------------------------------

Results are very similar to those for **xtdpdqml**. They are slightly different because **xtdpdml **always treats time as a categorical variable, but time was a quantitative variable in the earlier model for **xtdpdqml.**

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper. As a matter of fact, the **xtdpdml** command is just a front-end to the **sem** command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself. That’s why we wrote the command.

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both **xtdpdqml** and **xtdpdml **can handle this situation also.

If you’d like to learn more about dynamic panel data models, check out my 2-day course on Longitudinal Data Analysis Using SEM. It will be offered again October 16-17, 2015, in Los Angeles.

__References__

Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” *Econometrica* 51 (6): 1635-1659.

Rabe-Hesketh, Sophia, and Anders Skrondal (2012) *Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses*. Third Edition. StataCorp LP.

- ML is simpler to implement (if you have the right software).
- Unlike multiple imputation, ML has no potential incompatibility between an imputation model and an analysis model.
- ML produces a deterministic result rather than a different result every time.

Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.

What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.

With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.

So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on large-sample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a *t*-distribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?

Wrong! In a paper that will be published soon in *Structural Equation Modeling*, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.

What about confidence intervals and *p*-values? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a *t*-distribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this *t-*distribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.

Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.

Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for non-linear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.

MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.

If you want to learn more about both multiple imputation and maximum likelihood, check out my two-day course on Missing Data that will be offered this fall. Dates and location have not yet been set.

]]>