In a recent guest blog, Paul von Hippel extended his earlier argument that there are many situations in which a linear probability model (estimated via ordinary least squares) is preferable to a logistic regression model. In his two posts, von Hippel makes three major points:

• Within the range of .20 to .80, the linear probability model is an extremely close approximation to the logistic model. Even outside that range, if the range is narrow, the linear probability model may do well.
• People understand changes in probabilities much better than they understand odds ratios.
• OLS regression is much faster than linear regression.

I don’t disagree with any of these points. Nevertheless, I still prefer logistic regression in the vast majority of applications. In my April 2015 post, I discussed some of the features of logistic regression that make it more attractive than other non-linear alternatives, like probit or complementary log-log. But I didn’t compare logistic to the linear probability model. So here’s my take on von Hippel’s arguments, along with some additional reasons why I like logistic regression better.

Speed. Linear regression by least squares is, indeed, faster than maximum likelihood estimation of logistic regression. Given the capabilities of today’s computers, however, that difference is hardly noticeable for estimating a single binary logistic regression, even with a million or more observations. As von Hippel notes, the difference really starts to matter when you’re estimating a model with random effects, with fixed effects, or with spatial or longitudinal correlation.

Speed can also matter if you’re doing bootstrapping, or multiple imputation, or if you’re using some sort of intensive variable selection method on a large pool of variables, especially when combined with k-fold cross-validation. In those kinds of applications, preliminary work with linear regression can be very useful. One danger, however, is that linear regression may find interactions (or other nonlinearities) that wouldn’t be needed in a logistic model. See the Invariance section below.

Predicted probabilities. Even if you really dislike odds ratios, the logit model has a well-known advantage with respect to predicted probabilities. As von Hippel reminds us, when you estimate a linear regression with a 1-0 outcome, the predicted values can be greater than 1 or less than 0, which obviously implies that they cannot be interpreted as probabilities. This frequently happens, even when the overwhelming majority of cases have predicted probabilities in his recommended range of .20 to .80.

In many applications, this is not a problem because you are not really interested in those probabilities. But quite often, getting valid predictions of probabilities is crucially important. For example, if you want to give osteoporosis patients an estimate of their probability of hip fracture in the next five years, you won’t want to tell them it’s 1.05. And even if the linear probability model produces only in-bounds predictions, the probabilities may be more accurately estimated with logistic.

Interpretability. Von Hippel is undoubtedly correct when he says that, for most researchers, differences in probability are more intuitive than odds ratios. In part, however, that’s just because probabilities are what we are most accustomed to as a measure of the chance that something will happen.

In von Hippel’s examples, the “difficulty” comes in translating from odds to probabilities. But there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds.

Here’s how I think about the odds of, say, catching a cold in a given year. If the odds is 2, that means that 2 people catch a cold for every one person who does not. If the odds increases to 4, then 4 people catch a cold for every one who does not. That’s a doubling of the odds, i.e., an odds ratio of 2. On the other side of the spectrum, if the odds is 1/3 then one person catches cold for every three who do not. If we quadruple the odds, then 4 people catch cold for every 3 who do not. More generally, if the odds increases by a certain percentage, the expected number of individuals who have the event increases by that percentage, relative to the number who do not have the event.

A major attraction of the odds is that it facilitates multiplicative comparisons. That’s because the odds does not have an upper bound. If the probability that I will vote in the next presidential election is .6, there’s no way that your probability can be twice as great as mine. But your odds of voting can easily be 2, 4 or 10 times as great as mine.

Even if you strongly prefer probabilities, once you estimate a logistic regression model you can readily get effect estimates that are expressed in terms of probabilities. Stata makes this especially easy with its margins command, which I will demonstrate in my next post.

Invariance. When it comes down to it, my strongest reason for preferring the logistic model is that, for dichotomous outcomes, there are good reasons to expect that odds ratios will be more stable across time, space, and populations than coefficients from linear regression. Here’s why: for continuous predictors, we know that the linear probability model is unlikely to be a “true” description of the mechanism producing the dichotomous outcome. That’s because extrapolation of the linear model would yield probabilities greater than 1 or less than 0. The true relationship must be an S-shaped curve—not necessarily logistic, but something like it.

Because the linear probability model does not allow for curvature, the slope produced by linear least squares will depend on where the bulk of the data lie on the curve. You’ll get a smaller slope near 1 or 0 and a larger slope near .50. But, of course, overall rates of event occurrence can vary dramatically from one situation to another, even if the underlying mechanism remains the same.

This issue also arises for categorical predictors. Consider a dichotomous y and a single dichotomous predictor x. Their relationship can be completely described by a 2 x 2 table of frequency counts. It is well known that the odds ratio for that table is invariant to multiplication of any row or any column by a positive constant. Thus, the marginal distribution of either variable can change substantially without changing the odds ratio. That is not the case for the “difference between two proportions”, the equivalent of the OLS coefficient for y on x.

One consequence is that linear regression for a dichotomous outcome is likely to produce evidence for interactions that are not “real” or at least would not be needed in a logistic regression. Here’s an example using data from the National Health and Nutrition Examination Study (NHANES). The data set is publicly available on the Stata website and can be directly accessed from the Internet within a Stata session.

There are 10,335 cases with complete data on the variables of interest. I first estimate a logistic regression model with diabetes (coded 1 or 0) as the dependent variable. Predictors are age (in years) and two dummy (indicator) variables, black and female. The model also includes the interaction of black and age. The Stata code for estimating the model is

webuse nhanes2f, clear
logistic diabetes black female age black#c.age

with the following results:

```------------------------------------------------------------------------------

diabetes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
black |   3.318733   1.825189     2.18   0.029     1.129381    9.752231
female |   1.165212   .1098623     1.62   0.105     .9686107    1.401718
age |   1.063009   .0044723    14.52   0.000     1.054279    1.071811
black#c.age |   .9918406   .0090871    -0.89   0.371     .9741892    1.009812
_cons |   .0014978   .0003971   -24.53   0.000     .0008909    .0025183
------------------------------------------------------------------------------

```

Given the high p-value for the interaction (.371), there is clearly no evidence here that the effect of black varies with age. Now let’s estimate the same model as a linear regression:

reg diabetes black female age black#c.age

The new results are:

```------------------------------------------------------------------------------
diabetes |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
black |  -.0215031   .0191527    -1.12   0.262    -.0590461    .0160399
female |   .0069338    .004152     1.67   0.095    -.0012049    .0150725
age |   .0020176   .0001276    15.82   0.000     .0017675    .0022676
black#c.age |   .0012962   .0003883     3.34   0.001     .0005351    .0020573
_cons |  -.0553238   .0068085    -8.13   0.000    -.0686697   -.0419779
------------------------------------------------------------------------------

```

Now we have strong evidence for an interaction, specifically that the effect of black is larger at higher ages. Lest you think this is just due to the large sample size, the implied coefficient for black increases from .004 at age 20 (the lowest age in the sample) to .074 at age 74 (the highest age). That’s a huge increase.

Why does this happen? Because the overall rate of diabetes increases markedly with age. When the overall rate is low, the difference in probabilities for blacks and non-blacks is small. As the overall rate gets nearer to .50—the steepest point on the logistic curve—the difference in probabilities becomes larger. But the odds ratio remains the same.

Could the reverse happen? Could we find examples where logistic regression finds interactions but the linear probability does not? Absolutely. But I believe that that’s a far less likely outcome.  I also believe that the substantive implications of the discrepancies between linear and logistic models may often be critical. It’s quite a different thing to say that “the diabetes disadvantage of being black increases substantially with age” versus “the diabetes disadvantage of being black is essentially the same at all ages.” At least for these data, I’ll go with the second statement.

In fairness to von Hippel, he would probably not recommend a linear model for this example. As I’ll show in my next post, the probabilities vary too widely for the linear model to be a good approximation. But the essential point is that logistic regression models may often be more parsimonious than linear regression models for dichotomous outcomes. And the quantitative estimates we get from logistic regression models are likely to be more stable under widely varying conditions.

In the next post, I’ll show how easy it is to get estimates from a logistic model that can be interpreted in terms of probabilities using the margins command in Stata. I’ll also provide links for how to do it in SAS and R.

In July 2015 I pointed out some advantages of the linear probability model over the logistic model. The linear model is much easier to interpret, and the linear model runs much faster, which can be important if the data set is large or the model is complicated. In addition, the linear probability model often fits about as well as the logistic model, since over some ranges the probability p is almost linearly related to the log odds function ln(p(1-p)) that is used in logistic regression.

As a rule of thumb I suggested that the linear probability model could be used whenever the range of modeled probabilities is between .20 and .80. Within that range the relationship between the probability and the log odds is almost linear.

Figure 1. The relationship between probability and log odds over the range of probabilities from .2 to .8.

This is reasonable advice, and similar advice has been given before (Long, 1997). But it doesn’t go far enough. There is a wider range of circumstances where the linear probability model is viable.

For example, in a new paper Joe Workman and I used a multilevel model to analyze obesity among US children in kindergarten through 2nd grade—an age range over which the probability of obesity increases from .09 to .13. Since these probabilities are less than .20, you might guess that we couldn’t use the linear probability model. But we could, and we did. The linear model ran very quickly, whereas the logistic model can be slow when used in a multilevel context. The linear model also gave us very interpretable results; for example, we could write that “during the summer, children’s probability of obesity increases by approximately 1 percentage point per month.” And we didn’t lose any sleep about model fit; the linear model fit practically as well as a logistic model, because over the range of probabilities from .09 to .13 the probability is almost linearly related to the log odds.

Figure 2. The relationship between probability and log odds over the range of probabilities from .09 to .13.

The basic insight is that the linear probability model can be used whenever the relationship between probability and log odds is approximately linear over the range of modeled probabilities. Probabilities between .2 and .8 are one range where approximate linearity holds—but it also holds for some narrow ranges of probabilities that are less than .2 or greater than .8.

I still haven’t gone far enough. When the relationship between probability and log odds is nonlinear, there are still situations where the linear probability model is viable. For example, if your regressors X are categorical variables, then you’re not really modeling a continuous probability function. Instead, you’re modeling the discrete probabilities associated with different categories of X, and this can be done about as well with a linear model as with a logistic model—particularly if your model includes interactions among the X variables (Angrist & Pischke, 2008, chapter 3; Pischke, 2012).

I don’t think that the linear probability model is always viable. I do use the logistic model sometimes. For example, looking at 30 years of data from the Belmont Stakes horse race, I found that the probability of an upset was strongly related to the number of horses that started the race. The more horses started, the more likely it became that one of them would upset the favorite. The relationship looked like this.[1]

Figure 3. The relationship between the number of horses starting the Belmont Stakes and the probability that the favorite will be upset.

On a probability scale the relationship is strongly nonlinear. It almost has to be, because the relationship is strong and the probabilities cover almost the full range from 0 to 1. A linear probability model can’t fit these data easily. When I tried a linear model, out of curiosity, I found that some of the modeled probabilities were out of bounds—greater than 1. I could improve the fit of the linear model by subjecting the X variable to some nonlinear transformation, but finding the right transformation isn’t trivial, and even if I found it the linear model’s ease of interpretation would be lost. It’s simpler to fit a logistic model which naturally keeps the probabilities in bounds.

To check whether your data are candidates for a linear probability model, then, a basic diagnostic is to plot the relationship between probability and log odds over the likely range of probabilities in your data. If the relationship is nearly linear, as in Figures 1 and 2, then a linear probability model will fit about as well as a logistic model, and the linear model will be faster and easier to interpret. But if the relationship is strongly nonlinear, as in Figure 3, then a linear model may fit poorly—unless your Xs are categorical.

The relationship between probability and log odds is easy to plot in a variety of software. In Stata, for example, I plotted the relationship in Figure 1 with a single command, as follows:

twoway function y=ln(x/(1-x)), range(.2 .8) xtitle(“Probability”) ytitle(“Log odds”)

I plotted Figure 2 using the same command, only changing the range to (.09 .13).

In some situations, the relationship between probability and log odds is slightly but not severely nonlinear. Then you face a tradeoff, and your choice of model will depend on your goals. If what you mainly want is a rough but clear summary of the relationships, you might be willing to tolerate a bit of misfit and use a linear model that runs quickly and gives coefficients that are easy to interpret. But if you really want to get the probabilities just right, then you might be willing to sacrifice runtime and interpretability for better probability estimates. For example, I’ve developed financial risk models that predict the probability that a transaction is fraudulent or a borrower will default. In that situation, the coefficients are not the focus. What you really want the model to do is assign accurate probabilities to individual transactions or borrowers, and the linear model often does that poorly over the range of probabilities that are typical for a risk model. Then the logistic model is a more natural choice, although other nonlinear models, such as neural networks or classification trees, are also used.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.

### References

Angrist, J. D., & Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion (1st ed.). Princeton University Press.

Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.

Pischke, J.-S. (2012, July 9). Probit better than LPM? Retrieved from http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/

von Hippel, P.T. & Workman, J. (2016). From Kindergarten Through Second Grade, U.S. Children’s Obesity Prevalence Grows Only During Summer Vacations. Obesity Volume 24, Issue 11, pages 2296–2300. http://onlinelibrary.wiley.com/doi/10.1002/oby.21613/full

[1] The relationship looked a little different in my original article on the Belmont Stakes. In that article I constrained the modeled probabilities to reflect the assumption that the probability of an upset cannot exceed n/(n-1) where n is the number of starters. I also subjected the regressor to a reciprocal transformation, but this made less of a difference to the modeled probabilities.

On April 21-22, 2017, I will be offering a seminar on causal mediation analysis in Philadelphia with Statistical Horizons. The course will cover very recent developments in this area.

Mediation is about the mechanisms or pathways by which some treatment or exposure affects an outcome. Questions about mediation arise with considerable frequency in the biomedical and social sciences. Various approaches have been used in the social sciences for decades, especially in psychology. However, the new literature on mediation from the perspective of causal inference has helped clarify the assumptions that are being made by these analyses.  It has also generalized the approach to a much broader class of settings.

The course will cover many of these new advances. We will also talk about the relationships between the newer methods and the approaches that have been used in the biomedical and social sciences.

I have published a book on this topic with Oxford University Press:

The course will cover a lot of the material in Chapters 1-8 of the book, and the book will also serve as a useful reference for the course. Both the book and the course are based on a longer course on mediation and interaction that I have taught at Harvard for several years. I have managed to distill the essential content of that course into material that can be covered in just two days, so the course will provide a fairly comprehensive overview in a relatively short period of time.

The course will also provide an introduction to concepts in causal inference and so might be a fairly accessible way into that literature. The prerequistes for the course are fairly minimal, however, so I’ll essentially just be assuming some familiarity with linear and logistic regression. I have worked hard to make the course as accessible as possible, and have now taught and refined the material numerous times.

The rough outline of the course and the broad topics we will cover are as follows:

1. Concepts and Methods for Mediation
2. Sensitivity Analysis
3. Mediation with Time-to-Event Outcome
4. Multiple Mediators
5. Surrogate Outcomes
6. A Unification of Mediation and Interaction

More specifically, the course will cover the relationship between traditional methods for mediation in the biomedical and the social sciences and new methods in causal inference. For dichotomous, continuous, and time-to-event outcomes, we will discuss when the standard approaches to mediation analysis are valid or not valid, and we’ll extend these to more complex settings.

The no-confounding assumptions needed for these techniques will be described. SAS, SPSS and Stata macros to implement these techniques will be explained and distributed to course participants. The use and implementation of sensitivity analysis techniques to assess the how sensitive conclusions are to violations of assumptions will be covered, along with extensions to multiple mediators.

If you have any questions about the course feel free to drop me a note (tvanderw@hsph.harvard.edu) and I would be happy to try to provide further information. In any case, I would of course be happy to have you there.

When Paul Allison asked me if I wanted to teach a course in Bangladesh, my first reaction was confusion.  Other than knowing a few basic facts about the place, I had spent little time thinking about the country and none at all imagining myself going there. And here, suddenly, was an opportunity to spend a week in the country teaching Stata.

I love Stata and always jump at the chance to teach it. But I was honestly a bit terrified. I had never been outside North America and Western Europe and don’t really consider myself much of a traveler. What would Dhaka be like? I had no real reference point for trying to predict the experience. Even as I was boarding the flight from Boston to Dubai (the second leg of a 28-hour trip), there was a part of me still asking, “What have you gotten yourself into?”

Fortunately my love of teaching Stata was greater than my anxiety because the week I spent in Dhaka ended up being one of the most satisfying of my professional life.

Most of the students I taught were participants in the CDC’s Field Epidemiology Training Program (FETP). This worldwide program takes government doctors who’ve been doing clinical work and trains them to be field epidemiologists. All of the doctors had received basic biostatistics training before I arrived. So my primary role was to teach them how to use Stata to implement the techniques they’d already learned in class.

The most surprising aspect of this experience was how much we ended up focusing on data cleaning. In most of my career, I’ve used very well-behaved data sets like the General Social Survey. In Dhaka, most of the students were dealing with what initially seemed to me very disorganized spreadsheets. Their data had all the numbers stored as text, inconsistent capitalization, and in many cases were half “long” format and half “wide” format.

I found it amazingly satisfying to teach these doctors how to take messy, complicated epidemiological surveillance data and turn them into something they could really use. There were one or two students who had been unable to make any progress on their projects for months because of data management difficulties. I had never met data management problems that complicated before. One particular reshape had me pulling my hair out and took me almost two hours to figure out. But I have rarely—maybe never—been as satisfied in front of a computer screen as the moment I finally got it.

Over the five days of the course, we covered a great deal, from t-tests, to logistic regression, to complex survey analysis. I especially enjoyed teaching these eager, energetic young doctors because they knew exactly how they were going to use these skills to make a difference in their country.

Fortunately I also had a little time to experience Dhaka. I got to ride a rickshaw, take a boat on the Ganges, walk down Hindu Street, and see buildings built by the East India Company. Just walking down the streets of Dhaka was one of the most incredible experiences of my life. The sights and sounds were beyond anything I had ever experienced. As I packed up my laptop on the last day of class, with the sound of the call to prayer in the background, my thoughts were no longer “what have I gotten myself into?” but instead “is this really over?”

Boarding the flight to return home, I was so grateful that my love of Stata had taken me to such an unexpected place. I hope I made as much of an impact on my students as they and their country made on me.

Stephen Vaisey is an Associate Professor of Sociology at Duke University. For Statistical Horizons, he regularly teaches a seminar on Treatment Effects Analysis.

In his April 1 post, Paul Allison pointed out several attractive properties of the logistic regression model.  But he neglected to consider the merits of an older and simpler approach: just doing linear regression with a 1-0 dependent variable.  In both the social and health sciences, students are almost universally taught that when the outcome variable in a regression is dichotomous, they should use logistic instead of linear regression. Yet economists, though certainly aware of logistic regression, often use a linear model to model dichotomous outcomes.

Which probability model is better, the linear or the logistic? It depends. While there are situations where the linear model is clearly problematic, there are many common situations where the linear model is just fine, and even has advantages.

### Interpretability

Let’s start by comparing the two models explicitly. If the outcome Y is a dichotomy with values 1 and 0, define p = E(Y|X), which is just the probability that Y is 1, given some value of the regressors X. Then the linear and logistic probability models are:

p = a+ a1X1a2X2 + … + akXk    (linear)

ln[p/(1-p)] = b+ b1X1b2X2 + … + bkXk       (logistic)

The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1-p) is a linear function of the regressors.

The major advantage of the linear model is its interpretability. In the linear model, if a1 is (say) .05, that means that a one-unit increase in X1 is associated with a 5 percentage point increase in the probability that Y is 1. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese.

The logistic model is less interpretable. In the logistic model, if b1 is .05, that means that a one-unit increase in X1 is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds.

### How intuitive are odds ratios?

Because the log odds scale is so hard to interpret, it is common to report logistic regression results as odds ratios. To do this, we exponentiate both sides of the logistic regression equation and obtain a new equation that looks like this:

p/(1-p) = d0 × (d1)X1 × (d2)X2 × … × (dk)Xk

On the left side we have the odds and on the right side we have a product involving the odds ratios d1 = exp(b1), d2 = exp(b2), etc.

Odds ratios seem like they should be intuitive. If d1 = 2, for example, that means that a one-unit increase in Xdoubles the odds that Y is 1. That sounds like something we should understand.

But we don’t understand, really. We think we understand odds because in everyday speech we use the word “odds” in a vague and informal way. Journalists commonly use “odds” interchangeably with a variety of other words, such as “chance,” “risk,” “probability,” and “likelihood”—and academics are often just as sloppy when interpreting results. But in statistics these words aren’t synonyms. The word odds has a very specific meaning—p/(1-p)—and so does the odds ratio.

Still think you have an intuition for odds ratios? Let me ask you a question. Suppose a get-out-the-vote campaign can double your odds of voting. If your probability of voting was 40% before the campaign, what is it after? 80%? No, it’s 57%.

If you got that wrong, don’t feel bad. You’ve got a lot of company. And if you got it right, I bet you had to do some mental arithmetic[1], or even use a calculator, before answering. The need for arithmetic should tell you that odds ratios aren’t intuitive.

Here’s a table that shows what doubling the odds does to various initial probabilities:

 Before doubling After doubling Probability Odds Odds Probability 10% 0.11 0.22 18% 20% 0.25 0.50 33% 30% 0.43 0.86 46% 40% 0.67 1.33 57% 50% 1.00 2.00 67% 60% 1.50 3.00 75% 70% 2.33 4.67 82% 80% 4.00 8.00 89% 90% 9.00 18.0 95%

It isn’t simple. The closest I’ve come to developing an intuition for odds ratios is this: If p is close to 0, then doubling the odds is approximately the same as doubling p. If p is close to 1, then doubling the odds is approximately the same as halving 1-p. But if p is in the middle—not too close to 0 or 1—then I don’t really have much intuition and have to resort to arithmetic.

That’s why I’m not crazy about odds ratios.

### How nonlinear is the logistic model?

The logistic model is unavoidable if it fits the data much better than the linear model. And sometimes it does. But in many situations the linear model fits just as well, or almost as well, as the logistic model. In fact, in many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret (Hellevik 2007).

For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. And for that to be true, the relationship between the probability and the log odds must itself be nonlinear. But how nonlinear is the relationship between probability and log odds? If the probability is between .20 and .80, then the log odds are almost a linear function of the probability  (cf. Long 1997).

It’s only when you have a really wide range of probabilities—say .01 to .99—that the linear approximation totally breaks down.

When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model.

### A rule of thumb

These considerations suggest a rule of thumb. If the probabilities that you’re modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.

Both situations occur with some frequency. If you’re modeling the probability of voting, or of being overweight, then nearly all the modeled probabilities will be between .20 and .80, and a linear probability model should fit nicely and offer a straightforward interpretation. On the other hand, if you’re modeling the probability that a bank transaction is fraudulent—as I used to do—then the modeled probabilities typically range between .000001 and .20. In that situation, the linear model just isn’t viable, and you have to use a logistic model or another nonlinear model (such as a neural net).

Keep in mind that the logistic model has problems of its own when probabilities get extreme. The log odds ln[p/(1-p)] are undefined when p is equal to 0 or 1. When p gets close to 0 or 1 logistic regression can suffer from complete separation, quasi-complete separation, and rare events bias (King & Zeng, 2001). These problems are less likely to occur in large samples, but they occur frequently in small ones. Users should be aware of available remedies. See Paul Allison’s post on this topic.

### Computation and estimation

Interpretability is not the only advantage of the linear probability model. Another advantage is computing speed. Fitting a logistic model is inherently slower because the model is fit by an iterative process of maximum likelihood. The slowness of logistic regression isn’t noticeable if you are fitting a simple model to a small or moderate-sized dataset. But if you are fitting a very complicated model or a very large data set, logistic regression can be frustratingly slow.[2]

The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS). OLS ignores the fact that the linear probability model is heteroskedastic with residual variance p(1-p), but the heteroscedasticity is minor if p is between .20 and .80, which is the situation where I recommend using the linear probability model at all. OLS estimates can be improved by using heteroscedasticity-consistent standard errors or weighted least squares. In my experience these improvements make little difference, but they are quick and reassuring.

Paul von Hippel is an Associate Professor in the LBJ School of Public Affairs at the University of Texas, Austin, with affiliations and courtesy appointments in Sociology, Population Research, and Statistics and Data Science.

### References

Hellevik, O. (2007) Linear versus logistic regression when the dependent variable is a dichotomy. Quality & Quantity, 43(1), 59–74. http://doi.org/10.1007/s11135-007-9077-3

King, G., & Zeng, L. (2001) Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163. http://doi.org/10.2307/25791637

Long, J. S. (1997) Regression Models for Categorical and Limited Dependent Variables (1st ed.). Sage Publications, Inc.

[1] Here’s the mental arithmetic that I did. A probability of 40% is equivalent to odds of 2/3. Doubling those odds gives odds of 4/3. And odds of 4/3 are equivalent to a probability of 4/7, which in my head I figured was about 56%. When I wrote this footnote, though, I checked my mental arithmetic using Excel, which showed me that 4/7 is 57%.

[2] In current work, my colleagues and I are using a hierarchical, spatially correlated model to estimate the probability of obesity among 376,576 adults in approximately 2,400 US counties. The computational methods are demanding, and switching from a logistic to a linear probability model reduced our runtime from days to less than an hour.

When estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time t is what happened at time t-1.

This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models.  Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as dynamic panel data models.  Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward.  Let yit be the value of the dependent variable for individual i at time t.  Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector xit:

yit = b0 + b1yi(t-1) + b2xit +  ui + eit

The random intercept ui represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the right-hand side.

That’s where the problem lies. Because the model applies to all time points, i has a direct effect on yi(t-1).  But if i affects yi(t-1), it can’t also be statistically independent of yi(t-1). The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male).

Here’s the Stata code:

xtset id t
xtreg lwage L.lwage ed fem t

The xtset command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7.  The xtreg command fits a random-intercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors.  L.lwage specifies the one-year lag of lwage

Here’s the output:

`------------------------------------------------------------------------------`
`       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`       lwage |`
`         L1. |   .8747517   .0085886   101.85   0.000     .8579183    .8915851`
`          ed |   .0108335   .0011933     9.08   0.000     .0084947    .0131724`
`         fem |    -.06705    .010187    -6.58   0.000    -.0870162   -.0470839`
`           t |   .0071965   .0019309     3.73   0.000     .0034119    .0109811`
`       _cons |   .7624068   .0491383    15.52   0.000     .6660974    .8587161`
`-------------+----------------------------------------------------------------`
` `

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males.  Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply xtdpdqml to the wage data:

xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)

The re option specifies a random effects (random intercepts) model.  By default, the command includes the lag-1 dependent variable as a predictor.  The initval option sets the starting values for the four variance parameters that are part of the model.  Here is the output:

`------------------------------------------------------------------------------`
`       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`       lwage |`
`         L1. |   .4142827   .0230843    17.95   0.000     .3690383     .459527`
`          ed |   .0403258   .0031841    12.66   0.000     .0340851    .0465666`
`         fem |  -.2852665   .0271688   -10.50   0.000    -.3385164   -.2320166`
`           t |   .0533413   .0027533    19.37   0.000     .0479449    .0587378`
`       _cons |    3.25368   .1304816    24.94   0.000      2.99794    3.509419`
`------------------------------------------------------------------------------`

Results are markedly different from those produced above by xtreg.  The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference.  Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with xtdpdml, use

xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv

The inv option is for time-invariant variables.  The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1-time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.

Here is the output:

`------------------------------------------------------------------------------`
`             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`lwage2       |`
`      lwage1 |   .4088803   .0229742    17.80   0.000     .3638517     .453909`
`          ed |   .0406719   .0032025    12.70   0.000     .0343951    .0469486`
`         fem |  -.2878266    .027345   -10.53   0.000    -.3414218   -.2342315`
`------------------------------------------------------------------------------`

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper.  As a matter of fact, the xtdpdml command is just a front-end to the sem command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself.  That’s why we wrote the command.

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both xtdpdqml and xtdpdml can handle this situation also.

If you’d like to learn more about dynamic panel data models, check out my 2-day course on Longitudinal Data Analysis Using SEM. It will be offered again October 16-17, 2015, in Los Angeles.

References

Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 1635-1659.

Rabe-Hesketh, Sophia, and Anders Skrondal  (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP.

In my July 2012 post, I argued that maximum likelihood (ML) has several advantages over multiple imputation (MI) for handling missing data:

• ML is simpler to implement (if you have the right software).
• Unlike multiple imputation, ML has no potential incompatibility between an imputation model and an analysis model.
• ML produces a deterministic result rather than a different result every time.

Incidentally, the use of ML for handling missing data is often referred to as “full information maximum likelihood” or FIML.

What I didn’t mention in that 2012 post (but which I discussed in the paper on which it was based) is that ML is also asymptotically efficient. Roughly speaking, that means that in large samples, the standard errors of ML estimates are as small as possible—you can’t do any better with other methods.

With MI, on the other hand, the only way to get asymptotic efficiency is to do an infinite number of imputations, something that is clearly not possible. You can get pretty close to full efficiency for the parameter estimates with a relatively small number of imputations (say, 10), but efficient estimation of standard errors and confidence intervals typically requires a much larger number of imputations.

So for large samples, ML seems to have the clear advantage. But what about small samples? For ML, the problem is that statistical inference is based on large-sample approximations that may not be accurate in smaller samples. By contrast, statistical inference for MI is typically based on a t-distribution which adjusts for small sample size. That means that MI is better than ML when working with small samples, right?

Wrong! In a paper that will be published soon in Structural Equation Modeling, Paul von Hippel assesses the performance of ML and MI in small samples drawn from a bivariate normal distribution. He shows, analytically, that ML estimates have less bias than MI estimates. By simulation, he also shows that ML estimates have smaller sampling variance than MI estimates.

What about confidence intervals and p-values? To address that issue, von Hippel introduces a novel method for calculating degrees of freedom for a t-distribution that can be used with ML estimation in small samples. He demonstrates by simulation that confidence intervals based on this t-distribution have approximately the correct coverage and are narrower, on average, than the usual confidence intervals for MI.

Problem solved? Well, not quite. Von Hippel’s DF formula requires some computational work, and that will discourage some researchers. In principle, the method could easily be programmed into structural equation modeling packages and, hopefully, that will happen. Until it does, however, the method probably won’t be widely used.

Bottom line is that ML seems like the better way to go for handling missing data in both large and small samples. But there’s still a big niche for MI. ML requires a parametric model that can be estimated by maximizing the likelihood. And to do that, you usually need specialized software. Most structural equation modeling packages can do FIML for linear models, but not for non-linear models. As far as I know, Mplus is the only commercial package that can do FIML for logistic, Poisson, and Cox regression.

MI, on the other hand, can be readily applied to these and many other models, without the need for specialized software. Another attraction of MI is that you can easily do a sensitivity analysis for the possibility that data that are not missing at random. So if you really want to be a skilled handler of missing data, you need to be adept at both approaches.

If you want to learn more about both multiple imputation and maximum likelihood, check out my two-day course on Missing Data that will be offered this fall. Dates and location have not yet been set.

For the analysis of binary data, logistic regression dominates all other methods in both the social and biomedical sciences. It wasn’t always this way. In a 1934 article in Science, Charles Bliss proposed the probit function for analyzing binary data, and that method was later popularized in David Finney’s 1947 book Probit Analysis. For many years, probit was the method of choice in biological research.

In a 1944 article in the Journal of the American Statistical Association, Joseph Berkson introduced the logit model (aka logistic regression model) and argued for its superiority to the probit model. The logistic method really began to take off with the publication of David Cox’s 1970 book Analysis of Binary Data. Except for toxicology applications, probit has pretty much disappeared from the biomedical world. There are other options as well (like complementary log-log) but logistic regression is the overwhelming favorite.

So what is it about logistic regression that makes it so popular?  In this post, I’m going to detail several things about logistic regression that make it more attractive than its competitors. And I’m also going to explain why logistic regression has some of these properties. It turns out that there is something special about the logit link that gives it a natural advantage over alternative link functions.

First, a brief review. For binary data, the goal is to model the probability p that one of two outcomes occurs. The logit function is log[p/(1-p)], which varies between -∞ and +∞ as p varies between 0 and 1.The logistic regression model says that

log[p/(1-p)] = b0 + b1x1 + … + bkxk

or, equivalently,

p = 1/(1 + exp{-(b0 + b1x1 + … + bkxk)})

Estimation of the b coefficients is usually accomplished by maximum likelihood.

In this context, the logit function is called the link function because it “links” the probability to the linear function of the predictor variables. (In the probit model, the link function is the inverse of the cumulative distribution function of a standard normal variable.)

What’s most important about the logit link is that it guarantees that p is bounded by 0 and 1, no matter what the b’s and the x’s are. However, that property is hardly unique to the logit link. It’s also true for the probit link, the complementary log-log link, and an infinite number of other possible link functions.  But there are a few things that are special about logit:

1. If you exponentiate the coefficients, you get adjusted odds ratios. These have a remarkably intuitive interpretation, one that is even used in the popular media to convey the results of logistic regression to non-statisticians. Coefficients from probit regression are not nearly so interpretable.

2. With logit, you can do disproportionate stratified random sampling on the dependent variable without biasing the coefficients. For example, you could construct a sample that includes all of the events, and a 10% random sample of the non-events. This property is the justification for the widely-used case-control method in epidemiology. It’s also extremely useful when dealing with very large samples with rare events. No other link function has this property.

3. You can do exact logistic regression. With conventional maximum likelihood estimation, the p-values and confidence intervals are large-sample approximations. These approximations may not be very accurate when the number of events is small. With exact logistic regression (a generalization of Fisher’s exact test for contingency tables) exact p-values are obtained by enumeration of all possible data permutations that produce the same “sufficient statistics” (more below). Again, this is not possible with any other link function.

4. You can do conditional logistic regression. Suppose your binary data are clustered in some way, for example, repeated measurements clustered within persons or persons clustered within neighborhoods. Suppose, further, that you want to estimate a model that allows for between-cluster differences, but you don’t want to put any restrictions on the distribution of those differences or their relationship with predictor variables. In that case, conditional maximum likelihood estimation of the logistic model is the way to go. What it conditions on is the total number of events in each cluster. When you do that, the between-cluster effects cancel out of the likelihood function. This doesn’t work with probit or any other link function.

What is it about the logit link that makes it possible to do these useful variations of logistic regression?  The answer is somewhat esoteric, but I’ll do my best to explain it. For binary data, the most appropriate probability distribution is the binomial (or its special case, the Bernoulli distribution for single trial data). The binomial distribution happens to be a member of the very important exponential family of distributions. In general form, the probability distribution for the exponential family can be written as

f(x|b) = h(x)exp{T(x)’g(b)–A(b)}

In this formula, x is a vector of the data, b is a vector of parameters, and h, g, T, and A are known functions.  If g(b) = b, then b is said to be the natural parameter (or canonical parameter) of the distribution. T(x) is a vector of sufficient statistics of the data. These are summary statistics that contain all the available information in the data about the parameters.

For the binomial distribution, this formula specializes to

f(x|p) = [N x]exp{x log[p/(1-p)] –N log(1-p)}

where N is the number of trials, x is the number of events and [N x] is the binomial coefficient.  We see immediately that log[p/(1-p)] is the natural parameter of the binomial distribution. Because the natural parameter directly multiplies the sufficient statistic (in this case, the number of events), all sorts of mathematical operations become much easier and more straightforward. If you work with something other than the natural parameter, things are more difficult. That’s why the logit link has a special place among the infinite set of possible link functions.

Of course, mathematical convenience does not imply that the logit link is more likely to be a realistic representation of the real world than some other link. But in the absence of compelling reasons to use something else, you might as well go with a method that’s convenient and interpretable. It’s the same reason why we often prefer linear models with normally distributed errors.

Incidentally, many social scientists, especially economists, still prefer to use linear probability models for binary data for exactly these reasons: mathematical convenience and interpretability. Check out next month’s post for arguments in favor of linear models for binary outcomes.

There are also some situations in which probit has the mathematical advantage. For example, suppose you want to do a factor analysis of a set of binary variables.  You want a model in which the binary variables depend on one or more continuous latent variables. If the distributions of the binary variables are expressed as probit functions of the latent variables, then the multivariate normal distribution can be used as a basis for estimation. There is no comparable multivariate logistic distribution.

If you’d like to learn more about these and other methods for binary data, see my book Logistic Regression Using SAS: Theory and Application.

Predictive mean matching (PMM) is an attractive way to do multiple imputation for missing data, especially for imputing quantitative variables that are not normally distributed. But, as I explain below, it’s also easy to do it the wrong way.

Compared with standard methods based on linear regression and the normal distribution, PMM produces imputed values that are much more like real values. If the original variable is skewed, the imputed values will also be skewed. If the original variable is bounded by 0 and 100, the imputed values will also be bounded by 0 and 100. And if the real values are discrete (like number of children), the imputed values will also be discrete. That’s because the imputed values are real values that are “borrowed” from individuals with real data.

PMM has been around for a long time (Rubin 1986, Little 1988), but only recently has it become widely available and practical to use. Originally, it could only be used in situations where a single variable had missing data or, more broadly, when the missing data pattern was monotonic. Now, however, the PMM method is embedded in many software packages that implement an approach to multiple imputation variously known as multiple imputation by chained equations (MICE), sequential generalized regression, or the fully conditional specification (FCS). It’s available in many statistical packages, including SAS, Stata, SPSS, and R, all of which allow you to use PMM for virtually any missing data pattern.

There are two major pitfalls to PMM, however. First, only a handful of studies have evaluated its performance, so it’s not clear how well it compares with alternative methods. Second, at least two statistical packages, SPSS and Stata, have implemented PMM with a default setting that actually invalidates the method. If you use either of those packages, you must override the default.

Before explaining that problem, I first need to provide a brief description of how PMM works. Suppose there is a single variable x that has some cases with missing data, and a set of variables z (with no missing data) that are used to impute x. Do the following:

1. For cases with no missing data, estimate a linear regression of x on z, producing a set of coefficients b.
2. Make a random draw from the “posterior predictive distribution” of b, producing a new set of coefficients b*. Typically this would be a random draw from a multivariate normal distribution with mean b and the estimated covariance matrix of b (with an additional random draw for the residual variance). This step is necessary to produce sufficient variability in the imputed values, and is common to all “proper” methods for multiple imputation.
3. Using b*, generate predicted values for x for all cases, both those with data missing on x and those with data present.
4. For each case with missing x, identify a set of cases with observed x whose predicted values are close to the predicted value for the case with missing data.
5. From among those close cases, randomly choose one and assign its observed value to substitute for the missing value.
6. Repeat steps 2 through 5 for each completed data set.

Unlike many methods of imputation, the purpose of the linear regression is not to actually generate imputed values. Rather, it serves to construct a metric for matching cases with missing data to similar cases with data present.

There are several variations to this method (Morris et al. 2014), but the most important issue to settle is how many cases (k) should be in each match set. The default in the SAS procedure MI and in the MICE package for R is k=5. That is, each case with missing data on x is matched to the 5 cases (with data present) that have the closest predicted values. One of the 5 is chosen at random and its x value is assigned to the case with missing data. Solas and the user-written ice command for Stata set the default at k=10.

On the other hand, for the SPSS missing values module and for the built-in mi command in Stata the default is k=1. That is, each case with missing data is matched to the single case whose predicted value is closest to the predicted value for the case with missing data. With only one matched case, there is no random draw at Step 5 in the scheme above.

That’s a serious error. With no random draw at Step 5, the only source of random variation in the imputation process is the random draw of regression coefficients in Step 2. That’s not nearly enough to produce proper imputations. As a result, estimated standard errors tend to be much too low, leading to inflated test statistics and confidence intervals that are much too narrow (Morris et al. 2014).

Why did SPSS and Stata get it so wrong? Well, I’m guessing that they relied on Don Rubin’s classic 1987 book Multiple Imputation for Nonresponse in Surveys. In his description of PMM (p. 168), he proposed matching to a single case. But later work makes it clear that this is not the way to go.

So, if not k=1, then how many? That’s not clear. Schenker and Taylor (1996) did simulations with k=3 and k=10. Differences in performance were small, but with k=3, there was less bias and more sampling variation. Based on their simulations, Morris et al. (2014) recommended k=10 for most situations. But a lot depends on sample size. With large samples, k=10 is probably the better choice. But with smaller samples, k=10 will probably include too many cases that are rather unlike the case to which they are matched. Personally, I’m reasonably happy with the k=5 default of SAS and MICE.

The other major drawback of PMM is that there’s no mathematical theory to justify it (which is also true of MICE methods more generally). We have to rely on Monte Carlo simulations, and no simulation can study all the possibilities. Results reported by Schenker and Taylor (1996) and Morris et al. (2014) are very encouraging, but hardly definitive. In brief, it appears that PMM does almost as well as parametric methods for a correctly specified model, and a little better than parametric methods in certain misspecified models. So the current consensus seems to be that this is an acceptable and potentially useful method. But–as they say–more research is needed.

REFERENCES

Little, Roderick J. A. (1988) “Missing-data adjustments in large surveys.” Journal of Business & Economic Statistics 6: 287-296.

Morris, Tim P., Ian R. White and Patrick Royston (2014) “Tuning multiple imputation by predictive mean matching and local residual draws.” BMC Medical Research Methodology 14: 75-87.

Rubin, Donald B. (1986) “Statistical matching using file concatenation with adjusted weights and multiple imputations.” Journal of Business & Economic Statistics 4: 87-94.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley.

Schenker, Nathaniel and Jeremy M.G. Taylor (1996) “Partially parametric techniques for multiple imputation.” Computational Statistics & Data Analysis 22: 425-446.

In my November and December posts, I extolled the virtues of SEM for estimating dynamic panel models. By combining fixed effects with lagged values of the predictor variables, I argued that this approach offers the best option for making causal inferences with non-experimental panel data. It controls for all time-invariant variables, whether observed or not, and it allows for the possibility of reciprocal causation. Simulation evidence strongly favors SEM over the more popular Arellano-Bond method for estimating these kinds of models.

Despite the potential for this method, I recently learned that it’s vulnerable to a very troubling kind of bias when the lag structure is misspecified. In the latest issue of Sociological Methods and Research, Stephen Vaisey and Andrew Miles showed by both simulation and formal proof that a positive contemporaneous effect will often show up as a negative effect when estimating a fixed effects model with a predictor that is lagged by one time unit. They concluded that, for most social science applications, “artifactual negative ‘effects’ will likely be the rule rather than the exception.”

Vaisey and Miles investigated this problem only for the case of three periods of data, no lagged effect of the dependent variable y on itself, and no true effect of y on x. In that case, maximum likelihood reduces to OLS regression using difference scores: y3y2 on x2x1. They showed that the coefficient for x2x1 has an expected value that is exactly -.5 times the true coefficient.

My own simulations suggest that a sign reversal can also happen with four or more periods and a lagged dependent variable. And the effect of one variable on the other doesn’t have to be exactly contemporaneous. The reversal of sign can also occur if the correct lag is one week, but the estimated model specifies a lag of one year. Note that this artifact does not arise with random effects models. It’s specific to fixed effects models with lagged predictors. That should not be interpreted as an endorsement of random effects models, however, because they are much more prone to bias from omitted variables.

As noted by Vaisey and Miles, a 2011 article in the Journal of Quantitative Criminology may exemplify the problem of misspecified lags. Following my advice, Ousey, Wilcox and Fisher used the fixed effects SEM method to examine the relationship between victimization and offending. Numerous studies have found a positive, cross-sectional relationship between these variables: people who report being victims of crimes are also more likely to commit crimes. But Ousey et al. found negative effects of each variable on the other. Respondents who reported higher levels of offending in year t had lower levels of victimization in year t+1, after adjusting for fixed effects. And respondents with higher levels of victimization in year t had lower levels of offending in year t+1.

This surprising result could be real. But it could also occur if there is a positive effect of victimization on offending that is almost instantaneous rather than lagged by one year. And, finally, it could also occur if there is a positive, instantaneous effect of offending on victimization.

What can be done about this problem? Well, one implication is that more thought should go into the design of panel surveys. If you expect that changes in x will produce changes in y a month later, then collecting monthly data would be much better than collecting annual data. This could have the added advantage of reducing the total time for data collection, although it might also increase certain kinds of response bias.

What if your data have already been collected? Here’s a tentative recommendation that worked well in a few simulations. As a robustness check, estimate models that include both contemporaneous and lagged predictors. If a one-year lag is the correct specification, then the contemporaneous effect should be small and not statistically significant. If, on the other hand, the contemporaneous effect is large and significant, it should raise serious doubts about validity of the method and the kinds of conclusions that can be drawn. It may be that the data are simply not suitable for separating the effect of x on y from the effect of y on x.

I tried this strategy on a subset of the data used by Ousey et al. to study victimization and offending. When both contemporaneous and lagged predictors were included, I found a strong positive effect of victimization on offending in the same year. The one-year lagged effect was negative but small and non-significant. The same thing happened in the reverse direction. Offending had a strong positive effect on victimization in the same year, but the lagged effect was negative and not significant. My take: these data don’t allow one to draw any firm conclusions about whether victimization affects offending or offending affects victimization. They certainly don’t provide a basis for claiming negative effects of each variable on the other.

Clearly this is a problem that needs a great deal more study. There is a substantial econometric literature on determining the number of lags needed for autoregressive models but, as far as I know, Vaisey and Miles are the first to identify this particular phenomenon.

By the way, Steve Vaisey teaches a highly-rated course for Statistical Horizons called Treatment Effects Analysis.