When estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time t is what happened at time t-1.

This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models.  Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as dynamic panel data models.  Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward.  Let yit be the value of the dependent variable for individual i at time t.  Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector xit:

yit = b0 + b1yi(t-1) + b2xit +  ui + eit

The random intercept ui represents the combined effect on y of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and independent of the other variables on the right-hand side.

That’s where the problem lies. Because the model applies to all time points, i has a direct effect on yi(t-1).  But if i affects yi(t-1), it can’t also be statistically independent of yi(t-1). The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the wages data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the xtreg command, let’s first estimate a random intercepts model for lwage (log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: ed (years of education) and fem (1 for female, 0 for male).

Here’s the Stata code:

```use "https://statisticalhorizons.com/wp-content/uploads/wages.dta", clear
xtset id t
xtreg lwage L.lwage ed fem t```

The xtset command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable id and a time variable t that ranges from 1 to 7.  The xtreg command fits a random-intercepts model by default, with lwage as the dependent variable and the subsequent four variables as predictors.  L.lwage specifies the one-year lag of lwage

Here’s the output:

`------------------------------------------------------------------------------`
`       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`       lwage |`
`         L1. |   .8747517   .0085886   101.85   0.000     .8579183    .8915851`
`          ed |   .0108335   .0011933     9.08   0.000     .0084947    .0131724`
`         fem |    -.06705    .010187    -6.58   0.000    -.0870162   -.0470839`
`           t |   .0071965   .0019309     3.73   0.000     .0034119    .0109811`
`       _cons |   .7624068   .0491383    15.52   0.000     .6660974    .8587161`
`-------------+----------------------------------------------------------------`
` `

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males.  Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called xtdpdqml. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply xtdpdqml to the wage data:

xtset id t
xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)

The re option specifies a random effects (random intercepts) model.  By default, the command includes the lag-1 dependent variable as a predictor.  The initval option sets the starting values for the four variance parameters that are part of the model.  Here is the output:

`------------------------------------------------------------------------------`
`       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`       lwage |`
`         L1. |   .4142827   .0230843    17.95   0.000     .3690383     .459527`
`          ed |   .0403258   .0031841    12.66   0.000     .0340851    .0465666`
`         fem |  -.2852665   .0271688   -10.50   0.000    -.3385164   -.2320166`
`           t |   .0533413   .0027533    19.37   0.000     .0479449    .0587378`
`       _cons |    3.25368   .1304816    24.94   0.000      2.99794    3.509419`
`------------------------------------------------------------------------------`

Results are markedly different from those produced above by xtreg.  The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference.  Unfortunately, xtdpdqml has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, xtdpdml, that removes many of these limitations. (Note that the only difference in the names for the two commands is the q in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with xtdpdml, use

xtset id t
xtdpdml lwage, inv(ed fem blk) errorinv

The inv option is for time-invariant variables.  The errorinv option forces the error variance to be the same at all points in time. Like xtdpdqml, this command automatically includes a 1-time unit lag of the dependent variable. Unlike xtdpdqml, xtdpdml can include longer lags and/or multiple lags.

Here is the output:

`------------------------------------------------------------------------------`
`             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]`
`-------------+----------------------------------------------------------------`
`lwage2       |`
`      lwage1 |   .4088803   .0229742    17.80   0.000     .3638517     .453909`
`          ed |   .0406719   .0032025    12.70   0.000     .0343951    .0469486`
`         fem |  -.2878266    .027345   -10.53   0.000    -.3414218   -.2342315`
`------------------------------------------------------------------------------`

Results are very similar to those for xtdpdqml. They are slightly different because xtdpdml always treats time as a categorical variable, but time was a quantitative variable in the earlier model for xtdpdqml.

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in Allison et al. (2018) . As a matter of fact, the xtdpdml command is just a front-end to the sem command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself.  That’s why we wrote the command.

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both xtdpdqml and xtdpdml can handle this situation also.

If you’d like to learn more about dynamic panel data models, check out my course on Longitudinal Data Analysis Using SEM

References

Allison, Paul D., Richard Williams and Enrique Moral-Benito (2017) “Maximum likelihood for cross-lagged panel models with fixed effects.” Socius 3: 1-17.

Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” Econometrica 51 (6): 1635-1659.

Kripfganz, S. (2016). “Quasi-maximum likelihood estimation of linear dynamic short-T panel-data models.” Stata Journal 16 (4), 1013–1038.

Rabe-Hesketh, Sophia, and Anders Skrondal  (2012) Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses. Third Edition. StataCorp LP.