When estimating regression models for longitudinal panel data, many researchers include a lagged value of the dependent variable as a predictor. It’s easy to understand why. In most situations, one of the best predictors of what happens at time *t* is what happened at time *t*-1.

This can work well for some kinds of models, but not for mixed models, otherwise known as a random effects models or multilevel models. Nowadays, mixed modeling is probably the most popular approach to longitudinal data analysis. But including a lagged dependent variable in a mixed model usually leads to severe bias.

In economics, models with lagged dependent variables are known as *dynamic panel data* models. Economists have known for many years that lagged dependent variables can cause major estimation problems, but researchers in other disciplines are often unaware of these issues.

The basic argument is pretty straightforward. Let *y _{it}* be the value of the dependent variable for individual

*i*at time

*t*. Here’s a random intercepts model (the simplest mixed model) that includes a lagged value of the dependent variable, as well as a set of predictor variables represented by the vector

*x*

_{it}: *y _{it}* =

*b*

_{0}+

*b*

_{1}

*y*

_{i}_{(t-1)}+

*b*

_{2}

*x*+

_{it}*u*+

_{i}*e*

_{it}The random intercept *u _{i}* represents the combined effect on

*y*of all unobserved variables that do not change over time. It is typically assumed to be normally distributed with a mean of 0, constant variance, and

*independent of the other variables on the right-hand side*.

That’s where the problem lies. Because the model applies to all time points, *u _{i}* has a direct effect on

*y*

_{i}_{(t-1)}. But if

*u*affects

_{i}*y*

_{i}_{(t-1)}, it can’t also be statistically independent of

*y*

_{i}_{(t-1)}. The violation of this assumption can bias both the coefficient for the lagged dependent variable (usually too large) and the coefficients for other variables (usually too small).

Later I’ll discuss some solutions to this problem, but first let’s consider an example. I use the **wages** data set that is available on this website. It contains information on annual wages of 595 people for seven consecutive years. The data are in “long form”, so there’s a total of 4,165 records in the data set. I use Stata for the examples because there are good Stata commands for solving the problem.

Using the **xtreg** command, let’s first estimate a random intercepts model for **lwage **(log of wage) with the dependent variable lagged by one year, along with two predictors that do not change over time: **ed** (years of education) and **fem** (1 for female, 0 for male).

Here’s the Stata code:

**use “http://statisticalhorizons.com/wp-content/uploads/wages.dta”, clear**** xtset id t**** xtreg lwage L.lwage ed fem t**

The **xtset** command tells Stata that this is a “cross-section time-series” data set with identification numbers for persons stored in the variable **id** and a time variable **t **that ranges from 1 to 7. The **xtreg** command fits a random-intercepts model by default, with **lwage** as the dependent variable and the subsequent four variables as predictors. **L.lwage** specifies the one-year lag of **lwage**.

Here’s the output:

------------------------------------------------------------------------------

lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage |

L1. | .8747517 .0085886 101.85 0.000 .8579183 .8915851

ed | .0108335 .0011933 9.08 0.000 .0084947 .0131724

fem | -.06705 .010187 -6.58 0.000 -.0870162 -.0470839

t | .0071965 .0019309 3.73 0.000 .0034119 .0109811

_cons | .7624068 .0491383 15.52 0.000 .6660974 .8587161

-------------+----------------------------------------------------------------

When the dependent variable is logged and the coefficients are small, multiplying them by 100 gives approximate percentage changes in the dependent variable. So this model says that each additional year of schooling is associated with a 1 percent increase in wages and females make about 6 percent less than males. Each additional year is associated with about a 0.7 percent increase in wages. All these effects are dominated by the lagged effect of wages on itself, which amounts to approximately a 0.9 percent increase in this year’s wages for a 1 percent increase in last year’s wages.

As I explained above, the lagged dependent variable gives us strong reasons to be skeptical of these estimates. Economists have developed a variety of methods for solving the problem, most of them relying on some form of instrumental variable (IV) analysis. For a discussion of how to implement IV methods for lagged dependent variables in Stata, see pp. 274-278 in Rabe-Hesketh and Skrondal (2012).

Personally, I prefer the maximum likelihood approach pioneered by Bhargava and Sargan (1983) which incorporates all the restrictions implied by the model in an optimally efficient way. Their method has recently been implemented by Kripfganz (2015) in a Stata command called **xtdpdqml**. This unwieldy set of letters stands for “cross-section time-series dynamic panel data estimation by quasi-maximum likelihood.”

Here’s how to apply **xtdpdqml** to the wage data:

**xtset id t****xtdpdqml lwage ed fem t, re initval(0.1 0.1 0.2 0.5)**

The **re** option specifies a random effects (random intercepts) model. By default, the command includes the lag-1 dependent variable as a predictor. The **initval** option sets the starting values for the four variance parameters that are part of the model. Here is the output:

------------------------------------------------------------------------------

lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage |

L1. | .4142827 .0230843 17.95 0.000 .3690383 .459527

ed | .0403258 .0031841 12.66 0.000 .0340851 .0465666

fem | -.2852665 .0271688 -10.50 0.000 -.3385164 -.2320166

t | .0533413 .0027533 19.37 0.000 .0479449 .0587378

_cons | 3.25368 .1304816 24.94 0.000 2.99794 3.509419

------------------------------------------------------------------------------

Results are markedly different from those produced above by **xtreg**. The coefficient of the lagged dependent variable is greatly reduced, while the others show substantial increases in magnitude. An additional year of schooling now produces a 4 percent increase in wages rather than 1 percent. Blacks now make 8 percent less than non-blacks rather than 1 percent less. And females make 24 percent less (calculated as 100(exp(-.28)-1) than males compared to 6 percent less. The annual increase in wages is 5 percent instead of 1 percent.

So doing it right can make a big difference. Unfortunately, **xtdpdqml** has a lot of limitations. For example, it can’t handle missing data except by listwise deletion. With Richard Williams and Enrique Moral-Benito, I have been developing a new Stata command, **xtdpdml**, that removes many of these limitations. (Note that the only difference in the names for the two commands is the **q** in the middle). It’s not quite ready for release, but we expect it out by the end of 2015.

To estimate a model for the wage data with **xtdpdml**, use

**xtset id t**** xtdpdml lwage, inv(ed fem blk) errorinv**

The **inv** option is for time-invariant variables. The **errorinv** option forces the error variance to be the same at all points in time. Like **xtdpdqml**, this command automatically includes a 1-time unit lag of the dependent variable. Unlike **xtdpdqml**, **xtdpdml** can include longer lags and/or multiple lags.

Here is the output:

------------------------------------------------------------------------------

| Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

lwage2 |

lwage1 | .4088803 .0229742 17.80 0.000 .3638517 .453909

ed | .0406719 .0032025 12.70 0.000 .0343951 .0469486

fem | -.2878266 .027345 -10.53 0.000 -.3414218 -.2342315

------------------------------------------------------------------------------

Results are very similar to those for **xtdpdqml**. They are slightly different because **xtdpdml **always treats time as a categorical variable, but time was a quantitative variable in the earlier model for **xtdpdqml.**

If you’re not a Stata user, you can accomplish the same thing with any linear structural equation modeling software, as explained in my unpublished paper. As a matter of fact, the **xtdpdml** command is just a front-end to the **sem** command in Stata. But it’s a lot more tedious and error-prone to set up the equations yourself. That’s why we wrote the command.

By the way, although I’ve emphasized random effects models in this post, the same problem occurs in standard fixed-effects models. You can’t put a lagged dependent variable on the right-hand side. Both **xtdpdqml** and **xtdpdml **can handle this situation also.

If you’d like to learn more about dynamic panel data models, check out my 2-day course on Longitudinal Data Analysis Using SEM. It will be offered again October 16-17, 2015, in Los Angeles.

__References__

Bhargava, A. and J. D. Sargan (1983) “Estimating dynamic random effects models from panel data covering short time periods.” *Econometrica* 51 (6): 1635-1659.

Rabe-Hesketh, Sophia, and Anders Skrondal (2012) *Multilevel and Longitudinal Modeling Using Stata. Volume 1: Continuous Responses*. Third Edition. StataCorp LP.