More on Causal Inference With Panel Data

This is a follow-up to last month’s post, in which I considered the use of panel data to answer questions about causal ordering: does x cause y or does y cause x? In the interim, I’ve done many more simulations to compare the two competing methods, Arellano-Bond and ML-SEM, and I’m going to report some key results here. If you want all the details, read my recent paper by clicking here. If you’d like to learn how to use these methods, check out my seminar titled Longitudinal Data Analysis Using SEM.

Quick review: The basic approach is to assume a cross-lagged linear model, with y at time t affected by both x and y and time t-1, and x at time t also affected by both lagged variables. The equations are

y_it = b₁x_i_(t-1) + b₂y_i_(t-1) + c_i + e_it

x_it = a₁x_i_(t-1) + a₂y_i_(t-1)+ f_i + d_it

for i = 1,…, n, and t = 1,…, T.

The terms c_i and f_i represent individual-specific unobserved heterogeneity in both x and y. They are treated as “fixed effects”, thereby allowing one to control for all unchanging characteristics of the individuals, a key factor in arguing for a causal interpretation of the coefficients. Finally, e_it and d_it are assumed to represent pure random noise, independent of any variables measured at earlier time points. Additional exogenous variables could also be added to these equations.

LEARN MORE IN A SEMINAR WITH PAUL ALLISON

Conventional estimation methods are biased because of the lagged dependent variable and because of the reciprocal relationship between the two variables. The most popular solution is the Arellano-Bond (A-B) method (or one of its cousins), but I have previously argued for the use of maximum likelihood (ML) as implemented in structural equation modeling (SEM) software.

Last month I presented very preliminary simulation results showing that ML-SEM had substantially lower mean-squared error (MSE) than A-B under a few conditions. Since then I’ve done simulations for 31 different sets of parameter values and data configurations. For each condition, I generated 1,000 samples, ran the two methods on each sample, and then calculated bias, mean squared error, and coverage for confidence intervals. Since the two equations are symmetrical, the focus is on the coefficients in the first equation, b₁ for the effect of x on y, and b₂ for the effect of y on itself.

The simulations for ML were done with PROC CALIS in SAS. I originally started with the sem command in Stata, but it had a lot of convergence problems for the smaller sample sizes. The A-B simulations were done in Stata with the xtabond command. I tried PROC PANEL in SAS, but couldn’t find any combination of options that produced approximately unbiased estimates.

Here are some of the things I’ve learned:

Under every condition, ML showed little bias and quite accurate confidence interval coverage. That means that about 95% of the nominal 95% confidence intervals included the true value.

Except under “extreme” conditions, A-B also had little bias and reasonably accurate confidence interval coverage.

However, compared with A-B, ML-SEM always showed less bias and smaller sampling variance. My standard of comparison is relative efficiency, which is the ratio of MSE for ML to MSE for A-B. (MSE is the sum of the sampling variance plus the squared bias.) Across 31 different conditions, relative efficiency of the two estimators ranged from .02 to .96, with a median of .50. To translate, if the relative efficiency is .50, you’d need twice as large a sample to get the same accuracy with A-B as with ML.

Relative efficiency of the two estimators is strongly affected by the value of the parameter b₂, the effect of y_t-₁ on y_t. As b₂ gets close to 1, the A-B estimators for both b₁ and b₂ become badly biased (toward 0), and the sample variance increases, which is consistent with previous literature on the A-B estimator. For ML, on the other hand, bias and variance are rather insensitive to the value of b₂. Here are the numbers:

	Rel Eff b1	Rel Eff b2
b2=0	0.546207	.8542228
b2=.25	0.509384	.6652079
b2=.50	0.462959	.5163349
b2=.75	0.202681	.2357591
b2=.90	0.022177	.0269079
b2=1.0	0.058521	.0820448
b2=1.25	0.248683	.4038526

Relative efficiency is strongly affected by the number of time points, but in the opposite direction for the two coefficients. Thus, relative efficiency for b₁ increases almost linearly as the number of time points goes from 3 to 10. But for b₂, relative efficiency is highest at T=3, declines markedly for T=4 and T=5, and then remains stable.

	Rel Eff b1	Rel Eff b2
T=3	0.243653	.9607868
T=4	0.398391	.8189295
T=5	0.509384	.6652079
T=7	0.696802	.6444535
T=10	0.821288	.6459828

Relative efficiency is also strongly affected by the ratio of the variance of c_i, (the fixed effect) to the variance of e_it (the pure random error). In the next table, I hold constant the variance of c and vary the standard devation of e.

	Rel Eff b1	Rel Eff b2
SD(e)=.25	0.234526	.3879175
SD(e)=1.0	0.509384	.6652079
SD(e)=1.5	0.551913	.7790358
SD(e)=2	0.613148	.7737681

Relative efficiency is not strongly affected by:

Sample size
The value of b₁
The correlation between c_i and f_i, the two fixed-effects variables.

Because ML is based on the assumption of multivariate normality, one might suspect that A-B would do better than ML if the distributions were not normal. To check that out, I generated all the variables using a 2-df chi-square variable, which is highly skewed to the right. ML still did great in this situation, and was still about twice as efficient as A-B.

In sum, ML-SEM outperforms A-B in every situation studied, by a very substantial margin.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Leave a Reply Cancel reply