For several years now, I’ve been promoting something I called the “hybrid method” as a way of analyzing longitudinal and other forms of clustered data. My books *Fixed Effects Regression Methods for Longitudinal Data Using SAS *(2005) and *Fixed Effects Regression Models *(2009) both devoted quite a few pages to this methodology. However, recent research has led me to be a little more cautious about this approach, especially for logistic regression models.

Like other fixed effects methods, the hybrid method provides a way of controlling for *all* cluster-level covariates, whether observed or not. It is sometimes called the “poor man’s conditional likelihood” (Neuhaus and McCulloch 2006) or the “between-within (BW) method” (Sjölander et al. 2013). Actually, I like that second name so much that I’m going to start using it in all future teaching and writing.

First introduced by Neuhaus and Kalbfleisch (1998), the BW method embeds a fixed effects estimator within the framework of a random effects (mixed) model, thereby enabling one to reap several benefits of both approaches. It’s especially attractive for models like ordinal logistic regression or negative binomial regression where more standard fixed effects methods (like conditional likelihood) are not available.

LEARN MORE IN A SEMINAR WITH PAUL ALLISON

I’ll review the BW method in a moment, but first let me tell you why I’m writing this post. Despite the many attractions of the method, I’ve long been bothered by the fact that there was no proof that it did what was claimed—that is, provide consistent (and, hence, approximately unbiased) estimates of the causal parameters of interest. For linear models, the BW method produces exactly the same results as the standard fixed effects method, so that’s reassuring. But for logistic regression, the BW estimates are not identical to those produced by conditional likelihood (the gold standard)—close, but no cigar.

I’m glad to report that there’s been some real progress on this front in recent years, but not all of it is encouraging. Goetgeluk and Vansteelandt (2008) proved that the BW method produces consistent estimates in the linear case—something that was already pretty apparent. However, they also showed that the BW method could yield *biased* estimates for some nonlinear models, including logistic and Poisson regression. Fortunately, they also observed that the biases were typically small in most practical settings.

Brumback et al. (2010) also showed by simulation that the BW method, when applied to binary logistic regression, could produce inconsistent estimates of the within-cluster causal parameters, although, again, the biases were small. But in a later paper, Brumback et al. (2012) presented a simulation in which the downward bias was as high as 45%.

So, does that mean that we should no longer use the BW method for logistic regression? I don’t think so. To explain that, I need to get a little more technical. Let *Y _{ij }*be a binary (0,1) dependent variable for the

*j*th individual in the

*i*th cluster. Let

*X*be a column vector of predictor variables, again, for individual

_{ij }*i*in cluster

*j*. (If you’re not comfortable with vectors, just think of

*X*as a single variable). We postulate the model

logit(Pr(*Y _{ij}* = 1|

*X*)) =

_{ij}, α_{i}*μ*+

*βX*(1)

_{ij}+ α_{i}where *β* is a row vector of coefficients, and *α _{i}* is an unobserved “effect” of being in cluster

*i*. It can be thought of as representing the combined effects of all cluster-level variables that are not included in the model.

How can we consistently estimate *β*? Well, we could estimate a generalized linear mixed model, treating *α* as a normally distributed random variable with mean 0 and constant variance. This can be done in SAS with PROC GLIMMIX or in Stata with the **melogit** command. But standard implementations assume that *α* is independent of *X*, which means that we aren’t really *controlling* for *α* as a potential confounder. If *α* is actually correlated with *X*, you’ll get biased estimates of *β *using this method, and the bias could be severe.

A second approach is to use conditional likelihood, which is known to produce consistent estimates of *β* even if *α* is correlated with *X*. You can do this with the **clogit** command in Stata or with PROC LOGISTIC using the STRATA statement.

A third approach is the BW method, which says to decompose *X* into a within-cluster component and a between-cluster component. Letting *X _{i}* be a vector of cluster-specific means (sorry to put the bar underneath), we estimate the following generalized linear mixed model

logit(Pr(*Y _{ij}* = 1|

*X*,

_{ij}*X*)) =

_{i}, α_{i}*μ + β*(

_{W}*X*) +

_{ij}–*X*_{i}*β*

_{B}*+*

*X*_{i }*α*(2)

_{i}using standard software like GLIMMIX or **melogit**.

It’s long been believed that the estimate of *β _{W}*, because it depends only on within-cluster variation, controls for confounding with

*α*and produces consistent estimates of

*β*in equation (1). We now know that that’s not true, at least not in general. But there is one condition in which it

*is*true. Specifically, it’s true when

*α*in (1) is a linear function of

_{i}*X*, plus a random error that is independent of

_{i}*X*(Brumback et al. 2010)

_{ij}_{. }It’s reasonable to suppose, then, that the performance of the BW method for logistic regression depends, at least in part, on how closely this linearity assumption is satisfied. I’ll make use of that idea shortly.

Why not just do conditional likelihood, which we know will give consistent estimates under more general conditions? First, there are lots of non-linear models for which conditional likelihood is just not possible. Those models include binary probit, binary complementary log-log, cumulative (ordered) logit, and negative binomial regression.

Second, even for binary logit, there are many attractions of the BW method that conditional likelihood can’t provide: random slopes, a test for confounding, the ability to include cluster-level covariates, and the ability to allow for higher-levels of clustering.

As previously mentioned, all the available evidence suggests that the bias of the BW method is small in most situations. The simulation by Brumback et al. that yielded 45% downward bias was very extreme: the cluster-level effect was highly correlated with the *x*’s, it had an dominating effect on the outcome, and it depended on the *maximum* of the *x*’s across individuals, not their mean. Also, there were only two individuals per cluster, which tends to magnify any other sources of bias.

Here’s a modest proposal for how to reduce any biases that may occur. First, it’s easy to show that the following equation is the exact equivalent of model (2):

logit(Pr(*Y _{ij}* = 1|

*X*,

_{ij}*X*)) =

_{i}, α_{i}*μ + β*

_{W}*X*+ (

_{ij}*β*–

_{B}*β*)

_{W}*+*

*X*_{i }*α*(3)

_{i}This tells us that as long as the means are in the equation, you don’t have to express *X _{ij}* as deviations from those means.

To check or allow for nonlinearity, I suggest adding polynomial functions of the means to equation (3), such as

logit(Pr(*Y _{ij}* = 1|

*X*,

_{ij}*X*)) =

_{i}, α_{i}*μ + β*

_{W}*X*+

_{ij}*θ*

_{1}

*+*

*X*_{i }*θ*

_{2}

*X*_{i}^{2}+

*θ*

_{3}

*X*_{i}^{3}+

*α*(4)

_{i }You could even include other cluster-level functions of the *X _{ij}*s, such as their standard deviation. If the additional terms are not statistically significant, and if the estimate of

*β*doesn’t change much, then you can feel more confident that bias is not an issue.

_{W}To sum up, the BW/hybrid method is not quite as good as I once thought, at least not for non-linear models. But it’s certainly better than doing conventional random effects/mixed models, and usually comes pretty close to the mark. And, lastly, there may be ways to improve it.

**References**

Allison, PD (2005) *Fixed Effects Regression Methods for Longitudinal Data Using SAS. *Cary, NC: The SAS Institute.

Allison, PD (2009) *Fixed Effects Regression Models*. Thousand Oaks, CA: Sage Publications.

Brumback BA, AB Dailey, LC Brumback, MD Livingston, Z He (2010) “Adjusting for confounding by cluster using generalized linear mixed models.” *Statistics and Probability Letters* 80:1650–1654.

Brumback BA, HW Zheng, and AB Dailey (2012) “Adjusting for confounding by neighborhood using generalized linear mixed models and complex survey data.” *Statistics in Medicine *32: 1313–1324.

Goetgeluk, S and S Vansteelandt (2008) “Conditional generalized estimating equations for the analysis of clustered and longitudinal data.” *Biometrics* 64: 772-780.

Neuhaus, JM and JD Kalbfleisch (1998) “Between- and within-cluster covariate effects in the analysis of clustered data.” *Biometrics *54: (2) 638-645.

Neuhaus, JM and CE McCulloch (2006) “Separating between- and within-cluster covariate effects by using conditional and partitioning methods.” *Journal of the Royal Statistical Society Series B* 68: 859–872.

Sjölander, A, P Lichtenstein, H Larsson and Y Pawitan (2013) “Between–within models for survival analysis.” *Statistics in Medicine* 32: 3067–3076

## Comments

Thank you for an excellent post. In your Fixed Effects Regression Models book you indicate that the hybrid models in the context of survival outcomes did not work so well in some simulations you performed. What types of hybrid survival models did you evaluate? Was it similar to the models Sjölander et al. used in their paper, or was it different? Their model does seem to be somewhat atypical in how the between and within effects are decomposed (i.e., they simply use the group mean in the model and interpret that as the between effect).

Thanks,

Guy

Frankly, I don’t recall under what conditions the hybrid model didn’t work well for survival models. I haven’t had a chance to study the Sjolander paper carefully, but there’s no problem with just putting in the group mean, as I explain toward the end of the blog post. In that case, however, the coefficient for the group mean is the DIFFERENCE between the within and between effects.

It might not directly relate to your post. I read your SAGE green book on fixed effect model. I read the hybrid model as well from chapter 2. I understand that the trick is to make mean and sd of independent variables. But how if some of the independent variables are dummies (categorical variables) that vary within the panel?

Thank you

Dummy variables get treated just like any other variable–subtract the individual-specific mean to get a deviation score. Then include both the mean and deviation score in the model (but not the standard deviation). Despite this transformation, the interpretation of their coefficients is the same as if they were dummy variables.

Thank you Prof. Allison for your reply. I could apply the hybrid model on dummies now.

How about interaction effects? How to specify between-interaction effects and within-interaction effects? Thank you.

Interactions should be treated as just another variable. Let x and z be the two time-dependent variables. Before doing any transformations, create the product of x and z. Calculate the individual specific mean of xz–that’s your between component. Then subtract that mean from xz. That’s your within component.

What if you are interested in examining the interaction between a time-varying and a time-invariant predictor? Can you just include the cross-products in the model (e.g., deviationtimevarying x time-invariant meantimevarying x time-invariant)? Or do you similarly need to first create the product and then calculate the specific mean and deviation?

First create the product, then the means and deviations of that product.

I have a related question. What to do when you are interested in including time-varying dummy (i.e., xit) with its ex ante (i.e., xit-1) or ex post (i.e., xit+1) effect? The means of the variables highly coorelated each other and they might suffer from multicollinearity.

For any panel or time series data, multicollinearity is always a likely problem with multiple lags and leads. I don’t know of any general solution. But if you are just concerned about multicollinearity among the means, keep in mind that we are usually uninterested in the coefficients of the means in a hybrid (between-within) model. And multicollinearity only affects the variables that are collinear.

Hello Professor Allison,

We currently only have the between- and within-components of x and z in our models and want to specify an interaction for x and z as you explain above. In this case, will we also have to include the main effects for x and z (in addition to their within- and between- effects which we already have)? Or do we need to include the main effect for “xz”?

The standard approach is to treat xz as “just another variable”. That is, the between component is the individual-specific mean of xz, and the within component subtracts this mean from xz. However, there are some potential problems with this approach–see the article at https://doi.org/10.1177/0049124120914934.

Dr. Allison, Thank you very much for your great post and SAS book.

If “within” predictor X has four groups,should we create three dummy variables (X1,X2,X3) and then calculate cluster-specific mean(Mx1,Mx2,Mx3)?

In this way, the final model would have three terms related to X(X1,X2,X3) and three terms related to mean(Mx1,Mx2,Mx3). Am I right? Thank you so much.

Yes, that is correct.

Hi, Dr. Allison, I have read your book on hybrid model for ordinal logistic regression with stata. You mentioned ‘by calculating person-specific means for each of the predictor variables and then calculating deviations from those means. Both mean variables and deviation variables are included as predictors in the cumulative logit model’, while some of your independent variables are categorical variables, is it appropriate to calculate means and deviations for them? Looking forward to your reply. Thanks.

Olivia

Yes, treat them just like any other variable.

Dear Dr. Allison, Thank you for your excellent and very helpful post. I am exploring the association between a continuous variable “x” and time-to-event data “y”, utilizing survival analysis on siblings in order to estimate a quasi-causal effect between x and y. When treating siblings “as individuals” (but modeling a shared frailty for family cluster, where most clusters are of two individuals), there are significant linear (main effect for x) and non-linear (X^2) associations between x and y (survival time). I would like to test these linear and non-linear associations in a “between-within” survival model, where siblings are compared only to each other. I have two questions:

1. Is it correct to build the model as follows: calculate the family-specific mean of x (“Mx”) and then subtract Mx from each individual’s x (“Dx”). Create the squared term for (i.e. product of) x (i.e. x*x=“xsq”) then calculate the family-specific mean of xsq (“Mxsq”) and subtract Mxsq from each individual’s xsq (“Dxsq”). So, in addition to any covariates, the “between-within” survival model would include 4 total variables: Mx Dx Mxsq Dxsq? Have I understood this correctly?

2. If Dxsq (within-family non-linear effect) is non-significant and subsequently removed from the model for the sake of parsimony (but Mx Mxsq and Dx are significant and kept in the model), does Dx still have its normal ‘fixed effect’ interpretation in this model (i.e. Dx would indicate the within-family effect of x on survival, and it assumes that this effect is linear)? Would it be appropriate in this case to keep the “between” effect decomposed here into linear (Mx) and non-linear (Mxsq) effects, or would you recommend also removing the Mxsq term in this case? Thank you very much for your thoughts.

Dear Dr. Allison, thank you for this very helpful post. I am wondering how to model linear and non-linear (e.g. x^2) associations between x (continuous) and y (time-to-event data) in a BW survival analysis model, where siblings are nested within families. Would it be correct to: calculate the family-specific mean of x (“Mx”) and subtract Mx from each individual’s x (“Dx”). Create x squared (x*x = “xsq”), then, calculate family-specific mean of xsq (“Mxsq”) and subtract Mxsq from each individual’s xsq (“Dxsq”)? Thus, there would be four terms in the BW survival model (Mx Dx Mxsq Dxsq) plus covariates? If Mx, Mxsq, and Dx are significant but Dxsq is not, and Dxsq is subsequently removed from model, would it be appropriate to keep Mxsq in the model? Would Dx still be interpreted as the within-family linear effect of x on survival? Thank you for your insights.

1. Yes, your proposed method is correct.

2. Yes, if you delete Dxsq, Dx would still have the within family interpretation. And you can delete Dxsq but leave in Mxsq.

Finally, you might want to consider trying a conditional likelihood approach by simply running a Cox regression x and xsq but stratifying on family. I’m guessing that the between-within method has the same limitations for survival analysis as for logistic regression.

Dear Dr. Allison, Thank you very much for your reply. So, in general, if we get consistent/similar results for the Cox model stratified on family and for the ‘within-‘ term in the BW model, we can be less concerned about the potential for bias in the BW survival model? Thank you

Yes. And it’s also easier to do.

Dear Dr. Allison,

First I like to thank you for your great blog, and also for the Sage book on fixed effects. These were really helpful to make things clear about the between-within model. Reading this blog on problems with the BW model for logistic, raises a question. You said that in model (1) the standard implementations assume independence of the cluster effect a.i and the independent X.ij. However, I always thought that for linear random effect models, the random effects should also be independent of predictor variables. Therefore, I now do not understand why this would cause a problem only for logistic models. Thanks in advance,

Ben.

No, this is a problem for linear random effects models as well. I only mention that here as an introduction to other methods that could be used to solve this problem, like conditional likelihood and the BW method. We know that conditional likelihood yields consistent estimates in both linear and logistic models. The BW method does so in linear models. But for logistic models, it’s consistent only under certain conditions.