Do We Really Need Zero-Inflated Models?
August 7, 2012 By Paul Allison
For the analysis of count data, many statistical software packages now offer zero-inflated Poisson and zero-inflated negative binomial regression models. These models are designed to deal with situations where there is an “excessive” number of individuals with a count of 0. For example, in a study where the dependent variable is “number of times a student had an unexcused absence”, the vast majority of students may have a value of 0.
Zero-inflated models have become fairly popular in the research literature: a quick search of the Web of Science for the past five years found 499 articles with “zero inflated” in the title, abstract or keywords. But are such models really needed? Maybe not.
It’s certainly the case that the Poisson regression model often fits the data poorly, as indicated by a deviance or Pearson chi-square test. That’s because the Poisson model assumes that the conditional variance of the dependent variable is equal to the conditional mean. In most count data sets, the conditional variance is greater than the conditional mean, often much greater, a phenomenon known as overdispersion.
The zero inflated Poisson (ZIP) model is one way to allow for overdispersion. This model assumes that the sample is a “mixture” of two sorts of individuals: one group whose counts are generated by the standard Poisson regression model, and another group (call them the absolute zero group) who have zero probability of a count greater than 0. Observed values of 0 could come from either group. Although not essential, the model is typically elaborated to include a logistic regression model predicting which group an individual belongs to.
In cases of overdispersion, the ZIP model typically fits better than a standard Poisson model. But there’s another model that allows for overdispersion, and that’s the standard negative binomial regression model. In all data sets that I’ve examined, the negative binomial model fits much better than a ZIP model, as evaluated by AIC or BIC statistics. And it’s a much simpler model to estimate and interpret. So if the choice is between ZIP and negative binomial, I’d almost always choose the latter.
But what about the zero-inflated negative binomial (ZINB) model? It’s certainly possible that a ZINB model could fit better than a conventional negative binomial model regression model. But the latter is a special case of the former, so it’s easy to do a likelihood ratio test to compare them (by taking twice the positive difference in the log-likelihoods).* In my experience, the difference in fit is usually trivial.
Of course, there are certainly situations where a zero-inflated model makes sense from the point of view of theory or common sense. For example, if the dependent variable is number of children ever born to a sample of 50-year-old women, it is reasonable to suppose that some women are biologically sterile. For these women, no variation on the predictor variables (whatever they might be) could change the expected number of children.
So next time you’re thinking about fitting a zero-inflated regression model, first consider whether a conventional negative binomial model might be good enough. Having a lot of zeros doesn’t necessarily mean that you need a zero-inflated model.
You can read more about zero-inflated models in Chapter 9 of my book Logistic Regression Using SAS: Theory & Application. The second edition was published in April 2012.
*William Greene (Functional Form and Heterogeneity in Models for Count Data, 2007) claims that the models are not nested because “there is no parametric restriction on the [zero-inflated] model that produces the [non-inflated] model.” This is incorrect. A simple reparameterization of the ZINB model allows for such a restriction. So a likelihood ratio test is appropriate, although the chi-square distribution may need some adjustment because the restriction is on the boundary of the parameter space.