For the analysis of count data, many statistical software packages now offer zero-inflated Poisson and zero-inflated negative binomial regression models. These models are designed to deal with situations where there is an “excessive” number of individuals with a count of 0. For example, in a study where the dependent variable is “number of times a student had an unexcused absence”, the vast majority of students may have a value of 0.

Zero-inflated models have become fairly popular in the research literature: a quick search of the Web of Science for the past five years found 499 articles with “zero inflated” in the title, abstract or keywords. But are such models really needed? Maybe not.

LEARN MORE IN A SEMINAR WITH PAUL ALLISON

It’s certainly the case that the Poisson regression model often fits the data poorly, as indicated by a deviance or Pearson chi-square test. That’s because the Poisson model assumes that the conditional variance of the dependent variable is equal to the conditional mean. In most count data sets, the conditional variance is greater than the conditional mean, often much greater, a phenomenon known as overdispersion.

The zero inflated Poisson (ZIP) model is one way to allow for overdispersion. This model assumes that the sample is a “mixture” of two sorts of individuals: one group whose counts are generated by the standard Poisson regression model, and another group (call them the absolute zero group) who have zero probability of a count greater than 0. Observed values of 0 could come from either group. Although not essential, the model is typically elaborated to include a logistic regression model predicting which group an individual belongs to.

In cases of overdispersion, the ZIP model typically fits better than a standard Poisson model. But there’s another model that allows for overdispersion, and that’s the standard negative binomial regression model. In all data sets that I’ve examined, the negative binomial model fits much better than a ZIP model, as evaluated by AIC or BIC statistics. And it’s a much simpler model to estimate and interpret. So if the choice is between ZIP and negative binomial, I’d almost always choose the latter.

But what about the zero-inflated negative binomial (ZINB) model? It’s certainly possible that a ZINB model could fit better than a conventional negative binomial model regression model. But the latter is a special case of the former, so it’s easy to do a likelihood ratio test to compare them (by taking twice the positive difference in the log-likelihoods).* In my experience, the difference in fit is usually trivial.

Of course, there are certainly situations where a zero-inflated model makes sense from the point of view of theory or common sense. For example, if the dependent variable is number of children ever born to a sample of 50-year-old women, it is reasonable to suppose that some women are biologically sterile. For these women, no variation on the predictor variables (whatever they might be) could change the expected number of children.

So next time you’re thinking about fitting a zero-inflated regression model, first consider whether a conventional negative binomial model might be good enough. Having a lot of zeros doesn’t necessarily mean that you need a zero-inflated model.

You can read more about zero-inflated models in Chapter 9 of my book *Logistic Regression Using SAS: Theory & Application*. The second edition was published in April 2012.

*William Greene (*Functional Form and Heterogeneity in Models for Count Data*, 2007) claims that the models are not nested because “there is no parametric restriction on the [zero-inflated] model that produces the [non-inflated] model.” This is incorrect. A simple reparameterization of the ZINB model allows for such a restriction. So a likelihood ratio test is appropriate, although the chi-square distribution may need some adjustment because the restriction is on the boundary of the parameter space.

## Comments

What Percentage of zeros needed in the outcome to use a zero-inflated model

The percentage of zeros is not relevant. A standard negative binomial model can handle a high percentage of zeros. What’s relevant is whether you have a theory that says that some substantial fraction of the zeros come from individuals who are “absolute zeros” and for whom the covariates have no effect on the propensity to experience events.

What a robust discussion! Allison, your blog was so engaging…Good job!

In one of your comments, you said “there’s nothing in the data that will tell you whether some zeros are structural, and some are sampling. That has to be decided on theoretical grounds.” on June 17, 2015, at 9:09 am. I am curious to know, if you have by chance stumble on how the ZIP can be used to identify the true zero and imputed zero counts?

Also, in modeling the zero counts, can you explain why that is model using logistic regression.

Thanks

Once you estimate a zero-inflated model, for each individual you can generate an estimated probability that the individual is a structural zero. But that’s only an estimate and, at best, it’s only a probability.

Why logistic regression? Well, that’s just a common and convenient way of modeling a binary outcome, in this case, whether or not an individual is a structural zero. But, in principle, it could be a probit model or something else entirely.

Hello Paul,

I read all discussions here and I do appreciate your kindness to address all questions. I also learned a lot from others.

Currently I am building a predictive model using ZINB. The response variable a revenue amount. To make it a count format, SAS int function was used. By the nature we have 70% zero amount. Do you think the approach ZINB reliable for this purpose?

Further, can the zero-inflated gamma model an alternative with some minor transformation of the revenue amount 0+0.1?

Thanks a lot for your kindness.

Young

For advice on dealing with these kinds of situations, I recommend this book: Economic Evaluation in Clinical Trials (Handbooks in Health Economic Evaluation) 2nd Edition

Im conducting a simulation study where im trying to examine the fit of this models Poisson, NB, ZIP, ZINB, HP, and HNB.

Surpringly I notice that when the true model is ZINB (The Psudopopulation is ZINB) in the vast majority of the scenarios of proportion of zeros and overdispersion the NB provides a lower AIC than ZINB. Furtheremore, the Hurdle NB provides the Lowest AIC in basically every scenario. Can someone explain this to me? am I making a mistake somewhere or what do you think is the reason for this since we would assume that if the true model or pseudo population follows a ZINB distribution then when we fit ZINB to data ZINB should provide the lowest AIC. However this is not the case in my study

What happens with the BIC? It’s guaranteed to select the correct model with a sufficiently large sample. That’s not true of the AIC. On the other hand, BIC penalizes the additional parameters in the ZINB more than the AIC, so I wouldn’t expect the AIC to go for the more parsimonious NB model.

Dear Paul,

thanks for this useful article.

A reviewer asked me to test a ZIP model on my dependent variable (a binary variable with 85% of zero values) instead of my logit model. I am under the impression that this wouldn’t be correct, given the count nature of ZIP dependent variables, am I right?

Thanks!

You’re right, that would not be correct. In fact, it wouldn’t even work.

Thank you! Is there a model for binary variables that I could use instead to account for the high number of zero values?

See my other blog post–https://statisticalhorizons.com/logistic-regression-for-rare-events

Hi Paul.

Thanks for your invigorating discussion. I am recently working on a project in which I deploy a survey data. In my project, I am trying to model the treatment delay behavior of the illness/injury suffered persons. My dependent variable is ‘Treatment_delay’ which has a lot of zeros (roughly 1/3rd) among 35000 observations. This variable starts from 0 to onwards, where 0 means no delay. I am using demographic profiles and some health indicators like (previous illness history, hospitalization records, transport cost for reaching to healthcare provider etc.).

I am using poisson and negative binomial regression in modelling this. I dont see ‘no treatment delays’ (which means 0 days) is caused by two separate process as only people who suffered illness or injury in last 30 days went to healthcare providers, which made me think no to use ZIP or ZINB models.

I also want to categorize my dependent variable into 3 groups (less than a day (less negligence), 1-7 days (moderate negligence), more than 7 days (very negligent) before going to healthcare providers) so that I can use ordered logit or ordered probit.

I was wondering 1) whether I am right or wrong in my thinking process..2) whether ZIP or ZINB is required?

Thanks in Advance!

This isn’t really a count variable, so I probably wouldn’t go with Poisson or negative binomial. I prefer your suggestion to categorize the dependent variable and do ordered logit or probit.

I just noticed your blog post. Interestingly, in 2005 and 2007, I wrote two well-received (and cited) papers that described fundamental issues with the use of zero-inflated models. Some of which you already discussed in your blog. I put the link to the pre-print below each reference.

Lord, D., S.P. Washington, and J.N. Ivan (2005) Poisson, Poisson-Gamma and Zero Inflated Regression Models of Motor Vehicle Crashes: Balancing Statistical Fit and Theory. Accident Analysis & Prevention. Vol. 37, No. 1, pp. 35-46. (doi:10.1016/j.aap.2004.02.004)

https://ceprofs.civil.tamu.edu/dlord/Papers/Lord_et_al._AA&P_03225_March_24th.pdf

Lord, D., S.P. Washington, and J.N. Ivan (2007) Further Notes on the Application of Zero Inflated Models in Highway Safety. Accident Analysis & Prevention, Vol. 39, No. 1, pp. 53-57. (doi:10.1016/j.aap.2006.06.004)

https://ceprofs.civil.tamu.edu/dlord/Papers/Lord_et_al_2006_Zero-Inflated_Models.pdf

I also proposed a new model for analyzing dataset with a large proportion of zero responses.

Geedipally, S.R., D. Lord, S.S. Dhavala (2012) The Negative Binomial-Lindley Generalized Linear Model: Characteristics and Application using Crash Data. Accident Analysis & Prevention, Vol. 45, No. 2, pp. 258-265. (http://dx.doi.org/doi:10.1016/j.aap.2011.07.012)

https://ceprofs.civil.tamu.edu/dlord/Papers/Geedipally_et_al_NB-Lindley_GLM.pdf

Thanks, your arguments seem very consistent with my post.

Excellent discussion, Paul. I have similar concern to the previous post. I include unit and time fixed-effects in my testing of a government program on crime outcomes (I observe districts over time). The crime I observe is extremely rare, with some districts going many month-years without experiencing one single event; others however, experience many of them.

Question 1: Is there any benefit to modeling counts of crime events with only an intercept for the inflation component? I am generally not a fan of zero-inflated models since they are computationally difficult in applied work, especially with many fixed effects.

Question 2: Poisson models with counts of events over several months show evidence of overdispersion. However, when taking into account a longer time series (e.g., counts over 100 months), then the standard Poisson performs better (i.e., little overdispersion). Is observing differences of this sort (i.e., less dispersion with more data) a violation of Poisson assumptions, such that the rate is changing through time? Or, is it that I have more variation with a shorter time series, and so the conditional variance might be larger? Thoughts or similar experiences?

Q1: Maybe. First of all, I would rarely consider a ZIP model because a conventional NB model will almost always fit better. A ZINB model with just an intercept might be useful in some settings. However, consider what you are assuming–that there is a sub-group of districts whose latent crime rate is absolutely zero, and the covariates are unrelated to whether a district is in this subgroup or not.

Q2: I’ve more commonly observed the reverse pattern, that longer intervals tend to show more overdispersion. What you’re seeing suggests that there are many factors affect the crime rate in a district that are time-specific, but that tend to average out over longer periods of time.

Thank you. I agree that this is a difficult assumption to make. Can any time-invariant factors go into the zero-inflation component if the ‘count’ component has a series of district fixed effects? I’m curious if an offset or population density can go into the zero component without it becoming to intractable. Any information is helpful.

I think so, but I haven’t tried it.

Hello,

I want to use zero inflated models in one of my papers. But I encounter difficulties or at least doubts in the manner of estimating this kind of model. I use stata software to estimate the ZIP model and the ZINB model. For the moment there is no command that implicitly take into account the panel structure. There is “zip and zinb” commands on stata but I don’t think it take into account the panel structure of my data. For example, the stata zip command is the following: “zip depvar indepvar, inflate (varlist)”.

The problem is I want to take account my panel structure because I need to introduce fixed effets.

Is it correct to right my command like this: “xi: depvar indepvar i.countryeffect, inflate (varlist i.countryeffect)”

I was wondering if you would have any recommendations for me on this. I have long been on stata forums but unfortunately I have not had a clear answer on this subject.

If you read my post, you’ll know that I’m not a huge fan of zip or zinb. But if you are determined to use this method, what you can do in Stata for panel data is (a) request clustered standard errors and (b) do fixed effects by including country-specific dummies, as you suggest. However, I wouldn’t put the country dummies in the inflate option. I think that will overfit the data. And the xi: prefix isn’t necessary in recent versions of Stata. So the command would look like this:

nbreg depvar indepvar i.countryeffect, inflate(varlist)

Hello dear Dr. ALLISON

Sorry if I asking an Irrelevant question.

I’m working on a set of highway accident data with overdispersion that contain a lot of zeros. I tried 4 goodness-of-fit measures (AIC, BIC, LL Chi2 and McFadden’s R2) to choose the best fitting model (among NB, ZINB & ZIP) in each set of data; but there is a problem. The chosen model is different for each measure. for example, AIC and BIC always tend to choose the NB or ZINB (NB most of the time) and LL Chi2 and McFadden’s R2, tend to choose ZIP most of the time. The Vuong test most of the time vote to Zero-inflated one and actually I’m confused what is the best model to choose!

I use STATA 15 software and I have 306 number of input samples for each data set; 9 independent and 1 dependent variable. The correlation between the Independent variables are checked but there are 3 exceptions (A little more than 0.2 Pearson correlation coefficient).

And in 2 sets of data, there is a convergence problem error when running the model.

It’s appreciated to have your comment.

Your’s faithfully.

Mahyar

As I tried to make clear in my post, I generally disapprove of the use of zero-inflated models merely to deal with overdispersion and a lot of zeroes. Unless you have some theory or strong substantive knowledge that supports the assumptions of zero-inflated models, I would stick with the negative binomial.

Dr. Allison,

Thanks for so generously sharing your knowledge with us.

I am working on a data on the number of questions asked by legislators on a particular setting. No legislator has a zero probability of having a count greater than zero. But

59% of legislators asked zero questions.

I was running a ZINB model with clustered standard errors (for parties). Several people suggested I dropped the clustered standard errors and use random effects because some of my groups (six) have relatively few observations. I use STATA and can run an NBREG with random effects but not a ZINB with random effects. But I was worried about including the random effects because I would have to move from a ZINB to an NBREG. After reading your post it seems that it should not be such a problem given the excess zeros and would be better because I could use the random intercepts. Do you agree that moving to an NBREG with random intercepts would be OK?

Thanks in advance for you reply.

If you only have six groups, that’s not enough for either clustered standard errors or random effects. I would do fixed effects via dummy variables for parties. I don’t see any obvious reason to prefer ZINB over NBREG. The fact that you have 59% with a 0 is not, in itself, any reason to prefer ZINB.

Dr. Allison,

Thanks for this great post. I’m working on a study to see if adolescents who have had a mental health visit prior to parental deployment see an increase in visits as their parents get deployed. We are considering using Proc Genmod with dist=negbin and GEE repeated measures analysis using Repeated child(parent). However over 70% of children have no further visits. Is it appropriate to use repeated measures when so many have zeros?

Thanks,

I don’t see any obvious problem here. But it’s not clear to me in what way the measures are repeated. Is it because there are multiple children per parent?

Thanks for the quick response! We are measuring the number of visits per child over deployment and non-deployment periods.

In that case, I think you should be OK. But you may want to consider a fixed effects negative binomial model, as described in my book Fixed Effects Regression Methods for Longitudinal Data Using SAS.

Very interesting post! I was brought to this page because I am trying to find the best approach for running multilevel models where the primary exposure of interest is a count variable with a lot of zeros and the dependent variable is a continuous variable. The analyses will be adjusted for potential confounders, and for the random effect of school (i.e. we recruited a stratified sample of children within schools). I thought about dichotomising my independent variable, but I would obviously lose a lot of information in doing so. I am not sure that a linear mixed model will provide accurate estimates for my independent variable. Any thoughts?

Thanks in advance!

There is no distributional assumption for the independent variable, so the post on zero-inflated models really doesn’t apply. The question is what is the appropriate functional form for the dependence of your dependent variable on the predictor. If you simply treat your exposure variable as a quantitative variable, then you are assuming that each 1-unit increment has the same effect. That may or may not be true. I’d try out different possibilities: quantitative, dichotomous, 5 categories, etc.

hi paul,

when can you say that the number of 0’s already exceeds the allowable number under the discrete distribution?

There’s no magic number. Even the Poisson distribution can allow for a very large fraction of zeros when the variance (and mean) are small. The negative binomial distribution can also allow for a large fraction of zeros when the variance is large.

Currently I am doing my thesis for my master degree in bio-statistics.

The title of my thesis is (fitting poisson normal and poisson gamma with random effect on oral health with zero inflated ( index dmf ),

I did my analysis with the software called ( Stata ) and in both cases ( my case and yours )

the result were inconclusive.

Witch comes to my shamelessly demand on how did you do the analysis and what software did you use ?

Before your answer, I respectfully thanking your and wish for further collaboration with you.

I use either Stata or SAS.

Hi Paul,

I found your article really helpful! I am working with a dataset on sickness absence and sickness presenteeism. Most researchers modeling absence or presenteeism individually have used ZINB models – theorising that some structural zero’s are due to employees having a no-absence or no-presenteeism rule whilst sampling zero’s are just due to respondents never having been ill.

In my research I am combining presenteeism and absence to one measure of ‘illness’ and thus cannot make this distinction (when you are ill you can only be either present or absent from work..), . Am I right to then use a negative binomial regression model without zero inflation (regardless of what the vuong test says?). And do you know of any article/book I can cite as evidence of the need for a theory on the different zero’s for zero-inflation to be used? – Chapter 9 of your book maybe?

Well, it does seem that the rationale that others use for the ZINB wouldn’t apply in this case. I would probably just go with the NB. Sorry but I don’t have a citation to recommend.

I am working on a model with a count outcome and trying to figure out which has a better fit- negative binomial or zero inflated negative binomial. (Poison definitely doesn’t fit well due to over dispersion). While the AIC is better for zero inflated models, the BIC tends to point towards to the regular negative binomial model. Can you help me understand this? Also if theoretically the negative binomial model makes more sense (it wasn’t originally hypothesized that there is a separate process for ‘excessive zeros’) does it make sense to go wit the negative binomial model despite the better fit of the zero inflated model?

BIC puts more penalty on additional parameters, and the ZINB has more parameters. So it’s not surprising that NB does better on this measure. Sounds like the fit is pretty close for the two models. So why not go with the simpler model if there’s no theoretical reason to prefer the ZINB.

Hi all,

This is an interesting discussion – for those who are interested the following paper does a nice “introductory” review of several of the topics mentioned here, http://www.ncbi.nlm.nih.gov/pubmed/21854279

and demonstrates how these decisions can be guided strongly by theory, etc.

What does it mea when the BIC for ZINB = – Inf?

It probably means that the algorithm for maximizing the likelihood did not converge.

Hi Paul,

I have made some progress with proc glimmix in SAS. The code for my final model is presented below, model one was unconditional with no predictor, model two had socioeconomic status (SES) as a predictor, and the final model has SES and gender as predictors.

*question = question type, response = answers to the questions

Proc glimmix data=work.ses method=laplace noclprint;

Class question;

Model response=ses gender / dist=multinomial link=clogit solution cl oddsratio (diff=first label);

Random ses / subject=question type=vc solution cl;

Covetest / wald;

Run;

So far I have gotten suitable results, model two is a better fit to model one, and model three is a better fit to model two. In the final model, fixed effects for SES p< .05, and gender p< .001. So far everything has been self-thought, picking up information from different sources with no particular one that matches my need. I am therefore not 100% sure of my code (save for dist=mult and link=clogit). The major problem I am facing now however, and have spent a considerable amount of time on is trying to figure how to get post-hoc tests for the gender effect on the different types of questions (like a pairwise comparison table for ANOVA). I have tried Lsmeans but it doesn’t work with multinomial data, I have tried splice and splicediff, as well as contrast (bycat and chisq) but keep getting errors.

Once again I am out of options and the study wouldn’t make much sense if I cannot pick out the particular question types that gender (and ideally SES) has an effect on.

Thanks in advance.

MacAulay

This code is not right. The RANDOM statement should be something like

RANDOM INTERCEPT / SUBJECT=person id;

question should be a predictor in the model. You can do your tests with CONTRAST statements.

Thanks.

This is to let everyone know that there is a free version of SAS available for non-commercial purposes. Follow the link below, if it is broken search for the page through google.

http://blogs.sas.com/content/sastraining/2014/06/18/free-sas-software-for-students/

Hi Paul,

Thank you for this post and for engaging with the commentators. I will greatly appreciate it if you can offer some advice on my data.

I am attempting to replicate and further a 3 (socio-economic status) x 6 (question type) study. The DV (question type) is measured with a 12 item questionnaire (6 categories containing 2 questions each). Participants in each category (i.e., two questions) can score between 0 and 2. In my study as well as the aforementioned study, most participants score 0 across all 6 categories. The data therefore does not satisfy the normality assumption for parametric tests as it is skewed to the right and the transformations I have tried did not work.

I don’t know how the authors got away with publishing the results arrived at from an ANOVA with this type of data as it is not mentioned in their methods. My study tests an extra variable ‘gender’ theorised to affect the relationship explored in the aforementioned study. That is, my study design is 2 (gender) x 3 (socio-economic status) x 6 (question type). An initial ANOVA gave all the predicted results but when I went back to explore the data I realised I had a huge normality problem which the authors must have also had. If their analysis is wrong I do not want to repeat it. Which statistical analysis do you think will be best to use in my situation?

Thanks in advance.

MacAulay

This sounds like a job for ordered logistic regression, also known as cumulative logit.

Hi Paul, One minor follow-up question.

SPSS’s ordinal regression dialog box only allows one DV at a time. Does this mean that I will have to repeat the analysis six times for my six DVs? If so will I have to use a p value of 0.05/6?

I have searched for answers to this question online and one or two statistics textbooks readily available but can’t find any answers. The only answers I have found are room for more than one IV (i.e., combinations of categorical and continuous IVs).

Thanks in advance.

Best regards

MacAulay

What you need is software that will allow you to do ordered logistic regression in a mixed-modeling framework (meologit in Stata or GLIMMIX in SAS). Or at the least, ordered logit software with clustered standard errors. Each subject would have 6 data records, and question type would be an independent variable.

Dear Paul, thank you for your post. I used the zero inflated negative binomial model to fit my data with a lot of zeros. But after reading your post, I have some concerns since my dependent variable is the amount of dollars the respondents were willing to pay for a specific policy option, and a “0” means that they are unwilling to pay anything for the option. Though I have a lot of zeros in my data (most of the respondents were unwilling to pay anything), I am not sure if I can make the assumption that there are two sorts of zeros. However, I tried the vuong test to compare the ZINB model and the conventional negative binomial model, and find out that the former is superior to the latter. Does that mean that it’s better to use the ZINB model even though I don’t have a theory of two kinds of zeros?

Thanks a lot in advance!

Well, a dollar amount is not a true count variable, so the underlying rationale for a negative binomial model (either conventional or ZINB) doesn’t really apply. That said, this model might be useful as an empirical approximation.

Thank you for an informative blog. Can I please call on your time to clarify an analysis that I have that I believe should follow a ZINB. I am unsure if I have it right and if the interpretations are correct.

We have data on CV related ultrasound testing in regions of varying size over a year. Many of these regions are very small and may not carry out any testing since there are no services available (no cardiologists) and some may carry out testing that has not been reported to us due to privacy reasons (also likely to be related to few cardiologists). We are using a ZINB with number of cardiologists as the predictor in the inflation-part of the model and we get what we believe to be sensible results: as number of cardiologists increase in a region the odds of a certain/structural zero decreases dramatically.

Can you verify that the interpretation of this part of the model is correct.

I assume that the negative binomial part of the model is interpreted the normal way i.e. that each factor influences the rate of testing carried out in each region (we have a log population offset).

Makes sense to me.

Negative Binomial model is an alternative to poisson model and it’s specifically useful when the sample mean exceeds the sample variance.Recall,in Poisson model the mean and variance are equal.

Zero-inflated model is only applicable when we have two sources of zero namely;structural and sampling.while hurdle models are suitable when we only have a single source,I.e structural.

Regarding the data with 35% zeros!first compute the mean and variance of the data!if the mean and variance are equal fit poisson model!if not try negative Binomial model.when NB doesn’t fit we’ll check the characteristics of the zero,in terms of structural and sampling.then decide to fit zero-inflated model or hurdle model.

While I generally agree with your comment, you can’t just check the sample mean and variance to determine whether the NB is better than the Poisson. That’s because, in a Poisson regression model, the assumption of equality applies to the CONDITIONAL mean and variance, conditioning on the predictors. It’s quite possible that the overall sample variance could greatly exceed the sample mean even under a Poisson model. Also, there’s nothing in the data that will tell you whether some zeros are structural and some are sampling. That has to be decided on theoretical grounds.

Hello from Korea,

Many thanks for your post.

I counted how creative my research participants’ answers are. Most of answers were 0 because creativity is a rare phenomenon.

I tried to use ZIP, but it was a bit difficult to use in SPSS. (I tried to find a manual of STATA or SAS for ZIP in Korean, but I couldn’t.) So I googled so many times, and I saw your article, which helped me use standard negative binomial regression model, since my data is overdispersion. Is there any article that I can refer to? I want to cite any article or book as an evidence for my thesis. Is your book “Logistic Regression Using SAS: Theory & Application” proper to cite when I use negative binomial model instead of zero-inflated poisson model?

Thank you in advance.

Yes, you can cite my book. The discussion is in Chapter 9.

Many thanks sir for this explanation

35% of my data includes zero values, do I need to apply zero-inflated negative binomial, or it is OK to use standard or random-parameter negative binomial?

Regards

Just because you have 35% zeros, that does not necessarily mean that you need a zero-inflated negative binomial. A standard NB may do just fine.

I think since he has 33% zera values, he has to use ZINB. Why u think it is not necessary to use ZINB?

Just because the fraction of zeroes is high, that doesn’t mean you need ZINB. NB can accommodate a large fraction.

Paul,

In this post you seem to recommend the standard negative binomial regression as a better way to deal with overdispersion. In another post “Beware of Software for Fixed Effects Negative Binomial Regression” on June 8th, 2012, you argued that some software that use HHG method to do conditional likelihood for a fixed effects negative binomial regression model do not do a very good job. Then, if one uses these softwares, it may be wise to use ZIP than negative binomial regression. Right?

Well, to the best of my knowledge, there’s no conditional likelihood for doing fixed effects with ZIP. So I don’t see any attraction for that method.

OK I see!! To sum up:

1) Standard Poisson model does not work because it cannot deal with overdispersion and zero excesses

2) Negative binomial model does not do appropriate conditional likelihood, at least for some software (SAS, STATA)

3) There is no conditional likelihood for ZIP

Then, it is kind of tough because there is no model that can appropriately deal with overdispersion and zero excesses. There is the pglm package in R but there is not much information about how it deals with these two issues.Do you happen to know more about it?

A solution may be to do Poisson fixed effects with quasi-maximum likelihood estimator (QMLE). This can be done in Stata. However, I read that QMLE can overcome overdispersion but does not do great job with zero excesses. Any thought about QMLE?

I agree with your three points. But, as I suggested, the negative binomial model often does just fine at handling “excess” zeros. And you can do random effects NB with the menbreg command in Stata or the GLIMMIX procedure in SAS. For fixed effects, you can do unconditional ML or use the “hybrid” method described in my books on fixed effects. I don’t know much about pglm, and the documentation is very sparse. QMLE is basically MLE on the “wrong” model, and I don’t think that’s a good way to go in this case.

By the way, you said earlier that there’s no conditional likelihood for doing fixed effects with ZIP. What about PROC TCOUNTREG in SAS? Somethig like:

MODEL dependent= </DIST=ZIP ERRORCOMP=FIXED

Does not it do ZIP fixed effects conditional likelihood?

I just tried that and got an error message saying that the errorcomp option was incompatible with the zeromodel statement. But I was using SAS 9.3. Maybe it works in 9.4.

SIr,

I work on crime data but I am facing an interesting problem. When I fit the count data models I find that the ZINB explains the problem better but when I plot the expected dependent values, the poisson distribution controlled for cluster heterogeneity fits better.

Does it have something to do with your debate?

Probably not.

This blog is going to be required reading for my students. If only they could have this type of discourse. Thanks.

ZI models may provide some explanations of the presenting of zeros. I do not know if this is an advantage of ZI models. And many thanks for your nice blog.

Paul,

I have been researching ZIP and have come across differing suggestions of when it would be appropriate to use. The example below is on a tutorial page for when zero-inflated analysis would be appropriate. My guess is that you would say zero-inflated analysis is not appropriate in this example, as there is no subgroup of students who have a zero probability of a “days absent” count greater than 0. Thanks.

“School administrators study the attendance behavior of high school juniors over one semester at two schools. Attendance is measured by number of days of absent and is predicted by gender of the student and standardized test scores in math and language arts. Many students have no absences during the semester.”

I agree that this is not an obvious application for ZIP or ZINB. Surely all students have some non-zero risk of an absence, due to illness, injury, death in family, etc.

This discussion between you and Greene was a great exchange, and I gained a lot from reading it. I would love to see you guys coauthor a piece in (eg) Sociological Methods reviewing the main points of agreement and disagreement. It would be a great article!

Good idea, but I don’t think it’s going to happen.

Is there a simple criteria to use to guide a researcher whether to use ZINB? For example, out of the sample size, what should the zeros constitute (proportion or percentage) in order for one to use ZINB? Can it be done from such a point of view?

I’m not aware of any such criterion.

Are many zeros a statistical problem in logistic regression analysis (with response variable 0/1) as well?

No, although see my earlier blog on logistic regression with rare events.

Hi Paul. Thank you for your answer. I was wondering why you think that ZINB might not make sense? Also, by ‘dichotomize’, do you mean using only the cells with values > 0? The reason why I might need some zero cells is that this is a study of lemming habitat choice (as expressed by the response variable ‘number of winter nests in a cell’) as a function of some environmental explanatory variables (related to snow cover and vegetation characteristics). I thought, then, that in order to best uncover the relation between my explanatory variables and my response variable, cells with especially poor environmental conditions (and zero nests) ought also to be represented?

Regarding the second question, I simply meant to dichotomize into zero and not zero. By “make sense” I meant is it reasonable to suppose that there is some substantial fraction of cases that have 0 probability of making a nest regardless of the values of any covariates.

Yes, you are right that a large number of cells will be zero, not because of the covariates, but just by chance – and because there are not so many lemmings in the area to fill it out. I understand that it is these unexplained zeros that you say make ZINB pointless(?) I guess that they should have belonged to the group of ‘structural zeros’ (like sterile women in your example) for things to make sense – only they don’t, since these cells could easily have housed one or more nests. Could you elaborate a little bit on which approach and model you think might be better then? By ‘dichotomize into zero and not zero’, do you mean run the data strictly as presence-absence in a logistic regression manner? Immediately, I would like to make use of the counts, as I think they might add information to the analysis. Finally, I would like to say that your advice and help is very much appreciated. Being able to choose a meaningful and appropriate model for the data analysis above will allow me to move past a critical point and into the final stages of writing my master thesis on the topic. Thank you in advance. Best regards,

Hi Paul. Sorry, I just read your comment correctly now. What I wrote above still applies to the dataset, though. The answer to your question: ‘is it reasonable to suppose that there is some substantial fraction of cases that have 0 probability of making a nest regardless of the values of any covariates’ must be: No. There are no ‘sterile women’ in this dataset. The only reason why a large part of the cells count zero, regardless of values of covariates, is that there are so relatively few lemmings in the area that they cannot take up all of the space – even some of the attractive locations. I understand that it is the ZI and hurdle approaches that make the assumption of a fraction of observations bound to be 0 regardless of covariates. Since you say that the basic negative binomial regr. model (without ZI) can also handle many zeros – might that be the road to go down, then?

I’d say give it a try.

Thank you both for the interesting discussion. Can either of you tell me if a count dataset can contain such a large amount of zeros that none of the models mentioned in this blog – NB, ZIP, ZINB – are likely to work?! I have a count dataset that contains 126,733 cells out of which 125,524 count “0”. That is, 99.05% of my dataset has a count of zero. Is this a detrimental proportion, and should I instead do some random resampling of zero-cells in order to lower the number? Thank you in advance…

Well, ZINB should “work” in the sense of fitting the data. Not sure whether it really makes sense, however. In a case like this, I would be tempted to just dichotomize the outcome. I don’t see any advantage in sampling the zero cells.

Hi,Jakob! Why don’t try jast dichotomizing (empty=”no” and “yes”>0 or white/black pixels ) & then to logit-reg? Another way – agregate to bigger non-empty cells & Poiss-like regression, or jast wait until lemming peak year 😉

What an intuitive discussion! Using d NB model often d standard error estimates are lower in poisson than in NB which increases the likelihood of incorrectly detecting a significant effect in the poisson model. But fitting ZI models predicts d correct mean counts and probability of zeros. So I think ZINB is better to NB when having excess zeros.

Thank you both for the interesting discussion. What do you think about two component – “hurdle” models (binomial+gamma(or Poisson or NegBin)? sees to me, it’s easily interpretable and flexible tool!

I don’t know a lot about hurdle models, but they seem pretty similar to zero-inflated models. They could be useful in some situations, but may be more complex than needed.

IMHO, they looks similar, but are easily interpretable and help to find some intresting effects, forexample different sign at the same predictor in binomial & count part of the model!

“In all data sets that I’ve examined, the negative binomial model fits much better than a ZIP model, as evaluated by AIC or BIC statistics. And it’s a much simpler model to estimate and interpret.” I get your second point in terms of a simpler model to estimate and interpret. But I question your first point. AIC and BIC are both based on the log likelihood. Negative Binomial and ZIP have different probability density functions and thus different expression for likelihoods. It’s my understanding that AIC and BIC are meaningless when comparing models without the same underlying likelihood form.

Good question, but I disagree. To compare likelihoods (or AICs or BICs), there’s no requirement that the “probability density functions” be the same. Only the data must be exactly the same. For example, for a given set of data, I can compute the likelihood under a normal distribution or a gamma distribution. The relative magnitudes of those likelihoods yields valid information about which distribution fits the data better.

Thank you both for the interesting discussion. I’ve been working on a random effects negative binomial model to explain crime occurrence across a spatial grid. The negative binomial model appears to fit quite well. That said, I’ve been thinking about whether there are two distinct data generating processes producing the zeros. One, crime hasn’t occurred, and two, crime occurred but has never been reported. Perhaps then the ZINB makes sense? I haven’t tried it yet…but will.

I think that it might be inappropriate to do as you describe – for two reasons: 1) The only reason why you came up with two possible classes of 0’s is that you know this is required for the ZI procedure, i.e. it is a post rationalization (also mentioned in the discussion). 2) You investigate where crime takes place – so a 0 because no one reported a crime is not a ‘real’ 0 – the crime did take place! For comparison, refer to the example from Paul: Both groups of women (sterile and those who just had no children) were ‘real’ 0’s – none of them had children!

1. I would not agree with you that the ZIP model is a nonstarter. In my experience, the ZINB model seems in many cases to be overspecified. There are two sources of heterogeneity embedded in the ZINB model, the possibly unneeded latent heterogeneity (discussed by Paul above) and the mixing of the latent classes. When the ZINB model fails to converge or otherwise behaves badly, it seems in many cases to be because the ZIP model is better suited for the modeling situation at hand.

* much of the rest of this discussion focuses on what I would call a functional form issue. Paul makes much of the idea of a researcher faced with an unspecified theory and a data set that contains a pile of zeros. At the risk of sounding dogmatic about it, I am going to stake my position on the situation in which the researcher has chosen to fit a zero inflated model (P or NB) because it is justified by the underlying theory. If the researcher has no such theory, but a data set that seems to be zero heavy, there really is no argument here. As I agreed earlier, there are many candidates for functional forms that might behave just as well as the ZI* models in terms of the fit measures that they prefer to use, such as AIC. (More on that below.)

2. See above. Just one point. Yes, the NB model is a continuous (gamma) mixture of Poissons. But, the nature of the mixing process in that is wholly different from the finite mixture aspect of the ZI models. Once again, this is an observation about theory. It does not help to justify the zip model or any of the suggested alternatives.

3. What I have in mind about fit measures is this. Many individuals (I have seen this in print) discuss the log likelihood, AIC or (even worse) pseudo R-squared in terms they generally intend to characterize the coefficient of determination in linear regression. I have even seen authors discuss sums of squares in Poisson or Probit models as they discuss AIC or Pseudo R squareds even though there are no sums of squares anywhere in the model or the estimator. These measures do not say anything about the correlation (or other correspondence) of the predictions from the model with the observed dependent variable. The difference between a “y-hat” and a “y-observed” appears nowhere in the likelihood function for an NB model, for example. But, it is possible to make such a comparison. If the analyst computes the predicted outcome from a ZINB model using the conditional mean function, then uses the correspondence of this predictor with the outcome, they can compute a conventional fit measure that squares more neatly with what people seem to have in mind by “fit measure.” As a general proposition, the ZINB model will outperform its uninflated counterpart by this measure.

4. I have no comment here. The buttons are there to push in modern software.

5. The problem of interpretation runs deeper than just figuring out what a beta means when a gamma that multiplies the same variable appears elsewhere in the same model. In these nonlinear models, neither the beta nor the gamma provides a useful measure of the association between the relevant X and the expected value of the dependent variable. It is incumbent on the researcher to make sense of the implications of the model coefficients. This usually involves establishing then estimating the partial effects. Partial effects in these models are nonlinear functions of all of the model parameters and all of the variables in the model – they are complicated. Modern software is built to help the researcher do this. This is a process of ongoing development such as the MARGINS command in Stata and nlogit’s PARTIALS command. None of this matters if the only purpose of the estimation is to report the signs and significance of estimated coefficients, but it has to be understood that in nonlinear contexts these are likely to be meaningless.

6. It is possible to “parameterize” the model so that P=b0/(1+b0) * exp(beta’x)/[1+exp(beta’x)], which is what is proposed. The problem that was there before remains. The “null hypothesis” is that b0=0, which is tricky to test, as Paul indicated. However, if b0=0, then there is no ZIP model. Or, maybe there is? If b0 is zero, how do you know that beta = 0? The problem of the chi-squared statistic when b0 is on the boundary of the parameter space is only the beginning. How many degrees of freedom does it have? If b0=0, then beta does not have to. Don Andrews published a string of papers in Econometrica on models in which model parameters are unidentified under the null hypothesis. This is a template case. The interested reader might refer to them. For better or worse, researchers have for a long time used the Vuong statistic to test for the Poisson or NB null against the zero inflation model. The narrower model usually loses this race.

To sum this up, it is difficult to see the virtue of the reparameterized model. The suggested test is invalid. (We don’t actually know what it is testing.) The null model is just the Poisson or NB model. The alternative is the zero inflated model, without the reparamaterization.

The zero inflation model is a latent class model. It is proposed in a specific situation – when there are two kinds of zeros in the observed data. It is a two part model that has a specific behavioral interpretation (that is not particularly complicated, by the way). The preceding discussion is not about the model. It is about curve fitting. No, you don’t need the ZINB. There are other functions that can be fit to the data that will look like they “fit better” than the ZINB model. However, neither the log likelihood function nor the suggested AIC are useful “fit measures” – the fit of the model to the data in the sense in which it is usually considered is not an element of the fitting criterion. If you use the model to predict the outcome variable, then compare these predictions to the actual data, the ZINB model will fit so much better there will be no comparison.

It is always intriguing when a commentator argues that a model is “difficult to fit.” Typing ZINB in Stata’s or nlogit’s command language is not harder than typing negbin. These models have existed for years as supported procedures in these programs. There is nothing difficult about fitting them. As for difficulty in interpreting the model, the ZINB model, as a two part model makes a great deal of sense. It is hard to see why it should be difficult to interpret.

The point above about the NB model being a parametric restriction on the ZINB model is incorrect. The reparameterization merely inflates the zero probability. But, it loses the two part interpretation – the reparameterized model is not a zero inflated model in the latent class sense in which it is defined. The so called reparameterized model is no longer a latent class model. It is true that the NB model can be tested as a restriction on proposed model. But, the proposed model is not equivalent to the original ZINB model – it is a different model. Once again, this is just curve fitting. There are numerous ways to blow up the zero probability, but these ways lose the theoretical interpretation of the zero inflated model.

I appreciate William Greene’s thoughtful consideration of some of the issues in my blog. Here are some responses:

1. ZIP model. Given that Greene didn’t mention the zero-inflated Poisson model, I’m guessing that he agrees with me that the ZIP model is a non-starter. It’s just too restrictive for the vast majority of applications.

2. Curve fitting vs. a behavioral model. It’s my strong impression that a great many researchers use zero-inflated models without any prior theory that would lead them to postulate a special class of individuals with an expected count of 0. They just know that they’ve got lots of zeros, and they’ve heard that that’s a problem. After learning more about the models, they may come up with a theory that would support the existence of a special class. But that was not part of their original research objective. My goal is simply to suggest that a zero-inflated model is not a necessity for dealing with what may seem like an excessive number or zeros.

As I mentioned toward the end of the blog, there are definitely situations where one might have strong theoretical reasons for postulating a two-class model. But even then, I think it’s worth comparing the fit of the ZINB model with that of the conventional NB model. The two-class hypothesis is just that — a hypothesis. And if the evidence for that hypothesis is weak, maybe it’s time to reconsider.

It’s also worth noting that the conventional NB model can itself be derived as a mixture model. Assume that each individual i has an event count that is generated by a Poisson regression model with expected frequency Ei. But then suppose that the expected frequency is multiplied by the random variable Ui to represent unobserved heterogeneity. If Ui has a gamma distribution (the mixing distribution), then the observed count variable will have a negative binomial distribution. The generalized gamma distribution is pretty flexible and allows for a large concentration of individuals near zero.

3. Fit criteria. I’m not sure what to make of Greene’s statement that “neither the log-likelihood nor the suggested AIC are useful ‘fit measures’—the fit of the model to the data in the sense in which it is usually considered is not an element of the fitting criterion.” Why should the fitting criterion (i.e., the log-likelihood) not be a key basis for comparing the fit of different models? If it’s not useful for comparing fit, why should it be used as a criterion for estimation? In any case, AIC and BIC are widely used to compare the relative merits of different models, and I don’t see any obvious reason why they shouldn’t be used to evaluate the zero-inflated models.

4. Fit difficulty. Greene is puzzled by any suggestion that zero-inflated models are “difficult to fit.” Those weren’t exactly my words, but I can stipulate that there are fewer keystrokes in ZINB than in NEGBIN. So in that sense, ZINB is actually easier. On the other hand, there is certainly more calculation required for the ZINB than for the NB. And if you’re dealing with “big data”, that could make a big difference. Furthermore, it’s not at all uncommon to run into fatal errors when trying to maximize the likelihood for the ZINB.

5. Interpretation difficulty. Why do I claim that the ZINB model is more difficult to interpret? Because you typically have twice as many coefficients to consider. And then you have to answer questions like “Why did variable X have a big effect on whether or not someone was in the absolute zero group, but not much of an effect on the expected number of events among those in the non-zero group? On the other hand, why did variable W have almost the same coefficients in each equation?” As in most analyses, one can usually come up with some after-the-fact explanations. But if the model doesn’t fit significantly better than a conventional NB with a single set of coefficients, maybe we’re just wasting our time trying to answer such questions.

6. Nesting of models. As I recall, Greene’s earlier claim that the NB model was not nested within the ZINB model was based on the observation that the only way you can get from the ZINB model to the NB model is by making the intercept in the logistic equation equal to minus infinity, and that’s not a valid restriction. But suppose you express the logistic part of the model as follows,

p/(1-p) = b0*exp(b1*x1 + … + bk*xk)

where b0 is just the exponentiated intercept in the original formulation. This is still a latent class model in its original sense. Now, if we set all the b’s=0, we get the conventional NB model. The issue of whether the models are nested is purely mathematical and has nothing to do with the interpretation of the models. If you get from one model to another simply by setting certain unknown parameters equal to fixed constants (or equal to each other), then they are nested.

As I mentioned in the blog, because b0 has a lower bound of zero, the restriction is “on the boundary of the parameter space.” It’s now widely recognized that, in such situations, the likelihood ratio statistic will not have a standard chi-square distribution. But, at least in principle, that can be adjusted for.

W.r.t the difficulty of interpretation of ZI models, I think you can imagine there is some unknown (unobserved) explanatory variable which causes many zeros. The zero-inflated “sub-model” (I don’t know the correct term) is activated by this variable.

For computer researchers (of whom I am one) this casualness is often tolerated. But maybe in other fields things are different.

Thanks for this blog post. You make these statistical concepts easy to understand; I will certainly be on look out for your books.