## Logistic Regression for Rare Events

##### February 13, 2012 By Paul Allison

Prompted by a 2001 article by King and Zeng, many researchers worry about whether they can legitimately use conventional logistic regression for data in which events are rare. Although King and Zeng accurately described the problem and proposed an appropriate solution, there are still a lot of misconceptions about this issue.

The problem is not specifically the *rarity* of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.

There’s nothing wrong with the logistic *model* in such cases. The problem is that maximum likelihood estimation of the logistic model is well-known to suffer from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. So even with a sample size of 100,000, if there are only 20 *events* in the sample, you may have substantial bias.

What’s the solution? King and Zeng proposed an alternative estimation method to reduce the bias. Their method is very similar to another method, known as penalized likelihood, that is more widely available in commercial software. Also called the Firth method, after its inventor, penalized likelihood is a general approach to reducing small-sample bias in maximum likelihood estimation. In the case of logistic regression, penalized likelihood also has the attraction of producing finite, consistent estimates of regression parameters when the maximum likelihood estimates do not even exist because of complete or quasi-complete separation.

Unlike exact logistic regression (another estimation method for small samples but one that can be very computationally intensive), penalized likelihood takes almost no additional computing time compared to conventional maximum likelihood. In fact, a case could be made for *always* using penalized likelihood rather than conventional maximum likelihood for logistic regression, regardless of the sample size. Does anyone have a counter-argument? If so, I’d like to hear it.

You can learn more about penalized likelihood in my seminar Logistic Regression Using SAS*. *

Reference:

Gary King and Langche Zeng. “Logistic Regression in Rare Events Data.” *Political Analysis* 9 (2001): 137-163.

I am thinking to use Poisson regression in case where event is rare, since p (probability of success) is very small and n (sample size is large).

This has no advantage over logistic regression. There’s still small sample bias if the number of events is small. Better to use exact logistic regression (if computationally practical) or the Firth method.

Can you please explain further why you say Poisson regression has no advantage over logistic regression when we have rare events? Thanks.

When events are rare, the Poisson distribution provides a good approximation to the binomial distribution. But it’s still just an approximation, so it’s better to go with the binomial distribution, which is the basis for logistic regression.

Is this the case with PHREG as well? If you have 50 events for 2000 observations, will using the firth option the appropriate one if your goal is to not only model likelihood but also the median time to event?

The Firth method can be helpful in reducing small-sample bias in Cox regression, which can arise when the number of events is small. The Firth method can also be helpful with convergence failures in Cox regression, although these are less common than in logistic regression.

I am interested to determine what are the significant factors associated an “outcome”, which is a binary variable in my sample.My sample size from a cross-sectional survey is 20,000 and the number of respondents with presence of “outcome” is 70. Which method would be appropriate, multiple logistic or poisson regression?

Thanks.

There is no reason to consider Poisson regression. For logistic regression, I would use the Firth method.

“Does anyone have a counter-argument? If so, I’d like to hear it.”

I usually default to using Firth’s method, but in some cases the true parameter really is infinite. If the response variable is presence of testicular cancer and one of the covariates is sex, for example. In that case, it’s obvious that sex should not be in the model, but in other cases it might not be so obvious, or the model might be getting fit as part of an automated process.

On a different note, I have read in Paul’s book that when there is a proportionality violation, creating time-varying covariates with the main predictor, and testing for its significance is both the diagnosis and the cure.

So, if the IV is significant after the IV*duration is also significant, then, are we ok to interpret the effect?

How does whether the event is rare or not affect the value of the above procedure?

Yes, if the IV*duration is significant, you can go ahead and interpret the “effect” which will vary with time. The rarity of the event reduces the power of this test.

I fully agree with Paul Allison. We have done extensive simulation studies with small samples, comparing the Firth method with ordinary maximum likelihood estimation. Regarding point estimates, the Firth method was always superior to ML. Furthermore, it turned out that confidence intervals based on the profile penalized likelihood were more reliable in terms of coverage probability than those based on standard errors. Profile penalized likelihood confidence intervals are available, e.g., in SAS/PROC LOGISTIC and in the R logistf package.

Hi,

I am a phD student at biostatistics. I have a data set with approximately 26000 cases where there are only 110 events. I used the method of weighting for rare events in Gary King article. My goal was to estimate ORs in a logistic regression,unfortunetly standard errors and confidence intervals are big , and there is a little difference with usual logistic regression. I dont no why, what is your idea? can I use penalized likelihood?

My guess is that penalized likelihood will give you very similar results. 110 events is enough so that small sample bias is not likely to be a big factor–unless you have lots of predictors, say, more than 20. But the effective sample size here is a lot closer to 110 than it is to 26,000. So you may simply not have enough events to get reliable estimates of the odds ratios. There’s no technical fix for that.

Paul,

Please clear me this. I have the sample of 16000 observations with equal number of good and bads. Is it good way of building the model or should I reduce the bads.

Don’t reduce the bads. There would be nothing to gain in doing that, and you want to use all the data you have.

Hi Dr. Allison,

If the event I am analyzing is extremely rare (1 in 1000) but the available sample is large (5 million) such that there are 5000 events in the sample, would logistic regression be appropriate? There are about 15-20 independent variables that are of interest to us in understanding the event. If an even larger sample would be needed, how much larger should it be at a minimum?

If logistic regression is not suitable, what are our options to model such an event?

Thanks,

Adwait

Yes, logistic regression should be fine in this situation. Again, what matters is the number of the rarer event, not the proportion.

Hi Dr. Allison,

I have a small data set (100 patients), with only 25 events. Because the dataset is small, I am able to do an exact logistic regression. A few questions…

1. Is there a variable limit for inclusion in my model? Does the 10:1 rule that is often suggested still apply?

2. Is there a “number” below which conventional logistic regression is not recommended…i.e. 20?

Thanks and take care.

1. I’d probably be comfortable with the more “liberal” rule of thumb of 5 events per predictor. Thus, no more than 5 predictors in your regression.

2. No there’s no lower limit, but I would insist on exact logistic regression for accurate p-values.

Dr. Allison,

I benefited a lot from your explanation of Exact logistic regression and I read your reply on this comment that you would relax the criteria to only 5 events per predictor instead of 10. I am in this situation right now and I badly need your help. I will have to be able to defend that and I wanna know if there is evidence behind the relaxed 5 events per predictor rule with exact regression?

Thanks a lot.

Below are two references that you might find helpful. One argues for relaxing the 10 events per predictor rule, while the other claims that even more events may be needed. Both papers focus on conventional ML methods rather than exact logistic regression.

Vittinghoff, E. and C.E. McCulloch (2006) “Relaxing the rule of ten events per variable in logistic and Cox regression.” American Journal of Epidemiology 165: 710-718.

Courvoisier, D.S., C. Combescure, T. Agoritsas, A. Gayet-Ageron and T.V. Perneger (2011) “Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure.” Journal of Clinical Epidemiology 64: 993-1000.

Hello again,

I also wanted to confirm this from you, that if I have the gender as a predictor (male, female), this is considered as TWO and not one variables, right?

Thanks.

Gender is only one variable.

Thank you very much for your help. I guess I gave you a wrong example for my question. I wanted to know if a categorical variable has more than two levels, would it still be counted as one variable for the sake of the rule we are discussing?

Also, do we have to stick to the 5 events per predictor if we use Firth, or can we violate the rule completely, and if it is OK to violate it, do I have to mention a limitation about that?

Sorry for the many questions.

Thanks

What matters is the number of coefficients. So a categorical variable with 5 categories would have four coefficients. Although I’m not aware of any studies on the matter, my guess is that the same rule of thumb (of 5 or 10 events per coefficient), would apply to the Firth method. Keep in mind, however, that this is only the roughest rule of thumb. It’s purpose is to ensure that the asymptotic approximations (consistency, efficiency, normality) aren’t too bad. But it is not sufficient to determine whether the study has adequate power to test the hypotheses of interest.

Hi Dr. Allison,

You mention in your original post that if a sample has 100,000 cases with 2,000 events, you’re golden. My question is this: from that group of 100,000 cases with 2,000 or so events, what is the appropriate sample size for analysis? I am working with a population of about 100,000 cases with 4,500 events; I want to select a random sample from this, but don’t want the sample to be too small (want to ensure there are enough events in the analysis). A second follow up question – is it ok for my cutoff value in logistic regression to be so low (around 0.04 or so?)

Thank so much for any help you can provide!

Joe

My question is, do you really need to sample? Nowadays, most software packages can easily handle 100,000 cases for logistic regression. If you definitely want to sample, I would take all 4500 cases with events. Then take a simple random sample of the non-events. The more the better, but at least 4500. This kind of disproportionate stratified sampling on the dependent variable is perfectly OK for logistic regression (see Ch. 3 of my book Logistic Regression Using SAS). And there’s no problem with only .04 of the original sample having events. As I said in the blog post, what matters is the number of events, not the proportion.

Dear Paul

I am using a data set of 86,000 observations to study business start-up. The most of the responses are dichotomous. Business start-up rate is 5% which is dependent variable. I used logistic regression and result shows all 10 independent variables are highly significant. I tried rare event and got same result. People are complaining for highly significant result and saying the result may be biased. Would you please suggest me?

Regards

Given what you’ve told me, I think your critics are being unreasonable.

I am going to analyze a situation where there are 97 non-events and only 3 events… i will try rare-events logistic as well as bayesian logistic…

With only three events, no technique is going to be very reliable. I would probably focus on exact logistic regression.

I am looking at comparing trends in prescription rates over time from a population health database. The events are in the range of 1500 per 100000 people +/- each of 5 years.

The Cochrane Armitage test for trend or logistic regression always seem to be significant even though event rate is going from 1.65 to 1.53. Is there a better test I should be performing or is this just due to large population numbers yielding high power?

thank you,

It’s probably the high power.

Dear Dr. Allison,

I have an unbalanced panel data on low birth weight kids. I am interested in evaluating the probability of hospital admissions (per 6-months) between 1 to 5 years of age. Birth weight categories are my main predictor variables of interest, but I would also want to account for their time varying effects, by interacting BW categories with age-period. The sample size of the cohort at age1 is ~51,000 but the sample size gets reduced to 19,000 by age5. Hospital admissions in the sample at yrs 1 and 5 are respectively 2,246 and 127. Are there issues in using the logistic procedure in the context of an unbalanced panel data such as the one I have ? Please provide your thoughts as they may apply to 1)pooled logistic regression using cluster robust SE and 2)using a fixed/random effects panel approach ? Many thanks in advance.

Best,

Vaidy

Regarding the unbalanced sample, a lot depends on why it’s unbalanced. If it’s simply because of the study design (as I suspect), I wouldn’t worry about it. But if it’s because of drop out, then you have to worry about the data not being missing completely at random. If that’s the case, maximum likelihood methods (like random effects models) have the advantage over simply using robust standard errors. Because FE models are also ML estimates, they should have good properties also.

Dr.Allison,

Thanks for your response. I guess I am saying I have two different issues here with my unbalanced panel: 1)the attrition issue that you rightly brought up; 2) i am concerned about incidental parameters problem by using fixed/random effects logistic regression with heavily attrited data. I ran some probit models to predict attrition and it appears that attrition in my data is mostly random. Is the second issue regarding incidental parameters problem really of concern ? Each panel in my data is composed of minimum two waves. Thanks.

First, it’s not possible to tell whether your attrition satisfies the missing at random condition. MAR requires that the probability of a datum being missing does not depend on the value of that datum. But if you don’t observe it, there’s not way to tell. Second, incidental parameters are not a problem if you estimate the fixed effects model by way of conditional likelihood.

Thanks for clarifying about the incidental parameters problem. I get your point about the criteria for MAR, that the missigness should not depend on the value of the datum. Key characteristics that could affect attrition are not observed in my data (e.g. SES, maternal characteristics, family income etc.). If there is no way to determine MAR, will it be fine to use a weighting procedure based on the theory of selection on observables ? For e.g. Fitzgerald and Moffit (1998) developed an indirect method to test attrition bias in panel data by using lagged outcomes to predict non-attrition. They call the lagged outcomes as auxillary variables. I ran probit regressions using different sets of lagged outcomes (such as lagged costs, hospitalization status, disability status etc.)and none of the models predicted >10% variation in non-attrition. This essentially means that attrition is probably not affected by observables. But should I still weight my observations in the panel regressions using the predicted probabilities of non-attrition from the probit models ?

Of course, I understand that this still does not address selection on unobservables [and hence your comment about I cannot say that data is missing at random].

Thanks,

Vaidy

MAR allows for selection on observables. And ML estimates of fixed and random effects automatically adjust for selection on observables, as long as those observables are among the variables in the model. So there’s no need to weight.

Dr. Allison,

I’m wondering your thoughts on this off-the-cuff idea: Say I have 1000 samples and only 50 cases. What if I sample 40 cases and 40 controls, and fit a logistic regression either with a small number of predictors or with some penalized regression. Then predict the other 10 cases with my coefficients, save the MSE, and repeat the sampling, many, many times (say, B). Then build up an estimate for the ‘true’ coefficients based on a weighted average of the B inverse MSEs and beta vectors. ok idea or hugely biased?

I don’t see what this buys you beyond what you get from just doing the single logistic regression on the sample of 1000 using the Firth method.

Hi Dr.Allison ,

In the case of rare event logistic regressions ( sub 1% ) , would the pseudo R2( Cox and Snell etc ) be a reliable indicator of the model fit since the upper bound of the same depends on the overall probability of occurrence of the event itself. Would a low R2 still represent a poor model ? I’m assuming the confusion matrix may no longer be a great indicator of the model accuracy either ….

Thanks

McFadden’s R2 is probably more useful in such situations than the Cox-Snell R2. But I doubt that either is very informative. I certainly wouldn’t reject a model in such situations just because the R2 is low.

Dear Dr. Allison,

I am analyzing the binary decisions of 500,000 individuals across two periods (so one million observations total). There were 2,500 successes in the first period, and 6,000 in the second. I estimate the effects of 20 predictors per period (40 total). For some reason, both logit and probit models give me null effects to variables that are significant under a linear probability model.

Any thoughts on why this might be the case? Thanks very much.

Good question. Maybe the LPM is reporting inaccurate standard errors. Try estimating the LPM with robust standard errors.

Thanks so much for the suggestion. I did use robust standard errors (the LPM requires it as it fails the homoskedasticity assumption by construction), and the variables are still significant under the LPM. I recall reading somewhere that the LPM and logit/probit may give different estimates when modeling rare events, but cannot find a reference supporting this or intuit myself why this might by the case.

It does seem plausible that results from LPM and logit would be most divergent when the overall proportion of cases is near 1 or 0, because that’s where there should be most discrepancy between a straight line and the logistic curve. I have another suggestion: check for multicollinearity in the variable(s) that are switching significance. Seemingly minor changes in specification can have major consequences when there is near-extreme collinearity.

Thanks, and so sorry for the late reply. I think you are right that collinearity may be responsible for the difference. In my analysis, I aim to find the incremental effect of several variables in the latter period (post-treatment) above and beyond effects in the eariler period (pre-treatment). Every variable thus enters my model twice, once alone and once interacted with a period indicator. The variables are, of course, very correlated to themselves interacted with the indicator. Thanks again!

Dr.Allison,

I appreciate your comments on this topic, I want to know is there any articles about the influence of the events of independent variables ? Thanks a lot.

Sorry, but I don’t know what you mean by “events of independent variables.”

Paul, I saw your post while searching for more information related to rare events logistic regressions. Thank you for the explanation, but why not zip regression?

Dear Dr Allison,

Is there a threshold that one should adhere to for an independent variable to be used for LR , in terms of ratio of two categories within the independent categorical variable. e.g. If I am trying to assess that in a sample size of 100 subjects, gender is a predictor of getting an infection (coded as 1), but 98 subjects are male and only 2 are females, will the results be reliable due to such disparity between the two categories within the independent categorical variables. [The event rate to variable ratio is set flexibly at 5].

thank you for your advice.

regards

John

With only 2 females, you will certainly not be able to get reliable estimates of sex differences. That should be reflected in the standard error for your sex coefficient.

Dear Dr. Allison,

I have a slightly different problem but maybe you have an idea. I use multinomial logit model. One value of the dependent variable has 100 events, the other 4000 events. The sample sice is 1 900 000. I am thinking the 100 events could be to little.

Thank you!

100 might be OK if you don’t have a large number of predictors. But don’t make this category the reference category.

Thank you,

I am using about ten predictors; would you consider this a low number in this case?

in general: is there an easy to implement way to deal with rare events in a multinomial logit model?

Should be OK.

Dear Dr. Allison,

I have a population of 810,000 cases with 500 events. I would like to use logit model. I am using about 10 predictors. If I did a logic regression, it could be done goods results in the coefficients estimations (especially for constant term)?

Thank you!

I see no problem with this. You can judge the quality of the constant term estimate by its confidence interval.

I don’t understand because I read in the article https://files.nyu.edu/mrg217/public/binaryresponse.pdf (page 38 talking about king and Zeng’s article) that “logit coefficients can lead us to underestimate the probability of an event even with sample sizes in the thousands when we have rare events data”. In fact, they explain constant term is affected (largely negative) but I think they talk also of biased’s coefficients (page 42).

Also, we can read a lot of things about prior correction with rare event for samples. I am wondering what the interest of this correction? Why should we use a sample rather than the whole population available if the estimates are biaised in both cases?

As I said in my post, what matters for bias is not the rarity of events (in terms of a small proportion) but the number of events that are actually observed. If there is concern about bias, the Firth correction is very useful and readily available. I do not believe that undersampling the non-events is helpful in this situation.

Dr. Allison–

Thank you very much for this helpful post. I am analyzing survey data using using SAS. I am looking at sexual violence and there are only 144 events. Although the overall sample is quite large (over 18,000), due to skip patterns in the survey, I looking at a subpopulation of only sexually active males (the only ones in the survey asked the questions of interest). The standard errors for the overall sample look excellent, but when applying subpopulation analysis the standard errors are large. Do you have any suggestions to address this? I believe that I can’t use the Firth method in this case because I use SAS and it doesn’t seem to be available for Proc Surveylogistic.

Thank you.

–Karen

How many events in your subpopulation? There may not be much you can do about this.

According to Stata Manual on the complementary log-log, “Typically, this model is used when the positive (or negative) outcome is rare” but there isn’t much explanation provided.

I tried looking up a few papers and textbooks about clog-log but most simply talk about the asymmetry property.

Can we use clog-log for rare event binary outcome? Which is preferred?

I’m not aware of any good reason to prefer complementary log-log over logit in rare event situations.

Dear Dr. Allison,

I have a sample with 5 events out of 1500 total sample. Is it possible to perform logistics regression with this sample (I have 5 predictors)? Do you know if Firth method is available with SPSS?

Thank you.

Not much you can do with just five events. Even a single predictor could be problematic. I’d go with exact logistic regression, not Firth. As far as I know, Firth is not available in SPSS.

(Correction2 – I sincerely apologize for my errors – the following is a correct and complete version of my question)

Dr. Allison,

I have a sample of 7108 with 96 events. I would like to utilize logistic regression and seem to be OK with standard errors. However, when analyzing standardized residuals for outliers, all but 5 of the 96 cases positive for the event have a SD>1.96. I have a few questions:

1) Is 96 events sufficient for logistic regression?

2) With 96 events, how many predictors would you recommend?

3) In that rare events analysis is really analysis of outliers, how do you deal with identifying outliers in such a case?

Thank you.

1. Yes, 96 events is sufficient.

2. I’d recommend no more than 10 predictors.

3. I don’t think standardized residuals are very informative in a case like this.

I have data set of about 60,000 observations with 750 event cases. I have 5 predictor variables. When I run the logistic regression I get all the predictors as significant. The Concordant pairs are about 80%. However, the over all model fit is not significant. Any suggestions to deal with this?

It’s rather surprising that all 5 predictors would be significant (at what level?) but the overall model fit is not significant. Send me your output.

Hi Dr. Allison,

You have mentioned that 2000 events out of 100,000 is a good sample for logistic regression, which is 98% – 2% split. I have been always suggested that we should have 80-20 or 70-30 split for logistic regression. And in case such split is not there than we should reduce the data. For example we should keep 2000 events and randomly select 8000 non-event observation and should run model on 10,000 records inplace of 100,000. Please suggest.

There is absolutely no requirement that there be an 80-20 split or better And deleting cases to achieve that split is a waste of data.

Dear Dr. Allison,

I have data of 41 patients with 6 events (=death). I am studying the prognostic value of a diagnostic parameter (DP) (numerical) for outcome (survival/death).

In a logistic regression outcome vers DP, DB was significant. However, I like to clarify whether this prognostic value is independant from age, and 3 other dichotomic parameters (gender disease, surgery). In a multiple logistic regression DP was the only significant parameter out of these 5. But I was told the event/no-of-parameters ratio should be at least 5. Therefore, this result has no meaning. Is there any method which could help coming closer to an answer? Or is it simply not enough data (unfortunately, small population is a common problem in clinic studies) Thank you very much for any suggestion.

Bernhard

Try exact logistic regression, available in SAS, Stata, and some other packages. This is a conservative method, but it has no lower bound on the number of events. You may not have enough data to get reliable results, however.

I have a rare predictor (n=41)and a rare outcome. Any guidelines on how may events are needed for the predictor? (Or, the n in a given chi-square cell?)

Thanks so much!

Check the 2 x 2 table and compute expected frequencies under the independence hypothesis. If they are all > 5 (the traditional rule of thumb) you should be fine.

Dear Colleagues, sorry to interrupt your discussion but I need of a help from experts.

I am a young cardiologist and I am studying the outcome in patients with coronary ectasia during acute myocardial infarction (very rare condition). I have only 31 events (combined outcome for death, revascularization and myocardial infarction). after Univariate analysis I selected 5 variables. Is it possibile in your opinion to carry on a Cox regression analysis in this case?The EPV is only 31/5: 6.2

Thanks

It’s probably worth doing, but you need to be very cautious about statistical inference. Your p-values (and confidence intervals) are likely to be only rough approximations. A more conservative approach would be to do exact logistic regression.

Hi Dr. Allison,

I am working on a rare event model with response rates of only 0.13% (300 events in a data sample of 200,000). I was reading through your comments above and you have stressed that what matters is the number of the rarer event, not the proportion. Can we put “minimum number of events” data must have for modeling.

In my case, I am asking this as I do have an option of adding more data to increase the number of events(however the response rate will remain the same 0.13%). How many events will be sufficient?

Also, what should be the best strategy here. Stratified sampling or Firth method?

Thanks,

Saurabh

A frequently mentioned but very rough rule of thumb is that you should have at least 10 events for each parameter estimated. The Firth method is usually good. Stratified sampling (taking all events and a simple random sample of the non-events) is good for reducing computation time when you have an extremely large data set. In that method, you want as many non-events as you can manage.

Hi Paul! i’ve been reading this trail and i also encounter problems in modeling outcomes for rare events occurring at 10% in the population we’re studying. One option that we did to get the unique behaviour is to get equal samples from outcomes and non outcomes. Just to determine the behavior to predict such outcomes. But when we ran the logistic model, we did not apply any weight to bring the results to be representative of the population. Is this ok? Am really not that happy with the accuracy rate of the model only 50% among predicted to result to the outcome had the actual outcome. Is our problem just a function of the equal sampling proportion? And will the firth method help to improve our model? Hope to get good insights /reco from you… Thanks!

Unless you’re working with very large data sets where computing time is an issue, there’s usually nothing to be gained by sampling to get equal fractions of events and non-events. And weighting such samples to match the population usually makes things worse by increasing the standard errors. As I tried to emphasize in the blog, what’s important is the NUMBER of rare events, not the fraction of rare events. If the number of rare events is substantial (relative to the number of predictors), the Firth method probably won’t help much.

Hi, thank you so much for your response. We’re working indeed with very large data. We need to sample to make computing time more efficient. I understand that what matters are the number of rare events and not the fraction, that’s why we made sure that we have a readable sample of the events. But I feel that the problem of accuracy of predicting the event is because of the equal number of events and non events used in the model. Is this valid? And yes, applying weights did no good. It made the model results even worse. For the model build for my base, should I just use random sampling of my entire population and just make sure that I have a readable base of my events?

When sampling rare events from a large data base, you get the best estimates by taking all of the events and a random sample of the non-events. The number of non-events should be at least equal to the number of events, but the more non-events you can afford to include, the better. When generating predicted probabilities, however, you should adjust for the disproportionate sampling. In SAS, this can be done using the PRIOREVENT option on the SCORE statement.

Dr. Allison,

You mentioned “The number of non-events should be at least equal to the number of events” — is this a necessity for logistic regression? That is, the event rate has to be lower than 50%?

Certainly not. That was just a recommendation for those cases in which you want to do stratified sampling on the dependent variable. If the number of events is small, it wouldn’t be sensible to then sample fewer non-events than events. That would reduce statistical power unnecessarily.

Since it sounds like the bias relates to maximum likelihood estimation, would Bayesian MCMC estimation methods also be biased?

Good question, but I do not know the answer.

Is this a relevant article?

Mehta, Cyrus R., Nitin R. Patel, and Pralay Senchaudhuri. “Efficient Monte Carlo methods for conditional logistic regression.” Journal of The American Statistical Association 95, no. 449 (2000): 99-108.

This article is about computational methods for doing conditional logistic regression. It’s not really about rare events.

Can you use model fit statistics from SAS such as the AIC and -2 log likelihood to compare models when penalized likelihood estimation with the firth method is used?

I believe that the answer is yes, although I haven’t seen any literature that specifically addresses this issue.

I know you’ve answered this many times above regarding logistic regression and discrete-time models — that if you have a huge number of observations, then it is best to take all of the events and a simple random sample of all of the non-events which is at least as large as the number of events. My question is: Does this advice apply also to continuous time models, specifically the Cox PH with time-varying covariates? I ask because I have a dataset with 2.8 million observations, 3,000 of which are events. Due to the many time-varying covariates and other fixed covariates (about 10 of each), we had to split the data into counting process format, so the 3,000 events have become 50,000 rows. Thus, our computing capabilities are such that taking a simple random sample from the non-events that is 15,000 (which become about 250,000 rows) and running these in PHREG with the events takes considerable computing time (it uses a combination of counting process format AND programming statements). Long story short, the question is – is 15,000 enough? And what corrections need to be made to the results when the model is based on a SRS of the non-events?

I think 15,000 is enough, but the methodology is more complex with Cox PH. There are two approaches: the nested case-control method and the case-cohort method. The nested case-control method requires a fairly complicated sampling design, but the analysis is (relatively) straightforward. Sampling is relatively easy with the case-cohort method, but the analysis is considerably more complex.

Thank you so much for the quick response! I really appreciate the guidance. I’ve just been doing some reading about both of these methods and your concise summary of the advantages and disadvantages of each approach is absolutely right on. I wanted to share, in case others are interested, two good and easy-to-understand articles on these sampling methodologies which I found: “Comparison of nested case-control and survival analysis methodologies for analysis of time-dependent exposure”, Vidal Essebag, et al. and “Analysis of Case-Cohort Designs”, William E. Barlow, et. al.

“Does anyone have a counter-argument?”

In the 2008 paper “a weakly informative default prior distribution for logistic and other regression models” by Gelman, Jakulin, Pittau and Su, a different fully Bayesian approach is proposed:

– shifting and scaling non-binary variables to have mean 0 and std dev 0.5

– placing a Cauchy-distribution with center 0 and scale 2.5 on the coefficients.

Cross-validation on a corpus of 45 data sets showed superior performance. Surprisingly the Jeffreys’ prior, i.e. Firth method, performed poorly in the cross-validation. The second-order unbiasedness of property of Jeffreys’ prior, while theoretically defensible, doesn’t make use of valuable prior information, notably that changes on the logistic scale are unlikely to be more that 5.

This paper has focused on solving the common problem of inifite ML estimates when there is complete separation, not so much on rare events per se. The corpus of 45 data sets are mostly reasonably balanced data sets with Pr(y=1) between 0.13 and 0.79.

Yet the poor performance of the Jeffreys’ prior in the cross-validation is striking. Its mean logarithmic score is actually far worse than that of conventional MLE (using glm).

I am in political science and wanted to use rare events logit in Stata, but it does not allow me to use fixed or random effects. After reading your work, I am not even sure my events are rare. Could you please let me know if I have a problem and how I might resolve it in Stata?

I have one sample with 7851 observations and 576 events. I have another sample with 6887 observations and 204 events.

I appreciate your advice.

Katherine

I don’t see any need to use rare event methods for these data.

I should have mentioned that I have 8 independent variables in my models.

Hi Dr. Allison,

Thanks for this post. I have been learning how to use logistic regression and your blog has been really helpful. I was wondering if we need to worry about the number of events in each category of a factor when using it as a predictor in the model. I’m asking this because I have a few factors set as independent variables in my model and some are highly unbalanced, which makes me worry that the number of events might be low in some of the categories (when size is low). For example, one variable has 4 categories and sizes range from 23 (15 events) to 61064! Total number of events is 45334 for a sample size of 83356. Thanks!

This is a legitimate concern. First of all, you wouldn’t want to use a category with a small number of cases as the reference category. Second, the standard errors of the coefficients for small categories will probably be high. These two considerations will apply to both linear and logistic regression. In addition, for logistic regression, the coefficients for small categories are more likely to suffer from small-sample bias. So if you’re really interested in those coefficients, you may want to consider the Firth method to reduce the bias.

Hi Dr. Allison,

When I have 20 events out of 1000 samples, if re-sampling like bootstrap method can help to improve estimation? Thanks very much !

I strongly doubt it.

Dr. Allison, it is great to get your reply, thanks very much. Could you help to explain why bootstrap can’t help when events are rare ? Besides, if I have 700 responders out of 60,000 samples and the variables in final model is 15, but the number of variables is 500 in the original varible selction process, do you think the 700 events are enough ? Thanks again !

What do you hope to accomplish by bootstrapping?

I want to increase the number of events by bootstrapping and thus the events are enough to make parameter estimation.

Bootstrapping can’t achieve that. What it MIGHT be able to do is provide a more realistic assessment of the sampling distribution of your estimates than the usual asymptotic normal distribution.

Hi Dr. Allison,

Iam working on natural resource management issues. In my project ‘yes’responses of my dependent variable are 80-85% while ‘no’ responses are 14-18%. Can I use Binary logit model here?

with Regards

S. Ray

Probably, but as I said in my post, what matters more is the number of “no”s, not the percentage.

Hi Paul!

I would be most grateful if you could help me with the following questions: 1) I have a logistic regression model with supposedly low power (65 events and ten independent variables). Several variables do however come out significant. Are these significance tests unreliable in any way?

And 2) do you know if it is possible to perform the penalized likelihood in SPSS?

They could be unreliable. In this case, I would try exact logistic regression. I don’t know if penalized likelihood is available in SPSS.

Dear Dr. Allison,

I am trying to build a logistic regression model for a dataset with 1.4 million records with the rare event comprising 50000 records. The number of variables is about 50 most of which are categorical variables which on an average about 4 classes each. I wanted to check with you if it is advisable to use the Firth method in this case.

Thank You

You’re probably OK with conventional ML, but check to see how many events there are in each category of each variable. If any of the numbers are small, say, less than 20, you may want to use Firth. And there’s little harm in doing so.

This is a nice discussion, but penalization is a much more general method than just the Firth bias correction, which is not always successful in producing sensible results. There are real examples in which the Firth method could be judged inferior (on both statistical and contextual grounds) to stronger penalization based on conjugate-logistic (log-F) priors. These general methods are easily implemented in any logistic-regression package by translating the penalty into prior data. For examples see Sullivan & Greenland (2013, Bayesian regression in SAS software. International Journal of Epidemiology, 42, 308-317. These methods have a frequentist justification in terms of MSE reduction (shrinkage) so are not just for Bayesians; see the application to sparse data and comparison to Firth on p. 313.

Thanks for the suggestions.

Dr. Paul Allison, I am very thankful to you for your post and the discussions followed, from which I have almost solved my problem except one. My event is out-migrant having 849 cases which is 1.2% of the total sample(69,207). Regarding the small proportion, I think my data is in the comfort zone to apply for logistic regression. But the dependent variable is highly skewed (8.86 skewness). Does it pose any problems, and if so, how can I take care of this? Reducing the number of non-events by taking random sample has been found helpful but I doubt whether it affects the actual characteristics of the population concerned. Plz clarify me on this. I use SPSS program. Thanks.

The skewness is not a problem. And I see know advantage in reducing the number of non-events by taking a random sample.

Dear Dr. Paul Allison, I understood we have to pay attention to small sample bias for small categories. But I have continuous independent variables, and 50 events over 90.000 cases (all times 11 years). If I use in a logit estimation, for example, 4 independent variables can I have some problems in the interpretation of their estimated coefficients and their significance? Thanks

I’d probably want to go with the Firth method, using p-values based on profile likelihood. To get more confidence in the p-values, you might even want to try exact logistic regression. Although the overall sample size is pretty large for exact logistic, the small number of events may make it feasible.

Dear Dr. Paul Allison, I would like to know which kind of logistic regression analysis shall I use if have 1500 samples and only 30 positives? Shall I use exact or firth? What would be the advantage of using either of them in the analysis?

Firth has the advantage of reducing small sample bias in the parameter estimates. Exact is better at getting accurate p-values (although they tend to be conservative). In your case I would do both: Firth for the coefficients and exact for the p-values (and/or confidence limits).

I think this situation is most similar to my own but I’d like to check if possible. I have an experiment that has 1 indepdendent variable with 3 levels, sample size of 30 in each condition. Condition 1 has 1 success/positive out of 30. Condition 2 has 4/30, and Condition 3 has 5/30. Can I rely on Firth or do I need both? (And is it acceptable to report coefficients from one but probability from another? I wouldn’t have guessed that would be ok.)

I don’t think you even need logistic regression here. You have a 3 x 2 table, and you can just do Fisher’s exact test, which is equivalent to doing exact logistic regression. I don’t think there’s much point in computing odds ratio, either, because they would have very wide confidence intervals. I’d just report the fraction of successes under each condition.

Dear Dr. Paul Allison,

In which case can we use 10% level of significance( p-value cut off point) instead of using 5%? For instance, if you have nine independent variables,and run univariate logistic regression, you find that the p-value for your three independent variables is below 10%. If you drop those variables which are above 10% (using 10% level of significance) and use firth to analyse your final model, you will end up with significant value(P<0.05) of the three variables. Is it possible to use this analysis and what would be the reason why you use 10% as cut off value?

I don’t quite understand the question. Personally, I would never use .10 as a criterion for statistical significance.

Dr. Allison, this is an excellent post with continued discussion. I am currently in debate with contractors who have ruled out 62 events in a sample of 1500 as too small to analyse empirically. Is 62 on the cusp of simple logistic regression or would the Firth method still be advisable? Further, is there a rule of thumb table available which describes minimum number of events necessary relative to sample and number of independent variables? Many thanks. Becky

It may not be too small. One very rough rule of thumb is that there should be at least 10 cases on the less frequent category for each coefficient in the regression model. A more liberal rule of thumb is at least 5 cases. I would try both Firth regression and exact logistic regression.

for a rare event example, 20 events in 10,000 cases, may we add multiple event(like 19 times the events, so that we can get 200 events) in the data. once we get the predicted probablity, we jsut need to adjust the probablity by the percentages(in this case 10/10000 -> 200/10200).

Or we may use boostrapping method to resample the data?

No, it’s definitely not appropriate to just duplicate your 20 events. And I can’t see any attraction to resampling. For p-values, I’d recommend exact logistic regression.

Exact logistic regression, rare events, and Firth method work well for binary outcomes. What would you suggest for rare continuous outcomes?

Say, I have multiple sources of income (20,000+ sources). Taken separately, each source throughout a year generates profit only on rare occasions. Each source could have 362 days of zero profit, and 3 days of positive profit. The number of profit days slightly vary from source to source.

I have collected daily profit values generated by each source into one data set. It looks like pooled cross sections. This profit is my dependent variable. Independent variables associated with it are also continuous variables.

Can you provide me any hints of which framework to use? (I tried tobit model that assumes left censoring.) Can I still use Firth or rare events?

Thanks.

Well, I’d probably treat this as a binary outcome rather than continuous: profit vs. no profit. Then, assuming that your predictors of interest are time-varying, I’d do conditional logistic regression, treating each income source as a different stratum. Although I can’t cite any theory, my intuition is that the rarity of the events would not be a serious problem in this situation.

Dr. Allison,

Hi. You may have already answered this from earlier threads, but is a sample size of 9000 with 85 events/occurrence considered a rare-event scenario? is logistic regression appropriate?

Many thanks.

Rob

Yes, it’s a rare event scenario, but conventional logistic regression may still be OK. If the number of predictors is no more than 8, you should be fine. But probably a good idea to verify your results with exact logistic regression and/or the Firth method.

Sorry, follow-up question… what’s the minimum acceptable c-stat… I usually hear .7, so if I get, say 0.67, should I consider a different modeling technique?

The c-stat is the functional equivalent of an R-squared. There is no minimum acceptable value. It all depends on what you are trying to accomplish. A different modeling technique is not necessarily going to do any better. If you want a higher c-stat, try getting better predictor variables.

Hello Mr.Allison,

I’m writing you because I have a similar problem. I have an unbalanced panel data with 23 time periods (the attrition is du to lose of indiv over periods). I would like to ask your opinion for 2 issues:

1. How can I do the regression, should I use the pooled data or panel data with FE/RE?

2. I also have a problem of rare events, for the pooled data I have almost 10000000 obs and only 45000 obs whit the event=1 (0.45%).What do you think I shold do in this case.

Thank you very much, I appreciate you help.

Stefan

1. I would do either fixed or random effects logistic regression.

2. With 45,000 events, you should be fine with conventional maximum likelihood methods.

First of all, thank you for your answears.

The problem is that when I do logistic regression for the pooled data I obtain a small Somers D (0.36) and my predicted probabilities are very small, even for the event=1 (The probabilities are nor bigger than 0.003). I don’t know what to do.

What do you think is the problem, and what can I do.

Thank you again.

Hello,

Paul,

I am currently doing my project for MSc, I have a dataset with 2312 observation with only 29 observations. I want to perform logistic association. Which method would you recommend?

I assume you mean 29 events. I’d probably use the Firth method to get parameter estimates. But I’d double-check the p-values and confidence intervals with conditional logistic regression. And I’d keep the number of predictors low–no more than 5, preferably fewer.

Dear Dr. Allison,

I have a small dataset (90 with 23 events) and have performed an exact logistic regression which leads to significant results.

I wanted to add an analysis of the Model Fit Statistics and the Goodness-of-Fit Statistics like AIC, Hosmer-Lemeshow-Test or Mc Fadden’s R. After reading your book about the logistic regression using SAS (second edition) in my understanding all these calculations only make sense respectively are possible if the conventional logistic regression is used. Is my understanding correct? Are there other opportunities to check the Goodness-of-fit in case of using the exact logistic regression? Thank you.

Standard measures of fit are not available for exact logistic regression. I a not aware of any other opportunities.

Dr.Allison,

The article and comments here have been extremely helpful. I’m working on building a predictive model for bus breakdowns with logistic regression. I have 207960 records total with 1424 events in the data set. Based on your comments above, it seems I should have enough events to continue without oversampling. The only issue is that I’m also working with a large number of potential predictors, around 80, which relate to individual diagnostic codes that occur in the engine. I’m not suggesting that all of these variables will be in final model, but is there a limit to the number of predictors I should be looking to include in the final model? Also, some of predictors/diagnostic codes happen rarely as well. Is there any concern having rare predictors in a model with rare events?

Thanks,

Tony

Well, a common rule of thumb is that you should have at least 10 events for each coefficient being estimated. Even with 80 predictors, you easily meet that criterion. However, the rarity of the predictor events is also relevant here. The Firth method could be helpful in reducing any small-sample bias of the estimators. For the test statistics, consider each 2 x 2 table of predictor vs. response. If the expected frequency (under the null hypothesis of independence) is at least 5 in all cells, you should be in good shape.

Hello Dr. Allison,

The data I use is also characterized by having very rare events (~0.5% positives) There are however enough positives (thousands) so should hopefully be ok to employ logistic regression according to your guidelines.

My question comes from a somewhat different angle (which I hope is ok).

I have ~20 predictors which by themselves represent estimated probabilities. The issue is that the level of confidence in these probabilities/predictors may vary significantly. Given that these confidence levels could be estimated, I’m looking for a way to take these confidence levels into account as well, since the predictor’s true weight may significantly depend on its confidence.

One suggested option was to divide each predictor/feature into confidence based bins, so that for each case (example) only a single bin will get an actual (non zero) value. Similar to using “Dummy Variables” for category based predictors. Zero valued features seem to have no effect in the logistic regression formulas (I assume that features would need to be normalized to a 0 mean value)

Could this be a reasonable approach ?

Any other ideas (or alternative models) for incorporating the varying confidence levels of the given predictor values?

Thanks in advance for your time and attention

One alternative: if you can express your confidence in terms of a standard error or reliability, then you can adjust for the confidence by estimating a structural equation model (SEM). You would have to use a program like Mplus or the gsem command in Stata that allows SEM with logistic regression. BTW, if you do dummy variables, there is no need to normalize them to a zero mean.

Thank you so much for your response and advise.

Regarding the option of using dummy variables, here is what I find confusing:

– On the one hand, whenever a feature assumes a value of 0 its weight learning does not seem to be affected (according to the gradient descent formula), or maybe i’m missing something ..

– On the other hand, the features in my case represent probabilities (which are a sort of prediction of the target value). So if in a given example the feature assumes a 0 value (implying a prediction of 0) but the actual target value is 1 it should cause the feature weight to decrease (since, in this example, it’s as far as possible from the true value)

Another related question that I have:

In logistic regression the linear combination is supposed to represent the odds Logit value ( log (p/1-p) ). In my case the features are them selves probabilities (actually sort of “predictions” of the target value). So their linear combinations seems more appropriate for representing the probability of the target value itself rather than its logit value. Since P is typically very small ~0.5% (implying that log (p/1-p) ~= log(p)) would it be preferable to use the log of the features instead of the original feature values themselves as input for the logistic regression model ?

Again, thanks a lot for your advise.

Because there is an intercept (constant) in the model, a value of 0 on a feature is no different than any other value. You can add any constant to the feature, but that will not change the weight or the model’s predictions. It will change the intercept, however.

It’s possible that a log transformation of your feature may do better. Try it and see.

I have a sample of 11,935 persons of whom 944 persons made one and more visits to emergency department during one year. Can I apply logistic regression safely to this data? (My colleague recommended the count data model like ZINB model because conventional logistic regression generates a problem of underestimated OR due to zero excess. But I think an event itself can be sometimes more important information than number of event per patient.)

Yes, I think it’s quite safe to apply logistic regression to these data. You could try the ZINB model, but see my blog post on this topic. A conventional NB model may do just fine. Personally, I would probably just stick to logistic, unless I was trying to develop a predictive model for the number of visits.

Dr. Allison,

I highly appreciate you for the valuable advice. But I have one more question.

He (my colleague) wrote to me:

“Our data have too many zeros of which some may be ‘good’ zeros but others may be ‘bad’ zeros. Then, we should consider that the results of logistic regression underestimate the probability of event (emergency department visit).”

If he is correct, what should I do to minimize this possibility? (Your words ‘quite safe’ in your reply imply that he is wrong, I guess)

If he is wrong, why is he wrong?

Thank you for sparing your time for me.

I would ask your colleague what he means by “too many zeros”. Both logistic regression and standard negative binomial regression can easily allow for large fractions of zeros. I would also want to know what is the difference between “good zeros” and “bad zeros”. Zero-inflated models are most useful when there is strong reason to believe that some of the individuals could not have experienced the event not matter what the values of their predictor variables. In the case of emergency room visits, however, it seems to me that everyone has some non-zero risk of such an event.

Dr. Allison,

Thank you very much. We bought some books on statistics including your books Your advice stimulated us to study important statistical techniques. Thank you.

Dear Dr. Allison,

I need your expertise on selecting appropriate method. I have 5 rare events(Machine failure) out of 2000 observations.

Now, I need to predict when machine will be down based on the historical data, I have 5 columns

1) Error logs – which were generated by the machine (non-numeric)

2) Time stamp – when error message was generated

3) Severity – Severity of each error log (1-low, 2- Medium, 3- High)

4) Run time – No. of hours the machine ran till failure

5) Failed? – Yes/No

Thanks in advance for your help!

With just five events, you’re going to have a hard time estimating a model with any reliability. Exact logistic regression is essential. Are the error logs produced at various times BEFORE the failure, or only at the time of the failure? If the latter, then they are useless in a predictive model. Since you’ve got run time, I would advise some kind of survival analysis, probably a discrete time method so that you can use exact logistic regression.

Thanks Dr. Allison, The error logs were produced at various times BEFORE the failure. Is there a minimum required number of events (or proportion of events)for estimating a model? However, I would try other methods as you advised (Survival, Poisson model)

Well, if you had enough events, I’d advise doing a survival analysis with time dependent covariates. However, I really don’t think you have enough events to do anything useful. One rule of thumb is that you should have at least 5 events for each coefficient to be estimated.

Thank you so much for the post. I am working on the data with only 0.45 percent “yes”s, and your posts were really helpful. The firth method and the rare event logit produces very same coefficients as you explained in your post. The regular post estimation commands such as mfx, however, do not get me the magnitudes of the effects that I would like to see after either method. I read all the posts in the blog, but could not find a clue.

Thank you for your help, Dr. Allison!

The mfx command in Stata has been superseded by the margins command. The firthlogit command is user written and thus may not support the post estimation use of the margins command. The problem with the exlogistic command is that it doesn’t estimate an intercept and thus cannot generate predicted values, at least not in the usual way.

Dear Dr. Allison,

I have 10 events in a sample with 46 observations (including the 10 events). I have run firthlogit in Stata, but I could not use the command fitstat to estimate r2. I would like to ask how I can estimate r2 with Stata? Is there any command?

Thanks in advance for your time and attention.

I recommend calculating Tjur’s R2 which is described in an earlier post. Here’s how to do it after firthlogit:

firthlogit y x1 x2

predict yhat

gen phat = 1/(1+exp(-yhat))

ttest phat, by(y)

The gen command converts log-odds predictions into probabilities. In the ttest output, what you’re looking for is the difference between the average predicted values. You’ll probably have to change the sign.

Dear Allison,

I have a study about bleeding complication after a procedure recently. A total of 185 patients were enrolled in this study and 500 times of procedure were performed. Only 16 events were finally observed. So what kind of method I can use to analyze the predictive factors of this events? I’ve tried logistic regression on SPSS,however the reviewers said “The number of events is very low, which limits the robustness of the multivariable analysis with such a high number of variables. ”

Thanks in advance for your help!

Do you really have 500 potential predictors? If so, you need to classify the procedures into a much smaller number. Then, here’s what I recommend: (1) Do forward inclusion stepwise logistic regression to reduce the predictors to no more than 3. Use a low p-value as your entry criterion, no more than .01. (2) Re-estimate the final model with Firth logit. (3) Verify the p-values with exact logistic regression.

Hi Paul,

In my case I have 14% (2.9 million) of the data with events. Is it fine if I go with MLE estimation?

Thanks!!!!

Yes

Dear Dr Allison,

I’m running some analysis about firms’ relations. I’ve got info on B-to-B relations (suppliers – customers) for almost all Belgian firms (let’s assume that I have all transactions – around 650,000 transactions after cleaning for missing values in explanatory variables) and I want to run a probit or a logit regression of the probability that two firms are connected (A supplies B) and I need to create the 0’s observations. What would be the optimal strategy, taking into account that I cannot create all potential transactions (19,249,758,792) ?

I’ve considered either selecting a random sample of suppliers (10% of original sample) and a random sample of customers (same size) and consider all potential transactions between those two sub-sample or to consider all actual transactions and randomly selected non transactions.

I’d go with the 2nd method–all transactions and a random sample of non-transactions. But with network data, you also need special methods to get the standard errors right. There’s an R package called netlogit that can do this.

Dear Dr. Allison,

I work in fundraising and have developed a logistic regression model to predict the likelihood of a constituent making a gift above a certain level. The first question my coworkers asked is what the time frame is for the predicted probability. In other words, if the model suggests John Smith has a 65% chance of making a gift, they want to know if that’s within the next 2 years, 5 years, or what. The predictor variables contain very little information about time, so I don’t think I have any basis to make this qualification.

The event we’re modeling is already pretty rare (~200 events at the highest gift level) so I’m concerned about dropping data, but the following approach has been suggested: If we want to say someone has a probability of giving within the next 3 years, we should rerun the model but restrict the data to events that happened within the last 3 years. Likewise, if we use events from only the last 2 years, then we’d be able to say someone has a probability of giving within the next 2 years.

Apart from losing data, I just don’t see the logic in this suggestion. Does this sound like a reasonable approach to you?

Any suggestions on other ways to handle the question of time would be much appreciated. It seems like what my coworkers want is a kind of survival analysis predicting the event of making a big gift, but I’ve never done that type of analysis, so that’s just a guess.

Thanks for your time,

DC

Ideally this would be a survival analysis using something like Cox regression. But the ad hoc suggestion is not unreasonable.

Dear Dr. Allison,

I am analyzing a rare event (about 60 in 15,000 cases) in a complex survey using Stata. I get good results (it seems) on the unweighted file using “firthlogit” but it is not implemented with svy: I need either another way to adjust for the complex survey design or an equivalent of firthlogit that can work with the svyset method.

Any suggestions?

Sorry, but I don’t have a good solution for Stata. Here’s what I’d do. Run the model unweighted using both firthlogit and logistic. If results are pretty close, then just use logistic with svyset. If you’re willing to use R, the logistf package allows for case weights (but not clustering). Same with PROC LOGISTIC in SAS.

This is a great resource, thanks so much for writing it. It answered a lot of my questions.

I am planning to use MLwiN for a multilevel logistic regression, with my outcome variable having 450 people in category 1 and around 3200 people in column 0.

My question is: MLwiN uses quasi-likelihood estimation methods as opposed to maximum likelihood methods. Do the warnings of bias stated in the article above still apply with this estimation technique, and if so, would it be smart to change the estimation method to penalized quasi-likelihood?

Thanks so much for any light you can shed on this issue.

First of all, I’m not a fan of quasi-likelihood for logistic regression. It’s well known to produce downwardly biased estimates unless the cluster sizes are large. As for rare events, I really don’t know how well quasi-likelihood does in that situation. My guess is that it would be prone to the same problems as regular ML. But with 450 events, you may be in good shape unless you’re estimating a lot of coefficients.

Hi paul, recently, i’m working on my thesis about classification for child labor using decision tree C5.0 algorithm compare with multivariate adaptive regression spline (MARS). I have imbalanced data for child labor (total 2402 sample, with 96% child labor and 4% not child labor)and 16 predictor variables.

Using decision tree for imbalanced data is not quite problem because of many techniques for balancing data, but i’m very confused with MARS(MARS with logit function). i have a few question:

1. could i just use MARS without balancing data? or

2. could 1 use sampling method(Oversampling,undersampling, SMOTE) for balancing data? or

3. could you proposing me some methods for me? Thank you for the advices

Sorry but I don’t know enough about MARS to answer this with any confidence. Does MARS actually require balancing? It’s hard to see how oversampling or undersampling could help in this situation.

Hi.

Thank you in advance for this fascinating discussion and for your assistance (if you reply, but if not I understand).

I have a model with 1125 cases. I have used binary logistic regression but have been told I do not take into account that 0/1 responses in the dependent variable are very unbalanced (8% vs 92%) and that the problem is that maximum likelihood estimation of the logistic model suffers from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. It has been suggested that in order to correct any potential biases, I should utilise the penalised likelihood/Firth method/exact logistic regression.

Do you agree with this suggestion or is my unbalanced sample OK because there are enough observations in the smaller group?

Regards,

Kim

So, you’ve got about 90 cases on the less frequent category. A popular (but very rough) rule of thumb is that you should have about 10 cases (some say 5) for each coefficient to be estimated. That suggests that you could reasonably estimate a model with about 10 predictors. But I’d still advise using the Firth method just to be more confident. It’s readily available for SAS and Stata. Exact logistic regression is a useful method, but there can be a substantial loss of power along with a substantial increase in computing time.

Hello sir, I am also trying to model (statistically) my binary response variable with 5 different independent variables. my dataset is a kind of imbalanced one. my sample size is 2153 out of which only 67 are of one kind the rest are of the other kind. what will be a good suggestion in this regard? will it be possible for me to model my data set statistically as it is an imbalanced one?

The problem is not lack of balance, but rather the small number of cases on the less frequent outcome. A very rough rule of thumb is that you should have a least 10 cases on the less frequent outcome for each coefficient that you want to estimate. So you may be OK. That rule of thumb is intended to ensure that the asymptotic approximations for p-values and confidence intervals are close enough. It doesn’t ensure that you have enough power to detect the effects of interest. I’d probably just run the model with conventional ML. Then corroborate the results with Firth logit or exact logistic regression.

Hi Paul,

In my case, I want to use logistic regression to model fraud or no fraud with 5 predictors, but the problem is I have only 1 fraud out of 5,000 observations. Is it still able to use logistic regression with Firth logit to model it? What is your suggestion for the best approach for this case?

Thank you so much,

Jeff Tang

I’m afraid you’re out of luck Jeff. With only 1 event, there’s no way you can do any kind of reliable statistical analysis.

That’s what I thought. Thank you, Paul.

By the way, what if I just convert the raw data from each predictor to a standard score (say 1-10) and then sum up in order to at least give me some idea how risky each person to commit a fraud.

What do you think?

Thanks again,

Jeff

Problem is, how do you know these are the right predictors?

I see. I’ll figure it out. Suppose after I find the right predictors, do you think it’s a good idea to use the standard score for this very limited data? What’s your advice?

Thank you,

Jeff

Might be useful. But getting the right predictors is essential.

Hi Paul,

I am working on my master thesis and i’m finding some difficulties with it.

It is about the relationship between socio-demographic and health related variables and the chance of passing the first year on college. So my dependent variable is passing (=1) or failing (=0).

Now, i’m doing a univariate logistic regression to see which variables are significant and so which I should include in my multivariate logistic regression analysis.

When I look at the Hosmer and Lemeshow test for the categorical predictors (f.e. gender, being on a diet or not) I get following,

chi²:0.000

df: o

sign:.

Why is this? Is this due to the fact that there are only four groups possible?

( male passed, male failed, female passed, female failed)

Furthermore I also have a predictor with 5 respons options (once a week, twice a week, 3-4 times a week,…) and also there my p value is significant. What should I do when it is significant? Now I entered this variable as a continu variable, but maybe this is not correct?

Also, is the hosmer and lemeshow test important in univariate logistic regressions or is it only done in multivariate?

Thanks in advance,

a desperate master student

See my post on the Hosmer-Lemeshow statistic: http://statisticalhorizons.com/hosmer-lemeshow

Thanks for this nice post. When do you start thinking that it is not possible to perform a reliable statistical analysis? My problem is that I have around 40 events in a sample of 40000, and I also have around 10 covariates to explain the outcomes. What would you suggest? Do you rely on other implemented software? R?

Thanks in advance

Well, I think you have enough events to do some useful analysis. And I’d probably start with conventional logistic regression. But then I’d want to corroborate results using both the Firth method and exact logistic regression. Both of these methods are available in SAS, Stata (with the user-written command firthlogit) and R. Your final model would ideally have closer to 5 covariates rather than 10. And keep in mind that while you may have enough events to do a correct analysis, your power to test hypotheses of interest may be low.

Dear dr Allison,

Thank you for this clear explanation above.

We are studying an event with a low incidence (0.8:1000 up to 10:1000) in a large dataset (n=1,570,635).

In addition, we also performed conventional logistic regression analysis on the recurrence rate of this event in a linked dataset (n=260,000 for both time points). Roughly 30 out of 320 patients with a first event had a recurrent event compared to 184 in the remaining population (de novo event at the second timepoint of the study). We adjusted for a maximum of 5 variables in the multivariate analysis.

Was it correct to use conventional logistic regression or should we have used Firth or exact logistic regression analysis instead?

Thanks in advance,

Well, you’re certainly OK with conventional ML for the non-recurrent analysis. For the recurrent analysis, you might want to replicate with Firth regression (downsides minimal) or possibly exact logistic (less power, more computing time).

Thank you for your quick answer. I’ll have a look at performing a Firth regression in SAS on the recurrent analysis and see what different results are given.

We performed the logistic regression analysis with Firth correction by adding \cl firth to our syntax. The odds with conventional log regr was 83 (55-123), with Firths’ it is now 84 (56-124). Mainly the CI became a bit wider.

We may thus conclude from these results that the recurrence rate remains statistically significant, isn’t it?

Thank you in advance,

Probably. But the CI based on the usual normal approximation may be inaccurate with the FIRTH method. Instead of CL, use PLRL which stands for profile likelihood risk limits.

Hello Dr. Allison,

Thank you for this posting it has been very helpful.

I have a sample of 170 observations which I have run a predictive model on. As the main focus of this study is exploring gender patterns I would like to build models stratified by gender leaving me with 76 women, 94 men. There are 50 events in the women and 59 in the men.

I found that with logistic regression my CIs are very wide for my ORs so have used firth logistic instead.

I am still finding I have wide CIs, the widest for any of the predictors in the women is 1.82-15.69 and for the men is 1.01-11.56.

I am finding however variables in the model to be significant below 0.05 , and even as low as 0.001 – these variables make clinical and statistical sense…is it still reasonable to present this model, noting that there are limitations in terms of sample size?

I have read however that wide CIs are common in firth, can you speak to this?

Are there any other suggestions you may have for modelling with such small sample size?

Thank you in advance!

If you use the Firth method, make sure that your CIs are based on the profile likelihood method rather than the usual normal approximation. The latter may be somewhat inaccurate. In any case, the fact that your CIs are wide is simply a consequence of the fact that your samples are relatively small, not the particular method that you are using. That said, there’s no reason not to present these results.

Thank you for your quick response!

In Stata the firth model output notes a penalized log likelihood rather than a log likelihood. I am assuming this penalty ensures the CIs are not based on a normal approximation. Is this correct, or is there something else I should be looking for in my output to identify the profile likelihood method is being used?

Thank you!

For confidence intervals, the firthlogit command uses the standard normal approximation rather than the profile likelihood method. However, you can get likelihood ratio tests that coefficients are 0 by using the set of commands shown in the example section of the help file for firthlogit.

Thank you for this suggestion, following the commands in the help section I have tested that the coefficients=0. The coefficients for the variables that are significant in the firth model do not = 0, while those that are not significant (my force in variables) do = 0, according to the Likelihood ratio test.

Despite doing this testing my CIs for this firth logistic regression model are still not based on the profile likelihood method and are being calculated using normal approximation. It seems from your previous post testing LRT coefficients may offer an alternative to presenting these CIs based on the PLM? Would you be able to clarify this?

Thank you very much!

The normal approximation CIs are probably OK in your case. They are most problematic when there is quasi-complete separation or something approaching that.

Both likelihood ratio tests and profile likelihood confidence intervals are based on the same principles. Thus, if the profile likelihood CI for the odds ratio does not include 1, the likelihood ratio test will be significant, and vice versa.

Are there any suggested goodness of fit tests for firth logistic as I see hosmer lemeshaw is invalid when using the firth method.

Also, are AIC values valid in firth?

Thank you.

Well, as I’ve stated in other posts, I am not a fan of the Hosmer-Lemeshow test in any case. But I don’t see why it should be specially invalid for the Firth method. The goodness of fit tests that I discuss in my posts of 7 May 2014 and 9 April 2014 could be useful.

Hi Dr Allison,

I am doing a logistical regression on 19100 cases with 18 predictors. 6 of my predictors have rare events (lowest events are 217;19100, 630;19100 etc). In the goodness of fit model, Pearson is 0 and Deviance is 1, which i know to be problematic.

Firstly, do you think this is likely to be due to the rare events? Secondly, is oversampling necessary, reading your previous comments it seems that although the predictors are proportionally unbalanced, there would be a sufficient number of events in each category.

Thanks for taking the time to reply to these comments.

Caroline

You should be fine with conventional ML. No oversampling is necessary. The discrepant results for Pearson and Deviance are simply a consequence of the fact that you are estimating the regression on individual-level data rather than grouped data. These statistics are worthless with individual-level data.

Dr. Allison –

I am performing a logistic regression with 20 predictors. There are 36,000 observations. The predictor of interest is a binary variable with only 84 events that align with the dependent variable. Is firth logistic regression the best method for me to use in this case?

Regards,

Amanda

That’s what I would use.

Dear Paul Allison,

Thanks for this insightful article. In my research, I have an unbalanced panel of merging and non-merging firms for about 20 years, and I am investigating driving factors of the probability of merging. Among the 5000 firms in the sample, only 640 of them experience a merger. It means the dependent variable has many zeros. Based on my readings from this article, firthlogit command in Stata is your choice. Is this true for an unbalanced panel data as well? Thanks for your time and consideration.

With that many mergers, standard logistic regression should be just fine. But if some firms contribute more than one merger, you should probably be doing a mixed model logistic regression using xtlogit or melogit.

I have a question about the recommended 5:1 ratio of events to predictors.

Is this ratio suggestion for the number of predictors you start with, or the number of predictors you ultimately find statistically significant for the final model?

p.s. fascinating discussion

Ideally it would be the number you start with. But that might be too onerous in some applications.

Dear Dr. Allison,

Stata’s firthlogit command does not allow for clustered standard errors. Does Firth logit automatically account for clustered observations?

I am fitting a discrete hazard model, so it feels strange not to specify clustered standard errors. In any case, firthlogit has produced results nearly identical to the results from logit and rare events logit models with clustered standard errors.

No Firth logit does not correct for clustering. However, if you are fitting a discrete hazard with no more than one event per individual, there is no need to adjust for clustering. That may explain why all the results are so similar.

I have a question regarding the applicability of the firthlogit command for panel data in stata:

How can I implement the penalized logistic regression for panel data? I understand that I can use the xtlogit commands for FE and RE, but how do I do this with the firthlogit command?

thank you very much for your help!

Unfortunately, the firthlogit command does not have any options for dealing with panel data.

Thanks so much for this article. I am performing logistic regression for a sample size of 200 with only 8 events on SPSS. I believe SPSS does not offer exact logistic regression or the Firth method. The p value for my model is statistically significant (p<0.05) and one of my independent variables seems to contribute significantly to the model (p<0.05).

Without any independent variables, the model correctly classifies 96% of the cases; the model correctly classifies 98% of cases with the independent variables added. R^2 = 33%. I realize that the number of rare events is quite small, which you mentioned could be problematic. How meaningful do you believe the results are, and would you have any suggestions on improving the statistical work? Thank you!

With only eight events, I really think you should do exact logistic regression to get p-values that you can put some trust in. Lack of availability in SPSS is not an acceptable excuse.

Dear Dr. Allison,

Thank you so much for your article. I have a sample of 320 observations with 22 events. Is it suitable to proceed with the Conventional ML? Or would exact logistic regression be a better option? Do you know whether the rare event methods such as firth or exact logistic regression can be implemented in eViews? Thank you.

You might be able to get by with conventional ML, depending on how many predictors you have. But in any case, I would verify p-values using exact logistic regression. Firth is probably better for coefficient estimates. I don’t know if these methods are available in eviews.

You can’t compare AIC and SC across different data sets. Similarly, the percentage of correctly predicted events will not be comparable across the full and subsampled data sets. There’s certainly no reason to think that the model estimated from the subsampled data will be any better than the model estimated from the full data. Try using the model from the subsampled data to predict outcomes in the full data. I expect that it will do worse than the model estimated from the full data.

Dr. Allison,

Thank you for your helpful comments. As you suspected the subsampled model did a much worse job predicting the full data than the full data model. It hugely over predicted the 1s resulting in false positives for almost every observation. Thank you!

Hello Dr. Allison,

Thank you for all the information contained in this article and especially the comments following. My dataset has about 75,000 observations (parcels) with about 1,000 events (abandoned properties). I plan to begin with 20 predictors and use the Penalized Method due to some of my predictor variables also being ‘rare’ (< 20 in some categories). My goal is to be able to use the model to predict future events of abandonment.

My major questions are about sampling. According to comments above, the full dataset should be used, so as to not lose good data but if I use stratified sampling to get the 50/50 split my coefficients will not be biased and my odds ratio will be unchanged. After trying both models using the full dataset and multiple 50/50 datasets (all 1s and a random sample of 0s) I get quite different results with the full dataset performing worse in all measures. Specifically in my AIC and SC. In the classification table With the full dataset I predict only about 10% of my abandoned and with the 50/50 I can predict about 90%. If I use the 50/50 model to try and predict future abandonment (with updated data) am I breaking principles of Logistic Regression? Thank you in advance for any insight.

Dear Dr. Allison,

One query. I am looking at a data set with c. 1.4 million observations and c. 1000 events. One of the explanatory variables has many levels (over 40) and in some cases there are 0 positive events for certain factor levels. In this case would it be best to subset the dataset in to include only those factor levels with a certain number of events (i.e. at least 20 or similar – would leave 15-20 levels to be estimated)

Any comments would be much appreciated.

(PS superb resource above)

If you try to estimate the model with the factor levels that have no events, the coefficients for those levels will not converge. However, the coefficients for the remaining levels are still OK, and they are exactly the same as if you had deleted the observations from factor levels with no events. A reasonable alternative is to use the Firth method, which will give you coefficients for the factor levels with no events.

Hi Dr Allison.

I’m estimating the effect of a police training on the likelihood of committing acts of use of force. I have data of 2900 police officers before and after treatment (monthly frequency), and the asignation to training is by alphabetical order of surname. Because the structure of the data, i am estimating a difference-in-difference model. It should be noted that the use of force are rare events (five on average per month and in the entire sample are 148 events). I estimated the ITT by OLS and Probit and gives me similar coefficients. Would you suggest me use another method, like the firth method?

Thank you

Could you give some details on how you are estimating the DID model?