The Hosmer-Lemeshow (HL) test for logistic regression is widely used to answer the question “How well does my model fit the data?” But I’ve found it to be unsatisfactory for several reasons that I’ll explain in this post.
First, some background. Last month I wrote about several R2 measures for logistic regression, which is one approach to assessing model fit. R2 is a measure of predictive power, that is, how well you can predict the dependent variable based on the independent variables. That may be an important concern, but it doesn’t really address the question of whether the model is consistent with the data.
By contrast, goodness-of-fit (GOF) tests help you decide whether your model is correctly specified. They produce a p-value—if it’s low (say, below .05), you reject the model. If it’s high, then your model passes the test.
In what ways might a model be misspecified? Well, the most important potential problems are interactions and nonlinearities. You can always produce a satisfactory fit by adding enough interactions and nonlinearities. But do you really need them? GOF tests are designed to answer that question. Another issue is whether the “link” function is correct. Is it logit, probit, complementary log-log, or something else entirely?
LEARN MORE IN A SEMINAR WITH PAUL ALLISON
For both linear and logistic regression, it’s possible to have a low R2 and still have a model that is correctly specified in every respect. And vice versa, you can have a very high R2 and yet have a model that is grossly inconsistent with the data.
GOF tests are readily available for logistic regression when the data can be aggregated or grouped into unique “profiles”. Profiles are groups of cases that have exactly the same values on the predictors. Suppose for example, that the model has just two predictor variables, sex (1=male, 0=female) and marital status (1=married, 0=unmarried). There are then four profiles: married males, unmarried males, married females and unmarried females, presumably with many cases in each profile.
Suppose we then fit a logistic regression model with the two predictors, sex and marital status (but not their interaction). For each profile, we can get an observed number of events and an expected number of events based on the model. There are two well-known statistics for comparing the observed number with the expected number: the deviance and Pearson’s chi-square.
The deviance is a likelihood ratio test of the fitted model versus a “saturated” model that perfectly fits the data. In our hypothetical example, a saturated model would include the interaction of sex and marital status. In that case, the deviance is testing the “no interaction” model as the null hypothesis, with the interaction model as the alternative. A low p-value suggests that the simpler model (without the interaction) should be rejected in favor of the more complex one (with the interaction). Pearson’s chi-square is an alternative method for testing the same hypothesis. It’s just the application of Pearson’s familiar formula for comparing observed with expected numbers of events (and non-events).
Both of these statistics have good properties when the expected number of events in each profile is at least 5. But most contemporary applications of logistic regression use data that do not allow for aggregation into profiles because the model includes one or more continuous (or nearly continuous) predictors. When there is only one case per profile, both the deviance and Pearson chi-square have distributions that depart markedly from a true chi-square distribution, yielding p-values that may be wildly inaccurate.
What to do? Hosmer and Lemeshow (1980) proposed grouping cases together according to their predicted values from the logistic regression model. Specifically, the predicted values are arrayed from lowest to highest, and then separated into several groups of approximately equal size. Ten groups is the standard recommendation.
For each group, we calculate the observed number of events and non-events, as well as the expected number of events and non events. The expected number of events is just the sum of the predicted probabilities over the individuals in the group. And the expected number of non-events is the group size minus the expected number of events.
Pearson’s chi-square is then applied to compare observed counts with expected counts. The degrees of freedom is the number of groups minus 2. As with the classic GOF tests, low p-values suggest rejection of the model.
It seems like a clever solution, but it turns out to have serious problems. The most troubling problem is that results can depend markedly on the number of groups, and there’s no theory to guide the choice of that number. This problem did not become apparent until software packages started allowing you to specify the number of groups, rather than just using 10.
Here’s an example using Stata with the famous Mroz data set that I used in last month’s post. The sample consists of 753 women, and the dependent variable is whether or not a woman is in the labor force. Here is the Stata code for producing the HL statistic based on10 groups:
use http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta, clear
logistic inlf kidslt6 age educ huswage city exper
estat gof, group(10)
The estat gof command produces a chi-square of a 15.52 with 8 df, yielding a p-value of .0499—just barely significant. This suggests that the model is not a satisfactory fit to the data, and that interactions and non-linearities are needed (or maybe a different link function). But if we specify 9 groups using the option group(9), the p-value rises to .11. And with group(11), the p-value is .64. Clearly, it’s not acceptable for the results to depend so greatly on such minor changes to a test characteristic that is completely arbitrary. Examples like this one are easy to come by.
But wait, there’s more. One would hope that adding a statistically significant interaction or non-linearity to a model would improve its fit, as judged by the HL test. But often that doesn’t happen. Suppose, for example, that we add the square of exper (labor force experience) to the model, allowing for non-linearity in the effect of experience. The squared term is highly significant (p=.002). But with 9 groups, the HL chi-square increases from 11.65 (p=.11) in the simpler model to 13.34 (p=.06) in the more complex model. That result suggests that we’d be better off with the model that excludes the squared term.
The reverse can also happen. Quite frequently, adding a non-significant interaction or non-linearity to a model will substantially improve the HL fit. For example, I added the interaction of educ and exper to the basic model above. The product term had a p-value of .68, clearly not statistically significant. But the HL chi-square (based on 10 groups) declined from 15.52 (p=.05) to 9.19 (p=.33). Again, unacceptable behavior.
If the HL test is no good, then how can we assess the fit of the model? It turns out that there’s been quite a bit of recent work on this topic. In next month’s post, I’ll describe some of the newer approaches.
If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application, Second Edition (2012), or try my seminar on Logistic Regression.
Hosmer D.W. and Lemeshow S. (1980) “A goodness-of-fit test for the multiple logistic regression model.” Communications in Statistics A10:1043-1069.
Very good explanation. I have seen this problem in my analyses too and could not find a “right” number of groups for the HL test…just beacause there isn’t one.Thanks.
H-L test fails most of the time in very large datasets commonly see the financial industry. Any better tests to deal this situation will be very helpful.
See my reply to Matt Bogard below.
I’ve also seen several criticisms that the HL test is too sensitive to large sample sizes. I’m not sure of the validity of this criticism, but look forward to next month’s article- maybe the new approaches you are referring to will address this issue if it is valid.
JOURNAL OF PALLIATIVE MEDICINE
Volume 12, Number 2, 2009
“The Hosmer-Lemeshow test detected a statistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small”
SIZE MATTERS TO A MODEL’s FIT (comment in Crit Care Med. 2007: Sep 35(9):2213
“Caution should be used in interpreting the calibration of predictive models developed using a smaller data set when applied to larger numbers of patients. A significant Hosmer-Lemeshow test does not necessarily mean that a predictive model is not useful or suspect. While decisions concerning a mortality model’s suitability should include the Hosmer-Lemeshow test, additional information needs to be taken into consideration. This includes the overall number of patients, the observed and predicted probabilities within each decile, and adjunct measures of model calibration.”
and from STATA LIST comments:
“It follows that with large sample sizes any discrepancy between the model and the data will be magnified, resulting in small p-values for a goodness of fit test.”
The large sample size issue is a potential problem with ANY goodness of fit test. With large sample sizes, even trivial departures from the model specification are likely to show up as statistically significant. Actually, simulation results suggest that the HL test has relatively LOW power for detecting certain kinds of model specification, especially interactions.
I look forward to the next post on this topic. I’m dealing with a CPS dataset with nearly 100,000 observations and find the H-L test to be significant, yet looking at the tables the counts in the expected/observed columns are very close, not different enough to warrant changes to a model that is theoretically very sound.
What are your thoughts on the link test (Stata linktest command)?
The link test in Stata is fairly crude, but serviceable.
I propose calculating the HL statistic on the “hold-out sample” rather than the “model development sample”. Assuming you have a lot of data, you can do a 75% development data set, and 25% hold-out data set.
If you don’t have enough data points for a hold-out data set, I recommend the BIC which penalizes for model complexity. http://en.wikipedia.org/wiki/Bayesian_information_criterion
Are you still planning a follow-up article on a good alternative to the HL test? I’d be very interested to read it.
This article was really helpful!
Very helpful. Thank you!
I have just read this post and I have found it really interesting. That is the reason I am looking forward to read the post on a good alternative to this HL test (which, in fact, has driven me crazy these last three months). Where can I find the explanations on those good alternatives?
Thank you very much.
I’m working on it, but it’s taken longer than expected.
Is le Cessie and Houwelingen test better?
Not familiar with this test.
You might be interested in this article from Hosmer & Lemeshow (and a couple of others) who critique the Hosmer-Lemeshow goodness-of-fit test and looks at how it and others actually perform (I took away from it that none of them are that great)…
D. W. HOSMER, T. HOSMER, S. LE CESSIE, S. LEMESHOW (1997) A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL Statistics in Medicine Volume 16, Issue 9, pages 965–980
A clearer explanation and a very helpful description of the HL’ test of GOF.
Have you published a paper on the this particular finding? if so, would you please provide me with a link so I can refer to it in my work.
Sorry, no publication. But you can refer to my recent paper presented at the SAS Global Forum. Click here to see it.
i have this (hosmer and lemeshow test) HL test for goodness of fit. All the estimates are being significant but the value of sig, in HL test is being greater than 0.75, whether it is correct or what can be the solution.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 2.764 8 .948
For the HL test, higher p-values are better. So 0.75 indicates that the model fits well.
I was planning on using the HL test of GOF for my analysis because I am using the svy command in Stata and I haven’t been able to find any other appropriate GOF stats. Are you able to recommend an alternative when using the svy command?
Sorry, I don’t have any recommendations for this situationn.
This article might be of interest to you: Archer, K. J., & Lemeshow, S. (2006). Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal, 6(1), 97-105.
Or this one: Archer, K. J., Lemeshow, S., & Hosmer, D. W. (2007). Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design. Computational Statistics & Data Analysis, 51(9), 4450-4464. doi: 10.1016/j.csda.2006.07.006
very helpful for test of results of Lr. Need more learning!
I am applying Binary Logistic Regression and my independent variables are all nominal. In GOF test, the H-L test is significant (less than 0.01) and my I have all nominal independent variables in the Nagelkerke R Square is 0.0439. I would like to know your suggestion of this situation. I am looking forward to hearing from you soon.
If your independent variables are all nominal, you should be able to use the deviance or Pearson chi-square to test the fit of the model. These are more trustworthy than the Hosmer-Lemeshow test. If these are significant, it would indicate a need for interactions among your predictors.
Thanks for the article. I am using the Hosmer-Lemeshow test to see if the observations are random variables whose distribution belongs to a given family of distributions. Do the observations have to be independent of each other? I am assuming that the observations (like defaults, non-defaults) are slightly correlated with each. Does the test still do its job or do I need to modify the test statistics? The denominator of the test looks like the variance of a binomial distribution. I am thinking if i have to modify it with terms to correlationsfactor.
Many thanks for your sharing your idea.
In principle, the observations should be independent. But I haven’t seen any suggestions for how to modify the test if they are not independent.
There is something strange in R when one uses the package “LogisticDx” to make Diagnostic tests for logistic regression models. It’s about the values of the Probabilities of covariate patterns, I think they are not correct, if the are, I would like to know how they are calculated. It’s expected that the sum of y=1 observed in each covariate pattern should be approximately equal to the sum of y=1 expected in each covariate pattern when consider the probability of the covariate pattern.
Please see this point if you can and reply on my email.
Sorry but I am not familiar with this package.
Is it also true that the number of groups must be greater than the number of predictors+1 ? I have been told that this constraint is in HL’s original paper.
“In a 1980 paper Hosmer-Lemeshow showed by simulation that (provided p+1<g ) their test statistic approximately followed a chi-squared distribution on g−2 degrees of freedom…"
I don’t have access to the original HL paper, but nothing in their later work (including the 3rd edition of their textbook) says anything about this requirement.
Is it possible to have non-significant H&L test indicating the model fits the data, but actually have no significant predictors? I’ve run a logistic regression and found that none of my predictors are significant, yet the H&L test is still indicating a good fit.
I have quite a small sample size – can this affect H & L?
Yes, absolutely. This is more likely to happen when the sample is small, but it can also happen in large samples. Keep in mind that the H&L statistic is not testing whether the predictors affect the outcome. Rather, it’s testing whether there are non-linearities and interactions that are not well approximated by your model.
In HL test, the grouping criterion is the fitted probabilities of the responses. What is the logic behind using this criterion for grouping?
Because the fitted probabilities are deterministic functions of the predictors, similar fitted probabilities are indicative of “similar” values of the predictors, at least with respect to determination of the outcome.
Thank you for the great information and discussion on this topic. I am running a logistic model on a large data set consisting of several millions. Not surprisingly, HL test was highly significant. However, when I ran the same model on a smaller random sample of the same data set, GOF(HL)was not significant and everything else (including the ORs)remained unchanged. Could this be considered as evidence that the model was fine and that lack of fit when using the full data set has more to do with the test limitations rather than the model specification?
Possibly. But if you’ve read the post, you’ll know that I don’t trust the HL test even in smaller samples.
Dear mr. Allison,
For my thesis I have performed a logistic regression. The Hosmer Lemeshow test is significant. Now I try to find out where the test goes wrong. I have used the contingency table to calculate the HL statistic, but I cannot find out in which decile the model predicts poorly. Do you have any recommendations as to how I could try to find the poorly predicting decile using a HL test?
With kind regards,
i know how to calculate crude odds ratio in logistic regression but how can i calculate adjusted odds ratio?
If you exponentiate the coefficients from a logistic regression (i.e., calculate exp(b)), you get adjusted odds ratios.
very good explaination, thanks
Thank you for great information about that we found in Goodness-of-fit test for svy: logistic.
We have analysis our large-scale survey use svy prefix stata. We can use estat gof to perform a goodness-of-fit test for this model.
Based on data analysis, in the multiple logistic regression as final model as following
Number of observations = 259885
F-adjusted test statistic = F(9,4409) = 3.07
Prob F = 0.001
. estat gof
Logistic model for malaria, goodness-of-fit test
F(9,4409) = 3.08
Prob F = 0.0011
The F statistic is significant at the 5% level, indicating that the model is not a good fit for these data?
Meanwhile, all variables remaining in the model have Pv = 0.001 and OR> 1.
Do you have any insight about our final model?
svylogitgof is not an official Stata command, and I’m really not sure what it’s doing. However, with a sample this large, almost any reasonably parsimonious model is going to show a significant goodness of fit statistic.
The issue of the number of groups created with the Hosmer-Lemeshow test not withstanding, couldn’t you avoid the sample size issue by applying an effect size to the HL chi-square? For example, Cramer’s V could easily be calculated as V = sqrt(HL chi square/((n*2)).
I don’t think it’s sensible to calculate an effect size for a goodness of fit test. And Cramer’s V definitely does not seem appropriate.
I was using HL test in SPSS, but in case of catagorical responses, the observed and the expected values are always identical/same, the chi squeare is 0.00, df=0 and the P value is empety. How this happens?
I’m guessing that you are fitting a saturated model. That would happen, for example, if you had a single categorical predictor variable. Or more than one categorical predictor with all possible interactions.
Thanks for the informative article.
I have a model with 2 explanatory categorical variables (with 2 and 3 levels respectively).
But the H&L test result in SAS that shows only 3 groups. What could be the reason for this.
I’m guessing that your problem stems from the fact that there are many ties on the predicted values.
can you help me please
Comment on the quality of fit of a logistic model corresponding to which the P-value of a Hosmer-Lemeshow test is equal to 0.0003. What is your expectation, if any, regarding the value of Nagelkerke R2 corresponding to this model?
The two statistics have nothing to do with each other.
i am using binary logistic regression. my dependent variable is ( 1 and 0) with only one independent variable which is categorical for ex:( calling, texting, music and gaming). when i run analysis it shows only:
1. only three categorical variables ex:( call,
tex, music) not showing gaming variable
2. Nag R2 is = 0.054
3. In Hosmer lemeeson test:
(GOF is 1.000>0.05) (in SAME HL TEST TABLE CHI SQUARE VALUE 0.000)
is there any reason why i got chi-square value 0.000 in H&T TEST.
Thanks in advance
The chi-square is 0 because you are fitting what’s called a saturated model. If you put your data in the form of a contingency table, the model can perfectly reproduce the cell frequencies. Models will be saturated if all predictors are categorical and the model includes all possible interactions. Any model with a single categorical predictor will be saturated.
Thank you very much for your clear and illustrative ways of presenting complex issues into simple.
I want to attend any training regarding your model.
Then take my course “Logistic Regression” which is being offered April 5-6, 2019, in Philadelphia.
Thanks prof Allison for your educative articles and explanations. This is very helpful to me as a beginner.I am expecting to reading more from you.
Kindly help me with the interpretation of this results for the post estimation result after using a logistic model for data analysis.
number of observations = 23502
number of groups = 5
Hosmer-Lemeshow chi2(3) = 0.63
Prob > chi2 = 0.8905
Well, the p-value indicates that the model is consistent with the data. But the point of the post was that there are good reasons not to trust the H-L test.
Thank you for your information!
I have run a binary logistic regression with a very high HL test (p < 1.000), but the omnibus model fit test was not significant (p < .104). I am very conflicted as to which test I can trust. Is my model a good fit/ can it still be used?
These two tests are testing very different things.
-The omnibus model fit test is testing the null hypothesis that all coefficients are 0. It’s answering the question “Is this model better than nothing?”
-The HL test is testing for whether there are any missing interactions or nonlinearities in your model. It’s answering the question “Given the variables that I have, is there something better than this model?”
As in your case, it’s quite possible that the model has little predictive power (omnibus test), but no need for interactions or nonlinearities (HL test).