Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression
March 5, 2013 By Paul Allison
The Hosmer-Lemeshow (HL) test for logistic regression is widely used to answer the question “How well does my model fit the data?” But I’ve found it to be unsatisfactory for several reasons that I’ll explain in this post.
First, some background. Last month I wrote about several R2 measures for logistic regression, which is one approach to assessing model fit. R2 is a measure of predictive power, that is, how well you can predict the dependent variable based on the independent variables. That may be an important concern, but it doesn’t really address the question of whether the model is consistent with the data.
By contrast, goodness-of-fit (GOF) tests help you decide whether your model is correctly specified. They produce a p-value—if it’s low (say, below .05), you reject the model. If it’s high, then your model passes the test.
In what ways might a model be misspecified? Well, the most important potential problems are interactions and nonlinearities. You can always produce a satisfactory fit by adding enough interactions and nonlinearities. But do you really need them? GOF tests are designed to answer that question. Another issue is whether the “link” function is correct. Is it logit, probit, complementary log-log, or something else entirely?
For both linear and logistic regression, it’s possible to have a low R2 and still have a model that is correctly specified in every respect. And vice versa, you can have a very high R2 and yet have a model that is grossly inconsistent with the data.
GOF tests are readily available for logistic regression when the data can be aggregated or grouped into unique “profiles”. Profiles are groups of cases that have exactly the same values on the predictors. Suppose for example, that the model has just two predictor variables, sex (1=male, 0=female) and marital status (1=married, 0=unmarried). There are then four profiles: married males, unmarried males, married females and unmarried females, presumably with many cases in each profile.
Suppose we then fit a logistic regression model with the two predictors, sex and marital status (but not their interaction). For each profile, we can get an observed number of events and an expected number of events based on the model. There are two well-known statistics for comparing the observed number with the expected number: the deviance and Pearson’s chi-square.
The deviance is a likelihood ratio test of the fitted model versus a “saturated” model that perfectly fits the data. In our hypothetical example, a saturated model would include the interaction of sex and marital status. In that case, the deviance is testing the “no interaction” model as the null hypothesis, with the interaction model as the alternative. A low p-value suggests that the simpler model (without the interaction) should be rejected in favor of the more complex one (with the interaction). Pearson’s chi-square is an alternative method for testing the same hypothesis. It’s just the application of Pearson’s familiar formula for comparing observed with expected numbers of events (and non-events).
Both of these statistics have good properties when the expected number of events in each profile is at least 5. But most contemporary applications of logistic regression use data that do not allow for aggregation into profiles because the model includes one or more continuous (or nearly continuous) predictors. When there is only one case per profile, both the deviance and Pearson chi-square have distributions that depart markedly from a true chi-square distribution, yielding p-values that may be wildly inaccurate.
What to do? Hosmer and Lemeshow (1980) proposed grouping cases together according to their predicted values from the logistic regression model. Specifically, the predicted values are arrayed from lowest to highest, and then separated into several groups of approximately equal size. Ten groups is the standard recommendation.
For each group, we calculate the observed number of events and non-events, as well as the expected number of events and non events. The expected number of events is just the sum of the predicted probabilities over the individuals in the group. And the expected number of non-events is the group size minus the expected number of events.
Pearson’s chi-square is then applied to compare observed counts with expected counts. The degrees of freedom is the number of groups minus 2. As with the classic GOF tests, low p-values suggest rejection of the model.
It seems like a clever solution, but it turns out to have serious problems. The most troubling problem is that results can depend markedly on the number of groups, and there’s no theory to guide the choice of that number. This problem did not become apparent until software packages started allowing you to specify the number of groups, rather than just using 10.
Here’s an example using Stata with the famous Mroz data set that I used in last month’s post. The sample consists of 753 women, and the dependent variable is whether or not a woman is in the labor force. Here is the Stata code for producing the HL statistic based on10 groups:
use http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta, clear
logistic inlf kidslt6 age educ huswage city exper
estat gof, group(10)
The estat gof command produces a chi-square of a 15.52 with 8 df, yielding a p-value of .0499—just barely significant. This suggests that the model is not a satisfactory fit to the data, and that interactions and non-linearities are needed (or maybe a different link function). But if we specify 9 groups using the option group(9), the p-value rises to .11. And with group(11), the p-value is .64. Clearly, it’s not acceptable for the results to depend so greatly on such minor changes to a test characteristic that is completely arbitrary. Examples like this one are easy to come by.
But wait, there’s more. One would hope that adding a statistically significant interaction or non-linearity to a model would improve its fit, as judged by the HL test. But often that doesn’t happen. Suppose, for example, that we add the square of exper (labor force experience) to the model, allowing for non-linearity in the effect of experience. The squared term is highly significant (p=.002). But with 9 groups, the HL chi-square increases from 11.65 (p=.11) in the simpler model to 13.34 (p=.06) in the more complex model. That result suggests that we’d be better off with the model that excludes the squared term.
The reverse can also happen. Quite frequently, adding a non-significant interaction or non-linearity to a model will substantially improve the HL fit. For example, I added the interaction of educ and exper to the basic model above. The product term had a p-value of .68, clearly not statistically significant. But the HL chi-square (based on 10 groups) declined from 15.52 (p=.05) to 9.19 (p=.33). Again, unacceptable behavior.
If the HL test is no good, then how can we assess the fit of the model? It turns out that there’s been quite a bit of recent work on this topic. In next month’s post, I’ll describe some of the newer approaches.
If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application, Second Edition (2012), or try my seminars on Logistic Regression Using SAS or Logistic Regression Using Stata.
Hosmer D.W. and Lemeshow S. (1980) “A goodness-of-fit test for the multiple logistic regression model.” Communications in Statistics A10:1043-1069.