Skip to content

Alternatives to the Hosmer-Lemeshow Test

Paul Allison
April 9, 2014

In my post of March 2013, I pointed out some of the deficiencies of the Hosmer-Lemeshow test for goodness-of-fit (GOF) of logistic regression models. Most alarmingly, the p-values produced by the HL statistic can differ dramatically depending on the arbitrary choice of the number of groups.

What I didn’t say in that post is that p-values can also differ dramatically across different software packages, even when the number of groups is the same. That’s because there are slightly different algorithms for classifying individuals into different groups, and those apparently slight differences can have major consequences.

In the 34 years since the HL test was first proposed, numerous articles have commented on these and other problems, and many alternative tests have been proposed. There is almost universal agreement that HL is far from ideal and that better tests are available. But there appears to be little agreement on which of these is the best.


In that year-ago post, I promised to talk about some of these alternative GOF tests “in next month’s post”. But I found the literature so daunting that I kept putting it off, month after month. Finally, I decided to force myself to sort things out by agreeing to give a talk on the topic at the SAS Global Forum 2014, which was held two weeks ago in Washington, DC.

It worked! I now feel like I’ve got a pretty good handle on which tests are most useful for applied researchers. You can download the slides and the paper for my talk by clicking here.

The alternatives to the HL test generally fall into two categories: tests that do not require grouping of the data, and tests that propose methods of grouping that are different than HL. I don’t particularly care for the grouped tests, primarily because they usually require significant user input along with some uncertainty about the best way to do it. Instead, I’ve restricted my attention to tests that can be directly applied to ungrouped data.

My Global Forum paper examines four ungrouped GOF tests: standardized Pearson, unweighted sum of squared residuals, Stukel’s test, and the information matrix test. In this post, I’ll take a look at just one of them, the standardized Pearson test. It’s familiar, relatively easy to compute, and has pretty good performance under a range of conditions.

Before going further, I should mention that I’ve particularly benefited from the work of Hosmer et al. (1997), Hosmer and Hjort (2002), and Kuss (2002). Also, in chapter 5 of the third edition of Applied Logistic Regression, Hosmer, Lemeshow and Sturdivant (2013) provide a very helpful overview of ungrouped GOF tests.

Pearson’s chi-square is one of two classic GOF tests for logistic regression, the other being the deviance. Most logistic regression packages can report these tests.  It’s well known, however, that when data are not grouped, neither of these tests has a sampling distribution that’s anywhere close to a chi-square distribution. In fact, that was the central motivation for the Hosmer-Lemeshow test.

But Pearson’s chi-square does have approximately a normal distribution in moderate- to large-sized samples, and this fact can be used to construct a useful GOF statistic. The trick is to get the correct mean and standard deviation of this statistic (under the null hypothesis that the fitted model is correct). Once you have those, you can construct a z-statistic by subtracting the mean and dividing by the standard deviation.

To a close approximation, the mean is n, the sample size. The standard deviation can be obtained by running an artificial linear regression and taking the root mean squared error from that regression. See chapter 5 of Hosmer, Lemeshow and Sturdivant (2013) for details on how to do this.

Alternatively, SAS users can download a macro called GOFLOGIT written by Oliver Kuss (2001) that calculates this statistic (and others). Stata users can get a command called pearsonx2 written by Jeroen Weesie.

How good is the standardized Pearson at detecting model misspecification? Well, simulation evidence suggests that it is usually more powerful than the traditional HL test. Hosmer and Hjort (2002), for example, show that the standardized Pearson is consistently more powerful than HL for detecting quadratic departures from linearity, although the differences are not huge. For example, for a moderately large quadratic effect and n=500, the standardized Pearson rejected the linear model 86% of the time, compared with 80% for HL.

The differences were larger for detecting interactions.  For a moderate interaction and n=500, the standardized Pearson rejected the additive model 68% of the time compared with 40% for HL. For detecting an incorrect link function, there was no consistent pattern. Under some conditions HL did better, under others the standardized Pearson did better.

There’s one important issue with the standardized Pearson statistic that is easily overlooked. Intuitively, one would expect the Pearson chi-square to be larger for models that fit the data more poorly, and that would suggest a one-sided z-test. But Osius and Rojek(1992) argued strongly for a two-sided test, and Hosmer, Lemeshow  and Sturdivant (2013) have apparently agreed. My own simulations also support this conclusion. Weesie’s Stata command squares the z-statistic to get a 1 df chi-square, and that’s equivalent to doing a two-sided z test. However, the GOFLOGIT macro calculates a one-side p-value for the z-statistic. So if you’re using that macro (with ungrouped data), I recommend calculating the two-sided p-value yourself.

In a later post, I’ll talk about some of the other ungrouped tests.  But (lesson learned) I’m not going to make any promises about how soon that will happen.

Hosmer, D.W. and N.L. Hjort (2002) “Goodness-of-fit processes for logistic regression: Simulation results.” Statistics in Medicine 21:2723–2738.

Hosmer, D.W., T. Hosmer, S. Le Cessie and S. Lemeshow (1997). “A comparison of goodness-of-fit tests for the logistic regression model.” Statistics in Medicine 16: 965

Hosmer D.W. and S. Lemeshow (1980) “A goodness-of-fit test for the multiple logistic regression model.” Communications in Statistics A10:1043-1069.

Hosmer D.W., S. Lemeshow and R.X. Sturdivant (2013) Applied Logistic Regression, 3rd Edition. New York: Wiley.

Kuss, O. (2001) “A SAS/IML macro for goodness-of-fit testing in logistic regression models with sparse data.” Paper 265-26 presented at the SAS User’s Group International 26.

Kuss, O. (2002) “Global goodness-of-fit tests in logistic regression with sparse data.” Statistics in Medicine 21:3789–3801.

Osius, G., and Rojek, D. (1992) “Normal goodness-of-fit tests for multinomial models with large degrees-of-freedom.” Journal of the American Statistical Association 87: 1145–1152.



  1. whether the variable is not significant in the regression model should be discarded? This might be off topic, but I still do not understand about this

  2. Hello Paul,

    I have Stata 12.1 and Stata’s pearsonx2 ado file is written in Stata 13 version. Would it be possible for you to provide me the pearsonx2 ado file for Stata 12.1?


    1. Sorry but this is not my program. Check with the author, Jeroen Weesie, at Utrecht University.

Leave a Reply

Your email address will not be published. Required fields are marked *