Skip to content

Another Goodness-of-Fit Test for Logistic Regression

Paul Allison
May 7, 2014

In my April post, I described a new method for testing the goodness of fit (GOF) of a logistic regression model without grouping the data. That method was based on the usual Pearson chi-square statistic applied to the ungrouped data. Although Pearson’s chi-square does not have a chi-square distribution when data are not grouped, it does have approximately a normal distribution (under the null hypothesis that the fitted model is correct). By subtracting the mean (which happens to be the sample size) and dividing by an appropriate standard deviation, you get a z-statistic that has pretty good properties—better than the Hosmer-Lemeshow test in simulation studies.

But there are other GOF tests for ungrouped data. One that deserves serious consideration is Stukel’s test, which is easily calculated with standard logistic regression software. Stukel (1988) proposed a generalization of the logistic regression model that has two additional parameters. These allow for departures from the logistic curve as it approaches either 1 or 0. Special cases of the model also include (approximately) the complementary log-log model and the probit model.


The logistic model can be tested against this more general model by a simple procedure. Let gi be the linear predictor from the fitted model, that is, gi = xib where xi is the vector of covariate values for individual i and b is the vector of estimated coefficients. Then create two new variables:

     za = g2 if g>=0, otherwise za = 0
     zb = g2 if g<0, otherwise zb = 0.

Add these two variables to the logistic regression model and test the null hypothesis that both of their coefficients are equal to 0. Stukel suggested a score test, but there’s no obvious reason to prefer that to a Wald test or a likelihood ratio test. Note that in many data sets, g is either never greater than 0 or never less than 0. In those cases, only one z variable is necessary.

Here’s an example of how to calculate a Wald version of Stukel’s test with Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable inlf is coded 1 if a woman was in the labor force, otherwise 0. A logistic regression model was fit with six predictors.

logistic inlf kidslt6 age educ huswage city exper
predict g, xb
gen za=(g>=0)*g^2
gen zb=(g<0)*g^2
logistic inlf kidslt6 age educ huswage city exper za zb
test za zb

This program produced a chi-square of .11 with 2 df and a p-value of .95. Clearly there is no evidence for misspecification. A likelihood ratio test comparing the two models produced almost exactly the same result.

Here’s the equivalent SAS code:

proc logistic data=my.mroz;
model inlf(desc) = kidslt6 age educ huswage city exper;
output out=a xbeta=g;
data b;
set a;
proc logistic data=b;
model inlf(desc) = kidslt6 age educ huswage city exper za zb;
test za=0,zb=0;

How well does the Stukel test stack up against alternatives? For detecting quadratic departures from linearity, simulation studies suggest that the Stukel test is a little less powerful than either the standardized Pearson test (mentioned above) or the traditional Hosmer-Lemeshow test (Hosmer et al. 1997). For detecting interactions, however, Stukel’s test is more powerful than the standardized Pearson (Allison 2014), which was previously shown to be more powerful than Hosmer-Lemeshow (Hosmer and Hjort 2002). Finally, Stukel is considerably more powerful than either of the other two at detecting departures from the logit link function (Hosmer et al. 1997).

So the Stukel test is definitely worth using, possibly in conjunction with the standardized Pearson test. It’s also worth noting the resemblance of the Stukel test to a misspecification test that is frequently recommended by econometricians. Ramsey (1969) proposed including the square (and possibly higher powers) of the predicted values in a regression, and testing for statistical significance. The Stukel test is different only insofar as it splits the squared predicted values into two separate components.


Allison, Paul D. (2014) “Measures of fit for logistic regression.” Paper 1485-2014 presented at the SAS Global Forum, Washington, DC.

Hosmer, D.W.and N.L. Hjort (2002) “Goodness-of-fit processes for logistic regression: Simulation results.” Statistics in Medicine 21:2723–2738.

Hosmer, D.W., T. Hosmer, S. Le Cessie and S. Lemeshow (1997). “A comparison of goodness-of-fit tests for the logistic regression model.” Statistics in Medicine 16: 965–980.

Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765-799.

Ramsey, J.B. (1969) “Tests for specification errors in classical linear least squares regression analysis.” Journal of the Royal Statistical Society, Series B. 31: 350–371.

Stukel, T.A. (1988) “Generalized logistic models.” Journal of the American Statistical Association 83: 426–431.



  1. Dear Mr Allison,

    does the GOF stukel test run well for naturally grouped binomial data?



    1. Well, it’s certainly not intended for that purpose. I’d go with the classic Pearson chi-square or deviance.

  2. Hi Paul,

    Like your workshops. Anyway, what do you think about the use of the Box-Tidwell approach for assessing non-linearity? Do you recommend it? (The BT approach is described in Menard’s LR text [2010], on page 108, as well as other texts.)



    1. I think it can be a useful method. But it’s specific to a particular variable, rather than giving an overall goodness of fit test.

  3. Dear Mr Allison,

    I found that when grouping continuous variables (and using WOE coding) the AUC might increase. (The event-rate / event probability is monoton in the variable, not U shaped or whatevere else.) Apart from numerical issues in the calcultion of AUC, can there be other reasons causing this phenomenon?
    Best regards,

    1. Even if the probability is monotone in the variable, moving to a categorical version of the variables can model departures from linearity that might be important in predicting the outcome.

  4. Dear Paul,

    Many thanks for this blog and the related paper. A couple of related questions:
    1) Are there versions of the Tjur R^2 and the 4 goodness of fit tests, recommended in your paper, for the multinomial and fractional logit algorithms?
    2) As you note in the paper, the p values in the goodness of fit tests would decrease with the number of observations. Could you suggest any rule of thumb or a characteristic dependence on the number of observations?

    Thanks and regards.


  5. This is an interesting approach. Like other GOF tests, I assume this is sensitive to sample size – so we could maybe look at the magnitude of the coefficients for za and zb and look for a substantive size. Any guidance for what that might look like?

    1. Yes, this is true of any statistical test. But I’m afraid I don’t have any guidance as to magnitudes of the coefficients.

Leave a Reply

Your email address will not be published. Required fields are marked *