One of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” There are two general approaches to answering this question. One is to get a measure of how well you can predict the dependent variable based on the independent variables. The other is to test whether the model needs to be more complex, specifically, whether it needs additional nonlinearities and interactions to satisfactorily represent the data.
In a later post, I’ll discuss the second approach to model fit, and I’ll explain why I don’t like the Hosmer-Lemeshow goodness-of-fit test. In this post, I’m going to focus on R2 measures of predictive power. Along the way, I’m going to retract one of my long-standing recommendations regarding these measures.
Unfortunately, there are many different ways to calculate an R2 for logistic regression, and no consensus on which one is best. Mittlbock and Schemper (1996) reviewed 12 different measures; Menard (2000) considered several others. The two methods that are most often reported in statistical software appear to be one proposed by McFadden (1974) and another that is usually attributed to Cox and Snell (1989) along with its “corrected” version (see below). However, the Cox-Snell R2 (both corrected and uncorrected) was actually discussed earlier by Maddala (1983) and by Cragg and Uhler (1970).
Among the statistical packages that I’m familiar with, SAS and Statistica report the Cox-Snell measures. JMP and SYSTAT report both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for binary logistic regression but McFadden’s measure for multinomial and ordered logit.
For years, I’ve been recommending the Cox and Snell R2 over the McFadden R2, but I’ve recently concluded that that was a mistake. I now believe that McFadden’s R2 is a better choice. However, I’ve also learned about another R2 that has good properties, a lot of intuitive appeal, and is easily calculated. At the moment, I like it better than the McFadden R2. But I’m not going to make a definite recommendation until I get more experience with it.
Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let L0 be the value of the likelihood function for a model with no predictors, and let LM be the likelihood for the model being estimated. McFadden’s R2 is defined as
R2McF = 1 – ln(LM) / ln(L0)
where ln(.) is the natural logarithm. The rationale for this formula is that ln(L0) plays a role analogous to the residual sum of squares in linear regression. Consequently, this formula corresponds to a proportional reduction in “error variance”. It’s sometimes referred to as a “pseudo” R2.
The Cox and Snell R2 is
R2C&S = 1 – (L0 / LM)2/n
where n is the sample size. The rationale for this formula is that, for normal-theory linear regression, it’s an identity. In other words, the usual R2 for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula. It’s appropriate, then, to describe this as a “generalized” R2 rather than a pseudo R2. By contrast, the McFadden R2 does not have the OLS R2 as a special case. I’ve always found this property of the Cox-Snell R2 to be very attractive, especially because the formula can be naturally extended to other kinds of regression estimated by maximum likelihood, like negative binomial regression for count data or Weibull regression for survival data.
It’s well known, however, that the big problem with the Cox-Snell R2 is that it has an upper bound that is less than 1.0. Specifically, the upper bound is 1 – L02/n.This can be a lot less than 1.0, and it depends only on p, the marginal proportion of cases with events:
upper bound = 1 – [pp(1-p)(1-p)]2
This has a maximum of .75 when p=.5. By contrast, when p=.9 (or .1), the upper bound is only .48.
For those who want an R2 that behaves like a linear-model R2, this is deeply unsettling. There is a simple correction, and that is to divide R2C&S by its upper bound, which produces the R2 attributed to Nagelkerke (1991). But this correction is purely ad hoc, and it greatly reduces the theoretical appeal of the original R2C&S. I also think that the values it typically produces are misleadingly high, especially compared with what you get from a linear probability model. (Some might view this as a feature, however).
So, with some reluctance, I’ve decided to cross over to the McFadden camp. As Menard (2000) argued, it satisfies almost all of Kvalseth’s (1985) eight criteria for a good R2. When the marginal proportion is around .5, the McFadden R2 tends to be a little smaller than the uncorrected Cox-Snell R2. When the marginal proportion is nearer to 0 or 1, the McFadden R2 tends to be larger.
But there’s another R2, recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to R2 definitions for linear models. It’s also easy to calculate.
The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!
The motivation should be clear. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Tjur also showed that his R2 (which he called the coefficient of discrimination) is equal to the arithmetic mean of two R2 formulas based on squared residuals, and equal to the geometric mean of two other R2’s based on squared residuals.
Here’s an example of how to calculate Tjur’s statistic in Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable inlf is coded 1 if a woman was in the labor force, otherwise 0. A logistic regression model was fit with six predictors.
Here’s the code:
logistic inlf kidslt6 age educ huswage city exper
predict yhat if e(sample)
ttest yhat, by(inlf)
The predict command produces fitted values and stores them in a new variable called yhat. (The if e(sample) code prevents predicted values from being calculated for cases that may be excluded from the regression model). The ttest command is the easiest way to get the difference in the means of the predicted values for the two groups (but you can ignore the p-values). The mean predicted value for those in the labor force was .680, while the mean predicted value for those not in the labor force was .422. The difference of .258 is the Tjur R2. By comparison, the Cox-Snell R2 is .248 and the McFadden R2 is .208. The corrected Cox-Snell is .332.
Here’s the equivalent SAS code:
proc logistic data=my.mroz;
model inlf(desc) = kidslt6 age educ huswage city exper;
output out=a pred=yhat;
proc ttest data=a;
class inlf; var yhat; run;
One possible objection to the Tjur R2 is that, unlike Cox-Snell and McFadden, it’s not based on the quantity being maximized, namely, the likelihood function.* As a result, it’s possible that adding a variable to the model could reduce the Tjur R2. But Kvalseth (1985) argued that it’s actually preferable that R2 not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods. For example, one might want to compare predictions based on logistic regression with those based on a classification tree method.
Another potential complaint is that the Tjur R2 cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward.
If you want to learn more about logistic regression, check out my book Logistic Regression Using SAS: Theory and Application, Second Edition (2012), or try my seminar on Logistic Regression.
* Conjecture: I suspect that the Tjur R2 is maximized when logistic regression coefficients are estimated by the linear discriminant function method. I encourage any interested readers to try to prove (or disprove) that. (For background on the relationship between discriminant analysis and logistic regression, see Press and Wilson (1984)).
Cragg, J.G. and R.S. Uhler (1970) “The demand for automobiles.” The Canadian Journal of Economics 3: 386-406.
Cox, D.R. and E.J. Snell (1989) Analysis of Binary Data. Second Edition. Chapman & Hall.
Kvalseth, T.O. (1985) “Cautionary note about R2.” The American Statistician: 39: 279-285.
McFadden, D. (1974) “Conditional logit analysis of qualitative choice behavior.” Pp. 105-142 in P. Zarembka (ed.), Frontiers in Econometrics. Academic Press.
Nagelkerke, N.J.D. (1991) “A note on a general definition of the coefficient of determination.” Biometrika 78: 691-692.
Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics. Cambridge University Press.
Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis.” The American Statistician 54: 17-24.
Mittlbock, M. and M. Schemper (1996) “Explained variation in logistic regression.” Statistics in Medicine 15: 1987-1997.
Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” Econometrica 55: 765-799.
Press, S.J. and S. Wilson (1978) “Choosing between logistic regression and discriminant analysis.” Journal of the American Statistical Association 73: 699-705.
Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372.