## What’s the Best R-Squared for Logistic Regression?

##### February 13, 2013 By Paul Allison

One of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” There are two general approaches to answering this question. One is to get a measure of how well you can predict the dependent variable based on the independent variables. The other is to test whether the model needs to be more complex, specifically, whether it needs additional nonlinearities and interactions to satisfactorily represent the data.

In a later post, I’ll discuss the second approach to model fit, and I’ll explain why I don’t like the Hosmer-Lemeshow goodness-of-fit test. In this post, I’m going to focus on *R*^{2 }measures of predictive power. Along the way, I’m going to retract one of my long-standing recommendations regarding these measures.

Unfortunately, there are many different ways to calculate an *R*^{2} for logistic regression, and no consensus on which one is best. Mittlbock and Schemper (1996) reviewed 12 different measures; Menard (2000) considered several others. The two methods that are most often reported in statistical software appear to be one proposed by McFadden (1974) and another that is usually attributed to Cox and Snell (1989) along with its “corrected” version (see below). However, the Cox-Snell *R*^{2 }(both corrected and uncorrected) was actually discussed earlier by Maddala (1983) and by Cragg and Uhler (1970).

Among the statistical packages that I’m familiar with, SAS and Statistica report the Cox-Snell measures. JMP and SYSTAT report both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for binary logistic regression but McFadden’s measure for multinomial and ordered logit.

For years, I’ve been recommending the Cox and Snell *R*^{2 }over the McFadden *R*^{2}, but I’ve recently concluded that that was a mistake. I now believe that McFadden’s *R*^{2} is a better choice. However, I’ve also learned about another *R*^{2} that has good properties, a lot of intuitive appeal, and is easily calculated. At the moment, I like it better than the McFadden *R*^{2}. But I’m not going to make a definite recommendation until I get more experience with it.

Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let *L*_{0} be the value of the likelihood function for a model with no predictors, and let *L _{M}* be the likelihood for the model being estimated. McFadden’s

*R*

^{2 }is defined as

* R*^{2}* _{McF}* = 1 – ln(

*L*

_{M}_{) }/ ln(

*L*

_{0})

where ln(.) is the natural logarithm. The rationale for this formula is that ln(*L*_{0}) plays a role analogous to the residual sum of squares in linear regression. Consequently, this formula corresponds to a proportional reduction in “error variance”. It’s sometimes referred to as a “pseudo” *R*^{2}.

The Cox and Snell *R*^{2} is

* R*^{2}* _{C&S}* = 1 – (

*L*

_{0 }/

*L*

*)*

_{M}^{2/n}

where *n* is the sample size. The rationale for this formula is that, for normal-theory linear regression, it’s an identity. In other words, the usual *R*^{2 }for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula. It’s appropriate, then, to describe this as a “generalized” *R*^{2 }rather than a pseudo *R*^{2}. By contrast, the McFadden *R*^{2 }does *not *have the OLS *R*^{2 }as a special case. I’ve always found this property of the Cox-Snell *R*^{2 }to be very attractive, especially because the formula can be naturally extended to other kinds of regression estimated by maximum likelihood, like negative binomial regression for count data or Weibull regression for survival data.

It’s well known, however, that the big problem with the Cox-Snell *R*^{2} is that it has an upper bound that is less than 1.0. Specifically, the upper bound is 1 – *L*_{0}^{2/n}.This can be a lot less than 1.0, and it depends only on *p*, the marginal proportion of cases with events:

upper bound = 1 – [*p ^{p}*(1-

*p*)

^{(1-p)}]

^{2}

This has a maximum of .75 when *p*=.5. By contrast, when *p*=.9 (or .1), the upper bound is only .48.

For those who want an *R*^{2 }that behaves like a linear-model *R*^{2}, this is deeply unsettling. There is a simple correction, and that is to divide *R*^{2}* _{C&S}* by its upper bound, which produces the

*R*

^{2 }attributed to Nagelkerke (1991)

_{. }But this correction is purely ad hoc, and it greatly reduces the theoretical appeal of the original

*R*

^{2}

*. I also think that the values it typically produces are misleadingly high, especially compared with what you get from a linear probability model. (Some might view this as a feature, however).*

_{C&S}So, with some reluctance, I’ve decided to cross over to the McFadden camp. As Menard (2000) argued, it satisfies almost all of Kvalseth’s (1985) eight criteria for a good *R*^{2}. When the marginal proportion is around .5, the McFadden *R*^{2} tends to be a little smaller than the uncorrected Cox-Snell *R*^{2}. When the marginal proportion is nearer to 0 or 1, the McFadden *R*^{2 }tends to be larger.

But there’s another *R*^{2}, recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to *R*^{2 }definitions for linear models. It’s also easy to calculate.

The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!

The motivation should be clear. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Tjur also showed that his *R*^{2} (which he called the coefficient of discrimination) is equal to the arithmetic mean of two *R*^{2} formulas based on squared residuals, and equal to the geometric mean of two other *R*^{2}’s based on squared residuals.

Here’s an example of how to calculate Tjur’s statistic in Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable **inlf **is coded 1 if a woman was in the labor force, otherwise 0. A logistic regression model was fit with six predictors.

Here’s the code:

use “http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta“, clear

logistic inlf kidslt6 age educ huswage city exper

predict yhat if e(sample)

ttest yhat, by(inlf)

The **predict** command produces fitted values and stores them in a new variable called **yhat**. (The **if e(sample)** code prevents predicted values from being calculated for cases that may be excluded from the regression model). The **ttest** command is the easiest way to get the difference in the means of the predicted values for the two groups (but you can ignore the *p*-values). The mean predicted value for those in the labor force was .680, while the mean predicted value for those not in the labor force was .422. The difference of .258 is the Tjur *R*^{2}. By comparison, the Cox-Snell *R*^{2 }is .248 and the McFadden *R*^{2 }is .208. The corrected Cox-Snell is .332.

Here’s the equivalent SAS code:

proc logistic data=my.mroz;

model inlf(desc) = kidslt6 age educ huswage city exper;

output out=a pred=yhat;

proc ttest data=a;

class inlf; var yhat; run;

One possible objection to the Tjur *R*^{2} is that, unlike Cox-Snell and McFadden, it’s not based on the quantity being maximized, namely, the likelihood function.* As a result, it’s possible that adding a variable to the model could reduce the Tjur *R*^{2}. But Kvalseth (1985) argued that it’s actually preferable that *R*^{2} not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods. For example, one might want to compare predictions based on logistic regression with those based on a classification tree method.

Another potential complaint is that the Tjur *R*^{2} cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward.

If you want to learn more about logistic regression, check out my book *Logistic Regression Using SAS: Theory and Application*, Second Edition (2012), or try my seminars on Logistic Regression Using SAS or Logistic Regression Using Stata.

* Conjecture: I suspect that the Tjur *R*^{2} is maximized when logistic regression coefficients are estimated by the linear discriminant function method. I encourage any interested readers to try to prove (or disprove) that. (For background on the relationship between discriminant analysis and logistic regression, see Press and Wilson (1984)).

**References**:

Cragg, J.G. and R.S. Uhler (1970) “The demand for automobiles.” *The Canadian Journal of Economics *3: 386-406.

Cox, D.R. and E.J. Snell (1989) *Analysis of Binary Data*. Second Edition. Chapman & Hall.

Kvalseth, T.O. (1985) “Cautionary note about R^{2}.” *The American Statistician*: 39: 279-285.

McFadden, D. (1974) “Conditional logit analysis of qualitative choice behavior.” Pp. 105-142 in P. Zarembka (ed.), *Frontiers in Econometrics*. Academic Press.

Nagelkerke, N.J.D. (1991) “A note on a general definition of the coefficient of determination.” *Biometrika *78: 691-692.

Maddala, G.S. (1983) *Limited Dependent and Qualitative Variables in Econometrics*. Cambridge University Press.

Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis.” *The American Statistician *54: 17-24.

Mittlbock, M. and M. Schemper (1996) “Explained variation in logistic regression.” *Statistics in Medicine* 15: 1987-1997.

Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” *Econometrica* 55: 765-799.

Press, S.J. and S. Wilson (1978) “Choosing between logistic regression and discriminant analysis.” *Journal of the American Statistical Association* 73: 699-705.

Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” *The American Statistician* 63: 366-372.

This morning I checked Paul Allison's Statistical Horizons blog and found a post on measures for logistic regression. It introduced me to Tjur's by way of an example, which I repackaged. [Click on the link above to see code].

Lucid advice, and useful…especially helpful for students transitioning from linear regression to logistic regresion.

Great post!

I believe you accidentally “flipped” the Cox & Snell R^2…

It should be [1 – (L0 / LM)]^2/n and not [1 – (LM / L0)]^2/n.

(that’s how it’s written in Nagelkerke’s paper).

Thanks! I’ve corrected the formula.

Very interesting. On a trivial point, I believe the Tjur stat is the absolute value of the difference, as the difference comes up negative, at least in this example.

Tue Tjur does not use the absolute value in his paper. If the model predicts worse than random, a negative measure of

ability to predictmakes sense to me.Interesting point. However, I don’t think it’s possible for a logistic regression model to produce a negative Tjur R2. Maybe some other method could.

Great post! I actually have a question about the model form of hazard analysis. I’ve been using the book “survival analysis using SAS”(very useful!) and it seems all the models in the book use a exponential form: h=exp(a+b*X1+c*X2), say h means hazard, and X1,X2 are the independent variables. I noticed though when I use a power form, say something like: h=X1^(a+b*X2), then changing the unit of X1 would change significance test result, and even AIC. I was wondering if you ever encountered this or what’s your suggestions on this. I apologize if this is not the appropriate place to ask this, but I’m really curious. Thanks!

I’m not surprised that changing the units of x1 would substantially change the results. Unless x1 has a coefficient, there is nothing to absorb changes in units. But why would you even consider a model like this?

Thanks for the reply! This model was used by a former graduate student in our lab 10 years ago. He recalled this power model as the most “efficient” and “easiest” form for his data (he couldn’t recall many details). What’s interesting is that when I run a power form and exponential form on our data, the former model seems to always come up with smaller AIC. The significant test results would be different depending on the covariates included. So we were wondering if there is a reason to prefer one to the other, or maybe it’s data specific?

Thanks!

Sorry but I don’t have any insights regarding this choice.

And a small correction about the models, the power form should be h=a*X1^(b+c*X2), is this “a” what you meant by “X1 has a coefficient”? But I think it didn’t absorb changes in units. A parallel exponential form we compare result with would be: h=exp(a+b*X1+c*X1*X2). Sorry about the confusion.

Your power-form hazard can be negative if X1 is. This is quite possible if, for example, X1 is mean standardized. This disqualifies it as a general hazard form.

I like this test postestimation for a regular binary logistic. However, it seems not to work when running a Firth logistic regression and produces values that are larger than 1. A quick check of the predicted values shows why as they predicted scores are not longer bounded by 1 after running these models.

Is there any way to get the equivalent of Tjur’s coefficient of discrimination after running a Firth logistic regression in Stata?

I think the problem is that when you use the firthlogit command in Stata, the predict command does not produce predicted probabilities. Instead, it gives the linear predictor or, equivalently, the log-odds. To get probabilities, you need to do the following:

firthlogit y x1 x2 x3

predict logodds

gen predprob=1/(1+exp(-logodds))

Then calculate the Tjur R2 using these predicted probabilites.

[…] Econometricians especially are fascinated to use some type of model explained type concept and introduce all kinds of pseudo-R-square concepts, implicating the % explained type information using loglikelihood values of no model vs. model. For a whole collection of references on various pseudo-R-square along with McFadden’s pseudo-Rsquare, see in the nice logical article, http://statisticalhorizons.com/r2logistic. by Paul Allison, the legendary logistic regression se… […]

Thanks for posting about Tjur’s R2 – how is Tjur commonly pronounced?

I believe it is pronounced “choor”

I am trying to use Tjur’s R2 after a Firth regression, but am getting very strange outputs; eg the mroz data above gives an R2 = 1.3.

This must be due to the penalized likelihood, but can it be adjusted by weighting, or?

What software are you using? As long as the predicted values are between 0 and 1, calculation of the Tjur R2 shouldn’t be a problem. The Firth method should produce predicted values between 0 and 1.

I understand that these pseudo R-squares should be interpreted NOT as a proportion of variance explained as in OLS multiple regression but rather (and this makes sense) as small, medium, or large effects. The problem is that I don’t see any general guidelines as to what values of a ‘pseudo’ R-square would constitute a ‘small’ ‘medium’ or ‘large’ effect. Obviously much depends on the data set but, do you have any general suggestions

These pseudo R-squares tend to to have values very similar to conventional R-squares.

Hi Paul, I have a logistic regression model for which i was looking at goodness of fit tests. The Hosmer and Lemeshow test is significant for my data as the number of rows is more than 10,000. The Nagerkerke’s R2 value for my model is about 0.32, but the percentage concordance(as reported in SAS) is 79%. ALso, in the classification table, percentage correctly classified by the model is 75%. I tested this out on a random sample and got 76% cases are correctly classified. Can you suggest some other measures by which I can validate my model and check its goodness of fit? Thanks, N

See my later blogs on alternatives to Hosmer-Lemeshow.

Hi, Paul. Thank you for crafting this very informative post. The Tjur method is indeed appealing and intuitive. I intend to report it in my future work.

I have a quick question for you regarding the reporting of pseudo R2s in discrete-time hazard analysis utilizing logistic regression. It seems that censoring necessarily provides incomplete information about the event of interest, therefore a pseudo R2 wouldn’t provide much information in the way of “fit”, certainly not to the degree that deviance, AIC, and BIC would.

So two questions: 1) am I on the right path in this understanding, and 2) can you recommend a reference that either supports or dismisses the use of pseudo R2s in the assessment of discrete-time hazard model fit?

I agree. Pseudo R2 is unrealistically small in this situation. But, unfortunately, I don’t have a reference for you.

Brilliant post!

I have a question regarding the value of McFadden’s R2. You write that:

“When the marginal proportion is nearer to 0 or 1, the McFadden R2 tends to be larger.”

Does that mean that for “rare events”, I should rather report a different R2? How much would the difference be? And is that reported in the literature or rather a observation based on experience? All the best, Moritz

The statement was based on personal observation, not on anything in the literature. I don’t think it implies that one should use a different R2.

I do not have access to Tjur’s paper but I can see the logic of his argument. If I understand it correctly, he is proposing to treat logistic goodness of fit as a special case of the classical biserial correlation problem (special in the sense that the continuous values are probabilities bounded between zero and one). The index of discrimination D is the difference between the mean probability for cases where the event has occurred (p1) and for cases where it has not (p0). This is the numerator of the point biserial correlation coefficient which is the Pearson product moment correlation coefficient applied to biserial data. The square of the Pearson coefficient (not the coefficient itself) can be interpreted as a PVE (proportion of variance explained) index. Most pseudo-R-squared statistics are defined as one minus the proportion of variance not explained which is the PVE. So it seems to me that to you would need to square p1 – p0 before you could regard it as a pseudo-R-squared type index comparable to McFadden, Nagelkerke, Effron etc. But even if the model fits well, p1 will be less than one and p0 will be greater than zero. The square of the difference therefore will be limited to values well short of one. Based on the simulations I have done, D-squared gives values much smaller than the other pseudo-R-squareds taken from the same data.

One way to remedy this is the atheoretical, ad-hoc one of rescaling D by dividing it by its maximum value, as Nagelkerke did for the Box-Snell statistic. For given distributions of the dichotomous and continuous variables, the maximum value of p1 occurs when the k largest probabilities are paired with the ones and the remaining (smallest) N – k are paired with the zeros. If you divide p1 – p0 by this maximum, you can square it without producing very small values. In fact I find that it produces values larger than other pseudo-R-squareds. For example, I simulated a data set with 100 observations five predictor variables. The event probabilities for the dichotomous variable were set equal to those predicted by the logistic model (i.e. model fit was as good as it could possibly be for this data set). The goodness of fit values I calculated were: Effron = 0.463, McFadden = 0.428, Nagelkerke = 0.501, D (raw) = 0.474, D (rescaled and squared) = 0.758.

Perhaps the conclusion is that there is no one best measure of goodness of fit for logistic regression. It depends on which aspect of the fit you are interested in and how you are going to interpret the result.

I’m not convinced that there’s any need to square Tjur’s R2. You should really read Tjur’s paper.

hy,

thanks a lot for the post.

there is no mention how to compute r-square of multiple imputed regression logistic.

what is the best way for it ?

thanks.

Compute the R-square (using the method of your choice) in each imputed data set. Then, simply average them across data sets.

Hi Paul,

I have 2 questions

1) Could you speak to the possible problems of using hosmer lemeshow test with a small sample size after firth logistic regression? (e.g.. 80)

I have found the following caution in a journal article:

With small sample sizes, the

Hosmer–Lemeshow test has low power and is unlikely todetect subtle deviations from the logistic model. Hosmer and

Lemeshow recommend sample sizes greater than 400

(Bewick, 2005)

2)

It seems Tjur’s R2 is suggested instead for firth with small sample size? Are there no limitations in using Tjur with a smaller sample size?

Thanks!

1.Power is definitely an issue with H-L test. But there are other more serious problems. See my other post, “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression”, http://www.statisticalhorizons.com/hosmer-lemeshow.

2. The Tjur R2 and the Firth method have completely different purposes. They are not comparable. But there’s no reason why Tjur could not be used with a small sample size.

By firth method you are referring to firth logistic regression?

I am using Tjurs R2 to assess model fit for models I built with firth logistic regression rather than using Hosmer Lemeshow (based on your noted limitations of HL).

After reading your post I will also use Tjurs R2 for the models I have built using logistic regression that have larger sample sizes.

Thank you.

Yes, I am referring to Firth logistic regression. Tjur R2 measures predictive power. H-L measures consistency of the model with the data. These are two different things and one has nothing to do with the other.

I am finding my Tjur R2 to be quite low for one of my models built with firth, (0.17). Is this common for Tjur R2 carried out on firth logistic models built with few variables? (4 variables for 80 observations) based on the availability of variables to make good predictions of the outcome?

I would have no reason to expect that Tjur R2 would be especially low with Firth logistic regression.