One of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” There are two general approaches to answering this question. One is to get a measure of how well you can predict the dependent variable based on the independent variables. The other is to test whether the model needs to be more complex, specifically, whether it needs additional nonlinearities and interactions to satisfactorily represent the data.

In a later post, I’ll discuss the second approach to model fit, and I’ll explain why I don’t like the Hosmer-Lemeshow goodness-of-fit test. In this post, I’m going to focus on *R*^{2 }measures of predictive power. Along the way, I’m going to retract one of my long-standing recommendations regarding these measures.

LEARN MORE IN A SEMINAR WITH PAUL ALLISON

Unfortunately, there are many different ways to calculate an *R*^{2} for logistic regression, and no consensus on which one is best. Mittlbock and Schemper (1996) reviewed 12 different measures; Menard (2000) considered several others. The two methods that are most often reported in statistical software appear to be one proposed by McFadden (1974) and another that is usually attributed to Cox and Snell (1989) along with its “corrected” version (see below). However, the Cox-Snell *R*^{2 }(both corrected and uncorrected) was actually discussed earlier by Maddala (1983) and by Cragg and Uhler (1970).

Among the statistical packages that I’m familiar with, SAS and Statistica report the Cox-Snell measures. JMP and SYSTAT report both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for binary logistic regression but McFadden’s measure for multinomial and ordered logit.

For years, I’ve been recommending the Cox and Snell *R*^{2 }over the McFadden *R*^{2}, but I’ve recently concluded that that was a mistake. I now believe that McFadden’s *R*^{2} is a better choice. However, I’ve also learned about another *R*^{2} that has good properties, a lot of intuitive appeal, and is easily calculated. At the moment, I like it better than the McFadden *R*^{2}. But I’m not going to make a definite recommendation until I get more experience with it.

Here are the details. Logistic regression is, of course, estimated by maximizing the likelihood function. Let *L*_{0} be the value of the likelihood function for a model with no predictors, and let *L _{M}* be the likelihood for the model being estimated. McFadden’s

*R*

^{2 }is defined as

* R*^{2}* _{McF}* = 1 – ln(

*L*

_{M}_{) }/ ln(

*L*

_{0})

where ln(.) is the natural logarithm. The rationale for this formula is that ln(*L*_{0}) plays a role analogous to the residual sum of squares in linear regression. Consequently, this formula corresponds to a proportional reduction in “error variance”. It’s sometimes referred to as a “pseudo” *R*^{2}.

The Cox and Snell *R*^{2} is

* R*^{2}* _{C&S}* = 1 – (

*L*

_{0 }/

*L*

*)*

_{M}^{2/n}

where *n* is the sample size. The rationale for this formula is that, for normal-theory linear regression, it’s an identity. In other words, the usual *R*^{2 }for linear regression depends on the likelihoods for the models with and without predictors by precisely this formula. It’s appropriate, then, to describe this as a “generalized” *R*^{2 }rather than a pseudo *R*^{2}. By contrast, the McFadden *R*^{2 }does *not *have the OLS *R*^{2 }as a special case. I’ve always found this property of the Cox-Snell *R*^{2 }to be very attractive, especially because the formula can be naturally extended to other kinds of regression estimated by maximum likelihood, like negative binomial regression for count data or Weibull regression for survival data.

It’s well known, however, that the big problem with the Cox-Snell *R*^{2} is that it has an upper bound that is less than 1.0. Specifically, the upper bound is 1 – *L*_{0}^{2/n}.This can be a lot less than 1.0, and it depends only on *p*, the marginal proportion of cases with events:

upper bound = 1 – [*p ^{p}*(1-

*p*)

^{(1-p)}]

^{2}

This has a maximum of .75 when *p*=.5. By contrast, when *p*=.9 (or .1), the upper bound is only .48.

For those who want an *R*^{2 }that behaves like a linear-model *R*^{2}, this is deeply unsettling. There is a simple correction, and that is to divide *R*^{2}* _{C&S}* by its upper bound, which produces the

*R*

^{2 }attributed to Nagelkerke (1991)

_{. }But this correction is purely ad hoc, and it greatly reduces the theoretical appeal of the original

*R*

^{2}

*. I also think that the values it typically produces are misleadingly high, especially compared with what you get from a linear probability model. (Some might view this as a feature, however).*

_{C&S}So, with some reluctance, I’ve decided to cross over to the McFadden camp. As Menard (2000) argued, it satisfies almost all of Kvalseth’s (1985) eight criteria for a good *R*^{2}. When the marginal proportion is around .5, the McFadden *R*^{2} tends to be a little smaller than the uncorrected Cox-Snell *R*^{2}. When the marginal proportion is nearer to 0 or 1, the McFadden *R*^{2 }tends to be larger.

But there’s another *R*^{2}, recently proposed by Tjur (2009), that I’m inclined to prefer over McFadden’s. It has a lot of intuitive appeal, its upper bound is 1.0, and it’s closely related to *R*^{2 }definitions for linear models. It’s also easy to calculate.

The definition is very simple. For each of the two categories of the dependent variable, calculate the mean of the predicted probabilities of an event. Then, take the difference between those two means. That’s it!

The motivation should be clear. If a model makes good predictions, the cases with events should have high predicted values and the cases without events should have low predicted values. Tjur also showed that his *R*^{2} (which he called the coefficient of discrimination) is equal to the arithmetic mean of two *R*^{2} formulas based on squared residuals, and equal to the geometric mean of two other *R*^{2}’s based on squared residuals.

Here’s an example of how to calculate Tjur’s statistic in Stata. I used a well-known data set on labor force participation of 753 married women (Mroz 1987). The dependent variable **inlf **is coded 1 if a woman was in the labor force, otherwise 0. A logistic regression model was fit with six predictors.

Here’s the code:

use http://www.stata.com/data/jwooldridge/eacsap/mroz.dta, clear

logistic inlf kidslt6 age educ huswage city exper

predict yhat if e(sample)

ttest yhat, by(inlf)

The **predict** command produces fitted values and stores them in a new variable called **yhat**. (The **if e(sample)** code prevents predicted values from being calculated for cases that may be excluded from the regression model). The **ttest** command is the easiest way to get the difference in the means of the predicted values for the two groups (but you can ignore the *p*-values). The mean predicted value for those in the labor force was .680, while the mean predicted value for those not in the labor force was .422. The difference of .258 is the Tjur *R*^{2}. By comparison, the Cox-Snell *R*^{2 }is .248 and the McFadden *R*^{2 }is .208. The corrected Cox-Snell is .332.

Here’s the equivalent SAS code:

proc logistic data=my.mroz;

model inlf(desc) = kidslt6 age educ huswage city exper;

output out=a pred=yhat;

proc ttest data=a;

class inlf; var yhat; run;

One possible objection to the Tjur *R*^{2} is that, unlike Cox-Snell and McFadden, it’s not based on the quantity being maximized, namely, the likelihood function.* As a result, it’s possible that adding a variable to the model could reduce the Tjur *R*^{2}. But Kvalseth (1985) argued that it’s actually preferable that *R*^{2} not be based on a particular estimation method. In that way, it can legitimately be used to compare predictive power for models that generate their predictions using very different methods. For example, one might want to compare predictions based on logistic regression with those based on a classification tree method.

Another potential complaint is that the Tjur *R*^{2} cannot be easily generalized to ordinal or nominal logistic regression. For McFadden and Cox-Snell, the generalization is straightforward.

If you want to learn more about logistic regression, check out my book *Logistic Regression Using SAS: Theory and Application*, Second Edition (2012), or try my seminar on Logistic Regression.

* Conjecture: I suspect that the Tjur *R*^{2} is maximized when logistic regression coefficients are estimated by the linear discriminant function method. I encourage any interested readers to try to prove (or disprove) that. (For background on the relationship between discriminant analysis and logistic regression, see Press and Wilson (1984)).

**References**:

Cragg, J.G. and R.S. Uhler (1970) “The demand for automobiles.” *The Canadian Journal of Economics *3: 386-406.

Cox, D.R. and E.J. Snell (1989) *Analysis of Binary Data*. Second Edition. Chapman & Hall.

Kvalseth, T.O. (1985) “Cautionary note about R^{2}.” *The American Statistician*: 39: 279-285.

McFadden, D. (1974) “Conditional logit analysis of qualitative choice behavior.” Pp. 105-142 in P. Zarembka (ed.), *Frontiers in Econometrics*. Academic Press.

Nagelkerke, N.J.D. (1991) “A note on a general definition of the coefficient of determination.” *Biometrika *78: 691-692.

Maddala, G.S. (1983) *Limited Dependent and Qualitative Variables in Econometrics*. Cambridge University Press.

Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis.” *The American Statistician *54: 17-24.

Mittlbock, M. and M. Schemper (1996) “Explained variation in logistic regression.” *Statistics in Medicine* 15: 1987-1997.

Mroz, T.A. (1987) “The sensitiviy of an empirical model of married women’s hours of work to economic and statistical assumptions.” *Econometrica* 55: 765-799.

Press, S.J. and S. Wilson (1978) “Choosing between logistic regression and discriminant analysis.” *Journal of the American Statistical Association* 73: 699-705.

Tjur, T. (2009) “Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.” *The American Statistician* 63: 366-372.

## Comments

Dear Dr. Allison,

Your engaging and influential post with long lasting popularity (since 2013 until now) shows how difficult (if at all possible) is to find the best R2 for logistic regression. Your move from Cox-Snell (R2CS) to McFadden (R2MF) and then to Tjur (R2Tjur) explains that. Of course, R2Tjur has a lot of intuitive appeal, and it easy to calculate (plus some other useful properties, like independence of the base rate, for example). But is it enough to prefer R2Tjur over R2MF? R2MF also has a clear intuitive interpretation in terms of information theory: it’s equal to the ratio of information gain when adding new predictors to the quantity of the whole information contained in the data. Also, R2MF, adjusted R2MF and Akaike Information Criterion (AIC) related to each other through the simple linear formulas. But the more important property of R2MF is the following one. Hosmer and Lemeshow in their seminal book of 1989, p. 149, say about R2MF (aka R2L): “… Thus the quantity R2L is nothing more than an expression of the likelihood ratio test and, as such, is not a measure of goodness-of-fit”. Therefore, Hosmer and Lemeshow did not see R2L as a lawful R2 measure. Though, in the 3-rd edition of the book the authors omitted this statement. Now, the whole statistical community considers R2MF as a typical ordinary R2 measure, which has a unique feature to be a statistical test for testing the Global Null Hypothesis: Beta = 0, and simultaneously a measure of predictive power. Thus, with such useful and unique properties it is unreasonable to ignore R2MF. So, what to do with these two candidates: R2Tjur and R2MF? The most reasonable decision is to keep and report both.

Also, we can add some third R2. This time from the class of R2 measures, based on sum-of-squares of the differences between observed binary outcomes and predicted probabilities. It could be, for example, very popular Efron R-Squared, R2Efron (aka R2O or R2OLS). It permits direct comparison logistic regression models with linear probability models, among other useful features. Working with this trio of complementary, not competitive R2, we can use all the advantages mentioned above. Of course, it is possible to work with other sets of R-Squareds. But I think that this trio is the best. What do you think?

Best regards,

Ernest Shtatland

You make some very good points. I really don’t have a strong preference.

Hi

I am new here and sorry if this is something everybody here knows already. I have ran a logistic regression analysis (outcomes were whether they had a disease or not, factors were both numerical and non-numerical factors) and had a following comment by a reviewer. “What was the R2 for the models? You must inform what R2 was used for (e.g., McFadden, Nagelkerke, Tjur e Cox & Snell), please”

However, when I run the analysis on STATA by the command logistic…only pseudo R2 comes out. Can anyone tell me how I can get a non-pseudo R2 and how can I tell which R2 was used for? Thank you for your time.

Stata uses the McFadden R2. For logistic regression, all R2’s are pseudo.

Hi Paul –

What I find to be the most useful R-squared value when using Logistic Regression to classify 2 groups is based on the point-biserial correlation which is just the usual Pearson’s Correlation Coefficient when one of the two variables is dichotomous (0,1).

It has the form of a Z-Score times a function that is maximized when there’s an equal number of 0’s and 1’s:

Rho = [( – )/Sy ] * Sqrt(po*p1)

You just plot the predicted output probability vs the true group, and compute Pearson’s R-Squared.

I find it much more useful than the R-Square (U) value that is produced in JMP, and it has the added benefit you mentioned of not being the quantity that is optimized by MLE (and therefore provides an additional perspective on the predictive capability of the model).

Have you ever tested that as a figure of merit?

Good point. As I recall, Tjur mentions this R-square in his American Statistician article. I can’t recall what he didn’t like about it.

Hi Paul

Many thanks for comparing various types of R2 explicitly with strong arguments and references.

This has cleared most of my concepts about the selection of R2 for my analysis. However, I have a quick question that I used multiple logistic regression models to see the association of several demographic factors with various practices related to Antimicrobial Resistance. The models produced the following R2s Tjur; 0.136, 0.075, 0.038, 0.02 and 0.016, however, I am not sure if some of these models fit my data or the values are very low and I need to re-analyze the data using some other tests. The sample size in my study is 570.

Yes, the last three of these R2s are pretty low. But that doesn’t mean the models are “incorrect” in some way. It just means that the variables you have in the model don’t have much predictive power. I don’t think you need “other tests”, at least not based on what you’ve told me.

A very usefull and simple explanation for non-statistician researchers.

Hello,

I run a logistic regression is SAS for a #event/#trials dependent variable. Of course, those values can be converted to a proportions.

Is there a way to calculate Tjur’s R2 for this type of data (#event/#trials) or proportions?

Thank you for your time.

marcel

It can be done, but it would take a bit of programming to do it.

Thank you for your answer. I guess the default R-square output from SAS will do it for me.

I meant to say. “The default SAS R-square output will be enough for me”. In other words, I will use the default SAS R-square to validate my model.

Is there a way to calculate Tjur’s R2 for a count-based regression model based on a negative binomial distribution?

Not that I’m aware of.

Hi, I would be grateful if you could advise me how Tjur coefficient of discrimination might be generalized to evaluating discriminatory powers of multi-class predictive models.

I presume you are asking about more than 2 categories on the dependent variable. If there’s a way to calculate Tjur’s R2 in this situation, I haven’t heard about it.

I think that Tjur’s D can be generalized to multiple outcomes. My process is to find the predicted probability for each data point corresponding to the specific outcome of that data point, call it Ds. Take the mean of these. Take the mean of 1-Ds. Subtract the mean of 1-Ds from the mean of Ds. Generalized Tjur. So far, it appears to make sense, but I lack the statistician chops to explicitly derive and confirm it.

I tried this on the mroz data and got .272, but Tjur gives .258. Did I do something wrong? If it doesn’t work for two categories, it can’t be a generalization.

Tjur’s R2 also allows for easy decomposition by levels of a factor.

For example, if one were to model likelihood of a disease by levels of a specific microRNA in three brain regions, one can very easily calculate Tjur’s R2 for each individual brain region in the model (presuming sufficient samples for a valid model, etc.). This would produce separate R2 for each region. This happened to be borne out by pathology for the disease in different brain regions. The region with the least pathology over disease progression also had the lowest Tjur’s R2.

Dear Professor Allison!

Thank you for the useful information you have offered concerning the assessment of logistic regression models.

I have carried out 1 000 sample simulations from the register data population (N = 26 442), by using stratified SRS sampling. The strata are 18 Finnish provinces. The sample size (n=540) was 2 % of the population size.

I have estimated the proportion of sauna (I assume that you know what is sauna, a quite hot place) in Finnish apartments for sale. The population proportions vary between 68 % and 93 % between the provinces.

I have applied logistic regression (two continuous auxiliary variables size and age of apartment) to each sample (6 different allocation methods). I became interested in Tjur´s R2 measure, which is easy to compute.

The range of Tjur´s R2 in the 1 000 samples for every allocation was 0.15 – 0.45. What is an acceptable value for this measure?

There is no criterion for whether any R2 measure is “acceptable” or not. Bigger is better, but even models with very low R2 can be useful in some circumstances.

Dr. Allison, this is probably a basic question and I apologize for my ignorance if the answer is obvious. Why is it that one could not just simply compute R2 as proportion of variance explained using model predicted probabilities? Isn’t this just Efron’s R2?

As Tjur’s R2, it only uses model predicted probabilities and therefore it would be applicable even to types of models other than logistic (say machine learning).

Source:

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

Yes, that’s another possibility. It’s equivalent to taking the squared correlation between predicted and observed values. One disadvantage, however, as pointed out in your reference, is the Efron’s R2 is not maximized by the maximum likelihood method of estimation.

sir what should be the range of “2 log likelihood ” ” Cox and Snell R square” “Nagelkerke R square ” ??

for my binary logistic model it is comming to be 36.44 , .225 and .301 respectively .

There is no “acceptable range” for these fit measures. The log-likelihood is only useful for comparing different models, not for evaluating a single model. In your case, I would say that .225 and .301 are not bad at all.

There must be an error in McFadden’s R2 formula. log(LM) is always greater than log(L0). Hence this formula will yield negative values. I think you need to flip them.

It’s weird that I find the same mistake in another post by another author.

A valuable source for different R2 definitions and discussion can be found in RMS book by Harrell

I understand your argument, but it is incorrect. It’s true that log(LM) > log(L0). But both are negative and log(LM) is closer to 0. The negative signs cancel out so that the ratio is less than or equal to 1.

Dr. Allison,

Regarding the Tjur R-Squared and William Greene comment:

This R-Squared measure was introduced independently by J.S. Cramer (1999) and T. Tjur (2009) in their publications in two journals with very similar titles: “The Statistician” and “The American Statistician”. Notations were of course different: D (Tjur) and λ (Cramer). The results of these publications are mutually complementary to some extent. For example, following Cramer’s approach it is more convenient to study relationships between λ and the base rate. It is turns out that λ does not depend on the base rate. In this respect, λ is closer to McFadden R^2 than to any other traditional version of R^2. On the other hand, Tjur showed that D is equal to the arithmetic mean of two R^2-like quantities based on squared residuals. One of these quantities, R^2(res), is nothing but the well-known R-Squared used with different notations such as R^2(SS), R^2(O) etc.

So it would be fair to call this measure R^2(C&T) – “C” for Cramer and “T” for Tjur.

Best regards,

Ernest Shtatland

It should probably be called simply “the Cramer R^2” since Cramer clearly had priority.

Hi Dr. Allison,

Thank you for this very interesting and engaging post. It has renewed my old interest in R^2 measures for logistic regression. Following Menard (2000), we have joined in http://www.lexjansen.com/nesug/nesug02/st/st004.pdf the “McFadden camp”, and have shown that there exists a very simple functional relation between R^2 (C&S) and R^2 (McF):

R^2(C&S) = 1 – exp(-R^2(McF) * T).

Here T = – 2lnL(0) / n, and can be re-written as T = -2[ŷ*Lnŷ + (1 – ŷ)*Ln(1 – ŷ)]. The quantity ŷ is known as base rate or prevalence or marginal proportion of cases with events, and T can be interpreted as the double entropy of Bernoulli distribution with probability ŷ. The formulas above can be seen as a theoretical justification of well-known empirical results: when the base rate ŷ is around .5, R^2(McF) tends to be a little smaller than R^2 (C&S) ; when the base rate ŷ is nearer to 0 or 1, the McFadden R2 tends to be substantially larger then R^2 (C&S). Also, these formulas explain directly why the maximal value of R^2(C&S) is 0.75, not 1.

It should be added that a comprehensive and almost exhaustive review of this topic can be found in Menard (2010): “Logistic regression: From introductory to advanced concepts and applications”, Sage University Paper Series, Chapter 3, pp. 48 – 62. It contains a very emphatic saying: “If you want R^2, why not use R^2”.

Best regards,

Ernest Shtatland

Thanks. Interesting results.

I have a question.

I ran a binary logistic regression in Stata 12 and i have to choose the model with a best fit for my study

My research led me to pseudo r2 but someone told me its value should be at least 0.5 so i can call it best fit.

However, the models i had only had pseudo r2 of 0.25-0.28.

I couldn’t understand why it didn’t increase whatever combination i tried.

So, here’s my questions:

1. Where could be the problem why my pseudo r2 is small? Could it be affected by the variation in the value of my independent variables?

2. Do i really have to had a r2 valueof greater than or equal to 0.5 so i can considered my model to have the best fit? (I really couldn’t find a source about this)

thank you very much

There is no requirement that the R2 be greater than 0.5. In fact, it’s uncommon to achieve this for binary regression models.

Dr. Allison,

Great article. In my paper I have to mention Nagelkerke R square, and disregard it, because the test is usualy misleading high and overall not a good measure.

You discribe this conclusion also, however, I have to name a scientific source for this conclusion.

Question: Where did you got this information, what is your source?

Greetings,

Martijn Kamminga

This is just based on my own experience with these measures.

Dr. Allison,

Thank you for this informative post. I would be very interested in your opinion on how the Tjur measure compares with another measure introduced recently by Zhang (2016). SAS Institute has coded an IML-based macro for computing Zhang’s r-square statistic. They describe it as follows: “The RsquareV macro provides the R2V statistic proposed by Zhang (2016) for use with any model based on a distribution with a well-defined variance function. This includes the class of generalized linear models and generalized additive models based on distributions such as the binomial for logistic models, Poisson, gamma, and others. It also includes models based on quasi-likelihood functions for which only the mean and variance functions are defined. A partial R2 is provided when comparing a full model to a nested, reduced model. Partial R can be obtained from this when the difference between the full and reduced model is a single parameter. A penalized R2 is also available adjusting for the additional parameters in the full model.” (http://support.sas.com/kb/60/162.html).

Zhang’s paper has been accepted for publication in The American Statistician. I looked the article up online and was interested to learn that one feature of his R2 is that it yields same value as the classical R2 for linear models.

Zhang’s method seems promising, but I have not yet had an opportunity to try it out or to study it carefully. I do like the fact that the conventional R2 is a special case.

Wonderful and helpful article, but it got me thinking: is there any advantage to using an R-squared measure in place of AUROC aside from simply wanting to have a reference to compare against linear regressions?

I don’t know of any reason to prefer one over the other. But I welcome comments from others on this issue.

[Earlier version messed up by inequality signs being interpreted as angle brackets]

Your conjecture about linear discrimination maximising the Tjur R^2 may be true if you restrict to sensible estimators, but it isn’t true more generally

Suppose (for simplicity, it isn’t essential) that the proportion of events is 50% and you fit a logistic regression model by linear discrimination to get fitted probabilities p_i for individual i (and the model fits well). Make a new predictor m_i by setting m_1=1 if p_i is greater than 0.5 and m_i=0 if p_i is less than 0.5. The Tjur R^2 for m will be larger than than for p.

To see why, consider a saturated model with discrete predictors. You will have p(x) greater than 0.5 exactly when observations with X=x are more likely to events than non-events. Increasing the predicted value to 1 for these observations will increase the contribution of the correctly-predicted observations to the Tjur statistic and decrease the contributions of the incorrectly-predicted ones by the same amount. Since there are more correctly-prediction ones, this is a net improvement. If you want the improved model to have logistic regression form, you can get as close as you like by using very large coefficients.

In the earlier paragraph, then, “fits well” means that a sufficiently high proportion of x values where the true mean is over 0.5 have fitted probabilities over 0.5.

Proving the conjecture would require finding the right definition of ‘sensible’ estimator.

Hi, this artical helps me a lot.

May I ask a new question?

Could this R-square be used to evalute the influence of missing data.

For example, I create an artificial dataset, and compare the r square of logistic models which were got from original data and data with 10% missing.

Thanks!

I don’t think this would be a useful way to evaluate the influence of missing data.

Dear Mr Allison,

I have got a Question regarding the Cox&Snell r² in Regression analysis.

I know that Nagelkerke is usually missleadingly high. Does this also hold for Cox& Snell r² – for examle compared with the r² value in OLS?

Best regards,

Gustav Sebastian

When it comes to R-square measures, it’s hard to say what is misleading or not. But in my experience, the Cox-Snell measure tends to be similar to OLS R-square.

Paul. FYI, the Tjur (2009) measure advocated in this post is actually proposed by J.S. Cramer in JRSS D, 48, 1999, p. 88, equation 14, and was reported in the 5th edition (2002) of Greene EA, p. 684.

Best regards,

Bill Greene

Thanks Bill. I’ll note this in any future writings. And as I stated in the post, the “Cox-Snell” R-square was previously proposed by Craig and Uhler in 1970.

Hi Prof Allison

Thanks for this post and the SAS paper, I found it most useful. I have some students & users for whom I find it important to spell the measures out explicitly, so I created this little %LET based code that will place the Tjur R-square at the end, labeled as such. I thought I’d share in case any readers find need of such functionality.

%LET Dataset = ;

%LET Event = ;

%LET Dependent_variable = ;

%LET Independent_variables = ;

%LET Class_Variables = ;

PROC LOGISTIC DATA=&Dataset;

CLASS &Class_Variables;

MODEL &Dependent_variable(event=”&Event”) = &Independent_variables / EXPB RSQUARE;

OUTPUT OUT=Predictions Pred=Predictions;

RUN;

/*Adding the Tjur (2009) R-square*/

PROC IML;

USE Predictions;

READ ALL VAR {&Dependent_variable} INTO DV;

CALL SYMPUT(“Other”,SETDIF(DV,{&Event}));

READ ALL VAR {Predictions} INTO Event WHERE(&Dependent_variable=”&Event”);

READ ALL VAR {Predictions} INTO Other WHERE(&Dependent_variable=”&Other”);

CLOSE Predictions;

Tjur_R2 = ABS(Mean(Event)-Mean(Other));

Print Tjur_R2;

QUIT;

Dear Sir, I found your article very helpful. Actually i applied Binomial Logistic regression and I am getting cox and snell R2= .709 and nagelkerke R2= .959. I am confused whether these values are good fit or not???

Kindly suggest me.

These values would generally be considered high.

Could you possibly help me understand the Cox-Snell equation? It seems like the exponent is penalizing large sample sizes by raising to a smaller power (thus decreasing the value you are subtracting from 1). If you have positive Log-lik values that makes sense, but my data is giving negative Log-lik values. So technically, my null model is smaller (-2753.4) than my global model (-2627.8), but this equation treats it as though the null has a larger value than the global since dividing a negative by a negative begets a positive value. Could you help me understand if I’m doing something wrong? Or why this is the case if not?

Thank you!!!

What’s your sample size?

Fantastic article. Loved it. Just wondering, for Tjur’s R sq, if yhat is asymmetric, wouldn’t you better off getting the median difference? Thanks.

Interesting idea. But it wouldn’t have some of the attractive properties described in Tjur’s article.

I am finding my Tjur R2 to be quite low for one of my models built with firth, (0.17). Is this common for Tjur R2 carried out on firth logistic models built with few variables? (4 variables for 80 observations) based on the availability of variables to make good predictions of the outcome?

I would have no reason to expect that Tjur R2 would be especially low with Firth logistic regression.

By firth method you are referring to firth logistic regression?

I am using Tjurs R2 to assess model fit for models I built with firth logistic regression rather than using Hosmer Lemeshow (based on your noted limitations of HL).

After reading your post I will also use Tjurs R2 for the models I have built using logistic regression that have larger sample sizes.

Thank you.

Yes, I am referring to Firth logistic regression. Tjur R2 measures predictive power. H-L measures consistency of the model with the data. These are two different things and one has nothing to do with the other.

Hi Paul,

I have 2 questions

1) Could you speak to the possible problems of using hosmer lemeshow test with a small sample size after firth logistic regression? (e.g.. 80)

I have found the following caution in a journal article:

With small sample sizes, the

Hosmer–Lemeshow test has low power and is unlikely todetect subtle deviations from the logistic model. Hosmer and

Lemeshow recommend sample sizes greater than 400

(Bewick, 2005)

2)

It seems Tjur’s R2 is suggested instead for firth with small sample size? Are there no limitations in using Tjur with a smaller sample size?

Thanks!

1.Power is definitely an issue with H-L test. But there are other more serious problems. See my other post, “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression”, https://statisticalhorizons.com/hosmer-lemeshow/.

2. The Tjur R2 and the Firth method have completely different purposes. They are not comparable. But there’s no reason why Tjur could not be used with a small sample size.

hy,

thanks a lot for the post.

there is no mention how to compute r-square of multiple imputed regression logistic.

what is the best way for it ?

thanks.

Compute the R-square (using the method of your choice) in each imputed data set. Then, simply average them across data sets.

I do not have access to Tjur’s paper but I can see the logic of his argument. If I understand it correctly, he is proposing to treat logistic goodness of fit as a special case of the classical biserial correlation problem (special in the sense that the continuous values are probabilities bounded between zero and one). The index of discrimination D is the difference between the mean probability for cases where the event has occurred (p1) and for cases where it has not (p0). This is the numerator of the point biserial correlation coefficient which is the Pearson product moment correlation coefficient applied to biserial data. The square of the Pearson coefficient (not the coefficient itself) can be interpreted as a PVE (proportion of variance explained) index. Most pseudo-R-squared statistics are defined as one minus the proportion of variance not explained which is the PVE. So it seems to me that to you would need to square p1 – p0 before you could regard it as a pseudo-R-squared type index comparable to McFadden, Nagelkerke, Effron etc. But even if the model fits well, p1 will be less than one and p0 will be greater than zero. The square of the difference therefore will be limited to values well short of one. Based on the simulations I have done, D-squared gives values much smaller than the other pseudo-R-squareds taken from the same data.

One way to remedy this is the atheoretical, ad-hoc one of rescaling D by dividing it by its maximum value, as Nagelkerke did for the Box-Snell statistic. For given distributions of the dichotomous and continuous variables, the maximum value of p1 occurs when the k largest probabilities are paired with the ones and the remaining (smallest) N – k are paired with the zeros. If you divide p1 – p0 by this maximum, you can square it without producing very small values. In fact I find that it produces values larger than other pseudo-R-squareds. For example, I simulated a data set with 100 observations five predictor variables. The event probabilities for the dichotomous variable were set equal to those predicted by the logistic model (i.e. model fit was as good as it could possibly be for this data set). The goodness of fit values I calculated were: Effron = 0.463, McFadden = 0.428, Nagelkerke = 0.501, D (raw) = 0.474, D (rescaled and squared) = 0.758.

Perhaps the conclusion is that there is no one best measure of goodness of fit for logistic regression. It depends on which aspect of the fit you are interested in and how you are going to interpret the result.

I’m not convinced that there’s any need to square Tjur’s R2. You should really read Tjur’s paper.

Brilliant post!

I have a question regarding the value of McFadden’s R2. You write that:

“When the marginal proportion is nearer to 0 or 1, the McFadden R2 tends to be larger.”

Does that mean that for “rare events”, I should rather report a different R2? How much would the difference be? And is that reported in the literature or rather a observation based on experience? All the best, Moritz

The statement was based on personal observation, not on anything in the literature. I don’t think it implies that one should use a different R2.

Hi, Paul. Thank you for crafting this very informative post. The Tjur method is indeed appealing and intuitive. I intend to report it in my future work.

I have a quick question for you regarding the reporting of pseudo R2s in discrete-time hazard analysis utilizing logistic regression. It seems that censoring necessarily provides incomplete information about the event of interest, therefore a pseudo R2 wouldn’t provide much information in the way of “fit”, certainly not to the degree that deviance, AIC, and BIC would.

So two questions: 1) am I on the right path in this understanding, and 2) can you recommend a reference that either supports or dismisses the use of pseudo R2s in the assessment of discrete-time hazard model fit?

I agree. Pseudo R2 is unrealistically small in this situation. But, unfortunately, I don’t have a reference for you.

Hi Paul, I have a logistic regression model for which i was looking at goodness of fit tests. The Hosmer and Lemeshow test is significant for my data as the number of rows is more than 10,000. The Nagerkerke’s R2 value for my model is about 0.32, but the percentage concordance(as reported in SAS) is 79%. ALso, in the classification table, percentage correctly classified by the model is 75%. I tested this out on a random sample and got 76% cases are correctly classified. Can you suggest some other measures by which I can validate my model and check its goodness of fit? Thanks, N

See my later blogs on alternatives to Hosmer-Lemeshow.

I understand that these pseudo R-squares should be interpreted NOT as a proportion of variance explained as in OLS multiple regression but rather (and this makes sense) as small, medium, or large effects. The problem is that I don’t see any general guidelines as to what values of a ‘pseudo’ R-square would constitute a ‘small’ ‘medium’ or ‘large’ effect. Obviously much depends on the data set but, do you have any general suggestions

These pseudo R-squares tend to to have values very similar to conventional R-squares.

I am trying to use Tjur’s R2 after a Firth regression, but am getting very strange outputs; eg the mroz data above gives an R2 = 1.3.

This must be due to the penalized likelihood, but can it be adjusted by weighting, or?

What software are you using? As long as the predicted values are between 0 and 1, calculation of the Tjur R2 shouldn’t be a problem. The Firth method should produce predicted values between 0 and 1.

Thanks for posting about Tjur’s R2 – how is Tjur commonly pronounced?

I believe it is pronounced “choor”

I like this test postestimation for a regular binary logistic. However, it seems not to work when running a Firth logistic regression and produces values that are larger than 1. A quick check of the predicted values shows why as they predicted scores are not longer bounded by 1 after running these models.

Is there any way to get the equivalent of Tjur’s coefficient of discrimination after running a Firth logistic regression in Stata?

I think the problem is that when you use the firthlogit command in Stata, the predict command does not produce predicted probabilities. Instead, it gives the linear predictor or, equivalently, the log-odds. To get probabilities, you need to do the following:

firthlogit y x1 x2 x3

predict logodds

gen predprob=1/(1+exp(-logodds))

Then calculate the Tjur R2 using these predicted probabilites.

Great post! I actually have a question about the model form of hazard analysis. I’ve been using the book “survival analysis using SAS”(very useful!) and it seems all the models in the book use a exponential form: h=exp(a+b*X1+c*X2), say h means hazard, and X1,X2 are the independent variables. I noticed though when I use a power form, say something like: h=X1^(a+b*X2), then changing the unit of X1 would change significance test result, and even AIC. I was wondering if you ever encountered this or what’s your suggestions on this. I apologize if this is not the appropriate place to ask this, but I’m really curious. Thanks!

I’m not surprised that changing the units of x1 would substantially change the results. Unless x1 has a coefficient, there is nothing to absorb changes in units. But why would you even consider a model like this?

Thanks for the reply! This model was used by a former graduate student in our lab 10 years ago. He recalled this power model as the most “efficient” and “easiest” form for his data (he couldn’t recall many details). What’s interesting is that when I run a power form and exponential form on our data, the former model seems to always come up with smaller AIC. The significant test results would be different depending on the covariates included. So we were wondering if there is a reason to prefer one to the other, or maybe it’s data specific?

Thanks!

Sorry but I don’t have any insights regarding this choice.

And a small correction about the models, the power form should be h=a*X1^(b+c*X2), is this “a” what you meant by “X1 has a coefficient”? But I think it didn’t absorb changes in units. A parallel exponential form we compare result with would be: h=exp(a+b*X1+c*X1*X2). Sorry about the confusion.

Your power-form hazard can be negative if X1 is. This is quite possible if, for example, X1 is mean standardized. This disqualifies it as a general hazard form.

Very interesting. On a trivial point, I believe the Tjur stat is the absolute value of the difference, as the difference comes up negative, at least in this example.

Tue Tjur does not use the absolute value in his paper. If the model predicts worse than random, a negative measure of

ability to predictmakes sense to me.Interesting point. However, I don’t think it’s possible for a logistic regression model to produce a negative Tjur R2. Maybe some other method could.

Great post!

I believe you accidentally “flipped” the Cox & Snell R^2…

It should be [1 – (L0 / LM)]^2/n and not [1 – (LM / L0)]^2/n.

(that’s how it’s written in Nagelkerke’s paper).

Thanks! I’ve corrected the formula.

Lucid advice, and useful…especially helpful for students transitioning from linear regression to logistic regresion.