%macro predict_ldm (data=_last_, y=, x=, pred=yhat, out=predict_ldm); /********************************************************************* Version 1.0, 22 April 2020 MACRO PREDICT_LDM produces a data set containing predicted values for a linear probability model that are never less than 0 or greater than 1. It is an implementation of the method described in my blog post at https://statisticalhorizons.com/better_predicted_probabilities. The method is based on the assumption of a linear discriminant model (LDM) which implies a logistic regression model. The LDM was first described by R.A. Fisher in 1936. The macro does the following: 1. It uses PROC REG to estimate the linear regression of a binary Y variable on a set of X variables, and generates predicted values in the usual way. The default output from PROC REG is reported. 2. It then rescales the regression coefficients to correspond to a logistic regression model. 3. It generates predicted values based on a logistic regression models with the coefficients produced in step 2. These are written to a new data set that contains all the variables in the original data set, plus two new variables: --the usual linear predictions, which may be greater than 1 or less than 0. --the transformed LDM predictions, which are bounded by 0 and 1. The macro argument has five parameters, only two of which are required: DATA is the name of the data set to be analyzed. The default is the most recently created data set. Y is the dependent variable, which must have values of 0 or 1. X is a list of predictor variables. PRED is the name of the variable that will contain the predicted values. The default is YHAT. That will be the name for the usual linear predictions. The LDM predictions will be contained in a variable called YHAT_LDM. OUT is the name of the output data set. The default is PREDICT_LDM. Example of usage: %predict_ldm(data=my.data, y=var1, x=var2 var3 var4, out=newdata) Author: Paul D. Allison, Statistical Horizons LLC allison@statisticalhorizons.com *******************************************************************/ proc reg data=&data outest=a plots=none; model &y=&x; ods output fitstatistics=fit; ods output anova=anova; output out=_predict_ pred=&pred; quit; data &out; if _n_=1 then merge a anova(where=(source='Error') keep=ss source df) fit(where=(label1='Dependent Mean') rename=(cvalue1=ybar) keep=cvalue1 label1); n=df+1; k=n/ss; a = log(ybar/(1-ybar)) + k*(intercept-.5) + .5*(1/ybar-1/(1-ybar)); set _predict_; logodds=k*(&pred-intercept)+ a; &pred._ldm = 1/(1+exp(-logodds)); drop label1 source n k a df ss ybar logodds; run; %mend predict_ldm;