Allison PicLet me tell you about my favorite new toy, the SAS® University Edition, which was just released on May 28. It’s essentially free SAS for anybody who wants it, and it has the potential to be a real game changer. SAS has long had a reputation for being one of the best statistical packages around, but also one of the most expensive. Last I checked, the starting price for a single-user license was around $10,000. Not surprisingly, virtually everyone who uses SAS gets their license through their employer or their university. 

So why is SAS now offering the core of its product line for free? For many years, SAS has made tons of money selling software to big companies, but its popularity among academics has been steadily waning. The decline in the academic market share has been especially steep in statistics departments where R has now become the preferred programming environment. This has created a serious problem for the SAS business model because the students of today are the business analysts of tomorrow. If they graduate with no experience using SAS, they will be far less likely to insist that their companies pay for a very costly software package. And the many companies that currently use SAS are finding it increasingly difficult to find new hires with SAS skills. 

SAS has made some previous attempts to solve this dilemma. Several years ago they released the SAS Learning Edition, which individuals could buy for around $100. But the functionality of that product was so limited that it was really only good for learning how to code in SAS. More recently, they introduced SAS On Demand which enabled academic users to access SAS via a web server. I tried using this system for a couple of courses, but I found it way too cumbersome, both for me and for my students.

With the University Edition (UE), SAS has finally produced a winner. Here are some things I like about it:

  • UE includes most of the SAS products that statistical analysts will need:  BASE, STAT, IML, and ACCESS.
  • It’s a completely local package and does NOT have to be connected to the Internet. 
  • UE can handle fairly large data sets (more on that later). 
  • When you sign on with an Internet connection, you are notified if an update is available. You can then update with the click of a button.
  • The browser-based interface, called SAS Studio, is a snap to learn and use.
  • SAS Studio will run in recent editions of all popular browsers, including Internet Explorer, Chrome, Safari and Firefox. 
  • UE can run on Macs, Windows, and Linux machines.
  • It runs smoothly and speedily, although not quite as fast as a regular installed version of SAS.
  • And did I mention that it’s absolutely free for anyone who wants it?

The license agreement states that UE can be used “solely for your own internal, non-commercial academic purposes.” As far as I can tell, there’s nothing to prevent someone in a business setting from downloading, installing, and running UE. But business users should bear in mind that the SAS Institute is known for zealously protecting its intellectual property.

You’re probably wondering, what’s the catch?  Well, there are a few things not to like, but they are relatively minor in my opinion:

  • UE only installs on 64-bit machines with at least 1 gig of memory.
  • UE doesn’t have SAS/ETS (econometrics & time series), SAS/OR (operations research) or SAS/QC (quality control). Most importantly, it doesn’t have SAS/GRAPH, although it does have ODS graphics. So you can’t use PROC GPLOT, but you can use PROC SGPLOT.
  • If you’re not connected to the Internet, it can take up to two minutes to start up, compared to only 10 seconds if you are connected. Weird, huh?
  • Installation can be a little tricky, so you need to follow all the instructions carefully.
  • It took me nearly two hours to download UE, but that was over a not-so-speedy WIFI connection. 

Now for a few details and suggestions. UE runs as a virtual machine, so you first need to download and install a free copy of Oracle’s VirtualBox software. (UE also runs with VMware Player or VMware Fusion, but those cost real money). After downloading UE, you open VirtualBox and then install UE as a virtual machine. With VirtualBox still open, you can start up UE by pointing your web browser to http://localhost:10080. For more details, check out the FAQs on the SAS support site.

I was warned by a SAS tech support person that UE may not work on “very large” data sets. But it worked fine with the biggest data set that I have, which has 414,000 cases, 674 variables, and takes up 888 MB on my computer.

If you want to use existing SAS data sets and programs, the most straightforward approach is to copy them into a dedicated folder for UE. Alternatively, you can create a folder shortcut to your existing data sets–but the process is a bit tricky. 

When I ran UE using a SAS data set that had been created by SAS 9.3 on a Windows machine, I got a warning in the Log window that the data set “is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.” I’m guessing that this happens because VirtualBox creates a Linux environment for UE to run in. And SAS data sets in Windows are not identical to SAS data sets in Linux.

In any case, this difference in file formats can really slow things down. When I ran a logistic regression with five predictors on the aforementioned data set, it took 47 seconds of real time and 38 seconds of CPU time. My solution was to use a DATA step to copy the old data set into a new data set (presumably in UE’s preferred format). When I re-ran the logistic regression on the new data set, execution improved dramatically: real time declined to 18 seconds and CPU time to 6.5 seconds. By comparison, when I ran the same regression on my standard installed version of SAS 9.3, the real time was 12 seconds and the CPU time was 2 seconds. So UE is definitely slower than “real” SAS, but the difference seems tolerable for most applications.

SAS Studio is the slick new interface for accessing SAS via a web browser. It’s designed not just for UE, but for any environment where users need to access SAS on a remote server. SAS Studio will be instantly familiar to anyone who has used the traditional SAS Display Manager with its editor window (now called Code), Log window, and Results window. As with PC SAS, you can have multiple program windows open in SAS Studio. But unlike PC SAS, each program window has its own Log and Results window. If you’re accustomed to using SAS on a PC, you can immediately start doing things the way you’ve always done them. However, there are lots of cool new features, most of which are easily learned by pointing and clicking on icons. For example, when you’re in the Results window, there are buttons that will save your output to an HTML file, a PDF file, or an RTF file.

Here’s a hint that you may find useful: by default, SAS Studio is in batch mode. That means that whenever you run a block of code, whatever is already in the Log and Results windows will get overwritten. If you want your results to accumulate, click on the “go interactive” icon in the Code window. You can also change your Preferences to start each session in interactive mode.The downside to the interactive mode is that temporary data sets and macros produced in one program window are not available to any other program window. 

If you plan to use UE a lot, it’s worth investing some time to learn the ins and outs of SAS Studio. A good introductory article (22 pages) can be found here.  Or click here for an 8-minute video tutorial. If total mastery is your thing, you can download the 300-page manual here.

So there you have it, free SAS in a (virtual) box. I would guess that at least 95% of the statistical analyses that I’ve done using SAS over the last 10 years could have been done with UE. That’s great news for potential users who don’t currently have access to SAS. But it must be a little scary for the SAS Insititute. Will this free product cannibalize existing sales? Loss leaders are always risky, and it will be interesting to see how this plays out. Personally, I’m rooting for UE to be a big success, both for users and for SAS.  

Allison PicIn the first chapter of my 1999 book Multiple Regression, I wrote

“There are two main uses of multiple regression: prediction and causal analysis. In a prediction study, the goal is to develop a formula for making predictions about the dependent variable, based on the observed values of the independent variables….In a causal analysis, the independent variables are regarded as causes of the dependent variable. The aim of the study is to determine whether a particular independent variable really affects the dependent variable, and to estimate the magnitude of that effect, if any.”

As in most regression textbooks, I then proceeded to devote the bulk of the book to issues related to causal inference—because that’s how most academic researchers use regression most of the time.

Outside of academia, however, regression (in all its forms) is primarily used for prediction. And with the rise of Big Data, predictive regression modeling has undergone explosive growth in the last decade. It’s important, then, to ask whether our current ways of teaching regression methods really meet the needs of those who primarily use those methods for developing predictive models.

Despite the fact that regression can be used for both causal inference and prediction, it turns out that there are some important differences in how the methodology is used, or should be used, in the two kinds of application. I’ve been thinking about these differences lately, and I’d like to share a few that strike me as being particularly salient. I invite readers of this post to suggest others as well.

1. Omitted variables. For causal inference, a major goal is to get unbiased estimates of the regression coefficients. And for non-experimental data, the most important threat to that goal is omitted variable bias. In particular, we need to worry about variables that both affect the dependent variable and are correlated with the variables that are currently in the model. Omission of such variables can totally invalidate our conclusions.

With predictive modeling, however, omitted variable bias is much less of an issue. The goal is to get optimal predictions based on a linear combination of whatever variables are available. There is simply no sense in which we are trying to get optimal estimates of “true” coefficients. Omitted variables are a concern only insofar as we might be able to improve predictions by including variables that are not currently available. But that has nothing to do with bias of the coefficients.

2. R2. Everyone would rather have a big R2 than a small R2, but that criterion is more important in a predictive study. Even with a low R2, you can do a good job of testing hypotheses about the effects of the variables of interest. That’s because, for parameter estimation and hypothesis testing, a low R2 can be counterbalanced by a large sample size.

For predictive modeling, on the other hand, maximization of R2 is crucial. Technically, the more important criterion is the standard error of prediction, which depends both on the R2 and the variance of y in the population. In any case, large sample sizes cannot compensate for models that are lacking in predictive power.

3. Multicollinearity. In causal inference, multicollinearity is often a major concern. The problem is that when two or more variables are highly correlated, it can be very difficult to get reliable estimates of the coefficients for each one of them, controlling for the others. And since the goal is accurate coefficient estimates, this can be devastating.

In predictive studies, because we don’t care about the individual coefficients, we can tolerate a good deal more multicollinearity. Even if two variables are highly correlated, it can be worth including both of them if each one contributes significantly to the predictive power of the model.

4. Missing data. Over the last 30 years, there have been major developments in our ability to handle missing data, including methods such as multiple imputation, maximum likelihood, and inverse probability weighting. But all these advances have focused on parameter estimation and hypothesis testing. They have not addressed the special needs of those who do predictive modeling.

There are two main issues in predictive applications. First, the fact that a data value is missing may itself provide useful information for prediction. And second, it’s often the case that data are missing not only for the “training” sample, but also for new cases for which predictions are needed. It does no good to have optimal estimates of coefficients when you don’t have the corresponding x values by which to multiply them.

Both of these problems are addressed by the well-known “dummy variable adjustment” method, described in my book Missing Data, even though that method is known to produce biased parameter estimates. There may well be better methods, but the only article I’ve seen that seriously addresses these issues is a 1998 unpublished paper by Warren Sarle.

5. Measurement error. It’s well known that measurement error in predictors leads to bias in estimates of regression coefficients. Is this a problem for a predictive analysis? Well, it’s certainly true that poor measurement of predictors is likely to degrade their predictive power. So efforts to improve measurement could have a payoff. Most predictive modelers don’t have that luxury, however. They have to work with what they’ve got. And after-the-fact corrections for measurement error (e.g., via errors-in-variables models or structural equation models) will probably not help at all.

I’m sure this list of differences is not exhaustive. If you think of others, please add a comment. One could argue that, in the long run, a correct causal model is likely to be a better basis for prediction than one based on a linear combination of whatever variables happen to be available. It’s plausible that correct causal models would be more stable over time and across different populations, compared with ad hoc predictive models. But those who do predictive modeling can’t wait for the long run. They need predictions here and now, and they must do the best with what they have.