R Should Be Your Second Language (If It’s Not Already Your First)
April 19, 2019 By Paul Allison
When R first came out, around the year 2000, I was really excited. Here was a powerful, programmable statistical package that was free to anyone. I thought “This could revolutionize data analysis.” But when I gave it a test run, I quickly got discouraged. All the routine data management tasks seemed much harder in R than in SAS or Stata, my go-to statistical packages. And many of my favorite analytical tools (like Cox regression) that were readily available in commercial packages seemed to be missing from R. Plus, everything was formulated in terms of vectors and matrices, and I knew that wouldn’t fly with the sociology graduate students I was teaching. So I put R back on the shelf and let others play with this shiny new toy.
Fast forward 20 years and things have changed dramatically. It now seems that no matter what you want to do in statistics, there’s an R package for that. Slick interface apps like RStudio have streamlined most of the routine tasks of managing R itself. And meta-packages like the tidyverse have made data management and manipulation much easier and more straightforward.
Along with those changes, the community of R users has grown enormously. It’s hard to determine the exact number of R users, but recent estimates (by Oracle and by Revolution Analytics) put the worldwide user base at about 2 million. In a regularly updated blog post, Robert Muenchen has been tracking the popularity of data science software for several years, using a variety of metrics. One graph that I found particularly interesting was this one, which shows the number of mentions of various statistical packages in scholarly articles, based on a search of Google Scholar:
The most striking feature of this graph is the overall dominance of SPSS. However, SPSS peaked in 2009 and has since suffered a massive decline. SAS has also declined greatly since 2010, and it is now below R (the orange triangles) which has been steadily increasing in scholarly mentions since its inception.
The growth in R has two important consequences:
- It’s now much easier to get answers to any kind of R question, either by a quick web search or by just asking someone who works nearby.
- You can readily communicate your program code to hundreds of thousands of other researchers, without worrying about whether they have the same software installed.
It’s this second point that I want to stress here. I’ve come to the conclusion that R should be every data analyst’s second language. To me, it’s like English. No matter what your native language, it’s extremely useful to have English as a second language so that you can communicate easily with others who don’t share your native tongue. And so it is with R.
I’ve certainly found that to be true in my own work. Although I’m far from being a skilled R programmer, I’ve recently developed enough familiarity and competence with R that I can translate nearly all of my teaching examples and exercises into R. Currently, I try to provide R code for all the seminars I teach for Statistical Horizons. My two most recent publications have included R code. And I can now do a much better job of reading and reviewing papers that include R code.
R also enables me to do some things that I can’t do in SAS or Stata. For example, R can do multiple imputation for clustered data using either the jomo package or the mice package. And for linear structural equation modeling, I prefer the lavaan package over what’s in SAS or Stata (but Mplus is still better).
More and more people who come to Statistical Horizons seminars prefer to work in R, and they really appreciate having R code. Increasingly, our new instructors also prefer to teach their courses using R. Of course, that’s mainly a cohort phenomenon. Statistics departments overwhelmingly prefer to teach with R, at both the undergraduate and graduate levels, so there are thousands of newly-minted R users every year.
Don’t get me wrong, there are still lots of things I don’t like about R. There are so many user-contributed R packages that it’s hard to figure out which one will do what you want. And because there will often be many packages that will do you want, you then have the problem of figuring out which one is best. Documentation for many packages is sparse or poorly written. Even more bothersome is that some packages conflict with other packages because they have functions (or other elements) that share the same names. That can cause serious headaches.
Despite those problems, I am confident that R has a bright future. There are even newer competitors (like Python) that are favored in the data science/machine learning community. But R has reached such a critical mass that it will be very hard to stop.
How do you go about learning R? Well, there are lots of good materials on the web. But I think the best way is to take our seminar Statistics with R taught by Professor Andrew Miles. It’s designed for people who are reasonably proficient in statistics and already familiar with another package (like SPSS, SAS or Stata), but who just want to learn how to do those same things in R. In two days, you’ll get comfortable enough with R to do most of the statistical tasks that you are already doing in other packages. Andrew’s course also has lots of hands-on exercises, something that’s essential for deep learning of any new tool. You can get more information about his course here.