Text As Data

A 2-Day Seminar Taught by Christopher Bail, Ph.D.

The past decade has not only witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, text-based data are not easily collected or analyzed using conventional statistical methods. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing.

This hands-on course will provide participants with an overview of popular techniques for collecting and analyzing text-based data—including screen-scraping, application programming interfaces or APIs, topic modeling, and network-based text classifiers. Though the class will review basic programming techniques such as loops and functions, students with no familiarity with these techniques may wish to review them in advance. The majority of our time will be spent mastering the following R packages: rvest, twitterR, RFacebook, lda, stm, ldaviz, and textnets, as well as a variety of functions in base R. 


This seminar will use R for the empirical examples and exercises. To participate in the hands-on exercises, you are strongly encouraged to bring a laptop computer with the most recent version of R and RStudio installed. 

Basic knowledge of R is required for this course. There are a number of excellent introductory books to R as well as a collection of online tutorials for people who are unfamiliar with R (e.g., https://www.tutorialspoint.com/r/).


This seminar is designed for researchers or practitioners who have no prior experience collecting or analyzing text data using automated methods but have basic familiarity with R (e.g. familiarity with matrices, vectors, lists, and data frames), basic data processing skills (e.g. cleaning, merging, or reshaping data in R) and beginner level programming knowledge (e.g. functions and loops). Though these basic subjects will be reviewed in the first part of the seminar, participants with no prior knowledge of these skills may find the seminar will move too quickly.

Location, Format and materials

The class will meet from 9 am to 5 pm each day with a 1-hour lunch break at Temple University Center City, 1515 Market Street, Philadelphia, PA 19103. 

Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking. 

Registration and Lodging

The fee of $995.00 includes all seminar materials. The early registration fee of $895 is available until April 30.

Refund Policy

If you cancel your registration at least two weeks before the course is scheduled to begin, you are entitled to a full refund (minus a processing fee of $50). 

Lodging Reservation Instructions 

A block of guest rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut Street, Philadelphia, PA at a special rate of $159 per night. This location is about a 5 minute walk to the seminar location. In order to make reservations, call 203-905-2100 during business hours and identify yourself by using group code SH0530 or click here. For guaranteed rate and availability, you must reserve your room no later than Monday, April 30, 2018. 

If you make reservations after the cut-off date, ask for the Statistical Horizons room rate (do not use the code) and they will try to accommodate your request.


1. Introduction to text as data
     a. The text as data explosion
     b. A brief history of automated text analysis
     c. Advantages of text analysis
     d. Challenges of text analysis

2. Screen-scraping
     a. Legal and logistical issues
     b. HTML and XML
     c. Navigating HTML and XML
     d. Browser automation

3. Application Programming Interfaces
     a. What is an API?
     b. Making your first API call
     c. Rate limiting
     d. R packages for working with APIs
     e. Writing custom code for APIs

4. Basic text-analysis
     a. Tokenization
     b. Character encoding
     c. GREP/Regular Expressions
     d. Dictionary-based methods

5. Topic Models
     a. A brief introduction to topic models
     b. Creating a corpus/prepping text
     c. The Term Document Matrix
     d. Latent Dirichlet Allocation
     e. Structural Topic Models
     f. Topic Model Validation

6. Text as Networks
     a. Basics of network theory
     b. Bipartite affiliation networks
     c. Sentence parsing
     d. Building a text network