Text as Data

A 2-Day Seminar Taught by Brandon Stewart, Ph.D.

Never before in human history has so much information been so easy to access. The promise of this wealth of information is immense, but because of its pure volume it is difficult to summarize and interpret. However, a burgeoning array of algorithms and statistical methods are beginning to make analysis of this information possible. These new forms of data and new statistical techniques provide opportunities to observe behavior that was previously unobservable, to measure quantities of interest that were previously unmeasurable, and to test hypotheses that were previously impossible to test.

This seminar will overview the field of “Text as Data” with an emphasis on the social sciences. The course is organized around the tasks in the research process: discovery, measurement and inference. We will introduce methods from natural language processing and machine learning (such as clustering, topic modeling, supervised classification, etc.) while demonstrating through applications how they can be incorporated into the research process. Implementation of the methods will be demonstrated using the R programming language.


This seminar will use R for the demonstrations. To participate in the hands-on exercises, you are strongly encouraged to bring a laptop computer with the most recent version of R and RStudio installed. Basic knowledge of R is required to follow these demonstrations. This includes familiarity with matrices, vectors, lists, and data frames, basic data processing skills (e.g., cleaning, merging, or reshaping data in R) and beginner level programming knowledge (e.g., functions and loops).

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.

Who should attend? 

This course is primarily designed for researchers and practitioners who have limited or no prior experience collecting or analyzing text data using automated methods, but have basic familiarity with R.

While this seminar will include coding exercises to demonstrate how the tools can be used in practice, the majority of the time will be devoted to how to use these tools to make inferences in either a research or industry setting. For that reason, even those with no prior coding experience can enjoy and benefit from the course. The course can also benefit those who have prior experience with text tools but want a firmer foundation for how to think about text as evidence.

Many of the methods in contemporary text analysis involve statistical models. A basic statistical foundation (probability, linear regression) will help, but the course will always provide the core intuition for users who don’t have that background. (Just know that we might still ask you to look at some equations!).

LOCAtion, Format, And Materials 

The class will meet from 9 am to 5 pm each day with a 1-hour lunch break at Temple University Center City, 1515 Market Street, Philadelphia, PA 19103. 

Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking. 

Registration and lodging

The fee of $995 includes all course materials. The early registration fee of $895 is available until May 11.

Refund Policy

If you cancel your registration at least two weeks before the course is scheduled to begin, you are entitled to a full refund (minus a processing fee of $50). 

Lodging Reservation Instructions 

A block of guest rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut Street, Philadelphia, PA at a special rate of $169 per night. This location is about a 5-minute walk to the seminar location. In order to make reservations, call 203-905-2100 during business hours and identify yourself by using group code STH610 or click here. For guaranteed rate and availability, you must reserve your room no later than Monday, May 11, 2020.

If you need to make reservations after the cut-off date, you may call Club Quarters directly and ask for the “Statistical Horizons” rate (do not use the code or mention a room block) and they will try to accommodate your request.


1. Introductions and Principles
     a. What Text Methods Can Do
     b. Core Concepts and Principles
     c. Example Applications
2. Representing Text as Data
     a. A Basic Recipe
     b. The Multinomial Language Model
     c. The Vector Space Model
     d. Word Embeddings
3. Discovery
     a. Principles
     b. Clustering
     c. Topic Models
     d. Interpretation
4. Measurement
     a. Principles
     b. Dictionary Methods
     c. Scaling
     d. Supervised Classification
     e. Assessing Performance
     f. Repurposing Discovery
5. Causal Inference
     a. Text as Outcome
     b. Text as Treatment
     c. Text as Confounder
6. The Future
     a. Recent Advances in Natural Language Processing
     b. Connections to Other Types of Data