Text As Data

A 3-Day Remote Seminar Taught by
Brandon Stewart, Ph.D.

To see a sample of the course materials, click here.

This seminar is currently sold out. Email info@statisticalhorizons.com to be added to the waitlist.

Never before in human history has so much information been so easy to access. The promise of this wealth of information is immense, but because of its pure volume it is difficult to summarize and interpret. However, a burgeoning array of algorithms and statistical methods are beginning to make analysis of this information possible. These new forms of data and new statistical techniques provide opportunities to observe behavior that was previously unobservable, to measure quantities of interest that were previously unmeasurable, and to test hypotheses that were previously impossible to test.

This seminar will overview the field of “Text as Data” with an emphasis on the social sciences. The course is organized around the tasks in the research process: discovery, measurement and inference. We will introduce methods from natural language processing and machine learning (such as clustering, topic modeling, supervised classification, etc.) while demonstrating through applications how they can be incorporated into the research process. Implementation of the methods will be demonstrated using the R programming language.

Starting December 3, we are offering this seminar as a 3-day synchronous*, remote workshop for the first time. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional session will be held Thursday and Friday afternoons as an “office hour”, where you can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for one week after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously.


This remote seminar is held via Zoom, a free video conferencing application. Instructions for joining a session via Zoom are available here. Before the seminar begins, you will receive an email with the meeting code and password you must use to join.

This seminar will use R for the demonstrations. To participate in the hands-on exercises, you are strongly encouraged to use a computer with the most recent version of R and RStudio installed. Basic knowledge of R is required to follow these demonstrations. This includes familiarity with matrices, vectors, lists, and data frames, basic data processing skills (e.g., cleaning, merging, or reshaping data in R) and beginner level programming knowledge (e.g., functions and loops).

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.

WHO SHOULD Register? 

This course is primarily designed for researchers and practitioners who have limited or no prior experience collecting or analyzing text data using automated methods, but have basic familiarity with R.

While this seminar will include coding exercises to demonstrate how the tools can be used in practice, the majority of the time will be devoted to how to use these tools to make inferences in either a research or industry setting. For that reason, even those with no prior coding experience can enjoy and benefit from the course. The course can also benefit those who have prior experience with text tools but want a firmer foundation for how to think about text as evidence.

Many of the methods in contemporary text analysis involve statistical models. A basic statistical foundation (probability, linear regression) will help, but the course will always provide the core intuition for users who don’t have that background. (Just know that we might still ask you to look at some equations!).


1. Introductions and Principles
     a. What Text Methods Can Do
     b. Core Concepts and Principles
     c. Example Applications
2. Representing Text as Data
     a. A Basic Recipe
     b. The Multinomial Language Model
     c. The Vector Space Model
     d. Word Embeddings
3. Discovery
     a. Principles
     b. Clustering
     c. Topic Models
     d. Interpretation
4. Measurement
     a. Principles
     b. Dictionary Methods
     c. Scaling
     d. Supervised Classification
     e. Assessing Performance
     f. Repurposing Discovery
5. Causal Inference
     a. Text as Outcome
     b. Text as Treatment
     c. Text as Confounder
6. The Future
     a. Recent Advances in Natural Language Processing
     b. Connections to Other Types of Data