Introduction to Text as Data: A Short Course

A 3-Day Livestream Seminar Taught by Amber Boydstun, Ph.D. and Cory Struthers, Ph.D.

Download Sample Course Slides

Text is all around us: from archived court documents to this morning’s social media posts, from transcripts of political ads to terrorist manifestos. Text-as-data methods allow us to use this text to measure and discover phenomena that may be otherwise hard or impossible to represent quantitatively, such as ideological positions of court documents and emotional sentiment in manifestos.

There has never been a more exciting time to learn text-as-data methods. Digital advances have made available text content that even a few years ago would have been difficult to collect and computational text-as-data methods have advanced just as fast. However, because there are now countless text data to explore and a dizzying array of accessible text-as-data tools to apply, understanding which methods are appropriate for what contexts is critically important.

This course will provide an introduction to text-as-data methods, including how they work, how they can be applied, and common pitfalls to avoid. We will focus on linking concepts to measurement through textual data. Topics covered include: manual content analysis; text collection and pre-processing; advanced keyword queries and frequencies; dictionary analysis (including sentiment analysis); text similarity and reuse; topic modeling; and supervised machine learning.

Starting January 25, we are offering this seminar as a 3-day synchronous*, livestream workshop held via the free video-conferencing software Zoom. Each day will consist of two lecture sessions which include hands-on exercises, separated by a 1-hour break. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

*We understand that finding time to participate in livestream courses can be difficult. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions. Live captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.

More Details About the Course Content

This seminar provides an intensive introduction to text-as-data methods, drawing on social science research and perspectives.

We will begin with an overview of text-as-data methods, highlighting the range of applications they make possible. We will ground this discussion in classic “manual content analysis” methods, which remain the gold standard for validating computational approaches.

Next, we will move on to an overview of how to pre-process a text dataset, known as a corpus (plural=corpora). Then we will examine core text-as-data techniques for which “off the shelf” code exists: advanced keyword queries and frequencies, dictionary methods (including sentiment analysis), text similarity and reuse, and topic modeling.

Along the way, we will discuss (but not cover in detail) more advanced text-as-data methods that require additional data and/or expertise but that also open up additional avenues of research.

Here are some of the things you will be able to do by the end of this course:

    • Develop a content analysis codebook.
    • Pre-process text for analysis.
    • Calculate frequencies of key words or phrases in a corpus.
    • Evaluate the sentiment of a corpus.
    • Apply dictionary methods to a corpus.
    • Identify topics in a corpus.
    • Have the foundational knowledge to learn more about advanced text analysis methods.

Computing

This seminar will primarily use R and RStudio software for in-class examples and exercises. Both are free, open-source programming languages and should be installed before the course begins. A basic literacy in R is needed to get the most out of the course.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent on-line resources for learning the basics. Here are our recommendations.

Who Should Register?

This course is designed for anyone who wants to apply text-as-data methods to newspapers, legislation, social media, meeting minutes, and other documents. No previous background in text-as-data or statistical methods are necessary. However, a working understanding of R is essential.

Outline

Day 1:

  • Introduction and overview
  • What is our goal? Defining latent variables of interest
  • The gold standard: Manual content analysis
  • Pre-processing text data
  • Approaches to measuring word frequencies

Day 2:

  • Understanding dictionary methods
  • Using established dictionaries; considerations for generating your own
  • Sentiment analysis
  • Topical dictionaries and related types

Day 3:

  • Text similarity and reuse
  • Different approaches to text similarity, including cosine similarity
  • Topic modeling and validation
  • Resources for pursuing more advanced topics

Reviews of Introduction to Text as Data

“I really enjoyed the open conversations.”
  Sukumar Ganapati, Florida International University

“I loved the balance between the theory, background, and methods. I enjoyed trying to then apply it.”
  Anandi Hira, Carnegie Mellon University

Seminar Information

Thursday, January 25 –
Saturday, January 27, 2024

Daily Schedule: All sessions are held live via Zoom. All times are ET (New York time).

10:00am-12:30pm (convert to your local time)
1:30pm-3:30pm

Payment Information

The fee of $995 includes all course materials.

PayPal and all major credit cards are accepted.

Our Tax ID number is 26-4576270.