Machine Learning with Text Data

A 2-Day Seminar Taught by Christopher Bail, Ph.D.

Read reviews of this seminar

To see a sample of the course materials, click here.

If you are interested in registering for this seminar, please email info@statisticalhorizons.com. 


The past decade has not only witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, text-based data are not easily collected or analyzed using conventional statistical methods. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing.

This hands-on course will provide participants with an overview of popular techniques for collecting and analyzing text-based data—including screen-scraping, application programming interfaces or APIs, topic modeling, and network-based text classifiers. Though the class will review basic programming techniques such as loops and functions, students with no familiarity with these techniques may wish to review them in advance. The majority of our time will be spent mastering the following R packages: rvest, twitterR, RFacebook, lda, stm, ldaviz, and textnets, as well as a variety of functions in base R. 


COMPUTING

This seminar will use R for the empirical examples and exercises. To participate in the hands-on exercises, you are strongly encouraged to bring a laptop computer with the most recent version of R and RStudio installed. 

Basic knowledge of R is required for this course. There are a number of excellent introductory books to R as well as a collection of online tutorials for people who are unfamiliar with R (e.g., Tutorials Point).


WHO SHOULD ATTEND?

This seminar is designed for researchers or practitioners who have no prior experience collecting or analyzing text data using automated methods but have basic familiarity with R (e.g. familiarity with matrices, vectors, lists, and data frames), basic data processing skills (e.g. cleaning, merging, or reshaping data in R) and beginner level programming knowledge (e.g. functions and loops). Though these basic subjects will be reviewed in the first part of the seminar, participants with no prior knowledge of these skills may find the seminar will move too quickly.


Location, Format and materials

The class will meet from 9 am to 5 pm each day with a 1-hour lunch break at Temple University Center City, 1515 Market Street, Philadelphia, PA 19103. 

Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking. 


Registration and Lodging

The fee of $995.00 includes all seminar materials. The early registration fee of $895.00 is available until March 26.

Refund Policy

If you cancel your registration at least two weeks before the course is scheduled to begin, you are entitled to a full refund (minus a processing fee of $50). 

Lodging Reservation Instructions

Hotel information will be posted when available.


SEMINAR OUTLINE

1. Introduction to text as data
     a. The text as data explosion
     b. A brief history of automated text analysis
     c. Advantages of text analysis
     d. Challenges of text analysis

2. Screen-scraping
     a. Legal and logistical issues
     b. HTML and XML
     c. Navigating HTML and XML
     d. Browser automation

3. Application Programming Interfaces
     a. What is an API?
     b. Making your first API call
     c. Rate limiting
     d. R packages for working with APIs
     e. Writing custom code for APIs

4. Basic text-analysis
     a. Tokenization
     b. Character encoding
     c. GREP/Regular Expressions
     d. Dictionary-based methods

5. Topic Models
     a. A brief introduction to topic models
     b. Creating a corpus/prepping text
     c. The Term Document Matrix
     d. Latent Dirichlet Allocation
     e. Structural Topic Models
     f. Topic Model Validation

6. Text as Networks
     a. Basics of network theory
     b. Bipartite affiliation networks
     c. Sentence parsing
     d. Building a text network


Comments from Recent participants

“The Text As Data course expanded my understanding of the most recent approaches and tools for collecting and processing unstructured text. The course is very well structured, and the instructor is fantastic!”
  Sinziana Dorobantu, New York University

“This course provided invaluable insights into where this discipline is heading and what skills will be needed as it matures.”
  Michael Siebel, Fors Marsh Group

“As a social scientist, I am encouraged by this experience to learn more about text analysis and apply the skills to my research.”
  Hui-Ming Deanna Wang, San Francisco State University

“One of the most daunting aspects of jumping into a new area of study, like natural language processing, is just getting the lay of the land – learning what has been done and what is possible. This course is an ideal first step, capitalizing well on just some foundational knowledge of R to really get new scholars up to speed.”
  Jeff Antsen, Temple University

“This introductory course teaches useful and foundational text mining/analysis techniques in R. It is a great course for researchers who try to analyze large volumes of data in an automated way. The instructor possesses expert knowledge in the field.”
  Ling Na, Incyte Corporation

“I found the course and instructor to be clear and user-friendly. I feel confident in my new background in these methods, so much so that I am wondering how best to apply them to current projects.”
  Katharine Bloeser, Hunter College

“A very useful course to know the state-of-the-art techniques in text mining.”
  Onook Oh, University of Colorado, Denver