Advanced Machine Learning

A 3-Day Remote Seminar Taught by
Ross Jacobucci, Ph.D.

To see a sample of the course materials, click here.


Machine learning–including artificial intelligence, big data, supervised learning, and data science–has had an enormous impact in both academic research and industry. Development of innovative machine learning algorithms has been paired with the availability of large datasets. And it has facilitated the collection of even larger datasets, often times containing novel data types (e.g., text).

While machine learning has become increasingly easy to apply in many programming languages, it also presents a number of challenges; specifically, how to interpret the relationships between variables, how to prevent overfitting, and how to deal with the inevitable issues that arise from collecting diverse data types.

This seminar builds off of introductory materials on machine learning, assuming a basic familiarity with the ideas behind regularization in regression, cross-validation, and decision trees. The first day covers the state-of-the-art algorithms for prediction problems with a single outcome. The second day focuses on putting everything together, namely, how to best run all of these algorithms and properly compare their results. Finally, the third day discusses a host of algorithms that were each developed for different types of unsupervised learning tasks. Understanding how each algorithm works will be paired with material on how to apply the method with minimal coding in R.

Starting October 7, we are offering this seminar as a 3-day synchronous*, remote workshop for the first time. Each day will consist of a 4-hour live lecture held via the free video-conferencing software Zoom. You are encouraged to join the lecture live, but will have the opportunity to view the recorded session later in the day if you are unable to attend at the scheduled time.

Each lecture session will conclude with a hands-on exercise reviewing the content covered, to be completed on your own. An additional lab session will be held Thursday and Friday afternoons, where you can review the exercise results with the instructor and ask any questions.

*We understand that scheduling is difficult during this unpredictable time. If you prefer, you may take all or part of the course asynchronously. The video recordings will be made available within 24 hours of each session and will be accessible for four weeks after the seminar, meaning that you will get all of the class content and discussions even if you cannot participate synchronously. 

Closed captioning is available for all live and recorded sessions.


COMPUTING

This seminar will use R for the empirical examples and exercises. To participate in the hands-on exercises, you are strongly encouraged to have a computer with R and RStudio installed. RStudio is a freely available interface for R. This seminar presumes at least some exposure to the R computing environment.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.


WHO SHOULD Register?

If you have an introductory knowledge of machine learning and want to learn the more advanced concepts, this course is for you. The material in this course builds off of the topics taught in Machine Learning, requiring at least familiarity with logistic regression, decision trees, and regularized regression, along with the concepts of cross-validation and bootstrapping. The course will briefly recap each topic followed by a more advanced look into each of those areas, while building into a number of more complex methods. The seminar will integrate the methods and results from a number of research articles that utilize machine learning from the instructor’s research on suicide. 


SEMINAR OUTLINE

Day 1: Advanced Prediction

  • Regularization review
  • Gradient boosting
  • Random forest
  • Support vector machines
  • SuperLearner
  • Intro to neural networks

Day 2: Assessing Prediction

  • Regression fit metrics
  • Classification fit metrics
  • Handling imbalanced data
  • Advanced cross-validation for comparing algorithms
  • Parallel and high-performance computing

Day 3: Unsupervised Learning

  • Brief cluster & mixtures review
  • Intro to text analysis (Latent Dirichlet Allocation & Sentiment)
  • Intro to social networks
  • Intro to neural networks for unsupervised learning