## Machine Learning: A Hands-On Introduction

A 2-Day Seminar Taught by Adi Tarca, Ph.D.

Classification of objects into predefined groups is a common task that humans do every day, and machines are also getting better at it, with applications ranging from character and speech recognition to distinguishing between spam and non-spam messages, medical diagnosis, stock market prediction, etc. Output prediction or *supervised* *machine learning* refers to the task of using the information embedded in a set of input variables _{1}_{2}_{p} to make predictions about an outcome variable Y (e.g. binary, multinomial, continuous) by learning from a training dataset consisting of observed input-output pairs (samples). The learning is to be achieved by a computer/machine algorithm that uses the training dataset to estimate/tune its internal parameters with minimal input from the user.

An example of a machine learning application is fitting a logistic regression model in which the binary outcome represents the voting preference (Republican vs. Democrat) of a set of individuals (samples) based on a set of features (e.g. age, gender and income), with the goal of predicting the voting preference of a new set of individuals. Although the methods of machine learning are often similar to those of statistical hypothesis testing, the goals and the pitfalls are quite different.

Requiring no previous background in machine learning, this two-day seminar is a hands-on introduction to the field as well as to the R statistical environment that will be used to illustrate the machine learning concepts.

The course is organized as a sequence of lectures and practical sessions on both machine learning and the R statistical language. The machine learning issues we will focus on include:

- When, why and how to use the different types of prediction models such as logistic regression, quadratic and linear discriminants, neural networks, support vector machines, decision trees and random forests, k-nearest neighbor.
- How to select the best set of features for a particular type of model.
- How to assess the model performance in an unbiased way and avoid overfitting.

The emphasis in the course will be on concepts and their practical use rather than on the mathematical theory behind them. At the end of the course the participants will be able to use R to (1) load a data set from a file, (2) prepare the data for analysis, (3) build and tune an appropriate classifier, (4) assess the future performance of the classifier on new data, and (5) apply the classifier on a new dataset to predict the outcome.

### COMPUTING

This is a hands-on course. Participants should bring their own laptops with the R software already installed (version 3.0.0) available at www.r-project.org for most operating systems. Additional R packages/modules that are needed will be installed onsite. To facilitate the writing of R commands, the freely available RStudio software (http://www.rstudio.com/ ) is recommended. A very limited number of participants may use our RStudio server, and hence will need only a Google Chrome enabled laptop.

### Who should attend?

This workshop is intended for researchers, clinicians, educators, statisticians, graduate students, and anyone with an interest in understanding or applying machine learning. The approach is primarily conceptual and focused on applications rather than mathematics, and participants with limited statistical background should be able to understand most of the materials. To fully benefit of the course, participants should be willing to write statements in the R language as simple/complex as:

mydat=read.csv(“mydata.csv”)

mymodel= glm(Y~X1+X2, family = binomial(logit), data = mydat)

mypredictions= predict(mymodel,mydat)

### Materials

Participants receive a bound manual containing detailed lecture notes (with equations and graphics) and many other useful features. This book frees participants from the distracting task of note taking.

### Registration and lodging

The fee of $895.00 includes all seminar materials.

**Lodging Reservation Instructions**

** **A block of guest rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut Street, Philadelphia, PA at a special rate of $137 per night. This location is about a 5 minute walk to the seminar location. In order to make reservations, call 203-905-2100 during business hours and identify yourself by using **group code STA117**. For guaranteed rate and availability, you must reserve your room no later than** September 19, 2013.**

### SEMINAR OUTLINE

Since the R environment is used to illustrate the machine learning concepts throughout this seminar, the outline below is given for both machine learning and for the R component of the course:

**Machine learning**

1. Introduction to machine learning

- Supervised vs. unsupervised learning
- Examples of applications of supervised and unsupervised learning
- Pattern recognition cycle

2. Approaches to supervised learning:

Methods that estimate probability density functions explicitly

- Parametric methods: Quadratic Discriminant Analysis, Linear Discriminant Analysis, Diagonal Linear Discriminant Analysis
- Non-parametric methods: Histogram method, k-Nearest Neighbor

Methods that estimate decision boundaries directly:

- Linear: Logistic discrimination, support vector machines
- Nonlinear – neural networks (multilayer perceptron)
- Decision trees and random forest

3. Performance assessment:

- Metrics: Accuracy, Area Under Receiver Operating Characteristic curve (ROC)
- Methods: hold-out, leave-one out cross-validation, N-fold cross-validation

4. Feature selection:

- Filter
- Wrapper
- Search algorithms: Best Individual N, Forward selection, Backward deletion, Combinatorial

5. Feature extraction: Principal component analysis (PCA)

**Introduction to R**

1. The R environment

2. Simple manipulations; numbers and vectors, factors

3. Arrays and matrices

4. Lists and data frames

5. Grouping, loops and conditional execution

6. Writing your own functions