Modern researchers increasingly find themselves facing a new paradigm where data are no longer scarce and expensive, but rather abundant and cheap. Both numbers of cases/instances and numbers of variables/features are exploding. This new reality raises important issues in effective data analysis.
Of course, the basic statistical objective–discovery and quantitative description of simple structure–remains unchanged. But new possibilities for applying highly flexible methods (not practical in “small data” contexts) must be reconciled with the inherent sparsity of essentially any data set comprised of a large number of features–and the corresponding danger of overfitting and unwarranted generalization from data in hand. Modern statistical machine methods rationally and effectively address these new realities.
This course first describes and explains the new context, formulates issues that it raises, and points to cross-validation as a fundamental tool for matching method flexibility/complexity to data set information content in predictive problems. Then a variety of modern squared error loss prediction methods (modern regression methods) will be discussed, related to optimal prediction, and illustrated using standard R packages. These will include
- smoothing methods
- shrinkage for linear prediction (ridge, lasso, and elastic net predictors)
- regression trees
- random forests, and
Next a variety of modern classification methods will be introduced, related to optimal classification, and illustrated using standard R packages. These will include:
- linear methods for classification (linear discriminant analysis, logistic regression, support vector classifiers)
- kernel extensions of support vector classifiers
- classification trees
- adaboost, and
- other ensemble classifiers
Finally, we’ll discuss some methods of modern “unsupervised” statistical machine learning, where the object is not prediction of a particular response variable but rather discovery of relations among features or natural groupings of either cases or features. These will include principal components and clustering methods
The course will consist of both lectures and hands-on R sessions.
Who should attend?
This course is designed for those who wish to make predictions of quantitative or qualitative responses from multiple inputs in modern data-rich contexts, and need to search for internal patterns in large multivariate datasets. A sound understanding of ordinary multiple linear regression will be assumed and an exposure to logistic regression will be helpful, as will some familiarity with matrices and matrix operations. Familiarity with basic probability concepts including joint, marginal, and conditional distributions, expected value and variance, and normal and binomial distributions will also be assumed.
Students will be expected to bring their own laptops with R and RStudio downloaded and ready for use in the classroom. A list of packages to install before the course will be provided. A rudimentary knowledge of the use of R (such as can be obtained working through any number of online introductions to the system) will be assumed.
LOCATION, FORMAT AND MATERIALS
The class will meet from 9 am to 5 pm each day with a 1-hour lunch break at Temple University Center City, 1515 Market Street, Philadelphia, PA 19103.
Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking.
Registration and lodging
The fee of $995.00 includes all seminar materials.
Lodging Reservation Instructions
A block of guest rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut Street, Philadelphia, PA at a special rate of $152 on May 12 and $177 on May 13. This location is about a 5 minute walk to the seminar location. In order to make reservations, call 203-905-2100 during business hours and use group code STA512. The room block will expire when it is full or on April 12, 2016.
1. Generalities about Modern Statistical Learning
2. Generalities about SEL Prediction (Modern Regression)
Form of the Best SEL Predictor/Near Neighbors
“Variance/Bias” Tradeoff and Model Bias
Over-fitting and Holdout-Strategies/Cross Validation
3. SEL Prediction Methods
Smoothing-Low D and Additive Models
Linear Prediction Methods
Bootstrapping and Random Forests
Other Ensemble Methods and Ideas
4. Generalities about Classification
Form of the Best Classifier/Near Neighbors
5. Classification Methods
Linear Classification Rules
Linear Discriminant Analysis
Support Vector Classifiers
Extensions of Linear Methods
Classification Trees and Random Forests
6. Introduction to Unsupervised Methods