Overview of machine/statistical learning

  • Machine Learning: offshoot of artificial intelligence
  • Statistical Lerning: offshoot of statistics


Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York

Supervised or Unsupervised?

  • Supervised learning: models a response
  • Unsupervised learning: searches for patterns



"All models are wrong, but some are useful." - George Box, 1976 & 1978

Supervised Learning

Trains models along some form of:

\(y = f(x) + \varepsilon\)

  • \(y\) is a specific value of the response or outcome variable, \(Y\)
  • \(x\) is a specific value of the explanitory or predictor variable, \(X\)
  • \(\varepsilon\) describes residual, leftover, random error inherent to the system
  • \(f(x)\), describes how \(Y\) and \(X\) are related


(We used to call \(Y\) and \(X\) dependent and independent variables, but statisticians no longer recommend this nomenclature.)

Inference vs Prediction

  • Inference: Is there an effect of \(X\) on \(Y\)?
  • Prediction: How can we best explain \(Y\) given \(X\)?


Machine/statistical learning emphasizes the latter.

Regression vs classification methods

  • Regression methods model continuous, numeric responses
  • Classification methods model categorical responses (or probabilities)

Regression: Linear Models

\(f(x)\) is a linear equation, and \(\varepsilon\) follows a normal distribution:

\[y = \beta_{0} + \beta_{1} x + \varepsilon\]

Regression: Linear Models

Polynomial regression

Non-linear relationships can be modeled by including higher order terms.

\[y = \beta_{0} + \beta_{1} x + \beta_{2} x^2 + \dots + \beta_{z} x^z + \varepsilon\]

Polynomial regression

ISLR Fig 3.8

Multiple regression

Finally, the linear model can allow any number of predictor variables (features).

\[y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \dots + \beta_{k} x_{k} + \varepsilon\]

Multiple regression

ISLR Fig 3.8

Classification: Generalized Linear Models

Alow "link" functions, \(f(y)\), and non-normal error.

For example, a logistic regression model would be: \[\ln \left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1} x + \varepsilon \text{ , and } \varepsilon \text{ is Binomial}\]

GLM: Logistic Regression

Cross-Validation

  1. Fit models to training data (many different \(f(x)\))
  2. Evaluate models via testing data
  3. Pick \(f(x)\) that minimizes Test Error

Cross-Validation: Simulation

ISLR Fig. 2.9

Feature Selection

  • Best Subset Selection: Guarantees best model
  • Stepwise Selection: More practical

Shrinkage

  • Imposes a penalty, \(\lambda\), on \(\beta\) estimates
  • "Ridge" regression shrinks extraneous variables' \(\beta\) towards 0
  • "Lasso" regression shrinks extraneous variables' \(\beta\) to 0
    • effectively removing them

Ridge Regression: Credit Balance

ISLR Fig 6.4

Lasso Regression: Credit Balance

ISLR Fig 6.6

Shrinkage: Pick \(\lambda\) with lowest MSE

ISLR Fig 6.5

Dimension Reduction

Example: Draw a 3D coffee mug on a 2D piece of paper

  • Rotates the cloud of points to account for the most variability
    • Principal Components
  • Keeps all features, but reduces the dimensions
  • PC Regression: models \(Y\) as a function of the PCs

Dimension Reduction: Sales

ISLR Fig 6.15

Dimension Reduction: Simulations

ISLR Fig 6.18

Non-linear Methods overview

  • Polynomial Regression
  • Regression Splines: piecewise low-degree polynomials
  • Local Regression: piecewise linear models
  • Generalized Additive Models: different \(f(x)\) for each feature

Regression Splines: Wages

ISLR Fig 7.3

Local Regression: Simulation

ISLR Fig 7.9

GAM: Wages

ISLR Fig 7.11

Tree-Based Methods

  • Splits predictor variable into regions
  • Not as accurate as previous methods
  • …unless multiple trees are combined (bagging, random forests, boosting)

Tree-Based Baseball Salaries

ISLR Fig 8.2

Tree-Based Baseball Salaries

ISLR Fig 8.1

Support Vector Machines

  • Classification method
  • Split \(p\) features along a \(p-1\) dimensional plane that best separates the classes.

SVM: \(p=2\)

ISLR Fig 9.2

SVM: Maximal Margin Classifier

  • Maximizes the gap between the two classes
  • Requires perfect separation

SVM: Maximal Margin Classifier

ISLR Fig 9.3

SVM: Support Vector Classifier

  • Relaxes perfect separation requirement
  • Minimizes misclassified points within a margin

SVM: Support Vector Classifier

ISLR Fig 9.7

Unsupervised learning

Search for patterns in data

  • Dimension reduction
  • Clustering

Principal Components Analysis

  • Rotates the \(p\)-dimensional cloud of points to find axes that account for the most variability.
  • The new axes are called Principal Components
  • Imagine "drawing" a 3D object in 2D
  • A smaller number of dimensions usually explain a large amount of variability
    • "dimension reduction"

Principal Components Analysis: Arrests

ISLR Fig 10.1

K-means clustering

Partitions data into \(K\) clusters

  1. Data are randomly assigned to 1 of \(K\) clusters
  2. Cluster centroids are computed.
  3. Data are reassigned to nearest centroid.
  4. Repeat 2 and 3 until centroids do not change.

K-means clustering: Simulation

ISLR Fig 10.6