An Overview of Machine Learning

Overview of machine/statistical learning

Machine Learning: offshoot of artificial intelligence
Statistical Lerning: offshoot of statistics

Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York

Supervised or Unsupervised?

Supervised learning: models a response
Unsupervised learning: searches for patterns

"All models are wrong, but some are useful." - George Box, 1976 & 1978

Supervised Learning

Trains models along some form of:

\(y = f(x) + \varepsilon\)

\(y\) is a specific value of the response or outcome variable, \(Y\)
\(x\) is a specific value of the explanitory or predictor variable, \(X\)
\(\varepsilon\) describes residual, leftover, random error inherent to the system
\(f(x)\), describes how \(Y\) and \(X\) are related

(We used to call \(Y\) and \(X\) dependent and independent variables, but statisticians no longer recommend this nomenclature.)

Inference vs Prediction

Inference: Is there an effect of \(X\) on \(Y\)?
Prediction: How can we best explain \(Y\) given \(X\)?

Machine/statistical learning emphasizes the latter.

Regression vs classification methods

Regression methods model continuous, numeric responses
Classification methods model categorical responses (or probabilities)

Regression: Linear Models

\(f(x)\) is a linear equation, and \(\varepsilon\) follows a normal distribution:

\[y = \beta_{0} + \beta_{1} x + \varepsilon\]

Regression: Linear Models

Polynomial regression

Non-linear relationships can be modeled by including higher order terms.

\[y = \beta_{0} + \beta_{1} x + \beta_{2} x^2 + \dots + \beta_{z} x^z + \varepsilon\]

Polynomial regression

ISLR Fig 3.8

Multiple regression

Finally, the linear model can allow any number of predictor variables (features).

\[y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \dots + \beta_{k} x_{k} + \varepsilon\]

Multiple regression

ISLR Fig 3.8

Classification: Generalized Linear Models

Alow "link" functions, \(f(y)\), and non-normal error.

For example, a logistic regression model would be: \[\ln \left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1} x + \varepsilon \text{ , and } \varepsilon \text{ is Binomial}\]

GLM: Logistic Regression

Cross-Validation

Fit models to training data (many different \(f(x)\))
Evaluate models via testing data
Pick \(f(x)\) that minimizes Test Error

Cross-Validation: Simulation

ISLR Fig. 2.9

Feature Selection

Best Subset Selection: Guarantees best model
Stepwise Selection: More practical

Shrinkage

Imposes a penalty, \(\lambda\), on \(\beta\) estimates
"Ridge" regression shrinks extraneous variables' \(\beta\) towards 0
"Lasso" regression shrinks extraneous variables' \(\beta\) to 0
- effectively removing them

Ridge Regression: Credit Balance

ISLR Fig 6.4

Lasso Regression: Credit Balance

ISLR Fig 6.6

Shrinkage: Pick \(\lambda\) with lowest MSE

ISLR Fig 6.5

Dimension Reduction

Example: Draw a 3D coffee mug on a 2D piece of paper

Rotates the cloud of points to account for the most variability
- Principal Components
Keeps all features, but reduces the dimensions
PC Regression: models \(Y\) as a function of the PCs

Dimension Reduction: Sales

ISLR Fig 6.15

Dimension Reduction: Simulations

ISLR Fig 6.18

Non-linear Methods overview

Polynomial Regression
Regression Splines: piecewise low-degree polynomials
Local Regression: piecewise linear models
Generalized Additive Models: different \(f(x)\) for each feature

Regression Splines: Wages

ISLR Fig 7.3

Local Regression: Simulation

ISLR Fig 7.9

GAM: Wages

ISLR Fig 7.11

Tree-Based Methods

Splits predictor variable into regions
Not as accurate as previous methods
…unless multiple trees are combined (bagging, random forests, boosting)

Tree-Based Baseball Salaries

ISLR Fig 8.2

Tree-Based Baseball Salaries

ISLR Fig 8.1

Support Vector Machines

Classification method
Split \(p\) features along a \(p-1\) dimensional plane that best separates the classes.

SVM: \(p=2\)

ISLR Fig 9.2

SVM: Maximal Margin Classifier

Maximizes the gap between the two classes
Requires perfect separation

SVM: Maximal Margin Classifier

ISLR Fig 9.3

SVM: Support Vector Classifier

Relaxes perfect separation requirement
Minimizes misclassified points within a margin

SVM: Support Vector Classifier

ISLR Fig 9.7

Unsupervised learning

Search for patterns in data

Dimension reduction
Clustering

Principal Components Analysis

Rotates the \(p\)-dimensional cloud of points to find axes that account for the most variability.
The new axes are called Principal Components
Imagine "drawing" a 3D object in 2D
A smaller number of dimensions usually explain a large amount of variability
- "dimension reduction"

Principal Components Analysis: Arrests

ISLR Fig 10.1

K-means clustering

Partitions data into \(K\) clusters

Data are randomly assigned to 1 of \(K\) clusters
Cluster centroids are computed.
Data are reassigned to nearest centroid.
Repeat 2 and 3 until centroids do not change.

K-means clustering: Simulation

ISLR Fig 10.6