Machine Learning

Agenda

What is machine learning?
- Supervised vs. unsupervised learning
Fundamental concepts
- Parameters and hyperparameters
- Bias, variance, overfit
- Train, test, (cross)validation
scikit-learn
- Linear regression (house price prediction)

What is machine learning?

Machine learning

Enables a machine to learn patterns or rules from data, rather than being explicitly programmed for every task
Algorithms and models
- Algorithms: computational procedures that define how learning will occur (e.g., how to adjust parameters)
- Models: the outcome of applying an algorithm to data — a learned representation characterized by parameters estimated from labeled or unlabeled data

What is machine learning?

Two fundamental approaches in machine learning

	Supervised	Unsupervised
Objective	Trained on a labeled data to learn a mapping from input to output	Find patterns or structures within data without labeled data
Outcome	Pre-defined categories	Not quite pre-defined
Common tasks	Regression, classification	Clustering, dimensionality reduction
Model evaluation	Explicit metrics such as accuracy, precision, recall, or MSE	Can involve qualitative assessment

What is machine learning?

Python is a really popular language in ML

Traditional machine learning: scikit-learn
Deep learning: PyTorch, Tensorflow & Keras

What is machine learning?

We will focus on supervised learning

Involves a training and a test set
Train a model using the training set
Test the performance of the model on the test set

Fundamental concepts

Parameter and hyperparameters

Parameter
- Learned (estimated) from data (internal to the model)
- E.g., regression weights/coefficients
Hyper-parameters
- Controls the learning process (thus “hyper”)
- Model structures (e..g, number of layers in NN), optimization approaches (e.g, learning rate in NN), etc.

Fundamental concepts

Goal of supervised ML

We train a model on a data set where inputs and correct outputs are known
The model learns a pattern (mapping) from input to output
Then we give it new inputs without the answer and ask it to predict the output
- Prediction: applying what the model has learned to new data
Model performs well not only on training data but also on unseen (test) data

Fundamental concepts

Two types of prediction depending on the task

Regression: continuous numbers (e.g., predicting house prices)
Classification: categories/labels (e.g., predicting if an email is spam)

Fundamental concepts

Bias

Bias arises when a model makes overly strong or simplistic assumptions about the data
The model oversimplifies, misses patterns, and underfits the data

Fundamental concepts

Underfit

If a model fails to learn the underlying patterns in the data, it can lead to underfit
This occurs when the model is too simple to capture the true structure or relationships in the data
- E.g., linear model on complex/nonlinear data
The model will perform poorly on both the training set and the test set, showing high error everywhere because it has not learned enough signal

Fundamental concepts

Variance

The degree to which the model generalizes to different data
High variance means low generalizability

Fundamental concepts

Overfit

If a model learns the training data “too well”, it can lead to overfit
This happens when the model mistakes noise for signal in the training data
- E.g., overly complex model “memorizing” the data
The model will perform well on the training set but would not generalize to unseen data (i.e., test set)

Fundamental concepts

Bias-variance trade-off

Fundamental concepts

Bias-variance trade-off

Fundamental concepts

Validation

A validation set is used to fine-tune hyperparameters (e.g., choosing learning rate, number of tree splits, regularization strength)
It helps prevent overfitting to the training data, because performance is monitored on data the model was not trained on
Once hyperparameters are finalized, the validation set should not be used to report final performance

Fundamental concepts

Train, Test, Validation sets (Wikimedia Commons)

Fundamental concepts

Cross Validation (scikit Learn)

Linear regression

Most basic ML method for continuous outcome variables

A linear regression model (Wikimedia Commons)

Ames housing data

A dataset to predict house prices in Ames, Iowa

Available through Kaggle
We’ll use it to apply some machine learning using scikit-learn

Load the Ames housing dataset

Inspect the training set

Inspect the test set

The 'SalePrice' column in the test set is withheld by Kaggle

Select features

The data set has a lot of features
Let’s use some of them to build a predictive model
We can select them using a list

Slice

Fit model

In a machine learning context, coefficients and standard errors are secondary
Predictive performance is more important
scikit-learn does not produce standard errors, p-values, confidence intervals, etc.
See Shmueli (2010) for the differences between prediction and explanation

Make predictions

Slice the test set in the same way

Add predictions to test set

Now that we have predictions, we can add them to the test set

Let’s do better

We can split the labeled data set into a training set and “test” set
Estimate what the performance of the model is going to be
Adjust model based on that (e.g. add parameters, regularization, etc.)

Create test set

Create a test set from the training set

Approximately 20% of the observations will go to the validation set

Train the model on the new training set

Train on the new (smaller) training set

Use the rest as the validation set; make prediction

Predict Performance

MSE

Now, because we know the actual values, we can guess performance

Predict performance

R-squard

K-fold cross-validation

K-fold cross-validation usually a better method to estimate performance
Not too sensitive to the randomness in the split
Split data into k folds (usually 5 or 10)
- Each fold takes turns being validation set
- Each time, remaining folds serve as training set
- Fit k models, predict for validation set
- Estimate performance by averaging over all folds

K-fold cross-validation

Import cross_val_score and create a LinearRegression() object

K-fold cross-validation

Cross-validation for hyperparameter tuning

Ridge regression
- A type of regularized linear regression
- Penalizes large coefficients
- Large coefficients often mean the model is overfitting (learning noise in the training data)
- The penalty shrinks coefficients \(\rightarrow\) better generalization
Regularization strength (\(\alpha\))
- The smaller, the milder
- \(\alpha\) = 0: no penalization

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

Resources

Introduction to Statistical Learning with Python (free!)
Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. ” O’Reilly Media, Inc.”.