Machine Learning

HSS 611: Programming for HSS

Taegyoon Kim

Dec 9, 2025

Agenda

  • What is machine learning?

    • Supervised vs. unsupervised learning
  • Fundamental concepts

    • Parameters and hyperparameters
    • Bias, variance, overfit
    • Train, test, (cross)validation
  • scikit-learn

    • Linear regression (house price prediction)

What is machine learning?

Machine learning

  • Enables a machine to learn patterns or rules from data, rather than being explicitly programmed for every task

  • Algorithms and models

    • Algorithms: computational procedures that define how learning will occur (e.g., how to adjust parameters)

    • Models: the outcome of applying an algorithm to data — a learned representation characterized by parameters estimated from labeled or unlabeled data

What is machine learning?

Two fundamental approaches in machine learning

Supervised Unsupervised
Objective Trained on a labeled data to learn a mapping from input to output Find patterns or structures within data without labeled data
Outcome Pre-defined categories Not quite pre-defined
Common tasks Regression, classification Clustering, dimensionality reduction
Model evaluation Explicit metrics such as accuracy, precision, recall, or MSE Can involve qualitative assessment

What is machine learning?

Python is a really popular language in ML

What is machine learning?

We will focus on supervised learning

  • Involves a training and a test set

  • Train a model using the training set

  • Test the performance of the model on the test set

Fundamental concepts

Parameter and hyperparameters

  • Parameter

    • Learned (estimated) from data (internal to the model)

    • E.g., regression weights/coefficients

  • Hyper-parameters

    • Controls the learning process (thus “hyper”)

    • Model structures (e..g, number of layers in NN), optimization approaches (e.g, learning rate in NN), etc.

Fundamental concepts

Goal of supervised ML

  • We train a model on a data set where inputs and correct outputs are known

  • The model learns a pattern (mapping) from input to output

  • Then we give it new inputs without the answer and ask it to predict the output

    • Prediction: applying what the model has learned to new data
  • Model performs well not only on training data but also on unseen (test) data

Fundamental concepts

Two types of prediction depending on the task

  • Regression: continuous numbers (e.g., predicting house prices)

  • Classification: categories/labels (e.g., predicting if an email is spam)

Fundamental concepts

Bias

  • Bias arises when a model makes overly strong or simplistic assumptions about the data
  • The model oversimplifies, misses patterns, and underfits the data

Fundamental concepts

Underfit

  • If a model fails to learn the underlying patterns in the data, it can lead to underfit

  • This occurs when the model is too simple to capture the true structure or relationships in the data

    • E.g., linear model on complex/nonlinear data
  • The model will perform poorly on both the training set and the test set, showing high error everywhere because it has not learned enough signal

Fundamental concepts

Variance

  • The degree to which the model generalizes to different data

  • High variance means low generalizability

Fundamental concepts

Overfit

  • If a model learns the training data “too well”, it can lead to overfit

  • This happens when the model mistakes noise for signal in the training data

    • E.g., overly complex model “memorizing” the data
  • The model will perform well on the training set but would not generalize to unseen data (i.e., test set)

Fundamental concepts

Bias-variance trade-off

Fundamental concepts

Bias-variance trade-off

Fundamental concepts

Validation

  • A validation set is used to fine-tune hyperparameters (e.g., choosing learning rate, number of tree splits, regularization strength)

  • It helps prevent overfitting to the training data, because performance is monitored on data the model was not trained on

  • Once hyperparameters are finalized, the validation set should not be used to report final performance

Fundamental concepts

Train, Test, Validation sets (Wikimedia Commons)

Fundamental concepts

Cross Validation (scikit Learn)

Linear regression

  • Most basic ML method for continuous outcome variables

A linear regression model (Wikimedia Commons)

Ames housing data

A dataset to predict house prices in Ames, Iowa

  • Available through Kaggle

  • We’ll use it to apply some machine learning using scikit-learn

Load the Ames housing dataset

Inspect the training set

Inspect the test set

  • The 'SalePrice' column in the test set is withheld by Kaggle

Select features

  • The data set has a lot of features

  • Let’s use some of them to build a predictive model

  • We can select them using a list

Slice

Slice

Fit model

  • In a machine learning context, coefficients and standard errors are secondary

  • Predictive performance is more important

  • scikit-learn does not produce standard errors, p-values, confidence intervals, etc.

  • See Shmueli (2010) for the differences between prediction and explanation

Make predictions

  • Slice the test set in the same way

Add predictions to test set

  • Now that we have predictions, we can add them to the test set

Let’s do better

  • We can split the labeled data set into a training set and “test” set

  • Estimate what the performance of the model is going to be

  • Adjust model based on that (e.g. add parameters, regularization, etc.)

Create test set

  • Create a test set from the training set

  • Approximately 20% of the observations will go to the validation set

Train the model on the new training set

  • Train on the new (smaller) training set

  • Use the rest as the validation set; make prediction

Predict Performance

Predict Performance

  • MSE

  • Now, because we know the actual values, we can guess performance

Predict performance

  • R-squard

K-fold cross-validation

  • K-fold cross-validation usually a better method to estimate performance

  • Not too sensitive to the randomness in the split

  • Split data into k folds (usually 5 or 10)

    • Each fold takes turns being validation set
    • Each time, remaining folds serve as training set
    • Fit k models, predict for validation set
    • Estimate performance by averaging over all folds

K-fold cross-validation

  • Import cross_val_score and create a LinearRegression() object

K-fold cross-validation

K-fold cross-validation

Cross-validation for hyperparameter tuning

  • Ridge regression
    • A type of regularized linear regression
    • Penalizes large coefficients
    • Large coefficients often mean the model is overfitting (learning noise in the training data)
    • The penalty shrinks coefficients \(\rightarrow\) better generalization
  • Regularization strength (\(\alpha\))
    • The smaller, the milder
    • \(\alpha\) = 0: no penalization

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

K-fold cross-validation

Ridge regression

Resources