HSS 611: Programming for HSS
Dec 9, 2025
What is machine learning?
Fundamental concepts
scikit-learn
Machine learning
Enables a machine to learn patterns or rules from data, rather than being explicitly programmed for every task
Algorithms and models
Algorithms: computational procedures that define how learning will occur (e.g., how to adjust parameters)
Models: the outcome of applying an algorithm to data — a learned representation characterized by parameters estimated from labeled or unlabeled data
Two fundamental approaches in machine learning
| Supervised | Unsupervised | |
|---|---|---|
| Objective | Trained on a labeled data to learn a mapping from input to output | Find patterns or structures within data without labeled data |
| Outcome | Pre-defined categories | Not quite pre-defined |
| Common tasks | Regression, classification | Clustering, dimensionality reduction |
| Model evaluation | Explicit metrics such as accuracy, precision, recall, or MSE | Can involve qualitative assessment |
Python is a really popular language in ML
Traditional machine learning: scikit-learn
Deep learning: PyTorch, Tensorflow & Keras
We will focus on supervised learning
Involves a training and a test set
Train a model using the training set
Test the performance of the model on the test set
Parameter and hyperparameters
Parameter
Learned (estimated) from data (internal to the model)
E.g., regression weights/coefficients
Hyper-parameters
Controls the learning process (thus “hyper”)
Model structures (e..g, number of layers in NN), optimization approaches (e.g, learning rate in NN), etc.
Goal of supervised ML
We train a model on a data set where inputs and correct outputs are known
The model learns a pattern (mapping) from input to output
Then we give it new inputs without the answer and ask it to predict the output
Model performs well not only on training data but also on unseen (test) data
Two types of prediction depending on the task
Regression: continuous numbers (e.g., predicting house prices)
Classification: categories/labels (e.g., predicting if an email is spam)
Bias
Underfit
If a model fails to learn the underlying patterns in the data, it can lead to underfit
This occurs when the model is too simple to capture the true structure or relationships in the data
The model will perform poorly on both the training set and the test set, showing high error everywhere because it has not learned enough signal
Variance
The degree to which the model generalizes to different data
High variance means low generalizability
Overfit
If a model learns the training data “too well”, it can lead to overfit
This happens when the model mistakes noise for signal in the training data
The model will perform well on the training set but would not generalize to unseen data (i.e., test set)
Bias-variance trade-off
Bias-variance trade-off
Validation
A validation set is used to fine-tune hyperparameters (e.g., choosing learning rate, number of tree splits, regularization strength)
It helps prevent overfitting to the training data, because performance is monitored on data the model was not trained on
Once hyperparameters are finalized, the validation set should not be used to report final performance
Train, Test, Validation sets (Wikimedia Commons)
Cross Validation (scikit Learn)
A linear regression model (Wikimedia Commons)
A dataset to predict house prices in Ames, Iowa
Available through Kaggle
We’ll use it to apply some machine learning using scikit-learn
'SalePrice' column in the test set is withheld by Kaggle
The data set has a lot of features
Let’s use some of them to build a predictive model
We can select them using a list
In a machine learning context, coefficients and standard errors are secondary
Predictive performance is more important
scikit-learn does not produce standard errors, p-values, confidence intervals, etc.
See Shmueli (2010) for the differences between prediction and explanation
We can split the labeled data set into a training set and “test” set
Estimate what the performance of the model is going to be
Adjust model based on that (e.g. add parameters, regularization, etc.)
K-fold cross-validation usually a better method to estimate performance
Not too sensitive to the randomness in the split
Split data into k folds (usually 5 or 10)
cross_val_score and create a LinearRegression() object
Cross-validation for hyperparameter tuning
Ridge regression
Ridge regression
Ridge regression
Ridge regression
Ridge regression
Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: a guide for data scientists. ” O’Reilly Media, Inc.”.