2026-5-11

Warm-up

  • Groups: Final project teams (if team members here)
  • Respond to worksheet questions on final projects

Today’s Class

  • Warm-up: final project prep
  • Data models vs. algorithmic models
  • Activity: categorization by hand
  • Unsupervised algorithms, supervised algorithms
  • Reflection on algorithms, society

Wednesday’s Class

  • Machine learning with tidymodels
  • Guest speaker from Recidiviz - add your questions on “Collaborations” tab

Office Hours

  • Office Hours: Monday 1:30-3:30pm (Tyler)
  • Tuesdays, 10:30am-12:00pm (Yao)

Learning Goals

  • Motivate algorithmic approaches to social science
  • Explore classification strategies
  • Understand data models vs. algorithms
  • Understand supervised vs. unsupervised algorithms
  • Discuss algorithmic bias
  • (Wednesday) Machine Learning with tidymodels

Algorithmic Approaches to Social Science

Algorithmic Approaches to Social Science

  • What is an example of a social process where something (\(x\)) leads to something else (\(y\))?
Source: Breiman, 2002

Source: Breiman, 2002

The Data Model Approach

  • In the data modeling approach, we would use a combination of features \(x\) to explain how \(y\) happens
  • Examples of data models?
Source: Breiman, 2002

Source: Breiman, 2002

The Data Model Approach

  • In the data modeling approach, we would use a combination of features \(x\) to explain how \(y\) happens
  • Examples of data models?
Source: Breiman, 2002

Source: Breiman, 2002

The Algorithmic Approach

  • In the algorithmic approach, we may or may not care how \(y\) happens
  • We use a machine learning model to take features from \(x\) and predict \(y\)
  • Examples of algorithmic models?
Source: Breiman, 2002

Source: Breiman, 2002

The Algorithmic Approach

  • In the algorithmic approach, we may or may not care how \(y\) happens
  • We use a machine learning model to take features from \(x\) and predict \(y\)
  • Examples of algorithmic models?
Source: Breiman, 2002

Source: Breiman, 2002

Unsupervised Algorithmic Approaches

What is unsupervised machine learning?

  • Unsupervised machine learning means using an algorithm to classify data where we have no ground truth
  • In other words, we want to categorize our data, often on many characteristics at once, or something that is difficult to define

Activity: Classify Neighborhoods

  • Classify the following housing data into “neighborhoods”
  • Outline your process: how did you classify these house?
  • Is there a mathematical formula or set of instructions that could be applied to a different area?
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   2%  |                                                                              |===                                                                   |   4%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   7%  |                                                                              |======                                                                |   8%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  12%  |                                                                              |==========                                                            |  15%  |                                                                              |============                                                          |  17%  |                                                                              |=============                                                         |  18%  |                                                                              |==============                                                        |  20%  |                                                                              |================                                                      |  22%  |                                                                              |================                                                      |  23%  |                                                                              |=================                                                     |  24%  |                                                                              |=================                                                     |  25%  |                                                                              |====================                                                  |  28%  |                                                                              |=====================                                                 |  30%  |                                                                              |=======================                                               |  32%  |                                                                              |========================                                              |  34%  |                                                                              |=========================                                             |  36%  |                                                                              |===========================                                           |  38%  |                                                                              |=============================                                         |  41%  |                                                                              |==============================                                        |  43%  |                                                                              |================================                                      |  45%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  49%  |                                                                              |====================================                                  |  52%  |                                                                              |=====================================                                 |  53%  |                                                                              |=======================================                               |  55%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  61%  |                                                                              |============================================                          |  63%  |                                                                              |==============================================                        |  65%  |                                                                              |===============================================                       |  67%  |                                                                              |================================================                      |  69%  |                                                                              |==================================================                    |  71%  |                                                                              |===================================================                   |  73%  |                                                                              |=====================================================                 |  75%  |                                                                              |======================================================                |  77%  |                                                                              |=======================================================               |  79%  |                                                                              |=========================================================             |  81%  |                                                                              |==========================================================            |  83%  |                                                                              |===========================================================           |  85%  |                                                                              |=============================================================         |  87%  |                                                                              |==============================================================        |  89%  |                                                                              |================================================================      |  91%  |                                                                              |=================================================================     |  93%  |                                                                              |==================================================================    |  95%  |                                                                              |====================================================================  |  97%  |                                                                              |===================================================================== |  99%  |                                                                              |======================================================================| 100%

What is unsupervised machine learning?

  • In unsupervised machine learning, we have no “ground truth” answers
  • We look to the algorithm for classifications

K-means

  1. Pick k points (centroids) at random
  2. Assign all data points to closest centroid
  3. Re-calculate centroids, based on points in group
  4. Re-assign points to nearest centroid
  5. Repeat 3-4 until no points switch groups

K-means

Supervised Algorithmic Approaches

What is supervised machine learning?

  • In supervised machine learning, we have “ground truth” data
  • We want to train an algorithm to make predictions on new data

Splitting Data

  • Recall: We might want to split our data into training and test sets
  • Question: Why do we split?

Cross-Validation

  • We might even want to split up the training sample to create better models

The Modeling Process

The Modeling Process

Decision Trees

  • A decision tree uses “split points” to make decisions based on certain variables
  • Let’s look at an example

Predicting Neighborhoods

  • Let’s say we have the “true” neighborhood classifications
  • What are some features that might help us classify neighborhoods?
  • Can you think of any “decision points”?

What is a random forest?

What is a random forest?

Accuracy and Interpretability

  • As our modeling strategies get more complicated, what happens to interpretability?
  • Plot the following on the chart below: - decision trees, random forests, linear regression, logistic regression, neural networks, and any other modeling strategies you can think of!

Accuracy and Interpretability

  • Maybe you have something like this? (not exact, will vary heavily by model specifications)
  • The takeaway: in effforts to increase predictive accuracy, models can become opaque/unclear

Accuracy and Interpretability

  • Recall: data models vs algorithmic models
  • What would it mean if important societal decisions were made by algorithms?

Algorithmic Bias

What is Algorithmic Bias?

  • We’ve seen how algorithms can make predictions by finding hidden patterns in data
  • Bias vs. variance tradeoff (statistical bias/variance)
  • This can also mean social bias: algorithms might favor certain individuals/groups/places

Credit Scores and Algorithmic Bias

  • Credit scoring is the “paradigmatic example of algorithmic governance” (Kiviat, 2019)
  • Big financial data is used to predict loan repayment based on obscure patterns
  • These scores dictate who can access loans, repayment rates, and more

Credit Scores and Algorithmic Bias

  • Example: Credit scores are biased according to individuals’ zip codes
  • Question: how would you make credit scores more fair?

Credit Scores and Algorithmic Bias

  • Question: how would you make credit scores more fair?
  • Use a data model? (Possibly more fair, but with lower predictive accuracy)
  • Remove zip code? (Other data might be correlated with zip code)
  • Set demographic parity (e.g. each zip code must have similar average scores)? This could make another variable more biased
  • Set equal opportunity (e.g. ensure hypothetical individuals with similar characteristics have similar scores across zip codes if all else equal)?

Credit Scores and Algorithmic Bias

  • Question: how would you make credit scores more fair?
  • This is not only a modeling question, it is also a social question!
  • Requires theory of fairness in society

Recap

  • Unsupervised algorithm: create categorizations when we don’t have any (e.g. k-means)
  • Supervised algorithm: make predictions when we have some ground-truth data (e.g. random forest)
  • Data models: explainable, interpretable, not necessarily the best predictors
  • Algorithmic approaches: accurate predictions, not necessarily explainaable/interpretable
  • Algorithmic fairness: requires social theory and algorithmic knowledge

Writing prompts 2

  • Respond to the prompts on the back of the warm-up exercise.

Credit Scores and Algorithmic Bias

  • Question: how would you make credit scores more fair?
  • This is not only a modeling question, it is also a social question!
  • Requires theory of fairness in society

Credit Scores and Algorithmic Bias

  • Question: how would you make credit scores more fair?
  • Use a data model? (Possibly more fair, but with lower predictive accuracy)
  • Remove zip code? (Other data might be correlated with zip code)
  • Set demographic parity (e.g. each zip code must have similar average scores)? This could make another variable more biased
  • Set equal opportunity (e.g. ensure hypothetical individuals with similar characteristics have similar scores across zip codes if all else equal)?

K-Means Modeling in R

Load ames data

  • Explore the ames data with View()
  • Notice the Latitude and Longitude columns
  • How would we make it spatial?
library(sf)
library(tidymodels)
library(ggplot2)
library(tidyverse)
library(magrittr)

data(ames)

Explore ames data

  • st_as_sf!
# make ames data spatial
ames_spatial <- ames %>%
  st_as_sf(coords = c("Longitude", "Latitude"))

Plot ames data

ggplot(ames_spatial) + 
  geom_sf()

K-Means Clustering

  • We can cluster along longitude/latitude axes
  • We will try 3 clusters first
kclust <- kmeans(ames %>%
                   select(Longitude, Latitude), 
                 centers = 3)

Plot the clusters!

# add clusters
ames_spatial %<>% 
  mutate(cluster = as.character(kclust$cluster))

ggplot(ames_spatial) + 
  geom_sf(aes(col = cluster))

K-Means Clustering

  • What would it mean to add another variable?
kclust <- kmeans(ames %>%
                   select(Longitude, Latitude, Sale_Price), 
                 centers = 3)

Adjust your model

  • Try tuning the model by changing \(k\) and/or other kmeans settings
  • How close can you make it to the neighborhoods below?
Neighborhood identifiers in Ames data.

Neighborhood identifiers in Ames data.

Random Forests with tidymodels

Exploring Ames Sale Prices

# examine ames housing data
ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white")

Exploring Ames Sale Prices

  • What do you notice about the distribution of home prices?
  • What could this mean for our train/test split?

Splitting Our Data

  • By specifying strata = Sale_Price we ensure that high sales prices are in both train and test data
library(tidymodels)

# split data
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

“Growing” a Decision Tree

# create a decision tree
tree_model <- 
  decision_tree(min_n = 2) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

# define the tree's model fit
tree_fit <- 
  tree_model %>% 
  fit(Sale_Price ~ Longitude + Latitude + Sale_Price, 
      data = ames_train)

Make Predictions!

  • Do you notice anything about the predictions?
ames_test_small <- ames_test %>% slice(1:10)

# combine predictions with our data
ames_test_small %>% 
  select(Sale_Price) %>% 
  bind_cols(predict(tree_fit, ames_test_small))
## # A tibble: 10 × 2
##    Sale_Price   .pred
##         <int>   <dbl>
##  1     213500 210904.
##  2     191500 210904.
##  3     189000 210904.
##  4     185000 210904.
##  5     141000 145929.
##  6     210000 210904.
##  7     146000 145929.
##  8     376162 313213.
##  9     320000 407439.
## 10     215200 210904.

“Growing” a Random Forest

# create a random forest
rf_model <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

# define the random forest workflow
rf_wflow <- 
  workflow() %>% 
  add_formula(
    Sale_Price ~ Gr_Liv_Area + Year_Built + Bldg_Type +  
      Latitude + Longitude) %>% 
  add_model(rf_model) 

# fit the random forest model
rf_fit <- rf_wflow %>% fit(data = ames_train)

Estimate Model Performance

estimate_perf <- function(model, dat) {
  # Capture the names of the `model` and `dat` objects
  cl <- match.call()
  obj_name <- as.character(cl$model)
  data_name <- as.character(cl$dat)
  data_name <- gsub("ames_", "", data_name)
  
  # Estimate these metrics:
  reg_metrics <- metric_set(rmse, rsq)
  # output our model
  output <- model %>%
    predict(dat) %>%
    bind_cols(dat %>% select(Sale_Price)) %>%
    reg_metrics(Sale_Price, .pred) %>%
    select(-.estimator) %>%
    mutate(object = obj_name, data = data_name)
  
  return(output)
}

A Note on Performance Metrics

  • RMSE is the sum of squared errors (smaller = better)
  • \(R^2\) is the correlation between predictions and actual values (larger = better)

Estimate Performance

  • We can estimate performance within training data
# first examine tree performance
estimate_perf(tree_fit, ames_train)
## # A tibble: 2 × 4
##   .metric .estimate object   data 
##   <chr>       <dbl> <chr>    <chr>
## 1 rmse    50318.    tree_fit train
## 2 rsq         0.603 tree_fit train
# now examine random forest performance
estimate_perf(rf_fit, ames_train)
## # A tibble: 2 × 4
##   .metric .estimate object data 
##   <chr>       <dbl> <chr>  <chr>
## 1 rmse    15019.    rf_fit train
## 2 rsq         0.968 rf_fit train

Estimate Performance

  • We can estimate performance within training data
  • Which model performs better?
  • What do we do next?
# first examine tree performance
estimate_perf(tree_fit, ames_train)

# now examine random forest performance
estimate_perf(rf_fit, ames_train)

Evaluate Predictions on Test Data

  • We can use last_fit and collect_metrics to evaluate models on test data
  • What do we notice, comparing our test sample results to the training sample results?
# final rf model
final_rf_res <- last_fit(rf_wflow, ames_split)

# get model metrics
collect_metrics(final_rf_res)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config        
##   <chr>   <chr>          <dbl> <chr>          
## 1 rmse    standard   29146.    pre0_mod0_post0
## 2 rsq     standard       0.867 pre0_mod0_post0

Cross Validation

Re-Sample your data to better estimate model efficacy

  • Cross validation gives us a tool to re-sample within the training set
  • Why would this be advantageous? (think about for problem set)
Cross validation. Source: Kuhn and Silge, 2023.

Cross validation. Source: Kuhn and Silge, 2023.

Guest Speaker!