Intro to Computational Social Science: Week 7

2026-5-11

Warm-up

Groups: Final project teams (if team members here)
Respond to worksheet questions on final projects

Today’s Class

Warm-up: final project prep
Data models vs. algorithmic models
Activity: categorization by hand
Unsupervised algorithms, supervised algorithms
Reflection on algorithms, society

Wednesday’s Class

Machine learning with tidymodels
Guest speaker from Recidiviz - add your questions on “Collaborations” tab

Office Hours

Office Hours: Monday 1:30-3:30pm (Tyler)
Tuesdays, 10:30am-12:00pm (Yao)

Learning Goals

Motivate algorithmic approaches to social science
Explore classification strategies
Understand data models vs. algorithms
Understand supervised vs. unsupervised algorithms
Discuss algorithmic bias
(Wednesday) Machine Learning with tidymodels

Algorithmic Approaches to Social Science

The Data Model Approach

In the data modeling approach, we would use a combination of features \(x\) to explain how \(y\) happens
Examples of data models?

Source: Breiman, 2002

The Data Model Approach

In the data modeling approach, we would use a combination of features \(x\) to explain how \(y\) happens
Examples of data models?

Source: Breiman, 2002

The Algorithmic Approach

In the algorithmic approach, we may or may not care how \(y\) happens
We use a machine learning model to take features from \(x\) and predict \(y\)
Examples of algorithmic models?

Source: Breiman, 2002

The Algorithmic Approach

In the algorithmic approach, we may or may not care how \(y\) happens
We use a machine learning model to take features from \(x\) and predict \(y\)
Examples of algorithmic models?

Source: Breiman, 2002

Unsupervised Algorithmic Approaches

What is unsupervised machine learning?

Unsupervised machine learning means using an algorithm to classify data where we have no ground truth
In other words, we want to categorize our data, often on many characteristics at once, or something that is difficult to define

Activity: Classify Neighborhoods

Classify the following housing data into “neighborhoods”
Outline your process: how did you classify these house?
Is there a mathematical formula or set of instructions that could be applied to a different area?

##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   2%  |                                                                              |===                                                                   |   4%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   7%  |                                                                              |======                                                                |   8%  |                                                                              |=======                                                               |  10%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  12%  |                                                                              |==========                                                            |  15%  |                                                                              |============                                                          |  17%  |                                                                              |=============                                                         |  18%  |                                                                              |==============                                                        |  20%  |                                                                              |================                                                      |  22%  |                                                                              |================                                                      |  23%  |                                                                              |=================                                                     |  24%  |                                                                              |=================                                                     |  25%  |                                                                              |====================                                                  |  28%  |                                                                              |=====================                                                 |  30%  |                                                                              |=======================                                               |  32%  |                                                                              |========================                                              |  34%  |                                                                              |=========================                                             |  36%  |                                                                              |===========================                                           |  38%  |                                                                              |=============================                                         |  41%  |                                                                              |==============================                                        |  43%  |                                                                              |================================                                      |  45%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  49%  |                                                                              |====================================                                  |  52%  |                                                                              |=====================================                                 |  53%  |                                                                              |=======================================                               |  55%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  61%  |                                                                              |============================================                          |  63%  |                                                                              |==============================================                        |  65%  |                                                                              |===============================================                       |  67%  |                                                                              |================================================                      |  69%  |                                                                              |==================================================                    |  71%  |                                                                              |===================================================                   |  73%  |                                                                              |=====================================================                 |  75%  |                                                                              |======================================================                |  77%  |                                                                              |=======================================================               |  79%  |                                                                              |=========================================================             |  81%  |                                                                              |==========================================================            |  83%  |                                                                              |===========================================================           |  85%  |                                                                              |=============================================================         |  87%  |                                                                              |==============================================================        |  89%  |                                                                              |================================================================      |  91%  |                                                                              |=================================================================     |  93%  |                                                                              |==================================================================    |  95%  |                                                                              |====================================================================  |  97%  |                                                                              |===================================================================== |  99%  |                                                                              |======================================================================| 100%

What is unsupervised machine learning?

In unsupervised machine learning, we have no “ground truth” answers
We look to the algorithm for classifications

K-means

Pick k points (centroids) at random
Assign all data points to closest centroid
Re-calculate centroids, based on points in group
Re-assign points to nearest centroid
Repeat 3-4 until no points switch groups

K-means

Supervised Algorithmic Approaches

What is supervised machine learning?

In supervised machine learning, we have “ground truth” data
We want to train an algorithm to make predictions on new data

Splitting Data

Recall: We might want to split our data into training and test sets
Question: Why do we split?

Cross-Validation

We might even want to split up the training sample to create better models

The Modeling Process

Decision Trees

A decision tree uses “split points” to make decisions based on certain variables
Let’s look at an example

Predicting Neighborhoods

Let’s say we have the “true” neighborhood classifications
What are some features that might help us classify neighborhoods?
Can you think of any “decision points”?

What is a random forest?

Accuracy and Interpretability

As our modeling strategies get more complicated, what happens to interpretability?
Plot the following on the chart below: - decision trees, random forests, linear regression, logistic regression, neural networks, and any other modeling strategies you can think of!

Accuracy and Interpretability

Maybe you have something like this? (not exact, will vary heavily by model specifications)
The takeaway: in effforts to increase predictive accuracy, models can become opaque/unclear

Accuracy and Interpretability

Recall: data models vs algorithmic models
What would it mean if important societal decisions were made by algorithms?

Algorithmic Bias

What is Algorithmic Bias?

We’ve seen how algorithms can make predictions by finding hidden patterns in data
Bias vs. variance tradeoff (statistical bias/variance)
This can also mean social bias: algorithms might favor certain individuals/groups/places

Credit Scores and Algorithmic Bias

Credit scoring is the “paradigmatic example of algorithmic governance” (Kiviat, 2019)
Big financial data is used to predict loan repayment based on obscure patterns
These scores dictate who can access loans, repayment rates, and more

Credit Scores and Algorithmic Bias

Example: Credit scores are biased according to individuals’ zip codes
Question: how would you make credit scores more fair?

Credit Scores and Algorithmic Bias

Question: how would you make credit scores more fair?
Use a data model? (Possibly more fair, but with lower predictive accuracy)
Remove zip code? (Other data might be correlated with zip code)
Set demographic parity (e.g. each zip code must have similar average scores)? This could make another variable more biased
Set equal opportunity (e.g. ensure hypothetical individuals with similar characteristics have similar scores across zip codes if all else equal)?

Credit Scores and Algorithmic Bias

Question: how would you make credit scores more fair?
This is not only a modeling question, it is also a social question!
Requires theory of fairness in society

Recap

Unsupervised algorithm: create categorizations when we don’t have any (e.g. k-means)
Supervised algorithm: make predictions when we have some ground-truth data (e.g. random forest)
Data models: explainable, interpretable, not necessarily the best predictors
Algorithmic approaches: accurate predictions, not necessarily explainaable/interpretable
Algorithmic fairness: requires social theory and algorithmic knowledge

Writing prompts 2

Respond to the prompts on the back of the warm-up exercise.

Credit Scores and Algorithmic Bias

Question: how would you make credit scores more fair?
This is not only a modeling question, it is also a social question!
Requires theory of fairness in society

Credit Scores and Algorithmic Bias

Question: how would you make credit scores more fair?
Use a data model? (Possibly more fair, but with lower predictive accuracy)
Remove zip code? (Other data might be correlated with zip code)
Set demographic parity (e.g. each zip code must have similar average scores)? This could make another variable more biased
Set equal opportunity (e.g. ensure hypothetical individuals with similar characteristics have similar scores across zip codes if all else equal)?

K-Means Modeling in R

Load ames data

Explore the ames data with View()
Notice the Latitude and Longitude columns
How would we make it spatial?

library(sf)
library(tidymodels)
library(ggplot2)
library(tidyverse)
library(magrittr)

data(ames)

Explore ames data

st_as_sf!

# make ames data spatial
ames_spatial <- ames %>%
  st_as_sf(coords = c("Longitude", "Latitude"))

Plot ames data

ggplot(ames_spatial) + 
  geom_sf()

K-Means Clustering

We can cluster along longitude/latitude axes
We will try 3 clusters first

kclust <- kmeans(ames %>%
                   select(Longitude, Latitude), 
                 centers = 3)

Plot the clusters!

# add clusters
ames_spatial %<>% 
  mutate(cluster = as.character(kclust$cluster))

ggplot(ames_spatial) + 
  geom_sf(aes(col = cluster))

K-Means Clustering

What would it mean to add another variable?

kclust <- kmeans(ames %>%
                   select(Longitude, Latitude, Sale_Price), 
                 centers = 3)

Adjust your model

Try tuning the model by changing \(k\) and/or other kmeans settings
How close can you make it to the neighborhoods below?

Neighborhood identifiers in Ames data.

Random Forests with tidymodels

Exploring Ames Sale Prices

# examine ames housing data
ggplot(ames, aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, col= "white")

Exploring Ames Sale Prices

What do you notice about the distribution of home prices?
What could this mean for our train/test split?

Splitting Our Data

By specifying strata = Sale_Price we ensure that high sales prices are in both train and test data

library(tidymodels)

# split data
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

“Growing” a Decision Tree

# create a decision tree
tree_model <- 
  decision_tree(min_n = 2) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

# define the tree's model fit
tree_fit <- 
  tree_model %>% 
  fit(Sale_Price ~ Longitude + Latitude + Sale_Price, 
      data = ames_train)

Make Predictions!

Do you notice anything about the predictions?

ames_test_small <- ames_test %>% slice(1:10)

# combine predictions with our data
ames_test_small %>% 
  select(Sale_Price) %>% 
  bind_cols(predict(tree_fit, ames_test_small))

## # A tibble: 10 × 2
##    Sale_Price   .pred
##         <int>   <dbl>
##  1     213500 210904.
##  2     191500 210904.
##  3     189000 210904.
##  4     185000 210904.
##  5     141000 145929.
##  6     210000 210904.
##  7     146000 145929.
##  8     376162 313213.
##  9     320000 407439.
## 10     215200 210904.

“Growing” a Random Forest

# create a random forest
rf_model <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

# define the random forest workflow
rf_wflow <- 
  workflow() %>% 
  add_formula(
    Sale_Price ~ Gr_Liv_Area + Year_Built + Bldg_Type +  
      Latitude + Longitude) %>% 
  add_model(rf_model) 

# fit the random forest model
rf_fit <- rf_wflow %>% fit(data = ames_train)

Estimate Model Performance

estimate_perf <- function(model, dat) {
  # Capture the names of the `model` and `dat` objects
  cl <- match.call()
  obj_name <- as.character(cl$model)
  data_name <- as.character(cl$dat)
  data_name <- gsub("ames_", "", data_name)
  
  # Estimate these metrics:
  reg_metrics <- metric_set(rmse, rsq)
  # output our model
  output <- model %>%
    predict(dat) %>%
    bind_cols(dat %>% select(Sale_Price)) %>%
    reg_metrics(Sale_Price, .pred) %>%
    select(-.estimator) %>%
    mutate(object = obj_name, data = data_name)
  
  return(output)
}

A Note on Performance Metrics

RMSE is the sum of squared errors (smaller = better)
\(R^2\) is the correlation between predictions and actual values (larger = better)

Estimate Performance

We can estimate performance within training data

# first examine tree performance
estimate_perf(tree_fit, ames_train)

## # A tibble: 2 × 4
##   .metric .estimate object   data 
##   <chr>       <dbl> <chr>    <chr>
## 1 rmse    50318.    tree_fit train
## 2 rsq         0.603 tree_fit train

# now examine random forest performance
estimate_perf(rf_fit, ames_train)

## # A tibble: 2 × 4
##   .metric .estimate object data 
##   <chr>       <dbl> <chr>  <chr>
## 1 rmse    15019.    rf_fit train
## 2 rsq         0.968 rf_fit train

Estimate Performance

We can estimate performance within training data
Which model performs better?
What do we do next?

# first examine tree performance
estimate_perf(tree_fit, ames_train)

# now examine random forest performance
estimate_perf(rf_fit, ames_train)

Evaluate Predictions on Test Data

We can use last_fit and collect_metrics to evaluate models on test data
What do we notice, comparing our test sample results to the training sample results?

# final rf model
final_rf_res <- last_fit(rf_wflow, ames_split)

# get model metrics
collect_metrics(final_rf_res)

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config        
##   <chr>   <chr>          <dbl> <chr>          
## 1 rmse    standard   29146.    pre0_mod0_post0
## 2 rsq     standard       0.867 pre0_mod0_post0

Cross Validation

Re-Sample your data to better estimate model efficacy

Cross validation gives us a tool to re-sample within the training set
Why would this be advantageous? (think about for problem set)

Cross validation. Source: Kuhn and Silge, 2023.

Warm-up

Today’s Class

Wednesday’s Class

Office Hours

Learning Goals

Algorithmic Approaches to Social Science

Algorithmic Approaches to Social Science

The Data Model Approach

The Data Model Approach

The Algorithmic Approach

The Algorithmic Approach

Unsupervised Algorithmic Approaches

What is unsupervised machine learning?

Activity: Classify Neighborhoods

What is unsupervised machine learning?

K-means

K-means

Supervised Algorithmic Approaches

What is supervised machine learning?

Splitting Data

Cross-Validation

The Modeling Process

The Modeling Process

Decision Trees

Predicting Neighborhoods

What is a random forest?

What is a random forest?

Accuracy and Interpretability

Accuracy and Interpretability

Accuracy and Interpretability

Algorithmic Bias

What is Algorithmic Bias?

Credit Scores and Algorithmic Bias

Credit Scores and Algorithmic Bias

Credit Scores and Algorithmic Bias

Credit Scores and Algorithmic Bias

Recap

Writing prompts 2

Credit Scores and Algorithmic Bias

Credit Scores and Algorithmic Bias

K-Means Modeling in R

Load ames data

Explore ames data

Plot ames data

K-Means Clustering

Plot the clusters!

K-Means Clustering

Adjust your model

Random Forests with tidymodels

Exploring Ames Sale Prices

Exploring Ames Sale Prices

Splitting Our Data

“Growing” a Decision Tree

Make Predictions!

“Growing” a Random Forest

Estimate Model Performance

A Note on Performance Metrics

Estimate Performance

Estimate Performance

Evaluate Predictions on Test Data

Cross Validation

Re-Sample your data to better estimate model efficacy

Guest Speaker!