Intro to Computational Social Science: Week 4

2026-4-20

Groups!

##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Warm-up

Using the definition provided, how could you gather could you use to measure gentrification in Bay Area neighborhoods?
Consider custom made and ready made sources

##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Today’s Class

Warm-up: neighborhood change
Gathering data in the 21st century
Activity: data modeling
Data models

Wednesday’s Class

Using APIs
Web Scraping
Data Models
Prediction

Office Hours

Office Hours: Fridays, 1:30pm-3:00pm (Tyler)
Tuesdays, 10:30am-12:00pm (Yao)

Miscellaneous

Week 4 optional readings
Final project proposal upcoming!
Topic/methods open

Gathering Data in the Twenty-first Century

Traditional Methods for Studying Neighborhood Change

What are traditional methods for studying neighborhood change?
Probabilistic surveys!
Census
Qualitative interviews/ethnographic observations
Question: Does computation change the way we do these methods?

Traditional Methods for Studying Neighborhood Change

What are traditional methods for studying neighborhood change?
Probabilistic surveys! (recruit respondents on social media)
Census (use R package to pull data)
Qualitative interviews/ethnographic observations (observe online behavior)
Question: Does computation change the way we do these methods?

New Methods for Studying Neighborhood Change

What are new methods for studying neighborhood change?
Wiki Surveys (flexible surveys with user input) allourideas.org
Ecological Momentary Assessments (survey people in real time)
Gamification (fun surveys)

New (and old) Methods for Studying Neighborhood Change

Use Application Program Interfaces (APIs) to gather census data tidycensus
Open-access measures Urban Displacement Project
Wiki Surveys (flexible surveys with user input) allourideas.org
Gathering text from online sources (scraping)
Gathering images or data from online sources
Link surveys to gathered data

New (and old) Methods for Studying Neighborhood Change

Takeaway: many computational social science approaches blur the lines between new and old methods
Let’s look at some examples!

Gentrification

Much debate about occurrence and extent of gentrification
However: empirical evidence of neighborhood change is limited (surveys, census data)

Using Google Street View to Study Gentrification

Hwang’s solution: use Google Street View to look at neighborhood change

Using Google Street View to Study Gentrification

Hwang’s solution: use Google Street View to look at neighborhood change
Combined these data with Census data
Compared these estimates to earlier Chicago gentrification estimates

Using Google Street View to Study Gentrification

In pairs: discuss whether a Google Street View approach would be effective for studying gentrification in the Bay Area

Incarceration in the US

The US incarceration system is large (more on this soon)
Little is known about how people re-enter society after incarceration

Using Cell Phone Surveys to Understand Re-Integration

Sugie’s solution: administer cell phones to study participants leaving prison
Conduct “Ecological Momentary Assessments” (daily surveys) to assess well-being, job search, and more

Using Cell Phone Surveys to Understand Re-Integration

Sugie’s solution: administer cell phones to study participants leaving prison
Conduct “Ecological Momentary Assessments” (daily surveys) to assess well-being, job search, and more

Using Cell Phone Surveys to Understand Re-Integration

In pairs: discuss how Sugie’s approach helps us learn more about human behaviors, and any potential challenges to gathering data this way

Data Modeling activity

Data Modeling

Plot some (or all) of the incarceration data
How would you predict what rates will be in 2030?

##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Data Modeling

Why Model Data?

As social scientists, we often want to go beyond descriptions of social processes
We may want to make explanations/generalizations or even predictions
In our incarceration data, write an example of:

A description
An explanation/generalization
A prediction

Why Model Data?

Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.

Why Model Data?

Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.
Explanation/Generalization: As time passes, modern societies become increasingly rely on carceral solutions to social problems, and incarceration rates increase.

Why Model Data?

Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.
Explanation/Generalization: As time passes, modern societies become increasingly rely on carceral solutions to social problems, and incarceration rates increase.
Prediction: In 2030, the U.S. incarceration rate will reach 1,000 (persons per 100,000).

Linear Models

Estimate the line that minimizes squared “residuals”

Linear Models

We could calculate line of best fit from our incarceration data

Non-Linear Models

Polynomials: quadratic, cubic, etc.
Tree-based models! Random forests
Neural networks
And more

Splitting Our Data

Data scientists often split their data into training and test sets
Goal: choose model that is likely to predict well out of sample

Choosing the Right Model

What is the structure of our data? (numeric, character, binary, etc.)
What do we want to do (describe, explain, predict)?

Miscellaneous

Week 4 optional readings
Final project proposal upcoming!
Topic/methods open

Final Project Proposal!

Details/examples on the course site

Final Project Proposal

Writing Exercise: describe one idea for a final project
What data will you use? (traditional, new, mix)
What is the structure of your data (numeric, character, binary, etc.)?
What do you want to do (describe, explain, predict)?
What type of model will you use? (none, linear, non-linear)?

Today’s Class

APIs
Web Scraping
Modeling

APIs

What are APIs?

Application Programming Interfaces (APIs) allow us to selectively pull information from an online database to our own R environments
We’ve already used one of these!

What are APIs?

Application Programming Interfaces (APIs) allow us to selectively pull information from an online database to our own R environments
We’ve already used one of these!

Gathering Census Data with TidyCensus

tidycensus allows us to pull census data
To use it, we need an API key
You can create one at api.census.gov/data/key_signup.html

library(tidycensus)
library(tidyverse)

census_api_key("YOUR API KEY GOES HERE")

TidyCensus

Gather data through get_decennial or get_acs functions
How would we get help with these functions?

# gather age data
age20 <- get_decennial(geography = "state", 
                       variables = "P13_001N", 
                       year = 2020,
                       sumfile = "dhc")

head(age20)

## # A tibble: 6 × 4
##   GEOID NAME                 variable value
##   <chr> <chr>                <chr>    <dbl>
## 1 09    Connecticut          P13_001N  41.1
## 2 10    Delaware             P13_001N  41.1
## 3 11    District of Columbia P13_001N  33.9
## 4 12    Florida              P13_001N  43  
## 5 13    Georgia              P13_001N  37.5
## 6 15    Hawaii               P13_001N  40.8

TidyCensus

To see possible variables, use the load_variables function

# 2020 dhc codebook
v20 <- load_variables(2020, "dhc")

# take a look 
head(v20)

Try Gathering Data!

Set key with census_api_key
Use load_variables to see census variables
Use get_decennial or get_acs to gather data
Bonus: try plotting with ggplot2!

census_api_key("YOUR API KEY GOES HERE")

# 2020 dhc codebook
v20 <- load_variables(2020, "dhc")

# gather age data
age20 <- get_decennial(geography = "state", 
                       variables = "P13_001N", 
                       year = 2020,
                       sumfile = "dhc")

Try Looking at Data

We can use ggplot to plot our data

# multiracial by state
multi20 <- get_decennial(geography = "state", 
                       variables = "H10_009N", 
                       year = 2020,
                       sumfile = "dhc")


# plot data - how could you rewrite this?
multi20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + # note reorder
  geom_point()

Try Looking at Data

We can use ggplot to plot our data

Try Looking at Data

We could add another group with bind_rows

# multiracial by state
asian20 <- get_decennial(geography = "state", 
                       variables = "H10_006N", 
                       year = 2020,
                       sumfile = "dhc")

# combine data with bind_rows
ma20 <- bind_rows(multi20,
                  asian20)
# plot
ma20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point(aes(col = variable))

Try Looking at Data

We can relabel data with %<>%, mutate, and recode

# relabel data
ma20 %<>%
  mutate(race = recode(variable, 
                       "H10_009N" = "Multi",
                       "H10_006N" = "Asian"))


ma20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point(aes(col = race))

Try Looking at Data

We can relabel data with %<>%, mutate, and recode

Web Scraping

Scraping Wikipedia Data

First, identify page of interest
For example, Wikipedia on incarceration

Scraping Wikipedia Data

First, identify page of interest
For example, Wikipedia on incarceration
Read it into r with read_html

library(rvest)

# read in html
us_inc <- read_html(
  "https://en.wikipedia.org/wiki/Incarceration_in_the_United_States")

# take a look
us_inc

## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Scraping Wikipedia Data

Then, choose table to scrape

Scraping Wikipedia Data

Then, choose table to scrape
Extract the table with html_nodes

# extract nodes of interest
us_inc_table <- html_nodes(us_inc,
xpath = '//*[@id="mw-content-text"]/div[2]/div[5]/table')

Let’s Try it!

Find Wikipedia page (try incarceration first, if finished, look for another)
Read with read_html
Pick a table
Right click, Inspect, Copy Xpath
Extract with html_nodes

library(rvest)

# read in html
us_inc <- read_html(
"https://en.wikipedia.org/wiki/Incarceration_in_the_United_States")

# extract nodes of interest
us_inc_table <- html_nodes(us_inc,
    xpath = '//*[@id="mw-content-text"]/div[2]/div[5]/table')

Clean Scraped Data

Use html_table to clean up table
Recall magrittr and dplyr pipes!

library(dplyr)
library(magrittr)

# convert object to table
us_inc_table %<>%
  html_table()

#take another look
us_inc_table %>%
  head()

## [[1]]
## # A tibble: 20 × 3
##     Year Count      Rate
##    <int> <chr>     <int>
##  1  1940 264,834     201
##  2  1950 264,620     176
##  3  1960 346,015     193
##  4  1970 328,020     161
##  5  1980 503,586     220
##  6  1985 744,208     311
##  7  1990 1,148,702   457
##  8  1995 1,585,586   592
##  9  2000 1,937,482   683
## 10  2002 2,033,022   703
## 11  2004 2,135,335   725
## 12  2006 2,258,792   752
## 13  2008 2,307,504   755
## 14  2010 2,270,142   731
## 15  2012 2,228,424   707
## 16  2014 2,217,947   693
## 17  2016 2,157,800   666
## 18  2018 2,102,400   642
## 19  2020 1,675,400   505
## 20  2021 1,767,200   531

Clean Scraped Data

Convert to dataframe with as.data.frame

# convert list to dataframe
us_inc_table %<>%
  as.data.frame()

# take another look
us_inc_table %>% 
  head()

##   Year   Count Rate
## 1 1940 264,834  201
## 2 1950 264,620  176
## 3 1960 346,015  193
## 4 1970 328,020  161
## 5 1980 503,586  220
## 6 1985 744,208  311

Clean Scraped Data

Use html_table to clean up table
Convert to dataframe with as.data.frame

# convert object to table
us_inc_table %<>%
  html_table()

# convert list to dataframe
us_inc_table %<>%
  as.data.frame()

Let’s Look at Our Data!

Recall plotting with ggplot
Set our data, aesthetics

library(ggplot2)

# plot incarceration data
ggplot(data = us_inc_table, aes(x = Year, y = Count))+
  geom_point()

Let’s Look at Our Data!

What is wrong?

Let’s Look at Our Data! (Again)

# plot incarceration data
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()

Models

Why model?

What are three ways that we could analyze our data?
Describe
Explain
Predict

Models - The Easy Way

We could try adding a line to our plot with geom_line
Is the line a model?

library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_line()

Models - The Easy Way

We could try adding a line to our plot
Is the line a model?

Models - The Easy Way

Recall: linear regression

Models - The Easy Way

Add a linear model with geom_smooth

# add linear model
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)

Models - The Easy Way

Add a linear model with geom_smooth

Models - The Easy Way

Try a non-linear model with geom_smooth
What type of model is this?
What if we also remove se=FALSE?

library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(se = FALSE)

Models - The Easy Way

Try a non-linear model
What type of model is this?
What if we also remove se=FALSE?

Models - The Hard Way

ggplot and geom_smooth are great for visualization
However, to understand/modify our model we want to build it ourselves
We will try this with tidymodels

Tidymodels

First, install tidymodels

library(tidymodels)

Tidymodels

We may want to split our data into training and test sets
Why do we do this?

Tidymodels

Split into training and test sets with initial_splits

# split data
splits <- initial_split(us_inc_table)


df_train <- training(splits)

df_test <- testing(splits)

Try Splitting Your Data!

Split into training and test sets with initial_splits
What proportion of your data are in training/test?
How would you change this proportion?

# split data
splits <- initial_split(us_inc_table)


df_train <- training(splits)

df_test <- testing(splits)

Fit the Model

In our linear model, we have \(Y_i = \beta_0 + \beta_1X_i\)
We want to calculate \(\beta_0\) and \(\beta_1\)
We will first specify that we want a linear model with linear_reg and set_engine("lm")
Notice the pipe (%>%) - how would we write this differently?

library(tidymodels)

# specify the model engine
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

Fit the Model

Now we can fit the model with fit(Y~X)

# fit the model
lm_form_fit <- lm_model %>% 
  fit(Rate ~ Year, data = df_train)

Try Fitting the Model

Try running it all! What do you make of the results?
What if you run it with a different set of data (e.g. the full data)?

# specify the model engine
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

# fit the model
lm_form_fit <- lm_model %>% 
  fit(Rate ~ Year, data = df_train)

# look at results
lm_form_fit

Try Fitting the Model

# look at results
lm_form_fit

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Rate ~ Year, data = data)
## 
## Coefficients:
## (Intercept)         Year  
##  -16996.867        8.803

Make Predictions

We can make predictions for test data or out of sample data
How do these compare to the true values?

# examine predictions
predict(lm_form_fit, new_data = df_test)

## # A tibble: 5 × 1
##   .pred
##   <dbl>
## 1  257.
## 2  433.
## 3  750.
## 4  785.
## 5  794.

Make Predictions

We’ve essentially extended our linear model
How could we make better predictions?

Recap

APIs can be useful for gathering data from organizations
Scraping is another tool to pull information from online sources
What were the steps we took to model our data?
Split our data with initial_split
Specify and fit the model with linear_reg (or other) and set_engine
Predict with predict

Final Projects Proposals and PSets

Details/examples on the course site

Questions?

Groups!

Warm-up

Today’s Class

Wednesday’s Class

Office Hours

Miscellaneous

Gathering Data in the Twenty-first Century

Traditional Methods for Studying Neighborhood Change

Traditional Methods for Studying Neighborhood Change

New Methods for Studying Neighborhood Change

New (and old) Methods for Studying Neighborhood Change

New (and old) Methods for Studying Neighborhood Change

Gentrification

Using Google Street View to Study Gentrification

Using Google Street View to Study Gentrification

Using Google Street View to Study Gentrification

Incarceration in the US

Using Cell Phone Surveys to Understand Re-Integration

Using Cell Phone Surveys to Understand Re-Integration

Using Cell Phone Surveys to Understand Re-Integration

Data Modeling activity

Data Modeling

Data Modeling

Why Model Data?

Why Model Data?

Why Model Data?

Why Model Data?

Linear Models

Linear Models

Non-Linear Models

Splitting Our Data

Choosing the Right Model

More on Modeling

Miscellaneous

Final Project Proposal!

Final Project Proposal!

Final Project Proposal

Today’s Class

APIs

What are APIs?

What are APIs?

Gathering Census Data with TidyCensus

TidyCensus

TidyCensus

Try Gathering Data!

Try Looking at Data

Try Looking at Data

Try Looking at Data

Try Looking at Data

Try Looking at Data

Try Looking at Data

Web Scraping

Scraping Wikipedia Data

Scraping Wikipedia Data

Scraping Wikipedia Data

Scraping Wikipedia Data

Let’s Try it!

Clean Scraped Data

Clean Scraped Data

Clean Scraped Data

Let’s Look at Our Data!

Let’s Look at Our Data!

Let’s Look at Our Data! (Again)

Models

Why model?

Models - The Easy Way

Models - The Easy Way

Models - The Easy Way

Models - The Easy Way

Models - The Easy Way

Models - The Easy Way

Models - The Easy Way

Models - The Hard Way

Tidymodels

Tidymodels

Tidymodels

Try Splitting Your Data!

Fit the Model

Fit the Model

Try Fitting the Model