2026-4-20

Groups!

##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Warm-up

  • Using the definition provided, how could you gather could you use to measure gentrification in Bay Area neighborhoods?
  • Consider custom made and ready made sources
##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Today’s Class

  • Warm-up: neighborhood change
  • Gathering data in the 21st century
  • Activity: data modeling
  • Data models

Wednesday’s Class

  • Using APIs
  • Web Scraping
  • Data Models
  • Prediction

Office Hours

  • Office Hours: Fridays, 1:30pm-3:00pm (Tyler)
  • Tuesdays, 10:30am-12:00pm (Yao)

Miscellaneous

  • Week 4 optional readings
  • Final project proposal upcoming!
  • Topic/methods open

Gathering Data in the Twenty-first Century

Traditional Methods for Studying Neighborhood Change

  • What are traditional methods for studying neighborhood change?
  • Probabilistic surveys!
  • Census
  • Qualitative interviews/ethnographic observations
  • Question: Does computation change the way we do these methods?

Traditional Methods for Studying Neighborhood Change

  • What are traditional methods for studying neighborhood change?
  • Probabilistic surveys! (recruit respondents on social media)
  • Census (use R package to pull data)
  • Qualitative interviews/ethnographic observations (observe online behavior)
  • Question: Does computation change the way we do these methods?

New Methods for Studying Neighborhood Change

  • What are new methods for studying neighborhood change?
  • Wiki Surveys (flexible surveys with user input) allourideas.org
  • Ecological Momentary Assessments (survey people in real time)
  • Gamification (fun surveys)

New (and old) Methods for Studying Neighborhood Change

  • Use Application Program Interfaces (APIs) to gather census data tidycensus
  • Open-access measures Urban Displacement Project
  • Wiki Surveys (flexible surveys with user input) allourideas.org
  • Gathering text from online sources (scraping)
  • Gathering images or data from online sources
  • Link surveys to gathered data

New (and old) Methods for Studying Neighborhood Change

  • Takeaway: many computational social science approaches blur the lines between new and old methods
  • Let’s look at some examples!

Gentrification

  • Much debate about occurrence and extent of gentrification
  • However: empirical evidence of neighborhood change is limited (surveys, census data)

Using Google Street View to Study Gentrification

  • Hwang’s solution: use Google Street View to look at neighborhood change

Using Google Street View to Study Gentrification

  • Hwang’s solution: use Google Street View to look at neighborhood change
  • Combined these data with Census data
  • Compared these estimates to earlier Chicago gentrification estimates

Using Google Street View to Study Gentrification

  • In pairs: discuss whether a Google Street View approach would be effective for studying gentrification in the Bay Area

Incarceration in the US

  • The US incarceration system is large (more on this soon)
  • Little is known about how people re-enter society after incarceration

Using Cell Phone Surveys to Understand Re-Integration

  • Sugie’s solution: administer cell phones to study participants leaving prison
  • Conduct “Ecological Momentary Assessments” (daily surveys) to assess well-being, job search, and more

Using Cell Phone Surveys to Understand Re-Integration

  • Sugie’s solution: administer cell phones to study participants leaving prison
  • Conduct “Ecological Momentary Assessments” (daily surveys) to assess well-being, job search, and more

Using Cell Phone Surveys to Understand Re-Integration

  • In pairs: discuss how Sugie’s approach helps us learn more about human behaviors, and any potential challenges to gathering data this way

Data Modeling activity

Data Modeling

  • Plot some (or all) of the incarceration data
  • How would you predict what rates will be in 2030?
##           group 1           group 2           group 3         group 4
## 1 Kang, Christine     Moore, Allana   Randall, Javion                
## 2     Ong, Alyssa Mahoney, Brigette       Qin, Celine Batson, Anthony
## 3   Leahy, Olivia      Myoung, Sein       Smith, Reid    Mendoza, Ava
## 4  Devir, Lindsey       Pham, Canon Wolfenstein, Luci   Pacheco, Alex
##           group 5
## 1    Barga, Jolie
## 2 Bell, Mary Rose
## 3  Knowles, Genny
## 4 Kang, Christine

Data Modeling

Why Model Data?

  • As social scientists, we often want to go beyond descriptions of social processes
  • We may want to make explanations/generalizations or even predictions
  • In our incarceration data, write an example of:
  1. A description
  2. An explanation/generalization
  3. A prediction

Why Model Data?

  1. Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.

Why Model Data?

  1. Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.
  2. Explanation/Generalization: As time passes, modern societies become increasingly rely on carceral solutions to social problems, and incarceration rates increase.

Why Model Data?

  1. Description: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.
  2. Explanation/Generalization: As time passes, modern societies become increasingly rely on carceral solutions to social problems, and incarceration rates increase.
  3. Prediction: In 2030, the U.S. incarceration rate will reach 1,000 (persons per 100,000).

Linear Models

  • Estimate the line that minimizes squared “residuals”

Linear Models

  • We could calculate line of best fit from our incarceration data

Non-Linear Models

  • Polynomials: quadratic, cubic, etc.
  • Tree-based models! Random forests
  • Neural networks
  • And more

Splitting Our Data

  • Data scientists often split their data into training and test sets
  • Goal: choose model that is likely to predict well out of sample

Choosing the Right Model

  • What is the structure of our data? (numeric, character, binary, etc.)
  • What do we want to do (describe, explain, predict)?

More on Modeling

  • This class is short!
  • Optional readings have more information on types of modeling

Miscellaneous

  • Week 4 optional readings
  • Final project proposal upcoming!
  • Topic/methods open

Final Project Proposal!

Final Project Proposal!

Final Project Proposal

  • Writing Exercise: describe one idea for a final project
  • What data will you use? (traditional, new, mix)
  • What is the structure of your data (numeric, character, binary, etc.)?
  • What do you want to do (describe, explain, predict)?
  • What type of model will you use? (none, linear, non-linear)?

Today’s Class

  • APIs
  • Web Scraping
  • Modeling

APIs

What are APIs?

  • Application Programming Interfaces (APIs) allow us to selectively pull information from an online database to our own R environments
  • We’ve already used one of these!

What are APIs?

  • Application Programming Interfaces (APIs) allow us to selectively pull information from an online database to our own R environments
  • We’ve already used one of these!

Gathering Census Data with TidyCensus

library(tidycensus)
library(tidyverse)

census_api_key("YOUR API KEY GOES HERE")

TidyCensus

  • Gather data through get_decennial or get_acs functions
  • How would we get help with these functions?
# gather age data
age20 <- get_decennial(geography = "state", 
                       variables = "P13_001N", 
                       year = 2020,
                       sumfile = "dhc")

head(age20)
## # A tibble: 6 × 4
##   GEOID NAME                 variable value
##   <chr> <chr>                <chr>    <dbl>
## 1 09    Connecticut          P13_001N  41.1
## 2 10    Delaware             P13_001N  41.1
## 3 11    District of Columbia P13_001N  33.9
## 4 12    Florida              P13_001N  43  
## 5 13    Georgia              P13_001N  37.5
## 6 15    Hawaii               P13_001N  40.8

TidyCensus

  • To see possible variables, use the load_variables function
# 2020 dhc codebook
v20 <- load_variables(2020, "dhc")

# take a look 
head(v20)

Try Gathering Data!

  • Set key with census_api_key
  • Use load_variables to see census variables
  • Use get_decennial or get_acs to gather data
  • Bonus: try plotting with ggplot2!
census_api_key("YOUR API KEY GOES HERE")

# 2020 dhc codebook
v20 <- load_variables(2020, "dhc")

# gather age data
age20 <- get_decennial(geography = "state", 
                       variables = "P13_001N", 
                       year = 2020,
                       sumfile = "dhc")

Try Looking at Data

  • We can use ggplot to plot our data
# multiracial by state
multi20 <- get_decennial(geography = "state", 
                       variables = "H10_009N", 
                       year = 2020,
                       sumfile = "dhc")


# plot data - how could you rewrite this?
multi20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + # note reorder
  geom_point()

Try Looking at Data

  • We can use ggplot to plot our data

Try Looking at Data

  • We could add another group with bind_rows
# multiracial by state
asian20 <- get_decennial(geography = "state", 
                       variables = "H10_006N", 
                       year = 2020,
                       sumfile = "dhc")

# combine data with bind_rows
ma20 <- bind_rows(multi20,
                  asian20)
# plot
ma20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point(aes(col = variable))

Try Looking at Data

Try Looking at Data

  • We can relabel data with %<>%, mutate, and recode
# relabel data
ma20 %<>%
  mutate(race = recode(variable, 
                       "H10_009N" = "Multi",
                       "H10_006N" = "Asian"))


ma20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point(aes(col = race))

Try Looking at Data

  • We can relabel data with %<>%, mutate, and recode

Web Scraping

Scraping Wikipedia Data

Scraping Wikipedia Data

library(rvest)

# read in html
us_inc <- read_html(
  "https://en.wikipedia.org/wiki/Incarceration_in_the_United_States")

# take a look
us_inc
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Scraping Wikipedia Data

  • Then, choose table to scrape

Scraping Wikipedia Data

  • Then, choose table to scrape
  • Extract the table with html_nodes
# extract nodes of interest
us_inc_table <- html_nodes(us_inc,
xpath = '//*[@id="mw-content-text"]/div[2]/div[5]/table')

Let’s Try it!

  • Find Wikipedia page (try incarceration first, if finished, look for another)
  • Read with read_html
  • Pick a table
  • Right click, Inspect, Copy Xpath
  • Extract with html_nodes
library(rvest)

# read in html
us_inc <- read_html(
"https://en.wikipedia.org/wiki/Incarceration_in_the_United_States")

# extract nodes of interest
us_inc_table <- html_nodes(us_inc,
    xpath = '//*[@id="mw-content-text"]/div[2]/div[5]/table')

Clean Scraped Data

  • Use html_table to clean up table
  • Recall magrittr and dplyr pipes!
library(dplyr)
library(magrittr)

# convert object to table
us_inc_table %<>%
  html_table()

#take another look
us_inc_table %>%
  head()
## [[1]]
## # A tibble: 20 × 3
##     Year Count      Rate
##    <int> <chr>     <int>
##  1  1940 264,834     201
##  2  1950 264,620     176
##  3  1960 346,015     193
##  4  1970 328,020     161
##  5  1980 503,586     220
##  6  1985 744,208     311
##  7  1990 1,148,702   457
##  8  1995 1,585,586   592
##  9  2000 1,937,482   683
## 10  2002 2,033,022   703
## 11  2004 2,135,335   725
## 12  2006 2,258,792   752
## 13  2008 2,307,504   755
## 14  2010 2,270,142   731
## 15  2012 2,228,424   707
## 16  2014 2,217,947   693
## 17  2016 2,157,800   666
## 18  2018 2,102,400   642
## 19  2020 1,675,400   505
## 20  2021 1,767,200   531

Clean Scraped Data

  • Convert to dataframe with as.data.frame
# convert list to dataframe
us_inc_table %<>%
  as.data.frame()

# take another look
us_inc_table %>% 
  head()
##   Year   Count Rate
## 1 1940 264,834  201
## 2 1950 264,620  176
## 3 1960 346,015  193
## 4 1970 328,020  161
## 5 1980 503,586  220
## 6 1985 744,208  311

Clean Scraped Data

  • Use html_table to clean up table
  • Convert to dataframe with as.data.frame
# convert object to table
us_inc_table %<>%
  html_table()

# convert list to dataframe
us_inc_table %<>%
  as.data.frame()

Let’s Look at Our Data!

  • Recall plotting with ggplot
  • Set our data, aesthetics
library(ggplot2)

# plot incarceration data
ggplot(data = us_inc_table, aes(x = Year, y = Count))+
  geom_point()

Let’s Look at Our Data!

  • What is wrong?

Let’s Look at Our Data! (Again)

# plot incarceration data
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()

Models

Why model?

  • What are three ways that we could analyze our data?
  • Describe
  • Explain
  • Predict

Models - The Easy Way

  • We could try adding a line to our plot with geom_line
  • Is the line a model?
library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_line()

Models - The Easy Way

  • We could try adding a line to our plot
  • Is the line a model?

Models - The Easy Way

  • Recall: linear regression

Models - The Easy Way

  • Add a linear model with geom_smooth
# add linear model
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)

Models - The Easy Way

  • Add a linear model with geom_smooth

Models - The Easy Way

  • Try a non-linear model with geom_smooth
  • What type of model is this?
  • What if we also remove se=FALSE?
library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(se = FALSE)

Models - The Easy Way

  • Try a non-linear model
  • What type of model is this?
  • What if we also remove se=FALSE?

Models - The Hard Way

  • ggplot and geom_smooth are great for visualization
  • However, to understand/modify our model we want to build it ourselves
  • We will try this with tidymodels

Tidymodels

  • First, install tidymodels
library(tidymodels)

Tidymodels

  • We may want to split our data into training and test sets
  • Why do we do this?

Tidymodels

  • Split into training and test sets with initial_splits
# split data
splits <- initial_split(us_inc_table)


df_train <- training(splits)

df_test <- testing(splits)

Try Splitting Your Data!

  • Split into training and test sets with initial_splits
  • What proportion of your data are in training/test?
  • How would you change this proportion?
# split data
splits <- initial_split(us_inc_table)


df_train <- training(splits)

df_test <- testing(splits)

Fit the Model

  • In our linear model, we have \(Y_i = \beta_0 + \beta_1X_i\)
  • We want to calculate \(\beta_0\) and \(\beta_1\)
  • We will first specify that we want a linear model with linear_reg and set_engine("lm")
  • Notice the pipe (%>%) - how would we write this differently?
library(tidymodels)

# specify the model engine
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

Fit the Model

  • Now we can fit the model with fit(Y~X)
# fit the model
lm_form_fit <- lm_model %>% 
  fit(Rate ~ Year, data = df_train)

Try Fitting the Model

  • Try running it all! What do you make of the results?
  • What if you run it with a different set of data (e.g. the full data)?
# specify the model engine
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

# fit the model
lm_form_fit <- lm_model %>% 
  fit(Rate ~ Year, data = df_train)

# look at results
lm_form_fit

Try Fitting the Model

# look at results
lm_form_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Rate ~ Year, data = data)
## 
## Coefficients:
## (Intercept)         Year  
##  -16996.867        8.803

Make Predictions

  • We can make predictions for test data or out of sample data
  • How do these compare to the true values?
# examine predictions
predict(lm_form_fit, new_data = df_test)
## # A tibble: 5 × 1
##   .pred
##   <dbl>
## 1  257.
## 2  433.
## 3  750.
## 4  785.
## 5  794.

Make Predictions

  • We’ve essentially extended our linear model
  • How could we make better predictions?

Recap

  • APIs can be useful for gathering data from organizations
  • Scraping is another tool to pull information from online sources
  • What were the steps we took to model our data?
  • Split our data with initial_split
  • Specify and fit the model with linear_reg (or other) and set_engine
  • Predict with predict

Final Projects Proposals and PSets