DATA 607 Final Project: Bicycles and Wealth in Chicago

Kyle Gilde

May 6, 2017

Introduction

One of the best parts of summer in a large city is getting to use the bike-share program.

Chicago is no different in this respect. However, Chicago’s Divvy bike-share program has been accused of disproportionately locating bike-share stations in more white and affluent parts of the city.

Divvy Membership Skews White and Wealthy, But Hopefully Not for Long (StreetsBlog, 9/10/15)

Report: In Chicago, Bike Amenities Correlate With Gentrication (StreetsBlog, 1/15/16)

Divvy expansion leaves some areas feeling like third wheel (Suntimes, 4/26/15)

This project will use publicly available data to attempt to confirm or disconfirm these accusations.

Research Question

Are the Divvy bike-share stations disproportionately located in Chicago’s wealthier zip codes?

Hypotheses

\(H_0: B_1 = 0\) There is no relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.

\(H_A: B_1 > 0\) There is a positive relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.

The Variables & Sources of Data

The Explanatory Variable: Sell Prices from Trulia.com

  • A zip code’s median sell price for single-family homes on Trulia.com will be used as a proxy to measure the wealth of that part of Chicago.

  • I scraped 4,500 sell prices from the JSON present on 150 browse pages

Web Scraping

#Inputs to loop
base_url <- "https://www.trulia.com/for_sale/Chicago,IL/SINGLE-FAMILY_HOME_type/"
pages <- 150
trulia_file <- "Trulia_file.csv"
aggregate_df <- data.frame()
reg_ex1 <- "var appState = "
reg_ex2 <- ";\\n  var googleMapURL ="
my_samp <- seq(1, 3, by = .01)

#Loop to scrape pages
if (!trulia_file %in% list.files(getwd())){
  for (i in 1:pages){
    #pagination
    current_url <- ifelse(i == 1,
                          base_url,
                          paste0(base_url, i, "_p/")
                          )
    #get html    
    trulia_html <- current_url %>%
      read_html() %>%
      html_nodes("script") %>%
      html_text()
    
    #get json from html
    json_text <- trulia_html[str_detect(trulia_html, reg_ex1)]
    begin <- as.integer(str_locate(json_text, reg_ex1)[1, 2])
    ending <- as.integer(str_locate(json_text, reg_ex2)[1, 1]) - 1
    
    #parse the JSON
    json <- json_text %>%
      str_sub(begin, ending) %>%
      str_trim() %>%
      fromJSON()
  
    #store data in DF  
    current_df <- data.frame(iteration = i,
                            id = json$page$cards$id,
                            price = json$page$cards$price,
                            zip_code = json$page$cards$zip,
                            location = json$page$cards$footer$location)
  
    aggregate_df <- rbind(aggregate_df, current_df)
  
    #delay
    rand_delay <- sample(my_samp, 1, replace = T)
    Sys.sleep(rand_delay)
  }
  write.csv(aggregate_df, file = trulia_file)
}  

Summarized the scraped data by zip code

trulia_data <- read.csv(trulia_file, stringsAsFactors = F)

trulia_df <- trulia_data %>% transmute(sell_price = as.integer(str_replace_all(price, 
    "\\$|\\+|,", "")), zip_code = as.character(zip_code)) %>% na.omit() %>% 
    group_by(zip_code) %>% summarise(median_sell_price = median(sell_price), 
    n = n())

glimpse(trulia_df)
## Observations: 57
## Variables: 3
## $ zip_code          <chr> "60601", "60604", "60605", "60607", "60608",...
## $ median_sell_price <dbl> 23500.0, 699000.0, 441000.0, 650000.0, 26500...
## $ n                 <int> 1, 1, 5, 11, 30, 57, 43, 11, 34, 60, 185, 43...

The Response Variable: Divvy Station Docks from City of Chicago API

The 2 variables needed are the totalDocks and postalCode. However, the postalCode values are mostly missing.

table(divvy_data$postalCode == "")

FALSE  TRUE 
  111   470 

Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.

coordinates <- cbind(divvy_data$longitude, divvy_data$latitude)
divvy_file <- "DivvyAddresses.csv"

if (!divvy_file %in% list.files(getwd())) {
    ## Code citation: http://stackoverflow.com/a/22919546
    address <- do.call(rbind, lapply(1:nrow(coordinates), function(i) revgeocode(coordinates[i, 
        ])))
    write.csv(data.frame(address = address), file = divvy_file)
}


divvy <- cbind(divvy_data, read.csv(divvy_file))

Combined the Data Sets using a Left Join

The final data set now contains 53 observations containing the median sell price and number of Divvy station docks by zip code.

divvy_trulia <- left_join(trulia_df, divvy_df, by = "zip_code") %>% 
  filter(str_detect(zip_code, "606")) %>% 
  transmute(docks = ifelse(is.na(docks), 0, docks),
         median_sell_price_1000s = median_sell_price/1000,
         zip_code = zip_code)

glimpse(divvy_trulia)
## Observations: 53
## Variables: 3
## $ docks                   <dbl> 372, 124, 405, 457, 376, 330, 356, 317...
## $ median_sell_price_1000s <dbl> 23.5000, 699.0000, 441.0000, 650.0000,...
## $ zip_code                <chr> "60601", "60604", "60605", "60607", "6...

Linear Regression Model

Model Summary

## 
## Call:
## lm(formula = docks ~ median_sell_price_1000s, data = divvy_trulia)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -176.73 -103.32  -22.84   79.41  292.27 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             75.29288   24.61746   3.059  0.00354 ** 
## median_sell_price_1000s  0.18872    0.03302   5.716 5.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.2 on 51 degrees of freedom
## Multiple R-squared:  0.3905, Adjusted R-squared:  0.3785 
## F-statistic: 32.67 on 1 and 51 DF,  p-value: 5.694e-07

Scatter Plot using ggplot2

Model Interpretation

  • If we cast the model in terms of whole Divvy docks, for each additional \(\$5,346\) increase in the median sell price of single-family homes, the model expects an increase of \(1\) Divvy station dock for the zip code.

  • In this model, multiple \(R^2\) is \(0.3888\), which means that the model’s least-squares line accounts for approximately \(39\%\) of the variation in the the number of Divvy station docks in a zip code.

Model Diagnostics: Were the necessary conditions met?

While there appears to be linearity, the residuals are not normally distributed and they lack constant variability. We are also uncertain that the sample is representative.

Conclusion

Since this was a one-side hypothesis test, the p-value is half of the tiny value listed in the regression summary. This would lead us to reject the null hypothesis that there is no relationship between our variables in favor of the alternative hypothesis. However, this is a very tentative conclusion given that the data clearly violated 2 of the model’s necessary conditions.