Introduction

One of the best parts of summer in a large city is getting to use the bike-share program. Borrowing bikes and riding them from station to station is a fun, convenient & healthy way to get around town.

Chicago is no different in this respect. However, since its inception in 2013, Chicago’s Divvy bike-share program has been a recurring target of the local press. As in the examples listed below, these reports accuse the program of disproportionately locating bike-share stations in more white and affluent parts of the city.

Divvy Membership Skews White and Wealthy, But Hopefully Not for Long (StreetsBlog, 9/10/15)

Report: In Chicago, Bike Amenities Correlate With Gentrication (StreetsBlog, 1/15/16)

Divvy expansion leaves some areas feeling like third wheel (Suntimes, 4/26/15)

This project will use publicly available data to attempt to confirm or disconfirm these accusations.

Research Question

Are the Divvy bike-share stations disproportionately located in Chicago’s wealthier zip codes?

Hypotheses

$H_0: B_1 = 0$ There is no relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.

$H_A: B_1 > 0$ There is a positive relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.

The Variables & Sources of Data

Required R Packages

              
prettydoc TRUE
knitr     TRUE
jsonlite  TRUE
ggmap     TRUE
tidyverse TRUE
stringr   TRUE
rvest     TRUE
psych     TRUE

The Explanatory Variable: Sell Prices from Trulia.com

There was very little publicly available data on wealth at the intra-city granularity. Consequently, a zip code’s median sell price for single-family homes on Trulia.com will be used as a proxy to measure the wealth of that part of Chicago.
After filtering to single-family homes for sale in Chicago, approximately 4,500 sell prices were scraped from the JSON present on 150 browse pages.
However, this data set is likely not a true representative sample of home values in each zip code. Properties that were bank-owned or in default could not be included in this study because their sell prices were not publicly listed.

Web Scraping

#Inputs to loop
base_url <- "https://www.trulia.com/for_sale/Chicago,IL/SINGLE-FAMILY_HOME_type/"
pages <- 150
trulia_file <- "Trulia_file.csv"
aggregate_df <- data.frame()
reg_ex1 <- "var appState = "
reg_ex2 <- ";\\n  var googleMapURL ="
my_samp <- seq(1, 3, by = .01)

#Loop to scrape pages
if (!trulia_file %in% list.files(getwd())){
  for (i in 1:pages){
    #pagination
    current_url <- ifelse(i == 1,
                          base_url,
                          paste0(base_url, i, "_p/")
                          )
    #get html    
    trulia_html <- current_url %>%
      read_html() %>%
      html_nodes("script") %>%
      html_text()
    
    #get json from html
    json_text <- trulia_html[str_detect(trulia_html, reg_ex1)]
    begin <- as.integer(str_locate(json_text, reg_ex1)[1, 2])
    ending <- as.integer(str_locate(json_text, reg_ex2)[1, 1]) - 1
    
    #parse the JSON
    json <- json_text %>%
      str_sub(begin, ending) %>%
      str_trim() %>%
      fromJSON()
  
    #store data in DF  
    current_df <- data.frame(iteration = i,
                            id = json$page$cards$id,
                            price = json$page$cards$price,
                            zip_code = json$page$cards$zip,
                            location = json$page$cards$footer$location)
  
    aggregate_df <- rbind(aggregate_df, current_df)
  
    #delay
    rand_delay <- sample(my_samp, 1, replace = T)
    Sys.sleep(rand_delay)
  }
  write.csv(aggregate_df, file = trulia_file)
}

Clean and summarize the scraped data by zip code

trulia_data <- read.csv(trulia_file, stringsAsFactors = F)

trulia_df <- trulia_data %>% transmute(sell_price = as.integer(str_replace_all(price, 
    "\\$|\\+|,", "")), zip_code = as.character(zip_code)) %>% na.omit() %>% 
    group_by(zip_code) %>% summarise(median_sell_price = median(sell_price), 
    n = n())

glimpse(trulia_df)

## Observations: 57
## Variables: 3
## $ zip_code          <chr> "60601", "60604", "60605", "60607", "60608",...
## $ median_sell_price <dbl> 23500.0, 699000.0, 441000.0, 650000.0, 26500...
## $ n                 <int> 1, 1, 5, 11, 30, 57, 43, 11, 34, 60, 185, 43...

The median sell prices are not normally distributed.

hist(trulia_df$median_sell_price, breaks = 10)

The Response Variable: Divvy Station Docks from City of Chicago API

Since stations can have differing numbers of docks, the number of Divvy station docks in each zip code will be used as the explanatory variable. The source of these data will be the City of Chicago’s Divvy Bicycle Stations API.

The Divvy API has 20 variables and 581 observations.

Each observation is a Divvy station, and each station has a certain number of docks for bikes.

feed <- "https://feeds.divvybikes.com/stations/stations.json"
if (!exists("divvy_data")) {
    divvy_data <- fromJSON(feed)$stationBeanList
}
glimpse(divvy_data)

## Observations: 581
## Variables: 20
## $ id                    <int> 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 14, 15,...
## $ stationName           <chr> "Michigan Ave & Balbo Ave", "Shedd Aquar...
## $ availableDocks        <int> 20, 21, 3, 17, 20, 18, 12, 5, 4, 15, 5, ...
## $ totalDocks            <int> 35, 31, 23, 23, 31, 19, 19, 11, 15, 27, ...
## $ latitude              <dbl> 41.87264, 41.86723, 41.85627, 41.87405, ...
## $ longitude             <dbl> -87.62398, -87.61536, -87.61335, -87.627...
## $ statusValue           <chr> "In Service", "In Service", "In Service"...
## $ statusKey             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ status                <chr> "IN_SERVICE", "IN_SERVICE", "IN_SERVICE"...
## $ availableBikes        <int> 13, 10, 20, 6, 11, 1, 6, 5, 10, 11, 10, ...
## $ stAddress1            <chr> "Michigan Ave & Balbo Ave", "Shedd Aquar...
## $ stAddress2            <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ city                  <chr> "Chicago", "Chicago", "Chicago", "Chicag...
## $ postalCode            <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ location              <chr> "", "", "", "620 S. State St.", "", "", ...
## $ altitude              <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ testStation           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ lastCommunicationTime <chr> "2017-05-14 17:54:58", "2017-05-14 17:54...
## $ landMark              <chr> "541", "544", "545", "030", "548", "534"...
## $ is_renting            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...

The 2 variables needed are the `totalDocks` and `postalCode`. However, the `postalCode` values are mostly missing.

table(divvy_data$postalCode == "")

FALSE  TRUE 
  111   470

Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.

coordinates <- cbind(divvy_data$longitude, divvy_data$latitude)
divvy_file <- "DivvyAddresses.csv"

if (!divvy_file %in% list.files(getwd())) {
    ## Code citation: http://stackoverflow.com/a/22919546
    address <- do.call(rbind, lapply(1:nrow(coordinates), function(i) revgeocode(coordinates[i, 
        ])))
    write.csv(data.frame(address = address), file = divvy_file)
}


divvy <- cbind(divvy_data, read.csv(divvy_file))

Next, let’s transform the data so that we have the number of bicycle docks in each zip code.

One station not in service was removed.
ggmaps failed to return the zip code for one of the stations. This value was manually inserted.

##one of the address values returned by ggmap is missing the zip code
missing_zip <- "730-798 W 28th St, Chicago, IL, USA"

divvy_df <- divvy %>% 
  mutate(zip_code = str_trim(str_extract(divvy$address, " [\\d]{5}"))) %>% 
  mutate(zip_code = ifelse(address == missing_zip, "60616", zip_code)) %>% 
  filter(status == "IN_SERVICE") %>%
  select(zip_code, totalDocks) %>%
  group_by(zip_code) %>%
  summarise(docks = sum(totalDocks)) %>% 
  arrange(desc(docks))

kable(head(divvy_df))

zip_code	docks
60614	639
60607	457
60605	405
60616	397
60608	376
60613	375

Combining the Data Sets

Zip codes not starting with “606” are not in the city of Chicago proper and were removed.
The median_sell_price was converted to the median sell price in thousands of US dollars (median_sell_price_1000s).
After left-joining the response variable to the explanatory variable by zip code, the data set now contains 53 observations containing the median sell price and number of Divvy station docks by zip code.

divvy_trulia <- left_join(trulia_df, divvy_df, by = "zip_code") %>% 
  filter(str_detect(zip_code, "606")) %>% 
  transmute(docks = ifelse(is.na(docks), 0, docks),
         median_sell_price_1000s = median_sell_price/1000,
         zip_code = zip_code)

glimpse(divvy_trulia)

## Observations: 53
## Variables: 3
## $ docks                   <dbl> 364, 124, 405, 457, 376, 330, 356, 317...
## $ median_sell_price_1000s <dbl> 23.5000, 699.0000, 441.0000, 650.0000,...
## $ zip_code                <chr> "60601", "60604", "60605", "60607", "6...

Linear Regression Model

Let’s take our bivariate data and create a linear model, using the median sell price as the explanatory variable and the number of bicycle docks as the response variable.

Model Summary

bicycle_model <- lm(docks ~ median_sell_price_1000s, divvy_trulia)
summary(bicycle_model)

## 
## Call:
## lm(formula = docks ~ median_sell_price_1000s, data = divvy_trulia)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -176.29 -103.33  -23.01   79.29  283.97 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             75.62985   24.53774   3.082  0.00331 ** 
## median_sell_price_1000s  0.18728    0.03291   5.691 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.7 on 51 degrees of freedom
## Multiple R-squared:  0.3884, Adjusted R-squared:  0.3764 
## F-statistic: 32.39 on 1 and 51 DF,  p-value: 6.227e-07

(intercept <- coef(bicycle_model)[1])

## (Intercept) 
##    75.62985

(slope <- coef(bicycle_model)[2])

## median_sell_price_1000s 
##               0.1872814

Scatter Plot using ggplot2

a <- ggplot(bicycle_model, aes(median_sell_price_1000s, docks))
a + geom_point() + geom_abline(aes(intercept = intercept, slope = slope))

Model Interpretation

This linear model is expressed as $\widehat{DivvyDocks} = 75.81994 + 0.1870611*{MedianSellPrice}$
If we cast the model in terms of whole Divvy docks, for each additional $\$5,346$ increase in the median sell price of single-family homes, the model expects an increase of $1$ Divvy station dock for the zip code.
In this model, multiple $R^2$ is $0.3888$, which means that the model’s least-squares line accounts for approximately $39\%$ of the variation in the the number of Divvy station docks in a zip code.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

Yes, both the scatter plots of the variables and the residuals support a linear relationship.

plot(bicycle_model$residuals ~ divvy_trulia$median_sell_price_1000s)
abline(h = 0, lty = 3)

Nearly normal residuals: Are the model’s residuals distributed normally?

No, per the histogram and Q-Q plot, the residuals are not normally distributed.

hist(bicycle_model$residuals)

qqnorm(bicycle_model$residuals)
qqline(bicycle_model$residuals)

Homoscedasticity: Is there constant variability among the residuals?

Based on the scatter plot of the residuals shown above, there is not constant variability.

Independent observations: Are the data from a random sample and not from a time series?

As stated previously, we can’t confirm that this is a simple random sample.

Conclusion

Since this was a one-side hypothesis test, the p-value is half of the tiny value listed in the regression summary. This would lead us to reject the null hypothesis that there is no relationship between our variables in favor of the alternative hypothesis. However, this is a very tentative conclusion given that the data clearly violated 2 of the model’s necessary conditions.

DATA 607 Final Project: Bicycles and Wealth in Chicago

Kyle Gilde

May 6, 2017

Introduction

Research Question

Hypotheses

The Variables & Sources of Data

The Explanatory Variable: Sell Prices from Trulia.com

Web Scraping

Clean and summarize the scraped data by zip code

The Response Variable: Divvy Station Docks from City of Chicago API

The Divvy API has 20 variables and 581 observations.

The 2 variables needed are the `totalDocks` and `postalCode`. However, the `postalCode` values are mostly missing.

Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.

Next, let’s transform the data so that we have the number of bicycle docks in each zip code.

Combining the Data Sets

Linear Regression Model

Model Summary

Scatter Plot using ggplot2

Model Interpretation

Model Diagnostics

Linearity: Do the variables have a linear relationship?

Nearly normal residuals: Are the model’s residuals distributed normally?

Homoscedasticity: Is there constant variability among the residuals?

Independent observations: Are the data from a random sample and not from a time series?

Conclusion

Introduction

Research Question

Hypotheses

The Variables & Sources of Data

The Explanatory Variable: Sell Prices from Trulia.com

Web Scraping

Clean and summarize the scraped data by zip code

The Response Variable: Divvy Station Docks from City of Chicago API

The Divvy API has 20 variables and 581 observations.

The 2 variables needed are the totalDocks and postalCode. However, the postalCode values are mostly missing.

Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.

Next, let’s transform the data so that we have the number of bicycle docks in each zip code.

Combining the Data Sets

Linear Regression Model

Model Summary

Scatter Plot using ggplot2

Model Interpretation

Model Diagnostics

Linearity: Do the variables have a linear relationship?

Nearly normal residuals: Are the model’s residuals distributed normally?

Homoscedasticity: Is there constant variability among the residuals?

Independent observations: Are the data from a random sample and not from a time series?

Conclusion

The 2 variables needed are the `totalDocks` and `postalCode`. However, the `postalCode` values are mostly missing.