Introduction
One of the best parts of summer in a large city is getting to use the bike-share program.
Chicago is no different in this respect. However, Chicago’s Divvy bike-share program has been accused of disproportionately locating bike-share stations in more white and affluent parts of the city.
Divvy Membership Skews White and Wealthy, But Hopefully Not for Long (StreetsBlog, 9/10/15)
Report: In Chicago, Bike Amenities Correlate With Gentrication (StreetsBlog, 1/15/16)
Divvy expansion leaves some areas feeling like third wheel (Suntimes, 4/26/15)
This project will use publicly available data to attempt to confirm or disconfirm these accusations.
Research Question
Are the Divvy bike-share stations disproportionately located in Chicago’s wealthier zip codes?
Hypotheses
\(H_0: B_1 = 0\) There is no relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.
\(H_A: B_1 > 0\) There is a positive relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.
The Variables & Sources of Data
The Explanatory Variable: Sell Prices from Trulia.com
A zip code’s median sell price for single-family homes on Trulia.com will be used as a proxy to measure the wealth of that part of Chicago.
I scraped 4,500 sell prices from the JSON present on 150 browse pages
Web Scraping
#Inputs to loop
base_url <- "https://www.trulia.com/for_sale/Chicago,IL/SINGLE-FAMILY_HOME_type/"
pages <- 150
trulia_file <- "Trulia_file.csv"
aggregate_df <- data.frame()
reg_ex1 <- "var appState = "
reg_ex2 <- ";\\n var googleMapURL ="
my_samp <- seq(1, 3, by = .01)
#Loop to scrape pages
if (!trulia_file %in% list.files(getwd())){
for (i in 1:pages){
#pagination
current_url <- ifelse(i == 1,
base_url,
paste0(base_url, i, "_p/")
)
#get html
trulia_html <- current_url %>%
read_html() %>%
html_nodes("script") %>%
html_text()
#get json from html
json_text <- trulia_html[str_detect(trulia_html, reg_ex1)]
begin <- as.integer(str_locate(json_text, reg_ex1)[1, 2])
ending <- as.integer(str_locate(json_text, reg_ex2)[1, 1]) - 1
#parse the JSON
json <- json_text %>%
str_sub(begin, ending) %>%
str_trim() %>%
fromJSON()
#store data in DF
current_df <- data.frame(iteration = i,
id = json$page$cards$id,
price = json$page$cards$price,
zip_code = json$page$cards$zip,
location = json$page$cards$footer$location)
aggregate_df <- rbind(aggregate_df, current_df)
#delay
rand_delay <- sample(my_samp, 1, replace = T)
Sys.sleep(rand_delay)
}
write.csv(aggregate_df, file = trulia_file)
} Summarized the scraped data by zip code
trulia_data <- read.csv(trulia_file, stringsAsFactors = F)
trulia_df <- trulia_data %>% transmute(sell_price = as.integer(str_replace_all(price,
"\\$|\\+|,", "")), zip_code = as.character(zip_code)) %>% na.omit() %>%
group_by(zip_code) %>% summarise(median_sell_price = median(sell_price),
n = n())
glimpse(trulia_df)## Observations: 57
## Variables: 3
## $ zip_code <chr> "60601", "60604", "60605", "60607", "60608",...
## $ median_sell_price <dbl> 23500.0, 699000.0, 441000.0, 650000.0, 26500...
## $ n <int> 1, 1, 5, 11, 30, 57, 43, 11, 34, 60, 185, 43...
The Response Variable: Divvy Station Docks from City of Chicago API
The 2 variables needed are the totalDocks and postalCode. However, the postalCode values are mostly missing.
table(divvy_data$postalCode == "")
FALSE TRUE
111 470 Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.
coordinates <- cbind(divvy_data$longitude, divvy_data$latitude)
divvy_file <- "DivvyAddresses.csv"
if (!divvy_file %in% list.files(getwd())) {
## Code citation: http://stackoverflow.com/a/22919546
address <- do.call(rbind, lapply(1:nrow(coordinates), function(i) revgeocode(coordinates[i,
])))
write.csv(data.frame(address = address), file = divvy_file)
}
divvy <- cbind(divvy_data, read.csv(divvy_file))Combined the Data Sets using a Left Join
The final data set now contains 53 observations containing the median sell price and number of Divvy station docks by zip code.
divvy_trulia <- left_join(trulia_df, divvy_df, by = "zip_code") %>%
filter(str_detect(zip_code, "606")) %>%
transmute(docks = ifelse(is.na(docks), 0, docks),
median_sell_price_1000s = median_sell_price/1000,
zip_code = zip_code)
glimpse(divvy_trulia)## Observations: 53
## Variables: 3
## $ docks <dbl> 372, 124, 405, 457, 376, 330, 356, 317...
## $ median_sell_price_1000s <dbl> 23.5000, 699.0000, 441.0000, 650.0000,...
## $ zip_code <chr> "60601", "60604", "60605", "60607", "6...
Linear Regression Model
Model Summary
##
## Call:
## lm(formula = docks ~ median_sell_price_1000s, data = divvy_trulia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176.73 -103.32 -22.84 79.41 292.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.29288 24.61746 3.059 0.00354 **
## median_sell_price_1000s 0.18872 0.03302 5.716 5.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.2 on 51 degrees of freedom
## Multiple R-squared: 0.3905, Adjusted R-squared: 0.3785
## F-statistic: 32.67 on 1 and 51 DF, p-value: 5.694e-07
Scatter Plot using ggplot2
Model Interpretation
If we cast the model in terms of whole Divvy docks, for each additional \(\$5,346\) increase in the median sell price of single-family homes, the model expects an increase of \(1\) Divvy station dock for the zip code.
In this model, multiple \(R^2\) is \(0.3888\), which means that the model’s least-squares line accounts for approximately \(39\%\) of the variation in the the number of Divvy station docks in a zip code.
Model Diagnostics: Were the necessary conditions met?
While there appears to be linearity, the residuals are not normally distributed and they lack constant variability. We are also uncertain that the sample is representative.
Conclusion
Since this was a one-side hypothesis test, the p-value is half of the tiny value listed in the regression summary. This would lead us to reject the null hypothesis that there is no relationship between our variables in favor of the alternative hypothesis. However, this is a very tentative conclusion given that the data clearly violated 2 of the model’s necessary conditions.