Introduction
One of the best parts of summer in a large city is getting to use the bike-share program. Borrowing bikes and riding them from station to station is a fun, convenient & healthy way to get around town.
Chicago is no different in this respect. However, since its inception in 2013, Chicago’s Divvy bike-share program has been a recurring target of the local press. As in the examples listed below, these reports accuse the program of disproportionately locating bike-share stations in more white and affluent parts of the city.
Divvy Membership Skews White and Wealthy, But Hopefully Not for Long (StreetsBlog, 9/10/15)
Report: In Chicago, Bike Amenities Correlate With Gentrication (StreetsBlog, 1/15/16)
Divvy expansion leaves some areas feeling like third wheel (Suntimes, 4/26/15)
This project will use publicly available data to attempt to confirm or disconfirm these accusations.
Research Question
Are the Divvy bike-share stations disproportionately located in Chicago’s wealthier zip codes?
Hypotheses
\(H_0: B_1 = 0\) There is no relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.
\(H_A: B_1 > 0\) There is a positive relationship between the median sell price of homes and the number of Divvy station docks in Chicago zip codes.
The Variables & Sources of Data
Required R Packages
prettydoc TRUE
knitr TRUE
jsonlite TRUE
ggmap TRUE
tidyverse TRUE
stringr TRUE
rvest TRUE
psych TRUE
The Explanatory Variable: Sell Prices from Trulia.com
There was very little publicly available data on wealth at the intra-city granularity. Consequently, a zip code’s median sell price for single-family homes on Trulia.com will be used as a proxy to measure the wealth of that part of Chicago.
After filtering to single-family homes for sale in Chicago, approximately 4,500 sell prices were scraped from the JSON present on 150 browse pages.
However, this data set is likely not a true representative sample of home values in each zip code. Properties that were bank-owned or in default could not be included in this study because their sell prices were not publicly listed.
Web Scraping
#Inputs to loop
base_url <- "https://www.trulia.com/for_sale/Chicago,IL/SINGLE-FAMILY_HOME_type/"
pages <- 150
trulia_file <- "Trulia_file.csv"
aggregate_df <- data.frame()
reg_ex1 <- "var appState = "
reg_ex2 <- ";\\n var googleMapURL ="
my_samp <- seq(1, 3, by = .01)
#Loop to scrape pages
if (!trulia_file %in% list.files(getwd())){
for (i in 1:pages){
#pagination
current_url <- ifelse(i == 1,
base_url,
paste0(base_url, i, "_p/")
)
#get html
trulia_html <- current_url %>%
read_html() %>%
html_nodes("script") %>%
html_text()
#get json from html
json_text <- trulia_html[str_detect(trulia_html, reg_ex1)]
begin <- as.integer(str_locate(json_text, reg_ex1)[1, 2])
ending <- as.integer(str_locate(json_text, reg_ex2)[1, 1]) - 1
#parse the JSON
json <- json_text %>%
str_sub(begin, ending) %>%
str_trim() %>%
fromJSON()
#store data in DF
current_df <- data.frame(iteration = i,
id = json$page$cards$id,
price = json$page$cards$price,
zip_code = json$page$cards$zip,
location = json$page$cards$footer$location)
aggregate_df <- rbind(aggregate_df, current_df)
#delay
rand_delay <- sample(my_samp, 1, replace = T)
Sys.sleep(rand_delay)
}
write.csv(aggregate_df, file = trulia_file)
} Clean and summarize the scraped data by zip code
trulia_data <- read.csv(trulia_file, stringsAsFactors = F)
trulia_df <- trulia_data %>% transmute(sell_price = as.integer(str_replace_all(price,
"\\$|\\+|,", "")), zip_code = as.character(zip_code)) %>% na.omit() %>%
group_by(zip_code) %>% summarise(median_sell_price = median(sell_price),
n = n())
glimpse(trulia_df)## Observations: 57
## Variables: 3
## $ zip_code <chr> "60601", "60604", "60605", "60607", "60608",...
## $ median_sell_price <dbl> 23500.0, 699000.0, 441000.0, 650000.0, 26500...
## $ n <int> 1, 1, 5, 11, 30, 57, 43, 11, 34, 60, 185, 43...
The median sell prices are not normally distributed.
hist(trulia_df$median_sell_price, breaks = 10)The Response Variable: Divvy Station Docks from City of Chicago API
Since stations can have differing numbers of docks, the number of Divvy station docks in each zip code will be used as the explanatory variable. The source of these data will be the City of Chicago’s Divvy Bicycle Stations API.
The Divvy API has 20 variables and 581 observations.
Each observation is a Divvy station, and each station has a certain number of docks for bikes.
feed <- "https://feeds.divvybikes.com/stations/stations.json"
if (!exists("divvy_data")) {
divvy_data <- fromJSON(feed)$stationBeanList
}
glimpse(divvy_data)## Observations: 581
## Variables: 20
## $ id <int> 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 14, 15,...
## $ stationName <chr> "Michigan Ave & Balbo Ave", "Shedd Aquar...
## $ availableDocks <int> 20, 21, 3, 17, 20, 18, 12, 5, 4, 15, 5, ...
## $ totalDocks <int> 35, 31, 23, 23, 31, 19, 19, 11, 15, 27, ...
## $ latitude <dbl> 41.87264, 41.86723, 41.85627, 41.87405, ...
## $ longitude <dbl> -87.62398, -87.61536, -87.61335, -87.627...
## $ statusValue <chr> "In Service", "In Service", "In Service"...
## $ statusKey <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ status <chr> "IN_SERVICE", "IN_SERVICE", "IN_SERVICE"...
## $ availableBikes <int> 13, 10, 20, 6, 11, 1, 6, 5, 10, 11, 10, ...
## $ stAddress1 <chr> "Michigan Ave & Balbo Ave", "Shedd Aquar...
## $ stAddress2 <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ city <chr> "Chicago", "Chicago", "Chicago", "Chicag...
## $ postalCode <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ location <chr> "", "", "", "620 S. State St.", "", "", ...
## $ altitude <chr> "", "", "", "", "", "", "", "", "", "", ...
## $ testStation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ lastCommunicationTime <chr> "2017-05-14 17:54:58", "2017-05-14 17:54...
## $ landMark <chr> "541", "544", "545", "030", "548", "534"...
## $ is_renting <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
The 2 variables needed are the totalDocks and postalCode. However, the postalCode values are mostly missing.
table(divvy_data$postalCode == "")
FALSE TRUE
111 470 Fortunately, since the data set includes the longitude & latitude of each station, we can use ggmap to obtain the addresses from the Google Maps API.
coordinates <- cbind(divvy_data$longitude, divvy_data$latitude)
divvy_file <- "DivvyAddresses.csv"
if (!divvy_file %in% list.files(getwd())) {
## Code citation: http://stackoverflow.com/a/22919546
address <- do.call(rbind, lapply(1:nrow(coordinates), function(i) revgeocode(coordinates[i,
])))
write.csv(data.frame(address = address), file = divvy_file)
}
divvy <- cbind(divvy_data, read.csv(divvy_file))Next, let’s transform the data so that we have the number of bicycle docks in each zip code.
One station not in service was removed.
ggmaps failed to return the zip code for one of the stations. This value was manually inserted.
##one of the address values returned by ggmap is missing the zip code
missing_zip <- "730-798 W 28th St, Chicago, IL, USA"
divvy_df <- divvy %>%
mutate(zip_code = str_trim(str_extract(divvy$address, " [\\d]{5}"))) %>%
mutate(zip_code = ifelse(address == missing_zip, "60616", zip_code)) %>%
filter(status == "IN_SERVICE") %>%
select(zip_code, totalDocks) %>%
group_by(zip_code) %>%
summarise(docks = sum(totalDocks)) %>%
arrange(desc(docks))
kable(head(divvy_df))| zip_code | docks |
|---|---|
| 60614 | 639 |
| 60607 | 457 |
| 60605 | 405 |
| 60616 | 397 |
| 60608 | 376 |
| 60613 | 375 |
Combining the Data Sets
Zip codes not starting with “606” are not in the city of Chicago proper and were removed.
The
median_sell_pricewas converted to the median sell price in thousands of US dollars (median_sell_price_1000s).After left-joining the response variable to the explanatory variable by zip code, the data set now contains 53 observations containing the median sell price and number of Divvy station docks by zip code.
divvy_trulia <- left_join(trulia_df, divvy_df, by = "zip_code") %>%
filter(str_detect(zip_code, "606")) %>%
transmute(docks = ifelse(is.na(docks), 0, docks),
median_sell_price_1000s = median_sell_price/1000,
zip_code = zip_code)
glimpse(divvy_trulia)## Observations: 53
## Variables: 3
## $ docks <dbl> 364, 124, 405, 457, 376, 330, 356, 317...
## $ median_sell_price_1000s <dbl> 23.5000, 699.0000, 441.0000, 650.0000,...
## $ zip_code <chr> "60601", "60604", "60605", "60607", "6...
Linear Regression Model
Let’s take our bivariate data and create a linear model, using the median sell price as the explanatory variable and the number of bicycle docks as the response variable.
Model Summary
bicycle_model <- lm(docks ~ median_sell_price_1000s, divvy_trulia)
summary(bicycle_model)##
## Call:
## lm(formula = docks ~ median_sell_price_1000s, data = divvy_trulia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176.29 -103.33 -23.01 79.29 283.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.62985 24.53774 3.082 0.00331 **
## median_sell_price_1000s 0.18728 0.03291 5.691 6.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.7 on 51 degrees of freedom
## Multiple R-squared: 0.3884, Adjusted R-squared: 0.3764
## F-statistic: 32.39 on 1 and 51 DF, p-value: 6.227e-07
(intercept <- coef(bicycle_model)[1])## (Intercept)
## 75.62985
(slope <- coef(bicycle_model)[2])## median_sell_price_1000s
## 0.1872814
Scatter Plot using ggplot2
a <- ggplot(bicycle_model, aes(median_sell_price_1000s, docks))
a + geom_point() + geom_abline(aes(intercept = intercept, slope = slope))Model Interpretation
This linear model is expressed as \(\widehat{DivvyDocks} = 75.81994 + 0.1870611*{MedianSellPrice}\)
If we cast the model in terms of whole Divvy docks, for each additional \(\$5,346\) increase in the median sell price of single-family homes, the model expects an increase of \(1\) Divvy station dock for the zip code.
In this model, multiple \(R^2\) is \(0.3888\), which means that the model’s least-squares line accounts for approximately \(39\%\) of the variation in the the number of Divvy station docks in a zip code.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
Yes, both the scatter plots of the variables and the residuals support a linear relationship.
plot(bicycle_model$residuals ~ divvy_trulia$median_sell_price_1000s)
abline(h = 0, lty = 3)Nearly normal residuals: Are the model’s residuals distributed normally?
No, per the histogram and Q-Q plot, the residuals are not normally distributed.
hist(bicycle_model$residuals)qqnorm(bicycle_model$residuals)
qqline(bicycle_model$residuals)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot of the residuals shown above, there is not constant variability.
Independent observations: Are the data from a random sample and not from a time series?
As stated previously, we can’t confirm that this is a simple random sample.
Conclusion
Since this was a one-side hypothesis test, the p-value is half of the tiny value listed in the regression summary. This would lead us to reject the null hypothesis that there is no relationship between our variables in favor of the alternative hypothesis. However, this is a very tentative conclusion given that the data clearly violated 2 of the model’s necessary conditions.