Citi Bike is the nation’s largest bike share program, with 20,000 bikes and over 1,300 stations across Manhattan, Brooklyn, Queens, the Bronx and Jersey City. All trip data is publicly available. For the purposed of this assignment I will be using the BigQuery Public data set to explore the data. (Motivate International, n.d.)
First I will import the general trip information.
In order to make the random sample reproducible I followed instructions from the following blog post(Lakshmanan 2016). Essentially, one computes a hash of a column of interest, which should be similar to a random number, partitions the set using the modulo operation and selects a subset from that partition. This ensures that the same random sample is taken on every run, since we are only choosing where the modulus is less than 1, we should capture roughly 10% of the population. Please note, the following query can be run to obtain a random sample on each run SELECT * FROM 'bigquery-public-data.new_york_citibike.citibike_trips' TABLESAMPLE SYSTEM (10 PERCENT)".
sampleTripsQuery <- "SELECT
*,
CONCAT(start_station_name, \" to \", end_station_name) as route,
EXTRACT(DAYOFWEEK from starttime) as startday,
EXTRACT(MONTH from starttime) as startmonth,
EXTRACT(YEAR from starttime) as startyear,
(EXTRACT(YEAR from starttime) - birth_year) as age
FROM
`bigquery-public-data.new_york.citibike_trips`
WHERE ABS(MOD(FARM_FINGERPRINT(CAST(starttime AS string )), 10)) < 2"
sampleTripsBQTable <- bq_project_query (projectid, sampleTripsQuery )
sampleTrips <- bq_table_download(sampleTripsBQTable, page_size = 100000)
I will now import zip code information based on a spatial join with the data.geo_us_boundaries.zip_codes public dataset.
allStationsWithZipQuery <- "SELECT
*
FROM
`bigquery-public-data.new_york.citibike_stations` AS bike_stations,
`bigquery-public-data.geo_us_boundaries.zip_codes` AS zip_codes
WHERE ST_Within(
ST_GeogPoint(bike_stations.longitude, bike_stations.latitude),
zip_codes.zip_code_geom
)
"
allStationsWithZipQTable <- bq_project_query (projectid, allStationsWithZipQuery )
allStationsWithZip <- bq_table_download(allStationsWithZipQTable, page_size = 100000)
It was easier to use the US census data to obtain demographic information about the stations. (Petti 2020)
allStationsCensusDataQuery <- "SELECT * FROM
(SELECT *
EXCEPT (internal_point_geo, tract_geom)
FROM `bigquery-public-data.new_york_citibike.citibike_stations` AS bike_stations,
`bigquery-public-data.geo_census_tracts.us_census_tracts_national` AS us_census_tracts_national
WHERE ST_Within(
ST_GeogPoint(bike_stations.longitude, bike_stations.latitude),
us_census_tracts_national.tract_geom)
) AS table1, `bigquery-public-data.census_bureau_acs.censustract_2018_5yr` as ctract
WHERE table1.geo_id = ctract.geo_id"
allStationsCensusDataQTable <- bq_project_query (projectid, allStationsCensusDataQuery )
allStationsCensusData <- bq_table_download(allStationsCensusDataQTable, page_size = 100000)
Using the NYC Neighborhoods data set, I am able to obtain the neigborhood names from the zip codes. (datHere 2020).
zipCodeInfo<-read.csv("https://data.beta.nyc/dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/7caac650-d082-4aea-9f9b-3681d568e8a5/download/nyc_zip_borough_neighborhoods_pop.csv") %>%
mutate(zip = as.character(zip)) %>%
select( zip, borough, neighborhood)
dim(sampleTrips )
## [1] 6662911 20
sampleTrips %>%
head(20) %>%
reactable(wrap = F)
allStationsCensusData %>%
head(20) %>%
reactable(wrap=F)
zipCodeInfo %>%
reactable(wrap=T)
allStationsWithZip <- allStationsWithZip %>%
select( -ends_with("_geom")) %>%
right_join(zipCodeInfo,by = c("zip_code"="zip")) %>%
select( station_id,zip_code, borough ,neighborhood )
allStationData <-
allStationsCensusData %>%
left_join(allStationsWithZip, by = c("station_id"="station_id"))
sample <- sampleTrips %>%
select( tripduration , start_station_id, start_station_name , end_station_id,
end_station_name, usertype, birth_year, gender, route, startday,
startmonth, startyear , age) %>%
left_join(allStationData, by = c("start_station_id"="station_id") )
sample %>%
head(100) %>%
reactable(wrap=F)
sample <- sample %>%
select(tripduration, start_station_name, route, total_pop, white_pop, not_us_citizen_pop, median_income,poverty ,zip_code, borough ,neighborhood )
sample <- sample %>%
mutate(white_prop = white_pop/total_pop , not_us_citizen_prop = not_us_citizen_pop/total_pop, poverty_prop = poverty/total_pop ) %>%
na.omit() %>%
mutate(route = as.factor(route), start_station_name = as.factor(start_station_name), borough = as.factor(borough), neighborhood = as.factor(neighborhood))
sample %>%
head(100) %>%
reactable(wrap=F)
sample %>%
group_by(borough) %>%
summarise( numTrips = n() , ) %>%
mutate( prop = round(numTrips / sum(numTrips),3) ) %>%
select( -numTrips) %>%
reactable()
allStationData %>%
select(name, borough) %>%
group_by(borough) %>%
summarise( numStations = n()) %>%
mutate(prop = round(numStations / sum(numStations),2) ) %>%
reactable()
sample %>%
mutate(majorityPoverty = ifelse(poverty_prop >= .5 , 1, 0)) %>%
group_by(majorityPoverty) %>%
summarise( numTrips = n()) %>%
mutate(prop = numTrips / sum(numTrips) ) %>%
select( -numTrips) %>%
reactable()
sample %>%
mutate(majorityWhite = ifelse(white_prop >= .5 , 1, 0)) %>%
group_by(majorityWhite) %>%
summarise( numTrips = n()) %>%
mutate(prop = round(numTrips / sum(numTrips),2) ) %>%
select(-numTrips) %>%
reactable()
routes <- sample %>%
group_by(route) %>%
summarise( numTrips = n() ) %>%
filter( numTrips > 500) %>%
mutate( routePopScore = (numTrips -min(numTrips) )/ max(numTrips) )
routes %>% reactable(wrap = T)
routes %>%
ggplot(aes(numTrips)) + geom_histogram(binwidth =50)
s2 <- sample %>%
left_join(( routes %>% select(-numTrips)) ,by=c("route"="route")) %>%
na.omit()
model <- lm(routePopScore ~ poverty_prop + white_prop +total_pop + not_us_citizen_prop + borough + tripduration + median_income, data=s2 )
summary(model)
##
## Call:
## lm(formula = routePopScore ~ poverty_prop + white_prop + total_pop +
## not_us_citizen_prop + borough + tripduration + median_income,
## data = s2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12476 -0.08842 -0.04417 0.02982 0.48405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.348e-01 5.909e-03 22.813 <2e-16 ***
## poverty_prop -6.480e-01 1.622e-02 -39.937 <2e-16 ***
## white_prop 5.097e-02 5.943e-03 8.577 <2e-16 ***
## total_pop -1.296e-05 2.451e-07 -52.888 <2e-16 ***
## not_us_citizen_prop 3.049e-01 1.205e-02 25.309 <2e-16 ***
## boroughManhattan 1.147e-01 2.316e-03 49.508 <2e-16 ***
## tripduration 1.239e-06 7.872e-08 15.746 <2e-16 ***
## median_income -4.390e-07 1.701e-08 -25.804 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1485 on 63119 degrees of freedom
## Multiple R-squared: 0.1286, Adjusted R-squared: 0.1285
## F-statistic: 1330 on 7 and 63119 DF, p-value: < 2.2e-16
There is much more information that can be gleamed from the given data sets. Based on data that was obtained from the US census data set, one can get an idea of the demographics of the area’s that the stations are located in. Based on this we can see if any of this data is correlated to the popularity of routes that start at a particular station. I would also like to see how weather impacts the use of the program,compare to taxi/lyft data , safety data / traffic patterens, etc.