DATA 607 Final Project

Introduction

Citi Bike is the nation’s largest bike share program, with 20,000 bikes and over 1,300 stations across Manhattan, Brooklyn, Queens, the Bronx and Jersey City. All trip data is publicly available. For the purposed of this assignment I will be using the BigQuery Public data set to explore the data. (Motivate International, n.d.)

Import BigQuery Data

First I will import the general trip information.

A note on reproducability of the sample.

In order to make the random sample reproducible I followed instructions from the following blog post(Lakshmanan 2016). Essentially, one computes a hash of a column of interest, which should be similar to a random number, partitions the set using the modulo operation and selects a subset from that partition. This ensures that the same random sample is taken on every run, since we are only choosing where the modulus is less than 1, we should capture roughly 10% of the population. Please note, the following query can be run to obtain a random sample on each run SELECT * FROM 'bigquery-public-data.new_york_citibike.citibike_trips' TABLESAMPLE SYSTEM (10 PERCENT)".

sampleTripsQuery <- "SELECT 
                      *,
                      CONCAT(start_station_name, \" to \", end_station_name) as route,  
                      EXTRACT(DAYOFWEEK from starttime) as startday,
                      EXTRACT(MONTH from starttime) as startmonth, 
                      EXTRACT(YEAR from starttime) as startyear, 
                      (EXTRACT(YEAR from starttime) - birth_year) as age 
                    FROM  
                      `bigquery-public-data.new_york.citibike_trips`
                    WHERE ABS(MOD(FARM_FINGERPRINT(CAST(starttime AS string )), 10)) < 2"


sampleTripsBQTable <- bq_project_query (projectid, sampleTripsQuery )
sampleTrips <- bq_table_download(sampleTripsBQTable, page_size = 100000)

I will now import zip code information based on a spatial join with the data.geo_us_boundaries.zip_codes public dataset.

allStationsWithZipQuery <- "SELECT
    *   
FROM
    `bigquery-public-data.new_york.citibike_stations` AS bike_stations,
    `bigquery-public-data.geo_us_boundaries.zip_codes` AS zip_codes
WHERE ST_Within(
        ST_GeogPoint(bike_stations.longitude, bike_stations.latitude),
         zip_codes.zip_code_geom
         )
"

allStationsWithZipQTable <- bq_project_query (projectid, allStationsWithZipQuery )
allStationsWithZip <- bq_table_download(allStationsWithZipQTable, page_size = 100000)

Use Census Data Instead of Zip Codes

It was easier to use the US census data to obtain demographic information about the stations. (Petti 2020)

allStationsCensusDataQuery <- "SELECT * FROM
(SELECT * 
  EXCEPT (internal_point_geo, tract_geom)
FROM `bigquery-public-data.new_york_citibike.citibike_stations` AS bike_stations, 
  `bigquery-public-data.geo_census_tracts.us_census_tracts_national` AS us_census_tracts_national
WHERE ST_Within(
        ST_GeogPoint(bike_stations.longitude, bike_stations.latitude), 
        us_census_tracts_national.tract_geom)
) AS table1, `bigquery-public-data.census_bureau_acs.censustract_2018_5yr` as ctract
WHERE table1.geo_id = ctract.geo_id"

allStationsCensusDataQTable <- bq_project_query (projectid, allStationsCensusDataQuery )
allStationsCensusData <- bq_table_download(allStationsCensusDataQTable, page_size = 100000)

Obtaining Neighborhood data from CSV

Using the NYC Neighborhoods data set, I am able to obtain the neigborhood names from the zip codes. (datHere 2020).

zipCodeInfo<-read.csv("https://data.beta.nyc/dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/7caac650-d082-4aea-9f9b-3681d568e8a5/download/nyc_zip_borough_neighborhoods_pop.csv") %>% 
  mutate(zip = as.character(zip)) %>% 
  select( zip, borough, neighborhood)

Examining Raw Data

dim(sampleTrips )

## [1] 6662911      20

sampleTrips %>% 
  head(20) %>% 
  reactable(wrap = F)

allStationsCensusData %>% 
  head(20) %>% 
  reactable(wrap=F)

zipCodeInfo %>% 
  reactable(wrap=T)

Combine Data

allStationsWithZip <- allStationsWithZip %>% 
  select( -ends_with("_geom")) %>% 
  right_join(zipCodeInfo,by = c("zip_code"="zip")) %>% 
  select( station_id,zip_code, borough ,neighborhood )

allStationData <- 
  allStationsCensusData %>% 
  left_join(allStationsWithZip, by = c("station_id"="station_id"))

sample <- sampleTrips %>% 
  select( tripduration , start_station_id, start_station_name , end_station_id, 
          end_station_name, usertype, birth_year, gender, route, startday, 
          startmonth, startyear , age) %>% 
  left_join(allStationData, by = c("start_station_id"="station_id") )

Final Data Raw

sample %>% 
  head(100) %>% 
  reactable(wrap=F)

Tidying/Transforming Data

Truncate Unnecessary Columns

sample <- sample %>% 
  select(tripduration, start_station_name, route, total_pop, white_pop, not_us_citizen_pop, median_income,poverty ,zip_code, borough ,neighborhood  )

Convert Population Information to Proportions

sample <- sample %>% 
  mutate(white_prop = white_pop/total_pop , not_us_citizen_prop = not_us_citizen_pop/total_pop, poverty_prop = poverty/total_pop ) %>% 
  na.omit() %>% 
  mutate(route = as.factor(route), start_station_name = as.factor(start_station_name), borough = as.factor(borough), neighborhood = as.factor(neighborhood))

Preview Data

sample %>% 
  head(100) %>% 
  reactable(wrap=F)

Analysis

Number of Trips by Borough

sample %>% 
  group_by(borough) %>% 
  summarise( numTrips = n() , ) %>% 
  mutate( prop = round(numTrips / sum(numTrips),3) ) %>% 
  select( -numTrips) %>% 
  reactable()

allStationData %>%  
  select(name, borough) %>% 
  group_by(borough) %>% 
  summarise( numStations = n()) %>% 
  mutate(prop = round(numStations / sum(numStations),2) )  %>% 
  reactable()

Number of Trips by Poverty Level of Start Station

sample %>% 
  mutate(majorityPoverty = ifelse(poverty_prop >= .5 , 1, 0)) %>% 
  group_by(majorityPoverty) %>% 
  summarise( numTrips = n()) %>% 
  mutate(prop = numTrips / sum(numTrips) )  %>% 
  select( -numTrips) %>% 
  reactable()

Number of Trips by Ethnicity of Start Station

sample %>% 
  mutate(majorityWhite = ifelse(white_prop >= .5 , 1, 0)) %>% 
  group_by(majorityWhite) %>% 
  summarise( numTrips = n()) %>% 
  mutate(prop = round(numTrips / sum(numTrips),2) )  %>% 
  select(-numTrips) %>% 
  reactable()

Looking at Routes

routes <- sample %>% 
  group_by(route) %>% 
  summarise( numTrips = n() ) %>% 
  filter( numTrips > 500) %>% 
  mutate( routePopScore = (numTrips -min(numTrips) )/ max(numTrips) )
  
routes %>% reactable(wrap = T)

routes %>% 
  ggplot(aes(numTrips)) + geom_histogram(binwidth =50)

Linear Model To Predict Popularity

s2 <- sample %>% 
  left_join(( routes %>% select(-numTrips)) ,by=c("route"="route")) %>% 
  na.omit()

model <- lm(routePopScore ~ poverty_prop + white_prop +total_pop + not_us_citizen_prop + borough + tripduration + median_income, data=s2 )
summary(model)

## 
## Call:
## lm(formula = routePopScore ~ poverty_prop + white_prop + total_pop + 
##     not_us_citizen_prop + borough + tripduration + median_income, 
##     data = s2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12476 -0.08842 -0.04417  0.02982  0.48405 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.348e-01  5.909e-03  22.813   <2e-16 ***
## poverty_prop        -6.480e-01  1.622e-02 -39.937   <2e-16 ***
## white_prop           5.097e-02  5.943e-03   8.577   <2e-16 ***
## total_pop           -1.296e-05  2.451e-07 -52.888   <2e-16 ***
## not_us_citizen_prop  3.049e-01  1.205e-02  25.309   <2e-16 ***
## boroughManhattan     1.147e-01  2.316e-03  49.508   <2e-16 ***
## tripduration         1.239e-06  7.872e-08  15.746   <2e-16 ***
## median_income       -4.390e-07  1.701e-08 -25.804   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1485 on 63119 degrees of freedom
## Multiple R-squared:  0.1286, Adjusted R-squared:  0.1285 
## F-statistic:  1330 on 7 and 63119 DF,  p-value: < 2.2e-16

Conclusion

There is much more information that can be gleamed from the given data sets. Based on data that was obtained from the US census data set, one can get an idea of the demographics of the area’s that the stations are located in. Based on this we can see if any of this data is correlated to the popularity of routes that start at a particular station. I would also like to see how weather impacts the use of the program,compare to taxi/lyft data , safety data / traffic patterens, etc.

References

datHere. 2020. https://data.beta.nyc/en/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5.

Lakshmanan, Valliappa. 2016. “Repeatable Sampling of Data Sets in BigQuery for Machine Learning.” O’Reilly Media. O’Reilly Media, Inc. https://www.oreilly.com/content/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning/.

Motivate International, Inc. n.d. “Citi Bike System Data.” Citi Bike NYC. https://www.citibikenyc.com/system-data.

Petti, Bill. 2020. “Leveraging US Census Data in BigQuery.” Root Mean Squared Musings. https://billpetti.github.io/2020-10-19-using-bigquery-to-join-census-location-data-geocoding-sql/.