The exploratory analysis of the Citibike’s user data shows promising preliminary results. The amount of base stations and use has increased from 2019 to 2020. This increase in bicycle stations creates a more convenient user experience since the rider has a better chance of finding a station closer to their destination. The outbreak of COVID-19 increased bicycle use in 2020. This can be attributed to the bicycles are in open air and not in other cramped public transportation options. The demographics from 2019 to 2020 has changed slightly as we have seen an increase in customers versus subscribers. The most frequent ages has not changed. The rider is typically 25 to 35 years old.The ride duration are similar from 2019 and 2020. The most frequent ride time slightly increased in 2020. Even though the number of bike stations increased, the epicenters for starting a trip are relatively in the same area within the South Village in New York City. The peak days for bike rentals is between Tuesday and Wednesday during the week and Sunday on the weekends since those days consistently see high rental numbers. As we progress toward a long term strategy of the bike share program for Citibike, we recommend performing analyses on the price of up-keep and maintenance of the bicycles. We recommend this because of the impact of increasing the number of bike stations may spread resources of the company too thin. The addition of bike stations is convenient to the customer, but it may be inefficient for the company to collect bikes for maintenance or some locations that are not frequently visited are costly to rent compared to other locations with high traffic. We also recommend performing an analysis for the optimum periods to reduce the number of bicycles on the streets prior to the winter months. The winter months involve heavy use of salt to prevent ice. The salt corrodes certain metals and can deteriorate the bicycles. The company can reduce the amount of bicycles on the streets while maintaining an exceptional service to their customers. The collection of the bicycles could assist with maintenance periods as well.
Citibike offers bicycles for customers to rent and travel around New York City. Citibike launched in 2013 and expanded their inventory to over 15,000 bicycles with 1,000 locations. Like many businesses during 2020, Citibike experienced changes during the outbreak of COVID-19. The objective of this analysis is to investigate the amount of influence COVID-19 had on Citibike’s bicycle use. The data is from Citibike and can be found at https://ride.citibikenyc.com/system-data. The data includes ride duration, paths taken by riders, and the age and gender of the rider. The data analyzed is from January 2019 to January 2021.
Below are histograms and graphs depicting riders’ demographics and bicycle use throughout 2019, 2020, and the beginning of 2021.
library(dygraphs)
library(ggplot2)
library(heatmaply)
load("~/BUAN514.RData")
wk03_df_pre <- rbind(week3_2019_01,week3_2019_02,week3_2019_03,week3_2019_04,
week3_2019_05,week3_2019_06,week3_2019_07,week3_2019_08,
week3_2019_09,week3_2019_10,week3_2019_11,week3_2019_12,
week3_2020_01,week3_2020_02)
wk03_df_pos <- rbind(week3_2020_03,week3_2020_04,week3_2020_05,week3_2020_06,
week3_2020_07,week3_2020_08,week3_2020_09,week3_2020_10,
week3_2020_11,week3_2020_12,week3_2021_1)
wk03_df_pre$age <- 2021-wk03_df_pre$birth.year
wk03_df_pos$age <- 2021-wk03_df_pos$birth.year
Below are histograms for the type of user and the ages of the users for both pre and post outbreak of COVID-19. A customer is a buyer who does not suscribe to the monthly payments of Citibike. In both sets, we see similar trends. Most of the users are subscribers to Citibike and the majority of users are between 25 and 35. After the outbreak, the amount of people being a customer increased. There is a spike of users at the age of 52 and ages as high as 136 riding the bicycles. It is highly improbable the most frequent user is 52, but rather a majority of users are not truthful about their ages when subscribing to Citibike. If we look at the pre and post COVID-19 datasets for rider age, we note that there was an increase in users between the ages of 25 and 35. A majority of the riders are also male, but there was an increase in the amount of undisclosed genders after the outbreak.
ggplot(wk03_df_pre,aes(x=usertype))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of User Type Pre-COVID-19")+ylab("Frequency")+xlab("Type of User")
ggplot(wk03_df_pos,aes(x=usertype))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of User Type Post-COVID-19")+ylab("Frequency")+xlab("Type of User")
ggplot(wk03_df_pre,aes(x=age))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of Rider Ages Pre-COVID-19")+ylab("Frequency")+xlab("Age")+
scale_x_continuous(limits = c(17, 95))
ggplot(wk03_df_pos,aes(x=age))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of Rider Ages Post-COVID-19")+ylab("Frequency")+xlab("Age")+
scale_x_continuous(limits = c(17, 95))+scale_y_continuous(limits = c(0, 0.06))
ggplot(wk03_df_pre,aes(as.character(gender)))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of Rider Gender Pre-COVID-19")+ylab("Frequency")+xlab("Gender")+
scale_x_discrete(labels=c("0"="Unknown","1"="Male","2"="Female"))
ggplot(wk03_df_pos,aes(as.character(gender)))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue")+
ggtitle("Histogram of Rider Gender Post-COVID-19")+ylab("Frequency")+xlab("Gender")+
scale_x_discrete(labels=c("0"="Unknown","1"="Male","2"="Female"))
Below is a dygraph depicting the amount of rides per month and a histogram showing the distribution of ride times. From the dygraph, we note the spike in the number of rides in February 2020. This could be attributed to riders not wanting to use public transportation at the start of the pandemic. Following the spike was the lowest number of rides in March 2020 and that can be associated with New York’s lock down to flatten the curve for the pandemic. As the pandemic progressed past the spring months, the number of rides became greater than the pre-pandemic levels for the same months.
rides<-c(19676,18565,23600,33056,36135,39430,
43746,48711,49244,42253,30797,19728,
26020,1048575,17719,9268,25077,36921,
38680,43267,53833,30086,21275,11694,
21275)
df_dy<-ts(rides, start=c(2019, 1), end=c(2021, 1), frequency=12)
dygraph(df_dy) %>% dyRangeSelector()
As you can see with the distribution of ride times, the most frequent ride lengths were less than 5 minutes prior to the pandemic and slightly increased to between 5 and 10 minutes after the outbreak.
wk03_df_pre$tripduration_min<-wk03_df_pre$tripduration/60
ggplot(wk03_df_pre,aes(tripduration_min))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="orange")+
xlim(1,50)+ggtitle("Ride Duration Pre-COVID-19")+xlab("Duration in Minutes")+ylab("Frequency")
wk03_df_pos$tripduration_min<-wk03_df_pos$tripduration/60
ggplot(wk03_df_pos,aes(tripduration_min))+geom_bar(aes(y = (..count..)/sum(..count..)),fill="orange")+
xlim(1,50)+ggtitle("Ride Duration Post-COVID-19")+xlab("Duration in Minutes")+ylab("Frequency")
Below is a heatmap of starting locations for riders in 2019. The heatmap is interactive and hover of the cell to see what the starting location was and how many riders left from that location. The heatmap for 2020 is not displayed since the number of bike stations increased from 51 to 889. The increase in stations dispersed the traffic relatively evenly with a few spikes at locations close to Grove St in South Village.
Jan_19<-table(week3_2019_01$start.station.name)
Feb_19<-table(week3_2019_02$start.station.name)
Mar_19<-table(week3_2019_03$start.station.name)
Apr_19<-table(week3_2019_04$start.station.name)
May_19<-table(week3_2019_05$start.station.name)
Jun_19<-table(week3_2019_06$start.station.name)
Jul_19<-table(week3_2019_07$start.station.name)
Aug_19<-table(week3_2019_08$start.station.name)
Sep_19<-table(week3_2019_09$start.station.name)
Oct_19<-table(week3_2019_10$start.station.name)
Nov_19<-table(week3_2019_11$start.station.name)
Dec_19<-table(week3_2019_12$start.station.name)
Jan_20<-table(week3_2020_01$start.station.name)
Feb_20<-table(week3_2020_02$start.station.name)
Mar_20<-table(week3_2020_03$start.station.name)
Apr_20<-table(week3_2020_04$start.station.name)
May_20<-table(week3_2020_05$start.station.name)
Jun_20<-table(week3_2020_06$start.station.name)
Jul_20<-table(week3_2020_07$start.station.name)
Aug_20<-table(week3_2020_08$start.station.name)
Sep_20<-table(week3_2020_09$start.station.name)
Oct_20<-table(week3_2020_10$start.station.name)
Nov_20<-table(week3_2020_11$start.station.name)
Dec_20<-table(week3_2020_12$start.station.name)
Jan_21<-table(week3_2021_1$start.station.name)
strt_loc_pre<-as.data.frame(cbind(Jan_19,Feb_19,Mar_19,Apr_19,May_19,Jun_19,
Jul_19,Aug_19,Sep_19,Oct_19,Nov_19,Dec_19))
strt_loc_pos<-as.data.frame(cbind(Jan_20,Feb_20,Mar_20,Apr_20,May_20,Jun_20,
Jul_20,Aug_20,Sep_20,Oct_20,Nov_20,Dec_20))
heatmaply(strt_loc_pre,main = "Heatmap of Bicycle Starting Stations in 2019",xlab = "Month",dendrogram = "none",showticklabels = c(TRUE, FALSE), scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(
low = "blue",
high = "orange",
midpoint = 3000,
limits = c(0, 6000)))
Below is a dygraph showing how Grove Street Bike Station have varied from 2019 to 2020. Grove St was chosen since it was the most frequently visited station in 2019. As expected, Grove Street visits increased through the summer months, the dwindled in 2020. This is attributed to the increase in the amount of stations. In 2019, there were 51 stations. The amount increased to 889 stations at the start of 2020.
library(tidyverse)
df_grove1 <- strt_loc_pre %>% filter(row.names(strt_loc_pre) %in% c("Grove St PATH"))
df_grove2 <- strt_loc_pos %>% filter(row.names(strt_loc_pos) %in% c("Grove St & Broadway"))
df_grove <- as.data.frame(cbind(df_grove1,df_grove2))
rownames(df_grove)<-NULL
df1<-data.frame()
for (i in 1:24){
df1[i,1]=df_grove[1,i]
}
df_dgrove<-ts(df1, start=c(2019, 1), end=c(2020, 12), frequency=12)
dygraph(df_dgrove) %>% dyRangeSelector()
Below is a bump chart ranking each day of week for the summer of 2020. As you can see, the peak day each month changes. During the week, the peak days are Tuesday, Wednesday, and Thursday. This is based on consistenly having ranks between 1 and 3. On the weekends, Sunday sees a peak for bike rents.
library(lubridate)
week3_2020_05$date <- as.Date(week3_2020_05$starttime)
week3_2020_05$wday <- wday(week3_2020_05$date)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==1,"Sunday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==2,"Monday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==3,"Tuesday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==4,"Wednesday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==5,"Thursday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==6,"Friday",week3_2020_05$wday)
week3_2020_05$wday <- ifelse(week3_2020_05$wday==7,"Saturday",week3_2020_05$wday)
week3_2020_05$wday <- factor(week3_2020_05$wday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
week3_2020_06$date <- as.Date(week3_2020_06$starttime)
week3_2020_06$wday <- wday(week3_2020_06$date)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==1,"Sunday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==2,"Monday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==3,"Tuesday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==4,"Wednesday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==5,"Thursday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==6,"Friday",week3_2020_06$wday)
week3_2020_06$wday <- ifelse(week3_2020_06$wday==7,"Saturday",week3_2020_06$wday)
week3_2020_06$wday <- factor(week3_2020_06$wday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
week3_2020_07$date <- as.Date(week3_2020_07$starttime)
week3_2020_07$wday <- wday(week3_2020_07$date)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==1,"Sunday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==2,"Monday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==3,"Tuesday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==4,"Wednesday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==5,"Thursday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==6,"Friday",week3_2020_07$wday)
week3_2020_07$wday <- ifelse(week3_2020_07$wday==7,"Saturday",week3_2020_07$wday)
week3_2020_07$wday <- factor(week3_2020_07$wday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
week3_2020_08$date <- as.Date(week3_2020_08$starttime)
week3_2020_08$wday <- wday(week3_2020_08$date)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==1,"Sunday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==2,"Monday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==3,"Tuesday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==4,"Wednesday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==5,"Thursday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==6,"Friday",week3_2020_08$wday)
week3_2020_08$wday <- ifelse(week3_2020_08$wday==7,"Saturday",week3_2020_08$wday)
week3_2020_08$wday <- factor(week3_2020_08$wday,levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
May1_20<-table(week3_2020_05$wday)
Jun1_20<-table(week3_2020_06$wday)
Jul1_20<-table(week3_2020_07$wday)
Aug1_20<-table(week3_2020_08$wday)
day<-c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
mon<-c("day","May_20","Jun_20","Jul_20","Aug_20")
dow<-as.data.frame(cbind(May1_20,Jun1_20,Jul1_20,Aug1_20))
rownames(dow)<-NULL
df2<-data.frame()
for (i in 1:4){
for (j in 1:7){
df2[j,i]=dow[j,i]
}
}
df3<-cbind(day,df2)
colnames(df3)<-mon
df3$rankMay<-rank(-df3$May_20)
df3$rankJun<-rank(-df3$Jun_20)
df3$rankJul<-rank(-df3$Jul_20)
df3$rankAug<-rank(-df3$Aug_20)
ggplot(df3,aes(c(1:7)))+
geom_line(aes(y=rankMay,colour=mon[2]))+
geom_line(aes(y=rankJun,colour=mon[3]))+
geom_line(aes(y=rankJul,colour=mon[4]))+
geom_line(aes(y=rankAug,colour=mon[5]))+
xlab("Day of the Week")+ylab("Rank")+
scale_x_continuous(breaks = c(1:7),
labels = day)+
scale_y_continuous(trans = "reverse", breaks = c(1:7))+
ggtitle("Bump Chart for Peak Days in Each Month")+
scale_colour_manual("",
breaks = mon[2:5],
values = c("orange", "blue", "green","yellow"))
The exploratory analysis of the Citibike’s user data shows promising preliminary results. The amount of base stations and use has increased from 2019 to 2020. This increase in bicycle stations creates a more convenient user experience since the rider has a better chance of finding a station closer to their destination. The outbreak of COVID-19 increased bicycle use in 2020. This can be attributed to the bicycles are in open air and not in other cramped public transportation options. The demographics from 2019 to 2020 has changed slightly as we have seen an increase in customers versus subscribers. The most frequent ages has not changed. The rider is typically 25 to 35 years old.The ride duration are similar from 2019 and 2020. The most frequent ride time slightly increased in 2020. Even though the number of bike stations increased, the epicenters for starting a trip are relatively in the same area within the South Village in New York City. The peak days for bike rentals is between Tuesday and Wednesday during the week and Sunday on the weekends since those days consistently see high rental numbers.
As we progress toward a long term strategy of the bike share program for Citibike, we recommend performing analyses on the price of up-keep and maintenance of the bicycles. We recommend this because of the impact of increasing the number of bike stations may spread resources of the company too thin. The addition of bike stations is convenient to the customer, but it may be inefficient for the company to collect bikes for maintenance or some locations that are not frequently visited are costly to rent compared to other locations with high traffic. We also recommend performing an analysis for the optimum periods to reduce the number of bicycles on the streets prior to the winter months. The winter months involve heavy use of salt to prevent ice. The salt corrodes certain metals and can deteriorate the bicycles. The company can reduce the amount of bicycles on the streets while maintaining an exceptional service to their customers. The collection of the bicycles could assist with maintenance periods as well.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.7.9.2 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4
## [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.6
## [9] tidyverse_1.3.0 heatmaply_1.2.1 viridis_0.5.1 viridisLite_0.3.0
## [13] plotly_4.9.3 ggplot2_3.3.3 dygraphs_1.1.1.6
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 jsonlite_1.7.2 foreach_1.5.1 modelr_0.1.8
## [5] assertthat_0.2.1 highr_0.8 cellranger_1.1.0 yaml_2.2.1
## [9] pillar_1.4.7 backports_1.2.1 lattice_0.20-41 glue_1.4.2
## [13] digest_0.6.27 RColorBrewer_1.1-2 rvest_0.3.6 colorspace_2.0-0
## [17] htmltools_0.5.1.1 plyr_1.8.6 pkgconfig_2.0.3 broom_0.7.4
## [21] haven_2.3.1 scales_1.1.1 webshot_0.5.2 generics_0.1.0
## [25] farver_2.0.3 ellipsis_0.3.1 withr_2.4.1 lazyeval_0.2.2
## [29] cli_2.3.0 magrittr_2.0.1 crayon_1.4.0 readxl_1.3.1
## [33] evaluate_0.14 fs_1.5.0 xml2_1.3.2 xts_0.12.1
## [37] tools_4.0.3 registry_0.5-1 data.table_1.13.6 hms_1.0.0
## [41] lifecycle_0.2.0 munsell_0.5.0 reprex_1.0.0 compiler_4.0.3
## [45] rlang_0.4.10 grid_4.0.3 rstudioapi_0.13 iterators_1.0.13
## [49] htmlwidgets_1.5.3 crosstalk_1.1.1 labeling_0.4.2 rmarkdown_2.6
## [53] gtable_0.3.0 codetools_0.2-16 DBI_1.1.1 TSP_1.1-10
## [57] reshape2_1.4.4 R6_2.5.0 seriation_1.2-9 gridExtra_2.3
## [61] zoo_1.8-9 knitr_1.31 dendextend_1.14.0 stringi_1.5.3
## [65] Rcpp_1.0.6 vctrs_0.3.6 dbplyr_2.1.0 tidyselect_1.1.0
## [69] xfun_0.20