Citi Bike is New York City’s bike share system, and the largest in the nation. Citi Bike launched in May 2013 and now consists of 750 stations across Brooklyn, Manhattan, Queens, and Jersey City. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. They provide a low-cost, environmentally-friendly transportation alternative for New York City.
This project looks at the trips taken on Citibike in September 2017. I’ll look at the commuting trends between Citibike annual subscribers and day-pass users, including trip duration, start and stop stations, and busiest times for travel.
Citibike Annual Subscribers pay $163 a year for unlimited 45 minute rides.
Day-pass users, or “Customers”, have the option of purchasing 24 hr or 3 day passes with the service. Both offer umlimited 30 minute rides within the time period purchased.
Loading the necessary pacakages for data analysis within R
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Load Citibike Data into R.
This data was obtained from the CitiBike website and downloaded as a CSV onto my computer.
citibikedata <- read.csv("/Users/ntlrsmllghn/Downloads/201709-citibike-tripdata.csv", stringsAsFactors = FALSE, header= TRUE)
head(citibikedata)## tripduration starttime stoptime start.station.id
## 1 362 2017-09-01 00:00:17 2017-09-01 00:06:19 3331
## 2 188 2017-09-01 00:00:21 2017-09-01 00:03:30 3101
## 3 305 2017-09-01 00:00:25 2017-09-01 00:05:30 3140
## 4 223 2017-09-01 00:00:52 2017-09-01 00:04:36 236
## 5 758 2017-09-01 00:01:01 2017-09-01 00:13:40 3427
## 6 2089 2017-09-01 00:01:20 2017-09-01 00:36:09 3016
## start.station.name start.station.latitude start.station.longitude
## 1 Riverside Dr & W 104 St 40.80134 -73.97115
## 2 N 12 St & Bedford Ave 40.72080 -73.95485
## 3 1 Ave & E 78 St 40.77140 -73.95352
## 4 St Marks Pl & 2 Ave 40.72842 -73.98714
## 5 Lafayette St & Jersey St 40.72431 -73.99601
## 6 Kent Ave & N 7 St 40.72037 -73.96165
## end.station.id end.station.name end.station.latitude
## 1 3328 W 100 St & Manhattan Ave 40.79500
## 2 3100 Nassau Ave & Newell St 40.72481
## 3 3141 1 Ave & E 68 St 40.76501
## 4 473 Rivington St & Chrystie St 40.72110
## 5 3431 E 35 St & 3 Ave 40.74652
## 6 3358 Garfield Pl & 8 Ave 40.67120
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.96450 14530 Subscriber 1993 1
## 2 -73.94753 15475 Subscriber 1988 1
## 3 -73.95818 30346 Subscriber 1969 1
## 4 -73.99193 28056 Subscriber 1993 1
## 5 -73.97789 25413 Subscriber 1987 1
## 6 -73.97484 17584 Subscriber 1975 2
I requested an API key from Google Maps to access Google Bicycle Directions. I’ll use the R packages ‘googleway’ and ‘ggmap’ to visualize Citibike dock stations and various bicycle routes, as well as projected time for each trip. I referenced this website for help with the Google Maps API (https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html)
google_key <- "AIzaSyAtu9cW4c01ucOJPs9MZewbPAPKrJhTkbw"
map <- google_map(key = google_key) %>%
add_bicycling()Since I’m not utilizing all of the information provided from Citibike (such as age, gender, and bikeid), I’ll subset the data to create a data frame useful for my analysis.
citibikedata <- citibikedata[c(1:11, 13)]Now, I’d like to come up with a data frame that will map out the location of all the citibike stations. This will be useful later in my analysis when I take a look at which stations are most popular start and end stations for citibike subscribers and customers.
startstation <- data.frame(citibikedata$start.station.latitude, citibikedata$start.station.longitude)
endstation <-data.frame(citibikedata$end.station.latitude, citibikedata$end.station.longitude)
names(startstation) <- c("latitude", "longitude")
names(endstation) <- c("latitude", "longitude")
stations <- data.frame(citibikedata$start.station.id, citibikedata$start.station.name, citibikedata$start.station.latitude, citibikedata$start.station.longitude)
names(stations) <- c("stationID", "stationname", "lat", "lon")
stations <- unique(stations)I need to add lat/lon lines for Googleway, since all our data is for NYC and not Melbourne, so I will use the lat/log of the stations in the citibikedata set.
google_map(key = google_key, data = stations) %>%
add_markers(lat = "lat", lon = "lon", opacity = .5)I’d like to analyze the trip duration but the data is presented in seconds, so I’m going to add a new column of data with the ‘tripduration’ divided by 60 to give me minutes.
citibikedata$tripmin <- (citibikedata$tripduration/60)
summary(citibikedata$tripmin)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 6.53 10.98 16.62 18.93 36926.33
My final analysis will look at popular times of day for ‘subscribers’ and ‘customers’, I will need to transform the start and stop time in this data.
Now that I’ve set up my data to help with the analysis, I’ll start my data exploration.
First, I’d like to figure out the percentage of annual subscribers to customers utilizing the citibikes throughout the month of Septemeber 2017.
citibikedata$usertype <- as.factor(citibikedata$usertype)
count <- count(citibikedata, usertype)
count$Pct <- count$n / sum(count$n)
count## # A tibble: 2 x 3
## usertype n Pct
## <fctr> <int> <dbl>
## 1 Customer 234989 0.1251207
## 2 Subscriber 1643109 0.8748793
ggplot(citibikedata) + geom_bar(aes(usertype),
fill = "blue",
width = 0.5)
One thing to note, this analysis counts the number rides taken by subscribers vs. customers. It does not account for multiples trips by individual subscribers or customers.
Now, I’d like to figure out most popular start and end stations for subscribers and for day pass holders.
I’ll subset my ‘citibikedata’ into two seperate data frames, looking at the ‘usertypes’ “Customers” and “Subscribers”
subroutes <- subset(citibikedata, usertype == "Subscriber")
custroutes <- subset(citibikedata, usertype == "Customer")With the data now seperate, I’m going to create a new column in each data from that will merge the ‘Start Station’ and the ‘End Station’.
subroutes$route <- paste(subroutes$start.station.name,subroutes$end.station.name,sep=" to ")
custroutes$routes <- paste(custroutes$start.station.name,custroutes$end.station.name,sep=" to ")Now, I’ll use this new column figure out the top 10 most frequent start and end stations for ‘subscribers’ and ‘customers’.
as.data.frame(sort(table(subroutes$route),decreasing=TRUE)[1:10])## Var1 Freq
## 1 E 7 St & Avenue A to Cooper Square & E 7 St 755
## 2 W 21 St & 6 Ave to 9 Ave & W 22 St 482
## 3 Pershing Square North to Broadway & W 32 St 479
## 4 Pershing Square North to E 24 St & Park Ave S 473
## 5 S 4 St & Wythe Ave to N 6 St & Bedford Ave 465
## 6 Cooper Square & E 7 St to E 7 St & Avenue A 461
## 7 Richardson St & N Henry St to Graham Ave & Conselyea St 454
## 8 McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave 444
## 9 Vernon Blvd & 50 Ave to McGuinness Blvd & Eagle St 418
## 10 N 6 St & Bedford Ave to Wythe Ave & Metropolitan Ave 399
## Var1
## 1 Central Park S & 6 Ave to Central Park S & 6 Ave
## 2 Central Park S & 6 Ave to 5 Ave & E 88 St
## 3 Grand Army Plaza & Central Park S to Grand Army Plaza & Central Park S
## 4 Centre St & Chambers St to Centre St & Chambers St
## 5 Old Fulton St to Centre St & Chambers St
## 6 Centre St & Chambers St to Cadman Plaza E & Tillary St
## 7 Central Park S & 6 Ave to 5 Ave & E 73 St
## 8 12 Ave & W 40 St to Pier 40 - Hudson River Park
## 9 12 Ave & W 40 St to West St & Chambers St
## 10 Central Park S & 6 Ave to Central Park North & Adam Clayton Powell Blvd
## Freq
## 1 637
## 2 629
## 3 500
## 4 448
## 5 412
## 6 409
## 7 405
## 8 397
## 9 390
## 10 364
Next, I’ll plot the most popular start and end stations for Customers and Subscribers. Adding a frequency table of most common start and end points.
suborigin <- subroutes %>% group_by(route, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
suborigin ## # A tibble: 10 x 4
## # Groups: route, start.station.latitude [10]
## route
## <chr>
## 1 E 7 St & Avenue A to Cooper Square & E 7 St
## 2 W 21 St & 6 Ave to 9 Ave & W 22 St
## 3 Pershing Square North to Broadway & W 32 St
## 4 Pershing Square North to E 24 St & Park Ave S
## 5 S 4 St & Wythe Ave to N 6 St & Bedford Ave
## 6 Cooper Square & E 7 St to E 7 St & Avenue A
## 7 Richardson St & N Henry St to Graham Ave & Conselyea St
## 8 McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave
## 9 Vernon Blvd & 50 Ave to McGuinness Blvd & Eagle St
## 10 N 6 St & Bedford Ave to Wythe Ave & Metropolitan Ave
## # ... with 3 more variables: start.station.latitude <dbl>,
## # start.station.longitude <dbl>, n <int>
Subsetting the data to include just the latitude and longitude for plotting in ‘googleways’
suborigin <- suborigin[c(2,3)]
names(suborigin) <- c("latitude", "longitude")subend <- subroutes %>% group_by(route, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
subend <- subend[c(2,3)]
names(subend) <- c("latitude", "longitude")subdf <- data.frame(from = c(suborigin),
to = c(subend))
subdf$start <- paste(subdf$from.latitude,subdf$from.longitude,sep=",")
subdf$end <- paste(subdf$to.latitude,subdf$to.longitude,sep=",")Here is a heat map showing where ‘subscribers’ are starting their rides.
google_map(data = subdf, key = google_key) %>%
add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.005)And the places that they are ending their rides.
google_map(data = subdf, key = google_key) %>%
add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.005)Now, I’ll repeat this analysis for the ‘customers’
custorigin <- custroutes %>% group_by(routes, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custorigin <- custorigin[c(2,3)]
names(custorigin) <- c("latitude", "longitude")custend <- custroutes %>% group_by(routes, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custend <- custend[c(2,3)]
names(custend) <- c("latitude", "longitude")custdf <- data.frame(from = c(custorigin),
to = c(custend))Here is a heatmap showing the most popular start stations for ‘customers’.
google_map(data = custdf, key = google_key) %>%
add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.01)Here is a heatmap showing the most popular end stations for ‘Customers’.
google_map(data = custdf, key = google_key) %>%
add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.01)Now, I’d like to compare the trip time for both ‘subscribers’ and ‘customers’.
Subscribers:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 6.150 10.033 13.689 16.800 20357.400
Customers:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 13.87 21.27 37.10 28.68 36926.33
boxplot(tripmin~usertype, data=citibikedata, main=toupper("Trip Time"), ylim=c(0,60), xlab="User Type", ylab="Time", col="blue")
Next, I would like to analyze the most popular times of day for citibike subscribers and customers.
citibikedata$timestamp <- strftime(citibikedata$starttime,"%Y-%m-%d %H:%M:%S")citibikedata$ridehours <- hour(citibikedata$timestamp)ggplot(citibikedata, aes(ridehours, fill=usertype, color=usertype)) + geom_histogram(
binwidth= 1,
position="identity",
alpha=0.5
)
A heat graph to illustrate the most popular times of the day for Citibike usage.
heatmap <- transform(citibikedata, freq = ave(seq(nrow(citibikedata)), ridehours, FUN=length))ggplot(heatmap, aes(x = ridehours,y=usertype, fill = freq)) +
viridis::scale_fill_viridis(name="Trip Hours",
option = 'C',
direction = 1,
na.value = "grey93") +
geom_tile(color = 'white', size = 0.1) +
scale_x_continuous(
expand = c(0, 0),
breaks = seq(0, 24, length = 25),
labels = c("00", "01", "02", "03", "04", "05",
"05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", "18","19", "20", "21", "22", "23")) 
I had a hard time working with the googleways package. I would have liked to plot the routes that Subscribers and Customers were taking, plus look at the suggested time for the routes, but had to settle with the heatmaps of the popular stations.
Based on the analysis about, we can see that over 87% of Citibike rides taken in September 2017 were taken by Annual Subscribers.
The most frequent rides taken by Citibike Subscribers started and ended in Midtown, the East Village, Williamsburg. For Customers, the majority of rides started and ended by Central Park and the Brooklyn Bridge.
The average ride time for Annual Subscribers was around 13 minutes, while Customers spent roughly 37 minutes on the bicycles.
The most popular times of the day for Annual Subscribers was 8 AM and between 5 and 6 PM. The most popular times of the day for Customers was between 2 and 4 PM.
Todd W Schneider: http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/ Heatmap Reference: https://austinwehrwein.com/data-visualization/heatmaps-with-divvy-data/ Google Maps Reference: https://developers.google.com/maps/documentation/javascript/examples/layer-bicycling