Citi Bike is New York City’s bike share system, and the largest in the nation. Citi Bike launched in May 2013 and now consists of 750 stations across Brooklyn, Manhattan, Queens, and Jersey City. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. They provide a low-cost, environmentally-friendly transportation alternative for New York City.
This project looks at the trips taken on Citibike in September 2017. I’ll look at the commuting trends between Citibike annual subscribers and day-pass users, including trip duration, start and stop stations, and busiest times for travel.
Citibike Annual Subscribers pay $163 a year for unlimited 45 minute rides.
Day-pass users, or “Customers”, have the option of purchasing 24 hr or 3 day passes with the service. Both offer umlimited 30 minute rides within the time period purchased.
Loading the necessary pacakages for data analysis within R
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Load Citibike Data into R.
This data was obtained from the CitiBike website and downloaded as a CSV onto my computer. Set working directory to Dropbox to access CSV file. Select first 500,000 rows
citibikedata <- read.csv("/Users/ntlrsmllghn/Dropbox/Data/Data 607/Data_607_Final_Project_files/201709-citibike-tripdata.csv", stringsAsFactors = FALSE, header= TRUE)
citibikedata <- citibikedata [0:500000,]
head(citibikedata)
## tripduration starttime stoptime start.station.id
## 1 362 2017-09-01 00:00:17 2017-09-01 00:06:19 3331
## 2 188 2017-09-01 00:00:21 2017-09-01 00:03:30 3101
## 3 305 2017-09-01 00:00:25 2017-09-01 00:05:30 3140
## 4 223 2017-09-01 00:00:52 2017-09-01 00:04:36 236
## 5 758 2017-09-01 00:01:01 2017-09-01 00:13:40 3427
## 6 2089 2017-09-01 00:01:20 2017-09-01 00:36:09 3016
## start.station.name start.station.latitude start.station.longitude
## 1 Riverside Dr & W 104 St 40.80134 -73.97115
## 2 N 12 St & Bedford Ave 40.72080 -73.95485
## 3 1 Ave & E 78 St 40.77140 -73.95352
## 4 St Marks Pl & 2 Ave 40.72842 -73.98714
## 5 Lafayette St & Jersey St 40.72431 -73.99601
## 6 Kent Ave & N 7 St 40.72037 -73.96165
## end.station.id end.station.name end.station.latitude
## 1 3328 W 100 St & Manhattan Ave 40.79500
## 2 3100 Nassau Ave & Newell St 40.72481
## 3 3141 1 Ave & E 68 St 40.76501
## 4 473 Rivington St & Chrystie St 40.72110
## 5 3431 E 35 St & 3 Ave 40.74652
## 6 3358 Garfield Pl & 8 Ave 40.67120
## end.station.longitude bikeid usertype birth.year gender
## 1 -73.96450 14530 Subscriber 1993 1
## 2 -73.94753 15475 Subscriber 1988 1
## 3 -73.95818 30346 Subscriber 1969 1
## 4 -73.99193 28056 Subscriber 1993 1
## 5 -73.97789 25413 Subscriber 1987 1
## 6 -73.97484 17584 Subscriber 1975 2
I requested an API key from Google Maps to access Google Bicycle Directions. I’ll use the R packages ‘googleway’ and ‘ggmap’ to visualize Citibike dock stations and various bicycle routes, as well as projected time for each trip. I referenced this website for help with the Google Maps API (https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html)
google_key <- "AIzaSyAtu9cW4c01ucOJPs9MZewbPAPKrJhTkbw"
map <- google_map(key = google_key) %>%
add_bicycling()
Since I’m not utilizing all of the information provided from Citibike (such as age, gender, and bikeid), I’ll subset the data to create a data frame useful for my analysis.
#subset necessary data
citibikedata <- citibikedata[c(1:11, 13)]
Now, I’d like to come up with a data frame that will map out the location of all the citibike stations. This will be useful later in my analysis when I take a look at which stations are most popular start and end stations for citibike subscribers and customers.
#subset popular start and end stations for citibike subscribers and customers
startstation <- data.frame(citibikedata$start.station.latitude, citibikedata$start.station.longitude)
endstation <-data.frame(citibikedata$end.station.latitude, citibikedata$end.station.longitude)
names(startstation) <- c("latitude", "longitude")
names(endstation) <- c("latitude", "longitude")
stations <- data.frame(citibikedata$start.station.id, citibikedata$start.station.name, citibikedata$start.station.latitude, citibikedata$start.station.longitude)
names(stations) <- c("stationID", "stationname", "lat", "lon")
stations <- unique(stations)
I need to add lat/lon lines for Googleway, since all our data is for NYC and not Melbourne, so I will use the lat/log of the stations in the citibikedata set.
#lat, log for all citibike stations
google_map(key = google_key, data = stations) %>%
add_markers(lat = "lat", lon = "lon", opacity = .5)
I’d like to analyze the trip duration but the data is presented in seconds, so I’m going to add a new column of data with the ‘tripduration’ divided by 60 to give me minutes.
#summerize trip duration.
citibikedata$tripmin <- (citibikedata$tripduration/60)
summary(citibikedata$tripmin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 6.55 11.17 17.99 19.57 36926.33
My final analysis will look at popular times of day for ‘subscribers’ and ‘customers’, I will need to transform the start and stop time in this data.
Now that I’ve set up my data to help with the analysis, I’ll start my data exploration.
First, I’d like to figure out the percentage of annual subscribers to customers utilizing the citibikes throughout the month of Septemeber 2017.
#percentages of users
citibikedata$usertype <- as.factor(citibikedata$usertype)
count <- count(citibikedata, usertype)
count$Pct <- count$n / sum(count$n)
count
## # A tibble: 2 x 3
## usertype n Pct
## <fctr> <int> <dbl>
## 1 Customer 76853 0.153706
## 2 Subscriber 423147 0.846294
#plot findings
ggplot(citibikedata) + geom_bar(aes(usertype),
fill = "blue",
width = 0.5)
One thing to note, this analysis counts the number rides taken by subscribers vs. customers. It does not account for multiples trips by individual subscribers or customers.
Now, I’d like to figure out most popular start and end stations for subscribers and for day pass holders.
I’ll subset my ‘citibikedata’ into two seperate data frames, looking at the ‘usertypes’ “Customers” and “Subscribers”
#subset routes based on user type
subroutes <- subset(citibikedata, usertype == "Subscriber")
custroutes <- subset(citibikedata, usertype == "Customer")
With the data now seperate, I’m going to create a new column in each data from that will merge the ‘Start Station’ and the ‘End Station’.
#merge the 'Start Station' and the 'End Station' -- subscriber data
subroutes$route <- paste(subroutes$start.station.name,subroutes$end.station.name,sep=" to ")
custroutes$routes <- paste(custroutes$start.station.name,custroutes$end.station.name,sep=" to ")
Now, I’ll use this new column figure out the top 10 most frequent start and end stations for ‘subscribers’ and ‘customers’.
#sort 10 most frequent routes for subscribers -- subscriber data
as.data.frame(sort(table(subroutes$route),decreasing=TRUE)[1:10])
## Var1 Freq
## 1 E 7 St & Avenue A to Cooper Square & E 7 St 194
## 2 McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave 126
## 3 W 21 St & 6 Ave to 9 Ave & W 22 St 122
## 4 E 102 St & 1 Ave to 2 Ave & E 96 St 121
## 5 Pier 40 - Hudson River Park to West St & Chambers St 119
## 6 Richardson St & N Henry St to Graham Ave & Conselyea St 119
## 7 Pershing Square North to E 24 St & Park Ave S 118
## 8 S 4 St & Wythe Ave to N 6 St & Bedford Ave 116
## 9 Soissons Landing to Soissons Landing 113
## 10 Yankee Ferry Terminal to Yankee Ferry Terminal 113
## Var1
## 1 Central Park S & 6 Ave to Central Park S & 6 Ave
## 2 Grand Army Plaza & Central Park S to Grand Army Plaza & Central Park S
## 3 Central Park S & 6 Ave to 5 Ave & E 88 St
## 4 Centre St & Chambers St to Centre St & Chambers St
## 5 12 Ave & W 40 St to Pier 40 - Hudson River Park
## 6 Central Park S & 6 Ave to 5 Ave & E 73 St
## 7 Central Park S & 6 Ave to Central Park North & Adam Clayton Powell Blvd
## 8 5 Ave & E 88 St to Central Park North & Adam Clayton Powell Blvd
## 9 Centre St & Chambers St to Cadman Plaza E & Tillary St
## 10 Grand Army Plaza & Central Park S to Central Park S & 6 Ave
## Freq
## 1 253
## 2 249
## 3 214
## 4 154
## 5 153
## 6 147
## 7 144
## 8 128
## 9 126
## 10 126
Next, I’ll plot the most popular start and end stations for Customers and Subscribers. Adding a frequency table of most common start and end points.
#include log/lat for popular routes for plotting -- subscriber data
suborigin <- subroutes %>% group_by(route, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
suborigin
## # A tibble: 10 x 4
## # Groups: route, start.station.latitude [10]
## route
## <chr>
## 1 E 7 St & Avenue A to Cooper Square & E 7 St
## 2 McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave
## 3 W 21 St & 6 Ave to 9 Ave & W 22 St
## 4 E 102 St & 1 Ave to 2 Ave & E 96 St
## 5 Pier 40 - Hudson River Park to West St & Chambers St
## 6 Richardson St & N Henry St to Graham Ave & Conselyea St
## 7 Pershing Square North to E 24 St & Park Ave S
## 8 S 4 St & Wythe Ave to N 6 St & Bedford Ave
## 9 Soissons Landing to Soissons Landing
## 10 Yankee Ferry Terminal to Yankee Ferry Terminal
## # ... with 3 more variables: start.station.latitude <dbl>,
## # start.station.longitude <dbl>, n <int>
Subsetting the data to include just the latitude and longitude for plotting in ‘googleways’
#rename columns in dataframe -- subscriber data
suborigin <- suborigin[c(2,3)]
names(suborigin) <- c("latitude", "longitude")
#subset end log/lat, rename columns -- subscriber data
subend <- subroutes %>% group_by(route, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
subend <- subend[c(2,3)]
names(subend) <- c("latitude", "longitude")
#create dataframe by merging log/lat of start and end stations. -- subscriber data
subdf <- data.frame(from = c(suborigin),
to = c(subend))
subdf$start <- paste(subdf$from.latitude,subdf$from.longitude,sep=",")
subdf$end <- paste(subdf$to.latitude,subdf$to.longitude,sep=",")
Here is a heat map showing where ‘subscribers’ are starting their rides.
#plot findings for start stations -- subscriber data
google_map(data = subdf, key = google_key) %>%
add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.005)
And the places that they are ending their rides.
#plot findings for end stations -- subscriber data
google_map(data = subdf, key = google_key) %>%
add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.005)
Now, I’ll repeat this analysis for the ‘customers’
#repeat above analysis for customer data
custorigin <- custroutes %>% group_by(routes, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custorigin <- custorigin[c(2,3)]
names(custorigin) <- c("latitude", "longitude")
custend <- custroutes %>% group_by(routes, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custend <- custend[c(2,3)]
names(custend) <- c("latitude", "longitude")
custdf <- data.frame(from = c(custorigin),
to = c(custend))
Here is a heatmap showing the most popular start stations for ‘customers’.
#plot start station for customers
google_map(data = custdf, key = google_key) %>%
add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.01)
Here is a heatmap showing the most popular end stations for ‘Customers’.
#plot end stations for customers
google_map(data = custdf, key = google_key) %>%
add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.01)
Now, I’d like to compare the trip time for both ‘subscribers’ and ‘customers’.
Subscribers:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.017 6.067 9.933 13.947 16.783 20357.400
Customers:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.02 14.20 21.65 40.25 29.07 36926.33
#box plot comparing average ride time for subscribers and customers.
boxplot(tripmin~usertype, data=citibikedata, main=toupper("Trip Time"), ylim=c(0,60), xlab="User Type", ylab="Time", col="blue")
Next, I would like to visualize the most popular times of day for citibike subscribers and customers.
#plot to visualize popular times of day for citibike rides
citibikedata$timestamp <- strftime(citibikedata$starttime,"%Y-%m-%d %H:%M:%S")
citibikedata$ridehours <- hour(citibikedata$timestamp)
ggplot(citibikedata, aes(ridehours, fill=usertype, color=usertype)) + geom_histogram(
binwidth= 1,
position="identity",
alpha=0.5
)
Based on the analysis about, we can see that over 87% of Citibike rides taken in September 2017 were taken by Annual Subscribers.
The most frequent rides taken by Citibike Subscribers started and ended in Midtown, the East Village, Williamsburg. For Customers, the majority of rides started and ended by Central Park and the Brooklyn Bridge.
The average ride time for Annual Subscribers was around 13 minutes, while Customers spent roughly 37 minutes on the bicycles.
The most popular times of the day for Annual Subscribers was 8 AM and between 5 and 6 PM. The most popular times of the day for Customers was between 2 and 4 PM.
Todd W Schneider: http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/ Heatmap Reference: https://austinwehrwein.com/data-visualization/heatmaps-with-divvy-data/ Google Maps Reference: https://developers.google.com/maps/documentation/javascript/examples/layer-bicycling