Citi Bike is New York City’s bike share system, and the largest in the nation. Citi Bike launched in May 2013 and now consists of 750 stations across Brooklyn, Manhattan, Queens, and Jersey City. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. They provide a low-cost, environmentally-friendly transportation alternative for New York City.

Data Science Workflow

Data Acquisition

Loading the necessary pacakages for data analysis within R

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Load Citibike Data into R.

This data was obtained from the CitiBike website and downloaded as a CSV onto my computer. Set working directory to Dropbox to access CSV file. Select first 500,000 rows

citibikedata <- read.csv("/Users/ntlrsmllghn/Dropbox/Data/Data 607/Data_607_Final_Project_files/201709-citibike-tripdata.csv", stringsAsFactors = FALSE, header= TRUE)
citibikedata <- citibikedata [0:500000,]
head(citibikedata)

##   tripduration           starttime            stoptime start.station.id
## 1          362 2017-09-01 00:00:17 2017-09-01 00:06:19             3331
## 2          188 2017-09-01 00:00:21 2017-09-01 00:03:30             3101
## 3          305 2017-09-01 00:00:25 2017-09-01 00:05:30             3140
## 4          223 2017-09-01 00:00:52 2017-09-01 00:04:36              236
## 5          758 2017-09-01 00:01:01 2017-09-01 00:13:40             3427
## 6         2089 2017-09-01 00:01:20 2017-09-01 00:36:09             3016
##         start.station.name start.station.latitude start.station.longitude
## 1  Riverside Dr & W 104 St               40.80134               -73.97115
## 2    N 12 St & Bedford Ave               40.72080               -73.95485
## 3          1 Ave & E 78 St               40.77140               -73.95352
## 4      St Marks Pl & 2 Ave               40.72842               -73.98714
## 5 Lafayette St & Jersey St               40.72431               -73.99601
## 6        Kent Ave & N 7 St               40.72037               -73.96165
##   end.station.id           end.station.name end.station.latitude
## 1           3328   W 100 St & Manhattan Ave             40.79500
## 2           3100     Nassau Ave & Newell St             40.72481
## 3           3141            1 Ave & E 68 St             40.76501
## 4            473 Rivington St & Chrystie St             40.72110
## 5           3431            E 35 St & 3 Ave             40.74652
## 6           3358        Garfield Pl & 8 Ave             40.67120
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.96450  14530 Subscriber       1993      1
## 2             -73.94753  15475 Subscriber       1988      1
## 3             -73.95818  30346 Subscriber       1969      1
## 4             -73.99193  28056 Subscriber       1993      1
## 5             -73.97789  25413 Subscriber       1987      1
## 6             -73.97484  17584 Subscriber       1975      2

I requested an API key from Google Maps to access Google Bicycle Directions. I’ll use the R packages ‘googleway’ and ‘ggmap’ to visualize Citibike dock stations and various bicycle routes, as well as projected time for each trip. I referenced this website for help with the Google Maps API (https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html)

google_key <- "AIzaSyAtu9cW4c01ucOJPs9MZewbPAPKrJhTkbw"
map <- google_map(key = google_key) %>%
  add_bicycling()

Data Management

Since I’m not utilizing all of the information provided from Citibike (such as age, gender, and bikeid), I’ll subset the data to create a data frame useful for my analysis.

#subset necessary data
citibikedata <- citibikedata[c(1:11, 13)]

Now, I’d like to come up with a data frame that will map out the location of all the citibike stations. This will be useful later in my analysis when I take a look at which stations are most popular start and end stations for citibike subscribers and customers.

#subset popular start and end stations for citibike subscribers and customers
startstation <- data.frame(citibikedata$start.station.latitude, citibikedata$start.station.longitude)
endstation <-data.frame(citibikedata$end.station.latitude, citibikedata$end.station.longitude)
names(startstation) <- c("latitude", "longitude")
names(endstation) <- c("latitude", "longitude")
stations <- data.frame(citibikedata$start.station.id, citibikedata$start.station.name, citibikedata$start.station.latitude, citibikedata$start.station.longitude)
names(stations) <- c("stationID", "stationname", "lat", "lon")
stations <- unique(stations)

I need to add lat/lon lines for Googleway, since all our data is for NYC and not Melbourne, so I will use the lat/log of the stations in the citibikedata set.

#lat, log for all citibike stations
google_map(key = google_key, data = stations) %>%
 add_markers(lat = "lat", lon = "lon", opacity = .5)

I’d like to analyze the trip duration but the data is presented in seconds, so I’m going to add a new column of data with the ‘tripduration’ divided by 60 to give me minutes.

#summerize trip duration.
citibikedata$tripmin <- (citibikedata$tripduration/60)
summary(citibikedata$tripmin)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.55    11.17    17.99    19.57 36926.33

My final analysis will look at popular times of day for ‘subscribers’ and ‘customers’, I will need to transform the start and stop time in this data.

Data Exploration

Now that I’ve set up my data to help with the analysis, I’ll start my data exploration.

First, I’d like to figure out the percentage of annual subscribers to customers utilizing the citibikes throughout the month of Septemeber 2017.

#percentages of users
citibikedata$usertype <- as.factor(citibikedata$usertype)
count <- count(citibikedata, usertype)
count$Pct <- count$n / sum(count$n)
count

## # A tibble: 2 x 3
##     usertype      n      Pct
##       <fctr>  <int>    <dbl>
## 1   Customer  76853 0.153706
## 2 Subscriber 423147 0.846294

#plot findings
ggplot(citibikedata) + geom_bar(aes(usertype), 
                                fill = "blue", 
                                width = 0.5)

One thing to note, this analysis counts the number rides taken by subscribers vs. customers. It does not account for multiples trips by individual subscribers or customers.

Now, I’d like to figure out most popular start and end stations for subscribers and for day pass holders.

I’ll subset my ‘citibikedata’ into two seperate data frames, looking at the ‘usertypes’ “Customers” and “Subscribers”

#subset routes based on user type
subroutes <- subset(citibikedata, usertype == "Subscriber")
custroutes <- subset(citibikedata, usertype == "Customer")

With the data now seperate, I’m going to create a new column in each data from that will merge the ‘Start Station’ and the ‘End Station’.

#merge the 'Start Station' and the 'End Station' -- subscriber data
subroutes$route <- paste(subroutes$start.station.name,subroutes$end.station.name,sep=" to ")
custroutes$routes <- paste(custroutes$start.station.name,custroutes$end.station.name,sep=" to ")

Now, I’ll use this new column figure out the top 10 most frequent start and end stations for ‘subscribers’ and ‘customers’.

#sort 10 most frequent routes for subscribers -- subscriber data
as.data.frame(sort(table(subroutes$route),decreasing=TRUE)[1:10])

##                                                       Var1 Freq
## 1              E 7 St & Avenue A to Cooper Square & E 7 St  194
## 2       McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave  126
## 3                       W 21 St & 6 Ave to 9 Ave & W 22 St  122
## 4                      E 102 St & 1 Ave to 2 Ave & E 96 St  121
## 5     Pier 40 - Hudson River Park to West St & Chambers St  119
## 6  Richardson St & N Henry St to Graham Ave & Conselyea St  119
## 7            Pershing Square North to E 24 St & Park Ave S  118
## 8               S 4 St & Wythe Ave to N 6 St & Bedford Ave  116
## 9                     Soissons Landing to Soissons Landing  113
## 10          Yankee Ferry Terminal to Yankee Ferry Terminal  113

##                                                                       Var1
## 1                         Central Park S & 6 Ave to Central Park S & 6 Ave
## 2   Grand Army Plaza & Central Park S to Grand Army Plaza & Central Park S
## 3                                Central Park S & 6 Ave to 5 Ave & E 88 St
## 4                       Centre St & Chambers St to Centre St & Chambers St
## 5                          12 Ave & W 40 St to Pier 40 - Hudson River Park
## 6                                Central Park S & 6 Ave to 5 Ave & E 73 St
## 7  Central Park S & 6 Ave to Central Park North & Adam Clayton Powell Blvd
## 8         5 Ave & E 88 St to Central Park North & Adam Clayton Powell Blvd
## 9                   Centre St & Chambers St to Cadman Plaza E & Tillary St
## 10             Grand Army Plaza & Central Park S to Central Park S & 6 Ave
##    Freq
## 1   253
## 2   249
## 3   214
## 4   154
## 5   153
## 6   147
## 7   144
## 8   128
## 9   126
## 10  126

Next, I’ll plot the most popular start and end stations for Customers and Subscribers. Adding a frequency table of most common start and end points.

#include log/lat for popular routes for plotting -- subscriber data
suborigin <- subroutes %>% group_by(route, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
suborigin

## # A tibble: 10 x 4
## # Groups:   route, start.station.latitude [10]
##                                                      route
##                                                      <chr>
##  1             E 7 St & Avenue A to Cooper Square & E 7 St
##  2      McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave
##  3                      W 21 St & 6 Ave to 9 Ave & W 22 St
##  4                     E 102 St & 1 Ave to 2 Ave & E 96 St
##  5    Pier 40 - Hudson River Park to West St & Chambers St
##  6 Richardson St & N Henry St to Graham Ave & Conselyea St
##  7           Pershing Square North to E 24 St & Park Ave S
##  8              S 4 St & Wythe Ave to N 6 St & Bedford Ave
##  9                    Soissons Landing to Soissons Landing
## 10          Yankee Ferry Terminal to Yankee Ferry Terminal
## # ... with 3 more variables: start.station.latitude <dbl>,
## #   start.station.longitude <dbl>, n <int>

Subsetting the data to include just the latitude and longitude for plotting in ‘googleways’

#rename columns in dataframe -- subscriber data
suborigin <- suborigin[c(2,3)]
names(suborigin) <- c("latitude", "longitude")

#subset end log/lat, rename columns -- subscriber data
subend <- subroutes %>% group_by(route, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
subend <- subend[c(2,3)]
names(subend) <- c("latitude", "longitude")

#create dataframe by merging log/lat of start and end stations. -- subscriber data
subdf <- data.frame(from = c(suborigin),
                 to = c(subend))
subdf$start <- paste(subdf$from.latitude,subdf$from.longitude,sep=",")
subdf$end <- paste(subdf$to.latitude,subdf$to.longitude,sep=",")

Here is a heat map showing where ‘subscribers’ are starting their rides.

#plot findings for start stations -- subscriber data
google_map(data = subdf, key = google_key) %>%
  add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.005)

And the places that they are ending their rides.

#plot findings for end stations -- subscriber data
google_map(data = subdf, key = google_key) %>%
  add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.005)

Now, I’ll repeat this analysis for the ‘customers’

#repeat above analysis for customer data
custorigin <- custroutes %>% group_by(routes, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custorigin <- custorigin[c(2,3)]
names(custorigin) <- c("latitude", "longitude")
custend <- custroutes %>% group_by(routes, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custend <- custend[c(2,3)]
names(custend) <- c("latitude", "longitude")

custdf <- data.frame(from = c(custorigin),
                 to = c(custend))

Here is a heatmap showing the most popular start stations for ‘customers’.

#plot start station for customers
google_map(data = custdf, key = google_key) %>%
  add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.01)

Here is a heatmap showing the most popular end stations for ‘Customers’.

#plot end stations for customers
google_map(data = custdf, key = google_key) %>%
  add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.01)

Now, I’d like to compare the trip time for both ‘subscribers’ and ‘customers’.

Subscribers:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.017     6.067     9.933    13.947    16.783 20357.400

Customers:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02    14.20    21.65    40.25    29.07 36926.33

#box plot comparing average ride time for subscribers and customers.
boxplot(tripmin~usertype, data=citibikedata, main=toupper("Trip Time"), ylim=c(0,60), xlab="User Type", ylab="Time", col="blue")

Next, I would like to visualize the most popular times of day for citibike subscribers and customers.

#plot to visualize popular times of day for citibike rides
citibikedata$timestamp <-  strftime(citibikedata$starttime,"%Y-%m-%d %H:%M:%S")

citibikedata$ridehours <- hour(citibikedata$timestamp)

ggplot(citibikedata, aes(ridehours, fill=usertype, color=usertype)) + geom_histogram(
   binwidth= 1,
   position="identity",
   alpha=0.5
 )

Data 607 Final Project

Natalie Mollaghan

12/4/2017

Introduction

Data Science Workflow

Data Acquisition

Data Management

Data Exploration

Conclusions

Reference