Citi Bike is New York City’s bike share system, and the largest in the nation. Citi Bike launched in May 2013 and now consists of 750 stations across Brooklyn, Manhattan, Queens, and Jersey City. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. They provide a low-cost, environmentally-friendly transportation alternative for New York City.

Data Science Workflow

Data Acquisition

Loading the necessary pacakages for data analysis within R

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Load Citibike Data into R.

This data was obtained from the CitiBike website and downloaded as a CSV onto my computer.

citibikedata <- read.csv("/Users/ntlrsmllghn/Downloads/201709-citibike-tripdata.csv", stringsAsFactors = FALSE, header= TRUE)
head(citibikedata)

##   tripduration           starttime            stoptime start.station.id
## 1          362 2017-09-01 00:00:17 2017-09-01 00:06:19             3331
## 2          188 2017-09-01 00:00:21 2017-09-01 00:03:30             3101
## 3          305 2017-09-01 00:00:25 2017-09-01 00:05:30             3140
## 4          223 2017-09-01 00:00:52 2017-09-01 00:04:36              236
## 5          758 2017-09-01 00:01:01 2017-09-01 00:13:40             3427
## 6         2089 2017-09-01 00:01:20 2017-09-01 00:36:09             3016
##         start.station.name start.station.latitude start.station.longitude
## 1  Riverside Dr & W 104 St               40.80134               -73.97115
## 2    N 12 St & Bedford Ave               40.72080               -73.95485
## 3          1 Ave & E 78 St               40.77140               -73.95352
## 4      St Marks Pl & 2 Ave               40.72842               -73.98714
## 5 Lafayette St & Jersey St               40.72431               -73.99601
## 6        Kent Ave & N 7 St               40.72037               -73.96165
##   end.station.id           end.station.name end.station.latitude
## 1           3328   W 100 St & Manhattan Ave             40.79500
## 2           3100     Nassau Ave & Newell St             40.72481
## 3           3141            1 Ave & E 68 St             40.76501
## 4            473 Rivington St & Chrystie St             40.72110
## 5           3431            E 35 St & 3 Ave             40.74652
## 6           3358        Garfield Pl & 8 Ave             40.67120
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.96450  14530 Subscriber       1993      1
## 2             -73.94753  15475 Subscriber       1988      1
## 3             -73.95818  30346 Subscriber       1969      1
## 4             -73.99193  28056 Subscriber       1993      1
## 5             -73.97789  25413 Subscriber       1987      1
## 6             -73.97484  17584 Subscriber       1975      2

I requested an API key from Google Maps to access Google Bicycle Directions. I’ll use the R packages ‘googleway’ and ‘ggmap’ to visualize Citibike dock stations and various bicycle routes, as well as projected time for each trip. I referenced this website for help with the Google Maps API (https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html)

google_key <- "AIzaSyAtu9cW4c01ucOJPs9MZewbPAPKrJhTkbw"
map <- google_map(key = google_key) %>%
  add_bicycling()

Data Management

Since I’m not utilizing all of the information provided from Citibike (such as age, gender, and bikeid), I’ll subset the data to create a data frame useful for my analysis.

citibikedata <- citibikedata[c(1:11, 13)]

Now, I’d like to come up with a data frame that will map out the location of all the citibike stations. This will be useful later in my analysis when I take a look at which stations are most popular start and end stations for citibike subscribers and customers.

startstation <- data.frame(citibikedata$start.station.latitude, citibikedata$start.station.longitude)
endstation <-data.frame(citibikedata$end.station.latitude, citibikedata$end.station.longitude)
names(startstation) <- c("latitude", "longitude")
names(endstation) <- c("latitude", "longitude")
stations <- data.frame(citibikedata$start.station.id, citibikedata$start.station.name, citibikedata$start.station.latitude, citibikedata$start.station.longitude)
names(stations) <- c("stationID", "stationname", "lat", "lon")
stations <- unique(stations)

I need to add lat/lon lines for Googleway, since all our data is for NYC and not Melbourne, so I will use the lat/log of the stations in the citibikedata set.

google_map(key = google_key, data = stations) %>%
 add_markers(lat = "lat", lon = "lon", opacity = .5)

I’d like to analyze the trip duration but the data is presented in seconds, so I’m going to add a new column of data with the ‘tripduration’ divided by 60 to give me minutes.

citibikedata$tripmin <- (citibikedata$tripduration/60)
summary(citibikedata$tripmin)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.53    10.98    16.62    18.93 36926.33

My final analysis will look at popular times of day for ‘subscribers’ and ‘customers’, I will need to transform the start and stop time in this data.

Data Exploration

Now that I’ve set up my data to help with the analysis, I’ll start my data exploration.

First, I’d like to figure out the percentage of annual subscribers to customers utilizing the citibikes throughout the month of Septemeber 2017.

citibikedata$usertype <- as.factor(citibikedata$usertype)
count <- count(citibikedata, usertype)
count$Pct <- count$n / sum(count$n)
count

## # A tibble: 2 x 3
##     usertype       n       Pct
##       <fctr>   <int>     <dbl>
## 1   Customer  234989 0.1251207
## 2 Subscriber 1643109 0.8748793

ggplot(citibikedata) + geom_bar(aes(usertype), 
                                fill = "blue", 
                                width = 0.5)

One thing to note, this analysis counts the number rides taken by subscribers vs. customers. It does not account for multiples trips by individual subscribers or customers.

Now, I’d like to figure out most popular start and end stations for subscribers and for day pass holders.

I’ll subset my ‘citibikedata’ into two seperate data frames, looking at the ‘usertypes’ “Customers” and “Subscribers”

subroutes <- subset(citibikedata, usertype == "Subscriber")
custroutes <- subset(citibikedata, usertype == "Customer")

With the data now seperate, I’m going to create a new column in each data from that will merge the ‘Start Station’ and the ‘End Station’.

subroutes$route <- paste(subroutes$start.station.name,subroutes$end.station.name,sep=" to ")
custroutes$routes <- paste(custroutes$start.station.name,custroutes$end.station.name,sep=" to ")

Now, I’ll use this new column figure out the top 10 most frequent start and end stations for ‘subscribers’ and ‘customers’.

as.data.frame(sort(table(subroutes$route),decreasing=TRUE)[1:10])

##                                                       Var1 Freq
## 1              E 7 St & Avenue A to Cooper Square & E 7 St  755
## 2                       W 21 St & 6 Ave to 9 Ave & W 22 St  482
## 3              Pershing Square North to Broadway & W 32 St  479
## 4            Pershing Square North to E 24 St & Park Ave S  473
## 5               S 4 St & Wythe Ave to N 6 St & Bedford Ave  465
## 6              Cooper Square & E 7 St to E 7 St & Avenue A  461
## 7  Richardson St & N Henry St to Graham Ave & Conselyea St  454
## 8       McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave  444
## 9       Vernon Blvd & 50 Ave to McGuinness Blvd & Eagle St  418
## 10    N 6 St & Bedford Ave to Wythe Ave & Metropolitan Ave  399

##                                                                       Var1
## 1                         Central Park S & 6 Ave to Central Park S & 6 Ave
## 2                                Central Park S & 6 Ave to 5 Ave & E 88 St
## 3   Grand Army Plaza & Central Park S to Grand Army Plaza & Central Park S
## 4                       Centre St & Chambers St to Centre St & Chambers St
## 5                                 Old Fulton St to Centre St & Chambers St
## 6                   Centre St & Chambers St to Cadman Plaza E & Tillary St
## 7                                Central Park S & 6 Ave to 5 Ave & E 73 St
## 8                          12 Ave & W 40 St to Pier 40 - Hudson River Park
## 9                                12 Ave & W 40 St to West St & Chambers St
## 10 Central Park S & 6 Ave to Central Park North & Adam Clayton Powell Blvd
##    Freq
## 1   637
## 2   629
## 3   500
## 4   448
## 5   412
## 6   409
## 7   405
## 8   397
## 9   390
## 10  364

Next, I’ll plot the most popular start and end stations for Customers and Subscribers. Adding a frequency table of most common start and end points.

suborigin <- subroutes %>% group_by(route, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
suborigin

## # A tibble: 10 x 4
## # Groups:   route, start.station.latitude [10]
##                                                      route
##                                                      <chr>
##  1             E 7 St & Avenue A to Cooper Square & E 7 St
##  2                      W 21 St & 6 Ave to 9 Ave & W 22 St
##  3             Pershing Square North to Broadway & W 32 St
##  4           Pershing Square North to E 24 St & Park Ave S
##  5              S 4 St & Wythe Ave to N 6 St & Bedford Ave
##  6             Cooper Square & E 7 St to E 7 St & Avenue A
##  7 Richardson St & N Henry St to Graham Ave & Conselyea St
##  8      McGuinness Blvd & Eagle St to Vernon Blvd & 50 Ave
##  9      Vernon Blvd & 50 Ave to McGuinness Blvd & Eagle St
## 10    N 6 St & Bedford Ave to Wythe Ave & Metropolitan Ave
## # ... with 3 more variables: start.station.latitude <dbl>,
## #   start.station.longitude <dbl>, n <int>

Subsetting the data to include just the latitude and longitude for plotting in ‘googleways’

suborigin <- suborigin[c(2,3)]
names(suborigin) <- c("latitude", "longitude")

subend <- subroutes %>% group_by(route, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
subend <- subend[c(2,3)]
names(subend) <- c("latitude", "longitude")

subdf <- data.frame(from = c(suborigin),
                 to = c(subend))
subdf$start <- paste(subdf$from.latitude,subdf$from.longitude,sep=",")
subdf$end <- paste(subdf$to.latitude,subdf$to.longitude,sep=",")

Here is a heat map showing where ‘subscribers’ are starting their rides.

google_map(data = subdf, key = google_key) %>%
  add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.005)

And the places that they are ending their rides.

google_map(data = subdf, key = google_key) %>%
  add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.005)

Now, I’ll repeat this analysis for the ‘customers’

custorigin <- custroutes %>% group_by(routes, start.station.latitude, start.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custorigin <- custorigin[c(2,3)]
names(custorigin) <- c("latitude", "longitude")

custend <- custroutes %>% group_by(routes, end.station.latitude, end.station.longitude) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(n=10)
custend <- custend[c(2,3)]
names(custend) <- c("latitude", "longitude")

custdf <- data.frame(from = c(custorigin),
                 to = c(custend))

Here is a heatmap showing the most popular start stations for ‘customers’.

google_map(data = custdf, key = google_key) %>%
  add_heatmap(lat = "from.latitude", lon = "from.longitude", option_radius = 0.01)

Here is a heatmap showing the most popular end stations for ‘Customers’.

google_map(data = custdf, key = google_key) %>%
  add_heatmap(lat = "to.latitude", lon = "to.longitude", option_radius = 0.01)

Now, I’d like to compare the trip time for both ‘subscribers’ and ‘customers’.

Subscribers:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.017     6.150    10.033    13.689    16.800 20357.400

Customers:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02    13.87    21.27    37.10    28.68 36926.33

boxplot(tripmin~usertype, data=citibikedata, main=toupper("Trip Time"), ylim=c(0,60), xlab="User Type", ylab="Time", col="blue")

Next, I would like to analyze the most popular times of day for citibike subscribers and customers.

citibikedata$timestamp <-  strftime(citibikedata$starttime,"%Y-%m-%d %H:%M:%S")

citibikedata$ridehours <- hour(citibikedata$timestamp)

ggplot(citibikedata, aes(ridehours, fill=usertype, color=usertype)) + geom_histogram(
   binwidth= 1,
   position="identity",
   alpha=0.5
 )

A heat graph to illustrate the most popular times of the day for Citibike usage.

heatmap <- transform(citibikedata, freq = ave(seq(nrow(citibikedata)), ridehours, FUN=length))

ggplot(heatmap, aes(x = ridehours,y=usertype, fill = freq)) +
    viridis::scale_fill_viridis(name="Trip Hours",
                       option = 'C',
                       direction = 1,
                       na.value = "grey93") +
    geom_tile(color = 'white', size = 0.1) +
    
    scale_x_continuous(
      expand = c(0, 0),
      breaks = seq(0, 24, length = 25),
      labels = c("00", "01", "02", "03", "04", "05",
                 "05", "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", "18","19", "20", "21", "22", "23"))

Data 607 Final Project

Natalie Mollaghan

12/4/2017

Introduction

Data Science Workflow

Data Acquisition

Data Management

Data Exploration

Limitations

Conclusions

Reference