DATA607 - Final Project

Section 1.0 - Introduction

NYC is at an inflection point: Due to COVID, many people are interested in trying biking in the city. If they enjoy it, New Yorkers could develop a life-long habit. That’s why it’s important to evaluate the Citi Bike program now.

Citi Bikes are primarily a fixed cost for their corporate owner Lyft. Therefore the more they can be used throughout the year, the more profitable the program will be. Improving the bike can be a win-win: More profit for the owning company, more biking in the city, perhaps creating a virtuous cycle. The benefits of more Citi Bike usage are multifold: Less environmental pollution; Safety in numbers – the more cyclists there are, the better/safer biking infrastructure becomes; Healthier citizens; Less space dedicated to car parking; and more.

I’m an avid NYC cyclist who occasionally uses Citi Bike when I don’t have access to my own bike. It’s my perception that the Citi Bike’s speed holds it back: The non-electric option seem too slow compared to normal bikes .My project will evaluate Citi Bike data for 2022 to determine if the non-electric Citi Bike’s speed could be improved to encourage more usage. I’ll compare trip data for the non-electric (“classic”) Citi Bike, the electric Citi Bike, and Google Maps’ bike estimates (based on the average bike) to test my hypothesis.

1.2 Research Questions and Hypothesis

While I may not have full data and resources to perform a cost-benefit analysis on whether making the non-electric Citi Bike faster would drive more ridership, I could explore the below questions to determine whether a cost-benefit analysis should be performed by Citi Bike’s owner Lyft.

Research Questions

How do electric and non-electric Citi Bike travel times compare to Google Maps’ bike time estimates?
Despite the higher price ($0.26 more/minute) does it appear that Citi Bike riders opt for the electric option when given the choice between the two?

If the answers to the above two questions are: 1. Non-electric Citi Bike travel times are much longer than both Google Maps’ estimates and the voltage-based option; 2. Citi Bike riders opt for the electric option by a large margin. Then I believe a cost-benefit analysis should be performed on whether making the non-electric bikes faster would increase profit.

Section 2.0 - Data

My two different data sources are:

Citi Bike data are available in CSV format per month from its website. I can read each month of 2022’s file using base R code and the utils package.
Google Maps bike trip time data can be calculated using the Google Maps API, which plugs into the gmapsdistance R package. If I provide the start and end coordinates for two separate points, I can calculate estimated trip time. The Citi Bike data have start and end coordinates for each observation (ride). However there are millions of observations, therefore I’ll compare bike travel time data for the most common trips, otherwise I’d run up a huge bill for the Google Maps API. In particular, I’ll focus on the most common, long trips (10 min+) so the trip speed discrepancy can be more apparent.

To help me clean and transform my data, I’ll use dplyr, lubridate, and tidyr, all from the Tidyverse. I won’t load the entire Tidyverse so I can maximize available memory and minimize overlapping function names.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyr)

2.1 Data Reading

Citibike Data

Reading in Citi Bike data from its website.

url_cb0122 <- "http://s3.amazonaws.com/tripdata/202201-citibike-tripdata.csv.zip"
download.file(url_cb0122,"202201-citibike-tripdata.csv.zip")
unzip("202201-citibike-tripdata.csv.zip")
cb_0122 <- read.csv("202201-citibike-tripdata.csv",sep=",",header=T)

url_cb0222 <- "http://s3.amazonaws.com/tripdata/202202-citibike-tripdata.csv.zip"
download.file(url_cb0222,"202202-citibike-tripdata.csv.zip")
unzip("202202-citibike-tripdata.csv.zip")
cb_0222 <- read.csv("202202-citibike-tripdata.csv",sep=",",header=T)

url_cb0322 <- "http://s3.amazonaws.com/tripdata/202203-citibike-tripdata.csv.zip"
download.file(url_cb0322,"202203-citibike-tripdata.csv.zip")
unzip("202203-citibike-tripdata.csv.zip")
cb_0322 <- read.csv("202203-citibike-tripdata.csv",sep=",",header=T)

url_cb0422 <- "http://s3.amazonaws.com/tripdata/202204-citibike-tripdata.csv.zip"
download.file(url_cb0422,"202204-citibike-tripdata.csv.zip")
unzip("202204-citibike-tripdata.csv.zip")
cb_0422 <- read.csv("202204-citibike-tripdata.csv",sep=",",header=T)

url_cb0522 <- "http://s3.amazonaws.com/tripdata/202205-citibike-tripdata.csv.zip"
download.file(url_cb0522,"202205-citibike-tripdata.csv.zip")
unzip("202205-citibike-tripdata.csv.zip")
cb_0522 <- read.csv("202205-citibike-tripdata.csv",sep=",",header=T)

# Note: June 2022 file download corrupted so I had to manually download and read locally.
# I understand this is not scalable; I tried for multiple hours to fix this one file without luck.
# For the sake of finishing this project on time, and because I felt I demonstrated
# how I would typically perform this download with the other files,
# I needed to take this one shortcut, apologies!
cb_0622 <- read.csv("C:\\Users\\rossboehme\\Downloads\\202206-citbike-tripdata.csv")

url_cb0722 <- "http://s3.amazonaws.com/tripdata/202207-citbike-tripdata.csv.zip"
download.file(url_cb0722,"202207-citbike-tripdata.csv.zip")
unzip("202207-citbike-tripdata.csv.zip")
cb_0722 <- read.csv("202207-citbike-tripdata.csv",sep=",",header=T)

url_cb0822 <- "https://s3.amazonaws.com/tripdata/202208-citibike-tripdata.csv.zip"
download.file(url_cb0822,"202208-citibike-tripdata.csv.zip")
unzip("202208-citibike-tripdata.csv.zip")
cb_0822 <- read.csv("202208-citibike-tripdata.csv",sep=",",header=T)

url_cb0922 <- "http://s3.amazonaws.com/tripdata/202209-citibike-tripdata.csv.zip"
download.file(url_cb0922,"202209-citibike-tripdata.csv.zip")
unzip("202209-citibike-tripdata.csv.zip")
cb_0922 <- read.csv("202209-citibike-tripdata.csv",sep=",",header=T)

url_cb1022 <- "https://s3.amazonaws.com/tripdata/202210-citibike-tripdata.csv.zip"
download.file(url_cb1022,"202210-citibike-tripdata.csv.zip")
unzip("202210-citibike-tripdata.csv.zip")
cb_1022 <- read.csv("202210-citibike-tripdata.csv",sep=",",header=T)

url_cb1122 <- "https://s3.amazonaws.com/tripdata/202211-citibike-tripdata.csv.zip"
download.file(url_cb1122,"202211-citibike-tripdata.csv.zip")
unzip("202211-citibike-tripdata.csv.zip")
cb_1122 <- read.csv("202211-citibike-tripdata.csv",sep=",",header=T)

url_cb1222 <- "https://s3.amazonaws.com/tripdata/202212-citibike-tripdata.csv.zip"
download.file(url_cb1222,"202212-citibike-tripdata.csv.zip")
unzip("202212-citibike-tripdata.csv.zip")
cb_1222 <- read.csv("202212-citibike-tripdata.csv",sep=",",header=T)

Binding the dataframes into one dataset. The full dataframe contains 30.7M rows. I’ll remove the monthly dataframes from my environment to save space and simplify my R Studio environment.

cb <- rbind(cb_0122,cb_0222,cb_0322,cb_0422,cb_0522,cb_0622,cb_0722,cb_0822,cb_0922,cb_1022,cb_1122,cb_1222)

rm(list=ls(pattern="22"))

My Citi Bike dataframes contain 13 columns. Definitions from Citi Bike site (owned by Lyft).

“ride_id” - Unique ride ID
“rideable_type” - Bike type: “electric”, “classic” (non-electric), or “docked” (accidental unlocking and re-locking)
“started_at” - Time and date ride started
“ended_at” - Time and date ride ended
“start_station_name” - Typically in the format of cross streets e.g. “7 Ave & Central Park South”
“start_station_id” - 6 digit float giving unique ID to start station
“end_station_name” - formatted the same as start station name
“end_station_id” - 6 digit float giving unique ID to end station
“start_lat” - Trip starting latitude at station
“start_lng” - Trip starting longitude at station
“end_lat” - Trip starting latitude at station
“end_lng” - Trip starting longitude at station
’member_casual” - Binary of whether trip was taken by Citi Bike subscriber “member” or non-member “casual”

Google Maps data

My Google Maps data is best added to my RMD as it’s simultaneously cleaned and transformed. Therefore I’ll save this for section 2.3. The only Google Maps data I’ll be bringing in will be trip time estimates based on starting and ending coordinates.

2.2 Data Cleaning

Citi Bike Cleaning

My Citi Bike data are relatively clean already. The only adjustments I’ll make are 1. Removing “docked” Citi Bike trips (docking and immediate re-docking due to bike issues) and 2. Cleaning the station names.

Removing “docked” trips

cb <- cb %>%
  filter(rideable_type != "docked_bike")

Removing forward slashes from the station names, and cleaning up missing spaces around ampersands. Example name changes:
- W 34 St &Blvd E” should be “W 34 St & Hudson Blvd E”
- “Forsyth St& Grand St” should be “Forsyth St & Grand St”

cb$start_station_name <- gsub("\\s*&\\s*", " & ", cb$start_station_name)
cb$end_station_name <- gsub("\\s*&\\s*", " & ", cb$end_station_name)

cb$start_station_name <- gsub("\\t", "", cb$start_station_name)
cb$end_station_name <- gsub("\\t", "", cb$end_station_name)

Google Maps Cleaning

My Google Maps data is best added to my RMD as it’s simultaneously cleaned and transformed. Therefore I’ll save this for section 2.3.

2.3 Data Transformation

Citibike Data and Google Maps Data Combined Transformation

I’ll assess the most common trips in cb which have a duration of 10 min+. I’ll use these as the basis of comparison for the speed of: a) Manual Citi Bike, b) Electric Citi Bike, c) Typical bike using Google Maps biking estimate as proxy. I could use a shorter trip length but a longer trip is a larger sample size which can better display the difference in speeds.

The most common 10+ min trips may actually cover short distances if bikers take indirect routes or are dawdling. Therefore I’ll check the most common routes using Google Maps to make sure they’re an adequately long distance before using them as a basis of comparison.

library(gmapsdistance)

## Warning: package 'gmapsdistance' was built under R version 4.2.3

#Calculating trip times
cb$trip_time <- difftime(cb$ended_at, cb$started_at,units='mins')

#Getting rid of "mins" string on trip_time vector
cb$trip_time <- gsub( " .*$", "", cb$trip_time)

#Combining coordinates from Citi Bike data so they can be plugged into Google Maps API
cb$start_lat <- paste0(cb$start_lat,"+")
cb$start_full_coord <- paste(cb$start_lat,cb$start_lng,sep="")
cb$end_lat <- paste0(cb$end_lat,"+")
cb$end_full_coord <- paste(cb$end_lat,cb$end_lng,sep="")

#Looking at only trips which might fit my 10 min+ criteria and which aren't accidental undocking/redocking
trip_count <- cb %>%
  filter(start_station_id != end_station_id
         ,trip_time >= 10)

library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

#Looking for the top trips
trip_count <- ddply(trip_count,c('start_station_name','start_full_coord','start_station_id','end_station_name','end_full_coord','end_station_id'),nrow)

#Detaching plyr so it doesn't affect my dplyr aggregate functions
detach("package:plyr", unload=TRUE)

top_trips <- trip_count %>%
  arrange(desc(V1)) %>%
  top_n(100)

## Selecting by V1

#Adding Google Maps estimates to top trips
google_time <- c()
for (i in 1:nrow(top_trips)){
google_time[i] <- gmapsdistance(origin = top_trips$start_full_coord[i],
                                            destination = top_trips$end_full_coord[i],
                                            mode = "bicycling",key=Sys.getenv("GOOGLE_API"))$Time
}

#Converting Google Maps estimates from seconds to minutes
trips_for_analysis <- top_trips %>%
  mutate(google_time_mins = round(google_time/60,2))

#Looking at only trips with a 10 min+ Google Maps bike trip estimate -- adequate length for comparison
trips_for_analysis <- trips_for_analysis %>%  
  filter(google_time_mins >= 10) %>%
  arrange(desc(google_time_mins))

This leaves me with 8 potential trips for analysis.

nrow(trips_for_analysis)

## [1] 8

I’ll now aggregate average trip times for these 8, grouped by electric and non-electric (“classic”) bikes, to compare to Google Maps’ estimates. First I need to prepare the data.

#Creating unique trip ID by combining start and end station IDs
trips_for_analysis <- transform(trips_for_analysis,start_end_id=paste0(start_station_id,end_station_id))
cb <- transform(cb,start_end_id=paste0(start_station_id,end_station_id))

#Filtering cb for only my top trips
cb_top_trips <- filter(cb,start_end_id %in% trips_for_analysis$start_end_id)
nrow(cb_top_trips)

## [1] 24976

#Calculating *actual* trip times for targeted trips
cb_top_trips$trip_time <- difftime(cb_top_trips$ended_at, cb_top_trips$started_at,units='mins')

#Getting rid of "mins" string on trip_time vector
cb_top_trips$trip_time <- gsub( " .*$", "", cb_top_trips$trip_time)

#Converting trip_time back to numeric
cb_top_trips <- cb_top_trips %>%
  mutate_at(c('trip_time'), as.numeric)

The “average” actual trip times may be skewed by outliers. The Google Maps bike time estimate assumes a direct trip. However, Citi Bike users may be on a leisurely bike ride for fun.

To remove outliers, I’ll perform the statistical analysis of leveraging upper and lower limits based on inter-quartile ranges (IQR). I’ll do so per each “top trip”/bike type combination, grouping by start_end_id (unique top trip) and rideable_type (electric_bike, classic_bike).

Overview of removing outliers formula below:

Find (Q1) and third (Q3) quartiles per group
Calculate Q1 – 1.5 * IQR to find lower limit and Q3 + 1.5 * IQR to find upper limit for outliers

library(dplyr)
cb_top_trips_lower_outliers <- cb_top_trips %>%
  group_by(rideable_type,start_end_id) %>%
  summarise(lower_lim = fivenum(trip_time)[2] - 1.5 * (fivenum(trip_time)[4] - fivenum(trip_time)[2]))

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

cb_top_trips_upper_outliers <- cb_top_trips %>%
  group_by(rideable_type,start_end_id) %>%
  summarise(upper_lim = fivenum(trip_time)[4] + 1.5 * (fivenum(trip_time)[4] - fivenum(trip_time)[2]))

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

cb_top_trips_outliers <- merge(cb_top_trips_lower_outliers,cb_top_trips_upper_outliers)

cb_top_trips_no_outliers <- cb_top_trips %>%
  left_join(cb_top_trips_outliers, by=c("rideable_type","start_end_id"))

#24,976 rows before outliers removed
nrow(cb_top_trips_no_outliers)

## [1] 24976

cb_top_trips_no_outliers <- cb_top_trips_no_outliers %>%
  filter(upper_lim > trip_time, trip_time > lower_lim)

#23,318 rows after outliers removed 
nrow(cb_top_trips_no_outliers)

## [1] 23318

Now that outliers have been accounted for, I’ll aggregate biking times per bike type and trip.

cb_top_trips_clean <- cb_top_trips_no_outliers %>%
  group_by(rideable_type,start_end_id) %>%
  summarise(avg_trip_time = mean(trip_time))

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

#Separating actual trip times into two separate dataframes, so they can be merged with trips_for_analysis
cb_top_electric <- cb_top_trips_clean %>%
  filter(rideable_type == 'electric_bike')

cb_top_electric <- subset(cb_top_electric, select = -c(rideable_type))
names(cb_top_electric) <- c('start_end_id','avg_trip_time_electric')
  
cb_top_classic <- cb_top_trips_clean %>%
  filter(rideable_type == 'classic_bike')

cb_top_classic <- subset(cb_top_classic, select = -c(rideable_type))
names(cb_top_classic) <- c('start_end_id','avg_trip_time_classic')

#Merging to compare with Google Maps estimates
trips_compare1 <- merge(trips_for_analysis,cb_top_classic)
trips_compare2 <- merge(trips_compare1,cb_top_electric)

#Cleaning up comparison df
trips_compared <- trips_compare2 %>%
  separate_wider_delim(start_full_coord, "-", names = c("start_lat", "start_lng")) %>%
  separate_wider_delim(end_full_coord, "-", names = c("end_lat", "end_lng")) %>%
  mutate(start_lng = paste0("-", start_lng),
         end_lng = paste0("-", start_lng),
         trip_route = paste0(start_station_name," to ",end_station_name)) %>%
  select(-c('V1'))

My final bike trip speed comparison df is 8 rows long, one for each “top trip” printed below. It includes average trip times (with outliers removed) for Google Maps’ estimate, the classic Citi Bike, and the electric Citi Bike. In addition it includes starting/ending station names and starting/ending coordinates.

trips_compared

## # A tibble: 8 × 13
##   start_end_id   start_station_name     start_lat     start_lng start_station_id
##   <chr>          <chr>                  <chr>         <chr>     <chr>           
## 1 5329.036157.04 West St & Chambers St  40.71754834+  -74.0132… 5329.03         
## 2 5329.036765.01 West St & Chambers St  40.71754834+  -74.0132… 5329.03         
## 3 6157.045184.08 10 Ave & W 14 St       40.741981599… -74.0083… 6157.04         
## 4 6157.045329.03 10 Ave & W 14 St       40.741981599… -74.0083… 6157.04         
## 5 6765.015329.03 12 Ave & W 40 St       40.76087502+  -74.0027… 6765.01         
## 6 6765.015696.03 12 Ave & W 40 St       40.76087502+  -74.0027… 6765.01         
## 7 6876.047323.09 Central Park S & 6 Ave 40.76590936+  -73.9763… 6876.04         
## 8 6876.047617.07 Central Park S & 6 Ave 40.76590936+  -73.9763… 6876.04         
## # ℹ 8 more variables: end_station_name <chr>, end_lat <chr>, end_lng <chr>,
## #   end_station_id <chr>, google_time_mins <dbl>, avg_trip_time_classic <dbl>,
## #   avg_trip_time_electric <dbl>, trip_route <chr>

Section 3.0 - Exploratory Data Analysis

Showing quantity of bike rides per month per bike type. Rides peaked in August at more than 3.5M rides, or more than 100K/day. Trips appeared to be correlated with the temperature as Jan and Feb, the coldest months, saw the fewest rides (between 1 to 1.25M). As an avid bike rider, my domain knowledge backs this up. Cold weather accentuates the wind effect created by biking, making me less likely to ride.

In addition, it appears that the number of electric bike trips as a % of total trips was highest in the coldest months, when there were the fewest riders. If electric bikes are more scarce, this could back up my hypothesis that riders opt for the voltage-based option over the classic option if both are available. I’ll explore this more fully in section 4 when I answer my research questions.

library(ggplot2)
cb_rides_per_month <- cb %>%
  group_by(rideable_type,lubridate::month(started_at,label=T)) %>%
  filter(rideable_type %in% c('classic_bike','electric_bike')) %>%
  count(rideable_type)

names(cb_rides_per_month) <- c('rideable_type','ride_month','rides')

# Stacked
ggplot(data=cb_rides_per_month, aes(fill=rideable_type, y=rides, x=ride_month)) + 
  geom_bar(position="stack", stat="identity") +
  xlab("Month") +
  ylab("Number of Rides")  +
  ggtitle("Citi Bike Rides Per Bike Type Per Month - 2022") + 
  guides(fill=guide_legend(title="Bike Type")) +
  scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) # millions

Showing type of rider per day of the week. Tuesday through Friday are the most popular days to ride, especially among members (subscription holders).

Citi Bike members account for roughly 3/4 of all trips.

cb_rider_type_per_DOW <- cb %>%
  group_by(member_casual,lubridate::wday(started_at,label=T,abbr=T)) %>%
  filter(rideable_type %in% c('classic_bike','electric_bike')) %>%
  count(member_casual)

names(cb_rider_type_per_DOW) <- c('rider_type','day_of_week','rides')

ggplot(data=cb_rider_type_per_DOW, aes(fill=rider_type, y=rides, x=day_of_week)) + 
  geom_bar(position="stack", stat="identity") +
  xlab("Day of Week") +
  ylab("Number of Rides")  +
  ggtitle("Citi Bike Rides Per Day of Week, Rider Type - 2022") + 
  guides(fill=guide_legend(title="Rider Type")) +
  scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6)) # millions

Section 4.0 - Research Question Analysis

To repeat my primary research questions:

How do electric and non-electric Citi Bike travel times compare to Google Maps’ bike time estimates?
Despite the higher price ($0.26 more/minute) does it appear that Citi Bike riders opt for the electric option when given the choice between the two?

I’ll answer the first question by assessing my trips_compared dataframe created in section 2.3.

#Pivoting longer to chart time comparisons
trip_compare_long <- trips_compared %>%
  select('google_time_mins','avg_trip_time_classic','avg_trip_time_electric','trip_route')

names(trip_compare_long) <- c('Google Estimate','Classic Citi Bike', 'Electric Citi Bike', 'trip_route')

trip_compare_long <-  trip_compare_long %>%
  pivot_longer(cols=c('Google Estimate','Classic Citi Bike','Electric Citi Bike'),
               names_to='trip_type',
               values_to='trip_time') %>%
  select(c('trip_route','trip_type','trip_time'))

#Wrapper for title
wrapper <- function(x, ...) 
{
  paste(strwrap(x, ...), collapse = "\n")
}

ggplot(trip_compare_long,                                      
       aes(x = trip_route,
           y = trip_time,
           fill = trip_type)) +
  geom_bar(stat = "identity",
           position = "dodge") +
  coord_flip() + 
  xlab("Trip Route") +
  ylab("Avg Trip Time (mins)")  +
  ggtitle(wrapper("Avg Trip Time by Biking Option: 8 Most Common Long NYC Routes",width=25)) +
  guides(fill=guide_legend(title="Biking Option"))

Per the above chart it appears that for four of the trips, both the classic and electric Citi Bike options take substantially longer (more than 1.3x) than the Google estimate. Upon further inspection of the station names using my NYC knowledge, 2 of these 4 trips involve a trip through Central Park (“Central Park S & 6 Ave to Central Park North & Adam…”, “Central Park S & 6 Ave to 5 Ave & E 87 St”) while the other two go along Hudson River Park (“10 Ave & W 14th St to West St & Liberty St”, “12 Ave & W 40 St to Pier 40…”).

This knowledge, combined with their difference in distribution to the other charts, makes me believe their average is skewed by leisurely journeys through their parks. Therefore, I will drop them from this model as I don’t believe they provide an accurate comparison.

#Dropping trips where the average duration for the electric Citi Bike was 1.3x+ longer than the Google Maps estimate #All of these involve trips through Central Park
trip_compare_final <- trips_compared %>%
  select('google_time_mins','avg_trip_time_classic','avg_trip_time_electric','trip_route') %>%
  filter(avg_trip_time_electric < (google_time_mins * 1.3))

names(trip_compare_final) <- c('Google Estimate','Classic Citi Bike', 'Electric Citi Bike', 'trip_route')

trip_compare_final <-  trip_compare_final %>%
  pivot_longer(cols=c('Google Estimate','Classic Citi Bike','Electric Citi Bike'),
               names_to='trip_type',
               values_to='trip_time') %>%
  select(c('trip_route','trip_type','trip_time'))

#Wrapper for title
wrapper <- function(x, ...) 
{
  paste(strwrap(x, ...), collapse = "\n")
}

ggplot(trip_compare_final,                                      
       aes(x = trip_route,
           y = trip_time,
           fill = trip_type)) +
  geom_bar(stat = "identity",
           position = "dodge") +
  coord_flip() + 
  xlab("Trip Route") +
  ylab("Avg Trip Time (mins)")  +
  ggtitle(wrapper("Avg Trip Time by Biking Option: 4 Most Common Long, Non-Leisurely NYC Routes",width=40)) + 
  guides(fill=guide_legend(title="Biking Option"))

The above chart displays that the Classic Citi Bike is by far the slowest option for these common trips. I quantify just how much slower it is in the below R chunk using statistics. In written terms:

On average, electric Citi Bike trips were 8% slower than Google Maps trip estimates for the 5 most common, non-leisurely NYC route
On average, classic Citi Bike trips were 34% slower than Google Maps trip estimates for the 5 most common, non-leisurely NYC routes
On average, Classic Citi Bike trips were 23% slower than electric Citi Bike trips for the 5 most common, non-leisurely NYC routes.

trip_compare_stats <- trips_compared %>%
  select('google_time_mins','avg_trip_time_classic','avg_trip_time_electric','trip_route') %>%
  filter(avg_trip_time_electric < (google_time_mins * 1.3)) %>%
  mutate(google_to_electric_ratio = avg_trip_time_electric / google_time_mins,
         google_to_classic_ratio = avg_trip_time_classic / google_time_mins,
         electric_to_classic_ratio = avg_trip_time_classic / avg_trip_time_electric)

trip_compare_stats %>% summarise(mean(google_to_electric_ratio))

## # A tibble: 1 × 1
##   `mean(google_to_electric_ratio)`
##                              <dbl>
## 1                             1.09

trip_compare_stats %>% summarise(mean(google_to_classic_ratio))

## # A tibble: 1 × 1
##   `mean(google_to_classic_ratio)`
##                             <dbl>
## 1                            1.34

trip_compare_stats %>% summarise(mean(electric_to_classic_ratio))

## # A tibble: 1 × 1
##   `mean(electric_to_classic_ratio)`
##                               <dbl>
## 1                              1.23

Now that I’ve established classic Citi Bikes as substantially slower than Google Maps time estimates and electric Citi Bikes, I’ll assess my second research question: Despite the higher price ($0.26 more/minute) does it appear that Citi Bike riders opt for the electric option when given the choice between the two?

The next chart shows electric vs. classic bikes as a % of total Citi Bike trips in 2022. In the earliest months of the year, Citi Bike trips took up a notably higher proportion than later in the year. That was likely because in April 2022, Citi Bike increased their electric fleet from 5.000 to 6,500 with the launch of new e-bike, increasing the total number of Citi Bikes from 24,500 to 26,000.

Interestingly, while absolute usage of electric e-bikes generally increased, as a % it actually decreased, meaning perhaps the voltage-based option reached interest saturation. That said, Lyft’s latest yearly report on Citi Bike claimed that even though e-bikes accounted for 1/5th of the fleet, they accounted for 1/3 of rides. The report also detailed how electric Citi Bikes were used three times more often per day in 2021 compared to “classics.” Therefore, my dataset may be skewed by a higher-than-usual number of electric bikes being out of service towards the end of the year. Overall, it still seems true that riders prefer the e-Citi Bike.

library(reshape)

## Warning: package 'reshape' was built under R version 4.2.3

## 
## Attaching package: 'reshape'

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

## The following object is masked from 'package:lubridate':
## 
##     stamp

## The following object is masked from 'package:dplyr':
## 
##     rename

library(scales)
wide_rides <- cb_rides_per_month %>%
  pivot_wider(names_from=rideable_type,values_from=rides)

wide_rides <- data.frame(wide_rides)

wide_melt <- melt(wide_rides, id.vars = 'ride_month')

ggplot(wide_melt,aes(x = ride_month, y = value,fill = variable)) + 
  geom_bar(position = "fill",stat = "identity") + 
  scale_y_continuous(labels = scales::percent_format()) + 
  xlab("Month") +
  ylab("% of Rides")  +
  ggtitle("Citi Bike Rides Per Bike Type as % of All Rides - 2022") + 
  guides(fill=guide_legend(title="Rider Type"))

This chart shows bike type usage broken down by member (Citi Bike subscriber) vs. non-member. Members prefer the electric bike by a 3:1 margin. If members are more profitable than “casuals”, they should be a priority, and they may prefer the electric bike for its speed.

rider_bike_preference <- cb %>%
  group_by(member_casual,rideable_type) %>%
  filter(rideable_type %in% c('classic_bike','electric_bike')) %>%
  count(member_casual)

names(rider_bike_preference) <- c('rider_type','bike_type','rides')

ggplot(data=rider_bike_preference, aes(fill=rider_type, y=rides, x=bike_type)) + 
  geom_bar(position="stack", stat="identity") +
  xlab("Bike Type") +
  ylab("Number of Rides")  +
  ggtitle("Citi Bike Rides Per Bike and Rider Type - 2022") + 
  guides(fill=guide_legend(title="Rider Type")) +
  scale_y_continuous(labels = scales::label_number(suffix = " M", scale = 1e-6))

Based on the above analysis, the data suggest that:

Classic Citi bikes are much slower (-42%) than a typical bike (using Google Maps time estimates as a proxy). They’re also much slower than an electric Citi Bike (-24%).
When users have the option to choose between the electric and classic Citi Bikes, they typically choose the former, even though they’re more expensive to use. Citi Bike subscribers/members especially have this preference.

Therefore yes, a cost-benefit analysis should likely be performed on whether making the classic Citi bikes faster would increase profit. This analysis could involve:

Surveying Citi Bike users for if/why they prefer the electric bike option (is it the speed, the ease of pedaling, or something else?)
Assessing cost of replacing/upgrading classic bike fleet over time (as current manual bikes break). Example improvements could include reducing the classic bike’s weight (currently 45lbs, changing the frame (from a cruiser style to something more upright), and making the tires thinner.
Organizing a focus group of users who said they thought “classic” bike was too slow, and having them test new bike options.
Assessing how much the Citi Bike fare would need to be raised to fund an upgrade of the classic fleet.

A skeptic of my proposal may suggest that instead of making the classic Citi Bike faster, instead Lyft should replace all the classic Citi Bikes with electric bikes. But electric bikes are much more expensive to produce (2-3x on average), less reliable due to having more parts, are not as rigorous of a workout, and may be unsettling to those not experienced with pedal-assisted energy. Therefore, based on my findings in this analysis and my own experience, a faster non-electric Citi Bike would be a more profitable product long-term than the current non-electric option.

Section 5.0 - Conclusions

The classic, non-electric Citi Bike appears to suffer in popularity because it’s slower than the typical bike (using Google Maps estimates as a proxy) and the electric Citi Bike.
When analyzing 4 of the most common, non-leisurely Citi Bike routes in NYC for 2022, the classic Citi Bike was 34% slower than the Google Maps estimate, and 23% slower than the electric Citi bike.
While I was unable to find a trend in my 2022 data which proved that casual riders prefer the electric over the traditional Citi Bike, a) Lyft had its own data over a longer time horizon which said as much. Lyft claims that the typical electric bike is used 3x more often than the typical classic bike (in part because there are fewer electrics). In addition: b) In my data, Citi Bike members chose electric bikes over classic bikes at a 3 to 1 ratio. Members have more informed opinions as repeat customers and may be more profitable customers who Lyft should prioritize.
The classic bike could easily be made faster by: 1) Reducing the weight from its current 45 lbs; 2) Changing the frame; 3) Making the tires less wide.
Therefore Lyft should carry out a more comprehensive cost-benefit analysis to determine if the current classic Citi Bike model should be phased out in favor of a faster option. Lyft’s analysis could include cost estimates for changing the bike and whether increasing fairs could make up for those costs

5.1 - Limitations

2022 data may not be representative of all Citi Bike trips. The city’s infrastructure, such as bike lanes, are constantly changing and Google Maps may have outdated trip time estimates.
Lyft may prefer a slower Citi Bike because it’s less dangerous. However the fact that the company has increasingly introduced e-bikes to their fleet means speed is not a primary worry.

Works Cited

Section 2

Citi Bike data including definitions for each fields are available on their website
The Google Maps API for r was created by Michael Dorman. Documentation here.

Section 4

Bicycling.com: Total Number of Citi Bikes
Lyft: 2022 Citi Bike Multimodal report for statistics on the Citi Bike fleet
Consumer Reports: Electric bikes are much more expensive to produce and less reliable)
NY Times: Electric bikes are not as intensive of a workout
Gothamist: Citi Bike’s Weight