Scenario

Cyclistic is a fictional bike-sharing company in Chicago. The service was launched in 2016 and has grown to a fleet of 5,824 bikes that are geotracked and locked into a network of 692 stations around the city.
The team wants to understand how casual riders and annual members, the two different rider types, use their service differently and believe that the company’s future success depends on maximizing the number of annual memberships as they are the most profitable. From these insights, the team will design a new marketing campaign to convert casual riders into annual members.

The company wants to analyze data of the last twelve months, from February 2021 to January 2022, and provides the data under the following link.
* https://divvy-tripdata.s3.amazonaws.com/index.html

For my analysis, I’ll use R for data exploration, manipulation and cleaning and then visualize the data in Tableau.

Data prearation

Setting up my R environment

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(geosphere)

knitr::opts_knit$set(root.dir = "~/R/Practice/Case Study_Cyclists/Data")

Data upload & exploration

First, I’ll upload the monthly data sets into R and explored them using the View() and head() function.

data_2021_02 <- read.csv("202102-divvy-tripdata.csv")
data_2021_03 <- read.csv("202103-divvy-tripdata.csv")
data_2021_04 <- read.csv("202104-divvy-tripdata.csv")
data_2021_05 <- read.csv("202105-divvy-tripdata.csv")
data_2021_06 <- read.csv("202106-divvy-tripdata.csv")
data_2021_07 <- read.csv("202107-divvy-tripdata.csv")
data_2021_08 <- read.csv("202108-divvy-tripdata.csv")
data_2021_09 <- read.csv("202109-divvy-tripdata.csv")
data_2021_10 <- read.csv("202110-divvy-tripdata.csv")
data_2021_11 <- read.csv("202111-divvy-tripdata.csv")
data_2021_12 <- read.csv("202112-divvy-tripdata.csv")
data_2022_01 <- read.csv("202201-divvy-tripdata.csv")
colnames(data_2021_02)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Data consolidation

After I made sure all files have the same data structure, I’ll consolidate them into one data frame for data manipulation and cleaning.

full_data <- rbind(data_2021_02, data_2021_03, data_2021_04, data_2021_05, data_2021_06,
                   data_2021_07, data_2021_08, data_2021_09, data_2021_10, data_2021_11, 
                   data_2021_12,data_2022_01)

Data manipulation

Convert start and end time into date format

To be able to make calculations with the start and end time of the trips, we need to convert these two columns into date format.

full_data$started_at <- strptime(full_data$started_at, format = "%Y-%m-%d %H:%M:%S")
full_data$ended_at <- strptime(full_data$ended_at, format = "%Y-%m-%d %H:%M:%S")

Calculate trip length in minutes and convert into number

In the next step, we’ll calculate the trip length of each ride.

full_data$ride_length_min <- round(difftime(full_data$ended_at, full_data$started_at, units = 'mins'), 2) %>% 
  as.numeric()

Calculate trip distance in meters

I’ll also calculate the trip distance based on the start and end coordinates using the ‘geosphere’ package. To do so, I’ll use the ‘haversine’ method, which gives the shortest distance between two coordinates over the earth’s surface.

full_data <- full_data %>%  
  mutate(ride_distance_m = round(distHaversine(cbind(full_data$start_lng, full_data$start_lat), 
                                         cbind(full_data$end_lng, full_data$end_lat)), 2))

To make sure the results are accurate, I’ve spot checked some coordinates manually on this website.

Let’s take a look at the data set.

knitr::kable(head(full_data, n = 5))
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual ride_length_min ride_distance_m
89E7AA6C29227EFF classic_bike 2021-02-12 16:14:56 2021-02-12 16:21:43 Glenwood Ave & Touhy Ave 525 Sheridan Rd & Columbia Ave 660 42.01270 -87.66606 42.00458 -87.66141 member 6.78 982.20
0FEFDE2603568365 classic_bike 2021-02-14 17:52:38 2021-02-14 18:12:09 Glenwood Ave & Touhy Ave 525 Bosworth Ave & Howard St 16806 42.01270 -87.66606 42.01954 -87.66956 casual 19.52 814.32
E6159D746B2DBB91 electric_bike 2021-02-09 19:10:18 2021-02-09 19:19:10 Clark St & Lake St KA1503000012 State St & Randolph St TA1305000029 41.88579 -87.63110 41.88487 -87.62750 member 8.87 315.99
B32D3199F1C2E75B classic_bike 2021-02-02 17:49:41 2021-02-02 17:54:06 Wood St & Chicago Ave 637 Honore St & Division St TA1305000034 41.89563 -87.67207 41.90312 -87.67394 member 4.42 847.45
83E463F23575F4BF electric_bike 2021-02-23 15:07:23 2021-02-23 15:22:37 State St & 33rd St 13216 Emerald Ave & 31st St TA1309000055 41.83473 -87.62583 41.83816 -87.64512 member 15.23 1645.30

Data cleaning

I’ll remove lines where the start and end station is identical and the trip length is under 2 minutes assuming the customers unlocked a bike and then had to put it right back. Also, I’ll remove any lines with no start or end station.

full_data_clean <- full_data[full_data$ride_length_min > 2 & 
                                    full_data$start_station_id != full_data$end_station_id, ] %>% 
  filter(start_station_id != "") %>% 
  filter(end_station_id != "")

This removes 1,391,019 lines from the data set.

nrow(full_data)
## [1] 5601999
nrow(full_data_clean)
## [1] 4210980

Data analysis

First, let’s take a look at the number of rides per month and compare casual riders (in blue) to annual members (in orange).

Rides per Month We can see that the number of rides peaks in the summer months, from June until September. In most of the months, the number of rides from annual members is higher than the ones from casual riders (except of July). In the winter months, the number of rides drops sharply with some annual members but almost none casual riders.

This could mean that annual members do rely on the service as their main mean of transport, meaning they also have to use it during cold and moist weather conditions. On the other side, casual riders use it almost excludively during summer which could mean they tend to use it more for leisure and enjoyment.

This hypothesis is being supported by looking at the number of rides per weekday and hour.

Rides per Weekday The graph above shows the number of rides per weekday. You can see, that casual riders are using the service mostly on the weekend, whereas the number of rides from annual member is pretty steady throughout the week, with the lowest number of rides on Sundays.

Rides per Hour This graph shows the number of rides per hour. Both types of members, casual and annual, have the most number rides in the late afternoon/early evening hours from 4 to 6 PM. However, for annual members, it also shows an increased number of rides in the morning hours between 6 to 9 AM so times where most of the employees begin and end their workday.

Rides per Hour Another noticeable difference can be found in the monthly average trip duration of the two member types. The following graph shows that trips of casual riders tend to be much longer than the ones from annual members. The trip duration of annual members is pretty steady throughout the year with an average of 10 to 15 minutes per ride. In comparison, the trip duration of casual riders tend to be much higher on average and fluctuates much more throughout the year with a maximum average of almost 45 minutes in February, and a minimum average of 20 minutes in November.

Conclusion

It seems like annual members use the bike-sharing for purposes such as transportation and commuting. Even though the number of rides peaks in summer, there is also some usage in the winter months by that member group. The number of rides is pretty steady throughout the week with it’s low on Sunday. By looking at the number of rides per hour we can see that most number of rides are being started in the hours between 4 - 6 PM, but also a noticeable peak in the morning hours between 6 - 8 AM.
In contrast, casual riders tend to use the service mostly on weekends and in the summer months for leisure activities and enjoyment.

Furthermore, the average trip duration of casual riders tend to be much higher and fluctuates a lot with averages of almost 45 minutes in February and to approximately 20 minutes in November. The average trip duration of annual members is pretty steady throughout the year with an average between 10-15 minutes.

Recommendations

Flexibility of memberships

To decrease the border of entry, more flexible memberships could be offered by giving the option to subscribe on a monthly base. Alternatively, quarterly or half-year memberships could be offered with gradual discounts off the monthly membership to attract more customers. Same here, advertisements should be placed at and around most popular stations (see above).