Case Study - Cyclistic

Scenario

Cyclistic is a fictional bike-sharing company in Chicago. The service was launched in 2016 and has grown to a fleet of 5,824 bikes that are geotracked and locked into a network of 692 stations around the city.
The team wants to understand how casual riders and annual members, the two different rider types, use their service differently and believe that the company’s future success depends on maximizing the number of annual memberships as they are the most profitable. From these insights, the team will design a new marketing campaign to convert casual riders into annual members.

The company wants to analyze data of the last twelve months, from February 2021 to January 2022, and provides the data under the following link.
* https://divvy-tripdata.s3.amazonaws.com/index.html

For my analysis, I’ll use R for data exploration, manipulation and cleaning and then visualize the data in Tableau.

Data prearation

Setting up my R environment

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(geosphere)

knitr::opts_knit$set(root.dir = "~/R/Practice/Case Study_Cyclists/Data")

Data upload & exploration

First, I’ll upload the monthly data sets into R and explored them using the View() and head() function.

data_2021_02 <- read.csv("202102-divvy-tripdata.csv")
data_2021_03 <- read.csv("202103-divvy-tripdata.csv")
data_2021_04 <- read.csv("202104-divvy-tripdata.csv")
data_2021_05 <- read.csv("202105-divvy-tripdata.csv")
data_2021_06 <- read.csv("202106-divvy-tripdata.csv")
data_2021_07 <- read.csv("202107-divvy-tripdata.csv")
data_2021_08 <- read.csv("202108-divvy-tripdata.csv")
data_2021_09 <- read.csv("202109-divvy-tripdata.csv")
data_2021_10 <- read.csv("202110-divvy-tripdata.csv")
data_2021_11 <- read.csv("202111-divvy-tripdata.csv")
data_2021_12 <- read.csv("202112-divvy-tripdata.csv")
data_2022_01 <- read.csv("202201-divvy-tripdata.csv")

colnames(data_2021_02)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Data consolidation

After I made sure all files have the same data structure, I’ll consolidate them into one data frame for data manipulation and cleaning.

full_data <- rbind(data_2021_02, data_2021_03, data_2021_04, data_2021_05, data_2021_06,
                   data_2021_07, data_2021_08, data_2021_09, data_2021_10, data_2021_11, 
                   data_2021_12,data_2022_01)

Data manipulation

Convert start and end time into date format

To be able to make calculations with the start and end time of the trips, we need to convert these two columns into date format.

full_data$started_at <- strptime(full_data$started_at, format = "%Y-%m-%d %H:%M:%S")
full_data$ended_at <- strptime(full_data$ended_at, format = "%Y-%m-%d %H:%M:%S")

Calculate trip length in minutes and convert into number

In the next step, we’ll calculate the trip length of each ride.

full_data$ride_length_min <- round(difftime(full_data$ended_at, full_data$started_at, units = 'mins'), 2) %>% 
  as.numeric()

Calculate trip distance in meters

I’ll also calculate the trip distance based on the start and end coordinates using the ‘geosphere’ package. To do so, I’ll use the ‘haversine’ method, which gives the shortest distance between two coordinates over the earth’s surface.

full_data <- full_data %>%  
  mutate(ride_distance_m = round(distHaversine(cbind(full_data$start_lng, full_data$start_lat), 
                                         cbind(full_data$end_lng, full_data$end_lat)), 2))

To make sure the results are accurate, I’ve spot checked some coordinates manually on this website.

Let’s take a look at the data set.

knitr::kable(head(full_data, n = 5))

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	ride_length_min	ride_distance_m
89E7AA6C29227EFF	classic_bike	2021-02-12 16:14:56	2021-02-12 16:21:43	Glenwood Ave & Touhy Ave	525	Sheridan Rd & Columbia Ave	660	42.01270	-87.66606	42.00458	-87.66141	member	6.78	982.20
0FEFDE2603568365	classic_bike	2021-02-14 17:52:38	2021-02-14 18:12:09	Glenwood Ave & Touhy Ave	525	Bosworth Ave & Howard St	16806	42.01270	-87.66606	42.01954	-87.66956	casual	19.52	814.32
E6159D746B2DBB91	electric_bike	2021-02-09 19:10:18	2021-02-09 19:19:10	Clark St & Lake St	KA1503000012	State St & Randolph St	TA1305000029	41.88579	-87.63110	41.88487	-87.62750	member	8.87	315.99
B32D3199F1C2E75B	classic_bike	2021-02-02 17:49:41	2021-02-02 17:54:06	Wood St & Chicago Ave	637	Honore St & Division St	TA1305000034	41.89563	-87.67207	41.90312	-87.67394	member	4.42	847.45
83E463F23575F4BF	electric_bike	2021-02-23 15:07:23	2021-02-23 15:22:37	State St & 33rd St	13216	Emerald Ave & 31st St	TA1309000055	41.83473	-87.62583	41.83816	-87.64512	member	15.23	1645.30

Data cleaning

I’ll remove lines where the start and end station is identical and the trip length is under 2 minutes assuming the customers unlocked a bike and then had to put it right back. Also, I’ll remove any lines with no start or end station.

full_data_clean <- full_data[full_data$ride_length_min > 2 & 
                                    full_data$start_station_id != full_data$end_station_id, ] %>% 
  filter(start_station_id != "") %>% 
  filter(end_station_id != "")

This removes 1,391,019 lines from the data set.

nrow(full_data)

## [1] 5601999

nrow(full_data_clean)

## [1] 4210980

Data analysis

First, let’s take a look at the number of rides per month and compare casual riders (in blue) to annual members (in orange).

Rides per Month We can see that the number of rides peaks in the summer months, from June until September. In most of the months, the number of rides from annual members is higher than the ones from casual riders (except of July). In the winter months, the number of rides drops sharply with some annual members but almost none casual riders.

This could mean that annual members do rely on the service as their main mean of transport, meaning they also have to use it during cold and moist weather conditions. On the other side, casual riders use it almost excludively during summer which could mean they tend to use it more for leisure and enjoyment.

This hypothesis is being supported by looking at the number of rides per weekday and hour.

The graph above shows the number of rides per weekday. You can see, that casual riders are using the service mostly on the weekend, whereas the number of rides from annual member is pretty steady throughout the week, with the lowest number of rides on Sundays.

This graph shows the number of rides per hour. Both types of members, casual and annual, have the most number rides in the late afternoon/early evening hours from 4 to 6 PM. However, for annual members, it also shows an increased number of rides in the morning hours between 6 to 9 AM so times where most of the employees begin and end their workday.

Rides per Hour Another noticeable difference can be found in the monthly average trip duration of the two member types. The following graph shows that trips of casual riders tend to be much longer than the ones from annual members. The trip duration of annual members is pretty steady throughout the year with an average of 10 to 15 minutes per ride. In comparison, the trip duration of casual riders tend to be much higher on average and fluctuates much more throughout the year with a maximum average of almost 45 minutes in February, and a minimum average of 20 minutes in November.

Conclusion

It seems like annual members use the bike-sharing for purposes such as transportation and commuting. Even though the number of rides peaks in summer, there is also some usage in the winter months by that member group. The number of rides is pretty steady throughout the week with it’s low on Sunday. By looking at the number of rides per hour we can see that most number of rides are being started in the hours between 4 - 6 PM, but also a noticeable peak in the morning hours between 6 - 8 AM.
In contrast, casual riders tend to use the service mostly on weekends and in the summer months for leisure activities and enjoyment.

Furthermore, the average trip duration of casual riders tend to be much higher and fluctuates a lot with averages of almost 45 minutes in February and to approximately 20 minutes in November. The average trip duration of annual members is pretty steady throughout the year with an average between 10-15 minutes.

Recommendations

Placement of advertisements at most popular stations

Place advertisements in the areas of most popular stations. The shiny areas in below heat map are the most popular pick-up spots. Note, that this heat map only considers data from the month with the highest number of rides, July (since Tableau can’t handle the entire data set)! In the second visualization, you can see the top-15 pick-up stations by name and how many rides were started respectively.
These advertisements can be digital on social media apps and restricted to certain areas, but also physical board advertisements in pedestrian areas.
Since it seems like casual riders are using the service as a way of staying active during their leisure time, the ads could highlight what benefits come with riding the bike such as physical health, mental health and even living a more sustainable life.

Pick-Up Spots in July

Flexibility of memberships

To decrease the border of entry, more flexible memberships could be offered by giving the option to subscribe on a monthly base. Alternatively, quarterly or half-year memberships could be offered with gradual discounts off the monthly membership to attract more customers. Same here, advertisements should be placed at and around most popular stations (see above).