Cyclistic Bike-Share Analysis Case Study

Ilupeju Ayomide

2023-05-06


Capstone Project - Google Data Analytics Program

Undertaken this capstone project is part of the requirement to complete the Google Data Analytics Program.

Scenario

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

In this capstone project I assume the role of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.

To this end, I am required to answer the question of how annual members and casual riders use Cyclistic bikes differently using the previous 12 months of Cyclistic trip data.

Analysis and Reporting

To help Cyclistic answer this question so as to be able to create marketing strategies to convert casual riders to annual members the following steps undertaken:

Data Collection and Organisation:

The data was sourced from the company’s database hosted on Amazon AWS.The data is categorized based on month. The database hosted on Amazon AWS is the primary source of data. The data are collected real time as the ride sharing is done. To collect the data, 12 months data was downloaded in csv format from the database and uploaded on R

Prepare R environment for upload of csv files

options(repos = list(CRAN="http://cran.rstudio.com/"))
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ayomide.ilupeju\AppData\Local\Temp\RtmpkVNQXT\downloaded_packages
library(tidyverse)

Load downloaded data on Rstudio console

Jan_22 <- read_csv("202201-divvy-tripdata.csv")
Feb_22 <- read_csv("202202-divvy-tripdata.csv")
Mar_22 <- read_csv("202203-divvy-tripdata.csv")
Apr_22 <- read_csv("202204-divvy-tripdata.csv")
May_22 <- read_csv("202205-divvy-tripdata.csv")
Jun_22 <- read_csv("202206-divvy-tripdata.csv")
Jul_22 <- read_csv("202207-divvy-tripdata.csv")
Aug_22 <- read_csv("202208-divvy-tripdata.csv")
Sep_22 <- read_csv("202209-divvy-publictripdata.csv")
Oct_22 <- read_csv("202210-divvy-tripdata.csv")
Nov_22 <- read_csv("202211-divvy-tripdata.csv")
Dec_22 <- read_csv("202212-divvy-tripdata.csv")

Merge monthly data into a single table and name new Table as Year_2022_Cyclistic

Year_2022_Cyclistic <- bind_rows(Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22,Jul_22, Aug_22, Sep_22, Oct_22, Nov_22, Dec_22)

Review of merged data to ensure correctness

install.packages("skimr")
## Error in download.file(url, destfile, method, mode = "wb", ...) : 
##   cannot open URL 'http://cran.rstudio.com/bin/windows/contrib/4.2/skimr_2.1.5.zip'
library(skimr)
skim_without_charts(Year_2022_Cyclistic)
Data summary
Name Year_2022_Cyclistic
Number of rows 5667717
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5667717 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 833064 0.85 7 64 0 1674 0
start_station_id 833064 0.85 3 44 0 1313 0
end_station_name 892742 0.84 9 64 0 1692 0
end_station_id 892742 0.84 3 44 0 1317 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5858 1 41.90 0.07 0.00 41.88 41.90 41.93 42.37
end_lng 5858 1 -87.65 0.11 -88.14 -87.66 -87.64 -87.63 0.00

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2022-01-01 00:00:05 2022-12-31 23:59:26 2022-07-22 15:03:59 4745862
ended_at 0 1 2022-01-01 00:01:48 2023-01-02 04:56:45 2022-07-22 15:24:44 4758633
Data Cleaning and Preparation for analysis:

Upon review of the merged data I noticed some issues with the data:

  • Empty cells in the data: To resolve this I eliminated the row with empty cells on the basis that there is no source to refer to help resolve the empty cells. Also assuming a 99.9% confidence level and 0.05% margin of error you will need a sample size of 842,806 from a population of 5,667,717. The no of rows left after the elimination of empty cells is far more than 842,806.

  • The started date and time were later than the ended data and time for some trips: To ensure consistency, the rows with this issue was also eliminated.

Eliminating rows with empty cells
Year_2022_Cyclistic_v1 <- Year_2022_Cyclistic[rowSums(is.na(Year_2022_Cyclistic)) == 0,]
Identifying and eliminating rows where the start date is later than the end date
Year_2022_Cyclistic_v2 <- Year_2022_Cyclistic_v1%>%
  mutate(ride_length_in_seconds = as.numeric(ended_at - started_at))%>%
  filter(ride_length_in_seconds > 0)
Counting number of rows left after data cleaning
tally(Year_2022_Cyclistic_v2)
## # A tibble: 1 × 1
##         n
##     <int>
## 1 4369052

The count is 4,369,052 which is far higher than the required sample of 842,806 to ensure a 99.9% confidence level and 0.05% margin of error.

Data Analysis:
Calculating the day of the week that each ride started.
Year_2022_Cyclistic_v3 <- Year_2022_Cyclistic_v2 %>%
  mutate(weekday_for_started_at = format(started_at, format = "%a"))

head(Year_2022_Cyclistic_v3)
## # A tibble: 6 × 15
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17
## 3 BD0F91DFF741C66D classic_bike  2022-01-25 04:53:40 2022-01-25 04:58:01
## 4 CBB80ED419105406 classic_bike  2022-01-04 00:18:04 2022-01-04 00:33:00
## 5 DDC963BFDDA51EEA classic_bike  2022-01-20 01:31:10 2022-01-20 01:37:12
## 6 A39C6F6CC0586C0B classic_bike  2022-01-11 18:48:09 2022-01-11 18:51:31
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length_in_seconds <dbl>, weekday_for_started_at <chr>
Calculating aggregate metrics for the types of rider.
Aggregate_Metrics <- Year_2022_Cyclistic_v3 %>%
  group_by(member_casual)%>%
  summarise(average_duration_hrs = (mean(ride_length_in_seconds)/60)/60, maximum_duration_hrs = (max(ride_length_in_seconds)/60)/60,
            minimum_duration_hrs = (min(ride_length_in_seconds)/60)/60, total_duration_hrs = (sum(ride_length_in_seconds)/60)/60, percent_of_rides = 100 * n() / nrow(Year_2022_Cyclistic_v3))

head(data.frame(Aggregate_Metrics))
##   member_casual average_duration_hrs maximum_duration_hrs minimum_duration_hrs
## 1        casual            0.3999084            572.56778         0.0002777778
## 2        member            0.2075397             24.88722         0.0002777778
##   total_duration_hrs percent_of_rides
## 1           703057.7         40.23864
## 2           541887.2         59.76136
Visualising Aggregate Metrics
  • Hours spent on Rides for each type of Rider in 2022
library(scales)
ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = total_duration_hrs, fill = "Red"))+labs(title = "Hours spent on Rides for each type of Rider", x = "Rider Type", y = "Hours spent on Rides" )+scale_y_continuous(labels = label_comma())

  • Average Duration for each Ride based on Rider Type
ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = average_duration_hrs, fill ="Red")) + labs(title = "Average Duration for each Ride based on Rider Type",x = "Rider Type", y = "Average hour per Ride")+scale_y_continuous(labels = label_comma())

Calculating the no of rides per type of rider
No_of_rides <- Year_2022_Cyclistic_v3 %>%
  group_by(member_casual) %>%
  tally()

head(No_of_rides)
## # A tibble: 2 × 2
##   member_casual       n
##   <chr>           <int>
## 1 casual        1758047
## 2 member        2611005

The analysis above shows that

  • More rides were taken by members than casual riders.

  • The total hours spent on trips for casual riders was more than the total hours spent on trips by members.

  • On an average, casual riders took longer trips than members.

Further analysis was carried out using Tableau for easy of visualization

The table Year_2022_Cyclistic_v3 was extracted from R in csv format and uploaded on Tableau.

Extracting data from R in csv format:

write.csv("Year_2022_1", file = "/Users/ayomide.ilupeju/OneDrive - Mainstream Energy Solutions Limited/2023/Data Analysis/Year_2022_Cyclistic_v3.csv")
Analysis of Rider Types using Tableau - Click on the caption

Conclusion

Based on analysis above:

  1. More rides were done by members than causal riders. However, the total hours spent on rides by casual riders was higher than the total hours spent by members.

  2. Only casual members used docked bikes.

  3. Majority of the rides in terms of number of rides and hours spent on rides for casual riders were done on Friday, Saturday and Sunday. Sunday being the highest on both basis.

Recommendation

To help Cyclistic convert casual riders into annual members based on the fact that the future success of the company depends on maximizing the number of annual memberships, Cyclistic should:

  1. Create marketing strategies that would motivate docked bike users to be members rather than casual riders as all docked bike users are casual riders.

  2. Develop a marketing promotion that would motivate individuals that ride bikes on Friday, Saturday and Sunday to be members rather than casual riders.

  3. Provide discounts and benefits for members who take longer trips. Longer trips were done mostly by casual riders. This could motivate them to be members rather than being a casual rider.

Reference

Data Source: Google Data Analytics program Capstone Course.