Cyclistic Bike-Share Analysis Case Study

Ilupeju Ayomide

2023-05-06

Capstone Project - Google Data Analytics Program

Undertaken this capstone project is part of the requirement to complete the Google Data Analytics Program.

Scenario

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

In this capstone project I assume the role of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.

To this end, I am required to answer the question of how annual members and casual riders use Cyclistic bikes differently using the previous 12 months of Cyclistic trip data.

Analysis and Reporting

To help Cyclistic answer this question so as to be able to create marketing strategies to convert casual riders to annual members the following steps undertaken:

Data Collection and Organisation:

The data was sourced from the company’s database hosted on Amazon AWS.The data is categorized based on month. The database hosted on Amazon AWS is the primary source of data. The data are collected real time as the ride sharing is done. To collect the data, 12 months data was downloaded in csv format from the database and uploaded on R

Prepare R environment for upload of csv files

options(repos = list(CRAN="http://cran.rstudio.com/"))
install.packages("tidyverse")

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ayomide.ilupeju\AppData\Local\Temp\RtmpkVNQXT\downloaded_packages

library(tidyverse)

Load downloaded data on Rstudio console

Jan_22 <- read_csv("202201-divvy-tripdata.csv")
Feb_22 <- read_csv("202202-divvy-tripdata.csv")
Mar_22 <- read_csv("202203-divvy-tripdata.csv")
Apr_22 <- read_csv("202204-divvy-tripdata.csv")
May_22 <- read_csv("202205-divvy-tripdata.csv")
Jun_22 <- read_csv("202206-divvy-tripdata.csv")
Jul_22 <- read_csv("202207-divvy-tripdata.csv")
Aug_22 <- read_csv("202208-divvy-tripdata.csv")
Sep_22 <- read_csv("202209-divvy-publictripdata.csv")
Oct_22 <- read_csv("202210-divvy-tripdata.csv")
Nov_22 <- read_csv("202211-divvy-tripdata.csv")
Dec_22 <- read_csv("202212-divvy-tripdata.csv")

Merge monthly data into a single table and name new Table as Year_2022_Cyclistic

Year_2022_Cyclistic <- bind_rows(Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22,Jul_22, Aug_22, Sep_22, Oct_22, Nov_22, Dec_22)

Review of merged data to ensure correctness

install.packages("skimr")

## Error in download.file(url, destfile, method, mode = "wb", ...) : 
##   cannot open URL 'http://cran.rstudio.com/bin/windows/contrib/4.2/skimr_2.1.5.zip'

library(skimr)
skim_without_charts(Year_2022_Cyclistic)

Data summary
Name	Year_2022_Cyclistic
Number of rows	5667717
Number of columns	13
_______________________
Column type frequency:
character	7
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5667717
rideable_type	0	1.00	11	13	3
start_station_name	833064	0.85	7	64	1674
start_station_id	833064	0.85	3	44	1313
end_station_name	892742	0.84	9	64	1692
end_station_id	892742	0.84	3	44	1317
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	5858	1	41.90	0.07	0.00	41.88	41.90	41.93	42.37
end_lng	5858	1	-87.65	0.11	-88.14	-87.66	-87.64	-87.63	0.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-01-01 00:00:05	2022-12-31 23:59:26	2022-07-22 15:03:59	4745862
ended_at	0	1	2022-01-01 00:01:48	2023-01-02 04:56:45	2022-07-22 15:24:44	4758633

Data Cleaning and Preparation for analysis:

Upon review of the merged data I noticed some issues with the data:

Empty cells in the data: To resolve this I eliminated the row with empty cells on the basis that there is no source to refer to help resolve the empty cells. Also assuming a 99.9% confidence level and 0.05% margin of error you will need a sample size of 842,806 from a population of 5,667,717. The no of rows left after the elimination of empty cells is far more than 842,806.
The started date and time were later than the ended data and time for some trips: To ensure consistency, the rows with this issue was also eliminated.

Eliminating rows with empty cells

Year_2022_Cyclistic_v1 <- Year_2022_Cyclistic[rowSums(is.na(Year_2022_Cyclistic)) == 0,]

Identifying and eliminating rows where the start date is later than the end date

Year_2022_Cyclistic_v2 <- Year_2022_Cyclistic_v1%>%
  mutate(ride_length_in_seconds = as.numeric(ended_at - started_at))%>%
  filter(ride_length_in_seconds > 0)

Counting number of rows left after data cleaning

tally(Year_2022_Cyclistic_v2)

## # A tibble: 1 × 1
##         n
##     <int>
## 1 4369052

The count is 4,369,052 which is far higher than the required sample of 842,806 to ensure a 99.9% confidence level and 0.05% margin of error.

Data Analysis:

Calculating the day of the week that each ride started.

Year_2022_Cyclistic_v3 <- Year_2022_Cyclistic_v2 %>%
  mutate(weekday_for_started_at = format(started_at, format = "%a"))

head(Year_2022_Cyclistic_v3)

## # A tibble: 6 × 15
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17
## 3 BD0F91DFF741C66D classic_bike  2022-01-25 04:53:40 2022-01-25 04:58:01
## 4 CBB80ED419105406 classic_bike  2022-01-04 00:18:04 2022-01-04 00:33:00
## 5 DDC963BFDDA51EEA classic_bike  2022-01-20 01:31:10 2022-01-20 01:37:12
## 6 A39C6F6CC0586C0B classic_bike  2022-01-11 18:48:09 2022-01-11 18:51:31
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length_in_seconds <dbl>, weekday_for_started_at <chr>

Calculating aggregate metrics for the types of rider.

Aggregate_Metrics <- Year_2022_Cyclistic_v3 %>%
  group_by(member_casual)%>%
  summarise(average_duration_hrs = (mean(ride_length_in_seconds)/60)/60, maximum_duration_hrs = (max(ride_length_in_seconds)/60)/60,
            minimum_duration_hrs = (min(ride_length_in_seconds)/60)/60, total_duration_hrs = (sum(ride_length_in_seconds)/60)/60, percent_of_rides = 100 * n() / nrow(Year_2022_Cyclistic_v3))

head(data.frame(Aggregate_Metrics))

##   member_casual average_duration_hrs maximum_duration_hrs minimum_duration_hrs
## 1        casual            0.3999084            572.56778         0.0002777778
## 2        member            0.2075397             24.88722         0.0002777778
##   total_duration_hrs percent_of_rides
## 1           703057.7         40.23864
## 2           541887.2         59.76136

Visualising Aggregate Metrics

Hours spent on Rides for each type of Rider in 2022

library(scales)
ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = total_duration_hrs, fill = "Red"))+labs(title = "Hours spent on Rides for each type of Rider", x = "Rider Type", y = "Hours spent on Rides" )+scale_y_continuous(labels = label_comma())

Average Duration for each Ride based on Rider Type

ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = average_duration_hrs, fill ="Red")) + labs(title = "Average Duration for each Ride based on Rider Type",x = "Rider Type", y = "Average hour per Ride")+scale_y_continuous(labels = label_comma())

Calculating the no of rides per type of rider

No_of_rides <- Year_2022_Cyclistic_v3 %>%
  group_by(member_casual) %>%
  tally()

head(No_of_rides)

## # A tibble: 2 × 2
##   member_casual       n
##   <chr>           <int>
## 1 casual        1758047
## 2 member        2611005

The analysis above shows that

More rides were taken by members than casual riders.
The total hours spent on trips for casual riders was more than the total hours spent on trips by members.
On an average, casual riders took longer trips than members.

Further analysis was carried out using Tableau for easy of visualization

The table Year_2022_Cyclistic_v3 was extracted from R in csv format and uploaded on Tableau.

Extracting data from R in csv format:

write.csv("Year_2022_1", file = "/Users/ayomide.ilupeju/OneDrive - Mainstream Energy Solutions Limited/2023/Data Analysis/Year_2022_Cyclistic_v3.csv")

Analysis of Rider Types using Tableau - Click on the caption

Conclusion

Based on analysis above:

More rides were done by members than causal riders. However, the total hours spent on rides by casual riders was higher than the total hours spent by members.
Only casual members used docked bikes.
Majority of the rides in terms of number of rides and hours spent on rides for casual riders were done on Friday, Saturday and Sunday. Sunday being the highest on both basis.

Recommendation

To help Cyclistic convert casual riders into annual members based on the fact that the future success of the company depends on maximizing the number of annual memberships, Cyclistic should:

Create marketing strategies that would motivate docked bike users to be members rather than casual riders as all docked bike users are casual riders.
Develop a marketing promotion that would motivate individuals that ride bikes on Friday, Saturday and Sunday to be members rather than casual riders.
Provide discounts and benefits for members who take longer trips. Longer trips were done mostly by casual riders. This could motivate them to be members rather than being a casual rider.

Reference

Data Source: Google Data Analytics program Capstone Course.