Undertaken this capstone project is part of the requirement to complete the Google Data Analytics Program.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
In this capstone project I assume the role of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.
To this end, I am required to answer the question of how annual members and casual riders use Cyclistic bikes differently using the previous 12 months of Cyclistic trip data.
To help Cyclistic answer this question so as to be able to create marketing strategies to convert casual riders to annual members the following steps undertaken:
The data was sourced from the company’s database hosted on Amazon AWS.The data is categorized based on month. The database hosted on Amazon AWS is the primary source of data. The data are collected real time as the ride sharing is done. To collect the data, 12 months data was downloaded in csv format from the database and uploaded on R
Prepare R environment for upload of csv files
options(repos = list(CRAN="http://cran.rstudio.com/"))
install.packages("tidyverse")## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ayomide.ilupeju\AppData\Local\Temp\RtmpkVNQXT\downloaded_packages
library(tidyverse)Load downloaded data on Rstudio console
Jan_22 <- read_csv("202201-divvy-tripdata.csv")
Feb_22 <- read_csv("202202-divvy-tripdata.csv")
Mar_22 <- read_csv("202203-divvy-tripdata.csv")
Apr_22 <- read_csv("202204-divvy-tripdata.csv")
May_22 <- read_csv("202205-divvy-tripdata.csv")
Jun_22 <- read_csv("202206-divvy-tripdata.csv")
Jul_22 <- read_csv("202207-divvy-tripdata.csv")
Aug_22 <- read_csv("202208-divvy-tripdata.csv")
Sep_22 <- read_csv("202209-divvy-publictripdata.csv")
Oct_22 <- read_csv("202210-divvy-tripdata.csv")
Nov_22 <- read_csv("202211-divvy-tripdata.csv")
Dec_22 <- read_csv("202212-divvy-tripdata.csv")Merge monthly data into a single table and name new Table as Year_2022_Cyclistic
Year_2022_Cyclistic <- bind_rows(Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22,Jul_22, Aug_22, Sep_22, Oct_22, Nov_22, Dec_22)Review of merged data to ensure correctness
install.packages("skimr")## Error in download.file(url, destfile, method, mode = "wb", ...) :
## cannot open URL 'http://cran.rstudio.com/bin/windows/contrib/4.2/skimr_2.1.5.zip'
library(skimr)
skim_without_charts(Year_2022_Cyclistic)| Name | Year_2022_Cyclistic |
| Number of rows | 5667717 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5667717 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 833064 | 0.85 | 7 | 64 | 0 | 1674 | 0 |
| start_station_id | 833064 | 0.85 | 3 | 44 | 0 | 1313 | 0 |
| end_station_name | 892742 | 0.84 | 9 | 64 | 0 | 1692 | 0 |
| end_station_id | 892742 | 0.84 | 3 | 44 | 0 | 1317 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5858 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 |
| end_lng | 5858 | 1 | -87.65 | 0.11 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2022-01-01 00:00:05 | 2022-12-31 23:59:26 | 2022-07-22 15:03:59 | 4745862 |
| ended_at | 0 | 1 | 2022-01-01 00:01:48 | 2023-01-02 04:56:45 | 2022-07-22 15:24:44 | 4758633 |
Upon review of the merged data I noticed some issues with the data:
Empty cells in the data: To resolve this I eliminated the row with empty cells on the basis that there is no source to refer to help resolve the empty cells. Also assuming a 99.9% confidence level and 0.05% margin of error you will need a sample size of 842,806 from a population of 5,667,717. The no of rows left after the elimination of empty cells is far more than 842,806.
The started date and time were later than the ended data and time for some trips: To ensure consistency, the rows with this issue was also eliminated.
Year_2022_Cyclistic_v1 <- Year_2022_Cyclistic[rowSums(is.na(Year_2022_Cyclistic)) == 0,]Year_2022_Cyclistic_v2 <- Year_2022_Cyclistic_v1%>%
mutate(ride_length_in_seconds = as.numeric(ended_at - started_at))%>%
filter(ride_length_in_seconds > 0)tally(Year_2022_Cyclistic_v2)## # A tibble: 1 × 1
## n
## <int>
## 1 4369052
The count is 4,369,052 which is far higher than the required sample of 842,806 to ensure a 99.9% confidence level and 0.05% margin of error.
Year_2022_Cyclistic_v3 <- Year_2022_Cyclistic_v2 %>%
mutate(weekday_for_started_at = format(started_at, format = "%a"))
head(Year_2022_Cyclistic_v3)## # A tibble: 6 × 15
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 C2F7DD78E82EC875 electric_bike 2022-01-13 11:59:47 2022-01-13 12:02:44
## 2 A6CF8980A652D272 electric_bike 2022-01-10 08:41:56 2022-01-10 08:46:17
## 3 BD0F91DFF741C66D classic_bike 2022-01-25 04:53:40 2022-01-25 04:58:01
## 4 CBB80ED419105406 classic_bike 2022-01-04 00:18:04 2022-01-04 00:33:00
## 5 DDC963BFDDA51EEA classic_bike 2022-01-20 01:31:10 2022-01-20 01:37:12
## 6 A39C6F6CC0586C0B classic_bike 2022-01-11 18:48:09 2022-01-11 18:51:31
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length_in_seconds <dbl>, weekday_for_started_at <chr>
Aggregate_Metrics <- Year_2022_Cyclistic_v3 %>%
group_by(member_casual)%>%
summarise(average_duration_hrs = (mean(ride_length_in_seconds)/60)/60, maximum_duration_hrs = (max(ride_length_in_seconds)/60)/60,
minimum_duration_hrs = (min(ride_length_in_seconds)/60)/60, total_duration_hrs = (sum(ride_length_in_seconds)/60)/60, percent_of_rides = 100 * n() / nrow(Year_2022_Cyclistic_v3))
head(data.frame(Aggregate_Metrics))## member_casual average_duration_hrs maximum_duration_hrs minimum_duration_hrs
## 1 casual 0.3999084 572.56778 0.0002777778
## 2 member 0.2075397 24.88722 0.0002777778
## total_duration_hrs percent_of_rides
## 1 703057.7 40.23864
## 2 541887.2 59.76136
library(scales)
ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = total_duration_hrs, fill = "Red"))+labs(title = "Hours spent on Rides for each type of Rider", x = "Rider Type", y = "Hours spent on Rides" )+scale_y_continuous(labels = label_comma())ggplot(Aggregate_Metrics)+geom_col(mapping = aes(x = member_casual, y = average_duration_hrs, fill ="Red")) + labs(title = "Average Duration for each Ride based on Rider Type",x = "Rider Type", y = "Average hour per Ride")+scale_y_continuous(labels = label_comma())No_of_rides <- Year_2022_Cyclistic_v3 %>%
group_by(member_casual) %>%
tally()
head(No_of_rides)## # A tibble: 2 × 2
## member_casual n
## <chr> <int>
## 1 casual 1758047
## 2 member 2611005
More rides were taken by members than casual riders.
The total hours spent on trips for casual riders was more than the total hours spent on trips by members.
On an average, casual riders took longer trips than members.
The table Year_2022_Cyclistic_v3 was extracted from R in csv format and uploaded on Tableau.
Extracting data from R in csv format:
write.csv("Year_2022_1", file = "/Users/ayomide.ilupeju/OneDrive - Mainstream Energy Solutions Limited/2023/Data Analysis/Year_2022_Cyclistic_v3.csv")Based on analysis above:
More rides were done by members than causal riders. However, the total hours spent on rides by casual riders was higher than the total hours spent by members.
Only casual members used docked bikes.
Majority of the rides in terms of number of rides and hours spent on rides for casual riders were done on Friday, Saturday and Sunday. Sunday being the highest on both basis.
To help Cyclistic convert casual riders into annual members based on the fact that the future success of the company depends on maximizing the number of annual memberships, Cyclistic should:
Create marketing strategies that would motivate docked bike users to be members rather than casual riders as all docked bike users are casual riders.
Develop a marketing promotion that would motivate individuals that ride bikes on Friday, Saturday and Sunday to be members rather than casual riders.
Provide discounts and benefits for members who take longer trips. Longer trips were done mostly by casual riders. This could motivate them to be members rather than being a casual rider.
Data Source: Google Data Analytics program Capstone Course.