This project is a Google Data Analytics Capstone Project on Coursera, designed to reinforce the analytical skills gained throughout the course duration. For this exercise, I utilize a publicly available data from Cyclistic, a fictional bike-share company based in the city of Chicago, launched in 2016. The company features 5800 bicycles and 692 docking station. They offer various types of bikes such as classic , electric, and docked bikes.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The finance analysts have concluded that annual members are much more profitable than casual riders.
I am the junior data analyst working in the marketing analyst team at Cyclistic. In order to answer the key business questions, uncover key insights, pattern and trends on the bike rental service usage and communicate the findings in an easy, engaging and persuasive way to the key stakeholder. i will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act
⦁ Lily Moreno: The director of marketing
⦁ Cyclistic marketing analytics team
⦁ Cyclistic executive team
⦁ How do annual members and casual riders use Cyclistic bikes differently?
⦁ Increase in revenue derived from the transition of casual riders into annual members subscriber. i aim to uncover distinctions in the utilization patterns of Cyclistic bikes between casual riders and annual members and how can Cyclistic use digital media to influence casual riders to become members?
⦁ Recommend a new marketing strategy geared towards the conversion of casual riders
This report made use of data collected in 12 months (August 2022 - July 2023) for detailed data analysis to complete this case study. To download the data, please use this link https://divvy-tripdata.s3.amazonaws.com/index.html. This data was made public by Motivate International Inc, under this license https://divvybikes.com/data-license-agreement. Due to data privacy issues, personal information has been removed or encrypted. i Ensure the “Divvy bikeshare” dataset integerity by checking if the data ROCCC.
Data Interigity(ROCCC)
Reliable: data is reliable and accurate being collected by a credible source using advanced data collection technology
Original: Primary data source, the company Cyclistic gathers this data first-hand.
Comprehensive: The dataset has more than 5 million complete data entries. With the available data we are able to compare the two groups ‘annual members & casual members’, but not the preference that lead to their choice on what product/services to opt for.
Current: Data is up-to-date as it includes data from the past 12 months.
Cited : Yes, The data is authorized under license.
The dataset consist of 13 columns and a total of 5723606 rows. Here’s an overview of the columns:
ride_id:— a unique ID per ride
rideable_type:- the type of bicycle used
started_at: the date and time at which bicycle was checked out
ended_at: the date and time at which bicycle was checked in
start_station_name: the name of the station at the start of the trip
start_station_id: a unique identifier for the start station
end_station_name: the name of the station at the end of the trip
end_station_id: a unique identifier for the end station
Start_lat: the latitude of the start station
start_lng: the longitude of the start station
end_lat: the latitude of the end station
end_lng: the longitude of the end station
member_casual: this field indicates whether the rider type is member or a casual
R programming is the data analysis tool of my choice for this project, and preferable to spreadsheet due to the raw amount of data in this dataset. R is an open-source programming language and free software environment for statistical computing and graphics. R has a wide range of tools available and can handle massive amounts of data with ease. I imported and load the data into RStudio.
# Setting up my r environment and Loading packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
library(readr)
library(skimr)
library(dplyr)
library(tidyr)
library(gt)
library(farver)
library(plotrix)
library(tinytex)
# setting directory
setwd("C:/Users/HP PC/Documents/Cousera/Capstone_Project/cyclist_divvy_dataset")
# assigning and importing dataset
aug_22 <- read.csv('202208-divvy-tripdata.csv')
sep_22 <- read.csv("202209-divvy-tripdata.csv")
oct_22 <- read.csv("202210-divvy-tripdata.csv")
nov_22 <- read.csv("202211-divvy-tripdata.csv")
dec_22 <- read.csv("202212-divvy-tripdata.csv")
jan_23 <- read.csv("202301-divvy-tripdata.csv")
feb_23 <- read.csv("202302-divvy-tripdata.csv")
mar_23 <- read.csv("202303-divvy-tripdata.csv")
apr_23 <- read.csv("202304-divvy-tripdata.csv")
may_23 <- read.csv("202305-divvy-tripdata.csv")
jun_23 <- read.csv("202306-divvy-tripdata.csv")
jul_23 <- read.csv("202307-divvy-tripdata.csv")
colnames(aug_22)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_22)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_22)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov_22)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_22)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(feb_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_23)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
i ensured the column are the same on each assigned data set variable which ascertained that all column name are consistent.
# merging all the data set
divvy_trips <- rbind(aug_22, sep_22, oct_22, nov_22, dec_22, jan_23, feb_23, mar_23, apr_23, may_23, jun_23, jul_23)
# broad overview of the dataset
skim_without_charts(divvy_trips)
| Name | divvy_trips |
| Number of rows | 5723606 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5723606 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4821003 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4833101 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 868772 | 1802 | 0 |
| start_station_id | 0 | 1 | 0 | 36 | 868904 | 1503 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 925008 | 1800 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 925149 | 1506 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.92 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 6102 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.18 |
| end_lng | 6102 | 1 | -87.65 | 0.12 | -88.16 | -87.66 | -87.64 | -87.63 | 0.00 |
str(divvy_trips)
## 'data.frame': 5723606 obs. of 13 variables:
## $ ride_id : chr "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2022-08-07 21:34:15" "2022-08-08 14:39:21" "2022-08-08 15:29:50" "2022-08-08 02:43:50" ...
## $ ended_at : chr "2022-08-07 21:41:46" "2022-08-08 14:53:23" "2022-08-08 15:40:34" "2022-08-08 02:58:53" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 42 42 41.8 ...
## $ end_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
Carrying Out analysis to identify obvious errors, understand patterns within the data, detect outliers, and find relation within variables.
#quick glace of the data set, showing the first 10 rows
head(divvy_trips, 10)
# checking for duplicates
nrow(divvy_trips)
## [1] 5723606
divvy_trips %>% distinct(ride_id, .keep_all = TRUE) %>% count()
#no duplicates found
divvy_trips1 <- divvy_trips %>% drop_na
# there are 6,102 missing values, this brings the total number of observation to 5717504
# converting 'started_at' and 'ended_at' to datetime
divvy_trips1$started_at <- as.POSIXct(divvy_trips1$started_at)
divvy_trips1$ended_at <- as.POSIXct(divvy_trips1$ended_at)
# 'started_ride_date' column
divvy_trips1$started_ride_date <- format(as.Date(divvy_trips1$started_at), "%D")
# 'started_ride_time' column
divvy_trips1$started_ride_time <- format(as.POSIXct(divvy_trips1$started_at), "%H:%M:%S")
# 'ended_date' column
divvy_trips1$ended_date <- format(as.Date(divvy_trips1$ended_at), "%D")
# 'ended_ride_time' column
divvy_trips1$ended_ride_time <- format(as.POSIXct(divvy_trips1$ended_at), "%H:%M:%S")
# 'ride_length' column
divvy_trips1$ride_length <- difftime(divvy_trips1$ended_at, divvy_trips1$started_at, units = "secs")
# 'start_hour'
divvy_trips1$start_hour <- format(as.POSIXct(divvy_trips1$started_at), "%H")
#Extracting date component for 'year', 'month', & 'weekday'
divvy_trips2 <- divvy_trips1 %>%
mutate(year = lubridate::year(started_at),
month = lubridate::month(started_at, label = TRUE), # month "abbr"
weekday = lubridate::wday(started_at, label = TRUE)) # day of week
# overview of dataset
glimpse(divvy_trips2) #glimpse of dataset
## Rows: 5,717,504
## Columns: 22
## $ ride_id <chr> "550CF7EFEAE0C618", "DAD198F405F9C5F5", "E6F2BC47B6…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2022-08-07 21:34:15, 2022-08-08 14:39:21, 2022-08-…
## $ ended_at <dttm> 2022-08-07 21:41:46, 2022-08-08 14:53:23, 2022-08-…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ start_lat <dbl> 41.93, 41.89, 41.97, 41.94, 41.85, 41.79, 41.89, 41…
## $ start_lng <dbl> -87.69, -87.64, -87.69, -87.65, -87.65, -87.72, -87…
## $ end_lat <dbl> 41.94, 41.92, 41.97, 41.97, 41.84, 41.82, 41.89, 41…
## $ end_lng <dbl> -87.72, -87.64, -87.66, -87.69, -87.66, -87.69, -87…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ started_ride_date <chr> "08/07/22", "08/08/22", "08/08/22", "08/08/22", "08…
## $ started_ride_time <chr> "21:34:15", "14:39:21", "15:29:50", "02:43:50", "20…
## $ ended_date <chr> "08/07/22", "08/08/22", "08/08/22", "08/08/22", "08…
## $ ended_ride_time <chr> "21:41:46", "14:53:23", "15:40:34", "02:58:53", "20…
## $ ride_length <drtn> 451 secs, 842 secs, 644 secs, 903 secs, 352 secs, …
## $ start_hour <chr> "21", "14", "15", "02", "20", "13", "14", "20", "21…
## $ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 202…
## $ month <ord> Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, A…
## $ weekday <ord> Sun, Mon, Mon, Mon, Sun, Mon, Mon, Sun, Sun, Sun, S…
# removing bad data and dealing with outliers
divvy_trips2 %>%
select(member_casual, ride_length) %>%
group_by(member_casual) %>%
dplyr::summarize(min_ride_length = min(ride_length), max_ride_length = max(ride_length)) %>%
ungroup()
The minimum and maximum ‘ride_length’ shows a negative values and a ‘ride_length’ more than 24 hours(86400secs) of any ride trips. This values are outliers and will be removed. A trip less than 60 sec may also be due to other factors such as faulty bikes and will also be removed.
divvy_trips3 <- divvy_trips2 %>%
filter(ride_length >= 60 & ride_length < 86400)
nrow(divvy_trips3)
## [1] 5565647
#brings the total number of observation to 5565647
# converting 'ride_length' to numeric before analyzing
is.factor(divvy_trips3$ride_length) #FLASE
## [1] FALSE
divvy_trips3$ride_length <- as.numeric(as.character(divvy_trips3$ride_length))
is.numeric(divvy_trips3$ride_length) #TRUE
## [1] TRUE
#ordering by 'started_at' in Ascending order
divvy_trips_clean <- arrange(divvy_trips3, started_at)
# Dataset Overview
divvy_trips_clean %>% head(10)
Using the summary function to determine the mean, median, min & max range of ride_length”
##statistical summary of trip duration
summary(divvy_trips_clean$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 344.0 592.0 931.5 1042.0 86255.0
#statistical summary of trip duration by member type
divvy_trips_clean %>%
group_by(member_casual) %>%
summarize(minmum_ride_length = min(ride_length),
maximum_ride_length = max(ride_length),
average_ride_length = mean(ride_length),
median_ride_length = median(ride_length))
#yearly percentage of casual Vs. member rides
divvy_trips_clean %>%
group_by(member_casual) %>%
summarise("numbers of rides" = n(), perc = n()/nrow(divvy_trips_clean) *100)
#Create data for the graph
x <- c(3456701, 2108946)
piepercent <- paste0(round(x/sum(x)*100, 1), "%")
labels <- c("member", "casual")
color <- c("#00BFC4","#F8766D")
# Plot the chart.
pie3D(x, labels = piepercent, explode =0.1, radius = 1,
main = "Proportion of Total Rides %", col = color)
legend(title = "Membership", "topright", c("member", "casual"),
cex = 0.5, fill = color)
The distribution chart indicates, members possessing (62.1%), and casual riders have (37.9%) of the total rides. members used rides (24.2%) more than casual riders during the 12 months period.
# Comparison of total rides n
divvy_trips_clean %>%
group_by(member_casual) %>%
ggplot(aes(x=member_casual, fill = member_casual))+
labs(title= "Membership Distribution", y= "Number of rides", x="casual vs member", fill = "Membership type") +
geom_bar(position='dodge')
An analysis of the rides frequency over a period of 12 months to gain precise insights that will guide my approach to encourage casual members to transition into full-fledged annual members.
#Calculating rides per start_hour & Average
divvy_trips_clean %>%
group_by(start_hour, member_casual) %>%
summarize(ride_numbers = n(), .groups = 'drop', Average_duration = mean(ride_length))
# Visualization of rides at early hours
divvy_trips_clean %>%
group_by(start_hour, member_casual) %>%
summarize(ride_numbers = n(), .groups = 'drop', Average_duration = mean(ride_length)) %>%
ggplot(aes(x=start_hour, y=ride_numbers, fill= member_casual)) +
geom_col(position = "dodge") +
labs(title = "Rides started per hour of the day", x="Hours", y="Number of started rides", fill="Membership")
#Calculating rides per week
divvy_trips_clean %>%
group_by(member_casual, weekday) %>%
summarize(Number_Rides = n(), .groups = 'drop', Average_rides = mean(ride_length)) %>%
arrange(weekday)
# Visualization of rides weekly
divvy_trips_clean %>%
group_by(member_casual, weekday) %>%
summarize(Number_Rides = n(), .groups = 'drop', Average_rides = mean(ride_length)) %>%
arrange(weekday) %>%
ggplot(aes(x=weekday, y=Number_Rides, fill=member_casual))+
geom_col(position='dodge')+
labs(title = "Total Rides Weekly: Subscriber Vs. Casual", x="Days", y="Total Ride Numbers", fill="Membership_Type")
# Visualization of average weekly rides
divvy_trips_clean %>%
group_by(member_casual, weekday) %>%
summarize(Number_Rides = n(), .groups = 'drop', Average_rides = mean(ride_length)) %>%
arrange(weekday) %>%
ggplot(aes(x=member_casual, y=Average_rides, fill=weekday))+
geom_bar(position='Dodge', stat='identity')+
labs(title = "Average Rides Weekly: Subscriber Vs. Casual", x="Days", y="Average_rides_length", fill="Weekday")
# Calculating rides per month
divvy_trips_mth <- divvy_trips_clean
divvy_trips_mth$month <- factor(divvy_trips_mth$month,levels = c( "Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul"))
divvy_trips_mth %>%
group_by(member_casual, month) %>%
summarize(ride_numbers = n(), .groups = 'drop', Average_duration = mean(ride_length)) %>%
arrange(month)
# Visualization Monthly rides
divvy_trips_mth %>%
group_by(member_casual, month) %>%
summarize(ride_numbers = n(), .groups = 'drop', Average_duration = mean(ride_length)) %>%
arrange(month) %>%
ggplot(aes(x=month, y=ride_numbers, fill= member_casual)) +
theme(axis.text.x = element_text(angle = 45)) +
geom_col(width=0.5,position = position_dodge(width=0.5)) +
labs(title = "Monthly Trips: Casual Vs. Subscribed members", x="month", y="Number of Trips", fill="Membership")
#Average number of rides
divvy_trips_clean %>%
group_by(member_casual, month) %>%
summarize(ride_numbers = n(), .groups = 'drop', Average_duration = mean(ride_length)) %>%
arrange(month) %>%
ggplot(aes(x=month, y=Average_duration, fill= member_casual)) +
geom_col(position = 'dodge') +
labs(title = "Average ride length of members by month", x="month", y="Average Ride_length", fill="Membership")
divvy_trips_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(total=n(), .groups = 'drop', perc =round(n()/nrow(divvy_trips_clean)*100, 1)) %>%
arrange(rideable_type)
# Visualization of prefered bikes
divvy_trips_clean %>%
group_by(member_casual, rideable_type) %>%
summarise(total=n(), .groups = 'drop', perc =round(n()/nrow(divvy_trips_clean)*100, 1)) %>%
arrange(rideable_type) %>%
ggplot(aes(x=rideable_type, y= total, fill=member_casual))+
geom_col(position='dodge')+
labs(title = "Preferred Bikes")+
facet_wrap(~member_casual)+
geom_text(aes(label = paste(format(perc, nsmall=1), "%")), vjust= -0.2)
Note : All ride ids are unique so we cannot conclude if the same rider taken several rides. More rider data needed for further analysis
Additional data that could expand scope of analysis
Pricing details for members and casual riders - Based on this data, we might be to optimize cost structure for casual riders or provide discounts without affecting the profit margin.
Address neighborhood details of members to investigate if there are any location specific parameters that encourage membership.Way to determine a recurring bike user using payment information or any personal identification.