Cyclistic is a bike-share program that features more than 5,800 bicycles and 700 docking stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic’s marketing strategy relied on building general awareness and appealing to broad customer segments. One approach that helped make these things possible was the flexiblity of its pricing plans: single-ride passes, full-day passes, and annual membership. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual membership are Cyclistic members.
The director of marketing believes the company’s future success depends on maximizing the number of annual membership. Therefore, as a Data Analyst, our job is to find and analyze any pattern or trend in Cyclistic historical bike trip data to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, we can create a new marketing strategy to convert casual riders to annual members.
Identify any pattern or trend in Cyclistic historical bike trip data to understand the difference between casual riders and annual members in using Cyclistic bikes so we can create a new marketing strategy to convert casual riders to annual members.
The dataset I use was acquired from Divvy Tripdata. The data has been made available for public by Motivate International Inc. under this license.. For this capstone project, I use data from October 2021 to September 2022 (12 months). I use R for combining and cleaning the dataset that contains a lot of rows (more than 5 million) which Spreadsheet cannot handle.
First, we need to load the library.
#Load the library
library(tidyverse)
library(lubridate)
library(summarytools)
library(data.table)
library(hms)
#Set the working directory
setwd('D:/DatSci & Analyst/Google Data Analytics/CAPSTONE Project')
Then, we have to combine the Cyclistic Bike-Share Datatrip.
#Combine the Cyclistic Bike Dataset from October 2021 to September 2022
filenames <- list.files(path='D:/DatSci & Analyst/Google Data Analytics/CAPSTONE Project/Dataset', full.names=TRUE)
#Read all csv files from filenames
cyclist<-rbindlist(lapply(filenames,fread))
Before we clean our data, we need a summary statistics for all variables in the dataframe. We use dfSummary function from summarytools library and set ASCII to false for better printing. A summary table are useful for checking data type, validity, and missing data.
plain.ascii = FALSE
print(dfSummary(cyclist, graph.magnif = 0.75), method='render')
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ride_id [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | rideable_type [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | started_at [POSIXct, POSIXt] |
|
4875181 distinct values | 5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | ended_at [POSIXct, POSIXt] |
|
4884765 distinct values | 5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | start_station_name [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | start_station_id [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | end_station_name [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | end_station_id [character] |
|
|
5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | start_lat [numeric] |
|
601238 distinct values | 5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | start_lng [numeric] |
|
569536 distinct values | 5828235 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | end_lat [numeric] |
|
217893 distinct values | 5822391 (99.9%) | 5844 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | end_lng [numeric] |
|
207375 distinct values | 5822391 (99.9%) | 5844 (0.1%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | member_casual [character] |
|
|
5828235 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.1 (R version 4.2.1)
2022-10-19
We can see from dfSummary table that there’s 5844 NA value in end_lat and end_lng. We have to validate them first.
sum(is.na(cyclist))
## [1] 11688
colSums(is.na(cyclist))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 5844 5844
## member_casual
## 0
It is true that there’s 5844 NA value in end_lat and end_lng. After validating the data, we have to remove the NA value and check the NA value again.
#Remove the NA Value from end_lat and end_lng column
cyclist <- na.omit(cyclist)
#Check the NA Value again
sum(is.na(cyclist))
## [1] 0
We have successfully removed the NA value.
After we remove the NA value, then we have to remove duplicate data from dataset. First, we need to see dataset’s total rows. After that, we remove the duplicate and check the total rows again.
#See total row
dim(cyclist)
## [1] 5822391 13
#Remove duplicate
cyclist %>% distinct()
## ride_id rideable_type started_at ended_at
## 1: 620BC6107255BF4C electric_bike 2021-10-22 12:46:42 2021-10-22 12:49:50
## 2: 4471C70731AB2E45 electric_bike 2021-10-21 09:12:37 2021-10-21 09:14:14
## 3: 26CA69D43D15EE14 electric_bike 2021-10-16 16:28:39 2021-10-16 16:36:26
## 4: 362947F0437E1514 electric_bike 2021-10-16 16:17:48 2021-10-16 16:19:03
## 5: BB731DE2F2EC51C5 electric_bike 2021-10-20 23:17:54 2021-10-20 23:26:10
## ---
## 5822387: 32ECA2B32C4B6F85 classic_bike 2022-09-05 17:59:21 2022-09-05 18:19:07
## 5822388: 14801F713026AEAE classic_bike 2022-09-30 17:20:54 2022-09-30 17:34:40
## 5822389: 7CCAF5D6E88E45C0 electric_bike 2022-09-04 11:39:37 2022-09-04 11:50:55
## 5822390: AF9A129D9AFAA40B electric_bike 2022-09-28 13:42:45 2022-09-28 13:52:59
## 5822391: 60B56F4897429FCE electric_bike 2022-09-01 20:07:04 2022-09-01 20:18:01
## start_station_name start_station_id
## 1: Kingsbury St & Kinzie St KA1503000043
## 2:
## 3:
## 4:
## 5:
## ---
## 5822387: Lincoln Ave & Winona St KA1504000078
## 5822388: Broadway & Ridge Ave 15578
## 5822389: Broadway & Ridge Ave 15578
## 5822390: Lincoln Ave & Winona St KA1504000078
## 5822391: Lincoln Ave & Winona St KA1504000078
## end_station_name end_station_id
## 1:
## 2:
## 3:
## 4:
## 5:
## ---
## 5822387: Broadway & Wilson - Truman College Vaccination Site 13074
## 5822388: Broadway & Wilson - Truman College Vaccination Site 13074
## 5822389: Broadway & Wilson - Truman College Vaccination Site 13074
## 5822390: Broadway & Wilson - Truman College Vaccination Site 13074
## 5822391: Broadway & Wilson - Truman College Vaccination Site 13074
## start_lat start_lng end_lat end_lng member_casual
## 1: 41.88919 -87.63850 41.89000 -87.63000 member
## 2: 41.93000 -87.70000 41.93000 -87.71000 member
## 3: 41.92000 -87.70000 41.94000 -87.72000 member
## 4: 41.92000 -87.69000 41.92000 -87.69000 member
## 5: 41.89000 -87.71000 41.89000 -87.69000 member
## ---
## 5822387: 41.97491 -87.69250 41.96522 -87.65814 member
## 5822388: 41.98404 -87.66027 41.96522 -87.65814 member
## 5822389: 41.98411 -87.66027 41.96522 -87.65814 member
## 5822390: 41.97492 -87.69273 41.96522 -87.65814 member
## 5822391: 41.97490 -87.69265 41.96522 -87.65814 member
#See total row after removing duplicate
dim(cyclist)
## [1] 5822391 13
We see that there is no row removed, that means there is no duplicate data in the dataset.
We create the date variable from started at and ended at column. We also create time of day column.
#Create data variable
cyclist <- cyclist %>% mutate(start_year = year(started_at),
start_month = month(started_at),
start_day = weekdays(started_at),
start_hour = hour(started_at),
start_time_of_day = case_when(start_hour>= 5 & start_hour <=12 ~ "Morning",
start_hour>=13 & start_hour <=17 ~ "Afternoon",
start_hour>=18 & start_hour <=22 ~ "Evening",
start_hour>= 0 & start_hour <=4 | start_hour ==23 ~ "Night"),
end_year = year(ended_at),
end_month = month(ended_at),
end_day = weekdays(ended_at),
end_hour = hour(ended_at),
end_time_of_day = case_when(end_hour>= 5 & end_hour <=12 ~ "Morning",
end_hour>=13 & end_hour <=17 ~ "Afternoon",
end_hour>=18 & end_hour <=22 ~ "Evening",
end_hour>= 0 & end_hour <=4 | end_hour ==23 ~ "Night")
)
We create another variable to show how long a person use Cyclistic bike
#Create ride length
cyclist <- cyclist %>% mutate(ride_length_mins = difftime(ended_at, started_at, units="mins"))
Check if there are negative time values in the data
#Check the negative time values count
cyclist %>% filter(ride_length_mins<0) %>% count()
## n
## 1: 108
We found 108 negative time values. So we have to subset negative time values data and check them.
negative_time <- cyclist %>% filter(ride_length_mins<0) %>% select(started_at, ended_at)
head(negative_time, 10)
## started_at ended_at
## 1: 2021-11-07 01:40:02 2021-11-07 01:05:46
## 2: 2021-11-07 01:52:53 2021-11-07 01:05:22
## 3: 2021-11-07 01:40:13 2021-11-07 01:00:29
## 4: 2021-11-07 01:34:03 2021-11-07 01:17:13
## 5: 2021-11-07 01:54:25 2021-11-07 01:03:44
## 6: 2021-11-07 01:54:04 2021-11-07 01:25:57
## 7: 2021-11-07 01:51:52 2021-11-07 01:22:53
## 8: 2021-11-07 01:54:12 2021-11-07 01:05:09
## 9: 2021-11-07 01:54:36 2021-11-07 01:03:11
## 10: 2021-11-07 01:51:21 2021-11-07 01:07:59
Negative time values comes when a person end time bike use is less than start time bike use (ended_at<started_at), which suggest that this negative time values data was input incorrectly (the start and end time are swapped), so we have to create a new calculation: difftime(started_at, ended_at)
cyclist <- cyclist %>% mutate(ride_length_mins = ifelse(ended_at>started_at,
difftime(ended_at, started_at, units='mins'),
(difftime(started_at, ended_at, units='mins'))))
Check the count of negative time values again
cyclist %>% filter(ride_length_mins<0) %>% count()
## n
## 1: 0
All the negative time values are successfully converted to positive time values
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal (ITL NIST). We will use Interquartile Range (IQR) method to remove the outliers. First, we need to create the upper and lower bound.
#Removing outliers in ride_length column with IQR method
Q1 <- quantile(cyclist$ride_length, .25)
Q3 <- quantile(cyclist$ride_length, .75)
IQR <- IQR(cyclist$ride_length)
upper_bound <- Q3+1.5*IQR
lower_bound <- Q1-1.5*IQR
Then, we remove the outliers. Any values that fall outside the lower and upper bound are considered outliers.
cyclist_clean <- cyclist %>% subset(ride_length_mins>lower_bound & ride_length_mins<upper_bound)
Now the dataset are clean, we are ready to visualize and analyze the data. The clean cyclistic data will be exported as csv and will get a data visualization on Tableau.
First, we need to create a data frame that consist the most used start station coordinate, then we create a data frame that consist the most used start station name for member and casual. The data frame will be exported as csv and we will manually input the coordinate for the top 5 start station names. We found that there’s a blank station name from dfSummary table, so We also need to exclude the blank station name.
#Create coordinate data start station
start_station_coord<-cyclist_clean %>%
filter(start_station_name != "") %>%
group_by(start_lat, start_lng, start_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count))
#Create top 5 start station for member
start_station_name_member<-cyclist_clean %>%
filter(start_station_name != "") %>%
group_by(start_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
filter(member_casual=="member")
top_5_start_station_name_member<-start_station_name_member %>% head(5)
#Create top 5 start station for casual
start_station_name_casual<-cyclist_clean %>%
filter(start_station_name != "") %>%
group_by(start_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
filter(member_casual=="casual")
top_5_start_station_name_casual<-start_station_name_casual %>% head(5)
We will do the same method for end station.
#Create coordinate data end station
end_station_coord<-cyclist_clean %>%
filter(end_station_name != "") %>%
group_by(end_lat, end_lng, end_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count))
#Create top 5 end station for member
end_station_name_member<-cyclist_clean %>%
filter(end_station_name != "") %>%
group_by(end_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
filter(member_casual=="member")
top_5_end_station_name_member<-end_station_name_member %>% head(5)
#Create top 5 end station for casual
end_station_name_casual<-cyclist_clean %>%
filter(end_station_name != "") %>%
group_by(end_station_name, member_casual) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
filter(member_casual=="casual")
top_5_end_station_name_casual<-end_station_name_casual %>% head(5)
Tableau was used for visualize the data, and now its time to analyze! We have to identify any pattern or trends from the data viz and find insights from it. The insights will help us solve the business task.
Let us take a look at dashboard that already been created on Tableau.
We have main dashboard, which contains the
daily and monthly chart of the number of rides and average ride length
between casual and annual members. We also have a pie chart for
comparing annual members vs. casual riders and rideable type. The next
dashboard is Maps, containing the top 5 start and end
stations’ names, including Google street view. Last, we have
Final Report, containing the business task, key
takeaways, and some business strategy to convert casual riders to annual
members.
Cyclistic has more annual members than casual riders (61.45% vs 38.55%).
Casual riders, on average, use Cyclistic bikes longer than annual member.
Casual riders are more active on weekends than annual member, which suggest casual riders are more likely ride for leisure.
Annual members tend to use Cyclistic bikes on weekdays, which suggest annual riders are more likely ride to commute to work each day.
Annual members tend to use Cyclistic bikes in morning and afternoon, this indicates that annual members are more likely to use Cyclistic bikes to commute from home to office (morning) and from office to home (afternoon).
Busiest stations (top 5 stations) are in close proximity, suggesting that casual and annual members usually drive around the same place.
Casual riders mainly use Cyclistic bikes around June-August.
Annual members mainly use Cyclistic bikes around May-October.
Both casual and annual members Cyclistic bike-share usage drops significantly around January and February.
Docked bikes are rarely used, never even used by annual members.
Casual riders prefer electric bikes, while annual members use electric bikes and classic bikes almost evenly.
Here are some recommendations that can be implemented to convert casual riders to annual members:
Make regular membership discount to casual riders, especially from June to August.
Run promotions during the weekends to reach out more casual riders.
Provide and promote additional perks for having a Cyclistic membership account, such as holding a membership only events and prizes.
Run promotions at the top 5 stations that casual riders often use.
Create some informative promotions and banners to inform casual riders, such as how cost-effective for them to use Cyclistic bikes as an annual member for commute to work on weekdays.
Increase the bikes’ renting price for casual riders during the weekends, especially electic bikes.
-Muhammad Hafidz Roihan, 2022.