In 2016, Cyclistic launched a successful bike-share offering having a fleet of 5,824 bicycles that are tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at any time.
Riders who have an annual subscription are called members while riders who are single-ride or full-day pass users are considered casual riders.
The director of marketing is looking to maximize the number of annual memberships as they are more profitable than single-ride or full-day passes. This strategy is believed to be the key to future growth.
Ask
Prepare
Process
Analyze
Share
Act
Ask
The questions that need to be answered are:
1)How do annual members and casual riders use Cyclistic bikes differently?
2)Why would casual riders buy Cyclistic annual memberships?
3)How can Cyclistic use digital media to influence casual riders to become members?
Prepare The dataset follows the ROCCC Analysis as described below:
Reliable - yes, not biased
Original - yes, can locate the original public data
Comprehensive - yes, not missing important information
Current - yes, updated monthly
Cited - yes
I will be using the public dataset located here. The data has been made available by Motivate International Inc. under this license.
Key Tasks Followed:
Downloaded data and copies have been stored on the computer.
I have downloaded the data for Jan 21-Feb 22 Period.
The data is in CSV (comma-separated values) format, and there are a total of 13 columns.
#install.packages("ggplot2")
#install.packages("dplyr")
#install.packages("tidyverse")
library(ggplot2)
library(dplyr)
library(tidyverse)
setwd("D:/Cyclist/data")
Jan <- read.csv("202101-divvy-tripdata.csv")
Feb <- read.csv("202102-divvy-tripdata.csv")
Mar <- read.csv("202103-divvy-tripdata.csv")
Apr <- read.csv("202104-divvy-tripdata.csv")
May <- read.csv("202105-divvy-tripdata.csv")
Jun <- read.csv("202106-divvy-tripdata.csv")
Jul <- read.csv("202107-divvy-tripdata.csv")
Aug <- read.csv("202108-divvy-tripdata.csv")
Sep <- read.csv("202109-divvy-tripdata.csv")
Oct <- read.csv("202110-divvy-tripdata.csv")
Nov <- read.csv("202111-divvy-tripdata.csv")
Dec <- read.csv("202112-divvy-tripdata.csv")
Jan22 <- read.csv("202201-divvy-tripdata.csv")
Feb22 <- read.csv("202202-divvy-tripdata.csv")
triprawdata <- rbind(Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Jan22,Feb22)
Process
colnames(triprawdata) #column names of data
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(triprawdata) #no. of rows
## [1] 5814442
ncol(triprawdata) #no. of columns
## [1] 13
sum(is.na(triprawdata)) # sum of NA's in data
## [1] 9868
# total NA values 9868 0.16% of total data
head(triprawdata) #see first 6 rows of data frame
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021/1/23 16:14 2021/1/23 16:24
## 2 DC88F20C2C55F27F electric_bike 2021/1/27 18:43 2021/1/27 18:47
## 3 EC45C94683FE3F27 electric_bike 2021/1/21 22:35 2021/1/21 22:37
## 4 4FA453A75AE377DB electric_bike 2021/1/7 13:31 2021/1/7 13:42
## 5 BE5E8EB4E7263A0B electric_bike 2021/1/23 2:24 2021/1/23 2:24
## 6 5D8969F88C773979 electric_bike 2021/1/9 14:24 2021/1/9 15:17
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
str(triprawdata) #See list of columns and data type
## 'data.frame': 5814442 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021/1/23 16:14" "2021/1/27 18:43" "2021/1/21 22:35" "2021/1/7 13:31" ...
## $ ended_at : chr "2021/1/23 16:24" "2021/1/27 18:47" "2021/1/21 22:37" "2021/1/7 13:42" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
summary(triprawdata) #Statistical summary of data
## ride_id rideable_type started_at ended_at
## Length:5814442 Length:5814442 Length:5814442 Length:5814442
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5814442 Length:5814442 Length:5814442 Length:5814442
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.39 Min. :-88.97
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.17 Max. :-87.49
## NA's :4934 NA's :4934
## member_casual
## Length:5814442
## Class :character
## Mode :character
##
##
##
##
# 4934 NA's each in end_lat and end_lng
triprawdata<- na.omit(triprawdata)
# Date format is inconsistent some dates are in YYYY-MM-DD format or MM/DD/YYYY
start<-as.POSIXlt(triprawdata$started_at, tz = "","%m/%d/%Y %H:%M")
start<- data.frame(start)
start<- (start[1:nrow(na.omit(start)),])
start<- data.frame(start)
start1<-as.POSIXlt(triprawdata$started_at, tz = "","%Y-%m-%d %H:%M")
start1<- data.frame(start1)
start1<- (start1[(nrow(start)+1):nrow(start1),])
start1<-data.frame(start1)
names(start)<- "start"
names(start1)<- "start"
start_time<- (rbind(start,start1))
triprawdata$started_at <- NULL
triprawdata['started_at']<- start_time
end<-(as.POSIXlt(triprawdata$ended_at, tz = "","%m/%d/%Y %H:%M"))
end<- data.frame(end)
end<- (end[1:nrow(na.omit(end)),])
end<-data.frame(end)
end1<-(as.POSIXlt(triprawdata$ended_at, tz = "","%Y-%m-%d %H:%M"))
end1<- data.frame(end1)
end1<- (end1[(nrow(end)+1):nrow(end1),])
end1<-data.frame(end1)
names(end)<- "end"
names(end1)<- "end"
end_time<- (rbind(end,end1))
triprawdata$ended_at<- NULL
triprawdata['ended_at']<- end_time
ride_length<- difftime(end_time[,], start_time[,], units = "mins")
ride_length<-data.frame(ride_length)
triprawdata <- cbind(triprawdata,ride_length)
# Some Ride lengths are negative,start time and end time are found interchanged
clean_data<- triprawdata%>% filter(ride_length>0)
sum(duplicated(clean_data$ride_id))
## [1] 0
weekday<- weekdays(clean_data$started_at)
weekday<- factor(weekday, levels =c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
weekday<- data.frame(weekday)
clean_data<- cbind(clean_data,weekday)
#write.csv(clean_data,"D:/Cyclist/clean_data.csv", row.names = FALSE)
for (i in 7:10) {
clean_data[,i]<- round(clean_data[,i],4)
clean_data[,i]<- as.character(clean_data[,i])
}
start_no_data<- clean_data%>% filter (start_station_name =="")%>% group_by(start_lat,start_lng,start_station_name,member_casual)%>% summarise(n=n())
# writing file to excel for further analysis
#write.csv(start_no_data,"start_no_data.csv",row.names = FALSE)
# filtering top data cordinates and assigning start station name
#filtering out in excel top 50 cordinates and finding their station name from our database clean_data
start_no_data<-read.csv("start_no_data.csv")
start_data<- clean_data%>% filter (start_station_name !="") %>%group_by(start_lat,start_lng,start_station_name,member_casual)%>% summarise(n=n())
# changing the names of 6 stations to their new names in excel
#DuSable Lake Shore Dr & Monroe St
#Lake Shore Dr & Monroe St
#Lake Shore Dr & Diversey Pkwy
#DuSable Lake Shore Dr & Diversey Pkwy
#DuSable Lake Shore Dr & North Blvd
#Lake Shore Dr & North Blvd
#Dusable Lake Shore Dr & Ohio St
#Lake Shore Dr & Ohio St
#Lake Shore Dr & Wellington Ave
#Dusable Lake Shore Dr & Wellington Ave
#Lake Shore Dr & Belmont Ave
#Dusable Lake Shore Dr & Belmont Ave
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & Diversey Pkwy"){
start_data[i,3]="DuSable Lake Shore Dr & Diversey Pkwy"
}
}
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & Monroe St"){
start_data[i,3]="DuSable Lake Shore Dr & Monroe St"
}
}
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & Ohio St"){
start_data[i,3]="Dusable Lake Shore Dr & Ohio St"
}
}
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & North Blvd"){
start_data[i,3]="DuSable Lake Shore Dr & North Blvd"
}
}
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & Wellington Ave"){
start_data[i,3]="Dusable Lake Shore Dr & Wellington Ave"
}
}
for ( i in 1:nrow(start_data)){
if (start_data[i,3]== "Lake Shore Dr & Belmont Ave"){
start_data[i,3]="DuSable Lake Shore Dr & Belmont Ave"
}
}
start_data <- data.frame(start_data)
start_data<- rbind(start_data,start_no_data)
###########start database created########
end_no_data <- clean_data%>% filter(end_station_name=="")%>% group_by(end_lat,end_lng,end_station_name,member_casual)%>% summarise(n=n())
end_no_data<- na.omit(end_no_data)
#write.csv(end_no_data,"end_no_data.csv",row.names = FALSE)
# filtering top data cordinates and assigning start station name
#filtering out in excel top 25 cordinates and finding their station name from our database
end_no_data<- read.csv("end_no_data.csv")
end_data_unclean <- clean_data%>% filter(end_station_name!="")%>% group_by(end_lat,end_lng,end_station_name,member_casual)%>% summarise(n=n())
#write.csv(end_data_unclean,"end_data_unclean.csv",row.names = FALSE)
# changing the names of 6 stations to their new names in excel
end_data_unclean<- read.csv("end_data_unclean.csv")
end_data<- rbind(end_data_unclean,end_no_data)
###########end database created########
Analyze
Key tasks
clean_data%>% group_by(member_casual)%>% summarise(n=n())%>%
mutate(percent = n*100/sum(n))
## # A tibble: 2 x 3
## member_casual n percent
## <chr> <int> <dbl>
## 1 casual 1922226 44.6
## 2 member 2389007 55.4
ggplot(data = clean_data,mapping= aes(x= member_casual)) +geom_bar() + labs(title="Member Vs Casual")
cat('average duration of rides is : ', mean(clean_data$ride_length), "mins")
## average duration of rides is : 19.94426 mins
cat('maximum duration of rides is : ', round(max(clean_data$ride_length)/60/24), "days")
## maximum duration of rides is : 37 days
bar<-clean_data%>% group_by(member_casual)%>% summarise(avg_ride_length=mean(ride_length))
bar
## # A tibble: 2 x 2
## member_casual avg_ride_length
## <chr> <drtn>
## 1 casual 28.49893 mins
## 2 member 13.06106 mins
ggplot(bar, aes(x = member_casual, y = avg_ride_length, fill =member_casual))+geom_col()+ scale_fill_manual(values = c("Purple", "pink"))
clean_data%>% group_by(rideable_type) %>% summarise(n=n())%>% mutate(percent = n*100/sum(n))
## # A tibble: 3 x 3
## rideable_type n percent
## <chr> <int> <dbl>
## 1 classic_bike 2435347 56.5
## 2 docked_bike 216447 5.02
## 3 electric_bike 1659439 38.5
ggplot(data = clean_data,mapping= aes(x= rideable_type,fill=rideable_type)) +geom_bar() + labs(title="Bike Types used")
rideable_type<-clean_data%>% group_by(rideable_type,member_casual) %>% summarise(n=n())%>%
mutate(percent = n*100/sum(n))
rideable_type
## # A tibble: 5 x 4
## # Groups: rideable_type [3]
## rideable_type member_casual n percent
## <chr> <chr> <int> <dbl>
## 1 classic_bike casual 952630 39.1
## 2 classic_bike member 1482717 60.9
## 3 docked_bike casual 216447 100
## 4 electric_bike casual 753149 45.4
## 5 electric_bike member 906290 54.6
ggplot(data = as.data.frame(rideable_type),mapping= aes(x= rideable_type, y = n, fill =member_casual)) +geom_bar(stat = 'identity') + labs(title="Bike Type with rider")
member_type<-clean_data%>% group_by(member_casual,rideable_type) %>% summarise(n=n())%>%
mutate(percent = n*100/sum(n))
member_type
## # A tibble: 5 x 4
## # Groups: member_casual [2]
## member_casual rideable_type n percent
## <chr> <chr> <int> <dbl>
## 1 casual classic_bike 952630 49.6
## 2 casual docked_bike 216447 11.3
## 3 casual electric_bike 753149 39.2
## 4 member classic_bike 1482717 62.1
## 5 member electric_bike 906290 37.9
ggplot(data = as.data.frame(member_type),mapping= aes(x= member_casual, y=n, fill =rideable_type)) +geom_bar(stat = 'identity') + labs(title="Choice of Bike by Riders")
clean_data$weekday<- factor(clean_data$weekday, levels= c("Monday", "Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
weektable<- clean_data%>% group_by(weekday)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n))
weektable
## # A tibble: 7 x 3
## weekday n percent
## <fct> <int> <dbl>
## 1 Monday 557725 12.9
## 2 Tuesday 553005 12.8
## 3 Wednesday 577020 13.4
## 4 Thursday 585778 13.6
## 5 Friday 621732 14.4
## 6 Saturday 758507 17.6
## 7 Sunday 657466 15.3
ggplot(data = clean_data,mapping= aes(x= weekday)) +geom_bar() + labs(title="Usage per day")
ggplot(data = clean_data,mapping= aes(x= weekday, fill = member_casual)) +geom_bar() +facet_wrap(~member_casual)+theme(axis.text.x = element_text(angle = 60, hjust =1))
Visualization in Tableau
The following calculated fields are created for analysis and visualization:
Month - MONTH([Started At]) , Time of day (Hour) - DATEPART(‘hour’,[Started At])
knitr::include_graphics('/Cyclist/data/Riders Trend by Month.png')
knitr::include_graphics('/Cyclist/data/Riders Trend by Hour.png')
casual_riders<-clean_data%>% filter(member_casual == 'casual')%>%group_by(weekday,rideable_type)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n))
ggplot(data= casual_riders, mapping= aes(x= weekday, y=n, fill =rideable_type)) + geom_bar(stat = 'identity')
members<-clean_data%>% filter(member_casual == 'member')%>%group_by(weekday,rideable_type)%>% summarise(n=n())%>% mutate(percent = n*100/sum(n))
ggplot(data= members, mapping= aes(x= weekday, y=n, fill =rideable_type)) + geom_bar(stat = 'identity')
knitr::include_graphics('/Cyclist/data/Top 10 Starting Stations.png')
ss<- start_data %>%filter(start_station_name!= "")%>% group_by(start_station_name,member_casual) %>% summarise(n=sum(n)) %>% arrange(desc(n))
p<- ggplot(data =as.data.frame(ss), aes(x = start_station_name, y=n,fill=member_casual))+
geom_bar(stat = 'identity')+facet_wrap(~ member_casual, ncol =1)+theme(axis.text.x = element_blank(),legend.position="none")
dat_text <- data.frame(
label = c("Streeter Dr & Grand Ave", "DuSable Lake Shore Dr & Monroe St", "Millennium Park","Clark St & Elm St","Kingsbury St & Kinzie St","Wells St & Concord Ln" ),
member_casual = c('casual','casual','casual','member','member','member'),
x = c("Streeter Dr & Grand Ave", "DuSable Lake Shore Dr & Monroe St", "Millennium Park","Clark St & Elm St","Kingsbury St & Kinzie St","Wells St & Concord Ln"),
y = c(60000,37000,35000,30000,30000,30000),
hjust =c(0,0,1,0.5,0.5,0.5),
angle=c(0,0,0,0,0,0))
p + geom_text(data = dat_text, mapping = aes(x = x, y = y, label = label, hjust=hjust, angle=angle), size =2.5)
knitr::include_graphics('/Cyclist/data/Top 10 Ending Stations.png')
knitr::include_graphics('/Cyclist/data/Most Used Starting Stations.png')
knitr::include_graphics('/Cyclist/data/MOst Used Ending Stations.png')
Recommendations
To convert casual riders into annual members, the following marketing strategies can be implemented:
Offer occasional membership discounts to casual riders on summer and weekends.
Increase the renting price of the bikes for the weekend to target casual users into having a membership, especially for classic and electric bikes, since they are preferred more by casual users.
Put banners or special discount advertisements at Lake Shore Drive that would target casual users which might influence them to become members.
Thanks for Reading, Please provide your feedback.
Check Detailed Analysis for [Code] and Visualization.