This is the case study of Google Analytics certificate. In this document, The analysis work is applied by the public data from June 2022 till May 2023 which is from Cyclistic Bike-Share. Cyclistic is a bike-share company in Chicago. The goal of this analysis is to suggest recommendations which are to maximize the number of annual memberships.The analysis process will be taken by each phase of “ask”, “prepare”, “process”, “analyze”, “share”, and “act” to answer the key business questions.
This section is to clarify the basic requirements and business questions in related with the analysis.
Lily Moreno: The director of marketing. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program.
Cyclistc marketing analytics team: a team of data analysts who are responsible for the development of campaigns an initiatives to promote the bike-share program.
Cyclistic executive team: The detail-oriented exective team will decide whether to approve the recommended marketing program.
More than 5800 bicycles
More than 600 docking stations
Type of bikes:
Reclining bikes
Hand tricycles
Cargo bikes
Type of riders:
Majority of riders use standard two-wheeled bikes.
8% of riders use assistive options.
Application
More likely to ride for leisure.
30% to commute to work each day.
In order to start analysis, the Cyclistic’s historical trip data will be helpful for trends of the analysis and the identification.
The historical trip data is stored and available for the analytically use by Motivate International Inc. under the license.
Data is organised monthly. For this analysis, data for 12 months is required. the period is from Jun 2022 till May 2023. The Data set follows ROCC, Reliability, Original, Comprehensive, Current, and Cited. According to the license of the company divvy, the data set is under licensing, privacy, security, and accessibility. [(https://ride.divvybikes.com/data-license-agreement)]
Please note that the following information is excluded due to the data-privacy issues; - Personal identifiable information - Credit card numbers - Cyclistic service area - Purchasing frequency.
According to the ROCC (Reliable, Original, Comprehensive, Cited/Current) standard, the trip data is from reliable and original data location, the data type is csv file, and updated current period. So, we confirm the data is credible for the analysis.
At the process phase of the data analysis, the following tasks will be implemented.
Install required packages for the data cleaning, manipulation, visualization, and documentation.
Then, we take a place to load the following functions.
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
library(cowplot) #helps visualize data
#For data cleaning, following packages are installed.
library(here)
library(janitor)
library(skimr)
library(dplyr)
getwd() #displays your working directory
## [1] "C:/Users/satos/Documents/project/case-1"
Collect 12 data sets with csv files by the function of “read_csv”.
df2206 <- read_csv("202206-divvy-tripdata.csv")
df2207 <- read_csv("202207-divvy-tripdata.csv")
df2208 <- read_csv("202208-divvy-tripdata.csv")
df2209 <- read_csv("202209-divvy-tripdata.csv")
df2210 <- read_csv("202210-divvy-tripdata.csv")
df2211 <- read_csv("202211-divvy-tripdata.csv")
df2212 <- read_csv("202212-divvy-tripdata.csv")
df2301 <- read_csv("202301-divvy-tripdata.csv")
df2302 <- read_csv("202302-divvy-tripdata.csv")
df2303 <- read_csv("202303-divvy-tripdata.csv")
df2304 <- read_csv("202304-divvy-tripdata.csv")
df2305 <- read_csv("202305-divvy-tripdata.csv")
Confirm all column names on each file has the same.
colnames(df2206)
colnames(df2207)
colnames(df2208)
colnames(df2209)
colnames(df2210)
colnames(df2211)
colnames(df2212)
colnames(df2301)
colnames(df2302)
colnames(df2303)
colnames(df2304)
colnames(df2305)
############
#[1] "ride_id" "rideable_type" "started_at"
#[4] "ended_at" "start_station_name" "start_station_id"
#[7] "end_station_name" "end_station_id" "start_lat"
#[10] "start_lng" "end_lat" "end_lng" #
#[13] "member_casual"
###########
Accordingly, the column names are the same for all 12 files. In addition, it does not include any personally identifiable information, such as birth-year, gender, names, and any credit card information.
str(df2206)
str(df2207)
str(df2208)
str(df2209)
str(df2210)
str(df2211)
str(df2212)
str(df2301)
str(df2302)
str(df2303)
str(df2304)
str(df2305)
###
# .. ride_id = col_character(),
# .. rideable_type = col_character(),
# .. started_at = col_datetime(format = ""),
# .. ended_at = col_datetime(format = ""),
# .. start_station_name = col_character(),
# .. start_station_id = col_character(),
# .. end_station_name = col_character(),
# .. end_station_id = col_character(),
# .. start_lat = col_double(),
# .. start_lng = col_double(),
# .. end_lat = col_double(),
# .. end_lng = col_double(),
# .. member_casual = col_character()
###
Accordingly, we confirmed all the data type on each data set are same. But, at the same time, NA data values are found in some columns.
In order to clean data efficiently, such as removing “NA” and duplicates, we wrangle 12 data sets and combine into a single file as follows:
In this case, all column names on each file are same, so we use “bind_rows” function instead of using the combination of “merge()” with “group_by()”.
all_trips<- bind_rows(df2206, df2207, df2208, df2209, df2210, df2211, df2212, df2301, df2302, df2303, df2304, df2305)
Check if there are any duplicated data.
sum(duplicated(all_trips)) # 0
Accordingly, we find out zero duplicated data in “all_trips”.
Then, we drop off NA data as well.
all_trips <- all_trips %>%
distinct() %>%
drop_na()
The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data –such as day, month, year –that provide additional opportunities to aggregate the data.
all_trips$date<-as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month<-format(as.Date(all_trips$date),"%m")
all_trips$day<-format(as.Date(all_trips$date),"%d")
all_trips$year<-format(as.Date(all_trips$date),"%Y")
all_trips$day_of_week<-format(as.Date(all_trips$date),"%A")
all_trips$time <- format(all_trips$started_at, format = "%H:%M:%S")
We will want to add a calculated field for length of rides as the “trip duration” column. We will add “ride_length” in seconds to the entire data frame for consistency.
all_trips$ride_length<-difftime(all_trips$ended_at,all_trips$started_at)
Check the data that is cleaned with the following functions:
head(all_trips) #View the first 6 data rows
tail(all_trips) #View the last 6 data rows
nrow(all_trips) #Numbers fo rows
#4494681
dim(all_trips) #dimension of data frame
# 4494681 13
n_unique(all_trips$member_casual) #Count numbers of unique data variables.
# 2.
summary(all_trips) #summarise data
Vector type of each column is confirmed and there is no duplicated data according to the verification process at “3.3.2.”.
According to the results of “summary and head(), the class of”ride_length” shall be transform to numeric for the calculation.
is.factor(all_trips$ride_length) #Check if it is the categorical data. [1] FALSE
all_trips$ride_length<-as.numeric(as.character(all_trips$ride_length)) #Transform data to numeric via character.
is.numeric(all_trips$ride_length) #Confirm if it is numeric data. [1] TRUE
Accordingly, now the class of “ride_length” turns as numeric.
Check the “ride_length” again.
summary(all_trips$ride_length)
We found negative values on the column of “ride_length”. This values will give confusion for analytics works and zero value does not mean anything, so we need to remove them.
At first, we count the numbers of negative and zero values and arrange ascending order of the “ride_length” and see the results.
sum(all_trips$ride_length <=0) #Count numbers of negative and zero data.
#The negative values exists 339 data out of 4494681 data.(0.75%)
#See other data information in the columns when the negative value of "ride_length".
all_trips %>%
arrange(ride_length)
It seems no tendencies of the negative values on the column of “ride_length”. So we drop off the rows in the “ride_length” with negative and zero values with subset() function.
#Filter out negative and zero values in the column of ride_length and create a new data frame.
all_trips1 <- subset(all_trips, ride_length > 0)
head(all_trips1)
sum(all_trips1$ride_length <=0) # Verify any negative and zero values are remained in the column of "ride_length".
summary(all_trips1$ride_length)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.0 348.0 609.0 977.3 1090.0 1922127.0
Accordingly, we removed any negative, zero, and NA values in the column of “ride_length”.
Through the process from 3.1 to 3.4, we cleaned and manipulate the trip data of divvy from June 2022 until May 2023.
summary(all_trips1)
## ride_id rideable_type started_at
## Length:4494342 Length:4494342 Min. :2022-06-01 00:00:04.00
## Class :character Class :character 1st Qu.:2022-07-25 17:40:49.25
## Mode :character Mode :character Median :2022-09-21 14:53:43.50
## Mean :2022-10-26 20:47:49.33
## 3rd Qu.:2023-02-04 15:56:11.25
## Max. :2023-05-31 23:59:49.00
## ended_at start_station_name start_station_id
## Min. :2022-06-01 00:02:38.00 Length:4494342 Length:4494342
## 1st Qu.:2022-07-25 17:56:54.00 Class :character Class :character
## Median :2022-09-21 15:08:44.00 Mode :character Mode :character
## Mean :2022-10-26 21:04:06.61
## 3rd Qu.:2023-02-04 16:08:30.75
## Max. :2023-06-07 23:04:26.00
## end_station_name end_station_id start_lat start_lng
## Length:4494342 Length:4494342 Min. :41.65 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53
## end_lat end_lng member_casual date
## Min. : 0.00 Min. :-87.84 Length:4494342 Min. :2022-06-01
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-07-25
## Median :41.90 Median :-87.64 Mode :character Median :2022-09-21
## Mean :41.90 Mean :-87.65 Mean :2022-10-26
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2023-02-04
## Max. :42.06 Max. : 0.00 Max. :2023-05-31
## month day year day_of_week
## Length:4494342 Length:4494342 Length:4494342 Length:4494342
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## time ride_length
## Length:4494342 Min. : 1.0
## Class :character 1st Qu.: 348.0
## Mode :character Median : 609.0
## Mean : 977.3
## 3rd Qu.: 1090.0
## Max. :1922127.0
str(all_trips1)
## tibble [4,494,342 × 20] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:4494342] "B12AD6565494C368" "BAD4CB075003A605" "76DAD9FC95774B53" "47DE68ACCA138C13" ...
## $ rideable_type : chr [1:4494342] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:4494342], format: "2022-06-09 22:28:32" "2022-06-19 17:08:23" ...
## $ ended_at : POSIXct[1:4494342], format: "2022-06-09 22:52:17" "2022-06-19 17:08:25" ...
## $ start_station_name: chr [1:4494342] "California Ave & Milwaukee Ave" "California Ave & Milwaukee Ave" "Burnham Greenway & 105th St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:4494342] "13084" "13084" "20222" "637" ...
## $ end_station_name : chr [1:4494342] "California Ave & Milwaukee Ave" "California Ave & Milwaukee Ave" "Burnham Greenway & 105th St" "California Ave & Division St" ...
## $ end_station_id : chr [1:4494342] "13084" "13084" "20222" "13256" ...
## $ start_lat : num [1:4494342] 41.9 41.9 41.7 41.9 41.9 ...
## $ start_lng : num [1:4494342] -87.7 -87.7 -87.5 -87.7 -87.7 ...
## $ end_lat : num [1:4494342] 41.9 41.9 41.7 41.9 41.9 ...
## $ end_lng : num [1:4494342] -87.7 -87.7 -87.5 -87.7 -87.7 ...
## $ member_casual : chr [1:4494342] "casual" "casual" "casual" "casual" ...
## $ date : Date[1:4494342], format: "2022-06-09" "2022-06-19" ...
## $ month : chr [1:4494342] "06" "06" "06" "06" ...
## $ day : chr [1:4494342] "09" "19" "26" "27" ...
## $ year : chr [1:4494342] "2022" "2022" "2022" "2022" ...
## $ day_of_week : chr [1:4494342] "Thursday" "Sunday" "Sunday" "Monday" ...
## $ time : chr [1:4494342] "22:28:32" "17:08:23" "23:59:44" "11:40:53" ...
## $ ride_length : num [1:4494342] 1425 2 1542 563 2083 ...
In this section, we will identify patterns and draw conclusions and make predictions and recommendations in order to respond the following business questions. 1. How do annual members and casual riders use Cyclistic bikes differently? 2. Why would casual riders buy Cyclistic annual memberships? 3. How can Cyclistic use digital media to influence casual riders to become members?
At first, let’s see the ratio and numbers of members and casuals.
#Count numbers of member and casuals
table(all_trips1$member_casual)
#casual member
#1747757 2746585
#Calculate the ratio
member_casual_ratio <- all_trips1 %>%
group_by(member_casual) %>%
summarise(total = n()) %>% #Count total numbers of each variables, "casual" and "member".
mutate(totals = sum(total)) %>% #Create a new column as "totals" by mutate() function and calculate the total by sum() function
group_by(member_casual) %>%
summarise(total_ratio = total / totals) %>% #summarize ratio
mutate(labels = scales::percent(total_ratio)) #Create a new column and lable withing the cell with % by "scales::percent()" function.
head(member_casual_ratio)
#Visualize as a pie chart.
member_casual_ratio %>%
ggplot(aes(x="",y=total_ratio, fill=member_casual)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start=0)+ #Convert the plot to polar coordinates
theme_minimal()+
theme(axis.title.x= element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
geom_text(aes(label = labels),
position = position_stack(vjust = 0.5, reverse = FALSE))+
labs(title="User distribution", fill = "User type")
Accordingly, the number of member is 61% and 39% are casual. This is the latest data sets, maybe the numbers of member has been increased because of the improvement.
We calculate, average, median, max, and min values accordingly.
mean(all_trips1$ride_length) #straight average(total ride length/rides)
## [1] 977.2823
median(all_trips1$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 609
max(all_trips1$ride_length) #longest ride
## [1] 1922127
min(all_trips1$ride_length) #shortest ride
## [1] 1
#Compare members and casual users
aggregate(all_trips1$ride_length~all_trips1$member_casual,FUN=mean)
## all_trips1$member_casual all_trips1$ride_length
## 1 casual 1363.2759
## 2 member 731.6599
aggregate(all_trips1$ride_length~all_trips1$member_casual,FUN=median)
## all_trips1$member_casual all_trips1$ride_length
## 1 casual 785
## 2 member 525
aggregate(all_trips1$ride_length~all_trips1$member_casual,FUN=max)
## all_trips1$member_casual all_trips1$ride_length
## 1 casual 1922127
## 2 member 89872
aggregate(all_trips1$ride_length~all_trips1$member_casual,FUN=min)
## all_trips1$member_casual all_trips1$ride_length
## 1 casual 1
## 2 member 1
See the average ride time by each month for members vs casual users.
aggregate(all_trips1$ride_length~all_trips1$member_casual+all_trips1$month,FUN=mean)
## all_trips1$member_casual all_trips1$month all_trips1$ride_length
## 1 casual 01 892.8232
## 2 member 01 600.2738
## 3 casual 02 1060.4027
## 4 member 02 625.3519
## 5 casual 03 1003.2793
## 6 member 03 610.2839
## 7 casual 04 1357.3867
## 8 member 04 693.3902
## 9 casual 05 1471.5850
## 10 member 05 761.6897
## 11 casual 06 1501.2135
## 12 member 06 821.0720
## 13 casual 07 1505.7013
## 14 member 07 810.2364
## 15 casual 08 1397.1418
## 16 member 08 786.4180
## 17 casual 09 1308.2871
## 18 member 09 757.3497
## 19 casual 10 1228.1667
## 20 member 10 700.5410
## 21 casual 11 1034.8040
## 22 member 11 649.6634
## 23 casual 12 890.5108
## 24 member 12 612.0680
See the average ride time by each day for members vs casual users.
aggregate(all_trips1$ride_length~all_trips1$member_casual+all_trips1$day_of_week,FUN=mean)
## all_trips1$member_casual all_trips1$day_of_week all_trips1$ride_length
## 1 casual Friday 1308.4576
## 2 member Friday 720.9742
## 3 casual Monday 1369.1382
## 4 member Monday 693.9649
## 5 casual Saturday 1526.4866
## 6 member Saturday 821.0278
## 7 casual Sunday 1570.1538
## 8 member Sunday 817.0922
## 9 casual Thursday 1210.4286
## 10 member Thursday 706.6981
## 11 casual Tuesday 1225.4594
## 12 member Tuesday 701.2078
## 13 casual Wednesday 1177.3245
## 14 member Wednesday 700.9543
Notice that the days of the week are out of order.
all_trips1$day_of_week<-ordered(all_trips1$day_of_week,levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
#Confirm
aggregate(all_trips1$ride_length~all_trips1$member_casual+all_trips1$day_of_week,FUN=mean)
## all_trips1$member_casual all_trips1$day_of_week all_trips1$ride_length
## 1 casual Sunday 1570.1538
## 2 member Sunday 817.0922
## 3 casual Monday 1369.1382
## 4 member Monday 693.9649
## 5 casual Tuesday 1225.4594
## 6 member Tuesday 701.2078
## 7 casual Wednesday 1177.3245
## 8 member Wednesday 700.9543
## 9 casual Thursday 1210.4286
## 10 member Thursday 706.6981
## 11 casual Friday 1308.4576
## 12 member Friday 720.9742
## 13 casual Saturday 1526.4866
## 14 member Saturday 821.0278
Analyze ridership data by type and month as a case 1 and by type and weekday as a case 2.
#creates month field using month()
all_trips1 %>%
mutate(month=month(started_at,label=TRUE))%>%
#groups by user type and weekday.
group_by(member_casual,month)%>%
#calculates the number of rides and average duration.
summarise(number_of_rides=n(), average_duration=mean(ride_length))%>%
#sorts data.
arrange(member_casual,month)
## # A tibble: 24 × 4
## # Groups: member_casual [2]
## member_casual month number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Jan 29618 893.
## 2 casual Feb 32774 1060.
## 3 casual Mar 46786 1003.
## 4 casual Apr 110526 1357.
## 5 casual May 177025 1472.
## 6 casual Jun 292053 1501.
## 7 casual Jul 311649 1506.
## 8 casual Aug 270074 1397.
## 9 casual Sep 220905 1308.
## 10 casual Oct 151312 1228.
## # ℹ 14 more rows
#creates weekday field using wday()
all_trips1 %>%
mutate(weekday = wday(started_at,label = TRUE))%>%
#groups by user type and weekday.
group_by(member_casual,weekday)%>%
#calculates the number of rides and average duration.
summarise(number_of_rides = n(), average_duration = mean(ride_length))%>%
#sorts data.
arrange(member_casual,weekday)
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 289531 1570.
## 2 casual Mon 194960 1369.
## 3 casual Tue 202311 1225.
## 4 casual Wed 218102 1177.
## 5 casual Thu 233749 1210.
## 6 casual Fri 258161 1308.
## 7 casual Sat 350943 1526.
## 8 member Sun 303812 817.
## 9 member Mon 374352 694.
## 10 member Tue 438773 701.
## 11 member Wed 456824 701.
## 12 member Thu 442091 707.
## 13 member Fri 387036 721.
## 14 member Sat 343697 821.
#At first, transform to POSIXct vector from character about "time" data, H:M:S.
all_trips1$time <- as.POSIXct(all_trips1$time, format = "%H:%M:%S")
#Extract the hour from the POSIXct object by hour() function
all_trips1 %>%
mutate(hour_day= hour(time)) %>%
#groups by user type and hour.
group_by(member_casual, hour_day)%>%
#calculates the number of rides and average duration.
summarise(number_of_rides = n(), average_duration = mean(ride_length))%>%
#sorts data.
arrange(member_casual,hour_day)
## # A tibble: 48 × 4
## # Groups: member_casual [2]
## member_casual hour_day number_of_rides average_duration
## <chr> <int> <int> <dbl>
## 1 casual 0 32053 1217.
## 2 casual 1 20813 1304.
## 3 casual 2 12246 1251.
## 4 casual 3 6763 1186.
## 5 casual 4 4515 1078.
## 6 casual 5 8707 923.
## 7 casual 6 23390 893.
## 8 casual 7 40346 877.
## 9 casual 8 55066 1001.
## 10 casual 9 56190 1350.
## # ℹ 38 more rows
all_trips1%>%
mutate(month=month(started_at,label=TRUE))%>%
group_by(member_casual,month)%>%
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,month)%>%
ggplot(mapping = aes(x=month,y=number_of_rides,fill=member_casual))+
geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Rider numbers through a year", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Month", y= "Numbers of riders")
Casual riders use the bike-share service more from May to October.
Casual riders use the service less from November to April.
Member riders use the service relatively more often from March to November.
Member riders use the service less from December to February.
During warm month, casual riders uses the service more.
all_trips1%>%
mutate(month=month(started_at,label=TRUE))%>% #Use month() function for labeling readable variables.
group_by(member_casual,month)%>%
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,month)%>%
ggplot(aes(x=month,y=average_duration,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Average riding lenghth through a year", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Month", y= "Riding length in seconds")
Casual users always use the service longer than member.
Member users use the time of the service relatively stable, around 10 minutes (600 seconds).
all_trips1%>%
mutate(weekday=wday(started_at,label=TRUE))%>% #Creat a new column as "weekday" and transform to name of weekday by wday() function.
group_by(member_casual,weekday)%>% #Pick up two columns
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,weekday)%>%
ggplot(aes(x=weekday,y=number_of_rides,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Riding numbers through a week", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Weekday", y= "Riding numbers")
During weekend, both users recorded the same amount.
Casual riders use the service more often during weekend, but less during weekdays.
Member riders use the service more frequently during weekdays than during weekend.
Though casual riders do not use as same frequency as what member riders do during weekdays, but around 200,0000 records are counted by casual riders from Monday to Friday.
all_trips1%>%
mutate(weekday=wday(started_at,label=TRUE))%>%
group_by(member_casual,weekday)%>%
summarise(number_of_rides=n()
,average_duration=mean(ride_length))%>%
arrange(member_casual,weekday)%>%
ggplot(aes(x=weekday,y=average_duration,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Average riding lenghth through a week", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Weekday", y= "Riding length in seconds")
Casual riders use the service longer during weekend, but they also use it more than 15 minutes (900 seconds) during weekdays.
Member riders use it around 10 minutes (600 seconds).
Among casual riders, some riders might use the bike-share service for commutes.
all_trips1%>%
mutate(hour_day = hour(time))%>% #Create a new column as "hour_day" and extract hour part by hour() function.
group_by(member_casual,hour_day)%>% #Pick up two columns
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,hour_day)%>%
ggplot(aes(x=hour_day,y=number_of_rides,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_x_continuous(breaks = seq(0, 23))+
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Riding numbers through a day", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Hour of day", y= "Riding numbers")
Casual riders use the bike-share service mainly during late afternoon.
Member riders use the service at 7-8 in the morning and 15-19 in the afternoon and evening.
In the morning, casual riders also use the service though the ratio is about 30% of member riders.
all_trips1%>%
mutate(hour_day=hour(time))%>%
group_by(member_casual,hour_day)%>%
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,hour_day)%>%
ggplot(aes(x=hour_day,y=average_duration,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_x_continuous(breaks = seq(0, 23))+
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
labs(title = "Average riding lenghth through a day", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Hour of day", y= "Riding length in seconds")
Casual riders more longer than member riders through a day.
Casual riders especially use this service from 10:00 to 15:00.
Even during midnight, the service is used among both riders.
all_trips1%>%
mutate(hour_day = hour(time))%>% #Create a new column as "hour_day" and extract hour part by hour() function.
mutate(weekday=wday(started_at,label=TRUE))%>%
group_by(member_casual,hour_day, weekday)%>% #Pick up two columns
summarise(number_of_rides=n(),average_duration=mean(ride_length))%>%
arrange(member_casual,hour_day, weekday)%>%
ggplot(aes(x=hour_day,y=number_of_rides,fill=member_casual))+geom_col(position="dodge")+
scale_y_continuous(labels = scales::number_format())+ #To display as numbers to avoid "1e+5".
scale_fill_manual(values = c("#66CDAA","#ffd480")) +
facet_grid(~weekday)+
labs(title = "Riding numbers through a day by day of week", subtitle = "Member vs Casual", caption = "Data collected by Divvy Data", x = "Hour of day", y= "Riding numbers", fill = "Rider type")
Member riders use the service during daytime and weekdays.
Casual riders use the service more than member riders during afternoon on Saturday.
Despite the number of casual riders is lower than members, but a certain percentage of casual riders might use the service in the morning during weekdays.
Create a csv file that others will visualize in Excel, Tableau, or other presentation software.
In this section, based on the business tasks and analysis works, we describe our recommendations to take actions for a new marketing strategy of Cyclistic.
In order to maximize the annual members of the bike-share service of Cyclistic, the following three recommendations will help for a new marketing strategy.
Increasing the stations at the business, school, and shopping areas to make frequent habits of casual riders by using the service.
Place digital promotional campaigns before April to convert casual riders to annual membership users. According to the visualization with 5.1. many casual riders use the service during warm months.
Notify or suggest the annual service plan for the casual riders who use the service 10 days a month.
Having implementation of these recommendations, we hope Cyclistic achieve their business task as soon as possible and contribute to the Eco-society.