How do annual members and casual members use Cyclistic bikes differently?
Identifying the main differences between members and casual riders can help us establish who Cyclistic customers are and provide key information that will help develope the business strategy and marketing campaign.
Design a marketing campaign to convert casual riders into annual members.
For this analysis, we have access to a primary data source downloaded from Cyclystic trip data. It is licensed under Motivate International Inc. This dataset consists of quantitative measurements collected from bike trackers, they do not gather personal data from users.
Spreadsheets where avoided because of the raw amount of data in this dataset. For this project, R Studio was the preferred tool used for this project. It has a wide range of tools available and can handle massive amounts of data with ease.
First we prepare the R Studio environment with the right tools by installing the required packages and libraries:
library(tidyverse) #helps wrangle data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor) #helps clean data
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) #helps wrangle date attributes
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) #helps visualize data
library(dplyr) #Count distinct
library(scales) #For better numbering in plots
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
Then we have to import the data to R Studio and check for consistency, we use ‘colnames’ to verify the that the column names match in each dataframe and ‘str’ to check for consistency in data type per column.
all_rides <- bind_rows(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)
colnames(all_rides)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
n_distinct(all_rides$ride_id) #Count distinct ride id, it should be the same number as rows.
## [1] 5733451
nrow(all_rides)
## [1] 5733451
dim(all_rides)
## [1] 5733451 13
head(all_rides)
## ride_id rideable_type started_at ended_at
## 1 46F8167220E4431F electric_bike 2021-12-07 15:06:07 2021-12-07 15:13:42
## 2 73A77762838B32FD electric_bike 2021-12-11 03:43:29 2021-12-11 04:10:23
## 3 4CF42452054F59C5 electric_bike 2021-12-15 23:10:28 2021-12-15 23:23:14
## 4 3278BA87BF698339 classic_bike 2021-12-26 16:16:10 2021-12-26 16:30:53
## 5 6FF54232576A3B73 electric_bike 2021-12-30 11:31:05 2021-12-30 11:51:21
## 6 93E8D79490E3AB11 classic_bike 2021-12-01 18:28:36 2021-12-01 18:38:03
## start_station_name start_station_id end_station_name
## 1 Laflin St & Cullerton St 13307 Morgan St & Polk St
## 2 LaSalle Dr & Huron St KP1705001026 Clarendon Ave & Leland Ave
## 3 Halsted St & North Branch St KA1504000117 Broadway & Barry Ave
## 4 Halsted St & North Branch St KA1504000117 LaSalle Dr & Huron St
## 5 Leavitt St & Chicago Ave 18058 Clark St & Drummond Pl
## 6 Wabash Ave & 16th St SL-012 Wells St & Polk St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 TA1307000130 41.85483 -87.66366 41.87197 -87.65097 member
## 2 TA1307000119 41.89441 -87.63233 41.96797 -87.65000 casual
## 3 13137 41.89936 -87.64852 41.93758 -87.64410 member
## 4 KP1705001026 41.89939 -87.64854 41.89488 -87.63233 member
## 5 TA1307000142 41.89558 -87.68202 41.93125 -87.64434 member
## 6 SL-011 41.86038 -87.62581 41.87260 -87.63350 member
str(all_rides)
## 'data.frame': 5733451 obs. of 13 variables:
## $ ride_id : chr "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
## $ ended_at : chr "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
## $ start_station_name: chr "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
summary(all_rides)
## ride_id rideable_type started_at ended_at
## Length:5733451 Length:5733451 Length:5733451 Length:5733451
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5733451 Length:5733451 Length:5733451 Length:5733451
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. : 0.00 Min. :-88.14
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.37 Max. : 0.00
## NA's :5874 NA's :5874
## member_casual
## Length:5733451
## Class :character
## Mode :character
##
##
##
##
n_distinct(all_rides$member_casual) #Count distinct values to check consistency.
## [1] 2
all_rides <- janitor::remove_empty(all_rides,which = c("cols")) #removes empty columns
all_rides <- janitor::remove_empty(all_rides,which = c("rows")) #removes empty rows
all_rides$started_at <- lubridate::ymd_hms(all_rides$started_at) #turns 'started_at' to ymd_hms
all_rides$ended_at <- lubridate::ymd_hms(all_rides$ended_at) #turns 'ended_at' to ymd_hms
all_rides$start_hour <- lubridate::hour(all_rides$started_at)
all_rides$end_hour <- lubridate::hour(all_rides$ended_at)
all_rides$day_of_week <- wday(all_rides$started_at)
all_rides$ride_length <- difftime(all_rides$ended_at,all_rides$started_at)
sum(all_rides$ride_length < 0, na.rm=TRUE) #counts number of rows < 0 in 'ride_length)
## [1] 100
all_rides_v2 <- all_rides[!(all_rides$ride_length<0),]
sum(all_rides_v2$ride_length < 0, na.rm=TRUE) #counts number of rows < 0 in 'ride_length)
## [1] 0
all_rides_v2 <- all_rides_v2 %>%
mutate(day_of_week = recode(day_of_week
,"1" = "Sunday"
,"2" = "Monday"
,"3" = "Tuesday"
,"4" = "Wednesday"
,"5" = "Thursday"
,"6" = "Friday"
,"7" = "Saturday"))
mean(all_rides_v2$ride_length) #straight average
## Time difference of 1165.316 secs
median(all_rides_v2$ride_length) #midpoint number
## Time difference of 618 secs
max(all_rides_v2$ride_length) #longest ride
## Time difference of 2483235 secs
min(all_rides_v2$ride_length) #shortest ride
## Time difference of 0 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = mean)
## all_rides_v2$member_casual all_rides_v2$ride_length
## 1 casual 1746.5364 secs
## 2 member 762.5375 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = median)
## all_rides_v2$member_casual all_rides_v2$ride_length
## 1 casual 783 secs
## 2 member 530 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = max)
## all_rides_v2$member_casual all_rides_v2$ride_length
## 1 casual 2483235 secs
## 2 member 93594 secs
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual, FUN = min)
## all_rides_v2$member_casual all_rides_v2$ride_length
## 1 casual 0 secs
## 2 member 0 secs
all_rides_v2$day_of_week <- ordered(all_rides_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_rides_v2$ride_length ~ all_rides_v2$member_casual + all_rides_v2$day_of_week, FUN = mean )
## all_rides_v2$member_casual all_rides_v2$day_of_week all_rides_v2$ride_length
## 1 casual Sunday 2045.0549 secs
## 2 member Sunday 842.8941 secs
## 3 casual Monday 1753.9351 secs
## 4 member Monday 736.4146 secs
## 5 casual Tuesday 1558.6146 secs
## 6 member Tuesday 728.8662 secs
## 7 casual Wednesday 1482.0886 secs
## 8 member Wednesday 723.9585 secs
## 9 casual Thursday 1534.3805 secs
## 10 member Thursday 737.9848 secs
## 11 casual Friday 1667.3108 secs
## 12 member Friday 750.0025 secs
## 13 casual Saturday 1951.9865 secs
## 14 member Saturday 849.0422 secs
all_rides_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
arrange(member_casual,weekday) %>%
ggplot(aes(x=weekday,y=number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")+
scale_y_continuous(labels = comma)+
ggtitle("Daily rides by user type and weekday:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_rides_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides=n(), average_duration = mean(ride_length)) %>%
arrange(member_casual,weekday) %>%
ggplot(aes(x=weekday,y=average_duration, fill = member_casual)) +
geom_col(position = "dodge")+
scale_y_continuous(labels = comma)+
ggtitle("Average ride duration by user type and weekday:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_rides_v2 %>%
group_by(member_casual, start_hour) %>%
summarise(number_of_rides=n()) %>%
ggplot(aes(x=start_hour,y=number_of_rides, color = member_casual)) +
geom_line()+
scale_y_continuous(labels = comma)+
ggtitle("Rides by time of day and user type:")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
We first obtain the total amount of rides divided by annual member and casual rider, further subdivided by bicycle type (classic, electric and docked).
classic_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'classic_bike')
electric_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'electric_bike')
docked_member <- sum(all_rides_v2$member_casual == 'member' & all_rides_v2$rideable_type == 'docked_bike')
classic_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'classic_bike')
electric_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'electric_bike')
docked_casual <- sum(all_rides_v2$member_casual == 'casual' & all_rides_v2$rideable_type == 'docked_bike')
For example: does the bicycle type usage change over the week?
all_rides_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(rideable_type, weekday) %>%
summarise(number_of_rides=n()) %>%
ggplot(aes(x=weekday,y=number_of_rides, fill=rideable_type)) +
geom_col(position = "dodge")+
scale_y_continuous(labels = comma)+
ggtitle("Rides by weekday and bicycle type:")
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
Then we analyze for differences between annual members and casual riders with the types of bicycles they use:
all_rides_v2 %>%
group_by(rideable_type, member_casual, start_hour) %>%
summarise(number_of_rides=n()) %>%
ggplot(aes(x=start_hour,y=number_of_rides, color=member_casual, linetype=rideable_type)) +
geom_smooth()+
scale_y_continuous(labels = comma)+
ggtitle("Rides by time of day, user and bicycle type:")
## `summarise()` has grouped output by 'rideable_type', 'member_casual'. You can
## override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We interpreted the results from this analysis as an indicator that our casual riders use cyclistic for leisure, we also determined that our annual members mostly use it for commuting to work and back. We reccommend to further this investigation by surveying our annual members with a direct “yes or no” survey asking if they use cyclistic bikes for commuting to work. If that is not a possibility, then the marketing campaign should focus on inspiring casual riders to become annual members to use cyclistic for their commute to work. Reassuring their intent by offering a free trial, moneyback guarantee, or other tacticts.