Goal of the Marketing Director: Design marketing strategies aimed at converting casual riders into annual members
Three questions will guide the future marketing program:
This Case Study focuses on the first question.
“Analyzing the difference in usage patterns of casual riders and annual members with the aim to convert casual riders into annual members”
The data is on an AWS server where it is easily downloadable and named correctly. I have downloaded the previous 12 months data and stored it locally for the next steps in this process. It is organized by year and Fiscal Quarters. The data is reliable and original since it comes from the company.This Data is ROCCC.
Starting with the Process and installing the packages and libraries required for this job
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(readr)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Because of limited RAM Available on RStudio, here we are using only 6 Datasets, because adding more datasets results in RStudio Crash.
trip_jan <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_feb <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_mar <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_apr <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_may <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_june <- read_csv("D:/Manjeet/RStudio/Cyclistic_Bike_Data/202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip2021 <- rbind(trip_jan,trip_feb,trip_mar,trip_apr,trip_may,trip_june)
str(trip2021)
## spec_tbl_df [1,973,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:1973410] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:1973410] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:1973410], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:1973410], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:1973410] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:1973410] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:1973410] NA NA NA NA ...
## $ end_station_id : chr [1:1973410] NA NA NA NA ...
## $ start_lat : num [1:1973410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:1973410] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:1973410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:1973410] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:1973410] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
When we saw the structure of the file, we found that start and end date were in character format. We need to change that to date and time format.
trip2021$started_at = strptime(trip2021$started_at,"%Y-%m-%d %H:%M:%S")
trip2021$ended_at = strptime(trip2021$ended_at,"%Y-%m-%d %H:%M:%S")
str(trip2021)
## spec_tbl_df [1,973,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:1973410] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:1973410] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXlt[1:1973410], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXlt[1:1973410], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:1973410] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:1973410] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:1973410] NA NA NA NA ...
## $ end_station_id : chr [1:1973410] NA NA NA NA ...
## $ start_lat : num [1:1973410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:1973410] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:1973410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:1973410] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:1973410] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
trip2021<-mutate(trip2021,tripduration=difftime(ended_at,started_at, units = "secs"))
head(trip2021)
## # A tibble: 6 × 14
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 E19E6F1B8D4C4… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660
## 2 DC88F20C2C55F… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660
## 3 EC45C94683FE3… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660
## 4 4FA453A75AE37… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660
## 5 BE5E8EB4E7263… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660
## 6 5D8969F88C773… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660
## # … with 8 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, tripduration <drtn>, and abbreviated variable names
## # ¹rideable_type, ²start_station_name, ³start_station_id
## # ℹ Use `colnames()` to see all variable names
Here we’re filtering out ‘tripduaration’ greater than zero because is after Viewing Trip data we found a lot of values less than zero
trip2021 <- filter(trip2021,trip2021$tripduration>0)
Taking a Glimpse at data
glimpse(trip2021)
## Rows: 1,973,230
## Columns: 14
## $ ride_id <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
## $ ended_at <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor…
## $ start_station_id <chr> "17660", "17660", "17660", "17660", "17660", "17660…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Wood St & Augu…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "657", "13258",…
## $ start_lat <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
## $ start_lng <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696…
## $ end_lat <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4…
## $ end_lng <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
## $ tripduration <drtn> 625 secs, 244 secs, 80 secs, 702 secs, 43 secs, 32…
Removing values that are missing/blank
trip2021<-trip2021 %>%
na.omit()
Adding the individual columns for date, day, month, year, day of the week to ease the in-depth analysis
trip2021$date<-as.Date(trip2021$started_at)
trip2021$month<-format(as.Date(trip2021$started_at),"%m")
trip2021$day<-format(as.Date(trip2021$started_at),"%d")
trip2021$year<-format(as.Date(trip2021$started_at),"%y")
trip2021$day_of_week<-format(as.Date(trip2021$started_at),"%A")
The Data has been Prepared and Processed now ready for Descriptive Analysis
Here will be performing calculations on the cleaned, consistent data and identification of trends, patterns and relationships as well as statistical analysis by calculating mean, median, maximum and minimum on ride length column for both casual riders and members
trip2021 %>%
group_by(member_casual) %>% summarise(average_trip_duration=mean(tripduration),median_duration=median(tripduration),max_trip_duration=max(tripduration),min_trip_duration=min(tripduration))
## # A tibble: 2 × 5
## member_casual average_trip_duration median_duration max_trip_duration min_tr…¹
## <chr> <drtn> <drtn> <drtn> <drtn>
## 1 casual 2321.3509 secs 1116 secs 3356649 secs 1 secs
## 2 member 841.1266 secs 621 secs 89738 secs 1 secs
## # … with abbreviated variable name ¹min_trip_duration
trip2021 %>%
group_by(member_casual) %>%
summarise(ride_count=length(ride_id))
## # A tibble: 2 × 2
## member_casual ride_count
## <chr> <int>
## 1 casual 740369
## 2 member 949760
trip2021 %>%
group_by(member_casual,day_of_week) %>%
summarise(number_of_rides=n(),average_trip_duration=mean(tripduration),.groups = "drop")
## # A tibble: 14 × 4
## member_casual day_of_week number_of_rides average_trip_duration
## <chr> <chr> <int> <drtn>
## 1 casual Friday 98136 2260.3380 secs
## 2 casual Monday 84057 2187.2885 secs
## 3 casual Saturday 171318 2469.5763 secs
## 4 casual Sunday 155323 2657.3575 secs
## 5 casual Thursday 69671 1964.1777 secs
## 6 casual Tuesday 83191 2124.9971 secs
## 7 casual Wednesday 78673 2078.4810 secs
## 8 member Friday 134603 805.0850 secs
## 9 member Monday 131540 810.5589 secs
## 10 member Saturday 139237 934.6932 secs
## 11 member Sunday 124329 966.5500 secs
## 12 member Thursday 127236 779.0815 secs
## 13 member Tuesday 144874 806.9532 secs
## 14 member Wednesday 147941 794.4575 secs
breaks<-hour((hm("00:00", "6:00", "12:00", "18:00", "23:59")))
labels <- c("Night", "Morning", "Afternoon", "Evening")
trip2021$time_of_the_trip <-cut(x=hour(trip2021$started_at), breaks = breaks, labels = labels, include.lowest = "true")
trip2021 %>%
group_by(member_casual,time_of_the_trip) %>%
summarise(number_of_rides=n(),average_trip_duration=mean(tripduration),.groups = "drop")
## # A tibble: 8 × 4
## member_casual time_of_the_trip number_of_rides average_trip_duration
## <chr> <fct> <int> <drtn>
## 1 casual Night 46680 2760.9889 secs
## 2 casual Morning 162059 2100.8855 secs
## 3 casual Afternoon 367788 2303.0961 secs
## 4 casual Evening 163842 2455.1382 secs
## 5 member Night 55862 761.2808 secs
## 6 member Morning 280562 800.9260 secs
## 7 member Afternoon 450785 877.1041 secs
## 8 member Evening 162551 838.1800 secs
Now that we have finished creating visualizations, its time to act on our findings and proposing the Top 3 Recommendations based on our analysis.
Weekend Membership: As we found that most of the casual riders prefer riding on weekends more, thus a weekend membership can attract new casual riders as well as the existing ones and also the weekend membership benefits can be used to influence them for extended memberships.
Marketing and Promotional Campaigns: The busiest time of the year for Cyclistic is in the 3rd quarter of the year when rides are on its peak for both type of riders which is the best time for promotional activities and campaigns. Those can be conducted nearby riding hotspots. Classic bikes are used the most thus offerings can be created for those.
Discounts and Riding Competitions: Cyclistic can organize bike riding competitions with exciting prizes and can offer discounted yearly memberships to the participants.
Additional data like pricing details etc. could be used to expand our findings and scope of analysis but the provided data is sufficient to conclude our findings and accomplish the business task.