This capstone project is the final project in the Google Data Analytics Professional Certificate course. I was given the opportunity to use the skills and methods I had been taught by working on a case study project that would summarize my relevant knowledge, following the steps of the data analysis process: ask, prepare, process, analyze, share and act.
You are a junior data analyst working in the marketing analyst team at Cyclistic. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Cyclistic is a bike-share company based in Chicago that launched in 2016. They offer 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. Cyclistic offers 3 different plans, single-ride passes, full-day passes, and annual memberships. Riders who have an annual subscription are called members while riders who are single-ride or full-day pass users are considered casual riders. The main fleet consist of classic bikes, electric bikes and docked bikes.
Business Task To analyze user behaviors on how annual members and casual riders use Cyclistic bikes differently and make recommendations on how to convert casual riders into annual members.
The data has been made available by Motivate International Inc., the company which operates the City of Chicago’s Divvy bicycle sharing service. The license to use this public dataset can be found here. Datasets are available here. For this analysis I’m going to focus on 12 months period from January 2021 to December 2021.
#Remove data from the environment
rm(list=ls())
#Read library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(tidyr)
#Import data set, 12 months from January to December 2021
ds1 <- read.csv("202101-divvy-tripdata.csv")
ds2 <- read.csv("202102-divvy-tripdata.csv")
ds3 <- read.csv("202103-divvy-tripdata.csv")
ds4 <- read.csv("202104-divvy-tripdata.csv")
ds5 <- read.csv("202105-divvy-tripdata.csv")
ds6 <- read.csv("202106-divvy-tripdata.csv")
ds7 <- read.csv("202107-divvy-tripdata.csv")
ds8 <- read.csv("202108-divvy-tripdata.csv")
ds9 <- read.csv("202109-divvy-tripdata.csv")
ds10 <- read.csv("202110-divvy-tripdata.csv")
ds11 <- read.csv("202111-divvy-tripdata.csv")
ds12 <- read.csv("202112-divvy-tripdata.csv")
#Check the columns consistency so it can be combined
colnames(ds1)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds2)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds3)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds4)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds5)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds6)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds7)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds8)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds9)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds10)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds11)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(ds12)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
#Combine 12 files into one data set, get to know the data set
all_trips <- rbind(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)
head(all_trips)
## ride_id rideable_type started_at ended_at
## 1 E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44
## 2 DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12
## 3 EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14
## 4 4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55
## 5 BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45
## 6 5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54
## start_station_name start_station_id end_station_name end_station_id
## 1 California Ave & Cortez St 17660
## 2 California Ave & Cortez St 17660
## 3 California Ave & Cortez St 17660
## 4 California Ave & Cortez St 17660
## 5 California Ave & Cortez St 17660
## 6 California Ave & Cortez St 17660
## start_lat start_lng end_lat end_lng member_casual
## 1 41.90034 -87.69674 41.89 -87.72 member
## 2 41.90033 -87.69671 41.90 -87.69 member
## 3 41.90031 -87.69664 41.90 -87.70 member
## 4 41.90040 -87.69666 41.92 -87.69 member
## 5 41.90033 -87.69670 41.90 -87.70 casual
## 6 41.90041 -87.69676 41.94 -87.71 casual
#Remove monthly data to clear environment
remove(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)
#Convert started_at and ended_at char to datetime
all_trips[['started_at']] <- ymd_hms(all_trips[['started_at']])
all_trips[['ended_at']] <- ymd_hms(all_trips[['ended_at']])
str(all_trips)
## 'data.frame': 5595063 obs. of 13 variables:
## $ ride_id : chr "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct, format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
#New columns were add for days of the week, month, time and duration in minutes
all_trips$week <- format(as.Date(all_trips$started_at),'%a')
all_trips$month <- format(as.Date(all_trips$started_at),'%b_%y')
all_trips$time <- format(all_trips$started_at, format = "%H:%M")
all_trips$time <- as.POSIXct(all_trips$time, format = "%H:%M")
all_trips$trip_duration <- (as.double(difftime(all_trips$ended_at, all_trips$started_at)))/60
glimpse(all_trips)
## Rows: 5,595,063
## Columns: 17
## $ ride_id <chr> "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-01-23 16:14:19, 2021-01-27 18:43:08, 2021-01-…
## $ ended_at <dttm> 2021-01-23 16:24:44, 2021-01-27 18:47:12, 2021-01-…
## $ start_station_name <chr> "California Ave & Cortez St", "California Ave & Cor…
## $ start_station_id <chr> "17660", "17660", "17660", "17660", "17660", "17660…
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "", "Wood St & Augu…
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "", "657", "13258",…
## $ start_lat <dbl> 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
## $ start_lng <dbl> -87.69674, -87.69671, -87.69664, -87.69666, -87.696…
## $ end_lat <dbl> 41.89000, 41.90000, 41.90000, 41.92000, 41.90000, 4…
## $ end_lng <dbl> -87.72000, -87.69000, -87.70000, -87.69000, -87.700…
## $ member_casual <chr> "member", "member", "member", "member", "casual", "…
## $ week <chr> "Sat", "Wed", "Thu", "Thu", "Sat", "Sat", "Mon", "T…
## $ month <chr> "Jan_21", "Jan_21", "Jan_21", "Jan_21", "Jan_21", "…
## $ time <dttm> 2022-09-20 16:14:00, 2022-09-20 18:43:00, 2022-09-…
## $ trip_duration <dbl> 10.4166667, 4.0666667, 1.3333333, 11.7000000, 0.716…
R programming language will be used for this analysis. It can perform a complete cleaning, analysis, visualizations and documentation on the same platform. I found null data on start and end station name columns, trips duration less than 0, NA’s as well. which consist in 9% of the total data set. For missing data since the data set is huge, I decided ignored it. The total of 4 columns and 1.006.877 rows were removed from the new data frame.
#Remove NA
all_trips_clean <- drop_na(all_trips)
dim(all_trips_clean)
## [1] 5590292 17
#Remove usefulness columns
all_trips_clean <- all_trips_clean %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
dim(all_trips_clean)
## [1] 5590292 13
#Check for trips duration less then 0
nrow(subset(all_trips_clean,trip_duration < 0))
## [1] 146
#Remove trips duration less then 0
all_trips_clean <- all_trips_clean[!(all_trips_clean$trip_duration < 0),]
dim(all_trips_clean)
## [1] 5590146 13
#Remove start_station_name and end_station_name blank results
all_trips_clean <- all_trips_clean %>%
filter(!(is.na(start_station_name) | start_station_name == "")) %>%
filter(!(is.na(end_station_name) | end_station_name == ""))
dim(all_trips_clean)
## [1] 4588186 13
In order to facilitate our analysis and arrive at more insightful conclusions, all the required information are now in a complete data frame ready to identify trends, relationships and farther explorations.
#Total rides by members type
all_trips_clean %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id))
## # A tibble: 2 × 2
## member_casual ride_count
## <chr> <int>
## 1 casual 2048335
## 2 member 2539851
#Rides duration by members type
setNames(aggregate(trip_duration ~ member_casual, all_trips_clean, sum),
c("customer_type", "trip_duration_mins"))
## customer_type trip_duration_mins
## 1 casual 66590608
## 2 member 33484772
#Total rides by bike type
table(all_trips_clean$rideable_type)
##
## classic_bike docked_bike electric_bike
## 3241904 312044 1034238
#Statistical summary of rides duration by customer type
all_trips_clean %>%
group_by(member_casual) %>%
summarise(mean_trip_duration = mean(trip_duration), median_trip_duration = median(trip_duration),
max_trip_duration = max(trip_duration), min_trip_duration = min(trip_duration))
## # A tibble: 2 × 5
## member_casual mean_trip_duration median_trip_duration max_trip_durat…¹ min_t…²
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 casual 32.5 16.6 55944. 0
## 2 member 13.2 9.72 1496. 0
## # … with abbreviated variable names ¹max_trip_duration, ²min_trip_duration
#Corrected order for days of the week and months
all_trips_clean$week <- ordered(all_trips_clean$week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
all_trips_clean$month <- ordered(all_trips_clean$month, levels=c("Jan_21", "Feb_21", "Mar_21", "Apr_21", "May_21", "Jun_21", "Jul_21", "Aug_21", "Sep_21", "Oct_21", "Nov_21", "Dec_21"))
To convert casual riders into annual members, the following marketing recommendations for Cyclistic bike share can be implemented:
Frequent rider program that get points for each ride then redeem into discount on sign up for annual membership, get points for feedback on the purpose of their ride also would offer further insights and could encourage customers to get membership. Offer targeted memberships deal to casual riders, sending discount coupon, notification, email reminders showing the price benefits of annual memberships which might influence them to become members. Considerer introducing a monthly pass, the more costumer use the monthly pass, more they feel comfortable getting an annual subscription.
I was presented with the task of analyzing historical data to determine how casual riders and annual members use Cyclistic bikes service differently. Due to data-privacy some limitations were found, rides identification weren’t allowed to be connect to personal information, trips weren’t grouped by users so data couldn’t provide qualitative information. I discovered some differences between casual riders and annual members but was unable to explain why they are different, more data collection are required to conduct a profound analysis and find the root cause of this question.
RDocumentation, RStudio, Github and Kaggle community.