This is my capstone project that I completed as part of the Google Data Analytics Professional Certificate.Through all stages of this project, I will act as a junior data analyst working for the marketing team at Cyclistic, a fictional bike-share company based out of Chicago. My task is to understand how each rider uses the platform and devise ways to attract more subscribing customers.
The steps of the data analysis process consist of five phase, those are ask, prepare, process, analyze, share and act. I will follow this guideline in my data analysis.
I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, I will attempt to find insights that will help my team design a new marketing strategy to convert casual riders into annual members.
In 2016, Cyclistic launched a successful bike-share oering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders ride, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
These three following questions will guide the future marketing program and needed to be answered:
I am using tidyverse packages, as well as some other packages that will need to be installed and admitted to the library every time the application is opened.
library(tidyverse)
library(stringr)
library(lubridate)
library(dplyr)
library(tidyr)
library(skimr)
library(readr)
2.1 Description of datasets
These datasets contain Cyclistic’s historical trip data in 2019 that can be download here. (Note: The datasets have a different name because Cyclistic is a fictional company). The data has been made available by Motivate Inc. under this license here. This is public data that can be used to explore how different customer types are using Cyclistic bikes. Note: The data privacy issues prohibit us from using riders’ personally identifiable information. This means that we wil not be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
2.2 Upload and assign quarters to dataIn this code snippet, I am (1) importing the values, (2) checking, and then (3) importing the variable names. Finally, we want to (4) merge all of these datasets into one.
q1 <- read.csv ("C:/Users/James/Desktop/Q1_2019.csv")
q2 <- read.csv ("C:/Users/James/Desktop/Q2_2019.csv")
q3 <- read.csv ("C:/Users/James/Desktop/Q3_2019.csv")
q4 <- read.csv ("C:/Users/James/Desktop/Q4_2019.csv")
#check column names
colnames(q1)
[1] "trip_id" "start_time" "end_time" "bikeid" "tripduration"
[6] "from_station_id" "from_station_name" "to_station_id" "to_station_name" "usertype"
[11] "gender" "birthyear"
colnames(q2)
[1] "X01...Rental.Details.Rental.ID" "X01...Rental.Details.Local.Start.Time"
[3] "X01...Rental.Details.Local.End.Time" "X01...Rental.Details.Bike.ID"
[5] "X01...Rental.Details.Duration.In.Seconds.Uncapped" "X03...Rental.Start.Station.ID"
[7] "X03...Rental.Start.Station.Name" "X02...Rental.End.Station.ID"
[9] "X02...Rental.End.Station.Name" "User.Type"
[11] "Member.Gender" "X05...Member.Details.Member.Birthday.Year"
colnames(q3)
[1] "trip_id" "start_time" "end_time" "bikeid" "tripduration"
[6] "from_station_id" "from_station_name" "to_station_id" "to_station_name" "usertype"
[11] "gender" "birthyear"
colnames(q4)
[1] "trip_id" "start_time" "end_time" "bikeid" "tripduration"
[6] "from_station_id" "from_station_name" "to_station_id" "to_station_name" "usertype"
[11] "gender" "birthyear"
#renaming q2 column names
colnames(q2) <- c("trip_id", "start_time", "end_time",
"bikeid", "tripduration", "from_station_id",
"from_station_name", "to_station_id", "to_station_name",
"usertype", "gender", "birthyear")
# merging the datasets
tripdata_2019 <- bind_rows(q1, q2, q3, q4)
2.3 Convert to correct distribution of data types
tripdata_2019 <- tripdata_2019 %>%
mutate(
trip_id = as.double(trip_id),
start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
end_time = as.POSIXct(end_time, format = "%Y-%m-%d %H:%M:%S"),
bikeid = as.double(bikeid),
tripduration = as.double(tripduration),
from_station_id = as.double(from_station_id),
from_station_name = as.character(from_station_name),
to_station_id = as.double(to_station_id),
to_station_name = as.character(to_station_name),
usertype = as.character(usertype),
gender = as.character(gender),
birthyear = as.double(birthyear)
)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `tripduration = as.double(tripduration)`.
Caused by warning:
! NAs introduced by coercion
glimpse(tripdata_2019)
Rows: 3,818,004
Columns: 12
$ trip_id <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 21742448, 21742449, 21742450, 2174245…
$ start_time <dttm> 2019-01-01 00:04:37, 2019-01-01 00:08:13, 2019-01-01 00:13:23, 2019-01-01 00:13:45, 20…
$ end_time <dttm> 2019-01-01 00:11:07, 2019-01-01 00:15:34, 2019-01-01 00:27:12, 2019-01-01 00:43:28, 20…
$ bikeid <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 3939, 6243, 6300, 3029, 84, 5019, …
$ tripduration <dbl> 390, 441, 829, NA, 364, 216, 177, 100, NA, 336, 886, 653, 601, 562, 906, 892, 407, NA, …
$ from_station_id <dbl> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 204, 90, 90, 289, 289, 152, 268, 35,…
$ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", "Racine Ave & 18th St", "California…
$ to_station_id <dbl> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 420, 255, 255, 324, 324, 166, 319, 3…
$ to_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Buren St (*)", "Western Ave & Fillmore …
$ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Su…
$ gender <chr> "Male", "Female", "Female", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "…
$ birthyear <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995, 1996, 1994, 1994, 1986, 1990, 198…
3.1 Cleaning column names
colnames(tripdata_2019) <- c("trip_id", "start_time", "end_time", "bike_id",
"trip_duration", "from_station_id", "from_station_name",
"to_station_id", "to_station_name", "user_type",
"gender", "birth_year")
3.2 Separating start_time column
tripdata_2019 <- tripdata_2019 %>%
separate(
start_time,
into = c("year", "month", "day"),
sep = "-",
remove = FALSE
) %>%
separate(
day,
into = c("day_of_week", "time_of_day"),
sep = " ",
remove = FALSE
)
3.3 Create the Quarter variable
tripdata_2019 <- tripdata_2019 %>%
mutate(quarter = quarter(start_time),
month = factor(month.name[as.numeric(month)], levels = month.name,
ordered = TRUE),
day_of_week = wday(as.Date(start_time),
label = TRUE, abbr = FALSE),
day = as.numeric(str_sub(day, 1, 2)),
time_of_day = hms(time_of_day),
)
3.4 Change Quarter variable to ordered factorial
tripdata_2019 <- tripdata_2019 %>%
mutate(
quarter = as.ordered(quarter))
glimpse(tripdata_2019)
Rows: 3,818,004
Columns: 18
$ trip_id <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 21742448, 21742449, 21742450, 2174245…
$ start_time <dttm> 2019-01-01 00:04:37, 2019-01-01 00:08:13, 2019-01-01 00:13:23, 2019-01-01 00:13:45, 20…
$ year <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019",…
$ month <ord> January, January, January, January, January, January, January, January, January, Januar…
$ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day_of_week <ord> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesda…
$ time_of_day <Period> 4M 37S, 8M 13S, 13M 23S, 13M 45S, 14M 52S, 15M 33S, 16M 6S, 18M 41S, 18M 43S, 19M 18…
$ end_time <dttm> 2019-01-01 00:11:07, 2019-01-01 00:15:34, 2019-01-01 00:27:12, 2019-01-01 00:43:28, 20…
$ bike_id <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 3939, 6243, 6300, 3029, 84, 5019, …
$ trip_duration <dbl> 390, 441, 829, NA, 364, 216, 177, 100, NA, 336, 886, 653, 601, 562, 906, 892, 407, NA, …
$ from_station_id <dbl> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 204, 90, 90, 289, 289, 152, 268, 35,…
$ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", "Racine Ave & 18th St", "California…
$ to_station_id <dbl> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 420, 255, 255, 324, 324, 166, 319, 3…
$ to_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Buren St (*)", "Western Ave & Fillmore …
$ user_type <chr> "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Subscriber", "Su…
$ gender <chr> "Male", "Female", "Female", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "…
$ birth_year <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995, 1996, 1994, 1994, 1986, 1990, 198…
$ quarter <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
3.5 Fix the trip duration variable and add stage of life variable
# Step 1: Mutate trip_duration
tripdata_2019 <- tripdata_2019 %>%
mutate(trip_duration = difftime(end_time, start_time))
# Step 2: Create stage_of_life
tripdata_2019 <- tripdata_2019 %>%
mutate(
stage_of_life = cut(
birth_year,
breaks = c(-Inf, 1960, 1980, 2000, 2006, Inf),
labels = c("senior adult", "middle age adult", "adult","teen","child"),
ordered_result = TRUE
)
)
3.6 Correct variable types to correct format
tripdata_2019 <- tripdata_2019 %>%
mutate(
trip_id = as.numeric(trip_id),
year = as.character(year),
quarter = as.factor(quarter), # Change to character
month = as.factor(month),
day = as.numeric(day),
day_of_week = as.factor(day_of_week),
time_of_day = as.period(hms(time_of_day), units = "mins"),
start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
end_time = as.POSIXct(end_time, format = "%Y-%m-%d %H:%M:%S"),
bike_id = as.numeric(bike_id),
trip_duration = as.difftime(trip_duration, units = "mins"),
from_station_id = as.numeric(from_station_id),
from_station_name = as.character(from_station_name),
to_station_id = as.character(to_station_id),
to_station_name = as.character(to_station_name),
user_type = as.character(user_type),
gender = as.character(gender),
birth_year = as.numeric(birth_year),
stage_of_life = as.factor(stage_of_life)
)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `time_of_day = as.period(hms(time_of_day), units = "mins")`.
Caused by warning in `.parse_hms()`:
! Some strings failed to parse, or all strings are NAs
3.7 Get rid of any trip duration that are less than 0 minutes
tripdata_2019 <- tripdata_2019 %>%
filter(trip_duration > 0)
3.8 Trimming down unnecessary columns for visualization preparation
tripdata_2019 <- tripdata_2019%>%
select(trip_id, year, quarter, month:time_of_day, start_time, end_time, bike_id:stage_of_life)
3.9 Skim the variables to make sure they are all appropriately categorized
skim(tripdata_2019, -time_of_day)
── Data Summary ────────────────────────
Values
Name tripdata_2019
Number of rows 3817991
Number of columns 19
_______________________
Column type frequency:
character 6
difftime 1
factor 4
numeric 5
POSIXct 2
________________________
Group variables None
3.10 Count all customers with a birth year before 1919
tripdata_2019 %>%
filter(birth_year < 1919) %>%
count()
3.11 Replace any birth year before 1919 with NA
tripdata_2019 <- tripdata_2019 %>%
mutate(birth_year = replace(birth_year, birth_year < 1919, NA))
3.12 Arranging the earliest birth years at the top.
tripdata_2019 %>%
arrange(birth_year) %>%
head()
4.1 Summarize trip duration data
tripdata_2019 %>%
summarize(
mean_trip_duration = mean(trip_duration),
median_trip_duration = median(trip_duration),
min_trip_duration = min(trip_duration),
max_trip_duration = max(trip_duration)
)
4.2 Summary of trip data by user type
tripdata_2019 %>%
group_by(user_type) %>%
summarize(
mean_trip_duration = mean(trip_duration),
median_trip_duration = median(trip_duration),
min_trip_duration = min(trip_duration),
max_trip_duration = max(trip_duration),
total_trips = n(),
) %>%
arrange(-total_trips)
4.3 Summary of rides for day of the week comparing user type
tripdata_2019 %>%
group_by(day_of_week, user_type) %>%
summarize(
mean_trip_duration = mean(trip_duration),
median_trip_duration = median(trip_duration),
max_trip_duration = max(trip_duration),
total_trips = n(),
) %>%
arrange(desc(total_trips))
`summarise()` has grouped output by 'day_of_week'. You can override using the `.groups` argument.
4.4 Analyzing day of the week ridership
smry_type_weekday <- tripdata_2019 %>%
group_by(user_type, day_of_week) %>%
summarize(
number_of_trips = n(),
average_duration = mean(trip_duration),
median_duration = median(trip_duration),
max_trip_duration = max(trip_duration)
) %>%
arrange(user_type, day_of_week)
`summarise()` has grouped output by 'user_type'. You can override using the `.groups` argument.
smry_type_weekday
4.5 Visualizing the summary data from above
smry_type_weekday %>%
ggplot(aes(day_of_week, number_of_trips, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Daily number of trips in 2019",
subtitle = "Based on user type: customer and subscriber",
x = "Day of week",
y = "Number of trips",
fill = "User type"
)
4.6 Visualizing average trip duration
smry_type_weekday %>%
ggplot(aes(day_of_week, average_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Daily average of trip duration in 2019",
subtitle = "Based on user type: customer and subscriber",
x = "Day of week",
y = "Average duration (min)",
fill = "User type"
)
4.7 Visualizing median trip duration by user type
smry_type_weekday %>%
ggplot(aes(day_of_week, median_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Daily median of trip duration in 2019",
subtitle = "Based on user type: customer and subscriber",
x = "Day of week",
y = "Median of trip duration (min)",
fill = "User type"
)
4.8 Analyzing ridership by user type each month
smry_type_month <- tripdata_2019 %>%
group_by(user_type, month) %>%
summarize(
number_of_trips = n(),
average_duration = mean(trip_duration),
median_duration = median(trip_duration)
) %>%
arrange(user_type)
`summarise()` has grouped output by 'user_type'. You can override using the `.groups` argument.
smry_type_month
4.9 Visualizing monthly number of trips by user type
smry_type_month %>%
ggplot(aes(month, number_of_trips, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "The monthly number of trips based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Month",
y = "Number of trips",
fill = "User type"
) +
coord_flip()
4.10 Visualizing the monthly median trip duration by user type
smry_type_month %>%
ggplot(aes(month, median_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "The monthly median duration based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Month",
y = "Median duration (min)",
fill = "User type"
) +
coord_flip()
4.11 Visualizing average trip duration through months comparing user type
smry_type_month %>%
ggplot(aes(month, average_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "The monthly average duration based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Month",
y = "Average duration (min)",
fill = "User type"
) +
coord_flip()
4.12 Detecting outliers with IQR method
# creating a quantile variable
quantile <- quantile(
tripdata_2019$trip_duration,
probs = c(.25, .75),
na.rm = FALSE
)
# looking for outliers
tripdata_2019 %>%
group_by(month) %>%
summarize(
iqr = IQR(trip_duration),
up_outliers = quantile[2] + 1.5 * iqr,
low_outliers = quantile[1] - 1.5 * iqr
) %>%
select(month, low_outliers, iqr, up_outliers)
# creating outliers variables for filtering purpose
iqr <- IQR(tripdata_2019$trip_duration)
up_outliers <- quantile[2] + 1.5 * iqr
low_outliers <- quantile[1] - 1.5 * iqr
4.13 This code will inform us how much of each month was removed
tripdata_2019 %>%
group_by(month) %>%
summarize(
total_n = n(),
outliers = sum(trip_duration > up_outliers),
percent = outliers/total_n
) %>%
arrange(desc(percent))
4.14 This code will remove the values we have identified as outliers
no_outliers_tripdata <- tripdata_2019 %>%
filter(trip_duration < up_outliers)
4.15 Visualizing monthly average without outliers
no_outliers_tripdata %>%
group_by(user_type, month) %>%
summarize(
number_of_trips = n(),
average_duration = mean(trip_duration),
median_duration = median(trip_duration)
) %>%
arrange(user_type) %>%
ggplot(aes(month, average_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Monthly average duration based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Month",
y = "Average duration (min)",
fill = "User type"
) +
coord_flip()
`summarise()` has grouped output by 'user_type'. You can override using the `.groups` argument.
4.16 User type vs. quarter
# creating summary table
smry_type_quarter <- no_outliers_tripdata %>%
group_by(user_type, quarter) %>%
summarize(
number_of_trips = n(),
average_duration = mean(trip_duration),
median_duration = median(trip_duration)
) %>%
arrange(user_type, quarter)
smry_type_quarter
# number of trips per quarter
ggplot(smry_type_quarter, aes(quarter, number_of_trips, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Quarterly total trips based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Quarter",
y = "Total trips",
fill = "User type"
)
# Average duration per quarter
ggplot(smry_type_quarter, aes(quarter, average_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Quarterly average duration based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Quarter",
y = "Average duration (min)",
fill = "User type"
)
ggplot(smry_type_quarter, aes(quarter, median_duration, fill = user_type)) +
geom_col(position = "dodge") +
labs(
title = "Quarterly median duration based on user type",
subtitle = "Based on user type: customer and subscriber",
x = "Quarter",
y = "Median duration (min)",
fill = "User type"
)