This Rmarkdown document reports my approach and thought process in solving the ‘Cyclistic Case Study’ from the Google Data Analytics Certificate. I will be following the six phases of data analysis that i have learned in the programme: Ask, Prepare, Process, Analyse, Share and Act to answer the main objective “Design marketing strategies aimed at converting casual riders into annual members.”
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles and a network of 692 stations across Chicago. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Lily Moreno, the director of marketing, believes that maximizing the number of annual members will be key to future growth. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. As a junior data analyst, you are assigned to answer the question “How do annual members and casual riders use Cyclistic bikes differently?”
Obejctive: Design marketing strategies aimed at converting casual riders into annual members.
Goal: How do annual members and casual riders use Cyclistic bikes differently?
Key Stakeholders:
The previous 12 months of trip data was downloaded (Q1 of 2020 - Q2 of 2019) and stored as .csv files. The data was then assessed to see if there was any issues with bias or credibility.
Step 1: Install and load packages.
library(tidyverse) # Helps wrangle data
library(lubridate) # Helps wrangle data attributes
library(ggplot2) # Helps visualize data
Step 2: Collect Data
Loaded the datasets (.csv) files into Rstudio.
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
Step 3: Wrangle data and combine into a single file.
The column names were compared in each file. The names don’t need to be in the same order, but they do need to match perfectly before merging into one file
colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)
The column names aren’t consistent between all of the files. The columns should then be renamed in accordance with q1_2020 as this will be the supposed going-forward table design for Cyclistic.
(q4_2019 <- rename(q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
(q3_2019 <- rename(q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
(q2_2019 <- rename(q2_2019
,ride_id = "01 - Rental Details Rental ID"
,rideable_type = "01 - Rental Details Bike ID"
,started_at = "01 - Rental Details Local Start Time"
,ended_at = "01 - Rental Details Local End Time"
,start_station_name = "03 - Rental Start Station Name"
,start_station_id = "03 - Rental Start Station ID"
,end_station_name = "02 - Rental End Station Name"
,end_station_id = "02 - Rental End Station ID"
,member_casual = "User Type"))
The dataframes are then inspected to look for incongruencies
str(q1_2020)
str(q4_2019)
str(q3_2019)
str(q2_2019)
‘ride_id’ and ‘rideable_type’ must be converted to character in order for them to stack properly
q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
The individual quarters where then stacked into one big data frame
all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)
The lat, long, birthyear, and gender fields were removed as this data was dropped beginning in 2020
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
Step 1: Inspect the new table that has been created.
colnames(all_trips) #List of column names shown in table below
nrow(all_trips) #3879822 rows in data frame
dim(all_trips) #Dimensions of the data frame are 3879822 x 9
head(all_trips) #See the first 6 rows of data frame.
str(all_trips) #See list of columns and data types (numeric, character, etc)
summary(all_trips) #Statistical summary of data. Mainly for numerics
| Column Name | Format | Description |
|---|---|---|
| ride_id | character | unique ID fro each ride |
| started_at | POSIXct | start time of the ride |
| ended_at | POSIXct | end time of each ride |
| rideable_type | character | displays bike ID for data before Q1 2020 or bike type 2020 Q1 |
| start_station_id | number | id of the starting station |
| start_station_name | character | name of the starting station |
| end_station_id | number | id for the end station |
| end_station_name | character | name of the end station |
| member_casual | character | type of member |
Step 2: Fix problems identified in the new table
In the ‘member_casual’ column ‘subscriber’ will be replaces with ‘member’ and ‘customer’ with ‘casual’. First, the ‘members_casual’ column was filtered to see how many observations fall under each user type.
table(all_trips$member_casual)
| casual | customer | member | subscriber |
|---|---|---|---|
| 48480 | 857474 | 378407 | 2595461 |
The different values were then reassigned to the desired ones.
all_trips <- all_trips %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
The ‘member_casual’ column was then checked to make sure the correct amount of observations were reassigned.
table(all_trips$member_casual)
| casual | member |
|---|---|
| 905954 | 2973868 |
Columns were than added that list the date; month, day, and year of each ride as this will allow the ride data for each month, day, or year … to be aggregated. Before completing these operations only the ride level could be aggregated.
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
A ‘ride_length’ column was then added to calculate the ride length in seconds
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
The structure of the columns was then investigated and it was seen that ‘ride_length’ needed to be converted from factor to numeric so calculations can be run on the data.
is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
Step 3: Remove ‘bad’ data
The database includes entries where bikes were taken out of docks and checked for quality or ‘ride_length’ was negative. A new dataframe was then created (all_trips_v2) as data is being removed.
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
Step 1: Perform analysis on ‘ride_length’ by calculating the mean, median, maximum and minimum length using the ‘summary’ function. All values are in seconds.
summary(all_trips_v2$ride_length)
| Min. | Median | Mean | Max. |
|---|---|---|---|
| 1 | 712 | 1479 | 9387024 |
Step 2: The mean, median, maximum and minimum ride length was then compared between members and casual users. With the results shown in the table below.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
| User Type | Median | Mean | Max. | Min |
|---|---|---|---|---|
| casual | 1546 | 3552.75 | 9387024 | 2 |
| member | 589 | 850.07 | 9056634 | 1 |
Step 3: The average ride time by each day for members vs casual users was then calculated
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
The days of the week were seen to be out of order so they were rearranged by the following code:
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
The average ride time by each day for members vs casual users was then calculated again with the days in the correct order.
Step 4: The ridership data was then analysed by the user type and weekday together with the number of rides and ride duration.
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% #calculates the average duration
arrange(member_casual, weekday) #sorts
Step 5: A visualization to compare the number of rides versus weekday by rider type was then created.
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
ylab("Number of Rides") +
xlab("Weekday") +
guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_weekday.png")
This is the resulting plot:
Observations:
Step 6: A visualization to compare the average ride duration versus weekday by rider type was then created.
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") + labs(title = "Average Ride Duration: Casual vs Member") +
ylab("Average Ride Duration") +
xlab("Weekday") +
guides(fill=guide_legend(title="User Type"))
ggsave("average_duration_vs_weekday.png")
This is the resulting plot:
Observations:
Step 7: A visualization to compare the number of rides with time was then created to further investigate the assumption of members using the bikes as part of their commute.
all_trips_v2 %>%
mutate(time = hour(started_at)) %>%
group_by(member_casual, time) %>%
summarise(number_of_rides = n()) %>%
arrange(member_casual, time) %>%
ggplot(aes(x = factor(time), y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
scale_x_discrete(breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
labels = c('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23')) +
labs(title = "Number of Rides: Members vs Casual") +
xlab("Time (hours)") +
ylab(" Number of Rides") +
guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_time.png")
This is the resulting plot:
Observations:
Step 8: A visualization to compare the number of rides versus month by rider type was then created.
all_trips_v2 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
ylab("Number of Rides") +
xlab("Month") +
guides(fill=guide_legend(title="User Type"))
ggsave("number_of _rides_vs_month.png")
This is the resulting plot:
Observations:
Step 9: The start station data and number of rides was then exported to a .csv file for further analysis.
start_station_data <- all_trips_v2 %>%
group_by(start_station_name) %>%
summarise(number_of_rides = n())
write.csv(start_station_data, "C:/Users/'username'/Documents/start_station_data.csv")
#'username' is used instead of personal username.
Step 10: The start_station_data.csv file was then
imported into Google sheets. The latitude and longitude was found for
each station using the extension ‘Geocode’ by Awesome Table. A
visualization for the number of trips and starting station was then
created in Tableau:
Observations:
Key Findings
Recommendations Based on Key Findings