Cyclistic Case Study

Intoroduction

This Rmarkdown document reports my approach and thought process in solving the ‘Cyclistic Case Study’ from the Google Data Analytics Certificate. I will be following the six phases of data analysis that i have learned in the programme: Ask, Prepare, Process, Analyse, Share and Act to answer the main objective “Design marketing strategies aimed at converting casual riders into annual members.”

Case Study Summary

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles and a network of 692 stations across Chicago. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Lily Moreno, the director of marketing, believes that maximizing the number of annual members will be key to future growth. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. As a junior data analyst, you are assigned to answer the question “How do annual members and casual riders use Cyclistic bikes differently?”

Ask

Obejctive: Design marketing strategies aimed at converting casual riders into annual members.

Goal: How do annual members and casual riders use Cyclistic bikes differently?

Key Stakeholders:

Lily Moreno, Marketing Director
Cyclistic Executive Team

Prepare

The previous 12 months of trip data was downloaded (Q1 of 2020 - Q2 of 2019) and stored as .csv files. The data was then assessed to see if there was any issues with bias or credibility.

Reliability - It is first party data and is unbiased and complete.
Original - The source of the data is first party.
Comprehensive - There is data on ‘Usertype’, and ‘Start time’ and ‘End time’ of trips as well as other fields to perform a sufficient analysis.
Current - The data is the latest 12 months available.
Cited - The data has been made available by Motivate International Inc. under this license.

Step 1: Install and load packages.

library(tidyverse) # Helps wrangle data
library(lubridate) # Helps wrangle data attributes
library(ggplot2) # Helps visualize data

Step 2: Collect Data

Loaded the datasets (.csv) files into Rstudio.

q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")

Step 3: Wrangle data and combine into a single file.

The column names were compared in each file. The names don’t need to be in the same order, but they do need to match perfectly before merging into one file

colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)

The column names aren’t consistent between all of the files. The columns should then be renamed in accordance with q1_2020 as this will be the supposed going-forward table design for Cyclistic.

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))

The dataframes are then inspected to look for incongruencies

str(q1_2020)
str(q4_2019)
str(q3_2019)
str(q2_2019)

‘ride_id’ and ‘rideable_type’ must be converted to character in order for them to stack properly

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))

The individual quarters where then stacked into one big data frame

all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)

The lat, long, birthyear, and gender fields were removed as this data was dropped beginning in 2020

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Process

Step 1: Inspect the new table that has been created.

colnames(all_trips)  #List of column names shown in table below
nrow(all_trips)  #3879822 rows in data frame
dim(all_trips)  #Dimensions of the data frame are 3879822 x 9
head(all_trips)  #See the first 6 rows of data frame.
str(all_trips)  #See list of columns and data types (numeric, character, etc)
summary(all_trips)  #Statistical summary of data. Mainly for numerics

Column Name	Format	Description
ride_id	character	unique ID fro each ride
started_at	POSIXct	start time of the ride
ended_at	POSIXct	end time of each ride
rideable_type	character	displays bike ID for data before Q1 2020 or bike type 2020 Q1
start_station_id	number	id of the starting station
start_station_name	character	name of the starting station
end_station_id	number	id for the end station
end_station_name	character	name of the end station
member_casual	character	type of member

Step 2: Fix problems identified in the new table

In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.
The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.
We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “tripduration” column. We will add “ride_length” to the entire dataframe for consistency.
There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

In the ‘member_casual’ column ‘subscriber’ will be replaces with ‘member’ and ‘customer’ with ‘casual’. First, the ‘members_casual’ column was filtered to see how many observations fall under each user type.

table(all_trips$member_casual)

casual	customer	member	subscriber
48480	857474	378407	2595461

The different values were then reassigned to the desired ones.

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                           ,"Subscriber" = "member"
                           ,"Customer" = "casual"))

The ‘member_casual’ column was then checked to make sure the correct amount of observations were reassigned.

table(all_trips$member_casual)

casual	member
905954	2973868

Columns were than added that list the date; month, day, and year of each ride as this will allow the ride data for each month, day, or year … to be aggregated. Before completing these operations only the ride level could be aggregated.

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

A ‘ride_length’ column was then added to calculate the ride length in seconds

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

The structure of the columns was then investigated and it was seen that ‘ride_length’ needed to be converted from factor to numeric so calculations can be run on the data.

is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

Step 3: Remove ‘bad’ data

The database includes entries where bikes were taken out of docks and checked for quality or ‘ride_length’ was negative. A new dataframe was then created (all_trips_v2) as data is being removed.

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Analysis

Step 1: Perform analysis on ‘ride_length’ by calculating the mean, median, maximum and minimum length using the ‘summary’ function. All values are in seconds.

summary(all_trips_v2$ride_length)

Min.	Median	Mean	Max.
1	712	1479	9387024

Step 2: The mean, median, maximum and minimum ride length was then compared between members and casual users. With the results shown in the table below.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

User Type	Median	Mean	Max.	Min
casual	1546	3552.75	9387024	2
member	589	850.07	9056634	1

Step 3: The average ride time by each day for members vs casual users was then calculated

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

The days of the week were seen to be out of order so they were rearranged by the following code:

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

The average ride time by each day for members vs casual users was then calculated again with the days in the correct order.

Step 4: The ridership data was then analysed by the user type and weekday together with the number of rides and ride duration.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%    #creates weekday field using wday()
  group_by(member_casual, weekday) %>%                    #groups by usertype and weekday
  summarise(number_of_rides = n()                         #calculates the number of rides and average duration 
  ,average_duration = mean(ride_length)) %>%              #calculates the average duration
  arrange(member_casual, weekday)                         #sorts

Step 5: A visualization to compare the number of rides versus weekday by rider type was then created.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
  ylab("Number of Rides") +
  xlab("Weekday") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_weekday.png")

This is the resulting plot: number_of_rides_vs_weekday

Observations:

It can be seen from the graph that members have a much larger number of rides over weekdays than casual riders, with a smaller difference over weekends. This could be because members use the bikes as part of their commute to work.
Casual users’ number of rides remains similar through each weekday with an increase on Friday. Then there is a large increase over the weekend, with Saturday having the most number of rides. This could be due to casual riders using the bikes for leisure activities.
A possible idea to convert casual users to members is to add a promotional offer on the use of bikes over weekends to the annual membership.

Step 6: A visualization to compare the average ride duration versus weekday by rider type was then created.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Average Ride Duration: Casual vs Member") +
  ylab("Average Ride Duration") +
  xlab("Weekday") +
  guides(fill=guide_legend(title="User Type"))
ggsave("average_duration_vs_weekday.png")

This is the resulting plot: average_ride_duration_vs_weekday Observations:

The graph shows that casual users have a much greater average ride duration than members. This could be due to members using their bikes for shorter distances in their commute, for example, riding to the nearest train station. The average ride duration remains very similar during weekdays for members. This suggests that members repeat the same journey during the week, reinforcing the idea of using them as part of their commute.

Step 7: A visualization to compare the number of rides with time was then created to further investigate the assumption of members using the bikes as part of their commute.

all_trips_v2 %>% 
  mutate(time = hour(started_at)) %>%
  group_by(member_casual, time) %>%
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, time)  %>%
  ggplot(aes(x = factor(time), y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + 
  scale_x_discrete(breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
                   labels = c('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23')) +
  labs(title = "Number of Rides: Members vs Casual") +
  xlab("Time (hours)") +
  ylab(" Number of Rides") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_time.png")

This is the resulting plot: number_of_rides_vs_time

Observations:

Members travel the most at 7-8am and 4-6pm. This reinforces the idea that members like to use the bikes as part of their commute as these times coincide with rush hour.
Casual users have the lowest number of rides at 4am, which then increases 5pm and then proceeds to decrease again to 4am.
Targeting casual riders who display the same riding patterns as members my result in them converting to an annual membership.

Step 8: A visualization to compare the number of rides versus month by rider type was then created.

all_trips_v2 %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
  ylab("Number of Rides") +
  xlab("Month") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of _rides_vs_month.png")

This is the resulting plot: number_of_rides_vs_month

Observations:

Both members and casual users follow the same ride pattern from March to February, where the number of rides increases from March to August and then decreases from August to February.
The ride pattern differs outside of these months. The members’ number of rides continues to decreases until March, whereas the casual users’ decrease until February and then begin to increase again.
The largest amount of casual users ride the bikes in the summer months (June, July and August) so targeting them with advertisement for annual memberships in those months will result in more casual users converting to members.

Step 9: The start station data and number of rides was then exported to a .csv file for further analysis.

start_station_data <- all_trips_v2 %>%                  
                group_by(start_station_name) %>%                                  
                summarise(number_of_rides = n())

write.csv(start_station_data, "C:/Users/'username'/Documents/start_station_data.csv") 
#'username' is used instead of personal username.

Step 10: The start_station_data.csv file was then imported into Google sheets. The latitude and longitude was found for each station using the extension ‘Geocode’ by Awesome Table. A visualization for the number of trips and starting station was then created in Tableau:
Starting_station_and_number_of_rides

Observations:

The most popular stations are located in central Chicago and the coastal area.
Targeting advertisements near these locations may result in new customers signing up to annual memberships or casual users converting to an annual membership.

Key Findings

Casual riders prefer to use the bikes on the weekends. Members prefer weekdays, potential using them as part of their commute.
Casual riders have a much greater average ride duration than members.
Members use the bikes the most during rush hour, whereas casual riders use the bikes most in the afternoon.
Both members and casual riders have similar monthly ride patterns.

Recommendations Based on Key Findings

Consider implementing a weekend pass as part of the annual membership to attract casual riders.
Target casual riders who display similar ride patterns to members (short ride duration on weekdays during rush hour).
Target advertisement near the top locations so more riders will consider an annual membership.

Act

Create a survey to confirm findings on how members use their bikes.
Follow up on unanswered questions from the presentation.

Cyclistic Case Study

William Graham

2022-09-30

Intoroduction

Case Study Summary

Ask

Prepare

Process

Analysis

Act