Intoroduction

This Rmarkdown document reports my approach and thought process in solving the ‘Cyclistic Case Study’ from the Google Data Analytics Certificate. I will be following the six phases of data analysis that i have learned in the programme: Ask, Prepare, Process, Analyse, Share and Act to answer the main objective “Design marketing strategies aimed at converting casual riders into annual members.”

Case Study Summary

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles and a network of 692 stations across Chicago. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Lily Moreno, the director of marketing, believes that maximizing the number of annual members will be key to future growth. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. As a junior data analyst, you are assigned to answer the question “How do annual members and casual riders use Cyclistic bikes differently?”

Ask

Obejctive: Design marketing strategies aimed at converting casual riders into annual members.

Goal: How do annual members and casual riders use Cyclistic bikes differently?

Key Stakeholders:

Prepare

The previous 12 months of trip data was downloaded (Q1 of 2020 - Q2 of 2019) and stored as .csv files. The data was then assessed to see if there was any issues with bias or credibility.

Step 1: Install and load packages.

library(tidyverse) # Helps wrangle data
library(lubridate) # Helps wrangle data attributes
library(ggplot2) # Helps visualize data 

Step 2: Collect Data

Loaded the datasets (.csv) files into Rstudio.

q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")

Step 3: Wrangle data and combine into a single file.

The column names were compared in each file. The names don’t need to be in the same order, but they do need to match perfectly before merging into one file

colnames(q3_2019)
colnames(q4_2019)
colnames(q2_2019)
colnames(q1_2020)

The column names aren’t consistent between all of the files. The columns should then be renamed in accordance with q1_2020 as this will be the supposed going-forward table design for Cyclistic.

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid 
                   ,started_at = start_time  
                   ,ended_at = end_time  
                   ,start_station_name = from_station_name 
                   ,start_station_id = from_station_id 
                   ,end_station_name = to_station_name 
                   ,end_station_id = to_station_id 
                   ,member_casual = usertype))

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))

The dataframes are then inspected to look for incongruencies

str(q1_2020)
str(q4_2019)
str(q3_2019)
str(q2_2019)

‘ride_id’ and ‘rideable_type’ must be converted to character in order for them to stack properly

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type)) 

The individual quarters where then stacked into one big data frame

all_trips <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)

The lat, long, birthyear, and gender fields were removed as this data was dropped beginning in 2020

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))

Process

Step 1: Inspect the new table that has been created.

colnames(all_trips)  #List of column names shown in table below
nrow(all_trips)  #3879822 rows in data frame
dim(all_trips)  #Dimensions of the data frame are 3879822 x 9
head(all_trips)  #See the first 6 rows of data frame.
str(all_trips)  #See list of columns and data types (numeric, character, etc)
summary(all_trips)  #Statistical summary of data. Mainly for numerics
Column Name Format Description
ride_id character unique ID fro each ride
started_at POSIXct start time of the ride
ended_at POSIXct end time of each ride
rideable_type character displays bike ID for data before Q1 2020 or bike type 2020 Q1
start_station_id number id of the starting station
start_station_name character name of the starting station
end_station_id number id for the end station
end_station_name character name of the end station
member_casual character type of member

Step 2: Fix problems identified in the new table

In the ‘member_casual’ column ‘subscriber’ will be replaces with ‘member’ and ‘customer’ with ‘casual’. First, the ‘members_casual’ column was filtered to see how many observations fall under each user type.

table(all_trips$member_casual)
casual customer member subscriber
48480 857474 378407 2595461

The different values were then reassigned to the desired ones.

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                           ,"Subscriber" = "member"
                           ,"Customer" = "casual"))

The ‘member_casual’ column was then checked to make sure the correct amount of observations were reassigned.

table(all_trips$member_casual)
casual member
905954 2973868

Columns were than added that list the date; month, day, and year of each ride as this will allow the ride data for each month, day, or year … to be aggregated. Before completing these operations only the ride level could be aggregated.

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

A ‘ride_length’ column was then added to calculate the ride length in seconds

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

The structure of the columns was then investigated and it was seen that ‘ride_length’ needed to be converted from factor to numeric so calculations can be run on the data.

is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

Step 3: Remove ‘bad’ data

The database includes entries where bikes were taken out of docks and checked for quality or ‘ride_length’ was negative. A new dataframe was then created (all_trips_v2) as data is being removed.

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Analysis

Step 1: Perform analysis on ‘ride_length’ by calculating the mean, median, maximum and minimum length using the ‘summary’ function. All values are in seconds.

summary(all_trips_v2$ride_length)
Min. Median Mean Max.
1 712 1479 9387024

Step 2: The mean, median, maximum and minimum ride length was then compared between members and casual users. With the results shown in the table below.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
User Type Median Mean Max. Min
casual 1546 3552.75 9387024 2
member 589 850.07 9056634 1

Step 3: The average ride time by each day for members vs casual users was then calculated

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

The days of the week were seen to be out of order so they were rearranged by the following code:

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

The average ride time by each day for members vs casual users was then calculated again with the days in the correct order.

Step 4: The ridership data was then analysed by the user type and weekday together with the number of rides and ride duration.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%    #creates weekday field using wday()
  group_by(member_casual, weekday) %>%                    #groups by usertype and weekday
  summarise(number_of_rides = n()                         #calculates the number of rides and average duration 
  ,average_duration = mean(ride_length)) %>%              #calculates the average duration
  arrange(member_casual, weekday)                         #sorts

Step 5: A visualization to compare the number of rides versus weekday by rider type was then created.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
  ylab("Number of Rides") +
  xlab("Weekday") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_weekday.png")

This is the resulting plot: number_of_rides_vs_weekday

Observations:

Step 6: A visualization to compare the average ride duration versus weekday by rider type was then created.

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Average Ride Duration: Casual vs Member") +
  ylab("Average Ride Duration") +
  xlab("Weekday") +
  guides(fill=guide_legend(title="User Type"))
ggsave("average_duration_vs_weekday.png")

This is the resulting plot: average_ride_duration_vs_weekday Observations:

Step 7: A visualization to compare the number of rides with time was then created to further investigate the assumption of members using the bikes as part of their commute.

all_trips_v2 %>% 
  mutate(time = hour(started_at)) %>%
  group_by(member_casual, time) %>%
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, time)  %>%
  ggplot(aes(x = factor(time), y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + 
  scale_x_discrete(breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
                   labels = c('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23')) +
  labs(title = "Number of Rides: Members vs Casual") +
  xlab("Time (hours)") +
  ylab(" Number of Rides") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of_rides_vs_time.png")

This is the resulting plot: number_of_rides_vs_time

Observations:

Step 8: A visualization to compare the number of rides versus month by rider type was then created.

all_trips_v2 %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + labs(title = "Number of Rides: Casual vs Member") +
  ylab("Number of Rides") +
  xlab("Month") +
  guides(fill=guide_legend(title="User Type"))
ggsave("number_of _rides_vs_month.png") 

This is the resulting plot: number_of_rides_vs_month

Observations:

Step 9: The start station data and number of rides was then exported to a .csv file for further analysis.

start_station_data <- all_trips_v2 %>%                  
                group_by(start_station_name) %>%                                  
                summarise(number_of_rides = n())

write.csv(start_station_data, "C:/Users/'username'/Documents/start_station_data.csv") 
#'username' is used instead of personal username.

Step 10: The start_station_data.csv file was then imported into Google sheets. The latitude and longitude was found for each station using the extension ‘Geocode’ by Awesome Table. A visualization for the number of trips and starting station was then created in Tableau:
Starting_station_and_number_of_rides

Observations:

Key Findings

Recommendations Based on Key Findings

Share

The key findings and recommendations were all shown in a Google slides presentation

Act