Cyclistic Case Study

Introduction:

This markdown is used to document analysis steps of Cyclistic Case Study under the program Coursera’s Google Data Analytics Professional Certificate. It will follow the analysis steps introduced in the course, which is Ask - Prepare - Process - Analyze - Share - Act; however, the analysis in this document is reflective of my own thought process and conclusion. The goal here is to demonstrate what I would do in reality as Junior Data Analyst, the assumed role for this project.

Case Summary

Launched in 2016, Cyclistic is a bike-share program in Chicago featuring more than 5,800 bicycles and 600 docking stations. It has gained momentum of brand awareness through different marketing approaches, and the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Both Finance Analysts and Marketing Director agree that the key to future growth is convert casual riders into annual memberships. Moreno, Marketing Director, has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.

As a Junior Data Analyst, you were assigned the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You were also given a set of data from the last 12 months to identify key trends.

ASK

Below are the key deliverables of what I need to confirm and clarify with the Marketig Director.

Objective: Design marketing strategies aimed at converting casual riders into annual members.

Goals: Answer to How do annual members and casual riders use Cyclistic bikes differently?

Key Stakeholders / Audience:

Primary Business Owner:

Ms. Lily Moreno - Marketing Director
Executive Team

Secondary Business Owner:

Marketing Data Analytics Team

Outcome:

Report of key findings and recommendation
Report of data analysis approach

Key Timeline:

Briefing date: 2021/09/07
Download and prepare data: 2021/09/08
Process and analyze data: 2021/09/08 - 2021/09/10
Report of Preliminary Findings to Marketing Director: 2021/09/11
Update Preliminary Report and email to all stakeholders: 2021/09/12
Present Final Report: 2021/09/13

Scope:

12 months of Data from 2020/08/01 to 2021/07/31

Further Comments or Notes:

This is the only internal source of data available at the moment. There are no other resources or information available, except for public data sources.

PREPARE

Download: Download Link
Date Download: 2021/09/08
Date Range: 2020/08/01 - 2021/07/31
File Type: .CVS (Total: 12 .CSV files)

Is this a good data source?	Remarks
Reliable	The source is accurate (first-party data), unbiased and complete.
Original	The source is first-party and primary
Comprehensive	All the spreadsheets has the categorical data of “member_casual” to group and compare, in addition to other fields. Sufficient to make comparisons.
Current	It is from August 2020 to July 2021. The modified date (metadata) is within 1 - 15 days of the following month. Example: July 2021 spreadsheet is modified or updated on August 14th, 2021.
Cited	The data has been made available by Motivate International Inc. under this license.

I tried to open and examine these 12 .CSV files in Google Sheet, however, there are files that are too heavy to import. Therefore,I decided to use RStudio to prepare, process and analyze.

Step 1: Install and load packages

library(tidyverse)
library(skimr)
library(lubridate)
library(scales)

Step 2: Load .csv files into RStudio. If you cannot run these files, please download from the source link and rename as below.

tripdata_2107 <- read.csv("divvy_tripdata_202107.csv")
tripdata_2106 <- read.csv("divvy_tripdata_202106.csv")
tripdata_2105 <- read.csv("divvy_tripdata_202105.csv")
tripdata_2104 <- read.csv("divvy_tripdata_202104.csv")
tripdata_2103 <- read.csv("divvy_tripdata_202103.csv")
tripdata_2102 <- read.csv("divvy_tripdata_202102.csv")
tripdata_2101 <- read.csv("divvy_tripdata_202101.csv")
tripdata_2012 <- read.csv("divvy_tripdata_202012.csv")
tripdata_2011 <- read.csv("divvy_tripdata_202011.csv")
tripdata_2010 <- read.csv("divvy_tripdata_202010.csv")
tripdata_2009 <- read.csv("divvy_tripdata_202009.csv")
tripdata_2008 <- read.csv("divvy_tripdata_202008.csv")

Step 3: Check the .csv files

skim_without_charts(tripdata_2107)
skim_without_charts(tripdata_2106)
skim_without_charts(tripdata_2105)
skim_without_charts(tripdata_2104)
skim_without_charts(tripdata_2103)
skim_without_charts(tripdata_2102)
skim_without_charts(tripdata_2101)
skim_without_charts(tripdata_2012)
skim_without_charts(tripdata_2011)
skim_without_charts(tripdata_2010)
skim_without_charts(tripdata_2009)
skim_without_charts(tripdata_2008)

After a review of summary of all 12 files, there are a couple of quick notes:

There are a total of 13 fields (columns) in each file.
Format of start_station_id and end_station_id is not consistent among all of them.
started_at and ended_at are character, not date time type, might needs re-format for calculation.

Below are descriptions of all the fields.

Field Name	Current Format	Description	Remarks
ride_id	character	unique ID of a ride
rideable_type	character	classic, docked, electric bike
started_at	character	ride’s start date time	needs re-format
ended_at	character	ride’s end date time	needs re-format
start_station_name	character	name of start station	contains null
start_station_id	character	unique id of start station	contains null
end_station_name	character	name of end station	contains null
end_station_id	character	unique id of end station	contains null
member_casual	character	member, casual	required to answer the question
start_lat	numeric	geo tag of start station	out of scope
start_lng	numeric	geo tag of start station	out of scope
end_lat	numeric	geo tag of end station	out of scope
end_lng	numeric	geo tag of end station	out of scope

Step 4: Combine all 12 .csv into 1 table named tripdata for better process and analysis because all have the same columns, then double check the final table tripdata.

tripdata <- rbind(tripdata_2107, 
                  tripdata_2106, 
                  tripdata_2105, 
                  tripdata_2104,
                  tripdata_2103,
                  tripdata_2102,
                  tripdata_2101,
                  tripdata_2012,
                  tripdata_2011,
                  tripdata_2010,
                  tripdata_2009,
                  tripdata_2008)

skim_without_charts(tripdata)

From the summary, there are things to take notes of:

There are duplicates in ride_id column in the new file tripdata
There are both missing data in station id and station name columns, which cannot be refilled. They are missing in complete random => Keep only start_station_name and end_station_name for analysis, remove start_station_id and end_station_id

To remove duplicates of ride_id

checkdup_ride_id <- tripdata %>% ## This is to quickly identify duplicate ride_id
  group_by(ride_id) %>%
  summarize(count=n()) %>%
  filter(count > 1)

tripdata_dup_ride_id <- tripdata %>%  ## To examine observations where ride_id is duplicate
  semi_join(checkdup_ride_id, by = "ride_id")

View(tripdata_dup_ride_id)

Review of the duplicate ride_id obeservations, we learned that there were wrong input data in December 2020, where about 400 observations from November are repeated. However, these duplicate observations in December 2020 have ended_at date is earlier than started_at date. We can easily remove those observations.

PROCESS

Step 1: Format data type and add new fields for analysis

tripdata_clean <- tripdata %>% ## Pick relevant fields for analysis
  select(ride_id,                  
         rideable_type,
         started_at,
         ended_at,
         start_station_name,
         end_station_name,
         member_casual) %>%
  mutate(started_at = ymd_hms(started_at)) %>% ## Reformat to datetime
  mutate(ended_at = ymd_hms(ended_at)) %>% ## Reformat to datetime
  mutate(ride_length = ended_at - started_at) %>% ## Create new field
  mutate(weekday_started = weekdays(started_at, abbreviate = TRUE)) ## Create new field

skim_without_charts(tripdata_clean)

After examining the new tripdata_clean, there are things to take notes of:

ride_length is too spread and hard to give insights, especially there are cases ride_length is negative, which does not make sense. => I decided to remove those negative observations and group ride_length
I created a new field of month_started in the format of YYYY-MM for summary and plotting purpose.
I created a new field of time_started in the format of AM / PM for summary and plotting purpose.

tripdata_clean <- tripdata_clean %>%
  filter(ride_length > 0) %>%
  mutate(ride_length_cat = case_when(ride_length <= 300 ~ '1s - 5m',
                                     ride_length <= 600 ~ '5m - 10m',
                                     ride_length <= 900 ~ '10m - 15m',
                                     ride_length <= 1800 ~ '15m - 30m',
                                     ride_length <= 3600 ~ '30m - 1h',
                                     TRUE ~ 'More Than 1h')) %>%
  mutate(month_started = format(started_at,"%Y-%m")) ## For grouping by month and year 

tripdata_clean <- tripdata_clean %>%
  mutate(time_started = format(started_at, "%Y-%m-%d %I:%M %p")) %>%
  mutate(time_started_cat = ifelse(grepl("PM", time_started), "PM", "AM")) ## For grouping whether the ride starts in AM or PM

Step 2: Next is a verification step to see if I created the right category for ride_length

checkup_ride_length_cat <- tripdata_clean %>%  
  group_by(ride_length_cat) %>% 
  summarize(min=min(ride_length), 
            max=max(ride_length), 
            count=n()) %>%
  
skim_without_charts(tripdata_clean)

After examining the tripdata_clean, everything looks okay and ready to analyze.

ANALYSIS

Step 1: Quick summary and descriptive analysis

tripdata_clean %>%
  group_by(member_casual) %>%
  summarize(average_ride_length = mean(ride_length),
            count_of_ride = n())

no	member_casual	average_ride_length	count_of_ride

1	casual	2205.7636 secs	2099442
2	member	884.2268 secs	2623020

Keynote points:

Casual riders have longer average_ride_length.
Number of casual rides are less than that of members.

Step 2: Monthly trend of rides

## Create a dataframe to group by month
number_of_rides <- tripdata_clean %>%
  group_by(month_started, member_casual) %>%
  summarize(no_of_rides = n())

## Create a line chart. After the first run, I noticed casual rides rises above member rides in the month of Jun and Jul in 2021 which is odd. Therefore, I decided to investigate further for key events in Chicago, the hometown of Cyclistic. The major event I found was the re-opening of the city during Covid-19. I put this into the chart through annotate for contextual purpose. 

ggplot(number_of_rides) +
  geom_line(mapping = aes(x = month_started, y = no_of_rides, group = member_casual, 
                          color = member_casual)) +
  labs(title = "Number of Rides: Casual vs. Member", 
       x = "Month", y = "Number of Rides") + 
  guides(color = guide_legend(title = "Membership Type")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma) +
  annotate("rect", xmin = -Inf, xmax = "2021-01", ymin = -Inf, ymax = Inf,
           fill = "blue", alpha = 0.1) +
  annotate("text", label = "Phase 3 Reopen", x = "2020-10", y = 400000,size = 3) +
  annotate("rect", xmin = "2021-01", xmax = "2021-06", ymin = -Inf, ymax = Inf,
           fill = "red", alpha = 0.1) +
  annotate("text", label = "Phase 4 Reopen", x = "2021-03", y = 400000, size = 3) +
  annotate("rect", xmin = "2021-06", xmax = Inf, ymin = -Inf, ymax = Inf,
           fill = "green", alpha = 0.1) +
  annotate("text", label = "Phase 5", x = "2021-07", y = 400000,size = 3)

ggsave("number_of_rides.png")

This is the chart saved.

Keynote points:

Overall, members had more rides than casual riders. However, since April 2021, casual rides seems to rise up faster and pass member rides in June and July 2021. Due to lack of information, we could not determine if it was because number of casual riders increased or casual riders made more rides per ride. The same question applied to members, we could not determine if members made less rides per member or number of member drops. Number of Unique Riders (Casual vs. Member) could give further context and information.
However, based on the context of Re-Opening of Phase 4 and 5, I felt that there could have been a rise in tourists / people moving to Chicago because there were less restrictions and more people vaccinated; and these people needed a way of transportation from place to place. Another theory was that if members decided to cancel during lockdown and came back to ride as casual riders.
Both follow the same trend of upward and downward. It seems like month or seasonal factors impacted both the same way. The only exception which needs further investigation was when casual rides rises much faster than members rides. Another comparison of year on year could see if this is abnormal or normal.

Step 3: Do members or casual prefer certain types of bikes?

ggplot(tripdata_clean) + 
  geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) + 
  facet_wrap(~member_casual) + 
  labs(title = "Types of Bikes: Casual vs. Member",
       x = "Types of Bikes", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Types of Bikes")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("types_of_bikes.png")

This is the chart saved.

Keynote points:

Members tend to use more classic bikes than the other two types. However, we did not know the information of how many bikes available for each type so we could not determine. A data point could be more descriptive of this assumption is number of rides per rider per type of bike. Moreover, a qualitative survey asking if members and casual riders prefer which type of bikes and whether the availability of the type is important to them.
There are no preferences for which type of bikes to use for casual riders.
This information could be used to determine key images of our digital advertisement, which bikes to show more in the ads.

Step 4: Which day of the week is preferred by casual riders or members?

## This step is used to ensure of the order of the weekdays for the chart.
tripdata_clean$weekday_started <- factor(tripdata_clean$weekday_started, 
                                         levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

ggplot(tripdata_clean) + 
  geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) + 
  facet_wrap(~member_casual) + 
  labs(title = "Days of Week: Casual vs. Member",
       x = "Days of Week", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Days of Week")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("days_of_week.png")

This is the chart saved.

Keynote points:

Members rides spread out the whole week which tends to go down a little bit in the weekends, and peaked in the middle of the week. Meanwhile, casual riders prefer weekends ride more. This could be further explained by the fact that they mostly used it for weekend leisure.
This information can help marketing director to optimize the campaign based on days of ads shown. Casual riders who ride during the week can be more likely to be converted into members than others, especially those who only ride during the weekend.
A further qualitative survey asking if riders use Cyclistic for which main purpose and when or which day if riders use the service can be assist to this key finding.

Step 5: Which time of the day is preferred by riders?

ggplot(tripdata_clean) +
  geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) +
  facet_grid(member_casual ~ time_started_cat) +
  labs(title = "Time of Ride: Casual vs. Member",
       x = "Time of Ride", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Days of Week")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("time_of_day.png")

This is the chart saved.

Keynote points:

Members made much more rides in AM than casual riders. However, in PM, members slightly rides more than casual with exception of the weekends.
Though this finding does not give much understanding into difference between the two types of riders, we can use this information for campaign optimazation.
Qualitative survey which asked if riders prefer AM or PM, or if they don’t use Cyclistic which other means of transportations do they use? People might opt for other public transport or Uber in the morning to go to work or to commute. But we understand that most users use Cyclistic for leisure (given in the case).

Step 6: How long of a ride is used by riders?

## This step is to ensure the orders of the category for the chart.
tripdata_clean$ride_length_cat <- factor(tripdata_clean$ride_length_cat, 
                                         levels = c ("1s - 5m","5m - 10m","10m - 15m","15m - 30m","30m - 1h","More Than 1h"))

ggplot(tripdata_clean) +
  geom_bar(mapping = aes(x = ride_length_cat, fill = ride_length_cat)) +
  facet_wrap(~member_casual) +
  labs(title = "Ride Length: Casual vs. Member",
       x = "Ride Length Group", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Ride Length Group")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("ride_length.png")

This is the chart saved.

Keynote points:

In line with the descriptive analysis, we learned that members riders make shorter rides than casual riders. To enhance the understanding of this finding, we can investigate further into number of rides per rider; to know if members like to make shorter riders but multiple rides within a certain time frame or if casual riders like to hold on to their bikes before returning.

To support further details, we can see clearly by percentage.

## This was created to support key finding, by showing percent of number of rides based on the ride length category, breakdown by members and casual.

checkup_ride_length_cat_member <- tripdata_clean %>% 
  filter(member_casual == "member") %>%
  group_by(ride_length_cat) %>% 
  summarize(count_rides=n()) %>%
  mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))

checkup_ride_length_cat_casual <- tripdata_clean %>% 
  filter(member_casual == "casual") %>%
  group_by(ride_length_cat) %>% 
  summarize(count_rides=n()) %>%
  mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))

write_excel_csv(checkup_ride_length_cat_casual, "checkup_casual.csv")
write_excel_csv(checkup_ride_length_cat_member, "checkup_member.csv")

Step 7: Which street (or station) is mostly where riders start their rides?

This analysis is just used to see whether there is a preferred route. After examining the station name, I decided to split it into Street Name, because most stations name are at crossover. However, if we can use station name, it can be helpful to decide if we want to run an additional offline campaigns at those stations.

For this task, I extracted the list using RStudio, but worked on it on Google Sheet using pivot table.

## This is to extract .csv file where I count number of rides start at a station, breakdown by member_casual.

route <- tripdata_clean %>%
  filter(!is.null(start_station_name), !is.na(start_station_name)) %>%
  group_by(start_station_name, member_casual) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

write_excel_csv(route, "route.csv")

Later on, I used Google Sheet to extract street name and compared the top 10 streets where most rides started casual vs. members.

Casual	SUM of count	Member	SUM of count
Clark St	145,718	Clark St	199,326
Michigan Ave	103,595	Wells St	96,375
Lake Shore Dr	94,755	Broadway	91,634
Wells St	72,010	Lake Shore Dr	70,030
Broadway	56,589	Michigan Ave	67,923
Wabash Ave	53,700	Halsted St	66,426
Streeter Dr	52,276	Sheffield Ave	62,951
Sheffield Ave	43,893	Damen Ave	60,137
Damen Ave	43,046	Wabash Ave	54,979
Halsted St	42,812	Larrabee St	51,393

Keynote points:

We don’t see much differences in 2 lists. It can be down to where there are availability of the bikes for use and whether users are around the area. This once again can be a useful information for campaign optimization. OR priority for geo tagging the campaign.

Step 8: Finalize all key findings

There are no significant differences in choices of bikes, or route between casual riders and members.
Casual riders rides more in the weekend where members ride evenly throughout the week and less in the weekends.
Casual riders tend to make longer riders (ave 36 minutes) than members (ave 15 minutes).

Step 9: Recommendations based on key findings

Target Member-Lookalike Casual Riders: Casual riders who make frequent short rides during weekdays (Mon - Thu)
Geo Target Top Street Names: We can first focus on riders around these top street names who are more likely to be customers.
Qualitative Survey and Research Unique Riders Behavior: to confirm key insights such as bike types or reasons for a ride to drive our key message and communications

ACT

Next actions include:

Follow up and address unanwered questions during Q&A
Propose Qualitative Research Chicago Riders
Work with Product Team and Data Scientist on how to track unique riders

This concludes the case study. For any recommendation and comments, please feel free to send to dannydo1910@gmail.com or connect in Linkedin at Danh Do’s LinkedIn

Cyclistic Case Study

Danh Do

9/13/2021

Introduction:

Case Summary

ASK

PREPARE

PROCESS

ANALYSIS

SHARE

ACT