Introduction:

This markdown is used to document analysis steps of Cyclistic Case Study under the program Coursera’s Google Data Analytics Professional Certificate. It will follow the analysis steps introduced in the course, which is Ask - Prepare - Process - Analyze - Share - Act; however, the analysis in this document is reflective of my own thought process and conclusion. The goal here is to demonstrate what I would do in reality as Junior Data Analyst, the assumed role for this project.

Case Summary

Launched in 2016, Cyclistic is a bike-share program in Chicago featuring more than 5,800 bicycles and 600 docking stations. It has gained momentum of brand awareness through different marketing approaches, and the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Both Finance Analysts and Marketing Director agree that the key to future growth is convert casual riders into annual memberships. Moreno, Marketing Director, has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.

As a Junior Data Analyst, you were assigned the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You were also given a set of data from the last 12 months to identify key trends.


ASK

Below are the key deliverables of what I need to confirm and clarify with the Marketig Director.

Objective: Design marketing strategies aimed at converting casual riders into annual members.

Goals: Answer to How do annual members and casual riders use Cyclistic bikes differently?

Key Stakeholders / Audience:

Primary Business Owner:

Secondary Business Owner:

Outcome:

Key Timeline:

Scope:

Further Comments or Notes:

PREPARE

Is this a good data source? Remarks
Reliable The source is accurate (first-party data), unbiased and complete.
Original The source is first-party and primary
Comprehensive All the spreadsheets has the categorical data of “member_casual” to group and compare, in addition to other fields. Sufficient to make comparisons.
Current It is from August 2020 to July 2021. The modified date (metadata) is within 1 - 15 days of the following month. Example: July 2021 spreadsheet is modified or updated on August 14th, 2021.
Cited The data has been made available by Motivate International Inc. under this license.

I tried to open and examine these 12 .CSV files in Google Sheet, however, there are files that are too heavy to import. Therefore,I decided to use RStudio to prepare, process and analyze.

Step 1: Install and load packages

library(tidyverse)
library(skimr)
library(lubridate)
library(scales)

Step 2: Load .csv files into RStudio. If you cannot run these files, please download from the source link and rename as below.

tripdata_2107 <- read.csv("divvy_tripdata_202107.csv")
tripdata_2106 <- read.csv("divvy_tripdata_202106.csv")
tripdata_2105 <- read.csv("divvy_tripdata_202105.csv")
tripdata_2104 <- read.csv("divvy_tripdata_202104.csv")
tripdata_2103 <- read.csv("divvy_tripdata_202103.csv")
tripdata_2102 <- read.csv("divvy_tripdata_202102.csv")
tripdata_2101 <- read.csv("divvy_tripdata_202101.csv")
tripdata_2012 <- read.csv("divvy_tripdata_202012.csv")
tripdata_2011 <- read.csv("divvy_tripdata_202011.csv")
tripdata_2010 <- read.csv("divvy_tripdata_202010.csv")
tripdata_2009 <- read.csv("divvy_tripdata_202009.csv")
tripdata_2008 <- read.csv("divvy_tripdata_202008.csv")

Step 3: Check the .csv files

skim_without_charts(tripdata_2107)
skim_without_charts(tripdata_2106)
skim_without_charts(tripdata_2105)
skim_without_charts(tripdata_2104)
skim_without_charts(tripdata_2103)
skim_without_charts(tripdata_2102)
skim_without_charts(tripdata_2101)
skim_without_charts(tripdata_2012)
skim_without_charts(tripdata_2011)
skim_without_charts(tripdata_2010)
skim_without_charts(tripdata_2009)
skim_without_charts(tripdata_2008)

After a review of summary of all 12 files, there are a couple of quick notes:

Below are descriptions of all the fields.

Field Name Current Format Description Remarks
ride_id character unique ID of a ride
rideable_type character classic, docked, electric bike
started_at character ride’s start date time needs re-format
ended_at character ride’s end date time needs re-format
start_station_name character name of start station contains null
start_station_id character unique id of start station contains null
end_station_name character name of end station contains null
end_station_id character unique id of end station contains null
member_casual character member, casual required to answer the question
start_lat numeric geo tag of start station out of scope
start_lng numeric geo tag of start station out of scope
end_lat numeric geo tag of end station out of scope
end_lng numeric geo tag of end station out of scope

Step 4: Combine all 12 .csv into 1 table named tripdata for better process and analysis because all have the same columns, then double check the final table tripdata.

tripdata <- rbind(tripdata_2107, 
                  tripdata_2106, 
                  tripdata_2105, 
                  tripdata_2104,
                  tripdata_2103,
                  tripdata_2102,
                  tripdata_2101,
                  tripdata_2012,
                  tripdata_2011,
                  tripdata_2010,
                  tripdata_2009,
                  tripdata_2008)

skim_without_charts(tripdata)

From the summary, there are things to take notes of:

To remove duplicates of ride_id

checkdup_ride_id <- tripdata %>% ## This is to quickly identify duplicate ride_id
  group_by(ride_id) %>%
  summarize(count=n()) %>%
  filter(count > 1)

tripdata_dup_ride_id <- tripdata %>%  ## To examine observations where ride_id is duplicate
  semi_join(checkdup_ride_id, by = "ride_id")

View(tripdata_dup_ride_id)

Review of the duplicate ride_id obeservations, we learned that there were wrong input data in December 2020, where about 400 observations from November are repeated. However, these duplicate observations in December 2020 have ended_at date is earlier than started_at date. We can easily remove those observations.

PROCESS

Step 1: Format data type and add new fields for analysis

tripdata_clean <- tripdata %>% ## Pick relevant fields for analysis
  select(ride_id,                  
         rideable_type,
         started_at,
         ended_at,
         start_station_name,
         end_station_name,
         member_casual) %>%
  mutate(started_at = ymd_hms(started_at)) %>% ## Reformat to datetime
  mutate(ended_at = ymd_hms(ended_at)) %>% ## Reformat to datetime
  mutate(ride_length = ended_at - started_at) %>% ## Create new field
  mutate(weekday_started = weekdays(started_at, abbreviate = TRUE)) ## Create new field

skim_without_charts(tripdata_clean)

After examining the new tripdata_clean, there are things to take notes of:

tripdata_clean <- tripdata_clean %>%
  filter(ride_length > 0) %>%
  mutate(ride_length_cat = case_when(ride_length <= 300 ~ '1s - 5m',
                                     ride_length <= 600 ~ '5m - 10m',
                                     ride_length <= 900 ~ '10m - 15m',
                                     ride_length <= 1800 ~ '15m - 30m',
                                     ride_length <= 3600 ~ '30m - 1h',
                                     TRUE ~ 'More Than 1h')) %>%
  mutate(month_started = format(started_at,"%Y-%m")) ## For grouping by month and year 

tripdata_clean <- tripdata_clean %>%
  mutate(time_started = format(started_at, "%Y-%m-%d %I:%M %p")) %>%
  mutate(time_started_cat = ifelse(grepl("PM", time_started), "PM", "AM")) ## For grouping whether the ride starts in AM or PM

Step 2: Next is a verification step to see if I created the right category for ride_length

checkup_ride_length_cat <- tripdata_clean %>%  
  group_by(ride_length_cat) %>% 
  summarize(min=min(ride_length), 
            max=max(ride_length), 
            count=n()) %>%
  
skim_without_charts(tripdata_clean)

After examining the tripdata_clean, everything looks okay and ready to analyze.

ANALYSIS

Step 1: Quick summary and descriptive analysis

tripdata_clean %>%
  group_by(member_casual) %>%
  summarize(average_ride_length = mean(ride_length),
            count_of_ride = n())
no member_casual average_ride_length count_of_ride
1 casual 2205.7636 secs 2099442
2 member 884.2268 secs 2623020

Keynote points:

Step 2: Monthly trend of rides

## Create a dataframe to group by month
number_of_rides <- tripdata_clean %>%
  group_by(month_started, member_casual) %>%
  summarize(no_of_rides = n())

## Create a line chart. After the first run, I noticed casual rides rises above member rides in the month of Jun and Jul in 2021 which is odd. Therefore, I decided to investigate further for key events in Chicago, the hometown of Cyclistic. The major event I found was the re-opening of the city during Covid-19. I put this into the chart through annotate for contextual purpose. 

ggplot(number_of_rides) +
  geom_line(mapping = aes(x = month_started, y = no_of_rides, group = member_casual, 
                          color = member_casual)) +
  labs(title = "Number of Rides: Casual vs. Member", 
       x = "Month", y = "Number of Rides") + 
  guides(color = guide_legend(title = "Membership Type")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma) +
  annotate("rect", xmin = -Inf, xmax = "2021-01", ymin = -Inf, ymax = Inf,
           fill = "blue", alpha = 0.1) +
  annotate("text", label = "Phase 3 Reopen", x = "2020-10", y = 400000,size = 3) +
  annotate("rect", xmin = "2021-01", xmax = "2021-06", ymin = -Inf, ymax = Inf,
           fill = "red", alpha = 0.1) +
  annotate("text", label = "Phase 4 Reopen", x = "2021-03", y = 400000, size = 3) +
  annotate("rect", xmin = "2021-06", xmax = Inf, ymin = -Inf, ymax = Inf,
           fill = "green", alpha = 0.1) +
  annotate("text", label = "Phase 5", x = "2021-07", y = 400000,size = 3)

ggsave("number_of_rides.png")

This is the chart saved.

Keynote points:

Step 3: Do members or casual prefer certain types of bikes?

ggplot(tripdata_clean) + 
  geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) + 
  facet_wrap(~member_casual) + 
  labs(title = "Types of Bikes: Casual vs. Member",
       x = "Types of Bikes", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Types of Bikes")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("types_of_bikes.png")

This is the chart saved.

Keynote points:

Step 4: Which day of the week is preferred by casual riders or members?

## This step is used to ensure of the order of the weekdays for the chart.
tripdata_clean$weekday_started <- factor(tripdata_clean$weekday_started, 
                                         levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

ggplot(tripdata_clean) + 
  geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) + 
  facet_wrap(~member_casual) + 
  labs(title = "Days of Week: Casual vs. Member",
       x = "Days of Week", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Days of Week")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("days_of_week.png")

This is the chart saved.

Keynote points:

Step 5: Which time of the day is preferred by riders?

ggplot(tripdata_clean) +
  geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) +
  facet_grid(member_casual ~ time_started_cat) +
  labs(title = "Time of Ride: Casual vs. Member",
       x = "Time of Ride", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Days of Week")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("time_of_day.png")

This is the chart saved.

Keynote points:

Step 6: How long of a ride is used by riders?

## This step is to ensure the orders of the category for the chart.
tripdata_clean$ride_length_cat <- factor(tripdata_clean$ride_length_cat, 
                                         levels = c ("1s - 5m","5m - 10m","10m - 15m","15m - 30m","30m - 1h","More Than 1h"))

ggplot(tripdata_clean) +
  geom_bar(mapping = aes(x = ride_length_cat, fill = ride_length_cat)) +
  facet_wrap(~member_casual) +
  labs(title = "Ride Length: Casual vs. Member",
       x = "Ride Length Group", y = "Number of Rides") + 
  guides(fill = guide_legend(title = "Ride Length Group")) +
  theme(axis.title = (element_text(size = 8)), 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        axis.text.y = element_text(size = 8),
        legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
  scale_y_continuous(labels = comma)

ggsave("ride_length.png")

This is the chart saved.

Keynote points:

To support further details, we can see clearly by percentage.

## This was created to support key finding, by showing percent of number of rides based on the ride length category, breakdown by members and casual.

checkup_ride_length_cat_member <- tripdata_clean %>% 
  filter(member_casual == "member") %>%
  group_by(ride_length_cat) %>% 
  summarize(count_rides=n()) %>%
  mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))

checkup_ride_length_cat_casual <- tripdata_clean %>% 
  filter(member_casual == "casual") %>%
  group_by(ride_length_cat) %>% 
  summarize(count_rides=n()) %>%
  mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))

write_excel_csv(checkup_ride_length_cat_casual, "checkup_casual.csv")
write_excel_csv(checkup_ride_length_cat_member, "checkup_member.csv")

Step 7: Which street (or station) is mostly where riders start their rides?

This analysis is just used to see whether there is a preferred route. After examining the station name, I decided to split it into Street Name, because most stations name are at crossover. However, if we can use station name, it can be helpful to decide if we want to run an additional offline campaigns at those stations.

For this task, I extracted the list using RStudio, but worked on it on Google Sheet using pivot table.

## This is to extract .csv file where I count number of rides start at a station, breakdown by member_casual.

route <- tripdata_clean %>%
  filter(!is.null(start_station_name), !is.na(start_station_name)) %>%
  group_by(start_station_name, member_casual) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

write_excel_csv(route, "route.csv")

Later on, I used Google Sheet to extract street name and compared the top 10 streets where most rides started casual vs. members.

Casual SUM of count Member SUM of count
Clark St 145,718 Clark St 199,326
Michigan Ave 103,595 Wells St 96,375
Lake Shore Dr 94,755 Broadway 91,634
Wells St 72,010 Lake Shore Dr 70,030
Broadway 56,589 Michigan Ave 67,923
Wabash Ave 53,700 Halsted St 66,426
Streeter Dr 52,276 Sheffield Ave 62,951
Sheffield Ave 43,893 Damen Ave 60,137
Damen Ave 43,046 Wabash Ave 54,979
Halsted St 42,812 Larrabee St 51,393

Keynote points:

Step 8: Finalize all key findings

Step 9: Recommendations based on key findings

SHARE

I chose to present all findings in Google Slide deck. Click here to view.

ACT

Next actions include:

This concludes the case study. For any recommendation and comments, please feel free to send to or connect in Linkedin at Danh Do’s LinkedIn