This markdown is used to document analysis steps of Cyclistic Case Study under the program Coursera’s Google Data Analytics Professional Certificate. It will follow the analysis steps introduced in the course, which is Ask - Prepare - Process - Analyze - Share - Act; however, the analysis in this document is reflective of my own thought process and conclusion. The goal here is to demonstrate what I would do in reality as Junior Data Analyst, the assumed role for this project.
Launched in 2016, Cyclistic is a bike-share program in Chicago featuring more than 5,800 bicycles and 600 docking stations. It has gained momentum of brand awareness through different marketing approaches, and the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Both Finance Analysts and Marketing Director agree that the key to future growth is convert casual riders into annual memberships. Moreno, Marketing Director, has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.
As a Junior Data Analyst, you were assigned the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You were also given a set of data from the last 12 months to identify key trends.
Below are the key deliverables of what I need to confirm and clarify with the Marketig Director.
Objective: Design marketing strategies aimed at converting casual riders into annual members.
Goals: Answer to How do annual members and casual riders use Cyclistic bikes differently?
Key Stakeholders / Audience:
Primary Business Owner:
Ms. Lily Moreno - Marketing Director
Executive Team
Secondary Business Owner:
Outcome:
Report of key findings and recommendation
Report of data analysis approach
Key Timeline:
Briefing date: 2021/09/07
Download and prepare data: 2021/09/08
Process and analyze data: 2021/09/08 - 2021/09/10
Report of Preliminary Findings to Marketing Director: 2021/09/11
Update Preliminary Report and email to all stakeholders: 2021/09/12
Present Final Report: 2021/09/13
Scope:
Further Comments or Notes:
| Is this a good data source? | Remarks |
|---|---|
| Reliable | The source is accurate (first-party data), unbiased and complete. |
| Original | The source is first-party and primary |
| Comprehensive | All the spreadsheets has the categorical data of “member_casual” to group and compare, in addition to other fields. Sufficient to make comparisons. |
| Current | It is from August 2020 to July 2021. The modified date (metadata) is within 1 - 15 days of the following month. Example: July 2021 spreadsheet is modified or updated on August 14th, 2021. |
| Cited | The data has been made available by Motivate International Inc. under this license. |
I tried to open and examine these 12 .CSV files in Google Sheet, however, there are files that are too heavy to import. Therefore,I decided to use RStudio to prepare, process and analyze.
Step 1: Install and load packages
library(tidyverse)
library(skimr)
library(lubridate)
library(scales)
Step 2: Load .csv files into RStudio. If you cannot run these files, please download from the source link and rename as below.
tripdata_2107 <- read.csv("divvy_tripdata_202107.csv")
tripdata_2106 <- read.csv("divvy_tripdata_202106.csv")
tripdata_2105 <- read.csv("divvy_tripdata_202105.csv")
tripdata_2104 <- read.csv("divvy_tripdata_202104.csv")
tripdata_2103 <- read.csv("divvy_tripdata_202103.csv")
tripdata_2102 <- read.csv("divvy_tripdata_202102.csv")
tripdata_2101 <- read.csv("divvy_tripdata_202101.csv")
tripdata_2012 <- read.csv("divvy_tripdata_202012.csv")
tripdata_2011 <- read.csv("divvy_tripdata_202011.csv")
tripdata_2010 <- read.csv("divvy_tripdata_202010.csv")
tripdata_2009 <- read.csv("divvy_tripdata_202009.csv")
tripdata_2008 <- read.csv("divvy_tripdata_202008.csv")
Step 3: Check the .csv files
skim_without_charts(tripdata_2107)
skim_without_charts(tripdata_2106)
skim_without_charts(tripdata_2105)
skim_without_charts(tripdata_2104)
skim_without_charts(tripdata_2103)
skim_without_charts(tripdata_2102)
skim_without_charts(tripdata_2101)
skim_without_charts(tripdata_2012)
skim_without_charts(tripdata_2011)
skim_without_charts(tripdata_2010)
skim_without_charts(tripdata_2009)
skim_without_charts(tripdata_2008)
After a review of summary of all 12 files, there are a couple of quick notes:
There are a total of 13 fields (columns) in each file.
Format of start_station_id and end_station_id is not consistent among all of them.
started_at and ended_at are character, not date time type, might needs re-format for calculation.
Below are descriptions of all the fields.
| Field Name | Current Format | Description | Remarks |
|---|---|---|---|
| ride_id | character | unique ID of a ride | |
| rideable_type | character | classic, docked, electric bike | |
| started_at | character | ride’s start date time | needs re-format |
| ended_at | character | ride’s end date time | needs re-format |
| start_station_name | character | name of start station | contains null |
| start_station_id | character | unique id of start station | contains null |
| end_station_name | character | name of end station | contains null |
| end_station_id | character | unique id of end station | contains null |
| member_casual | character | member, casual | required to answer the question |
| start_lat | numeric | geo tag of start station | out of scope |
| start_lng | numeric | geo tag of start station | out of scope |
| end_lat | numeric | geo tag of end station | out of scope |
| end_lng | numeric | geo tag of end station | out of scope |
Step 4: Combine all 12 .csv into 1 table named tripdata for better process and analysis because all have the same columns, then double check the final table tripdata.
tripdata <- rbind(tripdata_2107,
tripdata_2106,
tripdata_2105,
tripdata_2104,
tripdata_2103,
tripdata_2102,
tripdata_2101,
tripdata_2012,
tripdata_2011,
tripdata_2010,
tripdata_2009,
tripdata_2008)
skim_without_charts(tripdata)
From the summary, there are things to take notes of:
There are duplicates in ride_id column in the new file tripdata
There are both missing data in station id and station name columns, which cannot be refilled. They are missing in complete random => Keep only start_station_name and end_station_name for analysis, remove start_station_id and end_station_id
To remove duplicates of ride_id
checkdup_ride_id <- tripdata %>% ## This is to quickly identify duplicate ride_id
group_by(ride_id) %>%
summarize(count=n()) %>%
filter(count > 1)
tripdata_dup_ride_id <- tripdata %>% ## To examine observations where ride_id is duplicate
semi_join(checkdup_ride_id, by = "ride_id")
View(tripdata_dup_ride_id)
Review of the duplicate ride_id obeservations, we learned that there were wrong input data in December 2020, where about 400 observations from November are repeated. However, these duplicate observations in December 2020 have ended_at date is earlier than started_at date. We can easily remove those observations.
Step 1: Format data type and add new fields for analysis
tripdata_clean <- tripdata %>% ## Pick relevant fields for analysis
select(ride_id,
rideable_type,
started_at,
ended_at,
start_station_name,
end_station_name,
member_casual) %>%
mutate(started_at = ymd_hms(started_at)) %>% ## Reformat to datetime
mutate(ended_at = ymd_hms(ended_at)) %>% ## Reformat to datetime
mutate(ride_length = ended_at - started_at) %>% ## Create new field
mutate(weekday_started = weekdays(started_at, abbreviate = TRUE)) ## Create new field
skim_without_charts(tripdata_clean)
After examining the new tripdata_clean, there are things to take notes of:
ride_length is too spread and hard to give insights, especially there are cases ride_length is negative, which does not make sense. => I decided to remove those negative observations and group ride_length
I created a new field of month_started in the format of YYYY-MM for summary and plotting purpose.
I created a new field of time_started in the format of AM / PM for summary and plotting purpose.
tripdata_clean <- tripdata_clean %>%
filter(ride_length > 0) %>%
mutate(ride_length_cat = case_when(ride_length <= 300 ~ '1s - 5m',
ride_length <= 600 ~ '5m - 10m',
ride_length <= 900 ~ '10m - 15m',
ride_length <= 1800 ~ '15m - 30m',
ride_length <= 3600 ~ '30m - 1h',
TRUE ~ 'More Than 1h')) %>%
mutate(month_started = format(started_at,"%Y-%m")) ## For grouping by month and year
tripdata_clean <- tripdata_clean %>%
mutate(time_started = format(started_at, "%Y-%m-%d %I:%M %p")) %>%
mutate(time_started_cat = ifelse(grepl("PM", time_started), "PM", "AM")) ## For grouping whether the ride starts in AM or PM
Step 2: Next is a verification step to see if I created the right category for ride_length
checkup_ride_length_cat <- tripdata_clean %>%
group_by(ride_length_cat) %>%
summarize(min=min(ride_length),
max=max(ride_length),
count=n()) %>%
skim_without_charts(tripdata_clean)
After examining the tripdata_clean, everything looks okay and ready to analyze.
Step 1: Quick summary and descriptive analysis
tripdata_clean %>%
group_by(member_casual) %>%
summarize(average_ride_length = mean(ride_length),
count_of_ride = n())
| no | member_casual | average_ride_length | count_of_ride |
|---|---|---|---|
| 1 | casual | 2205.7636 secs | 2099442 |
| 2 | member | 884.2268 secs | 2623020 |
Keynote points:
Casual riders have longer average_ride_length.
Number of casual rides are less than that of members.
Step 2: Monthly trend of rides
## Create a dataframe to group by month
number_of_rides <- tripdata_clean %>%
group_by(month_started, member_casual) %>%
summarize(no_of_rides = n())
## Create a line chart. After the first run, I noticed casual rides rises above member rides in the month of Jun and Jul in 2021 which is odd. Therefore, I decided to investigate further for key events in Chicago, the hometown of Cyclistic. The major event I found was the re-opening of the city during Covid-19. I put this into the chart through annotate for contextual purpose.
ggplot(number_of_rides) +
geom_line(mapping = aes(x = month_started, y = no_of_rides, group = member_casual,
color = member_casual)) +
labs(title = "Number of Rides: Casual vs. Member",
x = "Month", y = "Number of Rides") +
guides(color = guide_legend(title = "Membership Type")) +
theme(axis.title = (element_text(size = 8)),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
scale_y_continuous(labels = comma) +
annotate("rect", xmin = -Inf, xmax = "2021-01", ymin = -Inf, ymax = Inf,
fill = "blue", alpha = 0.1) +
annotate("text", label = "Phase 3 Reopen", x = "2020-10", y = 400000,size = 3) +
annotate("rect", xmin = "2021-01", xmax = "2021-06", ymin = -Inf, ymax = Inf,
fill = "red", alpha = 0.1) +
annotate("text", label = "Phase 4 Reopen", x = "2021-03", y = 400000, size = 3) +
annotate("rect", xmin = "2021-06", xmax = Inf, ymin = -Inf, ymax = Inf,
fill = "green", alpha = 0.1) +
annotate("text", label = "Phase 5", x = "2021-07", y = 400000,size = 3)
ggsave("number_of_rides.png")
This is the chart saved.
Keynote points:
Overall, members had more rides than casual riders. However, since April 2021, casual rides seems to rise up faster and pass member rides in June and July 2021. Due to lack of information, we could not determine if it was because number of casual riders increased or casual riders made more rides per ride. The same question applied to members, we could not determine if members made less rides per member or number of member drops. Number of Unique Riders (Casual vs. Member) could give further context and information.
However, based on the context of Re-Opening of Phase 4 and 5, I felt that there could have been a rise in tourists / people moving to Chicago because there were less restrictions and more people vaccinated; and these people needed a way of transportation from place to place. Another theory was that if members decided to cancel during lockdown and came back to ride as casual riders.
Both follow the same trend of upward and downward. It seems like month or seasonal factors impacted both the same way. The only exception which needs further investigation was when casual rides rises much faster than members rides. Another comparison of year on year could see if this is abnormal or normal.
Step 3: Do members or casual prefer certain types of bikes?
ggplot(tripdata_clean) +
geom_bar(mapping = aes(x = rideable_type, fill = rideable_type)) +
facet_wrap(~member_casual) +
labs(title = "Types of Bikes: Casual vs. Member",
x = "Types of Bikes", y = "Number of Rides") +
guides(fill = guide_legend(title = "Types of Bikes")) +
theme(axis.title = (element_text(size = 8)),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
scale_y_continuous(labels = comma)
ggsave("types_of_bikes.png")
This is the chart saved.
Keynote points:
Members tend to use more classic bikes than the other two types. However, we did not know the information of how many bikes available for each type so we could not determine. A data point could be more descriptive of this assumption is number of rides per rider per type of bike. Moreover, a qualitative survey asking if members and casual riders prefer which type of bikes and whether the availability of the type is important to them.
There are no preferences for which type of bikes to use for casual riders.
This information could be used to determine key images of our digital advertisement, which bikes to show more in the ads.
Step 4: Which day of the week is preferred by casual riders or members?
## This step is used to ensure of the order of the weekdays for the chart.
tripdata_clean$weekday_started <- factor(tripdata_clean$weekday_started,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
ggplot(tripdata_clean) +
geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) +
facet_wrap(~member_casual) +
labs(title = "Days of Week: Casual vs. Member",
x = "Days of Week", y = "Number of Rides") +
guides(fill = guide_legend(title = "Days of Week")) +
theme(axis.title = (element_text(size = 8)),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
scale_y_continuous(labels = comma)
ggsave("days_of_week.png")
This is the chart saved.
Keynote points:
Members rides spread out the whole week which tends to go down a little bit in the weekends, and peaked in the middle of the week. Meanwhile, casual riders prefer weekends ride more. This could be further explained by the fact that they mostly used it for weekend leisure.
This information can help marketing director to optimize the campaign based on days of ads shown. Casual riders who ride during the week can be more likely to be converted into members than others, especially those who only ride during the weekend.
A further qualitative survey asking if riders use Cyclistic for which main purpose and when or which day if riders use the service can be assist to this key finding.
Step 5: Which time of the day is preferred by riders?
ggplot(tripdata_clean) +
geom_bar(mapping = aes(x = weekday_started, fill = weekday_started)) +
facet_grid(member_casual ~ time_started_cat) +
labs(title = "Time of Ride: Casual vs. Member",
x = "Time of Ride", y = "Number of Rides") +
guides(fill = guide_legend(title = "Days of Week")) +
theme(axis.title = (element_text(size = 8)),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
scale_y_continuous(labels = comma)
ggsave("time_of_day.png")
This is the chart saved.
Keynote points:
Members made much more rides in AM than casual riders. However, in PM, members slightly rides more than casual with exception of the weekends.
Though this finding does not give much understanding into difference between the two types of riders, we can use this information for campaign optimazation.
Qualitative survey which asked if riders prefer AM or PM, or if they don’t use Cyclistic which other means of transportations do they use? People might opt for other public transport or Uber in the morning to go to work or to commute. But we understand that most users use Cyclistic for leisure (given in the case).
Step 6: How long of a ride is used by riders?
## This step is to ensure the orders of the category for the chart.
tripdata_clean$ride_length_cat <- factor(tripdata_clean$ride_length_cat,
levels = c ("1s - 5m","5m - 10m","10m - 15m","15m - 30m","30m - 1h","More Than 1h"))
ggplot(tripdata_clean) +
geom_bar(mapping = aes(x = ride_length_cat, fill = ride_length_cat)) +
facet_wrap(~member_casual) +
labs(title = "Ride Length: Casual vs. Member",
x = "Ride Length Group", y = "Number of Rides") +
guides(fill = guide_legend(title = "Ride Length Group")) +
theme(axis.title = (element_text(size = 8)),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 8), legend.position = "bottom", legend.title = element_text(size = 8)) +
scale_y_continuous(labels = comma)
ggsave("ride_length.png")
This is the chart saved.
Keynote points:
To support further details, we can see clearly by percentage.
## This was created to support key finding, by showing percent of number of rides based on the ride length category, breakdown by members and casual.
checkup_ride_length_cat_member <- tripdata_clean %>%
filter(member_casual == "member") %>%
group_by(ride_length_cat) %>%
summarize(count_rides=n()) %>%
mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))
checkup_ride_length_cat_casual <- tripdata_clean %>%
filter(member_casual == "casual") %>%
group_by(ride_length_cat) %>%
summarize(count_rides=n()) %>%
mutate(percentage = paste0(round(count_rides/sum(count_rides)*100,2),"%"))
write_excel_csv(checkup_ride_length_cat_casual, "checkup_casual.csv")
write_excel_csv(checkup_ride_length_cat_member, "checkup_member.csv")
Step 7: Which street (or station) is mostly where riders start their rides?
This analysis is just used to see whether there is a preferred route. After examining the station name, I decided to split it into Street Name, because most stations name are at crossover. However, if we can use station name, it can be helpful to decide if we want to run an additional offline campaigns at those stations.
For this task, I extracted the list using RStudio, but worked on it on Google Sheet using pivot table.
## This is to extract .csv file where I count number of rides start at a station, breakdown by member_casual.
route <- tripdata_clean %>%
filter(!is.null(start_station_name), !is.na(start_station_name)) %>%
group_by(start_station_name, member_casual) %>%
summarize(count = n()) %>%
arrange(desc(count))
write_excel_csv(route, "route.csv")
Later on, I used Google Sheet to extract street name and compared the top 10 streets where most rides started casual vs. members.
| Casual | SUM of count | Member | SUM of count |
|---|---|---|---|
| Clark St | 145,718 | Clark St | 199,326 |
| Michigan Ave | 103,595 | Wells St | 96,375 |
| Lake Shore Dr | 94,755 | Broadway | 91,634 |
| Wells St | 72,010 | Lake Shore Dr | 70,030 |
| Broadway | 56,589 | Michigan Ave | 67,923 |
| Wabash Ave | 53,700 | Halsted St | 66,426 |
| Streeter Dr | 52,276 | Sheffield Ave | 62,951 |
| Sheffield Ave | 43,893 | Damen Ave | 60,137 |
| Damen Ave | 43,046 | Wabash Ave | 54,979 |
| Halsted St | 42,812 | Larrabee St | 51,393 |
Keynote points:
Step 8: Finalize all key findings
There are no significant differences in choices of bikes, or route between casual riders and members.
Casual riders rides more in the weekend where members ride evenly throughout the week and less in the weekends.
Casual riders tend to make longer riders (ave 36 minutes) than members (ave 15 minutes).
Step 9: Recommendations based on key findings
Target Member-Lookalike Casual Riders: Casual riders who make frequent short rides during weekdays (Mon - Thu)
Geo Target Top Street Names: We can first focus on riders around these top street names who are more likely to be customers.
Qualitative Survey and Research Unique Riders Behavior: to confirm key insights such as bike types or reasons for a ride to drive our key message and communications
Next actions include:
This concludes the case study. For any recommendation and comments, please feel free to send to dannydo1910@gmail.com or connect in Linkedin at Danh Do’s LinkedIn