This project is my attempt to solve the first case study from the Google Data Analytics Professional Certificate Capstone Projects on Coursera that you can access here. I really enjoy working on this project because not the challenge on this project is ranging from the size of the data provided, to the working on how to create a meaningful analysis to solve the main problem itself.
This project will be about analyzing users behavior on how they use the service of a bike-sharing company based in Chicago, called Cyclistic.
The structure of this project will follow the data analysis life cycle phases that is taught on the course. That phases include:
Ask: This part is all about identifying the business problem and find out how the analysis could help solve the business problem.
Prepare: The phase where we will find out about all the information regarding the data that will be used in this analysis. Where it’s stored, how it’s organized, whether there is any problem regarding the data, etc.
Process: This is where the data wrangling happens. This phase will focus on all the cleaning and transformation steps taken to clean and prepare the data for the next phase, the analyze phase.
Analyze: This is where all the analysis is going to happen. The phase where we’re going to find out how the behavior of Cyclistics users differ and what kind of insight we can gather from it.
Share: The main point from this part is the data visualization and what are the steps to visualize the data in compelling ways. Most of what is going to be in this phase will be covered in the previous phase. The reason is that showing the analysis result together with the visualization can helps the reader understand the result faster.
Act: This part will be focus on explaining every recommendation from the result.
On each step above, there are also going to be some guiding questions that have been provided to guide the students on what they should consider doing when working on this project. These guiding questions are helping me personally to build the habit of documenting my work.
If you have any question regarding this project, don’t hesitate to contact me through email here, or send me a message on LinkedIn here.
Cyclistic is a bike-sharing company that launched back in 2016. Since then, the program has grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, they believe that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, they believe there is a very good chance to convert casual riders into members. They notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
They have set a clear goal:
Design marketing strategies aimed at converting casual riders into annual members.
In order to do that, however, the marketing analyst team needs to better understand several things:
How do annual members and casual riders differ?
Why casual riders would buy a membership?
How digital media could affect their marketing tactics?
The team is interested in analyzing the Cyclistic historical bike trip data to identify trends. And for this project, I’m going to work to answer the first question, which is:
How do annual members and casual riders differ?
Questions
What is the problem you are trying to solve?
Understanding the different behavior of Cyclistic users, both casual and member users, on how they use Cyclistic service differently.
How can your insights drive business decisions?
By understanding the different behavior of each user group, we can help the Cyclistic team to create a clear targeted marketing campaign to convert casual users into Cyclistic members.
Questions
Where is your data located?
The data is coming from a real bike-sharing company that is also based in Chicago called Divvy. The data have been made available to be accessed publicly through this link.
How is the data organized?
The data is organized by monthly trip data in *.csv format that is compressed in a *.zip format for each month. Meaning, each *.csv file contains the historical trip data for a specific month. For this analysis, we’re going to need the historical trip data from the past 12 months, which means the trip data from August 2021 to September 2022 will be used.
Are there issues with bias or credibility in this data? Does your data ROCCC (Reliable, Original, Comprehensive, Current, Cited)?
Overall, the data is ROCCC! It’s first-party data that is owned by a real business and the data is constantly updated, so it’s already reliable, original, comprehensive, and current. The only missing piece is the information about what kind of information is contained in each data, or it’s missing the Cited part.
How are you addressing licensing, privacy, security, and accessibility?
The data is coming from a real company that is organized by the City of Chicago and maintained by them. The data can be accessed publicly, meaning there’s no need to perform specific agreements to use this data. The only thing to consider is the terms and conditions for using the data that is written here. That being said, the data can be used to be used for this analysis.
How did you verify the data integrity?
The data is first-party data from a real company, so it’s easy to trust that the data is complete, accurate, and consistent.
How does it help you answer your question?
The data already aligns well with the problem the business is currently facing, which is understanding how each group of users use their service differently.
Are there any problems with the data?
The lack of metadata made the data a little bit hard to understand at first. But the variables that exist in the data can be recognizable by its name.
For this part, we’re going to start to getting our hands on some code. I’m going to use R programming language to import, wrangle, and transform the data.
library(tidyverse) # helps wrangle data
library(janitor) # helps wrangle data
library(lubridate) # helps wrangle date attributes
library(kableExtra) # helps format table on output
library(patchwork) # additional library that help plotting
library(scales) # helps with chart display
theme_set(theme_bw()) # set default ggplot2 theme
ccp <- RColorBrewer::brewer.pal(8, "Set2")
ccp <- ccp[4:length(ccp)]
The data to support the analysis is coming from a real bike-sharing company that is also based in Chicago called Divvy. The data have been made available to be accessed publicly through this link.
The data is organized by monthly trip data in *.csv format that is compressed in a *.zip format for each month. Meaning, each *.csv file contains the historical trip data for a specific month. For this analysis, we’re going to need the historical trip data from the past 12 months, which means the trip data from September 2021 to August 2022 will be used.
To help with the analysis, let’s combine the data into a single
dataframe. To do this, dplyr::bind_rows function will be
used.
all_trips <-
list.files(path = '../data/',
pattern = "*.csv",
full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
We can check the how is our data for now using glimpse()
function.
glimpse(all_trips)
## Rows: 5,883,043
## Columns: 13
## $ ride_id <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32", "6EF7213790…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-09-28 16:07:10, 2021-09-28 14:24:51, 2021-09-…
## $ ended_at <dttm> 2021-09-28 16:09:54, 2021-09-28 14:40:05, 2021-09-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Clark St &…
## $ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "TA13070001…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.89000, 41.94000, 41.81000, 41.80000, 41.88000, 4…
## $ start_lng <dbl> -87.68000, -87.64000, -87.72000, -87.72000, -87.740…
## $ end_lat <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.88, 41.74, 41…
## $ end_lng <dbl> -87.67, -87.67, -87.72, -87.72, -87.71, -87.74, -87…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
Checking for Duplicates
check_duplicate <- all_trips %>%
distinct() # dplyr::distinct() to find duplicate
print(paste("Duplicated rows:",
nrow(all_trips) - nrow(check_duplicate)))
## [1] "Duplicated rows: 0"
We have no duplicate row in our data.
Renaming member_casual Column
Renaming the member_casual column into
user_type can help to understand the data better.
all_trips_v2 <- rename(all_trips, user_type = member_casual)
Changing Values in user_type Column
Still on the same column, let’s change the values in this column,
"casual" to "One-time User" and
"member" to "Subscriber".
all_trips_v2 <-
all_trips_v2 %>%
mutate(user_type =
recode(user_type,
"casual" = "One-time user",
"member" = "Subscriber"))
Renaming rideable_type Column
Let’s also change the column name of rideable_type into
bike_type.
all_trips_v2 <-
rename(all_trips_v2,
bike_type = rideable_type)
Changing Values in bike_type Column
Let’s see how the values in this column is displayed.
unique(all_trips_v2$bike_type)
## [1] "electric_bike" "classic_bike" "docked_bike"
Let’s rename the values to help with the visualization.
all_trips_v2 <-
all_trips_v2 %>%
mutate(bike_type =
recode(bike_type,
"electric_bike" = "Electric bike",
"classic_bike" = "Classic bike",
"docked_bike" = "Docked bike"))
unique(all_trips_v2$bike_type)
## [1] "Electric bike" "Classic bike" "Docked bike"
Adding Date/Time Related Columns
We can add a level of detail to the analysis by adding some columns that are related to the time each ride is taken. These columns include the month, day of the month, day of the week, and the hour each rides is taken.
all_trips_v2$date <-
as.Date(all_trips_v2$started_at) # the date of each rides
all_trips_v2$month <-
format(as.Date(all_trips_v2$date), "%b") # month of each rides
all_trips_v2$day_of_month <-
format(as.Date(all_trips_v2$date), "%d") # day of month of each rides
all_trips_v2$day_of_week <-
format(as.Date(all_trips_v2$date), "%a") # weekday of each rides
all_trips_v2$hour <-
format(as.POSIXct(all_trips_v2$started_at), "%H") # hour of day
Let’s check these new columns.
all_trips_v2 %>%
select(date, month, day_of_month, day_of_week, hour) %>%
glimpse()
## Rows: 5,883,043
## Columns: 5
## $ date <date> 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-…
## $ month <chr> "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "…
## $ day_of_month <chr> "28", "28", "28", "28", "28", "28", "28", "28", "27", "28…
## $ day_of_week <chr> "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "…
## $ hour <chr> "16", "14", "00", "14", "09", "01", "07", "11", "19", "11…
Adding Trip Duration Column
all_trips_v2$trip_duration <-
difftime(all_trips_v2$ended_at, all_trips_v2$started_at, units = "mins") %>%
as.numeric()
Before going to far, let’s check out this newly added column.
all_trips_v2 %>%
select(trip_duration) %>%
summary()
## trip_duration
## Min. : -137.42
## 1st Qu.: 6.05
## Median : 10.72
## Mean : 19.75
## 3rd Qu.: 19.33
## Max. :40705.02
Filter Out Trip Duration Under 0 Minute
all_trips_v3 <- filter(all_trips_v2, !trip_duration < 0)
Let’s see the cleaned result.
glimpse(all_trips_v3)
## Rows: 5,882,908
## Columns: 19
## $ ride_id <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32", "6EF7213790…
## $ bike_type <chr> "Electric bike", "Electric bike", "Electric bike", …
## $ started_at <dttm> 2021-09-28 16:07:10, 2021-09-28 14:24:51, 2021-09-…
## $ ended_at <dttm> 2021-09-28 16:09:54, 2021-09-28 14:40:05, 2021-09-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Clark St &…
## $ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "TA13070001…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.89000, 41.94000, 41.81000, 41.80000, 41.88000, 4…
## $ start_lng <dbl> -87.68000, -87.64000, -87.72000, -87.72000, -87.740…
## $ end_lat <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.88, 41.74, 41…
## $ end_lng <dbl> -87.67, -87.67, -87.72, -87.72, -87.71, -87.74, -87…
## $ user_type <chr> "One-time user", "One-time user", "One-time user", …
## $ date <date> 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-28, 20…
## $ month <chr> "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "S…
## $ day_of_month <chr> "28", "28", "28", "28", "28", "28", "28", "28", "27…
## $ day_of_week <chr> "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "T…
## $ hour <chr> "16", "14", "00", "14", "09", "01", "07", "11", "19…
## $ trip_duration <dbl> 2.7333333, 15.2333333, 3.6833333, 8.8166667, 10.533…
Questions
What tools are you choosing and why?
I’m going to use R to work on this project. There are two reasons for that, first I want to know how to perform analysis using R, and second I just want to implement the knowledge about this programming language from the course.
Have you ensured your data integrity?
By looking at the data, we can see that the data is containing the records of every trip ever taken using the Cyclistic bike for the last 12 months, meaning the data is accurate. Next, each column is consistent among each file that contains those records. This means we can ensure that the data integrity is preserved.
What steps have you taken to ensure that your data is clean?
Checking for duplicated trips can prevent the wrong result in the analysis. Also, check the data type of each column to make sure the value in each column is matched with the data type that represents them.
How can you verify that your data is clean and ready to analyze?
By using the steps above, and the transformation that has been taken, I can be sure that the data is clean and ready to analyze.
Have you documented your cleaning process so you can review and share those results?
Every step I took to clean the data area is documented above.
This is where we will perform the analysis on the data we have to answer the business problem stated above.
Let’s see the overview on how each user type, the one-time user and subscriber differ.
all_trips_v3 %>%
count(user_type) %>%
mutate(rides_num = n,
percentage = formattable::percent(n / sum(n))) %>% # format as percentage
select(user_type, rides_num, percentage) %>%
arrange(desc(rides_num)) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| user_type | rides_num | percentage |
|---|---|---|
| Subscriber | 3414493 | 58.04% |
| One-time user | 2468415 | 41.96% |
We can see that subscriber users ride more bike for the past 12-month, which are 58% or 17% higher than the one-time users.
Let’s add some level of detail to show how the number of trips differ per month.
all_trips_v3$month <-
ordered(
all_trips_v3$month,
levels = c(
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec"
)
)
| user_type | month | trips_num | percentage |
|---|---|---|---|
| One-time user | Jan | 18520 | 0.75% |
| One-time user | Feb | 21416 | 0.87% |
| One-time user | Mar | 89880 | 3.64% |
| One-time user | Apr | 126417 | 5.12% |
| One-time user | May | 280414 | 11.36% |
| One-time user | Jun | 369044 | 14.95% |
| One-time user | Jul | 406046 | 16.45% |
| One-time user | Aug | 358917 | 14.54% |
| One-time user | Sep | 363883 | 14.74% |
| One-time user | Oct | 257242 | 10.42% |
| One-time user | Nov | 106898 | 4.33% |
| One-time user | Dec | 69738 | 2.83% |
| Subscriber | Jan | 85250 | 2.50% |
| Subscriber | Feb | 94193 | 2.76% |
| Subscriber | Mar | 194160 | 5.69% |
| Subscriber | Apr | 244832 | 7.17% |
| Subscriber | May | 354443 | 10.38% |
| Subscriber | Jun | 400148 | 11.72% |
| Subscriber | Jul | 417426 | 12.23% |
| Subscriber | Aug | 427000 | 12.51% |
| Subscriber | Sep | 392228 | 11.49% |
| Subscriber | Oct | 373984 | 10.95% |
| Subscriber | Nov | 253027 | 7.41% |
| Subscriber | Dec | 177802 | 5.21% |
Let’s see how this data in visual form.
p_heatmap <- all_trips_v3 %>%
count(month, day_of_month) %>%
ggplot(aes(
x = day_of_month,
y = reorder(month, desc(month)),
fill = n
)) +
geom_tile(color = "white",
lwd = 0.8) +
coord_fixed() +
labs(title = "The Most Trips Taken For the Last 12 Months",
x = "Day of Month") +
guides(fill = guide_colourbar(
barwidth = 0.8,
barheight = 8,
title = "Trips"
)) +
theme(axis.title.y = element_blank(), # Remove axis title
panel.grid.major = element_blank()) + # Remove gridlines
scale_fill_distiller(palette = "YlGnBu", direction = 1)
patchwork::wrap_plots(p_heatmap)
From the chart above, May to October is the months with the highest number of trips taken by Cyclistic users. Next, let’s see the breakdown for each user type for each month.
all_trips_v3 %>%
ggplot(aes(
x = month, # Month data
fill = user_type # Color per user type
)) +
geom_bar(width = 0.7) + # Create stacked bar chart
labs(title = "Favorite Month to Ride Cyclistic's Bike?", # Chart title
subtitle = "The total number of trips taken for each month from the past 12-months\n(September 2021 - August 2022)",
y = "# of Trips") + # y-axis title
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 900000)) + # Set axis range
theme(
axis.title.x = element_blank(), # remove axis title
legend.title = element_blank(), # remove legend title
legend.position = "top", # legend position top
legend.justification = "left" # legend left justify
) +
scale_fill_brewer(palette = "Set2")
Here, we can see that the number of rides rose from January to July, and continuously decreased after that. The number of rides taken in June - September is the highest throughout the year, with more than 750 thousand bike rides per month, or more than 50% of the total bike rides are taken during these months. Besides that, May and October also have a quite high number of bike rides.
Looking at the individual user type can help us understand their behavior on a monthly level.
all_trips_v3 %>%
ggplot(aes(
x = month, # Month data
fill = user_type # Color per user type
)) +
geom_bar(width = 0.7) + # create a column bar chart
facet_wrap(~ user_type) +
labs(title = "Monthly Trips per User Type", # Chart title
subtitle = "The total number of trips taken by each user type for each month",
y = "# of Trips") + # y-axis title
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 450000)) + # Set axis range
theme(
axis.title.x = element_blank(), # remove axis title
legend.title = element_blank(), # remove legend title
legend.position = "blank" # remove legend
) +
scale_fill_brewer(palette = "Set2")
We can see conclude that:
It looks like the number of trips taken is correlated with the temperature. To prove this, let’s look how the monthly temperature in Chicago looks like.
# Relationship with the temperature
## Average monthly data from 1872 - 2022 for Chicago area
## https://www.weather.gov/wrh/climate?wfo=lot
month <- c("Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec")
mean_temp <-
c(24.2,
27.0,
36.9,
48.1,
58.7,
68.7,
74.0,
72.6,
65.5,
53.9,
40.3,
29.1)
chicago_monthly_temp <- data.frame(month, mean_temp)
chicago_monthly_temp$month <-
ordered(
chicago_monthly_temp$month,
levels = c(
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec"
)
)
chicago_monthly_temp %>%
ggplot(aes(month, mean_temp)) +
geom_bar(stat = 'identity', width = 0.7) +
labs(title = "Chicago Monthly Average Temperature (°F) from 1872 - 2022",
y = "°F") +
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
limits = c(0, 80)) + # set the axis limit
theme(axis.title.x = element_blank()) # Remove axis label
We can clearly see how the two variables are correlated.
all_trips_v3$day_of_week <-
ordered(all_trips_v3$day_of_week,
levels = c("Sun",
"Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat"))
First, let’s see the number of trips taken for each day.
all_trips_v3 %>%
count(day_of_week) %>%
mutate(percentage = formattable::percent(n / sum(n))) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| day_of_week | n | percentage |
|---|---|---|
| Sun | 841916 | 14.31% |
| Mon | 766998 | 13.04% |
| Tue | 814492 | 13.85% |
| Wed | 840256 | 14.28% |
| Thu | 836970 | 14.23% |
| Fri | 818204 | 13.91% |
| Sat | 964072 | 16.39% |
Visualizing the data can help to get the insight faster.
all_trips_v3 %>%
ggplot(aes(
x = day_of_week, # Month data
fill = user_type # Color per user type
)) +
geom_bar(width = 0.7) + # create a stacked bar chart
labs(title = "Favorite Day to Ride Cyclistic's Bike?", # Chart title
subtitle = "The total number of trips taken per day of week",
y = "# of Trips") + # y-axis title
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "M", scale = 1e-6),
limits = c(0, 1000000)) + # Set axis range
theme(
axis.title.x = element_blank(), # remove axis title
legend.title = element_blank(), # remove legend title
legend.position = "top", # legend position top
legend.justification = "left" # legend left justify
) +
scale_fill_brewer(palette = "Set2")
We can see that there are high number of bike rides on weekend, especially on Saturday. Other than that, the number of rides are quite consistent along the week.
Next, let’s see the breakdown for each user type.
all_trips_v3 %>%
group_by(user_type, day_of_week) %>%
summarise(rides_num = n()) %>%
ungroup() %>%
mutate(percentage =
formattable::percent(rides_num / sum(rides_num))) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| user_type | day_of_week | rides_num | percentage |
|---|---|---|---|
| One-time user | Sun | 437413 | 7.44% |
| One-time user | Mon | 292181 | 4.97% |
| One-time user | Tue | 278007 | 4.73% |
| One-time user | Wed | 293245 | 4.98% |
| One-time user | Thu | 311487 | 5.29% |
| One-time user | Fri | 346025 | 5.88% |
| One-time user | Sat | 510057 | 8.67% |
| Subscriber | Sun | 404503 | 6.88% |
| Subscriber | Mon | 474817 | 8.07% |
| Subscriber | Tue | 536485 | 9.12% |
| Subscriber | Wed | 547011 | 9.30% |
| Subscriber | Thu | 525483 | 8.93% |
| Subscriber | Fri | 472179 | 8.03% |
| Subscriber | Sat | 454015 | 7.72% |
The visualization looks like,
all_trips_v3 %>%
ggplot(aes(
x = day_of_week, # Month data
fill = user_type # Color per user type
)) +
geom_bar(width = 0.7) + # create a column bar chart
facet_wrap(~ user_type) +
labs(title = "Daily Trips per User Type", # Chart title
subtitle = "The total number of trips taken by each user type for each day of week",
y = "# of Trips") + # y-axis title
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 600000)) + # Set axis range
theme(
axis.title.x = element_blank(), # remove axis title
legend.title = element_blank(), # remove legend title
legend.position = "blank" # remove legend
) +
scale_fill_brewer(palette = "Set2")
From this visualization we can see that:
This analysis can give us an overview on the reason each user type use the bike on daily basis.
all_trips_v3 %>%
count(hour) %>%
mutate(percentage = formattable::percent(n / sum(n))) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| hour | n | percentage |
|---|---|---|
| 00 | 86288 | 1.47% |
| 01 | 55441 | 0.94% |
| 02 | 33665 | 0.57% |
| 03 | 20012 | 0.34% |
| 04 | 17829 | 0.30% |
| 05 | 47143 | 0.80% |
| 06 | 121431 | 2.06% |
| 07 | 228147 | 3.88% |
| 08 | 276640 | 4.70% |
| 09 | 226082 | 3.84% |
| 10 | 243570 | 4.14% |
| 11 | 301067 | 5.12% |
| 12 | 348968 | 5.93% |
| 13 | 353425 | 6.01% |
| 14 | 359080 | 6.10% |
| 15 | 410120 | 6.97% |
| 16 | 499449 | 8.49% |
| 17 | 586642 | 9.97% |
| 18 | 502530 | 8.54% |
| 19 | 372295 | 6.33% |
| 20 | 267081 | 4.54% |
| 21 | 216749 | 3.68% |
| 22 | 181260 | 3.08% |
| 23 | 127994 | 2.18% |
There are only small amount of data from the midnight to 5AM, or less than 2% of data, let’s filter them out from the visualization to make the result easier to view.
all_trips_v3 %>%
filter(as.numeric(hour) > 5) %>%
ggplot(aes(x = hour, # Hourly data
fill = user_type)) + # Color the bar
geom_bar(width = 0.7) + # Create stacked bar chart
labs(title = "At What Time of Day Cyclistic's Bike Mostly Used?", # chart title
subtitle = "The total number of trips taken for each time of day",
x = "Hour", # x-axis lable
y = "# of Trips") + # x-axis lable
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 650000)) + # Set axis range
theme(
legend.title = element_blank(), # remove legend title
legend.position = "top", # legend position top
legend.justification = "left" # legend left justify
) +
scale_fill_brewer(palette = "Set2") # RColorBrewer pallete
The gradual increase of bike rides after 08:00 until 17:00 could indicate the rides taken in this particular time are to commute to workplace/school. The peaks at 08:00 and 17:00 could also support that claim.
From the day-to-day analysis, we can see that the number of trips taken by the subscriber users is high during the week, while on the weekend there are more trips taken by the one-time users.
Let’s breakdown the hourly result from the previous analysis to show how each user type uses their bike on the weekday vs. on the weekend.
weekend <- c("Sat", "Sun")
all_trips_v3 %>%
mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
filter(as.numeric(hour) > 5) %>% # filter the data
count(user_type, week_type, hour) %>%
mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
ggplot(aes(hour, n_normal, fill = user_type)) +
geom_bar(stat = "identity", width = 0.7) +
facet_wrap( ~ week_type) +
labs(title = "Bike Trips on Weekday vs. Weekend",
subtitle = "The average number of trips taken on weekday vs. weekend for each time of day",
x = "Hour",
y = "# of Trips") +
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 100000)) +
theme(
legend.title = element_blank(), # remove legend title
legend.position = "top", # legend position top
legend.justification = "left", # legend left justify
) +
scale_fill_brewer(palette = "Set2") # RColorBrewer pallete
The pattern for the weekday showing the same result as the previous result on the number of trips per hour. On the other hand, the result for the weekend trips shows the number of trips is gradually increasing along the day and hit the peak at around 15:00.
We can look at each weekday and weekend result to better understand the different. For the weekday, we can look at the chart below to understand how each user type uses their bike.
all_trips_v3 %>%
mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
filter(as.numeric(hour) > 5 & week_type == "Weekday") %>% # filter the data
count(user_type, week_type, hour) %>%
mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
ggplot(aes(hour, n_normal, fill = user_type)) +
geom_bar(position = "dodge", stat = "identity", width = 0.7) +
facet_wrap( ~ user_type) +
labs(title = "Number of Trips on Weekday per User Type",
subtitle = "The average number of trips taken on weekday for each hour of day",
x = "Hour",
y = "# of Trips") +
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 60000)) +
theme(
legend.title = element_blank(), # remove legend title
legend.position = "blank", # legend position top
) +
scale_fill_brewer(palette = "Set2") # RColorBrewer palleted
For the weekday, subscriber users take more trips than the one-time users. Other than that, each user type have the same behavior. The peaks at 08:00 and 17:00 and the gradual increase of bike rides between those hours could possibly mean they use their bike to get to work/school. Data on the age group of each riders could help to clarify this statement.
For the weekend, we can look at the chart below to understand how each user type uses their bike.
all_trips_v3 %>%
mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
filter(as.numeric(hour) > 5 & week_type == "Weekend") %>% # filter the data
count(user_type, week_type, hour) %>%
mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
ggplot(aes(hour, n_normal, fill = user_type)) +
geom_bar(position = "dodge", stat = "identity", width = 0.7) +
facet_wrap( ~ user_type) +
labs(title = "Number of Trips on Weekend per User Type",
subtitle = "The average number of trips taken on weekend for each hour of day",
x = "Hour",
y = "# of Trips") +
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = unit_format(unit = "K", scale = 1e-3),
limits = c(0, 50000)) +
theme(
legend.title = element_blank(), # remove legend title
legend.position = "blank", # legend position top
) +
scale_fill_brewer(palette = "Set2") # RColorBrewer palleted
For the weekend, the behavior on how each user types use their bike are also the same. But not like on the weekday, the average number of rides taken by the one-time users are higher especially after 10:00. The high number of rides from both user types in middle of the day could indicate that people use their bike for sightseeing.
The high number of trips taken by the one-time users even on the weekday could support the claim that Cyclistic’s team made about casual users or the one-time users awareness on their bike-sharing service.
Let’s see the basic statistic for this variable.
summary(all_trips_v3$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.05 10.72 19.75 19.33 40705.02
This variable is in minutes, and the trips with the longest duration is 40705.02 minutes or about 28 days! The reasons is unknown for this case. But, visualizing this range will be a problem.
Let’s only include the data in the IQR range. Because the minimum value already 0, the next task is to find the 75% percentile of the data.
percentile <- quantile(all_trips_v3$trip_duration,
probs = 0.75) # find the 75% percentile
IQR <- IQR(all_trips_v3$trip_duration) # find the IQR
upper <- percentile[1] + 1.5*IQR # find the upper whisker
Let’s try to use the data up to the 95 percentile to analyze the distribution of this variable.
data_no_outliers <- all_trips_v3 %>%
filter(trip_duration < upper)
print(paste("Removed:",
nrow(all_trips_v3) - nrow(data_no_outliers), "rows"))
## [1] "Removed: 426698 rows"
Looking at our new trip duration summary.
summary(data_no_outliers$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.767 9.933 12.237 16.717 39.250
Let’s see how this new range looks like.
data_no_outliers %>%
ggplot(aes(x = trip_duration)) +
geom_density(fill = "#41b6e6", color = "#67C5ED") +
labs(title = "Trip Duration Variable Distribution",
x = "Trip Duration",
y = "Density") +
scale_y_continuous(labels = scales::percent, # set axis format
expand = c(0, 0)) # set start point to 0
Let’s see the distribution of the trip duration variable for each user type.
data_no_outliers %>%
group_by(user_type) %>%
summarize(
min = min(trip_duration), # minimum value
q1 = quantile(trip_duration, 0.25), # 25% percentile
median = median(trip_duration), # median value
mean = mean(trip_duration), # mean value
q3 = quantile(trip_duration, 0.75), # 75% percentila
max = max(trip_duration) # minimum value
) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| user_type | min | q1 | median | mean | q3 | max |
|---|---|---|---|---|---|---|
| One-time user | 0 | 7.216667 | 12.05 | 14.16656 | 19.60000 | 39.25 |
| Subscriber | 0 | 5.100000 | 8.70 | 10.99073 | 14.73333 | 39.25 |
Let’s visualize it to better understand the distribution.
data_no_outliers %>%
ggplot(aes(x = user_type, y = trip_duration, fill = user_type)) +
geom_boxplot(outlier.size = 1,
alpha = 0.8) +
labs(title = "Who Rides Their Bike Longer?",
subtitle = "The distribution of trip duration for each user type",
y = "Trip Duration (Minutes)") +
theme(axis.title.x = element_blank(), # Remove axis title
legend.position = "none" # Remove legend
) +
scale_fill_brewer(palette = "Set2") # RColorBrewer pallete
By looking at the previous result on how each user type use their bike on daily basis, this result should make sense. The high number of trips by subscriber users along the week could indicate that they use their bike to commute to work. Lower trip duration from the data above meaning getting to their destination fast is their priority.
Let’s analyze the popular bike pick-up location for each user type.
From the visual above, we can conclude several things:
all_trips_v3 %>%
count(bike_type, name = "trips_count") %>%
mutate(
percentage = formattable::percent(
trips_count / sum(trips_count))
) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| bike_type | trips_count | percentage |
|---|---|---|
| Classic bike | 2896604 | 49.24% |
| Docked bike | 207981 | 3.54% |
| Electric bike | 2778323 | 47.23% |
There are only 3.5% of docked bike usage from the past 12 month.
Let’s see the breakdown for each user type.
all_trips_v3 %>%
count(user_type, bike_type, name = "trips_count") %>%
mutate(
percentage = formattable::percent(
trips_count / sum(trips_count))
) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| user_type | bike_type | trips_count | percentage |
|---|---|---|---|
| One-time user | Classic bike | 1031520 | 17.53% |
| One-time user | Docked bike | 207981 | 3.54% |
| One-time user | Electric bike | 1228914 | 20.89% |
| Subscriber | Classic bike | 1865084 | 31.70% |
| Subscriber | Electric bike | 1549409 | 26.34% |
And only the one-time users are using the docked bike from the whole data. The reason behind this low usage is outside of this analysis. Because there only small amount of trips using this bike, we’ll filter it for the next analysis.
all_trips_v3 %>%
filter(bike_type != "Docked bike") %>%
count(user_type, bike_type, name = "trips_count") %>%
group_by(user_type) %>%
mutate(
percentage = formattable::percent(
trips_count / sum(trips_count))
) %>%
ggplot(aes(
x = bike_type, # bike type data
y = percentage, # % data
fill = bike_type # color per bike type
)) +
geom_bar(stat = "identity", # accepting y data
width = 0.7) + # create stacked bar chart
facet_wrap(~ user_type) + # create facet
labs(title = "Percentage of Trips by Bike Type Usage per User Type", # Chart title
y = "% of Trips") + # y-axis title
scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
labels = scales::percent,
limits = c(0, 0.6)) + # set axis range
theme(
axis.title.x = element_blank(), # remove axis title
legend.title = element_blank(), # remove legend title
legend.position = "None" # remove legend
) + scale_fill_manual(values = ccp)
From the chart above we can conclude that:
More than 50% of trips taken by one-time users are taken with the electric bike. On the other hand, more than 50% of trips taken by subscriber users are taken with the classic bike. This means electric bikes are more popular for one-time users than subscriber users.
The high usage of classic bike type from the subscriber users is caused by the fact that you will get unlimited 45-min classic bike rides included in the subscription package. Looking at the average trip duration by the subscriber users can also confirm these findings.
Questions
How should you organize your data to perform analysis on it?
Most of the analysis are done by aggregating the data based on the specific variables that wanted to observe, for example if we wanted to show how the month-to-month number of rides for each user type, we aggregate the data by month and show the number of rides for each month by each of the user type.
Has your data been properly formatted?
Yes. Formatting the data by sort and filter them to only show the data we need have been done above. Formatting the data can also helps us to understand them easily.
What surprises did you discover in the data?
There are some interesting discoveries I found in them:
Even though subscriber users ride more bikes than the one-time users, they only use the bike for shorter amount of time than the one-time users on average.
The different pick-up location for each user type can help to indicate how each user type use their bike. For subscriber users, there are more trips taken from the pick-up location in the downtown area. Meanwhile, more trips are taken from the pick-up location on the coastal area, or are that is close with tourist sites.
What trends or relationships did you find in the data?
Some trends or relationships I found in the data:
How will these insights help answer your business questions?
The insights I found can be used to identify how each user type, especially the one-time users of Cyclistic on how they use the bike and some strategies to perform to convert them to the subscriber users.
Questions
What is your final conclusion based on your analysis?
There are clear differences between how subscriber users and one-time users of a Cyclistic bike-sharing company use their bikes.
How could your team and business apply your insights?
Having the knowledge of understanding users’ behavior on how and why they use the bike can help the business on making decisions on strategies to target specific user types based on how they use the service.
What next steps would you or your stakeholders take based on your findings?
Some actions/recommendations from this analysis are:
Perform marketing campaign on high trips months, which is from May - October.
Add subscription variation to include 6-months period subscription.
Create a special promotion price for university students.
Is there additional data you could use to expand on your findings?
Probably the data about the age of each user can help to gain insights, like confirming how each user uses their bike, or is there any relationship between the number of rides and the duration of each trip for each age group for example. Knowing this could also help the business on creating a more specific targeted campaign that is unique for each age group.