Introduction

This project is my attempt to solve the first case study from the Google Data Analytics Professional Certificate Capstone Projects on Coursera that you can access here. I really enjoy working on this project because not the challenge on this project is ranging from the size of the data provided, to the working on how to create a meaningful analysis to solve the main problem itself.

This project will be about analyzing users behavior on how they use the service of a bike-sharing company based in Chicago, called Cyclistic.

The structure of this project will follow the data analysis life cycle phases that is taught on the course. That phases include:

  1. Ask: This part is all about identifying the business problem and find out how the analysis could help solve the business problem.

  2. Prepare: The phase where we will find out about all the information regarding the data that will be used in this analysis. Where it’s stored, how it’s organized, whether there is any problem regarding the data, etc.

  3. Process: This is where the data wrangling happens. This phase will focus on all the cleaning and transformation steps taken to clean and prepare the data for the next phase, the analyze phase.

  4. Analyze: This is where all the analysis is going to happen. The phase where we’re going to find out how the behavior of Cyclistics users differ and what kind of insight we can gather from it.

  5. Share: The main point from this part is the data visualization and what are the steps to visualize the data in compelling ways. Most of what is going to be in this phase will be covered in the previous phase. The reason is that showing the analysis result together with the visualization can helps the reader understand the result faster.

  6. Act: This part will be focus on explaining every recommendation from the result.

On each step above, there are also going to be some guiding questions that have been provided to guide the students on what they should consider doing when working on this project. These guiding questions are helping me personally to build the habit of documenting my work.

If you have any question regarding this project, don’t hesitate to contact me through email here, or send me a message on LinkedIn here.

Background

Cyclistic is a bike-sharing company that launched back in 2016. Since then, the program has grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system at anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Business Problem

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, they believe that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, they believe there is a very good chance to convert casual riders into members. They notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

They have set a clear goal:

Design marketing strategies aimed at converting casual riders into annual members.

In order to do that, however, the marketing analyst team needs to better understand several things:

The team is interested in analyzing the Cyclistic historical bike trip data to identify trends. And for this project, I’m going to work to answer the first question, which is:

How do annual members and casual riders differ?

Analysis

1. Ask Phase

Questions

What is the problem you are trying to solve?

Understanding the different behavior of Cyclistic users, both casual and member users, on how they use Cyclistic service differently.

How can your insights drive business decisions?

By understanding the different behavior of each user group, we can help the Cyclistic team to create a clear targeted marketing campaign to convert casual users into Cyclistic members.

2. Prepare Phase

Questions

Where is your data located?

The data is coming from a real bike-sharing company that is also based in Chicago called Divvy. The data have been made available to be accessed publicly through this link.

How is the data organized?

The data is organized by monthly trip data in *.csv format that is compressed in a *.zip format for each month. Meaning, each *.csv file contains the historical trip data for a specific month. For this analysis, we’re going to need the historical trip data from the past 12 months, which means the trip data from August 2021 to September 2022 will be used.

Are there issues with bias or credibility in this data? Does your data ROCCC (Reliable, Original, Comprehensive, Current, Cited)?

Overall, the data is ROCCC! It’s first-party data that is owned by a real business and the data is constantly updated, so it’s already reliable, original, comprehensive, and current. The only missing piece is the information about what kind of information is contained in each data, or it’s missing the Cited part.

How are you addressing licensing, privacy, security, and accessibility?

The data is coming from a real company that is organized by the City of Chicago and maintained by them. The data can be accessed publicly, meaning there’s no need to perform specific agreements to use this data. The only thing to consider is the terms and conditions for using the data that is written here. That being said, the data can be used to be used for this analysis.

How did you verify the data integrity?

The data is first-party data from a real company, so it’s easy to trust that the data is complete, accurate, and consistent.

How does it help you answer your question?

The data already aligns well with the problem the business is currently facing, which is understanding how each group of users use their service differently.

Are there any problems with the data?

The lack of metadata made the data a little bit hard to understand at first. But the variables that exist in the data can be recognizable by its name.

3. Process Phase

For this part, we’re going to start to getting our hands on some code. I’m going to use R programming language to import, wrangle, and transform the data.

library(tidyverse) # helps wrangle data
library(janitor) # helps wrangle data
library(lubridate) # helps wrangle date attributes
library(kableExtra) # helps format table on output
library(patchwork) # additional library that help plotting
library(scales) # helps with chart display
theme_set(theme_bw()) # set default ggplot2 theme
ccp <- RColorBrewer::brewer.pal(8, "Set2")
ccp <- ccp[4:length(ccp)]

3.1 Load Data

The data to support the analysis is coming from a real bike-sharing company that is also based in Chicago called Divvy. The data have been made available to be accessed publicly through this link.

The data is organized by monthly trip data in *.csv format that is compressed in a *.zip format for each month. Meaning, each *.csv file contains the historical trip data for a specific month. For this analysis, we’re going to need the historical trip data from the past 12 months, which means the trip data from September 2021 to August 2022 will be used.

To help with the analysis, let’s combine the data into a single dataframe. To do this, dplyr::bind_rows function will be used.

all_trips <-
  list.files(path = '../data/',
             pattern = "*.csv",
             full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows

We can check the how is our data for now using glimpse() function.

glimpse(all_trips)
## Rows: 5,883,043
## Columns: 13
## $ ride_id            <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32", "6EF7213790…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-09-28 16:07:10, 2021-09-28 14:24:51, 2021-09-…
## $ ended_at           <dttm> 2021-09-28 16:09:54, 2021-09-28 14:40:05, 2021-09-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Clark St &…
## $ start_station_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "TA13070001…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.89000, 41.94000, 41.81000, 41.80000, 41.88000, 4…
## $ start_lng          <dbl> -87.68000, -87.64000, -87.72000, -87.72000, -87.740…
## $ end_lat            <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.88, 41.74, 41…
## $ end_lng            <dbl> -87.67, -87.67, -87.72, -87.72, -87.71, -87.74, -87…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…

3.2 Data Wrangling and Transformation

Checking for Duplicates

check_duplicate <- all_trips %>%
  distinct() # dplyr::distinct() to find duplicate

print(paste("Duplicated rows:", 
            nrow(all_trips) - nrow(check_duplicate)))
## [1] "Duplicated rows: 0"

We have no duplicate row in our data.

Renaming member_casual Column

Renaming the member_casual column into user_type can help to understand the data better.

all_trips_v2 <- rename(all_trips, user_type = member_casual)

Changing Values in user_type Column

Still on the same column, let’s change the values in this column, "casual" to "One-time User" and "member" to "Subscriber".

all_trips_v2 <-
  all_trips_v2 %>% 
  mutate(user_type = 
           recode(user_type, 
                  "casual" = "One-time user", 
                  "member" = "Subscriber"))

Renaming rideable_type Column

Let’s also change the column name of rideable_type into bike_type.

all_trips_v2 <- 
  rename(all_trips_v2, 
         bike_type = rideable_type)

Changing Values in bike_type Column

Let’s see how the values in this column is displayed.

unique(all_trips_v2$bike_type)
## [1] "electric_bike" "classic_bike"  "docked_bike"

Let’s rename the values to help with the visualization.

all_trips_v2 <-
  all_trips_v2 %>% 
  mutate(bike_type = 
           recode(bike_type, 
                  "electric_bike" = "Electric bike", 
                  "classic_bike" = "Classic bike",
                  "docked_bike" = "Docked bike"))
unique(all_trips_v2$bike_type)
## [1] "Electric bike" "Classic bike"  "Docked bike"

Adding Date/Time Related Columns

We can add a level of detail to the analysis by adding some columns that are related to the time each ride is taken. These columns include the month, day of the month, day of the week, and the hour each rides is taken.

all_trips_v2$date <-
  as.Date(all_trips_v2$started_at) # the date of each rides
all_trips_v2$month <-
  format(as.Date(all_trips_v2$date), "%b") # month of each rides
all_trips_v2$day_of_month <-
  format(as.Date(all_trips_v2$date), "%d") # day of month of each rides
all_trips_v2$day_of_week <-
  format(as.Date(all_trips_v2$date), "%a") # weekday of each rides
all_trips_v2$hour <-
  format(as.POSIXct(all_trips_v2$started_at), "%H") # hour of day

Let’s check these new columns.

all_trips_v2 %>%
  select(date, month, day_of_month, day_of_week, hour) %>%
  glimpse()
## Rows: 5,883,043
## Columns: 5
## $ date         <date> 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-…
## $ month        <chr> "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "…
## $ day_of_month <chr> "28", "28", "28", "28", "28", "28", "28", "28", "27", "28…
## $ day_of_week  <chr> "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "…
## $ hour         <chr> "16", "14", "00", "14", "09", "01", "07", "11", "19", "11…

Adding Trip Duration Column

all_trips_v2$trip_duration <-
  difftime(all_trips_v2$ended_at, all_trips_v2$started_at, units = "mins") %>%
  as.numeric()

Before going to far, let’s check out this newly added column.

all_trips_v2 %>%
  select(trip_duration) %>%
  summary()
##  trip_duration     
##  Min.   : -137.42  
##  1st Qu.:    6.05  
##  Median :   10.72  
##  Mean   :   19.75  
##  3rd Qu.:   19.33  
##  Max.   :40705.02

Filter Out Trip Duration Under 0 Minute

all_trips_v3 <- filter(all_trips_v2, !trip_duration < 0)

Let’s see the cleaned result.

glimpse(all_trips_v3)
## Rows: 5,882,908
## Columns: 19
## $ ride_id            <chr> "9DC7B962304CBFD8", "F930E2C6872D6B32", "6EF7213790…
## $ bike_type          <chr> "Electric bike", "Electric bike", "Electric bike", …
## $ started_at         <dttm> 2021-09-28 16:07:10, 2021-09-28 14:24:51, 2021-09-…
## $ ended_at           <dttm> 2021-09-28 16:09:54, 2021-09-28 14:40:05, 2021-09-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Clark St &…
## $ start_station_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "TA13070001…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.89000, 41.94000, 41.81000, 41.80000, 41.88000, 4…
## $ start_lng          <dbl> -87.68000, -87.64000, -87.72000, -87.72000, -87.740…
## $ end_lat            <dbl> 41.89, 41.98, 41.80, 41.81, 41.88, 41.88, 41.74, 41…
## $ end_lng            <dbl> -87.67, -87.67, -87.72, -87.72, -87.71, -87.74, -87…
## $ user_type          <chr> "One-time user", "One-time user", "One-time user", …
## $ date               <date> 2021-09-28, 2021-09-28, 2021-09-28, 2021-09-28, 20…
## $ month              <chr> "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "Sep", "S…
## $ day_of_month       <chr> "28", "28", "28", "28", "28", "28", "28", "28", "27…
## $ day_of_week        <chr> "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "Tue", "T…
## $ hour               <chr> "16", "14", "00", "14", "09", "01", "07", "11", "19…
## $ trip_duration      <dbl> 2.7333333, 15.2333333, 3.6833333, 8.8166667, 10.533…

Questions

What tools are you choosing and why?

I’m going to use R to work on this project. There are two reasons for that, first I want to know how to perform analysis using R, and second I just want to implement the knowledge about this programming language from the course.

Have you ensured your data integrity?

By looking at the data, we can see that the data is containing the records of every trip ever taken using the Cyclistic bike for the last 12 months, meaning the data is accurate. Next, each column is consistent among each file that contains those records. This means we can ensure that the data integrity is preserved.

What steps have you taken to ensure that your data is clean?

Checking for duplicated trips can prevent the wrong result in the analysis. Also, check the data type of each column to make sure the value in each column is matched with the data type that represents them.

How can you verify that your data is clean and ready to analyze?

By using the steps above, and the transformation that has been taken, I can be sure that the data is clean and ready to analyze.

Have you documented your cleaning process so you can review and share those results?

Every step I took to clean the data area is documented above.

4. Analyze Phase

This is where we will perform the analysis on the data we have to answer the business problem stated above.

Trips per User Type

Let’s see the overview on how each user type, the one-time user and subscriber differ.

all_trips_v3 %>%
  count(user_type) %>%
  mutate(rides_num = n, 
         percentage = formattable::percent(n / sum(n))) %>% # format as percentage
  select(user_type, rides_num, percentage) %>% 
  arrange(desc(rides_num)) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
user_type rides_num percentage
Subscriber 3414493 58.04%
One-time user 2468415 41.96%

We can see that subscriber users ride more bike for the past 12-month, which are 58% or 17% higher than the one-time users.

Trips per Months

Let’s add some level of detail to show how the number of trips differ per month.

all_trips_v3$month <-
  ordered(
    all_trips_v3$month,
    levels = c(
      "Jan",
      "Feb",
      "Mar",
      "Apr",
      "May",
      "Jun",
      "Jul",
      "Aug",
      "Sep",
      "Oct",
      "Nov",
      "Dec"
    )
  )
user_type month trips_num percentage
One-time user Jan 18520 0.75%
One-time user Feb 21416 0.87%
One-time user Mar 89880 3.64%
One-time user Apr 126417 5.12%
One-time user May 280414 11.36%
One-time user Jun 369044 14.95%
One-time user Jul 406046 16.45%
One-time user Aug 358917 14.54%
One-time user Sep 363883 14.74%
One-time user Oct 257242 10.42%
One-time user Nov 106898 4.33%
One-time user Dec 69738 2.83%
Subscriber Jan 85250 2.50%
Subscriber Feb 94193 2.76%
Subscriber Mar 194160 5.69%
Subscriber Apr 244832 7.17%
Subscriber May 354443 10.38%
Subscriber Jun 400148 11.72%
Subscriber Jul 417426 12.23%
Subscriber Aug 427000 12.51%
Subscriber Sep 392228 11.49%
Subscriber Oct 373984 10.95%
Subscriber Nov 253027 7.41%
Subscriber Dec 177802 5.21%

Let’s see how this data in visual form.

p_heatmap <- all_trips_v3 %>%
  count(month, day_of_month) %>%
  ggplot(aes(
    x = day_of_month,
    y = reorder(month, desc(month)),
    fill = n
  )) +
  geom_tile(color = "white",
            lwd = 0.8) +
  coord_fixed() +
  labs(title = "The Most Trips Taken For the Last 12 Months", 
       x = "Day of Month") +
  guides(fill = guide_colourbar(
    barwidth = 0.8,
    barheight = 8,
    title = "Trips"
  )) +
  theme(axis.title.y = element_blank(), # Remove axis title
        panel.grid.major = element_blank()) + # Remove gridlines
  scale_fill_distiller(palette = "YlGnBu", direction = 1)

patchwork::wrap_plots(p_heatmap)

From the chart above, May to October is the months with the highest number of trips taken by Cyclistic users. Next, let’s see the breakdown for each user type for each month.

all_trips_v3 %>%
  ggplot(aes(
    x = month, # Month data
    fill = user_type # Color per user type
  )) +
  geom_bar(width = 0.7) + # Create stacked bar chart
  labs(title = "Favorite Month to Ride Cyclistic's Bike?", # Chart title
       subtitle = "The total number of trips taken for each month from the past 12-months\n(September 2021 - August 2022)",
       y = "# of Trips") + # y-axis title
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 900000)) + # Set axis range
  theme(
    axis.title.x = element_blank(), # remove axis title
    legend.title = element_blank(), # remove legend title
    legend.position = "top", # legend position top
    legend.justification = "left" # legend left justify
  ) +
  scale_fill_brewer(palette = "Set2")

Here, we can see that the number of rides rose from January to July, and continuously decreased after that. The number of rides taken in June - September is the highest throughout the year, with more than 750 thousand bike rides per month, or more than 50% of the total bike rides are taken during these months. Besides that, May and October also have a quite high number of bike rides.

Looking at the individual user type can help us understand their behavior on a monthly level.

all_trips_v3 %>%
  ggplot(aes(
    x = month, # Month data
    fill = user_type # Color per user type
  )) +
  geom_bar(width = 0.7) +  # create a column bar chart
  facet_wrap(~ user_type) + 
  labs(title = "Monthly Trips per User Type", # Chart title
       subtitle = "The total number of trips taken by each user type for each month",
       y = "# of Trips") + # y-axis title
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 450000)) + # Set axis range
  theme(
    axis.title.x = element_blank(), # remove axis title
    legend.title = element_blank(), # remove legend title
    legend.position = "blank" # remove legend
  ) +
  scale_fill_brewer(palette = "Set2")

We can see conclude that:

  1. There are more rides by the subscriber users each month.
  2. Both user type has the same pattern in term of the number of bike ride they take each month.
  3. The number of rides by one-time users increasing significantly on May.

It looks like the number of trips taken is correlated with the temperature. To prove this, let’s look how the monthly temperature in Chicago looks like.

# Relationship with the temperature
## Average monthly data from 1872 - 2022 for Chicago area
## https://www.weather.gov/wrh/climate?wfo=lot

month <- c("Jan",
           "Feb",
           "Mar",
           "Apr",
           "May",
           "Jun",
           "Jul",
           "Aug",
           "Sep",
           "Oct",
           "Nov",
           "Dec")

mean_temp <-
  c(24.2,
    27.0,
    36.9,
    48.1,
    58.7,
    68.7,
    74.0,
    72.6,
    65.5,
    53.9,
    40.3,
    29.1)

chicago_monthly_temp <- data.frame(month, mean_temp)

chicago_monthly_temp$month <-
  ordered(
    chicago_monthly_temp$month,
    levels = c(
      "Jan",
      "Feb",
      "Mar",
      "Apr",
      "May",
      "Jun",
      "Jul",
      "Aug",
      "Sep",
      "Oct",
      "Nov",
      "Dec"
    )
  )
chicago_monthly_temp %>%
  ggplot(aes(month, mean_temp)) +
  geom_bar(stat = 'identity', width = 0.7) +
  labs(title = "Chicago Monthly Average Temperature (°F) from 1872 - 2022",
       y = "°F") +
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     limits = c(0, 80)) + # set the axis limit
  theme(axis.title.x = element_blank()) # Remove axis label

We can clearly see how the two variables are correlated.

Trips per Day

all_trips_v3$day_of_week <-
  ordered(all_trips_v3$day_of_week,
          levels = c("Sun",
                     "Mon",
                     "Tue",
                     "Wed",
                     "Thu",
                     "Fri",
                     "Sat"))

First, let’s see the number of trips taken for each day.

all_trips_v3 %>% 
  count(day_of_week) %>% 
  mutate(percentage = formattable::percent(n / sum(n))) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
day_of_week n percentage
Sun 841916 14.31%
Mon 766998 13.04%
Tue 814492 13.85%
Wed 840256 14.28%
Thu 836970 14.23%
Fri 818204 13.91%
Sat 964072 16.39%

Visualizing the data can help to get the insight faster.

all_trips_v3 %>%
  ggplot(aes(
    x = day_of_week, # Month data
    fill = user_type # Color per user type
  )) +
  geom_bar(width = 0.7) + # create a stacked bar chart 
  labs(title = "Favorite Day to Ride Cyclistic's Bike?", # Chart title
       subtitle = "The total number of trips taken per day of week",
       y = "# of Trips") + # y-axis title
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "M", scale = 1e-6),
                     limits = c(0, 1000000)) + # Set axis range
  theme(
    axis.title.x = element_blank(), # remove axis title
    legend.title = element_blank(), # remove legend title
    legend.position = "top", # legend position top
    legend.justification = "left" # legend left justify
  ) +
  scale_fill_brewer(palette = "Set2")

We can see that there are high number of bike rides on weekend, especially on Saturday. Other than that, the number of rides are quite consistent along the week.

Next, let’s see the breakdown for each user type.

all_trips_v3 %>% 
  group_by(user_type, day_of_week) %>% 
  summarise(rides_num = n()) %>%
  ungroup() %>% 
  mutate(percentage = 
           formattable::percent(rides_num / sum(rides_num))) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
user_type day_of_week rides_num percentage
One-time user Sun 437413 7.44%
One-time user Mon 292181 4.97%
One-time user Tue 278007 4.73%
One-time user Wed 293245 4.98%
One-time user Thu 311487 5.29%
One-time user Fri 346025 5.88%
One-time user Sat 510057 8.67%
Subscriber Sun 404503 6.88%
Subscriber Mon 474817 8.07%
Subscriber Tue 536485 9.12%
Subscriber Wed 547011 9.30%
Subscriber Thu 525483 8.93%
Subscriber Fri 472179 8.03%
Subscriber Sat 454015 7.72%

The visualization looks like,

all_trips_v3 %>%
  ggplot(aes(
    x = day_of_week, # Month data
    fill = user_type # Color per user type
  )) +
  geom_bar(width = 0.7) +  # create a column bar chart
  facet_wrap(~ user_type) + 
  labs(title = "Daily Trips per User Type", # Chart title
       subtitle = "The total number of trips taken by each user type for each day of week",
       y = "# of Trips") + # y-axis title
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 600000)) + # Set axis range
  theme(
    axis.title.x = element_blank(), # remove axis title
    legend.title = element_blank(), # remove legend title
    legend.position = "blank" # remove legend
  ) +
  scale_fill_brewer(palette = "Set2")

From this visualization we can see that:

  1. There are differences between each user type on how they use their bike on day-to-day basis. Along the week, more rides are taken by the subscriber users. Meanwhile, more rides are taken on the weekday by the one-time users.
  2. The high number of rides by subscriber users along the week could possibly mean they use their bike to commute to the workplace.
  3. The high number of rides by one-time users on the weekend could indicate that riders use their bike on this particular days for sightseeing.

Trips per Hour

This analysis can give us an overview on the reason each user type use the bike on daily basis.

all_trips_v3 %>%
  count(hour) %>% 
  mutate(percentage = formattable::percent(n / sum(n))) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
hour n percentage
00 86288 1.47%
01 55441 0.94%
02 33665 0.57%
03 20012 0.34%
04 17829 0.30%
05 47143 0.80%
06 121431 2.06%
07 228147 3.88%
08 276640 4.70%
09 226082 3.84%
10 243570 4.14%
11 301067 5.12%
12 348968 5.93%
13 353425 6.01%
14 359080 6.10%
15 410120 6.97%
16 499449 8.49%
17 586642 9.97%
18 502530 8.54%
19 372295 6.33%
20 267081 4.54%
21 216749 3.68%
22 181260 3.08%
23 127994 2.18%

There are only small amount of data from the midnight to 5AM, or less than 2% of data, let’s filter them out from the visualization to make the result easier to view.

all_trips_v3 %>%
  filter(as.numeric(hour) > 5) %>% 
  ggplot(aes(x = hour, # Hourly data
             fill = user_type)) + # Color the bar
  geom_bar(width = 0.7) + # Create stacked bar chart
  labs(title = "At What Time of Day Cyclistic's Bike Mostly Used?", # chart title
       subtitle = "The total number of trips taken for each time of day",
       x = "Hour", # x-axis lable
       y = "# of Trips") + # x-axis lable
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 650000)) + # Set axis range
  theme(
    legend.title = element_blank(), # remove legend title
    legend.position = "top", # legend position top
    legend.justification = "left" # legend left justify
  ) +
  scale_fill_brewer(palette = "Set2") # RColorBrewer pallete

The gradual increase of bike rides after 08:00 until 17:00 could indicate the rides taken in this particular time are to commute to workplace/school. The peaks at 08:00 and 17:00 could also support that claim.

Weekday vs. Weekend Analysis

From the day-to-day analysis, we can see that the number of trips taken by the subscriber users is high during the week, while on the weekend there are more trips taken by the one-time users.

Let’s breakdown the hourly result from the previous analysis to show how each user type uses their bike on the weekday vs. on the weekend.

weekend <- c("Sat", "Sun")

all_trips_v3 %>%
  mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
  filter(as.numeric(hour) > 5) %>% # filter the data
  count(user_type, week_type, hour) %>%
  mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
  ggplot(aes(hour, n_normal, fill = user_type)) +
  geom_bar(stat = "identity", width = 0.7) +
  facet_wrap( ~ week_type) +
  labs(title = "Bike Trips on Weekday vs. Weekend",
       subtitle = "The average number of trips taken on weekday vs. weekend for each time of day",
       x = "Hour", 
       y = "# of Trips") +
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 100000)) + 
  theme(
    legend.title = element_blank(), # remove legend title
    legend.position = "top", # legend position top
    legend.justification = "left", # legend left justify
  ) +
  scale_fill_brewer(palette = "Set2") # RColorBrewer pallete

The pattern for the weekday showing the same result as the previous result on the number of trips per hour. On the other hand, the result for the weekend trips shows the number of trips is gradually increasing along the day and hit the peak at around 15:00.

We can look at each weekday and weekend result to better understand the different. For the weekday, we can look at the chart below to understand how each user type uses their bike.

all_trips_v3 %>%
  mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
  filter(as.numeric(hour) > 5 & week_type == "Weekday") %>% # filter the data
  count(user_type, week_type, hour) %>%
  mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
  ggplot(aes(hour, n_normal, fill = user_type)) +
  geom_bar(position = "dodge", stat = "identity", width = 0.7) +
  facet_wrap( ~ user_type) +
  labs(title = "Number of Trips on Weekday per User Type",
       subtitle = "The average number of trips taken on weekday for each hour of day",
       x = "Hour", 
       y = "# of Trips") +
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 60000)) + 
  theme(
    legend.title = element_blank(), # remove legend title
    legend.position = "blank", # legend position top
  ) +
  scale_fill_brewer(palette = "Set2") # RColorBrewer palleted

For the weekday, subscriber users take more trips than the one-time users. Other than that, each user type have the same behavior. The peaks at 08:00 and 17:00 and the gradual increase of bike rides between those hours could possibly mean they use their bike to get to work/school. Data on the age group of each riders could help to clarify this statement.

For the weekend, we can look at the chart below to understand how each user type uses their bike.

all_trips_v3 %>%
  mutate(week_type = ifelse(all_trips_v3$day_of_week %in% weekend, "Weekend", "Weekday")) %>%
  filter(as.numeric(hour) > 5 & week_type == "Weekend") %>% # filter the data
  count(user_type, week_type, hour) %>%
  mutate(n_normal = ifelse(week_type == "Weekend", n / 2, n / 5)) %>%
  ggplot(aes(hour, n_normal, fill = user_type)) +
  geom_bar(position = "dodge", stat = "identity", width = 0.7) +
  facet_wrap( ~ user_type) +
  labs(title = "Number of Trips on Weekend per User Type",
       subtitle = "The average number of trips taken on weekend for each hour of day",
       x = "Hour", 
       y = "# of Trips") +
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = unit_format(unit = "K", scale = 1e-3),
                     limits = c(0, 50000)) + 
  theme(
    legend.title = element_blank(), # remove legend title
    legend.position = "blank", # legend position top
  ) +
  scale_fill_brewer(palette = "Set2") # RColorBrewer palleted

For the weekend, the behavior on how each user types use their bike are also the same. But not like on the weekday, the average number of rides taken by the one-time users are higher especially after 10:00. The high number of rides from both user types in middle of the day could indicate that people use their bike for sightseeing.

The high number of trips taken by the one-time users even on the weekday could support the claim that Cyclistic’s team made about casual users or the one-time users awareness on their bike-sharing service.

Trip Duration Analysis

Let’s see the basic statistic for this variable.

summary(all_trips_v3$trip_duration)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     6.05    10.72    19.75    19.33 40705.02

This variable is in minutes, and the trips with the longest duration is 40705.02 minutes or about 28 days! The reasons is unknown for this case. But, visualizing this range will be a problem.

Let’s only include the data in the IQR range. Because the minimum value already 0, the next task is to find the 75% percentile of the data.

percentile <- quantile(all_trips_v3$trip_duration,
                       probs = 0.75) # find the 75% percentile


IQR <- IQR(all_trips_v3$trip_duration) # find the IQR
 
upper <- percentile[1] + 1.5*IQR # find the upper whisker

Let’s try to use the data up to the 95 percentile to analyze the distribution of this variable.

data_no_outliers <- all_trips_v3 %>%
  filter(trip_duration < upper)

print(paste("Removed:", 
            nrow(all_trips_v3) - nrow(data_no_outliers), "rows"))
## [1] "Removed: 426698 rows"

Looking at our new trip duration summary.

summary(data_no_outliers$trip_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.767   9.933  12.237  16.717  39.250

Let’s see how this new range looks like.

data_no_outliers %>% 
  ggplot(aes(x = trip_duration)) + 
  geom_density(fill = "#41b6e6", color = "#67C5ED") +
  labs(title = "Trip Duration Variable Distribution",
       x = "Trip Duration",
       y = "Density") +
  scale_y_continuous(labels = scales::percent, # set axis format
                     expand = c(0, 0)) # set start point to 0

Let’s see the distribution of the trip duration variable for each user type.

data_no_outliers %>%
  group_by(user_type) %>%
  summarize(
    min = min(trip_duration), # minimum value
    q1 = quantile(trip_duration, 0.25), # 25% percentile
    median = median(trip_duration), # median value
    mean = mean(trip_duration), # mean value
    q3 = quantile(trip_duration, 0.75), # 75% percentila
    max = max(trip_duration) # minimum value
  ) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
user_type min q1 median mean q3 max
One-time user 0 7.216667 12.05 14.16656 19.60000 39.25
Subscriber 0 5.100000 8.70 10.99073 14.73333 39.25

Let’s visualize it to better understand the distribution.

data_no_outliers %>% 
  ggplot(aes(x = user_type, y = trip_duration, fill = user_type)) +
  geom_boxplot(outlier.size = 1,
               alpha = 0.8) +
  labs(title = "Who Rides Their Bike Longer?",
       subtitle = "The distribution of trip duration for each user type",
       y = "Trip Duration (Minutes)") + 
  theme(axis.title.x = element_blank(), # Remove axis title
        legend.position = "none" # Remove legend
  )  +
  scale_fill_brewer(palette = "Set2") # RColorBrewer pallete

By looking at the previous result on how each user type use their bike on daily basis, this result should make sense. The high number of trips by subscriber users along the week could indicate that they use their bike to commute to work. Lower trip duration from the data above meaning getting to their destination fast is their priority.

Pick-up Location Analysis

Let’s analyze the popular bike pick-up location for each user type.

From the visual above, we can conclude several things:

  1. The top bike pick-up location for subscriber users are more distributed along each point. This can be seen by the size of each circle that has quite similar size with the other. Meanwhile, we can see several spots with circle that is larger than the other circles on the top pick-up locations for the one-time users.
  2. The top pick-location are concentrated mainly on the south and the central area of the city.
  3. The top pick-up location for one-time users are more concentrated towards the area where most of the city’s tourism attractions are located, which is on the central area. On the other hand, The top pick-up location for the subscriber users are more concentrated toward the western area of the city.
  4. On the subscriber users’ result, the several spots on the bottom are located near the University of Chicago. This means, using bike sharing is quite popular for the college students.

Bike Type Analysis

all_trips_v3 %>% 
  count(bike_type, name = "trips_count") %>% 
  mutate(
    percentage = formattable::percent(
      trips_count / sum(trips_count))
    ) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
bike_type trips_count percentage
Classic bike 2896604 49.24%
Docked bike 207981 3.54%
Electric bike 2778323 47.23%

There are only 3.5% of docked bike usage from the past 12 month.

Let’s see the breakdown for each user type.

all_trips_v3 %>% 
  count(user_type, bike_type, name = "trips_count") %>% 
  mutate(
    percentage = formattable::percent(
      trips_count / sum(trips_count))
    ) %>%
  kbl() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
user_type bike_type trips_count percentage
One-time user Classic bike 1031520 17.53%
One-time user Docked bike 207981 3.54%
One-time user Electric bike 1228914 20.89%
Subscriber Classic bike 1865084 31.70%
Subscriber Electric bike 1549409 26.34%

And only the one-time users are using the docked bike from the whole data. The reason behind this low usage is outside of this analysis. Because there only small amount of trips using this bike, we’ll filter it for the next analysis.

all_trips_v3 %>% 
  filter(bike_type != "Docked bike") %>%
  count(user_type, bike_type, name = "trips_count") %>% 
  group_by(user_type) %>% 
  mutate(
    percentage = formattable::percent(
      trips_count / sum(trips_count))
    ) %>% 
  ggplot(aes(
    x = bike_type, # bike type data
    y = percentage, # % data
    fill = bike_type # color per bike type
  )) +
  geom_bar(stat = "identity", # accepting y data
           width = 0.7) + # create stacked bar chart
  facet_wrap(~ user_type) + # create facet
  labs(title = "Percentage of Trips by Bike Type Usage per User Type", # Chart title
       y = "% of Trips") + # y-axis title
  scale_y_continuous(expand = c(0, 0), # set start point of bar to 0
                     labels = scales::percent,
                     limits = c(0, 0.6)) + # set axis range
  theme(
    axis.title.x = element_blank(), # remove axis title
    legend.title = element_blank(), # remove legend title
    legend.position = "None" # remove legend
  ) + scale_fill_manual(values = ccp)

From the chart above we can conclude that:

  1. More than 50% of trips taken by one-time users are taken with the electric bike. On the other hand, more than 50% of trips taken by subscriber users are taken with the classic bike. This means electric bikes are more popular for one-time users than subscriber users.

  2. The high usage of classic bike type from the subscriber users is caused by the fact that you will get unlimited 45-min classic bike rides included in the subscription package. Looking at the average trip duration by the subscriber users can also confirm these findings.

Questions

How should you organize your data to perform analysis on it?

Most of the analysis are done by aggregating the data based on the specific variables that wanted to observe, for example if we wanted to show how the month-to-month number of rides for each user type, we aggregate the data by month and show the number of rides for each month by each of the user type.

Has your data been properly formatted?

Yes. Formatting the data by sort and filter them to only show the data we need have been done above. Formatting the data can also helps us to understand them easily.

What surprises did you discover in the data?

There are some interesting discoveries I found in them:

  1. Even though subscriber users ride more bikes than the one-time users, they only use the bike for shorter amount of time than the one-time users on average.

  2. The different pick-up location for each user type can help to indicate how each user type use their bike. For subscriber users, there are more trips taken from the pick-up location in the downtown area. Meanwhile, more trips are taken from the pick-up location on the coastal area, or are that is close with tourist sites.

What trends or relationships did you find in the data?

Some trends or relationships I found in the data:

  1. Summer is the season with the highest number of bike trips. This high number of rides are also correlated with the level of temperature on the season.
  2. Subscriber users are more active on the entire week, whereas the one-time users are more active on the weekend.

How will these insights help answer your business questions?

The insights I found can be used to identify how each user type, especially the one-time users of Cyclistic on how they use the bike and some strategies to perform to convert them to the subscriber users.

5. Share

Questions

Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?

Yes, the data can clearly show how different those users type use the Cyclistic bike.

What story does your data tell?

The data show the differences between how the subscriber users and the one-time user differ on how they use Cylistic bike.

Some stories we can create from the data are:

  1. There are more rides by the subscribers than the one-time users in the data. Even though the subscriber users take more trips, their duration per trip is lower than the one-trip users. There are two possibilities for this, first because the subscriber uses their bike mostly to commute, so getting to their destination faster is their main choice.

  2. Subscriber users use their bikes to commute to work/school and one-time users use their bikes for recreational purposes. This thought could be supported by the fact that subscriber users are more active during the week and most of their trips started from a bike station that is located in the downtown area of the city or close to a university. Whereas the one-time users are more active on the weekday and their most popular bike pick-up locations are located in the central area of the city or close to the tourist attraction area.

How do your findings relate to your original question?

The findings can help the financial analyst team to get information about how each user type uses Cyclistic bikes differently.

Who is your audience? What is the best way to communicate with them?

The financial analyst team of Cyclistic is the audience of this analysis. The best way to communicate the findings is through a presentation that not focusing too much on the technical aspect of the analysis, rather than the outline of the analysis only. This way, understanding the result and getting the answer they need can be delivered faster.

Can data visualization help you share your findings?

Yes, the visualization from this analysis certainly can help the audiences to understand the data more easily.

Is your presentation accessible to your audience?

For now, this notebook is the only way people can access this analysis, There are no specific presentation files made yet. But, hosting this analysis online can help to make the result accessible.

6. Act

Questions

What is your final conclusion based on your analysis?

There are clear differences between how subscriber users and one-time users of a Cyclistic bike-sharing company use their bikes.

How could your team and business apply your insights?

Having the knowledge of understanding users’ behavior on how and why they use the bike can help the business on making decisions on strategies to target specific user types based on how they use the service.

What next steps would you or your stakeholders take based on your findings?

Some actions/recommendations from this analysis are:

  1. Perform marketing campaign on high trips months, which is from May - October.

  2. Add subscription variation to include 6-months period subscription.

  3. Create a special promotion price for university students.

Is there additional data you could use to expand on your findings?

Probably the data about the age of each user can help to gain insights, like confirming how each user uses their bike, or is there any relationship between the number of rides and the duration of each trip for each age group for example. Knowing this could also help the business on creating a more specific targeted campaign that is unique for each age group.