Cyclistic Bike-Share Analysis

Project background:

This is a case study project made as part of the Google Data Analytics course.
For a business stakeholder perspective, a PowerPoint version (with speaker notes) is available in this LINK.
(Press CTRL + Click on the link to open in a new tab).

Setting up my environment & data:

library(tidyverse)
library(RColorBrewer)
library(scales)
library(here)

tripdata_summary_weather <- read_csv(here("processed_data/r", "tripdata_summary_weather.csv"))
tripdata_summary_daily <- read_csv(here("processed_data/r", "tripdata_summary_daily.csv"))
tripdata_time_of_day <- read_csv(here("processed_data/r", "tripdata_time_of_day.csv"))
tripdata_stations <- read_csv(here("processed_data/r", "tripdata_stations.csv"))
tripdata_summary <- read_csv(here("processed_data/r", "tripdata_summary.csv"))
station_names <- read_csv(here("processed_data/r", "station_names.csv"))

Business tasks:

How do annual members and casual riders use Cyclistic bikes differently?
Identify and compare bike trip trends in Chicago, Illinois to provide insights in designing a marketing strategy to convert “Casual” riders into an annual “Member” and increase revenue.

Presentation objectives:

Share an overview of historical bike trips data.
Examine the seasonal trends using historical data.
Discuss any potential areas for further exploration.

The data story:

Bike trips trend

The data collected is from April ’20 until August ’21 (17 months).
Member rider trips are generally higher but Casual riders increased and overtaken starting June ’21.
Overall, trips increases from April to August and trends down from September to February.

tripdata_summary %>%
  ggplot(aes(x = s_year_month, y = ride_count, color = member_casual)) +
  theme_grey(base_size = 12) +
  geom_smooth(se = FALSE, size = 1.1) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %y") +
  scale_color_manual(values=c("purple","gold")) +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Month-Year", y = "Ride Count", color  = "Member Type") +
  theme(legend.position = "bottom")

head(tripdata_summary)

## # A tibble: 6 x 4
##   member_casual s_year_month mean_ride_length ride_count
##   <chr>         <date>                  <dbl>      <dbl>
## 1 casual        2020-04-01               72.5      23553
## 2 casual        2020-05-01               50.6      86635
## 3 casual        2020-06-01               51.3     154186
## 4 casual        2020-07-01               59.3     267977
## 5 casual        2020-08-01               44.5     280498
## 6 casual        2020-09-01               38.5     214102

Total trip count of riders

An overall total of 6,005,431 bike trips during the 17-month observation period.
There is a 22% difference of trips between Casual and Member riders.
Due to the high trip count in the Casual category, there is still a big opportunity for member conversion.

tripdata_summary %>% 
  group_by(member_casual) %>% 
  summarise(total_ride = sum(ride_count)) %>% 
  ggplot(aes(x = member_casual, y = total_ride, fill = member_casual)) +
  theme_grey(base_size = 12) +
  theme(axis.title.x=element_blank()) +
  geom_col(width = 0.50, show.legend = FALSE) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values=c("purple", "gold")) +
  stat_summary(fun = "sum", aes(label = format(..y.., big.mark = ",")),
               geom = "text", vjust = 2, size = 3.5) +
  ylab("Ride Count") +
  scale_x_discrete(labels = c('Casual', 'Member'))

Seasonal trips trend

In the hotter months, trips tend to rise. This can be related to increased leisure activities outdoors.
In the colder months, trips tend to fall. Travelling through bike means being exposed in rain and snow.
Generally, people avoid getting wet or sustain accidents on a slippery road.

tripdata_summary_weather %>% 
  ggplot(aes(x = s_year_month, y = ride_count)) +
  theme_dark(base_size = 12) +
  geom_smooth(color = "gray", se = FALSE, size = 1) +
  geom_point(aes(color = avg_temp), size = 3) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %y") +
  scale_y_continuous(labels = scales::comma) +
  scale_color_distiller(name = "Temp (C)", palette = "RdBu") +
  labs(x = "Month-Year", y = "Ride Count")

head(tripdata_summary_weather)

## # A tibble: 6 x 4
##   s_year_month avg_temp mean_rl ride_count
##   <date>          <dbl>   <dbl>      <dbl>
## 1 2020-04-01         10      36      84607
## 2 2020-05-01         18      33     199674
## 3 2020-06-01         23      33     341897
## 4 2020-07-01         26      38     548485
## 5 2020-08-01         25      30     604203
## 6 2020-09-01         19      25     497656

Average ride length

A more obvious high-low-high pattern is observed in casual riders on time taken travelling.
Have peaks in the months of April ‘20 & ‘21 and a trough in December.
Longer bike travels can be related more towards leisure activities or one-off rides to avoid traffic jams.
A smoother pattern is observed for member riders.
Shorter bike travels can be related to riding from home to work purpose which is much frequent and consistent.

tripdata_summary %>%
  ggplot(aes(x = s_year_month, y = mean_ride_length, color = member_casual)) +
  theme_grey(base_size = 12) +
  geom_smooth(se = FALSE, size = 1.1, show.legend = FALSE) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %y") +
  scale_color_manual(values=c("purple","gold")) +
  ylim(0, 80) +
  labs(x = "Month-Year", y = "Average Ride Length (Min)")

Rides per day of week

Bike usage increases as day reaches to weekend for Casual riders.
Leisure activities outdoor can be the main contributor of this trend.
Member rider trips is more consistent throughout the week.
Members tend to have a firm routine in their daily schedule.
Be it with work purpose, personal errands & leisure activities combined.

tripdata_summary_daily%>%
  ggplot(aes(x = factor(day_of_week,level = c('Monday', 'Tuesday', 'Wednesday',
                                              'Thursday', 'Friday', 'Saturday', 'Sunday')),
             y = ride_count, fill = member_casual)) +
  theme_grey(base_size = 12) +
  geom_col(width = 0.50, show.legend = FALSE) +
  facet_wrap(~member_casual) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values=c("purple", "gold")) +
  labs(x = "Day of Week", y = "Ride Count") +
  scale_x_discrete(labels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))

head(tripdata_summary_daily)

## # A tibble: 6 x 3
##   member_casual day_of_week ride_count
##   <chr>         <chr>            <dbl>
## 1 casual        Sunday          518410
## 2 casual        Monday          292885
## 3 casual        Tuesday         281572
## 4 casual        Wednesday       286462
## 5 casual        Thursday        302551
## 6 casual        Friday          391052

Rides per time of day

The overall trend of bike usage are similar.
Increasing from morning until afternoon and decreases in the evening.
Peak usage is between 3PM to 5PM.
The differences are the following:
There’s an outlier observation that there are more Casual riders who rides at the midnight.
Member riders are prevailing bike users over Casual riders.

tripdata_time_of_day%>%
  ggplot(aes(x = started_at_time, y = ride_count, color = member_casual)) +
  theme_grey(base_size = 12) +
  geom_smooth(se = FALSE, size = 1.1, show.legend = FALSE) +
  scale_y_continuous(labels = scales::comma, breaks=seq(0,140000,20000)) +
  scale_color_manual(values=c("purple", "gold")) +
  labs(x = "Time of Day", y = "Ride Count") +
  scale_x_time(breaks = scales::breaks_width("2 hours"),
               labels = label_time(format = "%H:%M", tz = "UTC"))

head(tripdata_time_of_day)

## # A tibble: 6 x 3
##   member_casual started_at_time ride_count
##   <chr>         <time>               <dbl>
## 1 casual        00:00                14507
## 2 casual        00:30                25044
## 3 casual        01:00                20887
## 4 casual        01:30                17087
## 5 casual        02:00                14243
## 6 casual        02:30                10399

Stations traveled

There are 763 bike stations in total around Chicago, Illinois.
Around 9 bike stations are unused for pick-up & drop-off from both rider types.
We can re-optimize the bike allocations and transfer those bikes to stations with frequent rider usage to solve any “lacking of bike” issue.

station_names %>% 
  summarise(total_stations = n_distinct(start_station_name))

## # A tibble: 1 x 1
##   total_stations
##            <int>
## 1            763

tripdata_stations %>%   
ggplot(aes(x = factor(stations, level = c('start_station', 'end_station')), 
             y = stations_covered, fill = member_casual)) +
  theme_grey(base_size = 12) +
  theme(axis.title.x=element_blank()) +
  geom_col(width = 0.50, position = "dodge", show.legend = FALSE) +
  ylim(0, 800) +
  scale_fill_manual(values=c("purple", "gold")) +
  ylab("Stations Covered") +
  scale_x_discrete(labels = c('Start Station', 'End Station')) +
  geom_text(aes(label = stations_covered),
            position = position_dodge(width = 0.5), vjust = 2, size = 3.5)

head(station_names)

## # A tibble: 6 x 1
##   start_station_name                 
##   <chr>                              
## 1 Eckhart Park                       
## 2 Drake Ave & Fullerton Ave          
## 3 McClurg Ct & Erie St               
## 4 California Ave & Division St       
## 5 Rush St & Hubbard St               
## 6 Mies van der Rohe Way & Chicago Ave

Key takeaways:

Bike usage has a seasonal trend.
- We can prepare our marketing strategies during the ‘Fall’ and ‘Winter’ season.
- In the colder months, we can gather unused bikes for maintenance in preparation for peak season.
- We can also pilot test and implement software upgrades in downtrend season.
- At the start of peak season, we can deploy marketing agents and reach-out to Casual riders for potential Member conversion sale.

Casual riders are more active as the day approaches weekend.
- In a timing perspective, we can increase the pace of marketing agents reach-out during Friday, Saturday & Sunday.

Optimize the bike allocation and review the station usage.
- Transfer bikes from low usage stations to in-demand stations to maximize revenue.
- Review the schedule of bike transfers to stations.
- We can propose to drop low usage stations to reduce cost.

Appendix:

Bike trips data provided by Motivate International Inc.
Chicago, Illinois weather data provided by Visual Crossing.

Data workflow summary:

Data cleaning

Merged 17 historical datasets into a single dataset.
Checked for white-spaces that can cause analysis miss-outs.
Ensured that all variable names are format consistent.
Removed rows that have ‘blank’ observations.
Removed duplicate rows in the unique variable “ride_id”.
Removed rows with ‘zero’ and ‘negative’ minutes in the “ride_length” calculated variable.
Removed rows that are non-commercial station trips such as: “HUBBARD ST BIKE CHECKING”, “WATSON TESTING – DIVVY”

Data manipulation

Created new variables to compartmentalize existing observations for grouping.
Created subsets of the clean data to narrow analysis and plotting.
Merged the daily weather dataset of Chicago, Illinois.

Data archival for project closure

Purged initial raw data.
Retained the clean data and its subsets for future use.