1. Introduction

Background

This case study is the Google Data Analytics Certificate capstone project. I am acting as a junior data analyst at Cyclistic, a fictional bike-share company based in Chicago.

Cyclistic offers a fleet of bikes that can be unlocked from one station and returned to any other station in the network. Riders can purchase:

  • Single-ride passes (casual riders)
  • Full-day passes (casual riders)
  • Annual memberships (members)

The marketing director believes that converting casual riders into annual members is key to Cyclistic’s future growth. Three questions will guide the future marketing program:

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

I have been assigned the first question as my focus:

How do annual members and casual riders use Cyclistic bikes differently?

The insights from this analysis will inform a marketing strategy to convert casual riders into annual members.

Data Source

Data was sourced from Divvy trip data, made available by Motivate International Inc. under a public license. The dataset covers 12 months of ride data (February 2025 – January 2026), containing over 5.5 million records.


2. Prepare

Load Libraries

library(tidyverse)
library(janitor)
library(scales)

Load Data

Each monthly CSV file was loaded separately and combined into one dataframe.

m01 <- read_csv("../Cyclistic_Analysis/csv_data/202501-divvy-tripdata.csv")
m02 <- read_csv("../Cyclistic_Analysis/csv_data/202502-divvy-tripdata.csv")
m03 <- read_csv("../Cyclistic_Analysis/csv_data/202503-divvy-tripdata.csv")
m04 <- read_csv("../Cyclistic_Analysis/csv_data/202504-divvy-tripdata.csv")
m05 <- read_csv("../Cyclistic_Analysis/csv_data/202505-divvy-tripdata.csv")
m06 <- read_csv("../Cyclistic_Analysis/csv_data/202506-divvy-tripdata.csv")
m07 <- read_csv("../Cyclistic_Analysis/csv_data/202507-divvy-tripdata.csv")
m08 <- read_csv("../Cyclistic_Analysis/csv_data/202508-divvy-tripdata.csv")
m09 <- read_csv("../Cyclistic_Analysis/csv_data/202509-divvy-tripdata.csv")
m10 <- read_csv("../Cyclistic_Analysis/csv_data/202510-divvy-tripdata.csv")
m11 <- read_csv("../Cyclistic_Analysis/csv_data/202511-divvy-tripdata.csv")
m12 <- read_csv("../Cyclistic_Analysis/csv_data/202512-divvy-tripdata.csv")

# Combine into one dataframe
all_trips <- bind_rows(m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12)

Explore the Data

glimpse(all_trips)
## Rows: 5,552,092
## Columns: 13
## $ ride_id            <chr> "FD0EB1D32AF0D47E", "FB27405C3F8C824F", "6FAFA17094…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "electric_bike", "e…
## $ started_at         <dttm> 2026-01-31 09:13:09, 2026-01-15 14:25:42, 2026-01-…
## $ ended_at           <dttm> 2026-01-31 09:28:10, 2026-01-15 14:33:18, 2026-01-…
## $ start_station_name <chr> "Central St & Girard Ave", "Shore Dr & 55th St", "H…
## $ start_station_id   <chr> "CHI02042", "CHI00394", "CHI02087", "CHI00286", "CH…
## $ end_station_name   <chr> "Dodge Ave & Church St", "Woodlawn Ave & 55th St", …
## $ end_station_id     <chr> "CHI00741", "CHI00423", NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 42.06431, 41.79521, 41.93247, 41.89453, 41.88166, 4…
## $ start_lng          <dbl> -87.68615, -87.58071, -87.64242, -87.65341, -87.641…
## $ end_lat            <dbl> 42.04831, 41.79526, 41.94000, 41.83000, 41.89000, 4…
## $ end_lng            <dbl> -87.69822, -87.59647, -87.64000, -87.67000, -87.630…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…

The combined dataset contains 5,552,092 rows and 13 columns.

Check for Missing Values

colSums(is.na(all_trips))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##            1186374            1186374            1246827            1246827 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5670               5670 
##      member_casual 
##                  0

Key observations on missing data:

  • start_station_name and start_station_id: 1,186,374 missing — mostly electric bikes that do not require docking at a named station. These were retained as they represent valid rides.
  • end_station_name and end_station_id: 1,246,827 missing — same reason as above. Retained.
  • end_lat and end_lng: 5,670 missing — GPS did not record an end location. Since no map analysis was performed, these were retained.

3. Process

Add New Columns

New columns were added to support analysis:

all_trips <- all_trips %>%
  mutate(
    ride_length    = ended_at - started_at,
    ride_length_mins = as.numeric(ride_length) / 60,
    day_of_week    = wday(started_at, label = TRUE),
    month          = month(started_at, label = TRUE),
    year           = year(started_at)
  )

colnames(all_trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"      "ride_length"        "ride_length_mins"  
## [16] "day_of_week"        "month"              "year"

Inspect Ride Length

summary(all_trips$ride_length_mins)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -54.795    5.399    9.435   16.097   16.582 1574.900

The summary reveals negative ride lengths (minimum: -54.8 minutes), indicating data errors where ended_at was recorded before started_at. These are impossible in real life and must be removed.

Clean Data — Create all_trips_V2

Rides under 1 minute (including all negative values) were removed as they represent either data errors or false starts.

# Check how many rows will be removed
all_trips %>%
  filter(ride_length_mins < 1) %>%
  nrow()
## [1] 148401
# Create clean dataframe
all_trips_V2 <- all_trips %>%
  filter(ride_length_mins >= 1)

glimpse(all_trips_V2)
## Rows: 5,403,691
## Columns: 18
## $ ride_id            <chr> "FD0EB1D32AF0D47E", "FB27405C3F8C824F", "6FAFA17094…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "electric_bike", "e…
## $ started_at         <dttm> 2026-01-31 09:13:09, 2026-01-15 14:25:42, 2026-01-…
## $ ended_at           <dttm> 2026-01-31 09:28:10, 2026-01-15 14:33:18, 2026-01-…
## $ start_station_name <chr> "Central St & Girard Ave", "Shore Dr & 55th St", "H…
## $ start_station_id   <chr> "CHI02042", "CHI00394", "CHI02087", "CHI00286", "CH…
## $ end_station_name   <chr> "Dodge Ave & Church St", "Woodlawn Ave & 55th St", …
## $ end_station_id     <chr> "CHI00741", "CHI00423", NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 42.06431, 41.79521, 41.93247, 41.89453, 41.88166, 4…
## $ start_lng          <dbl> -87.68615, -87.58071, -87.64242, -87.65341, -87.641…
## $ end_lat            <dbl> 42.04831, 41.79526, 41.94000, 41.83000, 41.89000, 4…
## $ end_lng            <dbl> -87.69822, -87.59647, -87.64000, -87.67000, -87.630…
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "…
## $ ride_length        <drtn> 901.284 secs, 456.328 secs, 404.350 secs, 1850.186…
## $ ride_length_mins   <dbl> 15.021400, 7.605467, 6.739167, 30.836433, 78.447200…
## $ day_of_week        <ord> Sat, Thu, Tue, Mon, Sat, Tue, Wed, Fri, Fri, Thu, T…
## $ month              <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…
## $ year               <dbl> 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 202…

148,401 rides were removed, leaving 5,403,691 clean records for analysis.


4. Analyse

Average Ride Length by Rider Type

all_trips_V2 %>%
  group_by(member_casual) %>%
  summarise(average_ride_length = mean(ride_length_mins))
## # A tibble: 2 × 2
##   member_casual average_ride_length
##   <chr>                       <dbl>
## 1 casual                       23.5
## 2 member                       12.7

Casual riders average 23.5 minutes per ride compared to 12.7 minutes for members — almost twice as long.

Total Rides by Rider Type

all_trips_V2 %>%
  count(member_casual)
## # A tibble: 2 × 2
##   member_casual       n
##   <chr>           <int>
## 1 casual        1920972
## 2 member        3482719

Members take significantly more rides than casual riders despite shorter individual ride durations.

Total Rides by Day of Week

all_trips_V2 %>%
  count(member_casual, day_of_week) %>%
  print(n = 14)
## # A tibble: 14 × 3
##    member_casual day_of_week      n
##    <chr>         <ord>        <int>
##  1 casual        Sun         318480
##  2 casual        Mon         220280
##  3 casual        Tue         218800
##  4 casual        Wed         212183
##  5 casual        Thu         247670
##  6 casual        Fri         307148
##  7 casual        Sat         396411
##  8 member        Sun         374492
##  9 member        Mon         494937
## 10 member        Tue         558517
## 11 member        Wed         538722
## 12 member        Thu         560394
## 13 member        Fri         517005
## 14 member        Sat         438652

Average Ride Length by Day of Week

all_trips_V2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(average_ride_length = mean(ride_length_mins), .groups = "drop")
## # A tibble: 14 × 3
##    member_casual day_of_week average_ride_length
##    <chr>         <ord>                     <dbl>
##  1 casual        Sun                        27.3
##  2 casual        Mon                        23.2
##  3 casual        Tue                        20.6
##  4 casual        Wed                        19.4
##  5 casual        Thu                        20.7
##  6 casual        Fri                        23.5
##  7 casual        Sat                        26.4
##  8 member        Sun                        13.9
##  9 member        Mon                        12.2
## 10 member        Tue                        12.2
## 11 member        Wed                        12.1
## 12 member        Thu                        12.2
## 13 member        Fri                        12.7
## 14 member        Sat                        13.9

Rides by Month

all_trips_V2 %>%
  count(member_casual, month) %>%
  print(n = 24)
## # A tibble: 24 × 3
##    member_casual month      n
##    <chr>         <ord>  <int>
##  1 casual        Jan    23964
##  2 casual        Feb    27051
##  3 casual        Mar    83064
##  4 casual        Apr   105528
##  5 casual        May   176162
##  6 casual        Jun   279570
##  7 casual        Jul   309293
##  8 casual        Aug   324165
##  9 casual        Sept  255223
## 10 casual        Oct   214849
## 11 casual        Nov    94930
## 12 casual        Dec    27173
## 13 member        Jan   109969
## 14 member        Feb   122118
## 15 member        Mar   208500
## 16 member        Apr   257962
## 17 member        May   314091
## 18 member        Jun   379609
## 19 member        Jul   430535
## 20 member        Aug   443214
## 21 member        Sept  441046
## 22 member        Oct   414231
## 23 member        Nov   252033
## 24 member        Dec   109411

5. Share — Visualisations

Chart 1: Number of Rides by Day of Week

all_trips_V2 %>%
  mutate(day_of_week = factor(day_of_week, 
                              levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
  count(member_casual, day_of_week) %>%
  ggplot(mapping = aes(x = day_of_week, y = n, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Number of Rides by Day of Week",
       subtitle = "Members vs Casual Riders",
       x = "Day of the Week",
       y = "Number of Rides",
       fill = "Rider Type") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key finding: Casual riders peak on Saturday (396,411 rides) and Sunday, while members peak on weekday mornings (Tuesday–Thursday). This strongly suggests members use bikes for commuting while casual riders prefer leisure weekend rides.


Chart 2: Number of Rides by Month

all_trips_V2 %>%
  count(member_casual, month) %>%
  ggplot(mapping = aes(x = month, y = n, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Number of Rides by Month",
       subtitle = "Members vs Casual Riders",
       x = "Month",
       y = "Number of Rides",
       fill = "Rider Type") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key finding: Both rider types peak in summer (July–August), but casual riders drop dramatically in winter (January: 23,964 rides) while members remain far more consistent year-round (January: 109,969 rides). This highlights the weather-dependent nature of casual riders.


Chart 3: Average Ride Length by Day of Week

all_trips_V2 %>%
  group_by(member_casual, day_of_week) %>%
  mutate(day_of_week = factor(day_of_week, 
                              levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
  summarise(average_ride_length = mean(ride_length_mins)) %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average Ride Length by Day of Week",
       subtitle = "Members vs Casual Riders",
       x = "Day of the Week",
       y = "Average Ride Length (minutes)",
       fill = "Rider Type") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key finding: Casual riders take their longest rides on weekends (Sunday: 27.3 mins, Saturday: 26.4 mins), reinforcing the leisure-riding pattern. Members ride at a remarkably consistent 12–14 minutes every day, consistent with purposeful short commutes.


6. Act — Recommendations

Based on the analysis, the following three recommendations are made to help Cyclistic convert casual riders into annual members:


Recommendation 1: Launch Weekend Conversion Campaigns Starting Thursday

Analysis shows casual riders peak on Saturdays (396,411 rides) and Sundays (318,480 rides). Cyclistic should launch targeted digital and in-app marketing campaigns from Thursday onwards to capture casual riders during their highest-activity period. Ads should highlight the cost savings of annual membership for riders who already ride frequently on weekends, showing them that a membership quickly pays for itself.


Recommendation 2: Promote Membership for Weekday Convenience and Short Trips

Casual riders average 23.5 minutes per ride but their weekday ridership is significantly lower than members. Marketing should reframe membership not just as a leisure upgrade, but as a practical tool for spontaneous weekday trips — avoiding car use for short errands, unexpected commutes, or emergency situations. Messaging such as “No car? No problem — your membership covers you any time” could resonate with this segment.


Recommendation 3: Run Seasonal Pre-Summer Membership Promotions

Casual ridership surges in summer (June–August) and drops sharply in winter. Cyclistic should launch discounted membership offers in March and April, just before the peak season begins, to convert casual riders into members before they reach their highest riding frequency. A limited-time spring offer could create urgency and capture riders already planning their cycling season.