Cyclistic Bikeshare using R

Introduction

This is the report of the capstone project for my Google Data Analytics Professional Certificate program. I am using R programming language and RStudio Desktop. Note that the free version of RStudio Cloud cannot handle the amount of data needed for this project.

Scenario

I am a junior data analyst working in the marketing analysis team of Cyclistic, a bike-share company in Chicago. Note: This is a fictional name, but the company is real and is called Divvy https://divvybikes.com. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships. My team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so my recommendations must be backed up with compelling data insights and professional data visualizations.

Ask

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual membership?
How can Cyclistic use digital media to influence casual riders to become annual members?

The director of marketing and my manager Lily Moreno has assigned me the first question to answer.

Prepare

I use Cyclistic’s monthly trip data https://divvy-tripdata.s3.amazonaws.com/index.html which is publicly available. According to Divvy https://divvybikes.com/system-data, the data has been processed to remove trips that are taken by staff as they service and inspect the system, and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).
To see the effect of seasons on rides, 12 months of data is used. To make it easier to understand the seasonality, the period from January to December 2024 is used.
The Data was downloaded to the RStudio work directory on my computer. To identify the work directory, I used the getwd() commend. All trip data is in comma-delimited (.CSV) format.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(patchwork)
library(sf)

## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE

library(leaflet)
library(viridis)

## Loading required package: viridisLite

jan24 <- read_csv("202401-divvy-tripdata.csv") # I made sure that these files were in my work directory

## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

feb24 <- read_csv("202402-divvy-tripdata.csv")

## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mar24 <- read_csv("202403-divvy-tripdata.csv")

## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

apr24 <- read_csv("202404-divvy-tripdata.csv")

## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may24 <- read_csv("202405-divvy-tripdata.csv")

## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jun24 <- read_csv("202406-divvy-tripdata.csv")

## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jul24 <- read_csv("202407-divvy-tripdata.csv")

## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aug24 <- read_csv("202408-divvy-tripdata.csv")

## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep24 <- read_csv("202409-divvy-tripdata.csv")

## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct24 <- read_csv("202410-divvy-tripdata.csv")

## Rows: 616281 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov24 <- read_csv("202411-divvy-tripdata.csv")

## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dec24 <- read_csv("202412-divvy-tripdata.csv")

## Rows: 178372 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I Checked in my RStudio Environment pane that these files were actually uploaded. Then, I merged these files.

tripdata <- bind_rows(jan24, feb24, mar24, apr24, may24, jun24, jul24, aug24, sep24, oct24, nov24, dec24)

Process

Clean and Prepare data for analysis.

colnames(tripdata)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

str(tripdata)

## spc_tbl_ [5,860,568 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
##  $ rideable_type     : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
##  $ ended_at          : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
##  $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
##  $ start_station_id  : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
##  $ end_station_name  : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
##  $ end_station_id    : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
##  $ start_lat         : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5860568] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(tripdata)

##    ride_id          rideable_type        started_at                    
##  Length:5860568     Length:5860568     Min.   :2024-01-01 00:00:39.00  
##  Class :character   Class :character   1st Qu.:2024-05-20 19:47:53.00  
##  Mode  :character   Mode  :character   Median :2024-07-22 20:36:16.27  
##                                        Mean   :2024-07-17 07:55:47.61  
##                                        3rd Qu.:2024-09-17 20:14:22.56  
##                                        Max.   :2024-12-31 23:56:49.84  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2024-01-01 00:04:20.00   Length:5860568     Length:5860568    
##  1st Qu.:2024-05-20 20:07:54.75   Class :character   Class :character  
##  Median :2024-07-22 20:53:59.16   Mode  :character   Mode  :character  
##  Mean   :2024-07-17 08:13:06.54                                        
##  3rd Qu.:2024-09-17 20:27:46.02                                        
##  Max.   :2024-12-31 23:59:55.70                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5860568     Length:5860568     Min.   :41.64   Min.   :-87.91  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng        member_casual     
##  Min.   :16.06   Min.   :-144.05   Length:5860568    
##  1st Qu.:41.88   1st Qu.: -87.66   Class :character  
##  Median :41.90   Median : -87.64   Mode  :character  
##  Mean   :41.90   Mean   : -87.65                     
##  3rd Qu.:41.93   3rd Qu.: -87.63                     
##  Max.   :87.96   Max.   : 152.53                     
##  NA's   :7232    NA's   :7232

head(tripdata)

## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike  2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike  2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike  2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

colSums(is.na(tripdata))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##            1073951            1073951            1104653            1104653 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               7232               7232 
##      member_casual 
##                  0

I saw that there were 13 columns (variables): “ride_id”, “rideable_type” (electric_bike, classic_bike), “started_at”, “ended_at”, “start_station_name”, “start_station_id”, “end_station_name”, “end_station_id”, “start_lat” (latitude), “start_lng” (longtitude), “end_lat”, “end_lng”, “member_casual” (member, casual); and there were over 5.8 million rows (rides). The Min and Max of end_lat and end_lng were too far from Chicago, probably due to signal drift, station mislabeling, or technical glitches. Also, approximately 7000 end_lat and end_lng were null (NA). In addition, over 1 million station names and ids were null (NA) which are likely due to dockless stations or any other place the riders abandoned their bike.

I filtered the coordinates that were in Chicago in order to prevent map distortion.

tripdata_clean_lat_lng <- tripdata %>%
  filter(
    between(start_lat, 41.6, 42.1),
    between(start_lng, -88.0, -87.5),
    between(end_lat, 41.6, 42.1),
    between(end_lng, -88.0, -87.5)
  )

For later analysis, I calculated the ride_lengths,

tripdata_clean_lat_lng <- tripdata_clean_lat_lng %>% 
  mutate(ride_length = as.numeric(ended_at - started_at, units = "mins"))

and summarized the ride_lengths:

summary(tripdata_clean_lat_lng$ride_length)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2748.317     5.539     9.700    15.483    17.207  1509.367

I saw that the min was a negative number, which is impossible. Therefore, I filtered out observations with negative ride_length as well as any observations with ride_length < 1 minute (potentially false starts or users trying to re-dock a bike to ensure it was secure).

tripdata_clean <- tripdata_clean_lat_lng %>%
  filter(ride_length >= 1)

Analyze

Compare annual members and casual riders

tripdata_clean %>% 
  group_by(member_casual) %>% 
  summarise(
    ride_count = n(),
    ride_percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
  )

## # A tibble: 2 × 3
##   member_casual ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           2080374            36.4
## 2 member           3641302            63.6

Members take nearly twice as many rides as casuals.

tripdata_clean %>%
  group_by(member_casual) %>%
  summarise(
    average_ride_length = round(mean(ride_length), 2),
    median_length = round(median(ride_length), 2),
    max_ride_length = round(max(ride_length), 2),
    min_ride_length = round(min(ride_length), 2)
  )

## # A tibble: 2 × 5
##   member_casual average_ride_length median_length max_ride_length
##   <chr>                       <dbl>         <dbl>           <dbl>
## 1 casual                       21.8         12.4            1509.
## 2 member                       12.4          8.85           1500.
## # ℹ 1 more variable: min_ride_length <dbl>

Casual riders take 1.5 to 2 times longer rides than annual members on average.

All riders are charged extra fee for each minute over 3 hours. So, it is informative to analyze rides with ride_length <= 180 minutes and > 180 minutes.

filtered_data <- tripdata_clean %>%
  filter(ride_length <= 180)

ggplot(filtered_data, aes(x = ride_length, fill = member_casual)) +
  geom_histogram(binwidth = 2) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE) ) +
  labs(title = "Distribution of Ride Lengths (≤ 180 mins)",
       x = "Ride Length (minutes)",
       y = "Count") +
  theme_minimal() +
  facet_wrap(~member_casual, ncol = 2) +
  theme(legend.position = "none")

tripdata_clean %>%
  filter(ride_length > 180) %>%
  group_by(member_casual) %>%
  summarise(
    long_ride_count = n(),
    percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
  )

## # A tibble: 2 × 3
##   member_casual long_ride_count percentage
##   <chr>                   <int>      <dbl>
## 1 casual                  10009       0.17
## 2 member                   4251       0.07

Casual riders are more than twice as likely as members to take ultra-long rides. While both percentages are small, the absolute number of casual long rides is significant: over 10,000 instances.

Seasonality?

tripdata_clean <- tripdata_clean %>%
  mutate(
    month = format(as.Date(started_at), "%B"),
    month = factor(month, levels = c(
      "January", "February", "March", "April", "May", "June",
      "July", "August", "September", "October", "November", "December"
    ), ordered = TRUE)
  )

seasonality_summary <- tripdata_clean %>%
  group_by(member_casual, month) %>%
  summarise(
    number_of_rides = n(),
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, month)

# Plot 1
p1 <- seasonality_summary %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Number of Rides", x = "Month", y = "Number of Rides") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Plot 2
p2 <- seasonality_summary %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Average Ride Lengths", x = "Month", y = "Average Ride Length (minutes)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Combine them
combined_plot <- p1 / p2 +
  plot_annotation(
    title = "Seasonality Analysis: Number of Rides and Average Ride Lengths",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )
# Show the combined plot
combined_plot

Members ride more in every month. Casual riders have longer average ride lengths in every month.

Day of week effect?

tripdata_clean <- tripdata_clean %>% 
  mutate(
    day_of_week = format(as.Date(started_at), "%A"),
    day_of_week = factor(day_of_week, 
                         levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
                         ordered = TRUE)
  ) 

day_summary <- tripdata_clean %>%
  group_by(member_casual, day_of_week) %>%  
  summarise(
    number_of_rides = n(), 
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, day_of_week)

p1 <- day_summary %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
    geom_col(width=0.5, position = position_dodge(width=0.5)) +
    labs(title ="Total Rides", x = "Day of the Week", y = "Number of Rides") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

p2 <- day_summary %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
  labs(title = "Average Ride Lengths", x = "Day of the Week", y = "Average Ride Length (minutes)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

combined_plot <- p1 / p2 +
  plot_annotation(
    title = "Day-of-week Analysis: Number of Rides and Average Ride Lengths",
    theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
  )
# Show the combined plot
combined_plot

Members consistently take more rides than casuals every day of the week. Weekdays (Mon–Fri) show a steady, high volume of member rides — consistent with commuting behavior. Casual ridership peaks on weekends, suggesting that casuals tend to ride for leisure or recreation.

Casual riders consistently take longer rides than members on all days of the week. Casual ride lengths range from 22 to 26 minutes, peaking on Sunday and Saturday. Member ride lengths are shorter, around 12–14 minutes, and relatively stable across the week.

Hour of day effect?

tripdata_clean <- tripdata_clean %>% 
  mutate(start_hour = as.numeric(strftime(started_at, "%H")))

hour_summary <- tripdata_clean %>%
  group_by(member_casual, start_hour) %>%  
  summarise(
    number_of_rides = n(), 
    average_ride_length = round(mean(ride_length),2),
    .groups = "drop"
  ) %>%
  arrange(member_casual, start_hour)

hour_summary %>%
  ggplot(aes(x = start_hour, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_x_continuous(breaks = 0:23) +
  labs(
    x = "Hour of the Day",
    y = "Number of Rides",
    title = "Cyclistic's Bike Demand by Hour of Day"
  ) +
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

Demand of members for bikes is higher than casuals during every hour of the day. For both groups the peak demand occurs just after noon, but members have another peak during early mornings, showing a commuter-like pattern, whereas casuals’ demand distribution suggest a leisure/tourism behavior.

ggplot(tripdata_clean, aes(x = start_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  facet_wrap(~ day_of_week) +
  labs(
    title = "Bike Demand by Hour and Day of Week",
    x = "Hour of the Day",
    y = "Number of Rides"
  ) +
  scale_x_continuous(breaks = 0:23) +
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

The demand by hour for each day of the week clearly shows that members do have commuter behavior during the week days but in the weekend they behave leasurely, like casuals.

Top start stations?

# Prepare casual and member data
top_casual <- tripdata_clean %>%
  filter(member_casual == "casual", !is.na(start_station_name), !is.na(start_lat), !is.na(start_lng)) %>%
  group_by(start_station_name, start_lat, start_lng) %>%
  summarise(start_count = n(), .groups = "drop") %>%
  slice_max(start_count, n = 25)

top_member <- tripdata_clean %>%
  filter(member_casual == "member", !is.na(start_station_name), !is.na(start_lat), !is.na(start_lng)) %>%
  group_by(start_station_name, start_lat, start_lng) %>%
  summarise(start_count = n(), .groups = "drop") %>%
  slice_max(start_count, n = 25)

m1 <- leaflet(data = top_member) %>%    # interactive map
  addProviderTiles("CartoDB.Positron") %>%
  setView(lng = -87.63, lat = 41.88, zoom = 11) %>%  # Sets the **initial center and zoom level** of the map view.
  addCircles(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~sqrt(start_count) * 2,  
    color = "blue",
    stroke = FALSE,   # no outline
    fillOpacity = 0.6,
    label = ~paste0(start_station_name, ": ", start_count, " rides")
  ) %>%
  addControl("<strong>Top 25 Start Stations for Members</strong>", position = "topright")
m1

Most of the top 25 starting stations for members are concentrated in downtown Chicago, with additional clusters near residential areas on the North and South Sides.

m2 <- leaflet(data = top_casual) %>%
  addProviderTiles("CartoDB.Positron") %>%
  setView(lng = -87.63, lat = 41.92, zoom = 12) %>%
  addCircles(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~sqrt(start_count)* 2,  # adjust multiplier to your data
    color = "red",
    stroke = FALSE,
    fillOpacity = 0.6,
    label = ~paste0(start_station_name, ": ", start_count, " rides")
  ) %>%
  addControl("<strong>Top 25 Start Stations for Casuals</strong>", position = "topright")
m2

The top 25 starting stations for casual users are heavily clustered along the lakefront and near downtown tourist areas, reflecting strong recreational and sightseeing usage.

Self-loops

Before we can show the top routes on a map, we have to identify and exclude the self loops (i.e., trips that start and end at the same station), because we cannot show the self-loops on a map.

# STEP 1: Filter self-loops
self_loops <- tripdata_clean %>%
  filter(start_station_name == end_station_name)

# STEP 2: Count self-loop rides by user type
self_loop_summary <- self_loops %>%
  group_by(member_casual) %>%
  summarise(self_loop_rides = n(), .groups = "drop")

# STEP 3: Count total rides by user type
total_rides_summary <- tripdata_clean %>%
  group_by(member_casual) %>%
  summarise(total_rides = n(), .groups = "drop")

# STEP 4: Join and compute percentage
summary_table <- left_join(self_loop_summary, total_rides_summary, by = "member_casual") %>%
  mutate(percent_self_loops = round(100 * self_loop_rides / total_rides, 2))

# STEP 5: Display table
summary_table

## # A tibble: 2 × 4
##   member_casual self_loop_rides total_rides percent_self_loops
##   <chr>                   <int>       <int>              <dbl>
## 1 casual                 130120     2080374               6.25
## 2 member                  65032     3641302               1.79

Self-loop rides represent only 1.79% of rides by members and 6.25% of rides by casual users. Because these are small and they cannot be properly depicted on a route map, they are excluded in the following analysis for top routes.

Top routes

top_routes <- tripdata_clean %>%
  filter(!is.na(start_station_name), !is.na(end_station_name),
         !is.na(start_lat), !is.na(start_lng),
         !is.na(end_lat), !is.na(end_lng)) %>%
  group_by(member_casual, start_station_name, start_lat, start_lng,
           end_station_name, end_lat, end_lng) %>%
  summarise(route_count = n(), .groups = "drop") %>%
  group_by(member_casual) %>%
  slice_max(order_by = route_count, n = 25, with_ties = FALSE)

top_casual_routes <- top_routes %>% filter(member_casual == "casual")
top_member_routes <- top_routes %>% filter(member_casual == "member")

# Step 1: Determine non-loop member routes
non_loops <- top_member_routes %>%
  filter(start_station_name != end_station_name) %>%
  mutate(route_id = row_number())

non_loops <- non_loops %>%    # need to jitter the coordinates bec. many routes overlap
  mutate(
    start_lat = start_lat + runif(n(), -0.0015, 0.0015),
    start_lng = start_lng + runif(n(), -0.0015, 0.0015),
    end_lat   = end_lat   + runif(n(), -0.0015, 0.0015),
    end_lng   = end_lng   + runif(n(), -0.0015, 0.0015)
  )

# Step 2: Straight lines for non-loops
non_loop_geoms <- pmap(                                      # Apply a function row-by-row to multiple coordinate columns
  list(non_loops$start_lng, non_loops$start_lat,             # List of start and end coordinates for each route
       non_loops$end_lng, non_loops$end_lat),
  ~ st_linestring(matrix(c(..1, ..2, ..3, ..4),              # Create a 2x2 matrix of coordinates for a line
                         ncol = 2, byrow = TRUE))            # Matrix: [start_lng, start_lat], [end_lng, end_lat]
)

non_loop_sf <- st_sf(                                        # Create a simple features (sf) object
  non_loops,                                                 # Use the original route data as attributes
  geometry = st_sfc(non_loop_geoms, crs = 4326)              # Add the LINESTRING geometries with WGS84 coordinate system
)

# Step 3: Color palette
color_pal <- colorBin(
  palette = "viridis",                                       # the color scheme
  domain  = non_loop_sf$route_count,                         # the values you're coloring by
  bins    = 6                                                # how many color bins to create
)

leaflet(non_loop_sf) %>%                                       # Initialize leaflet map with route data
  addProviderTiles("CartoDB.Positron") %>%                     # Add a clean basemap for urban context
  addLegend("bottomright",                                     # Add a color legend to explain route counts
            pal = color_pal,                                   # Use the previously defined viridis color palette
            values = ~route_count,                             # Base legend on the number of rides per route
            title = "Number of Rides") %>%                     # Legend title
  addPolylines(                                                # Add the route lines to the map
    color = ~color_pal(route_count),                           # Color each line by its ride count
    weight = 4,                                                # Line thickness
    opacity = 0.8,                                             # Line transparency
    label = ~paste0("Route ", route_id, ": ",                  # Tooltip: show route ID
                    start_station_name, " → ",                 # Start and end station
                    end_station_name, 
                    " (", route_count, " rides)")              # Number of rides in parentheses
  ) %>%
  setView(lng = -87.63, lat = 41.83, zoom = 11.5) %>%               # Set initial map view centered on Chicago
  addControl("<strong>Top 25 routes for Members</strong>", position = "topright")

The top 25 non-loop routes by members are concentrated in the South Side and Hyde Park areas, with a few high-traffic corridors near the West Loop and downtown.

# Step 1: Determine non-loop casual routes
non_loops <- top_casual_routes %>%
  filter(start_station_name != end_station_name) %>%
  mutate(route_id = row_number())

non_loops <- non_loops %>%    # need to jitter the coordinates bec. many routes overlap
  mutate(
    start_lat = start_lat + runif(n(), -0.0015, 0.0015),
    start_lng = start_lng + runif(n(), -0.0015, 0.0015),
    end_lat   = end_lat   + runif(n(), -0.0015, 0.0015),
    end_lng   = end_lng   + runif(n(), -0.0015, 0.0015)
  )

# Step 2: Straight lines for non-loops
non_loop_geoms <- pmap(
  list(non_loops$start_lng, non_loops$start_lat,
       non_loops$end_lng, non_loops$end_lat),
  ~ st_linestring(matrix(c(..1, ..2, ..3, ..4), ncol = 2, byrow = TRUE))
)

non_loop_sf <- st_sf(
  non_loops,
  geometry = st_sfc(non_loop_geoms, crs = 4326)
)

# Step 3: Color palette
color_pal <- colorBin(
  palette = "viridis",
  domain  = non_loop_sf$route_count,
  bins    = 6
)

# Step 4: Plot leaflet map
leaflet(non_loop_sf) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addLegend("bottomright",
            pal = color_pal,
            values = ~route_count,
            title = "Number of Rides") %>%
  addPolylines(
    color = ~color_pal(route_count),
    weight = 4,
    opacity = 0.8,
    label = ~paste0("Route ", route_id, ": ",
                    start_station_name, " → ",
                    end_station_name,
                    " (", route_count, " rides)")
  ) %>%
  setView(lng = -87.63, lat = 41.895, zoom = 12.5)  %>%
  addControl("<strong>Top 25 routes for Casuals</strong>", position = "topright")

The top 25 non-loop routes by casual users cluster heavily along Chicago’s lakefront and around major tourist attractions like Millennium Park and Shedd Aquarium.

Geographic Comparison of routes by user type?

Aspect	Member Users	Casual Users
Geographic Spread	Broad — includes South Side, West Loop, and Hyde Park	Concentrated — downtown core and lakefront areas
Route Diversity	High — varied commuting routes and wider city coverage	Moderate — leisure-focused loops with some redundancy
Tourist Influence	Low — rides appear utilitarian and commute-driven	High — strong presence near landmarks and attractions

Act

Top 3 Marketing Actions to Convert Casual Users to annual Members

Deploy Membership Promos in High-Casual Zones Casual riders are heavily concentrated near lakefront and downtown attractions such as Millennium Park, Navy Pier, and DuSable Harbor. Action: Offer location-based promotions like lower per-minute costs, priority bike access, or limited-time membership trials at these hotspots.
Educate Casual Riders on Membership Value Many casual riders take longer rides and self-loops, which may indicate a lack of awareness about cost savings with membership. Action: Use in-app nudges, post-ride emails, or ride summaries to highlight member perks such as unlimited short rides, free ride time, and faster checkout. Focus especially on repeat casual users or those exceeding 20-minute rides.
Run Weekend-Focused Membership Campaigns Casual usage peaks on weekends and afternoons, aligning with recreational patterns. Action: Launch weekend conversion offers like: “Upgrade to Member Today & Ride Free All Weekend” Partner with local attractions and festivals to offer bundled incentives (e.g., ride credits or discounts with ticket purchases) that reinforce the value of membership.

Cyclistic Bikeshare using R

Mehran Hojati

2025-05-20

Introduction

Scenario

Ask

Prepare

Process

Analyze

Seasonality?

Day of week effect?

Hour of day effect?

Top start stations?

Self-loops

Top routes

Geographic Comparison of routes by user type?

Act

Top 3 Marketing Actions to Convert Casual Users to annual Members

Cyclistic Bikeshare using R

Mehran Hojati

2025-05-20

Introduction

Scenario

Ask

Prepare

Process

Analyze

Seasonality?

Day of week effect?

Hour of day effect?

Top start stations?

Self-loops

Top routes

Geographic Comparison of routes by user type?

Share

Main Insights and Conclusions

Act

Top 3 Marketing Actions to Convert Casual Users to annual Members