Introduction
This is the report of the capstone project for my Google Data
Analytics Professional Certificate program. I am using R programming
language and RStudio Desktop. Note that the free version of RStudio
Cloud cannot handle the amount of data needed for this project.
Scenario
I am a junior data analyst working in the marketing analysis team of
Cyclistic, a bike-share company in Chicago. Note: This is a fictional
name, but the company is real and is called Divvy https://divvybikes.com.
The director of marketing believes that the company’s future success
depends on maximizing the number of annual memberships. My team wants to
understand how casual riders and annual members use Cyclistic bikes
differently. From these insights, my team will design a new marketing
strategy to convert casual riders into annual members. But first,
Cyclistic executives must approve my recommendations, so my
recommendations must be backed up with compelling data insights and
professional data visualizations.
Ask
Three questions will guide the future marketing program:
- How do annual members and casual riders use Cyclistic bikes
differently?
- Why would casual riders buy Cyclistic annual membership?
- How can Cyclistic use digital media to influence casual riders to
become annual members?
The director of marketing and my manager Lily Moreno has assigned me
the first question to answer.
Prepare
I use Cyclistic’s monthly trip data https://divvy-tripdata.s3.amazonaws.com/index.html which
is publicly available. According to Divvy https://divvybikes.com/system-data, the data has been
processed to remove trips that are taken by staff as they service and
inspect the system, and any trips that were below 60 seconds in length
(potentially false starts or users trying to re-dock a bike to ensure it
was secure).
To see the effect of seasons on rides, 12 months of data is used. To
make it easier to understand the seasonality, the period from January to
December 2024 is used.
The Data was downloaded to the RStudio work directory on my computer. To
identify the work directory, I used the getwd() commend. All trip data
is in comma-delimited (.CSV) format.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(patchwork)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(leaflet)
library(viridis)
## Loading required package: viridisLite
jan24 <- read_csv("202401-divvy-tripdata.csv") # I made sure that these files were in my work directory
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb24 <- read_csv("202402-divvy-tripdata.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar24 <- read_csv("202403-divvy-tripdata.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr24 <- read_csv("202404-divvy-tripdata.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may24 <- read_csv("202405-divvy-tripdata.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun24 <- read_csv("202406-divvy-tripdata.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul24 <- read_csv("202407-divvy-tripdata.csv")
## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug24 <- read_csv("202408-divvy-tripdata.csv")
## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep24 <- read_csv("202409-divvy-tripdata.csv")
## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct24 <- read_csv("202410-divvy-tripdata.csv")
## Rows: 616281 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov24 <- read_csv("202411-divvy-tripdata.csv")
## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec24 <- read_csv("202412-divvy-tripdata.csv")
## Rows: 178372 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I Checked in my RStudio Environment pane that these files were
actually uploaded. Then, I merged these files.
tripdata <- bind_rows(jan24, feb24, mar24, apr24, may24, jun24, jul24, aug24, sep24, oct24, nov24, dec24)
Process
Clean and Prepare data for analysis.
colnames(tripdata)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(tripdata)
## spc_tbl_ [5,860,568 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
## $ rideable_type : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
## $ ended_at : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
## $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
## $ start_station_id : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
## $ end_station_name : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
## $ end_station_id : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
## $ start_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:5860568] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(tripdata)
## ride_id rideable_type started_at
## Length:5860568 Length:5860568 Min. :2024-01-01 00:00:39.00
## Class :character Class :character 1st Qu.:2024-05-20 19:47:53.00
## Mode :character Mode :character Median :2024-07-22 20:36:16.27
## Mean :2024-07-17 07:55:47.61
## 3rd Qu.:2024-09-17 20:14:22.56
## Max. :2024-12-31 23:56:49.84
##
## ended_at start_station_name start_station_id
## Min. :2024-01-01 00:04:20.00 Length:5860568 Length:5860568
## 1st Qu.:2024-05-20 20:07:54.75 Class :character Class :character
## Median :2024-07-22 20:53:59.16 Mode :character Mode :character
## Mean :2024-07-17 08:13:06.54
## 3rd Qu.:2024-09-17 20:27:46.02
## Max. :2024-12-31 23:59:55.70
##
## end_station_name end_station_id start_lat start_lng
## Length:5860568 Length:5860568 Min. :41.64 Min. :-87.91
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. :16.06 Min. :-144.05 Length:5860568
## 1st Qu.:41.88 1st Qu.: -87.66 Class :character
## Median :41.90 Median : -87.64 Mode :character
## Mean :41.90 Mean : -87.65
## 3rd Qu.:41.93 3rd Qu.: -87.63
## Max. :87.96 Max. : 152.53
## NA's :7232 NA's :7232
head(tripdata)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike 2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike 2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike 2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
colSums(is.na(tripdata))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 1073951 1073951 1104653 1104653
## start_lat start_lng end_lat end_lng
## 0 0 7232 7232
## member_casual
## 0
I saw that there were 13 columns (variables): “ride_id”,
“rideable_type” (electric_bike, classic_bike), “started_at”, “ended_at”,
“start_station_name”, “start_station_id”, “end_station_name”,
“end_station_id”, “start_lat” (latitude), “start_lng” (longtitude),
“end_lat”, “end_lng”, “member_casual” (member, casual); and there were
over 5.8 million rows (rides). The Min and Max of end_lat and end_lng
were too far from Chicago, probably due to signal drift, station
mislabeling, or technical glitches. Also, approximately 7000 end_lat and
end_lng were null (NA). In addition, over 1 million station names and
ids were null (NA) which are likely due to dockless stations or any
other place the riders abandoned their bike.
I filtered the coordinates that were in Chicago in order to prevent
map distortion.
tripdata_clean_lat_lng <- tripdata %>%
filter(
between(start_lat, 41.6, 42.1),
between(start_lng, -88.0, -87.5),
between(end_lat, 41.6, 42.1),
between(end_lng, -88.0, -87.5)
)
For later analysis, I calculated the ride_lengths,
tripdata_clean_lat_lng <- tripdata_clean_lat_lng %>%
mutate(ride_length = as.numeric(ended_at - started_at, units = "mins"))
and summarized the ride_lengths:
summary(tripdata_clean_lat_lng$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2748.317 5.539 9.700 15.483 17.207 1509.367
I saw that the min was a negative number, which is impossible.
Therefore, I filtered out observations with negative ride_length as well
as any observations with ride_length < 1 minute (potentially false
starts or users trying to re-dock a bike to ensure it was secure).
tripdata_clean <- tripdata_clean_lat_lng %>%
filter(ride_length >= 1)
Analyze
Compare annual members and casual riders
tripdata_clean %>%
group_by(member_casual) %>%
summarise(
ride_count = n(),
ride_percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
)
## # A tibble: 2 × 3
## member_casual ride_count ride_percentage
## <chr> <int> <dbl>
## 1 casual 2080374 36.4
## 2 member 3641302 63.6
Members take nearly twice as many rides as casuals.
tripdata_clean %>%
group_by(member_casual) %>%
summarise(
average_ride_length = round(mean(ride_length), 2),
median_length = round(median(ride_length), 2),
max_ride_length = round(max(ride_length), 2),
min_ride_length = round(min(ride_length), 2)
)
## # A tibble: 2 × 5
## member_casual average_ride_length median_length max_ride_length
## <chr> <dbl> <dbl> <dbl>
## 1 casual 21.8 12.4 1509.
## 2 member 12.4 8.85 1500.
## # ℹ 1 more variable: min_ride_length <dbl>
Casual riders take 1.5 to 2 times longer rides than annual members on
average.
All riders are charged extra fee for each minute over 3 hours. So, it
is informative to analyze rides with ride_length <= 180 minutes and
> 180 minutes.
filtered_data <- tripdata_clean %>%
filter(ride_length <= 180)
ggplot(filtered_data, aes(x = ride_length, fill = member_casual)) +
geom_histogram(binwidth = 2) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE) ) +
labs(title = "Distribution of Ride Lengths (≤ 180 mins)",
x = "Ride Length (minutes)",
y = "Count") +
theme_minimal() +
facet_wrap(~member_casual, ncol = 2) +
theme(legend.position = "none")

tripdata_clean %>%
filter(ride_length > 180) %>%
group_by(member_casual) %>%
summarise(
long_ride_count = n(),
percentage = round((n() / nrow(tripdata_clean)) * 100, 2)
)
## # A tibble: 2 × 3
## member_casual long_ride_count percentage
## <chr> <int> <dbl>
## 1 casual 10009 0.17
## 2 member 4251 0.07
Casual riders are more than twice as likely as members to take
ultra-long rides. While both percentages are small, the absolute number
of casual long rides is significant: over 10,000 instances.
Seasonality?
tripdata_clean <- tripdata_clean %>%
mutate(
month = format(as.Date(started_at), "%B"),
month = factor(month, levels = c(
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
), ordered = TRUE)
)
seasonality_summary <- tripdata_clean %>%
group_by(member_casual, month) %>%
summarise(
number_of_rides = n(),
average_ride_length = round(mean(ride_length),2),
.groups = "drop"
) %>%
arrange(member_casual, month)
# Plot 1
p1 <- seasonality_summary %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
labs(title = "Number of Rides", x = "Month", y = "Number of Rides") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
# Plot 2
p2 <- seasonality_summary %>%
ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
labs(title = "Average Ride Lengths", x = "Month", y = "Average Ride Length (minutes)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Combine them
combined_plot <- p1 / p2 +
plot_annotation(
title = "Seasonality Analysis: Number of Rides and Average Ride Lengths",
theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
)
# Show the combined plot
combined_plot

Members ride more in every month. Casual riders have longer average
ride lengths in every month.
Day of week effect?
tripdata_clean <- tripdata_clean %>%
mutate(
day_of_week = format(as.Date(started_at), "%A"),
day_of_week = factor(day_of_week,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"),
ordered = TRUE)
)
day_summary <- tripdata_clean %>%
group_by(member_casual, day_of_week) %>%
summarise(
number_of_rides = n(),
average_ride_length = round(mean(ride_length),2),
.groups = "drop"
) %>%
arrange(member_casual, day_of_week)
p1 <- day_summary %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
labs(title ="Total Rides", x = "Day of the Week", y = "Number of Rides") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
p2 <- day_summary %>%
ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
labs(title = "Average Ride Lengths", x = "Day of the Week", y = "Average Ride Length (minutes)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
combined_plot <- p1 / p2 +
plot_annotation(
title = "Day-of-week Analysis: Number of Rides and Average Ride Lengths",
theme = theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
)
# Show the combined plot
combined_plot

Members consistently take more rides than casuals every day of the
week. Weekdays (Mon–Fri) show a steady, high volume of member rides —
consistent with commuting behavior. Casual ridership peaks on weekends,
suggesting that casuals tend to ride for leisure or recreation.
Casual riders consistently take longer rides than members on all days
of the week. Casual ride lengths range from 22 to 26 minutes, peaking on
Sunday and Saturday. Member ride lengths are shorter, around 12–14
minutes, and relatively stable across the week.
Hour of day effect?
tripdata_clean <- tripdata_clean %>%
mutate(start_hour = as.numeric(strftime(started_at, "%H")))
hour_summary <- tripdata_clean %>%
group_by(member_casual, start_hour) %>%
summarise(
number_of_rides = n(),
average_ride_length = round(mean(ride_length),2),
.groups = "drop"
) %>%
arrange(member_casual, start_hour)
hour_summary %>%
ggplot(aes(x = start_hour, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
scale_x_continuous(breaks = 0:23) +
labs(
x = "Hour of the Day",
y = "Number of Rides",
title = "Cyclistic's Bike Demand by Hour of Day"
) +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
Demand of members for bikes is higher than casuals during every hour of
the day. For both groups the peak demand occurs just after noon, but
members have another peak during early mornings, showing a commuter-like
pattern, whereas casuals’ demand distribution suggest a leisure/tourism
behavior.
ggplot(tripdata_clean, aes(x = start_hour, fill = member_casual)) +
geom_bar(position = "dodge") +
facet_wrap(~ day_of_week) +
labs(
title = "Bike Demand by Hour and Day of Week",
x = "Hour of the Day",
y = "Number of Rides"
) +
scale_x_continuous(breaks = 0:23) +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
The demand by hour for each day of the week clearly shows that members
do have commuter behavior during the week days but in the weekend they
behave leasurely, like casuals.
Top start stations?
# Prepare casual and member data
top_casual <- tripdata_clean %>%
filter(member_casual == "casual", !is.na(start_station_name), !is.na(start_lat), !is.na(start_lng)) %>%
group_by(start_station_name, start_lat, start_lng) %>%
summarise(start_count = n(), .groups = "drop") %>%
slice_max(start_count, n = 25)
top_member <- tripdata_clean %>%
filter(member_casual == "member", !is.na(start_station_name), !is.na(start_lat), !is.na(start_lng)) %>%
group_by(start_station_name, start_lat, start_lng) %>%
summarise(start_count = n(), .groups = "drop") %>%
slice_max(start_count, n = 25)
m1 <- leaflet(data = top_member) %>% # interactive map
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -87.63, lat = 41.88, zoom = 11) %>% # Sets the **initial center and zoom level** of the map view.
addCircles(
lng = ~start_lng,
lat = ~start_lat,
radius = ~sqrt(start_count) * 2,
color = "blue",
stroke = FALSE, # no outline
fillOpacity = 0.6,
label = ~paste0(start_station_name, ": ", start_count, " rides")
) %>%
addControl("<strong>Top 25 Start Stations for Members</strong>", position = "topright")
m1
Most of the top 25 starting stations for members are concentrated in
downtown Chicago, with additional clusters near residential areas on the
North and South Sides.
m2 <- leaflet(data = top_casual) %>%
addProviderTiles("CartoDB.Positron") %>%
setView(lng = -87.63, lat = 41.92, zoom = 12) %>%
addCircles(
lng = ~start_lng,
lat = ~start_lat,
radius = ~sqrt(start_count)* 2, # adjust multiplier to your data
color = "red",
stroke = FALSE,
fillOpacity = 0.6,
label = ~paste0(start_station_name, ": ", start_count, " rides")
) %>%
addControl("<strong>Top 25 Start Stations for Casuals</strong>", position = "topright")
m2
The top 25 starting stations for casual users are heavily clustered
along the lakefront and near downtown tourist areas, reflecting strong
recreational and sightseeing usage.
Self-loops
Before we can show the top routes on a map, we have to identify and
exclude the self loops (i.e., trips that start and end at the same
station), because we cannot show the self-loops on a map.
# STEP 1: Filter self-loops
self_loops <- tripdata_clean %>%
filter(start_station_name == end_station_name)
# STEP 2: Count self-loop rides by user type
self_loop_summary <- self_loops %>%
group_by(member_casual) %>%
summarise(self_loop_rides = n(), .groups = "drop")
# STEP 3: Count total rides by user type
total_rides_summary <- tripdata_clean %>%
group_by(member_casual) %>%
summarise(total_rides = n(), .groups = "drop")
# STEP 4: Join and compute percentage
summary_table <- left_join(self_loop_summary, total_rides_summary, by = "member_casual") %>%
mutate(percent_self_loops = round(100 * self_loop_rides / total_rides, 2))
# STEP 5: Display table
summary_table
## # A tibble: 2 × 4
## member_casual self_loop_rides total_rides percent_self_loops
## <chr> <int> <int> <dbl>
## 1 casual 130120 2080374 6.25
## 2 member 65032 3641302 1.79
Self-loop rides represent only 1.79% of rides by members and 6.25% of
rides by casual users. Because these are small and they cannot be
properly depicted on a route map, they are excluded in the following
analysis for top routes.
Top routes
top_routes <- tripdata_clean %>%
filter(!is.na(start_station_name), !is.na(end_station_name),
!is.na(start_lat), !is.na(start_lng),
!is.na(end_lat), !is.na(end_lng)) %>%
group_by(member_casual, start_station_name, start_lat, start_lng,
end_station_name, end_lat, end_lng) %>%
summarise(route_count = n(), .groups = "drop") %>%
group_by(member_casual) %>%
slice_max(order_by = route_count, n = 25, with_ties = FALSE)
top_casual_routes <- top_routes %>% filter(member_casual == "casual")
top_member_routes <- top_routes %>% filter(member_casual == "member")
# Step 1: Determine non-loop member routes
non_loops <- top_member_routes %>%
filter(start_station_name != end_station_name) %>%
mutate(route_id = row_number())
non_loops <- non_loops %>% # need to jitter the coordinates bec. many routes overlap
mutate(
start_lat = start_lat + runif(n(), -0.0015, 0.0015),
start_lng = start_lng + runif(n(), -0.0015, 0.0015),
end_lat = end_lat + runif(n(), -0.0015, 0.0015),
end_lng = end_lng + runif(n(), -0.0015, 0.0015)
)
# Step 2: Straight lines for non-loops
non_loop_geoms <- pmap( # Apply a function row-by-row to multiple coordinate columns
list(non_loops$start_lng, non_loops$start_lat, # List of start and end coordinates for each route
non_loops$end_lng, non_loops$end_lat),
~ st_linestring(matrix(c(..1, ..2, ..3, ..4), # Create a 2x2 matrix of coordinates for a line
ncol = 2, byrow = TRUE)) # Matrix: [start_lng, start_lat], [end_lng, end_lat]
)
non_loop_sf <- st_sf( # Create a simple features (sf) object
non_loops, # Use the original route data as attributes
geometry = st_sfc(non_loop_geoms, crs = 4326) # Add the LINESTRING geometries with WGS84 coordinate system
)
# Step 3: Color palette
color_pal <- colorBin(
palette = "viridis", # the color scheme
domain = non_loop_sf$route_count, # the values you're coloring by
bins = 6 # how many color bins to create
)
leaflet(non_loop_sf) %>% # Initialize leaflet map with route data
addProviderTiles("CartoDB.Positron") %>% # Add a clean basemap for urban context
addLegend("bottomright", # Add a color legend to explain route counts
pal = color_pal, # Use the previously defined viridis color palette
values = ~route_count, # Base legend on the number of rides per route
title = "Number of Rides") %>% # Legend title
addPolylines( # Add the route lines to the map
color = ~color_pal(route_count), # Color each line by its ride count
weight = 4, # Line thickness
opacity = 0.8, # Line transparency
label = ~paste0("Route ", route_id, ": ", # Tooltip: show route ID
start_station_name, " → ", # Start and end station
end_station_name,
" (", route_count, " rides)") # Number of rides in parentheses
) %>%
setView(lng = -87.63, lat = 41.83, zoom = 11.5) %>% # Set initial map view centered on Chicago
addControl("<strong>Top 25 routes for Members</strong>", position = "topright")
The top 25 non-loop routes by members are concentrated in the South
Side and Hyde Park areas, with a few high-traffic corridors near the
West Loop and downtown.
# Step 1: Determine non-loop casual routes
non_loops <- top_casual_routes %>%
filter(start_station_name != end_station_name) %>%
mutate(route_id = row_number())
non_loops <- non_loops %>% # need to jitter the coordinates bec. many routes overlap
mutate(
start_lat = start_lat + runif(n(), -0.0015, 0.0015),
start_lng = start_lng + runif(n(), -0.0015, 0.0015),
end_lat = end_lat + runif(n(), -0.0015, 0.0015),
end_lng = end_lng + runif(n(), -0.0015, 0.0015)
)
# Step 2: Straight lines for non-loops
non_loop_geoms <- pmap(
list(non_loops$start_lng, non_loops$start_lat,
non_loops$end_lng, non_loops$end_lat),
~ st_linestring(matrix(c(..1, ..2, ..3, ..4), ncol = 2, byrow = TRUE))
)
non_loop_sf <- st_sf(
non_loops,
geometry = st_sfc(non_loop_geoms, crs = 4326)
)
# Step 3: Color palette
color_pal <- colorBin(
palette = "viridis",
domain = non_loop_sf$route_count,
bins = 6
)
# Step 4: Plot leaflet map
leaflet(non_loop_sf) %>%
addProviderTiles("CartoDB.Positron") %>%
addLegend("bottomright",
pal = color_pal,
values = ~route_count,
title = "Number of Rides") %>%
addPolylines(
color = ~color_pal(route_count),
weight = 4,
opacity = 0.8,
label = ~paste0("Route ", route_id, ": ",
start_station_name, " → ",
end_station_name,
" (", route_count, " rides)")
) %>%
setView(lng = -87.63, lat = 41.895, zoom = 12.5) %>%
addControl("<strong>Top 25 routes for Casuals</strong>", position = "topright")
The top 25 non-loop routes by casual users cluster heavily along
Chicago’s lakefront and around major tourist attractions like Millennium
Park and Shedd Aquarium.
Geographic Comparison of routes by user type?
| Geographic Spread |
Broad — includes South Side, West Loop, and Hyde Park |
Concentrated — downtown core and lakefront areas |
| Route Diversity |
High — varied commuting routes and wider city coverage |
Moderate — leisure-focused loops with some redundancy |
| Tourist Influence |
Low — rides appear utilitarian and commute-driven |
High — strong presence near landmarks and attractions |
Share
This phase will be done by presentation, but here we use R Markdown
HTML Notebook to share.
Main Insights and Conclusions
User Type Strongly Influences Ride Behavior Members tend to take
shorter, more frequent trips across a wider geographic area, consistent
with commuting or utilitarian use. Casual users take longer rides that
are highly concentrated near tourist attractions, indicating
recreational or sightseeing behavior.
Spatial Mapping Deepens Understanding Interactive route maps and
station visualizations clearly differentiate the behaviors of member and
casual users. Members dominate routes across the South Side, West Loop,
and Hyde Park, while casual riders are concentrated along the lakefront,
Millennium Park, and downtown areas.
Act
Top 3 Marketing Actions to Convert Casual Users to annual
Members
Deploy Membership Promos in High-Casual Zones Casual riders are
heavily concentrated near lakefront and downtown attractions such as
Millennium Park, Navy Pier, and DuSable Harbor. Action: Offer
location-based promotions like lower per-minute costs, priority bike
access, or limited-time membership trials at these hotspots.
Educate Casual Riders on Membership Value Many casual riders take
longer rides and self-loops, which may indicate a lack of awareness
about cost savings with membership. Action: Use in-app nudges, post-ride
emails, or ride summaries to highlight member perks such as unlimited
short rides, free ride time, and faster checkout. Focus especially on
repeat casual users or those exceeding 20-minute rides.
Run Weekend-Focused Membership Campaigns Casual usage peaks on
weekends and afternoons, aligning with recreational patterns. Action:
Launch weekend conversion offers like: “Upgrade to Member Today &
Ride Free All Weekend” Partner with local attractions and festivals to
offer bundled incentives (e.g., ride credits or discounts with ticket
purchases) that reinforce the value of membership.