What follows is a case study that serves as the capstone project for the Google Data Analytics Certificate. You can find a presentation form of the business case via this link.
The capstone is from the perspective of a junior data analyst working on the marketing analyst team at Cyclistic (Divvy), a bike-share company in Chicago. The Director of Marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, a team is assembled to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations; compelling data insights and professional data visualizations are required.
The Director of Marketing has provided three questions to answer to guide future marketing efforts:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
The first question is my assignment, recast in the form …
The Business Task: Deliver an analysis that compares annual membership holders usage patterns to casual riders looking for indication in data that would aid the organization in increasing annual membership sales.
The capstone documentation refers to the organization as “Cyclistic.” At this point in the organization’s history (2025) they are known as Divvy - a play on “divvy it up”. Divvy is essentially a city of Chicago program that is run in partnership with Lyft.
Dizzy offers easy to step through bikes in both classic pedal-powered models as well as electric models. There are also electric scooters on offer. The service is accessed via a phone app that unlocks the bike for use. You can use the service in the following ways:
“Casual Rider”
Single Ride - entails a fee to unlock the bike with a per minute rental fee
Buy a Day Pass - 3 hours included after which the rental fee applies
“Annual Member”
The data sources provided in the certification program can be found here: https://divvy-tripdata.s3.amazonaws.com/index.html
The license agreement can be found here: https://ride.divvybikes.com/data-license-agreement
Per Divvy’s Data Page:
The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).
There is no information on whether all stations in Chicago are in the data, or a subset. If this was a subset, that might be a source of bias - this is an unknown.
Two types of files are provided:
one set by quarter of with the latest file from Q1 2019, and
monthly files that contain recent data.
The main content difference between the quarterly files compared to the monthly is that the quarterly contain gender and birth year for the riders as well as the individual bike ID, whereas the monthly do not. The monthly files provide everything else the quarterly files contain with the addition of bike type and the longitude and latitude of the stations.
Neither file type contains customer names or payment information. Therefore we have no privacy concerns.
This analysis uses the more recent monthly data, specifically a trailing 12 months (01/2025 to 02/2024) and therefore for most representative of current usage patterns.
#install.packages("rmarkdown")
#install.packages('tidyverse')
#install.packages('skimr')
suppressPackageStartupMessages({
library(rmarkdown)
library(tidyverse)
library(skimr)
})
# utility function for printing in pipes
pipe_message = function(.data, status) {message(status); .data}
Loading the data from 12 files representing Divvy activity from 01/2025 to 02/2024.
# Target the location of the files and define a condition to ensure only
# .csv files are loaded.
source_files_dir <- "trips/"
file_condition <- ".*\\.csv$"
file_list <- list.files(
path = source_files_dir,
pattern = file_condition,
full.names = TRUE
)
# Import and aggregate files into one data frame adding a column for source file
combined_df <- file_list %>%
set_names() %>%
map_df(~ read_csv(.x, show_col_types = FALSE), .id = "source_file")
# quick look at the data
skim_without_charts(combined_df)
| Name | combined_df |
| Number of rows | 5854384 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source_file | 0 | 1.00 | 32 | 32 | 0 | 12 | 0 |
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5854173 | 0 |
| rideable_type | 0 | 1.00 | 12 | 16 | 0 | 3 | 0 |
| start_station_name | 1077638 | 0.82 | 10 | 64 | 0 | 1806 | 0 |
| start_station_id | 1077638 | 0.82 | 3 | 35 | 0 | 1762 | 0 |
| end_station_name | 1107977 | 0.81 | 10 | 64 | 0 | 1815 | 0 |
| end_station_id | 1107977 | 0.81 | 3 | 35 | 0 | 1770 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.91 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 7005 | 1 | 41.90 | 0.06 | 16.06 | 41.88 | 41.90 | 41.93 | 87.96 |
| end_lng | 7005 | 1 | -87.65 | 0.11 | -144.05 | -87.66 | -87.64 | -87.63 | 152.53 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2024-02-01 00:01:50 | 2025-01-31 23:57:54 | 2024-07-28 13:25:34 | 5651085 |
| ended_at | 0 | 1 | 2024-02-01 00:08:07 | 2025-01-31 23:59:51 | 2024-07-28 13:46:33 | 5653589 |
The output to the console indicates that all files share the same column layout and load properly.
This is a large data set with 5,854,384 observations. There is a ride_id that seems to be the unique key, we will look into the field to confirm.
We are missing start and/or end station info in about 19% of rows. Divvy allows parking at their racks and public racks, at least for members. These are more likely cases wherein the user did not properly check-out the bike at a rack and cases where someone picked up the bike off-rack. We have latitude and longitude data for all trip starts, but missing a small number of end trip geo data.
I am holding on removing all NA cases until I have more of an understanding.
I will first copy the data to another data frame for manipulation keeping the state of the original data intact.
As determined below there are 422 duplicate ride_id
values in the data. Considering this is a very small percentage of the
observations, I am not too concerned. These duplicates have been
removed.
# create a version of the data frame for manipulation
modified_df <- combined_df
# determine if ride_id's are duplicated
modified_df %>%
group_by(ride_id) %>%
mutate(dupe = n()>1) %>%
filter(dupe == TRUE) %>%
arrange(ride_id) %>%
pipe_message("duplicate ride_id's before clean: ") %>%
nrow()
## duplicate ride_id's before clean:
## [1] 422
# remove the duplicate ride_id cases
modified_df <- modified_df %>%
distinct(ride_id, .keep_all = TRUE)
modified_df %>%
group_by(ride_id) %>%
mutate(dupe = n()>1) %>%
filter(dupe == TRUE) %>%
arrange(ride_id) %>%
pipe_message("duplicate ride_id's after: ") %>%
nrow()
## duplicate ride_id's after:
## [1] 0
Divvy indicates that any test rides are removed. That said, I am doing an admittedly simple test for demo stations. The results of this are negative.
# simple test
modified_df[grep("test", modified_df$start_station_name, ignore.case=TRUE)]
## # A tibble: 5,717,168 × 0
modified_df[grep("test", modified_df$end_station_name, ignore.case=TRUE)]
## # A tibble: 5,717,168 × 0
modified_df[grep("demo", modified_df$start_station_name, ignore.case=TRUE)]
## # A tibble: 5,717,168 × 0
modified_df[grep("demo", modified_df$end_station_name, ignore.case=TRUE)]
## # A tibble: 5,717,168 × 0
We now have 5,717,168 (down from 5,854,384) of what I am considering complete data. This data still contains cases where trips start or end outside the station network. During analysis there will be cause to filter these cases out to answer some questions.
Generally I will take the Who, What, Where, and How approach here. The “How” will be in my findings pertaining to how our two customer types, Casual Rider and Annual Members, use the Dizzy service.
First let’s get a basic understanding of the population of riders indirectly through trip data. We are limited by the fact that this data does not contain rider identifiers.
36% of trips taken are by Casual Riders; versus 64% of trips taken by Annual Members.
# The number of Annual member vs. Casual Customers
modified_df %>%
group_by(member_casual) %>%
summarize(percentage_trips= n()/nrow(modified_df)*100)
## # A tibble: 2 × 2
## member_casual percentage_trips
## <chr> <dbl>
## 1 casual 36.4
## 2 member 63.6
modified_df %>%
group_by(member_casual) %>%
summarise(total_count=n(), .groups = 'drop')
## # A tibble: 2 × 2
## member_casual total_count
## <chr> <int>
## 1 casual 2080099
## 2 member 3637069
modified_df %>%
group_by(member_casual) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=member_casual, y=number_of_rides, fill=member_casual),
position="dodge") +
labs(title="Number of Trip Segmented by Customer Type",
subtitle="Data from: 02/24 through 01/25",
x="Customer Type",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
“What” as in what do the customers ride.
The use of electric bikes and classic bikes are very similar at 50.5% and 47% respectively. Electric scooters are at 2.4%; we do not have any fleet numbers to inform us of how many scooters are available to give us context. The scooters might be unpopular, or in limited supply, or both.
Though there is a difference in scooter usage, which is a considerably small subset of usage overall, the usage of electric vs. classic vehicles is very similar for the customer types.
# distribution of vehicle types among trips
modified_df %>%
group_by(rideable_type) %>%
summarize(percentage= n()/nrow(modified_df)*100)
## # A tibble: 3 × 2
## rideable_type percentage
## <chr> <dbl>
## 1 classic_bike 47.0
## 2 electric_bike 50.5
## 3 electric_scooter 2.41
# vehicle usage by customer type
# creating a data frame to summarize vehicle usage both in overall counts
# and in percents within each customer type (for viz labeling)
veh_summary <- aggregate(modified_df$rideable_type,
by=list(modified_df$member_casual,
modified_df$rideable_type),
FUN=length)
names(veh_summary) <- c("customer_type","vehicle","number_of_trips")
mem_trip_total <- sum(
veh_summary[veh_summary$customer_type == "member", ]$number_of_trips)
cas_trip_total <- sum(
veh_summary[veh_summary$customer_type == "casual", ]$number_of_trips)
veh_summary$cust_type_percent <- ifelse(veh_summary$customer_type == "casual",
round(
(veh_summary$number_of_trips/cas_trip_total)*100,
digits = 2),
round(
(veh_summary$number_of_trips/mem_trip_total)*100,
digits = 2))
# making the month_year column factors and ordering them for you in data viz
ordered_months <- c("02/2024", "03/2024", "04/2024", "05/2024", "06/2024",
"07/2024", "08/2024", "09/2024", "10/2024", "11/2024",
"12/2024", "01/2025")
modified_df$month_year <- factor(modified_df$month_year,
levels = ordered_months)
ggplot(veh_summary, aes(x=vehicle, y=cust_type_percent, fill=vehicle)) +
geom_bar(stat = "identity") +
facet_wrap(~customer_type) +
geom_text(aes(label = paste0(cust_type_percent, "%"), y = cust_type_percent),
vjust = 1.4, size = 3, color = "white")+
labs(title="Percent Trips by Vehicles Used then Segmented by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="",
y="Percent of Trips") +
guides(fill=guide_legend(title="Vehicle Type")) +
scale_fill_brewer(palette="Set1")
When do customers take advantage of the service?
At the level of whether there is a seasonal aspect to usage - the answer is yes. Overall there is a peak in usage starting in May and then dropping in November - this is Chicago we are talking about here. Annual Members use the service to a greater deal throughout the year that of the Casual Riders. The Casual Riders increase in usage is more pronounced in the warmer months. Timing for promotions should consider this difference.
# Is there seasonality to service usage and difference between customer types
modified_df %>%
group_by(member_casual,month_year) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=month_year, y=number_of_rides, fill=member_casual),
position="dodge") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Seasonal Usage by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="Month",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
Differences between our two customer segments are more pronounced when it comes to which days of the week the service is used.
Though Annual Members use the service on the weekend, their peak usage is on classic workdays with Wednesday as the top day. For the Casual Riders, their usage picks up on Friday with their highest usage on Saturday followed by Sunday.
When looking at time of day for all days of the week, Casual Riders have a steady climb of usage from 7 AM to a peak at 5 PM then usage trails off into the evening. For the Annual Member there are two distinctive peaks at the beginning of the workday and toward the end of the workday.
# making the day_of_week column factors and ordering them for you in data viz
ordered_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday")
modified_df$day_of_week <- factor(modified_df$day_of_week,
levels = ordered_days)
# running plot for day of week segmented by customer type
modified_df %>%
group_by(member_casual,day_of_week) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=day_of_week, y=number_of_rides, fill=member_casual),
position="dodge") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Usage by Day of Week Segmented by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
caption="Based on trip start",
x="Day of Week",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
# running plot for time of day segmented by customer type
# making the day_of_week column factors and ordering them for you in data viz
ordered_hours <- c("12 AM", "01 AM", "02 AM", "03 AM", "04 AM", "05 AM", "06 AM",
"07 AM", "08 AM", "09 AM", "10 AM", "11 AM", "12 PM", "01 PM",
"02 PM", "03 PM", "04 PM", "05 PM", "06 PM", "07 PM", "08 PM",
"09 PM", "10 PM", "11 PM")
modified_df$time_of_day <- factor(modified_df$time_of_day,
levels = ordered_hours)
modified_df %>%
group_by(member_casual,time_of_day) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
position="dodge") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Trip Start by Time of Day Segmented by Customer Type - Full Week",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="Time of Day",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
Looking further at our time of usage between our two customer segments.
Separating the view on the classic work week vs the weekend, Annual Members appear to be using the service more during commuting times than Casual Riders. On the weekend usage between the two customer types is nearly identical.
# Looking at work week vs. weekend usage
modified_df %>%
filter(day_of_week %in% ordered_days[1:5]) %>%
group_by(member_casual,time_of_day) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
position="dodge") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Trip Start by Time of Day Segmented by Customer Type - Monday - Friday",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="Time of Day",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
modified_df %>%
filter(day_of_week %in% ordered_days[6:7]) %>%
group_by(member_casual,time_of_day) %>%
summarize(number_of_rides=n(), .groups="drop") %>%
ggplot() +
geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
position="dodge") +
theme(axis.text.x = element_text(angle = 45)) +
labs(title="Trip Start by Time of Day Segmented by Customer Type - Weekend",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="Time of Day",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
There are outliers at the top end of the trip duration values for cases such as keeping the bike overnight. Outliers greatly skew the view of trip times. These outliers will be removed filtering to the 95th percentile and below.
Overall Members take shorter trips than Casual cusotmers. The median trip time, with high outliers removed, is 11.1 minutes for Casual customers and 8.7 for Members with a broader spread of times. Casual customers have a higher third quartile at about 18 minutes compared to 15 minutes for Members. On the weekend the trip times go up a bit, but not significantly.
# median trip_time 95th percentile and below
modified_df %>%
filter(trip_time <= 42.4) %>%
group_by(member_casual) %>%
summarize(median_trip=median(trip_time), .groups="drop")
## # A tibble: 2 × 2
## member_casual median_trip
## <chr> <dbl>
## 1 casual 11.1
## 2 member 8.66
# plotting trip_time
modified_df %>%
filter(trip_time <= 42.4) %>%
ggplot() +
geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
labs(title="Trip Duration Segmented by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
caption="Outliers above 95th percentile removed",
x="",
y="Trip Duration (minutes)") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type")) +
coord_flip()
modified_df %>%
filter(trip_time <= 42.4 & day_of_week %in% ordered_days[1:5]) %>%
ggplot() +
geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
labs(title="Trip Duration Segmented by Customer Type - Monday through Friday",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
caption="Outliers above 95th percentile removed",
x="",
y="Trip Duration (minutes)") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type")) +
coord_flip()
modified_df %>%
filter(trip_time <= 42.4 & day_of_week %in% ordered_days[6:7]) %>%
ggplot() +
geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
labs(title="Trip Duration Segmented by Customer Type - Saturday and Sunday",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
caption="Outliers above 95th percentile removed",
x="",
y="Trip Duration (minutes)") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type")) +
coord_flip()
Where do customers user the service?
As mentioned earlier, the majority of trips start and end at designated stations for bike check-out and check-in. I will filter to trips starting and ending at stations since they by nature provide as the station name indication where they are in Chicago. We have 1,800 unique stations in the data. 722 of the stations are public racks expanding the Divvy network.
In transforming the data to model the routes taken,
start_station_name concatenated with
end_station_name we can extend our understanding of usage
patterns. There are 178,691 unique routes in the data:
177,441 unique two station routes make up the majority of the data, and
1,250 routes that start and end at the same station.
# filtering data
station_trips <- modified_df %>%
filter(!(is.na(start_station_id)|is.na(end_station_id)))
# how many stations are in the data?
station_trips %>%
select(start_station_name, end_station_name) %>%
pivot_longer(cols = everything(), values_to = "station") %>%
distinct(station) %>%
pipe_message("unique stations: ") %>%
nrow()
## unique stations:
## [1] 1800
# how many stations are in the data?
station_trips %>%
select(start_station_name, end_station_name) %>%
pivot_longer(cols = everything(), values_to = "station") %>%
distinct(station) %>%
filter(grepl("Public",station)) %>%
pipe_message("unique stations that are public racks: ") %>%
nrow()
## unique stations that are public racks:
## [1] 722
# creating an overall route summary
route_summary <- station_trips %>%
group_by(start_station_name, end_station_name) %>%
summarise(trip_count = n(), total_minutes = sum(trip_time), .groups = 'drop') %>%
arrange(desc(trip_count))
names(route_summary) <- c("origin","destination","trip_count", "total_minutes")
route_summary$route <- ifelse(route_summary$origin == route_summary$destination,
paste(route_summary$origin,"(start & end)"),
paste(route_summary$origin,
route_summary$destination, sep = " >> "))
# the number of routes
route_summary %>%
pipe_message("overall unique routes: ") %>%
nrow()
## overall unique routes:
## [1] 178691
# the number of routes wherein the customer returns to the same station they
# started from.
route_summary %>%
filter(grepl('& end)$', route)) %>%
pipe_message("routes with same start and end station: ") %>%
nrow()
## routes with same start and end station:
## [1] 1250
# the number of routes wherein the customer ends in a different station from
# where they started
route_summary %>%
filter(origin != destination) %>%
pipe_message("unique two station routes: ") %>%
nrow()
## unique two station routes:
## [1] 177441
Looking at the top 20 routes there is an indication that Casual Riders tend more to take “loop” rides - rides that start and end at the same station. Going further to aggregate ride behavior by number of trips shows that Casual Rider take far more loop rides than Annual Members
# segmenting be customer type
# creating an customer segmented route summary
customer_route_summary <- station_trips %>%
group_by(member_casual,start_station_name, start_lat, start_lng,
end_station_name,end_lat, end_lng) %>%
summarise(trip_count = n(), total_minutes = sum(trip_time), .groups = 'drop') %>%
arrange(desc(trip_count))
names(customer_route_summary) <- c("customer_type","origin","origin_lat",
"origin_lng","destination", "dest_lat",
"dest_lng","trip_count", "total_minutes")
# add a route column concatenating the start and end stations names
customer_route_summary$route <- ifelse(customer_route_summary$origin ==
customer_route_summary$destination,
paste(customer_route_summary$origin,"(start & end)"),
paste(customer_route_summary$origin,
customer_route_summary$destination, sep =
" >> "))
# add a column for route type for plotting
customer_route_summary$route_type <- ifelse(customer_route_summary$origin ==
customer_route_summary$destination,
"loop","end to end")
# plot the top 20 routes by customer segment
head(customer_route_summary,20) %>%
ggplot(aes(x = fct_reorder(route, trip_count), y = trip_count,
fill=customer_type)) +
geom_col() +
labs(title="Top 20 Routes by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
theme(legend.position = "none") +
coord_flip()
# plot number of trips taken per route type and segment by customer type
customer_route_summary %>%
group_by(customer_type,route_type) %>%
ggplot() +
geom_col(mapping=aes(x=route_type, y=trip_count, fill=customer_type),
position="dodge") +
labs(title="Trip by Route Type Segmented by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
x="Type of Route",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type"))
Since we have latitude and longitude data for stations, it would be good to look at which stations have the most activity by customer type. “Activity” here is defined as the number of times a trip starts or ends at a given station. Visualizing this requires overlaying the data on a map of the city. Though there are R packages for mapping, Tableau is a faster option to get to this goal.
In condensing the data to a list of unique stations with trip counts and geo data, I found that there were minor precision variations in latitude and longitude data for a given station. This finding required an additional step to in a sense normalize these values via calculating their mean for the station.
The code below shows this process. I provide a plot here to give a tasted of the result. (The Tableau results are to follow.)
# Reform data to express general activity to each station.
# In attempting this approach I found small variability in the geo data.
# I am normalizing the data buy applying median calc.
station_trips <- station_trips %>%
group_by(start_station_name) %>%
mutate(
median_start_lat = median(start_lat, na.rm = TRUE),
median_start_lng = median(start_lng, na.rm = TRUE)
) %>%
ungroup() %>%
group_by(end_station_name) %>%
mutate(
median_end_lat = median(end_lat, na.rm = TRUE),
median_end_lng = median(end_lng, na.rm = TRUE)
) %>%
ungroup()
# Combine start and end station data
station_activity <- bind_rows(
station_trips %>% select(member_casual, station = start_station_name,
lat = median_start_lat, lng = median_start_lng),
station_trips %>% select(member_casual, station = end_station_name,
lat = median_end_lat, lng = median_end_lng)
)
# Count trips for each station
station_activity_summary <- station_activity %>%
group_by(member_casual, station, lat, lng) %>%
summarise(total_trips = n(), .groups = "drop") %>%
arrange(desc(total_trips))
# write.csv() was then used to output the data for use in Tableau
# distribution of activity over stations
station_activity_summary %>%
summarize(Q3_durr=quantile(total_trips, probs = .75),
perc80_durr=quantile(total_trips, probs = .80),
perc85_durr=quantile(total_trips, probs = .85),
perc90_durr=quantile(total_trips, probs = .90),
perc95_durr=quantile(total_trips, probs = .95),
perc98_durr=quantile(total_trips, probs = .98),
perc99_durr=quantile(total_trips, probs = .99),
max_durr=max(total_trips))
## # A tibble: 1 × 8
## Q3_durr perc80_durr perc85_durr perc90_durr perc95_durr perc98_durr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 760. 2004. 4103. 7262. 12817. 20512.
## # ℹ 2 more variables: perc99_durr <dbl>, max_durr <int>
# plot the top 20 stations by activity segmented by customer
head(station_activity_summary,20) %>%
ggplot(aes(x = fct_reorder(station, total_trips), y = total_trips,
fill=member_casual)) +
geom_col() +
labs(title="Top 20 Stations for Activity by Customer Type",
subtitle=paste("Data from:", ordered_months[1], "through",
tail(ordered_months , 1)),
caption="Activity: the number of times a trip starts or ends at a given station",
x="",
y="Number of Trips") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
guides(fill=guide_legend(title="Customer Type")) +
coord_flip()
See the Map Visualization on Tableau Public
In looking at the activity at stations, mapped on Tableau Public showing the 95th percentile and above, a clear difference can be seen between the customer types. Casual Rider activity hugs the lake coast from Shed Aquarium north to Lincoln Park with some activity west into the Fulton River area. Annual Members frequent these same areas, but exhibit more activity from the Uptown area South- South West to the Little Italy District. Additionally, Annual Members are more active running North West up N. Milwaukee Avenue and have a cluster of activity to the South at the University of Chicago. It also appears that most service use around universities and colleges is by Annual Members.
Although the data does not provide direct information specific to the riders themselves, it does provide us with a view into how the two customer segments utilize the Divvy system from a trip perspective. (As a reminder; the data used in this analysis is from 02/24 through 01/25.)
36% of trips taken are by Casual Riders; versus 64% of trips taken by Annual Members. Although we do not know how many of each customer type we have in this data set, we can say that the bulk of utilization, trips taken, is by Annual Members.
Both customer segments slightly prefer electric bikes to classic bikes and exhibit very little use of electric scooters.
For both groups there is a seasonal aspect with most activity June through September. Also for both groups on weekends the peak time to start a trip from 8AM to 7PM.
There is some overlap between the two customer types in which stations they utilize the most in the city, but the differences are more interesting.
Although there is a seasonality in system utilization for both Annual Members and Casual Riders, the Annual Members show higher trip numbers during the colder months of the year. This difference is most pronounced November through March.
As mentioned above the time of day to start a ride is similar on the weekends, but these two groups otherwise diverge. Annual Member take more trips during the classic work week with time of day peaks during work commuting times. Casual Riders have the most rides Friday through Sunday with most trips starting from 11AM to 6PM with a peak at 4PM.
The type of rides taken by the two groups differ. Casual Rider trips tend more to start and end a ride at the same station - a “loop” of sorts. Whereas the Annual Rider trips tend more to be end to end going from one station and ending in another.
As mentioned above, there is some overlap which stations have the most activity. The difference in to regard is that Annual Member station utilization is spread across more of the city, with the Casual Riders centered more around major attractions along the lakefront and the city center. Annual Members trips also infer usage by students.
The data infers that Annual Members are using the service to commute and the Casual Riders for leisure.