What follows is a case study that serves as the capstone project for the Google Data Analytics Certificate. You can find a presentation form of the business case via this link.

Capstone Scenario

The capstone is from the perspective of a junior data analyst working on the marketing analyst team at Cyclistic (Divvy), a bike-share company in Chicago. The Director of Marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, a team is assembled to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations; compelling data insights and professional data visualizations are required.

The Ask

The Director of Marketing has provided three questions to answer to guide future marketing efforts:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

The first question is my assignment, recast in the form …

The Business Task: Deliver an analysis that compares annual membership holders usage patterns to casual riders looking for indication in data that would aid the organization in increasing annual membership sales.

About Divvy

The capstone documentation refers to the organization as “Cyclistic.” At this point in the organization’s history (2025) they are known as Divvy - a play on “divvy it up”. Divvy is essentially a city of Chicago program that is run in partnership with Lyft.

Dizzy offers easy to step through bikes in both classic pedal-powered models as well as electric models. There are also electric scooters on offer. The service is accessed via a phone app that unlocks the bike for use. You can use the service in the following ways:

“Casual Rider”
- Single Ride - entails a fee to unlock the bike with a per minute rental fee
- Buy a Day Pass - 3 hours included after which the rental fee applies
“Annual Member”
- Buy an Annual Membership - this grants an allotment of 45 free minutes after which the rental fee is applied, but there is no unlock fee.

The Data

Our Source

The data sources provided in the certification program can be found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The license agreement can be found here: https://ride.divvybikes.com/data-license-agreement

Per Divvy’s Data Page:

The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).

There is no information on whether all stations in Chicago are in the data, or a subset. If this was a subset, that might be a source of bias - this is an unknown.

Available Data Sources

Two types of files are provided:

one set by quarter of with the latest file from Q1 2019, and
monthly files that contain recent data.

The main content difference between the quarterly files compared to the monthly is that the quarterly contain gender and birth year for the riders as well as the individual bike ID, whereas the monthly do not. The monthly files provide everything else the quarterly files contain with the addition of bike type and the longitude and latitude of the stations.

Neither file type contains customer names or payment information. Therefore we have no privacy concerns.

This analysis uses the more recent monthly data, specifically a trailing 12 months (01/2025 to 02/2024) and therefore for most representative of current usage patterns.

The Column Layout

“ride_id”: ride id number
“rideable_type”: type of bike used (classic, electric, and electric scooter)
“started_at”: ride starting date and timestamp
“ended_at”: ride ending date and timestamp
“start_station_name”: starting station name
“start_station_id”: starting station id
“end_station_name”: ending station name
“end_station_id”: ending station id
“start_lat”: starting station latitude
“start_lng”: starting station longitude
“end_lat”: ending station latitude
“end_lng”: ending station longitude
“member_casual”: member (annual membership) or casual rider

Loading, Cleaning, and Initial Manipulation of the Data

#install.packages("rmarkdown")
#install.packages('tidyverse')
#install.packages('skimr')
suppressPackageStartupMessages({
  library(rmarkdown)
  library(tidyverse)
  library(skimr)
})

# utility function for printing in pipes
pipe_message = function(.data, status) {message(status); .data}

Loading the data from 12 files representing Divvy activity from 01/2025 to 02/2024.

# Target the location of the files and define a condition to ensure only 
# .csv files are loaded. 
source_files_dir <- "trips/"
file_condition <- ".*\\.csv$"

file_list <- list.files(
  path = source_files_dir,
  pattern = file_condition,
  full.names = TRUE
)

# Import and aggregate files into one data frame adding a column for source file
combined_df <- file_list %>%
  set_names() %>% 
  map_df(~ read_csv(.x, show_col_types = FALSE), .id = "source_file")

# quick look at the data
skim_without_charts(combined_df)

Data summary
Name	combined_df
Number of rows	5854384
Number of columns	14
_______________________
Column type frequency:
character	8
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
source_file	0	1.00	32	32	12
ride_id	0	1.00	16	16	5854173
rideable_type	0	1.00	12	16	3
start_station_name	1077638	0.82	10	64	1806
start_station_id	1077638	0.82	3	35	1762
end_station_name	1107977	0.81	10	64	1815
end_station_id	1107977	0.81	3	35	1770
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.04	41.64	41.88	41.90	41.93	42.07
start_lng	0	1	-87.65	0.03	-87.91	-87.66	-87.64	-87.63	-87.52
end_lat	7005	1	41.90	0.06	16.06	41.88	41.90	41.93	87.96
end_lng	7005	1	-87.65	0.11	-144.05	-87.66	-87.64	-87.63	152.53

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2024-02-01 00:01:50	2025-01-31 23:57:54	2024-07-28 13:25:34	5651085
ended_at	0	1	2024-02-01 00:08:07	2025-01-31 23:59:51	2024-07-28 13:46:33	5653589

The Initial State of the Data

The output to the console indicates that all files share the same column layout and load properly.

This is a large data set with 5,854,384 observations. There is a ride_id that seems to be the unique key, we will look into the field to confirm.

We are missing start and/or end station info in about 19% of rows. Divvy allows parking at their racks and public racks, at least for members. These are more likely cases wherein the user did not properly check-out the bike at a rack and cases where someone picked up the bike off-rack. We have latitude and longitude data for all trip starts, but missing a small number of end trip geo data.

I am holding on removing all NA cases until I have more of an understanding.

The ride_id

I will first copy the data to another data frame for manipulation keeping the state of the original data intact.

As determined below there are 422 duplicate ride_id values in the data. Considering this is a very small percentage of the observations, I am not too concerned. These duplicates have been removed.

# create a version of the data frame for manipulation 
modified_df <- combined_df

# determine if ride_id's are duplicated
modified_df %>% 
  group_by(ride_id) %>% 
  mutate(dupe = n()>1) %>%
  filter(dupe == TRUE) %>%
  arrange(ride_id) %>%
  pipe_message("duplicate ride_id's before clean: ") %>%
  nrow()

## duplicate ride_id's before clean:

## [1] 422

# remove the duplicate ride_id cases
modified_df <- modified_df %>%
  distinct(ride_id, .keep_all = TRUE)

modified_df %>% 
  group_by(ride_id) %>% 
  mutate(dupe = n()>1) %>%
  filter(dupe == TRUE) %>%
  arrange(ride_id) %>%
  pipe_message("duplicate ride_id's after: ") %>%
  nrow()

## duplicate ride_id's after:

## [1] 0

Adding Additional Time Related Columns

We are provided with started_at and end_at fields that define each trip duration in POSIXct form. To better understand the data I am adding trip_time (minutes), day_of_week, month_year, and time_of_day columns calculated from the original POSIXct fields.

# To help judge the data and later understand usage - adding
# a trip timing related columns starting with duration
modified_df$trip_time <- 
  difftime(modified_df$ended_at, modified_df$started_at, units="mins")
modified_df$trip_time <- as.numeric(modified_df$trip_time)

# add a day of week column, based on trip start
modified_df$day_of_week <- format(as.Date(modified_df$started_at), "%A")

# add a month/year column
modified_df$month_year <- format(as.Date(modified_df$started_at), "%m/%Y")

# add time of day column
modified_df$time_of_day <- format(modified_df$started_at, "%I %p")

# move source_file to the end column position and take another look at the data
modified_df <- modified_df %>% 
  relocate(source_file, .after = time_of_day)

skim_without_charts(modified_df)

Data summary
Name	modified_df
Number of rows	5854173
Number of columns	18
_______________________
Column type frequency:
character	11
numeric	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5854173
rideable_type	0	1.00	12	16	3
start_station_name	1077615	0.82	10	64	1806
start_station_id	1077615	0.82	3	35	1762
end_station_name	1107897	0.81	10	64	1815
end_station_id	1107897	0.81	3	35	1770
member_casual	0	1.00	6	6	2
day_of_week	0	1.00	6	9	7
month_year	0	1.00	7	7	12
time_of_day	0	1.00	5	5	24
source_file	0	1.00	32	32	12

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.04	41.64	41.88	41.90	41.93	42.07
start_lng	0	1	-87.65	0.03	-87.91	-87.66	-87.64	-87.63	-87.52
end_lat	6965	1	41.90	0.06	16.06	41.88	41.90	41.93	87.96
end_lng	6965	1	-87.65	0.11	-144.05	-87.66	-87.64	-87.63	152.53
trip_time	0	1	17.21	60.49	-2748.32	5.53	9.70	17.23	1559.93

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2024-02-01 00:01:50	2025-01-31 23:57:54	2024-07-28 13:29:16	5650876
ended_at	0	1	2024-02-01 00:08:07	2025-01-31 23:59:51	2024-07-28 13:50:47	5653378

Cleaning the Data Based-On Trip Times

Looking at the skimr output above we have some noise in the signal in the new trip_time column. I the first quartile there are negative values, and in the last quartile we have a max of 1,560 minutes. Both these cases need to be addressed.

There as 207 cases of negative trim times. The start times precede the end times. I am considering this as faulty data and removing it. Divvy indicates that they drop data the that contains trips less than 60 seconds as false starts, I will make sure to apply that threshold here.

# quatifying the negative trip times
modified_df %>% 
  filter(trip_time < 0) %>%
  pipe_message("rides with negative trip times: ") %>%
  nrow()

## rides with negative trip times:

## [1] 207

# removing negative time cases
modified_df <- modified_df %>%
  filter(trip_time >= 1)

On the top end of the trip times there is a big jump from the beginning of last quartile to the max. To get a better view I look at this quartile more closely and inspect cases greater than 9 hours (540 minutes). This subset of the data is pretty small at less than 0.002% of the original sample.

This step determines that our spike in times is above the 99th percentile and about 68% of these case have no end station data. I was not able to determine online if Divvy timeout a ride after a threshold, but none-the-less I think is reasonable to trim these cases from the data as failed returns.

# taking another look at the trip times 
modified_df %>%
  summarize(Q3_durr=quantile(trip_time, probs = .75),
            perc80_durr=quantile(trip_time, probs = .80),
            perc85_durr=quantile(trip_time, probs = .85),
            perc90_durr=quantile(trip_time, probs = .90),
            perc95_durr=quantile(trip_time, probs = .95),
            perc98_durr=quantile(trip_time, probs = .98),
            perc99_durr=quantile(trip_time, probs = .99),
            max_durr=max(trip_time))

## # A tibble: 1 × 8
##   Q3_durr perc80_durr perc85_durr perc90_durr perc95_durr perc98_durr
##     <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
## 1    17.5        20.3        24.2        30.3        43.2        70.2
## # ℹ 2 more variables: perc99_durr <dbl>, max_durr <dbl>

# FINDINGS - There is a considerable spike above the 95th percentile.

# looking at cases greater than 8 hours
modified_df %>% 
  filter(trip_time > 540) %>%
  skim_without_charts()

Data summary
Name	Piped data
Number of rows	10619
Number of columns	18
_______________________
Column type frequency:
character	11
numeric	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	10619
rideable_type	0	1.00	12	12	1
start_station_name	0	1.00	10	55	780
start_station_id	0	1.00	3	14	777
end_station_name	7178	0.32	10	42	562
end_station_id	7178	0.32	3	13	562
member_casual	0	1.00	6	6	2
day_of_week	0	1.00	6	9	7
month_year	0	1.00	7	7	12
time_of_day	0	1.00	5	5	24
source_file	0	1.00	32	32	12

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1.00	41.89	0.05	41.65	41.88	41.89	41.92	42.06
start_lng	0	1.00	-87.64	0.03	-87.82	-87.65	-87.63	-87.62	-87.53
end_lat	6965	0.34	41.90	0.05	41.65	41.88	41.89	41.93	42.06
end_lng	6965	0.34	-87.64	0.03	-87.81	-87.65	-87.64	-87.63	-87.53
trip_time	0	1.00	1323.88	297.94	540.05	1209.26	1499.67	1499.89	1559.93

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2024-02-01 03:19:53	2025-01-30 21:13:48	2024-07-12 10:55:14	10618
ended_at	0	1	2024-02-02 04:19:42	2025-01-31 22:13:39	2024-07-13 08:06:15	10383

# removing failed returns (no geo)
modified_df <- modified_df %>%
  filter(!(is.na(end_lat)|is.na(end_lng)))

Is There Any Demo Data?

Divvy indicates that any test rides are removed. That said, I am doing an admittedly simple test for demo stations. The results of this are negative.

# simple test
modified_df[grep("test", modified_df$start_station_name, ignore.case=TRUE)]

## # A tibble: 5,717,168 × 0

modified_df[grep("test", modified_df$end_station_name, ignore.case=TRUE)]

## # A tibble: 5,717,168 × 0

modified_df[grep("demo", modified_df$start_station_name, ignore.case=TRUE)]

## # A tibble: 5,717,168 × 0

modified_df[grep("demo", modified_df$end_station_name, ignore.case=TRUE)]

## # A tibble: 5,717,168 × 0

We now have 5,717,168 (down from 5,854,384) of what I am considering complete data. This data still contains cases where trips start or end outside the station network. During analysis there will be cause to filter these cases out to answer some questions.

The Analysis

Generally I will take the Who, What, Where, and How approach here. The “How” will be in my findings pertaining to how our two customer types, Casual Rider and Annual Members, use the Dizzy service.

Who

First let’s get a basic understanding of the population of riders indirectly through trip data. We are limited by the fact that this data does not contain rider identifiers.

36% of trips taken are by Casual Riders; versus 64% of trips taken by Annual Members.

# The number of Annual member vs. Casual Customers
modified_df %>% 
  group_by(member_casual) %>%
  summarize(percentage_trips= n()/nrow(modified_df)*100)

## # A tibble: 2 × 2
##   member_casual percentage_trips
##   <chr>                    <dbl>
## 1 casual                    36.4
## 2 member                    63.6

modified_df %>% 
  group_by(member_casual) %>%
  summarise(total_count=n(), .groups = 'drop')

## # A tibble: 2 × 2
##   member_casual total_count
##   <chr>               <int>
## 1 casual            2080099
## 2 member            3637069

modified_df %>%
  group_by(member_casual) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=member_casual, y=number_of_rides, fill=member_casual),
           position="dodge") +
  labs(title="Number of Trip Segmented by Customer Type",
       subtitle="Data from: 02/24 through 01/25",
       x="Customer Type",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

What?

“What” as in what do the customers ride.

The use of electric bikes and classic bikes are very similar at 50.5% and 47% respectively. Electric scooters are at 2.4%; we do not have any fleet numbers to inform us of how many scooters are available to give us context. The scooters might be unpopular, or in limited supply, or both.

Though there is a difference in scooter usage, which is a considerably small subset of usage overall, the usage of electric vs. classic vehicles is very similar for the customer types.

# distribution of vehicle types among trips
modified_df %>% 
  group_by(rideable_type) %>%
  summarize(percentage= n()/nrow(modified_df)*100)

## # A tibble: 3 × 2
##   rideable_type    percentage
##   <chr>                 <dbl>
## 1 classic_bike          47.0 
## 2 electric_bike         50.5 
## 3 electric_scooter       2.41

# vehicle usage by customer type
# creating a data frame to summarize vehicle usage both in overall counts
# and in percents within each customer type (for viz labeling)
veh_summary <- aggregate(modified_df$rideable_type,
                         by=list(modified_df$member_casual,
                                 modified_df$rideable_type),
                         FUN=length)
names(veh_summary) <- c("customer_type","vehicle","number_of_trips")
mem_trip_total <- sum(
  veh_summary[veh_summary$customer_type == "member", ]$number_of_trips)
cas_trip_total <- sum(
  veh_summary[veh_summary$customer_type == "casual", ]$number_of_trips)

veh_summary$cust_type_percent <- ifelse(veh_summary$customer_type == "casual",
                              round(
                                (veh_summary$number_of_trips/cas_trip_total)*100,
                                digits = 2),
                              round(
                                (veh_summary$number_of_trips/mem_trip_total)*100,
                                digits = 2))

# making the month_year column factors and ordering them for you in data viz
ordered_months <- c("02/2024", "03/2024", "04/2024", "05/2024", "06/2024",
                    "07/2024", "08/2024", "09/2024", "10/2024", "11/2024",
                    "12/2024", "01/2025")
modified_df$month_year <- factor(modified_df$month_year, 
                                 levels = ordered_months)

ggplot(veh_summary, aes(x=vehicle, y=cust_type_percent, fill=vehicle)) +
  geom_bar(stat = "identity") +
  facet_wrap(~customer_type) +
  geom_text(aes(label = paste0(cust_type_percent, "%"), y = cust_type_percent),
            vjust = 1.4, size = 3, color = "white")+
  labs(title="Percent Trips by Vehicles Used then Segmented by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="",
       y="Percent of Trips") +
  guides(fill=guide_legend(title="Vehicle Type")) +
  scale_fill_brewer(palette="Set1")

When?

When do customers take advantage of the service?

At the level of whether there is a seasonal aspect to usage - the answer is yes. Overall there is a peak in usage starting in May and then dropping in November - this is Chicago we are talking about here. Annual Members use the service to a greater deal throughout the year that of the Casual Riders. The Casual Riders increase in usage is more pronounced in the warmer months. Timing for promotions should consider this difference.

# Is there seasonality to service usage and difference between customer types
modified_df %>%
  group_by(member_casual,month_year) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=month_year, y=number_of_rides, fill=member_casual),
           position="dodge") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title="Seasonal Usage by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="Month",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

Differences between our two customer segments are more pronounced when it comes to which days of the week the service is used.

Though Annual Members use the service on the weekend, their peak usage is on classic workdays with Wednesday as the top day. For the Casual Riders, their usage picks up on Friday with their highest usage on Saturday followed by Sunday.

When looking at time of day for all days of the week, Casual Riders have a steady climb of usage from 7 AM to a peak at 5 PM then usage trails off into the evening. For the Annual Member there are two distinctive peaks at the beginning of the workday and toward the end of the workday.

# making the day_of_week column factors and ordering them for you in data viz
ordered_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
                    "Saturday", "Sunday")
modified_df$day_of_week <- factor(modified_df$day_of_week, 
                                 levels = ordered_days)

# running plot for day of week segmented by customer type
modified_df %>%
  group_by(member_casual,day_of_week) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=day_of_week, y=number_of_rides, fill=member_casual),
           position="dodge") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title="Usage by Day of Week Segmented by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       caption="Based on trip start",
       x="Day of Week",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

# running plot for time of day segmented by customer type
# making the day_of_week column factors and ordering them for you in data viz
ordered_hours <- c("12 AM", "01 AM", "02 AM", "03 AM", "04 AM", "05 AM", "06 AM",
                  "07 AM", "08 AM", "09 AM", "10 AM", "11 AM", "12 PM", "01 PM",
                  "02 PM", "03 PM", "04 PM", "05 PM", "06 PM", "07 PM", "08 PM", 
                  "09 PM", "10 PM", "11 PM")
modified_df$time_of_day <- factor(modified_df$time_of_day, 
                                  levels = ordered_hours)

modified_df %>%
  group_by(member_casual,time_of_day) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
           position="dodge") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title="Trip Start by Time of Day Segmented by Customer Type - Full Week",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="Time of Day",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

Looking further at our time of usage between our two customer segments.

Separating the view on the classic work week vs the weekend, Annual Members appear to be using the service more during commuting times than Casual Riders. On the weekend usage between the two customer types is nearly identical.

# Looking at work week vs. weekend usage
modified_df %>%
  filter(day_of_week %in% ordered_days[1:5]) %>%
  group_by(member_casual,time_of_day) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
           position="dodge") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title="Trip Start by Time of Day Segmented by Customer Type - Monday - Friday",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="Time of Day",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

modified_df %>%
  filter(day_of_week %in% ordered_days[6:7]) %>%
  group_by(member_casual,time_of_day) %>%
  summarize(number_of_rides=n(), .groups="drop") %>% 
  ggplot() +
  geom_col(mapping=aes(x=time_of_day, y=number_of_rides, fill=member_casual),
           position="dodge") +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(title="Trip Start by Time of Day Segmented by Customer Type - Weekend",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="Time of Day",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

How Long Do They Ride?

There are outliers at the top end of the trip duration values for cases such as keeping the bike overnight. Outliers greatly skew the view of trip times. These outliers will be removed filtering to the 95th percentile and below.

Overall Members take shorter trips than Casual cusotmers. The median trip time, with high outliers removed, is 11.1 minutes for Casual customers and 8.7 for Members with a broader spread of times. Casual customers have a higher third quartile at about 18 minutes compared to 15 minutes for Members. On the weekend the trip times go up a bit, but not significantly.

# median trip_time 95th percentile and below
modified_df %>%
  filter(trip_time <= 42.4) %>%
  group_by(member_casual) %>%
  summarize(median_trip=median(trip_time), .groups="drop")

## # A tibble: 2 × 2
##   member_casual median_trip
##   <chr>               <dbl>
## 1 casual              11.1 
## 2 member               8.66

# plotting trip_time
modified_df %>% 
  filter(trip_time <= 42.4) %>%
  ggplot() +
  geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
  labs(title="Trip Duration Segmented by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       caption="Outliers above 95th percentile removed",
       x="",
       y="Trip Duration (minutes)") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type")) +
  coord_flip()

modified_df %>% 
  filter(trip_time <= 42.4 & day_of_week %in% ordered_days[1:5]) %>%
  ggplot() +
  geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
  labs(title="Trip Duration Segmented by Customer Type  - Monday through Friday",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       caption="Outliers above 95th percentile removed",
       x="",
       y="Trip Duration (minutes)") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type")) +
  coord_flip()

modified_df %>% 
  filter(trip_time <= 42.4 & day_of_week %in% ordered_days[6:7]) %>%
  ggplot() +
  geom_boxplot(mapping=aes(x=member_casual, y=trip_time, fill=member_casual)) +
  labs(title="Trip Duration Segmented by Customer Type  - Saturday and Sunday",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       caption="Outliers above 95th percentile removed",
       x="",
       y="Trip Duration (minutes)") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type")) +
  coord_flip()

Where?

Where do customers user the service?

As mentioned earlier, the majority of trips start and end at designated stations for bike check-out and check-in. I will filter to trips starting and ending at stations since they by nature provide as the station name indication where they are in Chicago. We have 1,800 unique stations in the data. 722 of the stations are public racks expanding the Divvy network.

In transforming the data to model the routes taken, start_station_name concatenated with end_station_name we can extend our understanding of usage patterns. There are 178,691 unique routes in the data:

177,441 unique two station routes make up the majority of the data, and
1,250 routes that start and end at the same station.

# filtering data
station_trips <- modified_df %>%
  filter(!(is.na(start_station_id)|is.na(end_station_id)))

# how many stations are in the data?
station_trips %>%
  select(start_station_name, end_station_name) %>%
  pivot_longer(cols = everything(), values_to = "station") %>%
  distinct(station) %>%
  pipe_message("unique stations: ") %>%
  nrow()

## unique stations:

## [1] 1800

# how many stations are in the data?
station_trips %>%
  select(start_station_name, end_station_name) %>%
  pivot_longer(cols = everything(), values_to = "station") %>%
  distinct(station) %>%
  filter(grepl("Public",station)) %>%
  pipe_message("unique stations that are public racks: ") %>%
  nrow()

## unique stations that are public racks:

## [1] 722

# creating an overall route summary
route_summary <- station_trips %>%
  group_by(start_station_name, end_station_name) %>%
  summarise(trip_count = n(), total_minutes = sum(trip_time), .groups = 'drop') %>%
  arrange(desc(trip_count))

names(route_summary) <- c("origin","destination","trip_count", "total_minutes")
route_summary$route <- ifelse(route_summary$origin == route_summary$destination,
                             paste(route_summary$origin,"(start & end)"),
                             paste(route_summary$origin, 
                                   route_summary$destination, sep = " >> "))

# the number of routes
route_summary %>%
  pipe_message("overall unique routes: ") %>%
  nrow()

## overall unique routes:

## [1] 178691

# the number of routes wherein the customer returns to the same station they 
# started from. 
route_summary %>%
  filter(grepl('& end)$', route)) %>%
  pipe_message("routes with same start and end station: ") %>%
  nrow()

## routes with same start and end station:

## [1] 1250

# the number of routes wherein the customer ends in a different station from
# where they started 
route_summary %>%
  filter(origin != destination) %>%
  pipe_message("unique two station routes: ") %>%
  nrow()

## unique two station routes:

## [1] 177441

Looking at the top 20 routes there is an indication that Casual Riders tend more to take “loop” rides - rides that start and end at the same station. Going further to aggregate ride behavior by number of trips shows that Casual Rider take far more loop rides than Annual Members

# segmenting be customer type
# creating an customer segmented route summary
customer_route_summary <- station_trips %>%
  group_by(member_casual,start_station_name, start_lat, start_lng, 
           end_station_name,end_lat, end_lng) %>%
  summarise(trip_count = n(), total_minutes = sum(trip_time), .groups = 'drop') %>%
  arrange(desc(trip_count))

names(customer_route_summary) <- c("customer_type","origin","origin_lat",
                                  "origin_lng","destination", "dest_lat", 
                                  "dest_lng","trip_count", "total_minutes")

# add a route column concatenating the start and end stations names
customer_route_summary$route <- ifelse(customer_route_summary$origin == 
                                         customer_route_summary$destination,
                             paste(customer_route_summary$origin,"(start & end)"),
                             paste(customer_route_summary$origin, 
                                   customer_route_summary$destination, sep =
                                     " >> "))
# add a column for route type for plotting
customer_route_summary$route_type <- ifelse(customer_route_summary$origin == 
                                         customer_route_summary$destination,
                                      "loop","end to end")

# plot the top 20 routes by customer segment
head(customer_route_summary,20) %>%
  ggplot(aes(x = fct_reorder(route, trip_count), y = trip_count, 
             fill=customer_type)) +
  geom_col() +
  labs(title="Top 20 Routes by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  theme(legend.position = "none") +
  coord_flip()

# plot number of trips taken per route type and segment by customer type 
customer_route_summary %>%
  group_by(customer_type,route_type) %>%
  ggplot() +
  geom_col(mapping=aes(x=route_type, y=trip_count, fill=customer_type),
           position="dodge") +
  labs(title="Trip by Route Type Segmented by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       x="Type of Route",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type"))

Since we have latitude and longitude data for stations, it would be good to look at which stations have the most activity by customer type. “Activity” here is defined as the number of times a trip starts or ends at a given station. Visualizing this requires overlaying the data on a map of the city. Though there are R packages for mapping, Tableau is a faster option to get to this goal.

In condensing the data to a list of unique stations with trip counts and geo data, I found that there were minor precision variations in latitude and longitude data for a given station. This finding required an additional step to in a sense normalize these values via calculating their mean for the station.

The code below shows this process. I provide a plot here to give a tasted of the result. (The Tableau results are to follow.)

# Reform data to express general activity to each station.
# In attempting this approach I found small variability in the geo data.  
# I am normalizing the data buy applying median calc.
station_trips <- station_trips %>%
  group_by(start_station_name) %>%
  mutate(
    median_start_lat = median(start_lat, na.rm = TRUE),
    median_start_lng = median(start_lng, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  group_by(end_station_name) %>%
  mutate(
    median_end_lat = median(end_lat, na.rm = TRUE),
    median_end_lng = median(end_lng, na.rm = TRUE)
  ) %>%
  ungroup()

# Combine start and end station data
station_activity <- bind_rows(
  station_trips %>% select(member_casual, station = start_station_name, 
                           lat = median_start_lat, lng = median_start_lng),
  station_trips %>% select(member_casual, station = end_station_name, 
                           lat = median_end_lat, lng = median_end_lng)
)

# Count trips for each station
station_activity_summary <- station_activity %>%
  group_by(member_casual, station, lat, lng) %>%
  summarise(total_trips = n(), .groups = "drop") %>%
  arrange(desc(total_trips))

# write.csv() was then used to output the data for use in Tableau

# distribution of activity over stations
station_activity_summary %>% 
  summarize(Q3_durr=quantile(total_trips, probs = .75),
            perc80_durr=quantile(total_trips, probs = .80),
            perc85_durr=quantile(total_trips, probs = .85),
            perc90_durr=quantile(total_trips, probs = .90),
            perc95_durr=quantile(total_trips, probs = .95),
            perc98_durr=quantile(total_trips, probs = .98),
            perc99_durr=quantile(total_trips, probs = .99),
            max_durr=max(total_trips))

## # A tibble: 1 × 8
##   Q3_durr perc80_durr perc85_durr perc90_durr perc95_durr perc98_durr
##     <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
## 1    760.       2004.       4103.       7262.      12817.      20512.
## # ℹ 2 more variables: perc99_durr <dbl>, max_durr <int>

# plot the top 20 stations by activity segmented by customer
head(station_activity_summary,20) %>%
  ggplot(aes(x = fct_reorder(station, total_trips), y = total_trips, 
             fill=member_casual)) +
  geom_col() +
  labs(title="Top 20 Stations for Activity by Customer Type",
       subtitle=paste("Data from:", ordered_months[1], "through",
                      tail(ordered_months , 1)),
       caption="Activity: the number of times a trip starts or ends at a given station",
       x="",
       y="Number of Trips") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  guides(fill=guide_legend(title="Customer Type")) +
  coord_flip()

See the Map Visualization on Tableau Public

In looking at the activity at stations, mapped on Tableau Public showing the 95th percentile and above, a clear difference can be seen between the customer types. Casual Rider activity hugs the lake coast from Shed Aquarium north to Lincoln Park with some activity west into the Fulton River area. Annual Members frequent these same areas, but exhibit more activity from the Uptown area South- South West to the Little Italy District. Additionally, Annual Members are more active running North West up N. Milwaukee Avenue and have a cluster of activity to the South at the University of Chicago. It also appears that most service use around universities and colleges is by Annual Members.

How Annual Members and Casual Riders Use the Service

Although the data does not provide direct information specific to the riders themselves, it does provide us with a view into how the two customer segments utilize the Divvy system from a trip perspective. (As a reminder; the data used in this analysis is from 02/24 through 01/25.)

36% of trips taken are by Casual Riders; versus 64% of trips taken by Annual Members. Although we do not know how many of each customer type we have in this data set, we can say that the bulk of utilization, trips taken, is by Annual Members.

Where There are Similarities

Both customer segments slightly prefer electric bikes to classic bikes and exhibit very little use of electric scooters.

For both groups there is a seasonal aspect with most activity June through September. Also for both groups on weekends the peak time to start a trip from 8AM to 7PM.

There is some overlap between the two customer types in which stations they utilize the most in the city, but the differences are more interesting.

Where There are Differences

Although there is a seasonality in system utilization for both Annual Members and Casual Riders, the Annual Members show higher trip numbers during the colder months of the year. This difference is most pronounced November through March.

As mentioned above the time of day to start a ride is similar on the weekends, but these two groups otherwise diverge. Annual Member take more trips during the classic work week with time of day peaks during work commuting times. Casual Riders have the most rides Friday through Sunday with most trips starting from 11AM to 6PM with a peak at 4PM.

The type of rides taken by the two groups differ. Casual Rider trips tend more to start and end a ride at the same station - a “loop” of sorts. Whereas the Annual Rider trips tend more to be end to end going from one station and ending in another.

As mentioned above, there is some overlap which stations have the most activity. The difference in to regard is that Annual Member station utilization is spread across more of the city, with the Casual Riders centered more around major attractions along the lakefront and the city center. Annual Members trips also infer usage by students.

The data infers that Annual Members are using the service to commute and the Casual Riders for leisure.

Recommendations

Use marketing research to learn more about our riders quantifying how many we have of each type as well as demographic information and usage preferences to guide outreach.
Center marketing around shifting Casual Rider to a membership highlighting using the service to commute to work or school.
Hold events at Milton Lee Olive, Maggie Daley, Butler Field, Millennium, and Lakefront Parks for membership drives.

Google Data Analytics Certificate Capstone

Greg Imhoff

2025-03-14

Capstone Scenario

The Ask

About Divvy

The Data

Our Source

Available Data Sources

The Column Layout

Loading, Cleaning, and Initial Manipulation of the Data

The Initial State of the Data

The ride_id

Is There Any Demo Data?

The Analysis

Who

What?

When?

How Long Do They Ride?

Where?

How Annual Members and Casual Riders Use the Service

Where There are Similarities

Where There are Differences

Recommendations