This report delves into the usage patterns of the Cyclistic bike-share program, comparing the behavior of annual members and casual riders. By analyzing trip data from the first quarter of 2019 and 2020, this study aims to uncover key differences in how these user groups interact with the service.
The primary focus is on ride duration, a critical metric for understanding user needs and optimizing service offering. Identifying distinct usage patterns will provide valuable insights for targeted marketing strategies and operational improvements.
The data for this analysis comes from Divvy trip data for Q1 2019 and Q1 2020, publicly available under the Divvy Data License Agreement, provided by Motivate International Inc.
The data was imported using the read_csv() function from
the tidyverse package.
# Load essential libraries for data manipulation and visualization
library(tidyverse)
library(conflicted)
conflict_prefer("filter", "dplyr")
conflict_prefer("lag", "dplyr")
# Load the datasets
q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
To ensure a robust and consistent analysis, the raw data underwent several crucial cleaning and transformation steps:
Column Standardization: Renamed columns in the 2019 dataset to align with the 2020 data structure, anticipating future data consistency.
Data Type Conversion: Ensured ride_length was in a numeric format for calculations and converted ride_id and rideable_type to character for proper merging.
Data Integration: Combined the 2019 and 2020 datasets into a unified all_trips dataframe using bind_rows().
Irrelevant Data Removal: Excluded columns not relevant to the analysis (latitude, longitude, birth year, gender, and the redundant tripduration from the 2019 data).
Date Feature Engineering: Created new columns for date, month, day, year, and day_of_week to enable temporal analysis.
Ride Duration Calculation: Computed a consistent ride_length in seconds for all trips using the difference between ended_at and started_at.
Outlier and Error Handling: Removed trips with negative ride_length values and those originating from the “HQ QR” station, likely representing data anomalies or quality checks rather than user rides.
The R code below details these data wrangling procedures:
# Standardize column names for consistency with the 2020 dataset
q1_2019 <- rename(q1_2019,
ride_id = trip_id,
rideable_type = bikeid,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
# Inspect the structure of the initial dataframes
str(q1_2019)
## spc_tbl_ [365,071 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : num [1:365071] 21742443 21742444 21742445 21742446 21742447 ...
## $ started_at : chr [1:365071] "2019-01-01 0:04:37" "2019-01-01 0:08:13" "2019-01-01 0:13:23" "2019-01-01 0:13:45" ...
## $ ended_at : chr [1:365071] "2019-01-01 0:11:07" "2019-01-01 0:15:34" "2019-01-01 0:27:12" "2019-01-01 0:43:28" ...
## $ rideable_type : num [1:365071] 2167 4386 1524 252 1170 ...
## $ tripduration : num [1:365071] 390 441 829 1783 364 ...
## $ start_station_id : num [1:365071] 199 44 15 123 173 98 98 211 150 268 ...
## $ start_station_name: chr [1:365071] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
## $ end_station_id : num [1:365071] 84 624 644 176 35 49 49 142 148 141 ...
## $ end_station_name : chr [1:365071] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
## $ member_casual : chr [1:365071] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr [1:365071] "Male" "Female" "Female" "Male" ...
## $ birthyear : chr [1:365071] "1989" "1990" "1994" "1993" ...
## $ ride_length : chr [1:365071] "0:06:30" "0:07:21" "0:13:49" "0:29:43" ...
## $ day_of_week : num [1:365071] 3 3 3 3 3 3 3 3 3 3 ...
## - attr(*, "spec")=
## .. cols(
## .. trip_id = col_double(),
## .. start_time = col_character(),
## .. end_time = col_character(),
## .. bikeid = col_double(),
## .. tripduration = col_number(),
## .. from_station_id = col_double(),
## .. from_station_name = col_character(),
## .. to_station_id = col_double(),
## .. to_station_name = col_character(),
## .. usertype = col_character(),
## .. gender = col_character(),
## .. birthyear = col_character(),
## .. ride_length = col_character(),
## .. day_of_week = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(q1_2020)
## spc_tbl_ [426,887 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:426887] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr [1:426887] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : chr [1:426887] "2020-01-21 20:06:59" "2020-01-30 14:22:39" "2020-01-09 19:29:26" "2020-01-06 16:17:07" ...
## $ ended_at : chr [1:426887] "2020-01-21 20:14:30" "2020-01-30 14:26:22" "2020-01-09 19:32:17" "2020-01-06 16:25:56" ...
## $ start_station_name: chr [1:426887] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_id : num [1:426887] 239 234 296 51 66 212 96 96 212 38 ...
## $ end_station_name : chr [1:426887] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_id : num [1:426887] 326 318 117 24 212 96 212 212 96 100 ...
## $ start_lat : num [1:426887] 42 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:426887] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:426887] 42 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:426887] -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:426887] "member" "member" "member" "member" ...
## $ ride_length : 'hms' num [1:426887] 00:07:31 00:03:43 00:02:51 00:08:49 ...
## ..- attr(*, "units")= chr "secs"
## $ day_of_week : num [1:426887] 3 5 5 2 5 6 6 6 6 6 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. ride_length = col_time(format = ""),
## .. day_of_week = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Convert ride_length to numeric (handling potential variations)
q1_2019 <- mutate(q1_2019, ride_length = as.numeric(hms::parse_hms(ride_length)))
q1_2020 <- mutate(q1_2020, ride_length = as.numeric(ride_length))
# Ensure ride_id and rideable_type are character for consistent stacking
q1_2019 <- mutate(q1_2019, ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type))
# Combine datasets from both quarters
all_trips <- bind_rows(q1_2019, q1_2020)
# Select relevant columns and remove redundant or inconsistent fields
all_trips <- select(all_trips, -c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, "tripduration"))
# --- Feature Engineering and Data Cleaning ---
# Convert started_at to Date for extracting temporal features
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(all_trips$date, "%m")
all_trips$day <- format(all_trips$date, "%d")
all_trips$year <- format(all_trips$date, "%Y")
all_trips$day_of_week <- format(all_trips$date, "%A")
# Calculate ride length in seconds
all_trips$ride_length <- difftime(all_trips$ended_at, all_trips$started_at, units = "secs")
all_trips$ride_length <- as.numeric(all_trips$ride_length) # Ensure it's numeric
# Standardize rider type labels
table(all_trips$member_casual) # Inspect initial distribution
##
## casual Customer member Subscriber
## 48480 23163 378407 341906
all_trips <- mutate(all_trips,
member_casual = recode(member_casual,
"Subscriber" = "member",
"Customer" = "casual"))
table(all_trips$member_casual) # Verify changes
##
## casual member
## 71643 720313
# Remove invalid ride data (negative duration or HQ QR station)
all_trips_v2 <- filter(all_trips, !(start_station_name == "HQ QR" | ride_length < 0))
# --- Data Exploration (Optional but good for context) ---
# glimpse(all_trips_v2)
# summary(all_trips_v2)
This section examines the distribution of ride lengths (in seconds) for all cleaned trips to understand typical usage patterns. Key measures of central tendency and spread are presented below.
The average ride duration across all riders was approximately 19.8 minutes (or 1189 seconds), suggesting the typical ride lasts around this time. The median ride duration, which is less sensitive to outliers, was about 9 minutes (or 539 seconds), highlighting a potential skew in the distribution due to longer rides.
The range of ride lengths was considerable, from a minimum of 0 minutes to a maximum of about 2952.3 hours (or 1.0628422^{7} seconds). This wide variability suggests diverse usage scenarios.
A detailed statistical summary of ride lengths is presented below:
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 331 539 1189 912 10628422
Comparing Ride Duration by Rider Type
To understand the distinct usage patterns of members and casual riders, the following table compares key ride duration statistics for each group:
| Rider Type | Average Ride (minutes) | Median Ride (minutes) | Max Ride (hours) | Min Ride (minutes) |
|---|---|---|---|---|
| Member | 13.3 | 8.5 | 1693.5 | 0 |
| Casual | 89.5 | 23.2 | 2952.3 | 0 |
# Descriptive statistics by rider type (in seconds)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = summary)
## all_trips_v2$member_casual all_trips_v2$ride_length.Min.
## 1 casual 2.000000e+00
## 2 member 1.000000e+00
## all_trips_v2$ride_length.1st Qu. all_trips_v2$ride_length.Median
## 1 7.800000e+02 1.393000e+03
## 2 3.170000e+02 5.080000e+02
## all_trips_v2$ride_length.Mean all_trips_v2$ride_length.3rd Qu.
## 1 5.370768e+03 2.359000e+03
## 2 7.951324e+02 8.230000e+02
## all_trips_v2$ride_length.Max.
## 1 1.062842e+07
## 2 6.096428e+06
# Average ride time by rider type and day of the week (initial - unordered)
average_ride_by_day_unordered <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# print(average_ride_by_day_unordered) # Keep this commented out in the final report
# Order the days of the week for meaningful analysis and visualization
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Average ride time by rider type and day of the week (ordered)
average_ride_by_day <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# print(average_ride_by_day) # Keep this commented out
# Summarize ridership data by type and weekday for visualization
rider_weekday_summary <- all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% # creates weekday field
group_by(member_casual, weekday) %>% # groups by usertype and weekday
summarise(number_of_rides = n(), # calculates the number of rides
average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
The following visualizations illustrate the differences in riding patterns between members and casual riders across the week, focusing on the number of rides and the average ride duration.
# Visualize the number of rides by rider type and weekday
library(scales) # Load the scales package
ggplot(rider_weekday_summary, aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("casual" = "#f0fa65", "member" = "#3b92a5")) +
labs(title = "Number of Rides by Rider Type and Weekday",
x = "Day of Week",
y = "Number of Rides",
fill = "Rider Type") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
scale_y_continuous(labels = comma) # Format y-axis labels with commas
# Visualize the average ride duration by rider type and weekday
ggplot(rider_weekday_summary, aes(x = weekday, y = average_duration / 60, fill = member_casual)) + # Display in minutes for better readability
geom_col(position = "dodge") +
scale_fill_manual(values = c("casual" = "#f0fa65", "member" = "#3b92a5")) +
labs(title = "Average Ride Duration (Minutes) by Rider Type and Weekday",
x = "Day of Week",
y = "Average Ride Duration (Minutes)",
fill = "Rider Type") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
The analysis reveals significant differences in how annual members and casual riders utilize the Cyclistic bike-share program:
Ride Duration: Casual riders consistently exhibit longer average ride durations compared to members, particularly on weekends. This suggests that casual riders may be using the service more for leisure activities or longer trips.
Weekday vs. Weekend Usage: Members show a more consistent number of rides throughout the week, likely indicative of commuting or regular daily activities. In contrast, casual ridership peaks on weekends, aligning with potential leisure use.
These findings have several implications for Cyclistic’s strategies:
Targeted Marketing: Marketing campaigns can be tailored to each user group. For casual riders, highlighting weekend leisure opportunities and longer ride benefits could be effective. For members, emphasizing the convenience for weekday commutes and offering incentives for off-peak usage might be beneficial.
Operational Optimization: Understanding the peak usage times for each group can inform decisions regarding bike availability and station management. The higher weekend usage by casual riders may necessitate increased bike availability in popular leisure areas during those times.
# Aggregate data: average ride length by rider type and day of the week
counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
# Rename columns for clarity in the exported CSV
names(counts) <- c("Member_Casual", "Day_of_Week", "Average_Ride_Length_Seconds")
# Define a platform-independent file path
output_file <- file.path("~", "Documents", "avg_ride_length.csv")
# Export the aggregated data to a CSV file, without row names
write.csv(counts, file = output_file, row.names = FALSE)
# Provide user feedback on the successful export
cat("Aggregated data successfully exported to:", output_file, "\n")
## Aggregated data successfully exported to: ~/Documents/avg_ride_length.csv
# Optional: Display the first few rows of the exported data for verification
head(counts)
## Member_Casual Day_of_Week Average_Ride_Length_Seconds
## 1 casual Sunday 5059.1813
## 2 member Sunday 972.1609
## 3 casual Monday 4750.7627
## 4 member Monday 822.2134
## 5 casual Tuesday 4560.8190
## 6 member Tuesday 769.4416