This project analyzes how casual riders and annual members use Cyclistic bikes differently, using historical bike-share data. The goal is to uncover behavior patterns to support a targeted marketing strategy focused on converting casual riders into long-term members.
You are a junior data analyst on Cyclistic’s marketing analytics team. Your manager, Lily Moreno, has tasked you with uncovering insights into rider behavior. Your findings will guide a strategic campaign aimed at increasing annual memberships—key to Cyclistic’s sustainable growth.
Cyclistic is a bike-share program in Chicago, founded in 2016. It offers a flexible pricing structure including single-ride, daily, and annual plans. The company has:
Annual members are more profitable than casual users, and converting casual riders is central to the marketing team’s new strategy.
The objective is to determine how annual members and casual riders use Cyclistic bikes differently. This analysis will identify trends in rider behavior—such as ride length, frequency, or preferred days—and clarify whether casual riders show habits that suggest they could become members. These findings will inform a targeted marketing campaign to increase annual memberships.
For this project, I’m working with two datasets:
These files contain historical trip records for a bike-share program in Chicago, made available by Motivate International Inc. under a public license. Although Cyclistic is fictional, these datasets are representative and appropriate for this analysis. They include key attributes such as rider type, ride duration, start and end times, and station IDs—all crucial for examining usage behavior.
I created a dedicated folder named cyclistic_case_study on my device to house all related materials. Within it, I organized:
This structure helps me stay organized as I move through the steps of the analysis.
To ensure the data is suitable, I evaluated it using the ROCCC framework:
Reliable: The datasets come from a well-established data provider and were used by many analysts.
Original: I’m working with raw trip-level data, not summaries.
Comprehensive: The data spans different seasons and contains a variety of variables useful for behavioral segmentation.
Current: Although the data is from 2019–2020, it reflects consistent trends that are still valuable for strategy planning.
Cited: Appropriate attribution is provided to Motivate International Inc. Licensing, Privacy, and Accessibility
This data excludes personally identifiable information (PII), aligning with standard privacy practices. I’m focusing only on publicly available variables like user type and ride duration. All code and documentation are included in this RMarkdown file and rendered as an HTML report to ensure accessibility for stakeholders and collaborators.
To prepare the files:
# Install required packages
install.packages("tidyverse") # For data manipulation and visualization
install.packages("lubridate") # For date-time handling and calculations
install.packages("janitor") # For cleaning column names and tabulations
install.packages("here") # Helps manage file paths easily
install.packages("skimr") # For quick data overviews
install.packages("data.table") # For high-performance data operations if needed
install.packages("readr")
install.packages("dplyr")
# Load the installed packages into your R environment
library(tidyverse) # For data wrangling and visualization
library(lubridate) # For working with date and time formats
library(janitor) # For cleaning and standardizing column names
library(readr)
library(dplyr)
# Import Divvy Q1 2019 and Q1 2020 datasets
# Use exact file names from list.files()
#divvy_2019 <- read_csv("Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv")
#divvy_2020 <- read_csv("Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv")
#file.rename("Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv", "Divvy_Trips_2019_Q1.csv")
#file.rename("Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv", "Divvy_Trips_2020_Q1.csv")
divvy_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
divvy_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
#Get a quick snapshot of the data
glimpse(divvy_2019)
glimpse(divvy_2020)
The datasets for Divvy 2019 Q1 and Divvy 2020 Q1 were loaded and assigned to divvy_2019 and divvy_2020, respectively. Column names and data structures were reviewed for compatibility. To ensure smooth merging, the divvy_2019 data frame was renamed to match divvy_2020:
Both datasets were merged into a unified data frame, all_trips, to enable consolidated analysis:
library(dplyr) # Enables %>%, mutate(), select(), rename()
library(lubridate) # Enables ymd_hms(), difftime()
library(readr) # Enables read_csv()
library(janitor) # Enables clean_names(), remove_empty()
divvy_2019 <- divvy_2019 %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type)
)
divvy_2020 <- divvy_2020 %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type)
)
all_trips <- bind_rows(divvy_2019, divvy_2020)
all_trips <- bind_rows(divvy_2019, divvy_2020)
Within the member_casual column, varied labels such as “Subscriber” and “Customer” were standardized into two groups—“member” and “casual”—to ensure consistency:
all_trips <- all_trips %>%
mutate(member_casual = if_else(member_casual %in% c("Subscriber", "member"), "member", "casual"))
With the absence of a built-in tripduration column in the 2020 dataset, a new ride_length field was created by computing the time difference between ended_at and started_at. This was then converted to minutes:
all_trips <- all_trips %>%
mutate(
started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at),
ride_length = as.numeric(difftime(ended_at, started_at, units = "secs")),
ride_length_mins = round(ride_length / 60, 2)
)
Non-essential columns, such as latitude/longitude, birth year, gender, and trip duration, were removed to streamline the dataset:
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, tripduration))
To maintain analytical integrity, records with negative durations, durations under one minute, or flagged station names like “HQ QR” were filtered out. A new version of the dataset—all_trips_v2—was created to preserve original data:
all_trips_v2 <- all_trips %>%
filter(
ride_length >= 60,
start_station_name != "HQ QR"
)
RStudio was primarily used to perform a step-by-step descriptive analysis of ride behavior. The goal is to understand how casual riders and annual members differ in ride length and temporal patterns, laying the groundwork for targeted marketing initiatives. Each section below explains the rationale, followed by the exact RMarkdown chunk you’ll use. These chunks will feed directly into the visualizations in the Share phase.
Objectives
Descriptive Statistics for Ride Length
This is done by measuring central tendency and dispersion of ride_length (in seconds) across all trips. These metrics help us understand the typical ride and identify outliers.
# Straightforward summary of ride_length
mean(all_trips_v2$ride_length)
median(all_trips_v2$ride_length)
max(all_trips_v2$ride_length)
min(all_trips_v2$ride_length)
# Condensed summary
summary(all_trips_v2$ride_length)
Comparing Ride Length Between Rider Types
Compare members vs. casual riders on mean, median, max, and min ride lengths. This reveals whether one group tends to take longer or shorter trips.
aggregate(ride_length ~ member_casual, data = all_trips_v2, FUN = mean)
aggregate(ride_length ~ member_casual, data = all_trips_v2, FUN = median)
aggregate(ride_length ~ member_casual, data = all_trips_v2, FUN = max)
aggregate(ride_length ~ member_casual, data = all_trips_v2, FUN = min)
Average Ride Time by Day of Week
To uncover weekly patterns, we calculate the average ride length by member_casual and day_of_week. At this point, days may be alphabetically ordered rather than chronologically.
aggregate(ride_length ~ member_casual + day_of_week, data = all_trips_v2, FUN = mean)
Analysis of Ridership by Rider Type and Weekday
# Total rides by type and weekday
rides_by_day <- all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(total_rides = n(), .groups = "drop")
# Average ride duration (in minutes)
duration_by_day <- all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(
avg_duration_min = mean(ride_length) / 60,
.groups = "drop"
)
rides_by_month_type <- all_trips_v2 %>%
mutate(month = month(started_at, label = TRUE, abbr = TRUE)) %>%
group_by(month, member_casual) %>%
summarise(total_rides = n(), .groups = "drop")
rides_by_month_type$month <- factor(rides_by_month_type$month, levels = month.abb)
ggplot(rides_by_month_type, aes(month, total_rides, fill = member_casual)) +
geom_col(position = "dodge") +
geom_text(aes(label = total_rides), position = position_dodge(0.9), vjust = -0.5, size = 3, color = "black") +
scale_fill_manual(name = "Rider Type", values = c(casual = "#1f98b4", member = "#FF6F61")) +
scale_y_continuous(labels = comma_format(), expand = c(0, 0)) +
labs(title = "Total Rides per Rider Type", x = "Month", y = "Number of Rides",
caption = 'Analysis of data from "Divvy_Trips_Q1_2019" and "Divvy_Trips_Q1_2020"') +
theme_minimal() +
theme(legend.position = "right", plot.margin = margin(10, 10, 20, 10),
axis.text.x = element_text(angle = 0, hjust = 1))
Key insights:
ggplot(duration_by_day, aes(x=day_of_week, y=avg_duration_min, color=member_casual, group=member_casual)) + geom_line(size=1) + geom_point(size=2) +
scale_color_manual(values=c(casual="#1f78b4", member="#33a02c")) + scale_y_continuous(limits=c(0,150), breaks=seq(0,150,by=20), expand=c(0,0)) +
labs(title="Average Ride Duration by Weekday and Rider Type", x="Day of Week", y="Average Duration (minutes)", color="Rider Type", caption='Analysis of data from "Divvy_Trips_Q1_2019" and "Divvy_Trips_Q1_2020"') +
theme_minimal()
Key insights:
library(dplyr); library(ggplot2)
all_trips_v2$day_of_week <- factor(all_trips_v2$day_of_week, levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
rides_heatmap <- all_trips_v2 %>% group_by(member_casual, day_of_week) %>% summarise(total_rides = n(), .groups = "drop")
ggplot(rides_heatmap, aes(x=day_of_week, y=member_casual, fill=total_rides)) + geom_tile(color="white") + scale_fill_gradient(low="#fee5d9", high="#de2d26", name="Number of Rides") + labs(title="Heatmap of Total Rides by Weekday and Rider Type", x="Day of Week", y="Rider Type") + theme_minimal() + theme(axis.text.x = element_text(angle=45, hjust=1))
Key Insights from the Heatmap:
Members outpace casual riders every day • All “member” cells are deep red, showing consistently high ride volumes throughout the week. • “Casual” cells remain in pale pink, indicating far lower usage.
Weekday peaks among members • The darkest cell appears on Tuesday for members, pinpointing the busiest single day. • High member demand persists Monday–Thursday, then slightly tapers into the weekend.
Leisure‐focused casual usage • Casual riders show a subtle uptick Friday–Sunday (Friday’s still light, but Saturday/Sunday slightly darker). • This suggests casual users ride more on weekends, likely for recreation rather than commute.
| Category | Casual Riders | Members |
|---|---|---|
| Ride Count | ~10–15 % of total rides | ~85–90 % of total rides |
| Average Ride Duration | 35–45 min avg, longest on weekends | 13–15 min avg, steady across all weekdays |
| Duration Peaks | Highest on Saturday; widest gap on Sunday | Small weekend uptick; peak duration on Tuesday |
| Weekly Usage Pattern | Light Mon–Thu; surge Fri–Sun | High Mon–Thu demand; tapering into the weekend |
| Monthly Ride Trend | Rapid growth in March (from ~12 k to ~40 k rides) | Stable high counts Jan–Mar with a March uplift |
Across these charts, casual riders form a smaller but fast-growing segment—favoring long, leisure-driven outings on weekends (especially in March)—whereas members conduct frequent, short trips peaking mid-week. Aligning marketing toward weekend casuals and reallocating bikes for commute versus leisure periods can unlock both operational efficiencies and membership growth.