This case study focuses on a bike-share company in Chicago called Cyclistic. As a junior data analyst on the marketing team, your goal is to support the development of a data-driven marketing strategy aimed at increasing the number of annual members.
The marketing director, Lily Moreno, believes that long-term success depends on converting casual riders into annual members. To do this, the team needs to understand how these two groups use the bike-share service differently.
Cyclistic operates a fleet of over 5,800 bikes and more than 600 docking stations. While most users ride traditional bikes, the company also offers inclusive options such as hand tricycles and cargo bikes.
In the past, Cyclistic focused on broad awareness through flexible pricing plans—single-ride, day-pass, and annual membership. However, financial analysts have found that annual members are significantly more profitable. That’s why the current strategy is to convert existing casual riders, who are already familiar with the service, into loyal members.
In this project, historical trip data will be analyzed to uncover trends and user behavior patterns. These insights will support the creation of targeted marketing strategies and help present clear, evidence-based recommendations to the Cyclistic executive team.
For this analysis, I used Cyclistic’s historical trip data to explore user behavior and identify differences between casual riders and annual members.
The data was downloaded from the official Divvy bike-share system, which is used as a fictional stand-in for Cyclistic in this case study. It is publicly available under an open data license from Motivate International Inc. and can be accessed here: https://divvy-tripdata.s3.amazonaws.com/index.html
The dataset includes rich trip-level information, such as ride start and end times, trip duration, station locations, bike types, and rider types. Due to privacy concerns, no personally identifiable information (PII) is included. This means I could not determine if casual riders live within the service area or whether they have made multiple purchases.
Since I am not using RStudio Cloud, I was able to download and work with the full set of data from the past 12 months, allowing for a more comprehensive analysis.
This dataset is the basis of my exploration and supports the development of insights that can guide Cyclistic’s marketing strategy.
To give an overview of the dataset, here are the first few rows of the raw trip data:
# Load tidyverse and data.table (if not already loaded)
library(tidyverse)
library(data.table)
library(dplyr)
library(knitr) # For kable()
library(gt)
library(kableExtra)
df <- fread("Divvy_Trips_2019_Q1.csv") %>% as_tibble()
# Show all columns for the first 5 rows
df %>%
slice_head(n = 5)%>%
kable()
col_types <- tibble(
Column = colnames(df),
Data_Type = sapply(df, class)
)
kable(col_types, caption = "Column Names and Their Data Types")
The process_csv function is designed to read a CSV file
containing bike trip data, process and clean the data, and then save the
processed data into a new CSV file. The function performs the following
steps:
input_file (the path
to the input CSV file) and output_file (the path where the
processed CSV will be saved).fread function
from the data.table package.start_time and end_time in the
dataset are assumed to contain date and time information as strings.
These columns are converted to POSIXct date-time format
using the as.POSIXct() function to enable time-based
operations.ride_length:
ride_length, is created by calculating
the difference between start_time and
end_time in seconds using the
difftime function. This represents the duration of
each bike ride.day_of_week Column:
day_of_week, is added to represent the
day of the week for each trip. The
weekdays() function is used to get the full name of the
weekday (e.g., “Monday”, “Tuesday”).week_map).fwrite function, which is a fast writing function from
the data.table package.# Load necessary libraries
library(data.table)
library(dplyr)
# Define the process_csv function
process_csv <- function(input_file, output_file) {
# Read the CSV file
df <- fread(input_file)
# Convert start_time and end_time to date-time format
df$start_time <- as.POSIXct(df$start_time, format = "%Y-%m-%d %H:%M:%S") # Start time
df$end_time <- as.POSIXct(df$end_time, format = "%Y-%m-%d %H:%M:%S") # End time
# Calculate ride_length: the difference between start_time and end_time (in seconds)
df$ride_length <- as.numeric(difftime(df$end_time, df$start_time, units = "secs"))
# Add day_of_week: the day of the week for each trip
df$day_of_week <- weekdays(df$start_time, abbreviate = FALSE) # Get full weekday name
week_map <- c("Sunday" = 1, "Monday" = 2, "Tuesday" = 3, "Wednesday" = 4,
"Thursday" = 5, "Friday" = 6, "Saturday" = 7)
df$day_of_week <- week_map[df$day_of_week]
# Save the processed data to a new CSV file
fwrite(df, output_file)
}
The dataset for the entire year was obtained by combining data from four quarters. However, for the second-quarter dataset, I had to rename the columns to match the column names used in the first quarter’s dataset to ensure consistency across all quarters.
To do this, I followed these steps:
Divvy_Trips_2019_Q2.csv) and the first-quarter dataset
(Divvy_Trips_2019_Q1.csv) using the fread
function from the data.table package.df) to match the column names in the first-quarter
dataset (df1).Divvy_Trips_2019_Q2_VX.csv).Here is the code I used:
df <- fread("Divvy_Trips_2019_Q2.csv")
df1 <- fread("Divvy_Trips_2019_Q1.csv")
colnames(df) <- colnames(df1)
fwrite(df,"Divvy_Trips_2019_Q2_VX.csv" )
To combine the data from all four quarters, I used the following approach:
dplyr for data
manipulation and readr for reading CSV files.Q1, Q2, Q3, and Q4)
that I had already pre-processed.read_csv() function from the
readr package to read each of the CSV files into a
list.bind_rows() from the dplyr package,
which efficiently appends the rows of all data frames into a single data
frame.head() and str() functions to
inspect the first few rows and the structure of the combined data.Divvy_Trips_2019_combined.csv).# Specify the file paths
file_paths <- c(
"Divvy_Trips_2019_Q1_V5.csv",
"Divvy_Trips_2019_Q2_V5.csv",
"Divvy_Trips_2019_Q3_V5.csv",
"Divvy_Trips_2019_Q4_V5.csv"
)
# Read the data files and store them in a list
data_list <- lapply(file_paths, read_csv)
# Combine all data frames into one large data frame
combined_data <- bind_rows(data_list)
# Inspect the combined data
print(head(combined_data))
print(str(combined_data))
# Save the combined data to a new CSV file
write_csv(combined_data, "Divvy_Trips_2019_combined.csv")