Introduction

This case study focuses on a bike-share company in Chicago called Cyclistic. As a junior data analyst on the marketing team, your goal is to support the development of a data-driven marketing strategy aimed at increasing the number of annual members.

The marketing director, Lily Moreno, believes that long-term success depends on converting casual riders into annual members. To do this, the team needs to understand how these two groups use the bike-share service differently.

Cyclistic operates a fleet of over 5,800 bikes and more than 600 docking stations. While most users ride traditional bikes, the company also offers inclusive options such as hand tricycles and cargo bikes.

In the past, Cyclistic focused on broad awareness through flexible pricing plans—single-ride, day-pass, and annual membership. However, financial analysts have found that annual members are significantly more profitable. That’s why the current strategy is to convert existing casual riders, who are already familiar with the service, into loyal members.

In this project, historical trip data will be analyzed to uncover trends and user behavior patterns. These insights will support the creation of targeted marketing strategies and help present clear, evidence-based recommendations to the Cyclistic executive team.

Data Overview

For this analysis, I used Cyclistic’s historical trip data to explore user behavior and identify differences between casual riders and annual members.

The data was downloaded from the official Divvy bike-share system, which is used as a fictional stand-in for Cyclistic in this case study. It is publicly available under an open data license from Motivate International Inc. and can be accessed here: https://divvy-tripdata.s3.amazonaws.com/index.html

The dataset includes rich trip-level information, such as ride start and end times, trip duration, station locations, bike types, and rider types. Due to privacy concerns, no personally identifiable information (PII) is included. This means I could not determine if casual riders live within the service area or whether they have made multiple purchases.

Since I am not using RStudio Cloud, I was able to download and work with the full set of data from the past 12 months, allowing for a more comprehensive analysis.

This dataset is the basis of my exploration and supports the development of insights that can guide Cyclistic’s marketing strategy.

Sample of Raw Data

To give an overview of the dataset, here are the first few rows of the raw trip data:

# Load tidyverse and data.table (if not already loaded)
library(tidyverse)
library(data.table)
library(dplyr)
library(knitr)  # For kable()
library(gt)

library(kableExtra)
df <- fread("Divvy_Trips_2019_Q1.csv") %>% as_tibble()
# Show all columns for the first 5 rows
df %>% 
  slice_head(n = 5)%>%
  kable()
Sample of Raw Data
Sample of Raw Data

Column Types Overview

col_types <- tibble(
  Column = colnames(df),
  Data_Type = sapply(df, class)
)

kable(col_types, caption = "Column Names and Their Data Types")
Dataypes
Dataypes

Process

Explanation of the process_csv Function

The process_csv function is designed to read a CSV file containing bike trip data, process and clean the data, and then save the processed data into a new CSV file. The function performs the following steps:

  1. Read Input CSV:
    • The function takes two arguments: input_file (the path to the input CSV file) and output_file (the path where the processed CSV will be saved).
    • The input CSV file is read using the fread function from the data.table package.
  2. Convert Date Columns to Date-Time Format:
    • The columns start_time and end_time in the dataset are assumed to contain date and time information as strings. These columns are converted to POSIXct date-time format using the as.POSIXct() function to enable time-based operations.
  3. Calculate ride_length:
    • A new column, ride_length, is created by calculating the difference between start_time and end_time in seconds using the difftime function. This represents the duration of each bike ride.
  4. Create day_of_week Column:
    • A new column, day_of_week, is added to represent the day of the week for each trip. The weekdays() function is used to get the full name of the weekday (e.g., “Monday”, “Tuesday”).
    • This column is then mapped to a numerical value (e.g., Monday = 2, Tuesday = 3, etc.) using a custom mapping (week_map).
  5. Save the Processed Data:
    • The cleaned and processed dataset is saved into a new CSV file using the fwrite function, which is a fast writing function from the data.table package.

Code Implementation

# Load necessary libraries
library(data.table)
library(dplyr)

# Define the process_csv function
process_csv <- function(input_file, output_file) {
  # Read the CSV file
  df <- fread(input_file)
  
  # Convert start_time and end_time to date-time format
  df$start_time <- as.POSIXct(df$start_time, format = "%Y-%m-%d %H:%M:%S")  # Start time
  df$end_time <- as.POSIXct(df$end_time, format = "%Y-%m-%d %H:%M:%S")      # End time
  
  # Calculate ride_length: the difference between start_time and end_time (in seconds)
  df$ride_length <- as.numeric(difftime(df$end_time, df$start_time, units = "secs"))
  
  # Add day_of_week: the day of the week for each trip
  df$day_of_week <- weekdays(df$start_time, abbreviate = FALSE)  # Get full weekday name
  week_map <- c("Sunday" = 1, "Monday" = 2, "Tuesday" = 3, "Wednesday" = 4, 
                "Thursday" = 5, "Friday" = 6, "Saturday" = 7)
  df$day_of_week <- week_map[df$day_of_week]
  
  # Save the processed data to a new CSV file
  fwrite(df, output_file)
}

Handling Column Name Changes for the Second Quarter Data

The dataset for the entire year was obtained by combining data from four quarters. However, for the second-quarter dataset, I had to rename the columns to match the column names used in the first quarter’s dataset to ensure consistency across all quarters.

To do this, I followed these steps:

  1. Read the Datasets:
    • I loaded the second-quarter dataset (Divvy_Trips_2019_Q2.csv) and the first-quarter dataset (Divvy_Trips_2019_Q1.csv) using the fread function from the data.table package.
  2. Align Column Names:
    • To ensure consistency, I renamed the columns in the second-quarter dataset (df) to match the column names in the first-quarter dataset (df1).
    • I achieved this by directly assigning the column names of the first-quarter dataset to the second-quarter dataset.
  3. Save the Updated Data:
    • After aligning the column names, I saved the updated second-quarter dataset as a new CSV file (Divvy_Trips_2019_Q2_VX.csv).

Here is the code I used:

df <- fread("Divvy_Trips_2019_Q2.csv")
df1 <- fread("Divvy_Trips_2019_Q1.csv")

colnames(df) <- colnames(df1) 

fwrite(df,"Divvy_Trips_2019_Q2_VX.csv" )

Combining the Quarterly Datasets

To combine the data from all four quarters, I used the following approach:

  1. Load Required Libraries:
    • I loaded the necessary packages: dplyr for data manipulation and readr for reading CSV files.
  2. Specify File Paths:
    • I specified the file paths for each of the quarterly datasets (Q1, Q2, Q3, and Q4) that I had already pre-processed.
  3. Read Data Files:
    • I used the read_csv() function from the readr package to read each of the CSV files into a list.
  4. Combine Data:
    • I combined all the individual data frames into one large data frame using bind_rows() from the dplyr package, which efficiently appends the rows of all data frames into a single data frame.
  5. Check the Combined Data:
    • I used head() and str() functions to inspect the first few rows and the structure of the combined data.
  6. Save the Combined Data:
    • Finally, I saved the combined data into a new CSV file (Divvy_Trips_2019_combined.csv).
# Specify the file paths
file_paths <- c(
  "Divvy_Trips_2019_Q1_V5.csv",
  "Divvy_Trips_2019_Q2_V5.csv",
  "Divvy_Trips_2019_Q3_V5.csv",
  "Divvy_Trips_2019_Q4_V5.csv"
)

# Read the data files and store them in a list
data_list <- lapply(file_paths, read_csv)

# Combine all data frames into one large data frame
combined_data <- bind_rows(data_list)

# Inspect the combined data
print(head(combined_data))
print(str(combined_data))

# Save the combined data to a new CSV file
write_csv(combined_data, "Divvy_Trips_2019_combined.csv")

Analyze and Share

Descriptive Statistics

We begin by calculating some basic statistics:

  • Mean Ride Length: Average duration of a ride in seconds.
  • Max Ride Length: Longest ride in the dataset.
  • Mode of Day of Week: Most frequent day when rides were taken.
mean_ride_length <- mean(df$ride_length, na.rm = TRUE)
max_ride_length <- max(df$ride_length, na.rm = TRUE)
mode_day_of_week <- as.numeric(names(sort(table(df$day_of_week), decreasing = TRUE)[1]))

summary_df <- data.frame(
  Metric = c("Mean Ride Length (s)", "Max Ride Length (s)", "Mode Day of Week"),
  Value = c(mean_ride_length, max_ride_length, mode_day_of_week)
)
summary_df  %>%
  gt() %>%
  tab_header(
    title = "Descriptive Statistics"
  )

Pivot Table 1: Average Ride Length by User Type

This table shows the average ride duration for each user type (e.g., members vs. casual users).

pivot_table_user_type <- df %>%
  group_by(usertype) %>%
  summarise(average_ride_length = mean(ride_length, na.rm = TRUE))

pivot_table_user_type %>%
  gt() %>%
  tab_header(
    title = "Average Ride Length by User Type (s)"

Visualization of Pivot Table 1

p1 <- ggplot(pivot_table_user_type, aes(x = usertype, y = average_ride_length, fill = usertype)) +
  geom_bar(stat = "identity") +
  labs(x = "User Type", y = "Average Ride Length (seconds)", title = "Average Ride Length by User Type") +
  theme_minimal() +
  theme(legend.position = "none")
p1

Pivot Table 2: Average Ride Length by Day of Week and User Type

This pivot table displays how ride duration varies by day of the week and user type.

pivot_table_day_of_week <- df %>%
  group_by(usertype, day_of_week) %>%
  summarise(average_ride_length = mean(ride_length, na.rm = TRUE))

pivot_table_day_of_week %>%
  gt() %>%
  tab_header(
    title = "Average Ride Length by Day of Week and User Type (s)"
  )

Visualization of Pivot Table 2

p2 <- ggplot(pivot_table_day_of_week, aes(x = factor(day_of_week), y = average_ride_length, fill = usertype)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Day of Week", y = "Average Ride Length (seconds)", title = "Average Ride Length by Day") +
  theme_minimal() +
  theme(legend.position = "none")
p2

Pivot Table 3: Number of Rides by Day of Week and User Type

This table summarizes the total number of rides taken per day of the week by user type.

pivot_table_ride_count <- df %>%
  group_by(usertype, day_of_week) %>%
  summarise(ride_count = n_distinct(trip_id))

pivot_table_ride_count  %>%
  gt() %>%
  tab_header(
    title = "Number of Rides by Day of Week and User Type"
)

Visualization of Pivot Table 3

p3 <- ggplot(pivot_table_ride_count, aes(x = factor(day_of_week), y = ride_count, fill = usertype)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Day of Week", y = "Number of Rides", title = "Total Rides by Day") +
  theme_minimal() +
  theme(legend.position = "none")
p3

Combined Visualization

Below is a combined layout of the three visualizations, providing a comprehensive overview of the ride patterns.

(p1 | p2) / p3 + 
  plot_annotation(title = "Bike Ride Analysis for the Whole Year", 
                  theme = theme(plot.title = element_text(hjust = 0.5)))
Combined layout of the three visualizations
Combined layout of the three visualizations

Conclusion

This report provides insights into:

  • How ride duration differs between user types
  • Patterns in ride activity across the days of the week
  • Overall ride frequency and usage trends

These findings could inform decisions about infrastructure planning, customer segmentation, and service optimization.