The company featured in this report is a real-world, globally recognized technology firm operating across multiple markets. Due to confidentiality and data-sharing agreements, its name and identifying details have been redacted from this publication.
What can be shared is that the company specializes in large-scale digital platforms — including mobile applications, web-based tools, and cloud-connected services — used daily by millions of users worldwide. With operations spanning North America, Europe, and Asia, the company delivers frequent product updates across Android, iOS, and web platforms to stay competitive in a rapidly evolving market.
This growth-driven pace, while key to innovation, presents operational challenges — particularly in maintaining product stability and customer support efficiency. The rising support ticket backlog analyzed in this report reflects those very challenges, and the findings offer actionable insight into how a high-performing tech organization can realign its resources to sustain both user satisfaction and technical excellence.
The dataset used in this report is a real-world snapshot from a global technology company’s operational logs. While the dataset appears relatively small in size, this is due to the scoped nature of the export — capturing only a representative timeframe rather than the entire historical backlog. This focused slice provides enough context to investigate the rising trend in support tickets without overwhelming processing tools.
Although this dataset could be managed using traditional tools like Excel or WPS Office, I intentionally chose RStudio to conduct the analysis for the following reasons:
This report follows a structured workflow—cleaning, joining, analyzing, and visualizing the data to identify key problem areas. Ultimately, the goal is to surface actionable recommendations to reduce the ticket backlog and improve product performance.
In this phase, I gathered and reviewed the available datasets relevant to the problem:
tickets_large.csv: Contains support ticket records,
including product, platform, severity, and date reported.usage_metrics_large.csv: Includes product usage data
such as daily active users and error rates.releases_large.csv: Lists product release dates and
related metadata.Date, Product,
Platform, Region).Date objects; ensured categorical fields (e.g., product,
platform) are factors or character types.The aim of this phase is to ensure the data is clean, well-structured, and analysis-ready before moving into deeper exploration.
In this phase, we ensured that the data was clean, reliable, and ready for analysis. This involved checking formats, aligning fields across datasets, and preparing the data for merging and transformation. The focus was on careful inspection to avoid introducing errors or misleading patterns in later steps.
The first task was to load all three datasets and examine their
column names and structures. This helped us understand what each file
contains and ensured that the key fields (like product,
date, platform) are named consistently. Any
mismatch at this stage could cause join failures or missing data later,
so this check is essential before moving forward.
library(readr)
library(dplyr)
library(knitr)
library(kableExtra)
# File paths
path_tickets <- "D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv"
path_usage <- "D:/Capstone_Project/ticket_backlog/raw_data/usage_metrics_large.csv"
path_releases <- "D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv"
# Load datasets
tickets <- read_csv(path_tickets)
usage <- read_csv(path_usage)
releases <- read_csv(path_releases)
# Function to format header as one-row table
format_header <- function(df, title) {
df_head <- as.data.frame(t(colnames(df)))
colnames(df_head) <- NULL
kable(df_head, format = "html", align = "c", caption = title) %>%
kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(1, bold = TRUE)
}
# Print formatted tables
format_header(tickets, "Tickets Dataset Columns")
| ticket_id | created_at | product | platform | region | ticket_type | severity |
format_header(usage, "Usage Metrics Dataset Columns")
| date | product | platform | region | active_users | error_rate |
format_header(releases, "Releases Dataset Columns")
| release_id | product | version | platform | launch_date | cohort |
In this step, we performed essential structural validation across the three datasets before applying any transformations. We checked for missing values in critical columns such as identifiers and timestamps, scanned for duplicated records and IDs, and inspected the data types to ensure consistency. This process helps detect early warning signs that could cause silent failures during joins or aggregations. Ensuring structural integrity at this stage builds a reliable foundation for accurate and efficient analysis.
library(dplyr)
# 1. Check for missing values in key columns
colSums(is.na(tickets[c("ticket_id", "created_at", "product")]))
colSums(is.na(usage[c("date", "product", "active_users")]))
colSums(is.na(releases[c("release_id", "launch_date", "product")]))
# 2. Check for duplicates
sum(duplicated(tickets)) # full row duplicates
sum(duplicated(tickets$ticket_id)) # duplicate ticket IDs
sum(duplicated(usage)) # full row duplicates
sum(duplicated(releases)) # full row duplicates
sum(duplicated(releases$release_id)) # duplicate release IDs
# 3. Glimpse of data types for manual validation
glimpse(tickets)
glimpse(usage)
glimpse(releases)
Data Structure Summary
Tickets Dataset (311 rows, 7 columns): Contains
individual support tickets with unique ticket_id, reported
created_at dates, and details such as product,
platform, region, ticket_type,
and severity. The data types are appropriate — IDs as
numeric, dates parsed correctly, and descriptive fields as character
strings.
Usage Metrics Dataset (120 rows, 6 columns):
Daily usage logs showing number of active_users and
error_rate per product, platform,
and region. All values are parsed correctly, and the
date column is in proper date format.
Releases Dataset (120 rows, 6 columns): Product
version releases with release_id, product,
version platform, launch date, and a
descriptive cohort. Dates are parsed well, and all fields
appear clean and consistent.
➡️ Conclusion: All three datasets show clean structure, proper parsing of key fields (especially dates), and consistent row counts. This gives us confidence to proceed with merging and deeper analysis in the next steps.
Before merging, we validated alignment across the datasets using
shared dimensions: product, platform, and
region. We checked if all combinations present in one
dataset also exist in the others and ensured ticket dates fall within
the usage and release date ranges. No mismatches or gaps were found,
indicating strong consistency and readiness for safe joining in the next
step.
library(dplyr)
library(knitr)
library(kableExtra)
library(readr)
# Load data
tickets <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv", show_col_types = FALSE)
usage <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/usage_metrics_large.csv", show_col_types = FALSE)
releases <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv", show_col_types = FALSE)
# Distinct key combinations
ticket_keys <- tickets %>% distinct(product, platform, region)
usage_keys <- usage %>% distinct(product, platform, region)
release_keys <- releases %>% distinct(product, platform)
# Anti-joins to find mismatches
ticket_missing_in_usage <- anti_join(ticket_keys, usage_keys, by = c("product", "platform", "region"))
usage_missing_in_tickets <- anti_join(usage_keys, ticket_keys, by = c("product", "platform", "region"))
# Print mismatches if any
if (nrow(ticket_missing_in_usage) > 0) {
kable(ticket_missing_in_usage, caption = "Combinations in Tickets Missing in Usage") %>%
kable_styling(full_width = FALSE, position = "center")
} else {
cat("✅ All ticket combinations are present in usage metrics.\n\n")
}
## ✅ All ticket combinations are present in usage metrics.
if (nrow(usage_missing_in_tickets) > 0) {
kable(usage_missing_in_tickets, caption = "Combinations in Usage Missing in Tickets") %>%
kable_styling(full_width = FALSE, position = "center")
} else {
cat("✅ All usage combinations are covered in tickets dataset.\n\n")
}
## ✅ All usage combinations are covered in tickets dataset.
# Date range checks
ticket_range <- range(tickets$created_at)
usage_range <- range(usage$date)
release_range <- range(releases$launch_date)
cat("📅 Ticket Date Range: ", ticket_range[1], " to ", ticket_range[2], "\n")
## 📅 Ticket Date Range: 20179 to 20238
cat("📅 Usage Metrics Date Range: ", usage_range[1], " to ", usage_range[2], "\n")
## 📅 Usage Metrics Date Range: 20179 to 20238
cat("📅 Release Launch Date Range: ", release_range[1], " to ", release_range[2], "\n")
## 📅 Release Launch Date Range: 20034 to 20268
Step 3 Summary:
We performed checks to validate the alignment of the key categorical
fields (product, platform,
region) across the three datasets. Using anti-joins, we
confirmed that all combinations matched, and no anomalies were found.
Additionally, we verified the date ranges:
In this phase, we carefully merged the three datasets using
consistent keys: product, platform,
region, and created_at (aligned with
date). We ensured a left join strategy to
preserve the integrity of the ticket records. Final checks confirmed
that:
# Join tickets with usage metrics (left join to keep all tickets)
tickets_usage <- tickets %>%
left_join(usage, by = c("product", "platform", "region", "created_at" = "date"))
# Check for any NAs introduced
missing_usage_rows <- tickets_usage %>% filter(is.na(active_users) | is.na(error_rate))
# Join with releases data (left join to retain full ticket context)
tickets_full <- tickets_usage %>%
left_join(releases, by = c("product", "platform"))
# Check for NAs after final join
missing_release_rows <- tickets_full %>% filter(is.na(version) | is.na(launch_date))
# Print status
if (nrow(missing_usage_rows) == 0) {
cat("✅ No NAs introduced from usage metrics join.\n\n")
} else {
cat("⚠️ NAs found after joining with usage metrics. Please inspect.\n")
}
## ✅ No NAs introduced from usage metrics join.
if (nrow(missing_release_rows) == 0) {
cat("✅ No NAs introduced from releases join.\n\n")
} else {
cat("⚠️ NAs found after joining with releases. Please inspect.\n")
}
## ✅ No NAs introduced from releases join.
Step 4 Summary:
We now have a unified dataset, ready for analysis.
In this phase, we focus on uncovering the patterns behind the growing ticket backlog. We begin by examining daily ticket trends to confirm whether the backlog is increasing over time. Then, we break down the volume of tickets by product and by region to determine which specific systems and geographical areas are most affected. These analyses will help isolate the root contributors to the congestion and provide evidence to support operational decisions. This phase forms the backbone of our insight-generation process, grounding conclusions in clean, visualized data.
library(dplyr)
library(ggplot2)
# Load cleaned tickets dataset
tickets <- read.csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv")
# Ensure proper date format
tickets$created_at <- as.Date(tickets$created_at)
# 1. Daily ticket volume
daily_tickets <- tickets %>%
count(created_at)
ggplot(daily_tickets, aes(x = created_at, y = n)) +
geom_line(color = "steelblue") +
labs(title = "Daily Ticket Volume", x = "Date", y = "Number of Tickets") +
theme_minimal()
Chart 1: Daily Ticket Volume The trend clearly shows a steady increase in daily ticket submissions, especially noticeable after early April 2025. This confirms the presence of a growing backlog over time.
# 2. Ticket count by product
product_tickets <- tickets %>%
count(product, sort = TRUE)
ggplot(product_tickets, aes(x = reorder(product, n), y = n, fill = product)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Ticket Count by Product", x = "Product", y = "Tickets") +
theme_minimal()
Chart 2: Ticket Count by Product Product Y stands out as the major source of tickets, contributing the highest volume among all products. This reinforces our hypothesis that Product Y is a key driver of the backlog.
# 3. Ticket count by region
region_tickets <- tickets %>%
count(region, sort = TRUE)
ggplot(region_tickets, aes(x = reorder(region, n), y = n, fill = region)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Ticket Count by Region", x = "Region", y = "Tickets") +
theme_minimal()
Chart 3: Ticket Count by Region The US region is more heavily impacted compared to others, indicating that the backlog is more severe in this area. This may reflect regional user volume or support limitations.
library(dplyr)
library(ggplot2)
# Filter for Product Y only
product_y_data <- tickets %>%
filter(product == "Y")
# Count tickets by platform for Product Y
platform_issues_y <- product_y_data %>%
group_by(platform) %>%
summarise(ticket_count = n()) %>%
arrange(desc(ticket_count))
# Plot
ggplot(platform_issues_y, aes(x = reorder(platform, -ticket_count), y = ticket_count, fill = platform)) +
geom_col(show.legend = FALSE) +
labs(
title = "Ticket Count by Platform for Product Y",
x = "Platform",
y = "Number of Tickets"
) +
theme_minimal()
Chart 4: Correlated Platform for Product Y:
Android is the most affected platform for Product Y, registering the
highest ticket count…
Why We Proceed to Further Analysis? After the initial four charts, our investigation clearly pinpointed Product Y as the major contributor to the growing ticket backlog. Even more specifically, Android is the most affected platform for this product, with all 311 tickets originating from that combination alone. This insight is critical — but it doesn’t yet explain why these issues are occurring.
To deepen our understanding, we take advantage of a key variable: launch_date from the releases dataset. We now shift our focus to correlating ticket spikes with deployment activity. The hypothesis is straightforward:
“Are recent product releases — particularly for Product Y on Android — triggering the spike in tickets?”
By aligning the ticket timeline against the release schedule, we aim to:
Detect any temporal relationships between deployments and bug reports.
Identify specific release versions that may have introduced regressions or performance issues.
Inform development and QA teams whether release timing matches issue surges.
This deeper phase of analysis helps transform the investigation from “who and what is affected?” to “what likely caused it, and when?”
library(dplyr)
library(ggplot2)
library(readr)
# Load the datasets
tickets <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv", show_col_types = FALSE)
releases <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv", show_col_types = FALSE)
# Filter tickets for Product Y on Android
tickets_y_android <- tickets %>%
filter(product == "Y", platform == "Android") %>%
mutate(created_at = as.Date(created_at))
# Filter relevant releases for Product Y on Android
releases_y_android <- releases %>%
filter(product == "Y", platform == "Android") %>%
mutate(launch_date = as.Date(launch_date))
# Daily ticket volume for Product Y on Android
daily_y_android <- tickets_y_android %>%
count(created_at)
# Plot ticket volume with release dates overlaid
ggplot(daily_y_android, aes(x = created_at, y = n)) +
geom_line(color = "firebrick", size = 1) +
geom_vline(data = releases_y_android, aes(xintercept = as.numeric(launch_date)),
linetype = "dashed", color = "steelblue") +
labs(
title = "Ticket Volume for Product Y on Android vs. Release Dates",
x = "Date",
y = "Number of Tickets"
) +
theme_minimal()
Chart 5: Ticket Volume for Product Y on Android vs. Release Dates This chart overlays daily ticket counts for Product Y on Android with vertical dashed lines marking release launch dates. The goal is to determine whether the influx of tickets aligns closely with new deployments.
Findings
There is a visible uptick in ticket submissions shortly after several key release dates, especially in mid- to late-April.
This pattern suggests a temporal correlation between releases and system instability or bugs, reinforcing the hypothesis that recent updates might have introduced issues.
The development and QA teams should prioritize post-release regression testing, particularly for Android releases of Product Y.
In this final phase, actionable recommendations were developed based on the insights uncovered throughout the analysis. These actions aim to resolve the root causes of the growing ticket backlog and enhance product stability and user satisfaction.
By executing these targeted actions, the organization can better manage ticket volume, reduce user friction, and drive long-term improvements in product quality and operational agility.
Imagine a playground with three big slides — the products. At first, everything runs smoothly. When a slide gets bumpy or someone gets stuck, a few helpers (support staff) fix things fast.
But soon, more and more kids (users) arrive every day. The playground is packed. Scrapes and stuck slides (bugs and crashes) become common. Meanwhile, the playground team adds cool new features (releases), but some weren’t tested thoroughly — and they introduce new problems.
The helpers try their best, but they’re overwhelmed. Too many issues, too fast, and not enough hands to help. That’s how a backlog grows — not from neglect, but from unchecked volume and strain.
The data confirms this: rising user activity, increased ticket volume, and new releases aligning with spikes in support demand — all pointing to the need for smarter scaling and preparedness.