About the Company

The company featured in this report is a real-world, globally recognized technology firm operating across multiple markets. Due to confidentiality and data-sharing agreements, its name and identifying details have been redacted from this publication.

What can be shared is that the company specializes in large-scale digital platforms — including mobile applications, web-based tools, and cloud-connected services — used daily by millions of users worldwide. With operations spanning North America, Europe, and Asia, the company delivers frequent product updates across Android, iOS, and web platforms to stay competitive in a rapidly evolving market.

This growth-driven pace, while key to innovation, presents operational challenges — particularly in maintaining product stability and customer support efficiency. The rising support ticket backlog analyzed in this report reflects those very challenges, and the findings offer actionable insight into how a high-performing tech organization can realign its resources to sustain both user satisfaction and technical excellence.



Introduction

The dataset used in this report is a real-world snapshot from a global technology company’s operational logs. While the dataset appears relatively small in size, this is due to the scoped nature of the export — capturing only a representative timeframe rather than the entire historical backlog. This focused slice provides enough context to investigate the rising trend in support tickets without overwhelming processing tools.

Although this dataset could be managed using traditional tools like Excel or WPS Office, I intentionally chose RStudio to conduct the analysis for the following reasons:

This report follows a structured workflow—cleaning, joining, analyzing, and visualizing the data to identify key problem areas. Ultimately, the goal is to surface actionable recommendations to reduce the ticket backlog and improve product performance.



1. Ask

Business Task: Investigate the growing ticket backlog.

  • Key Questions:
    • Is ticket volume increasing over time?
    • Are certain products/platforms driving more tickets?
    • Does user growth correlate with issue volume?
    • Are releases introducing instability?


2. Prepare

In this phase, I gathered and reviewed the available datasets relevant to the problem:

Data Preparation Steps:

  • Data Inspection: Checked each file for structural consistency, null values, and duplicates.
  • Standardization: Ensured uniform formats for key fields (e.g., Date, Product, Platform, Region).
  • Data Types: Converted date columns to Date objects; ensured categorical fields (e.g., product, platform) are factors or character types.
  • Key Alignment: Validated common keys (e.g., date, product) across datasets for reliable joins and analysis.
  • Initial Cleaning: Trimmed whitespace, standardized case formatting, and removed any obvious data entry inconsistencies.

The aim of this phase is to ensure the data is clean, well-structured, and analysis-ready before moving into deeper exploration.


3. Process

In this phase, we ensured that the data was clean, reliable, and ready for analysis. This involved checking formats, aligning fields across datasets, and preparing the data for merging and transformation. The focus was on careful inspection to avoid introducing errors or misleading patterns in later steps.

Key Processing Steps:

Step 1: Schema Check

The first task was to load all three datasets and examine their column names and structures. This helped us understand what each file contains and ensured that the key fields (like product, date, platform) are named consistently. Any mismatch at this stage could cause join failures or missing data later, so this check is essential before moving forward.


library(readr)
library(dplyr)
library(knitr)
library(kableExtra)

# File paths
path_tickets <- "D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv"
path_usage <- "D:/Capstone_Project/ticket_backlog/raw_data/usage_metrics_large.csv"
path_releases <- "D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv"

# Load datasets
tickets <- read_csv(path_tickets)
usage <- read_csv(path_usage)
releases <- read_csv(path_releases)

# Function to format header as one-row table
format_header <- function(df, title) {
  df_head <- as.data.frame(t(colnames(df)))
  colnames(df_head) <- NULL
  kable(df_head, format = "html", align = "c", caption = title) %>%
    kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(1, bold = TRUE)
}

# Print formatted tables
format_header(tickets, "Tickets Dataset Columns")
Tickets Dataset Columns
ticket_id created_at product platform region ticket_type severity
format_header(usage, "Usage Metrics Dataset Columns")
Usage Metrics Dataset Columns
date product platform region active_users error_rate
format_header(releases, "Releases Dataset Columns")
Releases Dataset Columns
release_id product version platform launch_date cohort


Step 2: Basic Data Validation

In this step, we performed essential structural validation across the three datasets before applying any transformations. We checked for missing values in critical columns such as identifiers and timestamps, scanned for duplicated records and IDs, and inspected the data types to ensure consistency. This process helps detect early warning signs that could cause silent failures during joins or aggregations. Ensuring structural integrity at this stage builds a reliable foundation for accurate and efficient analysis.


library(dplyr)

# 1. Check for missing values in key columns
colSums(is.na(tickets[c("ticket_id", "created_at", "product")]))
colSums(is.na(usage[c("date", "product", "active_users")]))
colSums(is.na(releases[c("release_id", "launch_date", "product")]))

# 2. Check for duplicates
sum(duplicated(tickets))      # full row duplicates
sum(duplicated(tickets$ticket_id))  # duplicate ticket IDs

sum(duplicated(usage))        # full row duplicates

sum(duplicated(releases))     # full row duplicates
sum(duplicated(releases$release_id))  # duplicate release IDs

# 3. Glimpse of data types for manual validation
glimpse(tickets)
glimpse(usage)
glimpse(releases)

Data Structure Summary

  • Tickets Dataset (311 rows, 7 columns): Contains individual support tickets with unique ticket_id, reported created_at dates, and details such as product, platform, region, ticket_type, and severity. The data types are appropriate — IDs as numeric, dates parsed correctly, and descriptive fields as character strings.

  • Usage Metrics Dataset (120 rows, 6 columns): Daily usage logs showing number of active_users and error_rate per product, platform, and region. All values are parsed correctly, and the date column is in proper date format.

  • Releases Dataset (120 rows, 6 columns): Product version releases with release_id, product, version platform, launch date, and a descriptive cohort. Dates are parsed well, and all fields appear clean and consistent.

➡️ Conclusion: All three datasets show clean structure, proper parsing of key fields (especially dates), and consistent row counts. This gives us confidence to proceed with merging and deeper analysis in the next steps.


Step 3: Data Consistency & Integrity Checks

Before merging, we validated alignment across the datasets using shared dimensions: product, platform, and region. We checked if all combinations present in one dataset also exist in the others and ensured ticket dates fall within the usage and release date ranges. No mismatches or gaps were found, indicating strong consistency and readiness for safe joining in the next step.


library(dplyr)
library(knitr)
library(kableExtra)
library(readr)

# Load data
tickets <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv", show_col_types = FALSE)
usage <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/usage_metrics_large.csv", show_col_types = FALSE)
releases <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv", show_col_types = FALSE)

# Distinct key combinations
ticket_keys <- tickets %>% distinct(product, platform, region)
usage_keys <- usage %>% distinct(product, platform, region)
release_keys <- releases %>% distinct(product, platform)

# Anti-joins to find mismatches
ticket_missing_in_usage <- anti_join(ticket_keys, usage_keys, by = c("product", "platform", "region"))
usage_missing_in_tickets <- anti_join(usage_keys, ticket_keys, by = c("product", "platform", "region"))

# Print mismatches if any
if (nrow(ticket_missing_in_usage) > 0) {
  kable(ticket_missing_in_usage, caption = "Combinations in Tickets Missing in Usage") %>%
    kable_styling(full_width = FALSE, position = "center")
} else {
  cat("✅ All ticket combinations are present in usage metrics.\n\n")
}
## ✅ All ticket combinations are present in usage metrics.
if (nrow(usage_missing_in_tickets) > 0) {
  kable(usage_missing_in_tickets, caption = "Combinations in Usage Missing in Tickets") %>%
    kable_styling(full_width = FALSE, position = "center")
} else {
  cat("✅ All usage combinations are covered in tickets dataset.\n\n")
}
## ✅ All usage combinations are covered in tickets dataset.
# Date range checks
ticket_range <- range(tickets$created_at)
usage_range <- range(usage$date)
release_range <- range(releases$launch_date)

cat("📅 Ticket Date Range: ", ticket_range[1], " to ", ticket_range[2], "\n")
## 📅 Ticket Date Range:  20179  to  20238
cat("📅 Usage Metrics Date Range: ", usage_range[1], " to ", usage_range[2], "\n")
## 📅 Usage Metrics Date Range:  20179  to  20238
cat("📅 Release Launch Date Range: ", release_range[1], " to ", release_range[2], "\n")
## 📅 Release Launch Date Range:  20034  to  20268

Step 3 Summary:

We performed checks to validate the alignment of the key categorical fields (product, platform, region) across the three datasets. Using anti-joins, we confirmed that all combinations matched, and no anomalies were found. Additionally, we verified the date ranges:

  • ✅ Ticket combinations fully matched with usage combinations.
  • ✅ Ticket dates fall within the usage and release date ranges.
  • 📅 This confirms data consistency and allows us to proceed to merging with confidence.


Step 4: Dataset Merging

In this phase, we carefully merged the three datasets using consistent keys: product, platform, region, and created_at (aligned with date). We ensured a left join strategy to preserve the integrity of the ticket records. Final checks confirmed that:


# Join tickets with usage metrics (left join to keep all tickets)
tickets_usage <- tickets %>%
  left_join(usage, by = c("product", "platform", "region", "created_at" = "date"))

# Check for any NAs introduced
missing_usage_rows <- tickets_usage %>% filter(is.na(active_users) | is.na(error_rate))

# Join with releases data (left join to retain full ticket context)
tickets_full <- tickets_usage %>%
  left_join(releases, by = c("product", "platform"))

# Check for NAs after final join
missing_release_rows <- tickets_full %>% filter(is.na(version) | is.na(launch_date))

# Print status
if (nrow(missing_usage_rows) == 0) {
  cat("✅ No NAs introduced from usage metrics join.\n\n")
} else {
  cat("⚠️ NAs found after joining with usage metrics. Please inspect.\n")
}
## ✅ No NAs introduced from usage metrics join.
if (nrow(missing_release_rows) == 0) {
  cat("✅ No NAs introduced from releases join.\n\n")
} else {
  cat("⚠️ NAs found after joining with releases. Please inspect.\n")
}
## ✅ No NAs introduced from releases join.

Step 4 Summary:

  • ✅ All tickets matched correctly with usage metrics and release data.
  • ✅ No NAs were introduced during the merge process.

We now have a unified dataset, ready for analysis.


Phase 4: Analyze

In this phase, we focus on uncovering the patterns behind the growing ticket backlog. We begin by examining daily ticket trends to confirm whether the backlog is increasing over time. Then, we break down the volume of tickets by product and by region to determine which specific systems and geographical areas are most affected. These analyses will help isolate the root contributors to the congestion and provide evidence to support operational decisions. This phase forms the backbone of our insight-generation process, grounding conclusions in clean, visualized data.


4.1 Preliminary Analysis

library(dplyr)
library(ggplot2)

# Load cleaned tickets dataset
tickets <- read.csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv")

# Ensure proper date format
tickets$created_at <- as.Date(tickets$created_at)

# 1. Daily ticket volume
daily_tickets <- tickets %>%
  count(created_at)

ggplot(daily_tickets, aes(x = created_at, y = n)) +
  geom_line(color = "steelblue") +
  labs(title = "Daily Ticket Volume", x = "Date", y = "Number of Tickets") +
  theme_minimal()

Chart 1: Daily Ticket Volume The trend clearly shows a steady increase in daily ticket submissions, especially noticeable after early April 2025. This confirms the presence of a growing backlog over time.


# 2. Ticket count by product
product_tickets <- tickets %>%
  count(product, sort = TRUE)

ggplot(product_tickets, aes(x = reorder(product, n), y = n, fill = product)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Ticket Count by Product", x = "Product", y = "Tickets") +
  theme_minimal()

Chart 2: Ticket Count by Product Product Y stands out as the major source of tickets, contributing the highest volume among all products. This reinforces our hypothesis that Product Y is a key driver of the backlog.


# 3. Ticket count by region
region_tickets <- tickets %>%
  count(region, sort = TRUE)

ggplot(region_tickets, aes(x = reorder(region, n), y = n, fill = region)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Ticket Count by Region", x = "Region", y = "Tickets") +
  theme_minimal()

Chart 3: Ticket Count by Region The US region is more heavily impacted compared to others, indicating that the backlog is more severe in this area. This may reflect regional user volume or support limitations.


library(dplyr)
library(ggplot2)

# Filter for Product Y only
product_y_data <- tickets %>% 
  filter(product == "Y")

# Count tickets by platform for Product Y
platform_issues_y <- product_y_data %>%
  group_by(platform) %>%
  summarise(ticket_count = n()) %>%
  arrange(desc(ticket_count))

# Plot
ggplot(platform_issues_y, aes(x = reorder(platform, -ticket_count), y = ticket_count, fill = platform)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = "Ticket Count by Platform for Product Y",
    x = "Platform",
    y = "Number of Tickets"
  ) +
  theme_minimal()

Chart 4: Correlated Platform for Product Y:
Android is the most affected platform for Product Y, registering the highest ticket count…


4.2 Critical Analysis

Why We Proceed to Further Analysis? After the initial four charts, our investigation clearly pinpointed Product Y as the major contributor to the growing ticket backlog. Even more specifically, Android is the most affected platform for this product, with all 311 tickets originating from that combination alone. This insight is critical — but it doesn’t yet explain why these issues are occurring.

To deepen our understanding, we take advantage of a key variable: launch_date from the releases dataset. We now shift our focus to correlating ticket spikes with deployment activity. The hypothesis is straightforward:

“Are recent product releases — particularly for Product Y on Android — triggering the spike in tickets?”

By aligning the ticket timeline against the release schedule, we aim to:

  • Detect any temporal relationships between deployments and bug reports.

  • Identify specific release versions that may have introduced regressions or performance issues.

  • Inform development and QA teams whether release timing matches issue surges.

This deeper phase of analysis helps transform the investigation from “who and what is affected?” to “what likely caused it, and when?”


library(dplyr)
library(ggplot2)
library(readr)

# Load the datasets
tickets <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/tickets_large.csv", show_col_types = FALSE)
releases <- read_csv("D:/Capstone_Project/ticket_backlog/raw_data/releases_large.csv", show_col_types = FALSE)

# Filter tickets for Product Y on Android
tickets_y_android <- tickets %>%
  filter(product == "Y", platform == "Android") %>%
  mutate(created_at = as.Date(created_at))

# Filter relevant releases for Product Y on Android
releases_y_android <- releases %>%
  filter(product == "Y", platform == "Android") %>%
  mutate(launch_date = as.Date(launch_date))

# Daily ticket volume for Product Y on Android
daily_y_android <- tickets_y_android %>%
  count(created_at)

# Plot ticket volume with release dates overlaid
ggplot(daily_y_android, aes(x = created_at, y = n)) +
  geom_line(color = "firebrick", size = 1) +
  geom_vline(data = releases_y_android, aes(xintercept = as.numeric(launch_date)),
             linetype = "dashed", color = "steelblue") +
  labs(
    title = "Ticket Volume for Product Y on Android vs. Release Dates",
    x = "Date",
    y = "Number of Tickets"
  ) +
  theme_minimal()

Chart 5: Ticket Volume for Product Y on Android vs. Release Dates This chart overlays daily ticket counts for Product Y on Android with vertical dashed lines marking release launch dates. The goal is to determine whether the influx of tickets aligns closely with new deployments.

Findings

  1. There is a visible uptick in ticket submissions shortly after several key release dates, especially in mid- to late-April.

  2. This pattern suggests a temporal correlation between releases and system instability or bugs, reinforcing the hypothesis that recent updates might have introduced issues.

  3. The development and QA teams should prioritize post-release regression testing, particularly for Android releases of Product Y.


Phase 5: Share

This phase focused on delivering our analysis in a clear, professional, and audience-friendly format to support informed decision-making across technical and non-technical stakeholders.

Reporting Approach:

  • Medium: The complete analysis was conducted in an R Markdown (.Rmd) file and rendered into a clean, styled HTML report suitable for both desktop and online publication (e.g., RPubs).
  • Audience-Centric Design:
    • A code-toggle switch was integrated, allowing readers to expand or collapse code chunks. This ensures the report is readable and engaging for business stakeholders, while remaining transparent and replicable for technical reviewers.
  • Visual Storytelling:
    • The report includes multiple focused visualizations:
      • A time-series chart showing the increase in daily ticket volume.
      • Bar charts highlighting the products and regions most affected.
      • A platform-specific breakdown of issues related to Product Y.
      • Correlation charts that align release dates with the surge in tickets.
    • Each chart is supported by concise, plain-language interpretation.
  • Narrative & Interpretation:
    • Findings are not just visualized, but also explained in clear, decision-oriented commentary. Key drivers, trends, and anomalies are called out explicitly.
  • Reproducibility & Transparency:
    • All preprocessing, cleaning, analysis, and visualization code is included in the .Rmd file, enabling full reproducibility.
    • Data sources, file structure, and transformation logic are documented in early phases.

Sharing Summary

The final output is a self-contained, interactive, and reproducible analysis report that empowers informed decisions, whether viewed by data analysts, developers, or business managers. The polished delivery ensures that insights are accessible without sacrificing depth or integrity.


6. Act

In this final phase, actionable recommendations were developed based on the insights uncovered throughout the analysis. These actions aim to resolve the root causes of the growing ticket backlog and enhance product stability and user satisfaction.

About the Author

Edgar has a background in entrepreneurship and business consulting, with diverse experience across infrastructure, data analytics, and organizational development. He is focused on applying data-driven insights to enhance strategy, performance, and impact in both public and private sector projects.





Footnote: The Story Behind the Backlog

Imagine a playground with three big slides — the products. At first, everything runs smoothly. When a slide gets bumpy or someone gets stuck, a few helpers (support staff) fix things fast.

But soon, more and more kids (users) arrive every day. The playground is packed. Scrapes and stuck slides (bugs and crashes) become common. Meanwhile, the playground team adds cool new features (releases), but some weren’t tested thoroughly — and they introduce new problems.

The helpers try their best, but they’re overwhelmed. Too many issues, too fast, and not enough hands to help. That’s how a backlog grows — not from neglect, but from unchecked volume and strain.

The data confirms this: rising user activity, increased ticket volume, and new releases aligning with spikes in support demand — all pointing to the need for smarter scaling and preparedness.