Optimising Telecom Network Performance Through Data Analytics

Author

Affiliation

Babatunde Majekodunmi

Lagos Business School (EMBA 31)

Published

May 19, 2026

1. Executive Summary

This study presents a data-driven analysis of LTE network performance aimed at identifying the operational factors responsible for customer experience degradation during busy-hour conditions and supporting targeted capacity investment decisions.

Using real cell-level operational data collected from LTE network sites, the study examined the relationship between traffic demand, network utilisation, radio quality, and user throughput across the network.

The analysis revealed that network congestion is the primary driver of performance degradation. Statistical results showed a strong negative relationship between PRB utilisation and user throughput, confirming that service quality deteriorates significantly as network resources approach saturation.

Regression analysis further demonstrated that high utilisation and traffic demand reduce throughput, while improved radio quality partially mitigates performance decline. In addition, a Capacity Stress Framework identified that severe degradation occurs when multiple operational thresholds are breached simultaneously.

The findings indicate that congestion is concentrated within a limited number of high-stress cells, suggesting that network investment should be strategically targeted rather than uniformly distributed.

The study concludes that analytics-driven capacity planning can improve customer experience, optimise resource allocation, and enhance overall network operational efficiency.

2. Professional Disclosure

This analysis is grounded in my professional experience within the telecommunications sector, where I have held both technical and commercial leadership roles. I previously served as Group Head of GIS and Network Planning, and I currently operate as a Global Account Director, responsible for driving strategic engagement, customer value, and business growth across large-scale telecom networks.

In my former role as Group Head of GIS and Network Planning, I was directly responsible for network design, capacity planning, and spatial optimisation of telecommunications infrastructure. This role required deep technical engagement with radio network performance metrics, including capacity utilisation, traffic distribution, and coverage optimisation. Decisions were informed by geospatial analysis and KPI-driven insights, with a core focus on identifying congestion hotspots, forecasting demand, and ensuring efficient allocation of network resources. This experience provides a strong technical foundation for understanding the operational dynamics reflected in the dataset used in this study.

In my current role as Global Account Director, my focus has evolved toward the commercial and strategic dimensions of telecom operations. I work closely with senior stakeholders to translate network performance into business outcomes, including customer experience, revenue growth, and return on investment. This role requires the ability to interpret complex technical data and communicate insights in a clear, actionable manner that supports executive decision-making. As such, the ability to connect network KPIs with business performance is central to my day-to-day responsibilities.

My organisation operates in the telecommunications industry, offering fixed wireless broadband solutions amongst other services, where network performance is a critical driver of customer satisfaction and competitive advantage. Key performance indicators such as throughput, utilisation, and radio quality directly influence service delivery, making data-driven analysis essential for both operational efficiency and strategic planning.

The five analytical techniques selected for this study—Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression Modelling—are highly relevant to both my current and previous roles.

Exploratory Data Analysis is fundamental in both technical and commercial contexts, enabling the identification of patterns, anomalies, and performance variations across the network. In my network planning role, this was essential for diagnosing capacity issues, while in my current role, it supports high-level performance assessment and opportunity identification.

Data Visualisation is critical for translating complex, multi-dimensional datasets into intuitive insights. As a Global Account Director, I frequently rely on visual tools such as heatmaps and performance dashboards to communicate network performance to non-technical stakeholders and support strategic discussions.

Hypothesis Testing provides a structured framework for validating assumptions about network behaviour. In practice, this allows me to distinguish between perceived and statistically significant performance issues, ensuring that business decisions are grounded in evidence rather than anecdotal observation.

Correlation Analysis is used to uncover relationships between key drivers of performance, such as the interaction between capacity utilisation and user experience. This is directly relevant to both network optimisation and commercial prioritisation, as it helps identify the underlying causes of performance degradation.

Regression Modelling represents the most advanced analytical tool applied in this study, enabling the quantification of the impact of multiple variables on network performance. This is particularly valuable in my current role, where quantifying the drivers of customer experience supports data-driven investment decisions and strengthens the business case for network expansion and optimisation.

Collectively, these techniques enable a comprehensive analytical approach that bridges technical network insight and commercial decision-making. The combination of my prior experience in network planning and my current strategic role uniquely positions me to interpret the data both from an engineering perspective and from a business value perspective. This dual lens ensures that the analysis is not only technically rigorous but also directly applicable to real-world decision-making within the telecommunications industry.

3. Data Collection and Sampling

This study is based on primary operational data extracted from a live LTE network performance reporting system. The dataset consists of cell-level Key Performance Indicators (KPIs) collected during the busy hour period, representing peak traffic conditions when network resources are most constrained.

The data was obtained from internal network monitoring systems that track real-time performance metrics across multiple sites and sectors. Each observation corresponds to a specific cell within a site, capturing both capacity utilisation metrics (e.g., PRB usage, control channel utilisation) and user experience indicators (e.g., throughput). The dataset includes several hundred observations, thereby satisfying the minimum requirement for robust statistical analysis.

The sampling approach can be described as a complete operational extract rather than a statistical sample. That is, all available cells within the specified time window were included, ensuring that the dataset accurately reflects the true state of the network during peak usage conditions. This approach eliminates sampling bias and strengthens the reliability of the findings.

The chosen time period—busy hour—is particularly relevant from a business and operational perspective, as it represents the worst-case performance scenario. Network congestion and customer experience issues are most likely to occur during this period, making it the most critical timeframe for capacity analysis and investment decision-making.

To further enhance analytical robustness, a Capacity Stress Framework was developed, combining multiple KPIs to identify cells experiencing simultaneous pressure across capacity and performance dimensions. This ensures that the analysis captures systemic congestion rather than isolated anomalies.

From an ethical and data governance perspective, all data used in this study has been fully anonymised, with no personally identifiable information (PII) included. The analysis focuses on aggregated network performance metrics and is used solely for academic and analytical purposes.

Overall, the dataset provides a highly representative and business-relevant view of network performance, enabling rigorous analysis of capacity constraints and their impact on user experience.

3.1 Data Cleaning and Preparation

3.1.1 Overview

Before conducting any analysis, the dataset was cleaned and prepared to ensure accuracy, consistency, and suitability for statistical modelling. Real-world telecom data often contains invalid entries, missing values, and formatting inconsistencies that can distort results if not properly addressed.

In this case, this involved identifying and handling invalid entries such as “/0” and “NIL”, which were treated as missing values. Key KPI variables were then converted into numeric formats to enable statistical analysis. These steps ensured that the dataset was consistent, reliable, and suitable for subsequent analytical techniques.

3.1.2 Data Cleaning Steps

The following steps were applied:

Identification and removal of invalid values
Conversion of key KPI variables to numeric format
Handling of missing values
Creation of analysis-ready dataset

3.1.3 Implementation (Tidyverse)

Code

library(tidyverse)

# Step 1: Replace invalid entries with NA (character columns only)
data_clean <- data_raw %>%
  mutate(across(where(is.character), ~na_if(., "/0"))) %>%
  mutate(across(where(is.character), ~na_if(., "NIL")))

# Step 2: Convert key variables to numeric
data_clean <- data_clean %>%
  mutate(
    DL_PRB        = as.numeric(`DL PRB Usage Rate(%)`),
    UL_PRB        = as.numeric(`UL PRB Usage Rate(%)`),
    DL_Throughput = as.numeric(`User DL PDCP Average Throughput`),
    UL_Throughput = as.numeric(`User UL PDCP Average Throughput`),
    Traffic       = as.numeric(`Total Traffic (Gbit)`),
    Users         = as.numeric(`L.Traffic.User.Avg`),
    CQI           = as.numeric(`Average CQI`),
    PUCCH         = as.numeric(`PUCCHUsage Rate[%]`),
    PDCCH         = as.numeric(`PDCCH CCE Usage Rate[%]`)
  )

# Step 3: Remove rows with missing critical values
data_clean <- data_clean %>%
  drop_na(DL_PRB, DL_Throughput, Traffic, Users, CQI)

3.1.4 Before vs After Data Cleaning

To demonstrate the impact of data cleaning, a comparison was performed between the raw dataset and the cleaned dataset.

Code

# Count missing / invalid values BEFORE cleaning (flag invalid strings then count NAs)
before_missing <- data_raw %>%
  mutate(across(where(is.character), ~na_if(., "/0"))) %>%
  mutate(across(where(is.character), ~na_if(., "NIL"))) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Before")

# Count missing values AFTER cleaning
after_missing <- data_clean %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_After")

# Combine into comparison table
comparison <- before_missing %>%
  inner_join(after_missing, by = "Variable")

# Plot comparison
ggplot(comparison, aes(x = reorder(Variable, Missing_Before))) +
  geom_bar(aes(y = Missing_Before, fill = "Before Cleaning"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = Missing_After, fill = "After Cleaning"), stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Missing Values: Before vs After Data Cleaning",
    x = "Variables",
    y = "Number of Missing Values",
    fill = "Stage"
  )

3.2 Traffic-Light Data Quality Dashboard

To provide a more intuitive and executive-friendly assessment of data completeness, a traffic-light system was implemented. Each KPI is classified based on its completeness level:

🟢 Green: Completeness ≥ 95% (High quality)
🟡 Amber: Completeness between 80% and 95% (Moderate quality)
🔴 Red: Completeness < 80% (Low quality)

Code

library(tidyverse)

# Calculate completeness (% non-missing)
data_quality <- data_clean %>%
  summarise(across(where(is.numeric), ~mean(!is.na(.)) * 100)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Completeness")

# Assign traffic-light categories
data_quality <- data_quality %>%
  mutate(
    Quality_Level = case_when(
      Completeness >= 95 ~ "High (Green)",
      Completeness >= 80 ~ "Medium (Amber)",
      TRUE ~ "Low (Red)"
    )
  )

# Plot traffic-light dashboard
ggplot(data_quality, aes(x = reorder(Variable, Completeness), y = Completeness, fill = Quality_Level)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(
    values = c(
      "High (Green)" = "green",
      "Medium (Amber)" = "gold",
      "Low (Red)" = "red"
    )
  ) +
  theme_minimal() +
  labs(
    title = "Data Quality Traffic-Light Dashboard",
    x = "Variables",
    y = "Completeness (%)",
    fill = "Data Quality Level"
  )

3.2.1 Technical Interpretation

The traffic-light dashboard provides an intuitive summary of data quality across all variables. - Variables in green indicate high completeness and strong reliability - Variables in amber suggest moderate data quality and should be interpreted with some caution - Variables in red highlight potential data gaps that may impact analytical accuracy

3.2.2 Plain-Language Interpretation

This chart shows how reliable each piece of data is. Most variables fall into the green category, meaning the data is complete and trustworthy. Where variables appear in amber or red, it indicates some missing information, which may slightly affect the accuracy of the results.

3.2.3 Business Impact

Using a traffic-light system enables quick identification of potential data quality risks, allowing decision-makers to assess confidence levels in the analysis. This ensures that strategic decisions are supported by reliable and transparent data.

4. Data Description

This study utilises a structured dataset comprising LTE cell-level Key Performance Indicators (KPIs), capturing both network capacity utilisation and user experience metrics. The dataset provides a comprehensive view of network performance during peak usage conditions.

The variables included in the dataset can be broadly classified into three categories:

🔹 1. Capacity and Utilisation Metrics

These variables reflect the level of resource usage within the network:

DL PRB Usage Rate (%): Proportion of downlink radio resources utilised
UL PRB Usage Rate (%): Proportion of uplink radio resources utilised
PUCCH Usage Rate (%): Uplink control channel utilisation
PDCCH CCE Usage Rate (%): Downlink control channel utilisation

These indicators are critical for identifying congestion, as high utilisation suggests that network resources are nearing or exceeding capacity limits.

To provide a clear and intuitive overview of the key variables used in this analysis, a visual KPI reference chart was developed. This chart categorises each KPI based on its function and highlights its role in assessing network performance.

Code

library(tidyverse)

kpi_reference <- tibble(
  KPI = c(
    "DL PRB Usage Rate",
    "UL PRB Usage Rate",
    "PUCCH Usage Rate",
    "PDCCH CCE Usage Rate",
    "Total Traffic",
    "User Count",
    "DL Throughput",
    "UL Throughput",
    "CQI"
  ),
  Category = c(
    "Capacity",
    "Capacity",
    "Control Channel",
    "Control Channel",
    "Demand",
    "Demand",
    "Performance",
    "Performance",
    "Quality"
  ),
  Impact = c(
    "Higher = Worse",
    "Higher = Worse",
    "Higher = Worse",
    "Higher = Worse",
    "Higher = Worse",
    "Higher = Worse",
    "Higher = Better",
    "Higher = Better",
    "Higher = Better"
  )
)

# Plot KPI reference chart
ggplot(kpi_reference, aes(x = Category, y = fct_reorder(KPI, as.integer(factor(Category))), fill = Impact)) +
  geom_tile(color = "white") +
  scale_fill_manual(values = c("Higher = Worse" = "red", "Higher = Better" = "green")) +
  theme_minimal() +
  labs(
    title = "KPI Reference Framework",
    x = "KPI Category",
    y = "KPI",
    fill = "Interpretation"
  )

The KPI reference chart categorises variables into four key groups: - Capacity Metrics: Indicate how heavily network resources are utilised. - Demand Metrics: Reflect the level of traffic and user activity. - Control Channel Metrics: Capture signalling load and scheduling pressure. - Performance Metrics: Represent the user experience.

The colour coding distinguishes between KPIs where higher values are desirable versus those indicating potential congestion. Capacity and demand-related KPIs are negatively associated with performance when elevated, while throughput and CQI are positively associated with improved user experience.

To enhance interpretability, threshold overlays were introduced to highlight the specific KPI levels at which network performance begins to deteriorate. These thresholds are based on operational experience and industry practice.

Code

library(tidyverse)

# Define KPI thresholds
kpi_thresholds <- tibble(
  KPI = c(
    "DL PRB Usage",
    "UL PRB Usage",
    "PUCCH Usage",
    "PDCCH Usage",
    "DL Throughput",
    "UL Throughput"
  ),
  Threshold = c(70, 70, 80, 80, 5, 2),
  Type = c(
    "Upper Limit",
    "Upper Limit",
    "Upper Limit",
    "Upper Limit",
    "Lower Limit",
    "Lower Limit"
  ),
  Category = c(
    "Capacity",
    "Capacity",
    "Control Channel",
    "Control Channel",
    "Performance",
    "Performance"
  )
)

# Create visual threshold chart
ggplot(kpi_thresholds, aes(x = Category, y = KPI, fill = Type)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste("Threshold:", Threshold)), size = 3) +
  scale_fill_manual(
    values = c("Upper Limit" = "red", "Lower Limit" = "blue")
  ) +
  theme_minimal() +
  labs(
    title = "KPI Threshold Overlay Framework",
    x = "KPI Category",
    y = "KPI",
    fill = "Threshold Type"
  )

🔹 2. Traffic and Demand Metrics

Total Traffic (Gbit): Volume of data transmitted across the network
Average User Count: Number of active users in the cell

These variables indicate the level of demand being placed on network resources and are key drivers of congestion.

🔹 3. Performance and Quality Metrics

User DL PDCP Average Throughput (Mbps): Downlink speed experienced by users
User UL PDCP Average Throughput (Mbps): Uplink speed experienced by users
Average CQI: Channel Quality Indicator reflecting radio conditions

These variables directly capture the quality of service delivered to end users and serve as primary indicators of customer experience.

4.2 Data Overview

The dataset consists of several hundred observations, each representing a unique cell within the network. This granularity enables detailed analysis of performance variation across locations.

🔹 Data Inspection (Tidyverse)

Code

# Structure of the dataset
glimpse(data_raw)

Rows: 1,476
Columns: 34
$ Date                                 <chr> "2025-10-25 21:00", "2025-10-25 2…
$ `eNodeB Name`                        <chr> "HRE-0128-Kambuzuma Main", "HRE-0…
$ `Cell FDD TDD Indication`            <chr> "CELL_TDD", "CELL_TDD", "CELL_TDD…
$ `Cell Name`                          <chr> "Kambuzuma Main_LTE_3", "High Gle…
$ `LocalCell Id`                       <chr> "3", "6", "3", "6", "6", "5", "4"…
$ `eNodeB Function Name`               <chr> "HRE-0128-Kambuzuma Main", "HRE-0…
$ Integrity                            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ `Radio Net Availability Rate(%)`     <dbl> 100, 100, 100, 100, 100, 100, 100…
$ `RRC Setup Success Rate(all)`        <dbl> 99.2238, 98.9817, 99.7899, 93.384…
$ `RRC Setup Success Rate(Service)[%]` <dbl> 99.3846, 99.0453, 99.7481, 99.232…
$ `RRC Setup Success Rate(Signal)[%]`  <dbl> 98.3740, 98.6111, 100.0000, 73.06…
$ `E-RAB Setup Success Rate (ALL)(%)`  <dbl> 98.5591, 99.5294, 99.5086, 99.510…
$ `Call Drop Rate (All)(%)`            <dbl> 2.7708, 1.5679, 2.1413, 0.4325, 2…
$ `HO Success Rate(Intra Freqency)`    <dbl> 89.7959, 100.0000, 98.4615, 85.29…
$ `HO Success Rate(S1)[%]`             <dbl> 92.4812, 100.0000, 98.9247, 92.85…
$ `Paging Transfer Success Rate`       <dbl> 100.0112, 100.0070, 100.0070, 100…
$ `Total Traffic (Gbit)`               <dbl> 26.7403, 19.5337, 38.8470, 110.67…
$ `DL Traffic Volume(Gbit)`            <dbl> 23.8716, 16.6373, 35.2433, 102.01…
$ `UL Traffic Volume(Gbit)`            <dbl> 2.8688, 2.8965, 3.6037, 8.6604, 3…
$ L.Traffic.User.Avg                   <dbl> 20.5558, 16.3728, 19.1853, 90.700…
$ L.Traffic.User.Max                   <dbl> 28, 25, 26, 105, 55, 115, 63, 2, …
$ `User DL PDCP Average Throughput`    <dbl> 1.9682, 1.1314, 1.8440, 1.7414, 1…
$ `User UL PDCP Average Throughput`    <dbl> 3.7719, 3.6686, 5.0932, 1.7556, 1…
$ `DL IBLER[%]`                        <dbl> 16.9946, 20.7208, 19.4805, 15.056…
$ `UL IBLER[%]`                        <dbl> 8.0447, 10.1054, 7.4697, 10.2490,…
$ `DL ReTrans Rate[%]`                 <dbl> 0.2148, 0.2582, 0.2285, 0.1667, 0…
$ `DL Packet Loss Rate(all)`           <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00…
$ `UL Packet Loss Rate(all)`           <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0…
$ `DL PRB Usage Rate(%)`               <dbl> 82.9788, 89.8447, 85.7056, 94.083…
$ `UL PRB Usage Rate(%)`               <dbl> 48.6947, 41.7859, 43.7148, 98.399…
$ `PUCCHUsage Rate[%]`                 <dbl> 5.1390, 4.0932, 4.7963, 22.6750, …
$ `PDCCH CCE Usage Rate[%]`            <dbl> 50.4997, 64.3848, 55.7521, 40.910…
$ `Average CQI`                        <dbl> 8.9297, 7.1773, 8.4482, 9.6486, 1…
$ `Average PDSCH MCS`                  <dbl> 6.3872, 6.1955, 7.9798, 19.0299, …

Code

# Summary statistics
summary(data_raw)

     Date           eNodeB Name        Cell FDD TDD Indication
 Length:1476        Length:1476        Length:1476            
 Class :character   Class :character   Class :character       
 Mode  :character   Mode  :character   Mode  :character       
                                                              
                                                              
                                                              
                                                              
  Cell Name         LocalCell Id       eNodeB Function Name   Integrity
 Length:1476        Length:1476        Length:1476          Min.   :1  
 Class :character   Class :character   Class :character     1st Qu.:1  
 Mode  :character   Mode  :character   Mode  :character     Median :1  
                                                            Mean   :1  
                                                            3rd Qu.:1  
                                                            Max.   :1  
                                                                       
 Radio Net Availability Rate(%) RRC Setup Success Rate(all)
 Min.   : 99.97                 Min.   : 38.46             
 1st Qu.:100.00                 1st Qu.: 97.31             
 Median :100.00                 Median : 98.90             
 Mean   :100.00                 Mean   : 97.67             
 3rd Qu.:100.00                 3rd Qu.: 99.57             
 Max.   :100.00                 Max.   :100.00             
                                                           
 RRC Setup Success Rate(Service)[%] RRC Setup Success Rate(Signal)[%]
 Min.   :  0.00                     Min.   : 33.33                   
 1st Qu.: 99.45                     1st Qu.: 90.71                   
 Median : 99.79                     Median : 95.24                   
 Mean   : 99.20                     Mean   : 93.24                   
 3rd Qu.: 99.94                     3rd Qu.: 98.12                   
 Max.   :100.00                     Max.   :100.00                   
 NA's   :5                          NA's   :1                        
 E-RAB Setup Success Rate (ALL)(%) Call Drop Rate (All)(%)
 Min.   :  0.00                    Min.   : 0.000         
 1st Qu.: 98.70                    1st Qu.: 1.102         
 Median : 99.29                    Median : 1.735         
 Mean   : 95.57                    Mean   : 2.847         
 3rd Qu.: 99.64                    3rd Qu.: 2.980         
 Max.   :100.00                    Max.   :66.667         
 NA's   :5                         NA's   :1              
 HO Success Rate(Intra Freqency) HO Success Rate(S1)[%]
 Min.   :  0.00                  Min.   :  0.00        
 1st Qu.: 98.34                  1st Qu.: 98.44        
 Median : 99.50                  Median : 99.44        
 Mean   : 97.89                  Mean   : 97.57        
 3rd Qu.:100.00                  3rd Qu.:100.00        
 Max.   :100.00                  Max.   :100.00        
 NA's   :44                      NA's   :22            
 Paging Transfer Success Rate Total Traffic (Gbit) DL Traffic Volume(Gbit)
 Min.   : 99.69               Min.   :  0.00       Min.   :  0.00         
 1st Qu.: 99.98               1st Qu.: 19.34       1st Qu.: 16.94         
 Median :100.00               Median : 43.88       Median : 39.50         
 Mean   :100.00               Mean   : 47.68       Mean   : 42.80         
 3rd Qu.:100.02               3rd Qu.: 69.99       3rd Qu.: 63.51         
 Max.   :100.27               Max.   :195.44       Max.   :176.77         
                                                                          
 UL Traffic Volume(Gbit) L.Traffic.User.Avg L.Traffic.User.Max
 Min.   : 0.000          Min.   :  0.0022   Min.   :  1.00    
 1st Qu.: 2.278          1st Qu.: 20.9033   1st Qu.: 29.00    
 Median : 4.624          Median : 48.5005   Median : 62.00    
 Mean   : 4.879          Mean   : 55.1288   Mean   : 67.96    
 3rd Qu.: 7.369          3rd Qu.: 91.0251   3rd Qu.:110.00    
 Max.   :19.015          Max.   :182.5296   Max.   :204.00    
                                                              
 User DL PDCP Average Throughput User UL PDCP Average Throughput
 Min.   :0.0246                  Min.   : 0.000                 
 1st Qu.:1.3287                  1st Qu.: 1.753                 
 Median :1.6203                  Median : 2.276                 
 Mean   :1.5072                  Mean   : 2.556                 
 3rd Qu.:1.8014                  3rd Qu.: 3.149                 
 Max.   :1.9998                  Max.   :23.567                 
                                                                
  DL IBLER[%]      UL IBLER[%]      DL ReTrans Rate[%] DL Packet Loss Rate(all)
 Min.   : 0.177   Min.   : 0.2393   Min.   :0.0031     Min.   :0.0000000       
 1st Qu.:16.839   1st Qu.: 7.6748   1st Qu.:0.1875     1st Qu.:0.0000000       
 Median :19.981   Median : 9.5078   Median :0.2255     Median :0.0000000       
 Mean   :20.879   Mean   :10.6645   Mean   :0.2426     Mean   :0.0001326       
 3rd Qu.:21.812   3rd Qu.:12.0768   3rd Qu.:0.2560     3rd Qu.:0.0000000       
 Max.   :68.701   Max.   :91.3124   Max.   :0.6289     Max.   :0.0625000       
                                                                               
 UL Packet Loss Rate(all) DL PRB Usage Rate(%) UL PRB Usage Rate(%)
 Min.   :0.0000000        Min.   : 0.8155      Min.   : 8.483      
 1st Qu.:0.0000000        1st Qu.:73.6467      1st Qu.:47.159      
 Median :0.0000000        Median :86.9415      Median :68.490      
 Mean   :0.0001298        Mean   :75.4107      Mean   :65.357      
 3rd Qu.:0.0000000        3rd Qu.:92.1616      3rd Qu.:82.536      
 Max.   :0.0187000        Max.   :99.4466      Max.   :99.552      
 NA's   :1                                                         
 PUCCHUsage Rate[%] PDCCH CCE Usage Rate[%]  Average CQI     Average PDSCH MCS
 Min.   : 0.0006    Min.   : 2.024          Min.   : 3.041   Min.   : 0.0547  
 1st Qu.: 5.2259    1st Qu.:34.725          1st Qu.: 7.878   1st Qu.: 7.5813  
 Median :12.1251    Median :40.832          Median : 9.157   Median :11.9964  
 Mean   :13.7822    Mean   :41.468          Mean   : 9.049   Mean   :11.1818  
 3rd Qu.:22.7563    3rd Qu.:51.212          3rd Qu.:10.153   3rd Qu.:15.6861  
 Max.   :45.6324    Max.   :79.920          Max.   :14.702   Max.   :20.8419

Code

# Coerce character KPI columns to numeric and create clean working data frame
df <- data_raw |>
  mutate(
    rrc_ssr          = as.numeric(`RRC Setup Success Rate(all)`),
    erab_ssr         = as.numeric(`E-RAB Setup Success Rate (ALL)(%)`),
    call_drop_rate   = as.numeric(`Call Drop Rate (All)(%)`),
    ho_intra         = as.numeric(`HO Success Rate(Intra Freqency)`),
    ho_s1            = as.numeric(`HO Success Rate(S1)[%]`),
    dl_prb_usage     = `DL PRB Usage Rate(%)`,
    ul_prb_usage     = `UL PRB Usage Rate(%)`,
    dl_traffic_gb    = `DL Traffic Volume(Gbit)`,
    ul_traffic_gb    = `UL Traffic Volume(Gbit)`,
    total_traffic_gb = `Total Traffic (Gbit)`,
    avg_users        = L.Traffic.User.Avg,
    dl_throughput    = `User DL PDCP Average Throughput`,
    ul_throughput    = `User UL PDCP Average Throughput`,
    dl_ibler         = `DL IBLER[%]`,
    ul_ibler         = `UL IBLER[%]`,
    dl_retrans       = `DL ReTrans Rate[%]`,
    avg_cqi          = `Average CQI`,
    avg_mcs          = `Average PDSCH MCS`,
    pdcch_usage      = `PDCCH CCE Usage Rate[%]`,
    availability     = `Radio Net Availability Rate(%)`,
    cell_name        = `Cell Name`,
    enodeb_name      = `eNodeB Name`,
    # Flag cells where DL PRB >= 80% (industry congestion threshold)
    congested        = if_else(dl_prb_usage >= 80, "Congested", "Non-Congested")
  )

cat("Rows:", nrow(df), "| Unique cells:", n_distinct(df$cell_name), "\n")

Rows: 1476 | Unique cells: 266

5. Analysis - Exploratory Data Analysis (EDA)

Theory Overview

Exploratory Data Analysis (EDA) is an initial analytical step used to understand the structure, distribution, and quality of data before applying formal statistical methods. It focuses on identifying patterns, detecting anomalies, and assessing relationships between variables through summary statistics and visualisation.

EDA is not used to confirm hypotheses but rather to generate insights, guide modelling decisions, and ensure data readiness for further analysis.

Business Justification

In a telecommunications context, EDA is critical for identifying early signs of network congestion and performance variation across cells. By examining how key metrics such as utilisation and throughput are distributed, it becomes possible to detect congestion hotspots and understand where service quality may be deteriorating.

This step is particularly important for ensuring that subsequent analysis and investment decisions are based on reliable and well-understood data.

Implementation (Tidyverse)

Code

# Summary statistics of key KPIs
df |>
  select(dl_prb_usage, ul_prb_usage, dl_throughput, ul_throughput,
         total_traffic_gb, avg_users, avg_cqi) |>
  summary()

  dl_prb_usage      ul_prb_usage    dl_throughput    ul_throughput   
 Min.   : 0.8155   Min.   : 8.483   Min.   :0.0246   Min.   : 0.000  
 1st Qu.:73.6467   1st Qu.:47.159   1st Qu.:1.3287   1st Qu.: 1.753  
 Median :86.9415   Median :68.490   Median :1.6203   Median : 2.276  
 Mean   :75.4107   Mean   :65.357   Mean   :1.5072   Mean   : 2.556  
 3rd Qu.:92.1616   3rd Qu.:82.536   3rd Qu.:1.8014   3rd Qu.: 3.149  
 Max.   :99.4466   Max.   :99.552   Max.   :1.9998   Max.   :23.567  
 total_traffic_gb   avg_users           avg_cqi      
 Min.   :  0.00   Min.   :  0.0022   Min.   : 3.041  
 1st Qu.: 19.34   1st Qu.: 20.9033   1st Qu.: 7.878  
 Median : 43.88   Median : 48.5005   Median : 9.157  
 Mean   : 47.68   Mean   : 55.1288   Mean   : 9.049  
 3rd Qu.: 69.99   3rd Qu.: 91.0251   3rd Qu.:10.153  
 Max.   :195.44   Max.   :182.5296   Max.   :14.702

Code

# Distribution of DL Throughput
ggplot(df, aes(x = dl_throughput)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(
    title = "Distribution of Downlink Throughput",
    x = "DL Throughput (Mbps)",
    y = "Frequency"
  )

Code

# Distribution of DL PRB Utilisation
ggplot(df, aes(x = dl_prb_usage)) +
  geom_histogram(binwidth = 5, fill = "tomato", color = "white") +
  geom_vline(xintercept = 80, linetype = "dashed", color = "darkred") +
  theme_minimal() +
  labs(
    title = "Distribution of DL PRB Utilisation",
    subtitle = "Dashed line = 80% congestion threshold",
    x = "DL PRB Usage (%)",
    y = "Frequency"
  )

Code

# Boxplot to detect outliers
ggplot(df, aes(y = dl_throughput)) +
  geom_boxplot(fill = "orange") +
  theme_minimal() +
  labs(
    title = "Boxplot of Downlink Throughput",
    y = "DL Throughput (Mbps)"
  )

Plain-Language Interpretation

The exploratory analysis shows that network performance varies significantly across different cells. While some areas perform well, others exhibit much lower speeds, indicating inconsistent service quality.

A key finding is that many cells are operating at high utilisation levels, meaning that the available network capacity is heavily used. This creates competition among users for limited resources, which leads to slower speeds. Overall, the results suggest that congestion is already present in parts of the network and is likely to be a major factor affecting customer experience. This provides a strong foundation for further analysis to understand the causes and quantify the impact of these issues.

Data Quality Issues Identified

Two key data quality issues were identified during exploratory analysis:

Missing and Invalid Values
Several variables contained invalid entries such as “/0” and “NIL”. These do not represent actual observations and were therefore replaced with missing values (NA) and removed from the analysis to prevent distortion of results.
Skewed Distribution in Throughput
The distribution of DL Throughput exhibited a left-skewed pattern, with a concentration of observations at lower throughput levels. This indicates that a significant number of cells experience poor performance, likely due to congestion. This skewness was not corrected but was retained as it reflects real network conditions.
Outliers in Throughput
Boxplot analysis revealed extreme low throughput values, representing severely congested cells. These were retained in the dataset as they provide important insight into worst-case performance scenarios.

6. Analysis - Data Visualisation

Theory Recap

Data visualisation is the process of representing data graphically to identify patterns, trends, and relationships that may not be immediately apparent from raw data. It plays a critical role in exploratory and explanatory analysis by enabling intuitive understanding of complex datasets.

In particular, scatter plots are used to identify relationships between variables, while trend lines help illustrate the direction and strength of these relationships.

Business Justification

In telecommunications, data visualisation is essential for translating technical network metrics into insights that can support strategic decision-making.

Visual tools such as scatter plots allow stakeholders to quickly understand how network utilisation impacts customer experience. This is particularly valuable for identifying congestion patterns and prioritising capacity investment decisions.

Visualisation also enables non-technical stakeholders to interpret data effectively, improving communication between technical teams and business leadership.

Implementation (Tidyverse)

Code

# PRB Utilisation vs Throughput
ggplot(df, aes(x = dl_prb_usage, y = dl_throughput)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Relationship between DL PRB Utilisation and Throughput",
    x = "DL PRB Usage (%)",
    y = "DL Throughput (Mbps)"
  )

Code

# Traffic vs Throughput
ggplot(df, aes(x = total_traffic_gb, y = dl_throughput)) +
  geom_point(alpha = 0.4, color = "darkgreen") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Relationship between Traffic Volume and Throughput",
    x = "Total Traffic (Gbit)",
    y = "DL Throughput (Mbps)"
  )

Code

# Users vs Throughput
ggplot(df, aes(x = avg_users, y = dl_throughput)) +
  geom_point(alpha = 0.4, color = "purple") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme_minimal() +
  labs(
    title = "Relationship between User Load and Throughput",
    x = "Average Users",
    y = "DL Throughput (Mbps)"
  )

Plain-Language Interpretation

The charts clearly show that the more heavily the network is used, the worse the customer experience becomes.

When too many users try to access the network at the same time, or when too much data is being transmitted, the available capacity is shared across users, resulting in slower speeds.

In simple terms, the network behaves like a busy road: as traffic increases, congestion builds up and performance slows down.

These findings provide visual confirmation that congestion is a key issue within the network and set the foundation for statistical testing in the next stage of analysis.

Visualisation Narrative

The set of visualisations collectively tells a single, consistent story: as network utilisation and demand increase, user throughput decreases.

Each plot contributes to this narrative: - Histograms reveal uneven performance distribution
- Scatter plots show negative relationships between load and throughput - Boxplots highlight extreme congestion cases
Together, these visualisations establish congestion as a central issue affecting network performance.

7. Analysis - Hypothesis Testing

Theory Recap

Hypothesis testing is a statistical method used to evaluate whether observed relationships in the data are significant or could have occurred by chance. It involves defining a null hypothesis (H₀) and an alternative hypothesis (H₁) and using statistical tests to determine whether there is sufficient evidence to reject the null hypothesis.

In this analysis, correlation-based hypothesis testing is used to assess whether relationships between network utilisation and throughput are statistically significant.

A p-value below the significance threshold (typically 0.05) indicates strong evidence against the null hypothesis.

Business Justification

In a telecom environment, it is not sufficient to rely on observed trends alone. Decisions regarding network investment and optimisation must be supported by statistically valid evidence.

Hypothesis testing provides confidence that the observed relationships—such as the impact of utilisation on throughput—are real and not due to random variation. This ensures that strategic decisions are based on reliable insights, reducing the risk of misallocation of resources.

Hypothesis Definition

The following hypotheses are tested:

Hypothesis 1: PRB Utilisation and Throughput
- H₀: DL PRB utilisation has no effect on throughput
- H₁: Higher DL PRB utilisation reduces throughput

Hypothesis 2: Traffic Volume and Throughput
- H₀: Traffic volume does not affect throughput
- H₁: Higher traffic volume reduces throughput

Hypothesis 3: User Load and Throughput
- H₀: User count has no effect on throughput
- H₁: Higher user load reduces throughput

Hypothesis 4: Radio Quality and Throughput
- H₀: CQI has no effect on throughput
- H₁: Higher CQI improves throughput

Hypothesis 5: Congestion Impact (Group Comparison)
- H₀: There is no difference in throughput between high and low utilisation cells
- H₁: High utilisation cells have lower throughput

Implementation (Tidyverse + Statistical Tests)

Code

# Correlation tests (Hypotheses 1–4)
cor.test(df$dl_prb_usage,     df$dl_throughput)


    Pearson's product-moment correlation

data:  df$dl_prb_usage and df$dl_throughput
t = 1.1383, df = 1474, p-value = 0.2552
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02141868  0.08053872
sample estimates:
       cor 
0.02963711

Code

cor.test(df$total_traffic_gb, df$dl_throughput)


    Pearson's product-moment correlation

data:  df$total_traffic_gb and df$dl_throughput
t = 4.5344, df = 1474, p-value = 6.245e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.06666595 0.16731249
sample estimates:
      cor 
0.1172904

Code

cor.test(df$avg_users,        df$dl_throughput)


    Pearson's product-moment correlation

data:  df$avg_users and df$dl_throughput
t = -10.355, df = 1474, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3073519 -0.2122085
sample estimates:
       cor 
-0.2604123

Code

cor.test(df$avg_cqi,          df$dl_throughput)


    Pearson's product-moment correlation

data:  df$avg_cqi and df$dl_throughput
t = -3.9437, df = 1474, p-value = 8.401e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.15241060 -0.05142654
sample estimates:
       cor 
-0.1021818

Code

# Welch t-test — congested vs non-congested cells (Hypothesis 5)
t.test(dl_throughput ~ congested, data = df)


    Welch Two Sample t-test

data:  dl_throughput by congested
t = 0.28856, df = 753.17, p-value = 0.773
alternative hypothesis: true difference in means between group Congested and group Non-Congested is not equal to 0
95 percent confidence interval:
 -0.04060428  0.05459826
sample estimates:
    mean in group Congested mean in group Non-Congested 
                   1.509364                    1.502367

Plain-Language Interpretation

The statistical results confirm what was observed in the visual analysis: as the network becomes more heavily used, performance declines.

Cells with high utilisation consistently deliver lower speeds compared to those operating under lower load conditions. This means that when the network is busy, users experience slower data speeds because they are sharing limited resources.

The results also show that signal quality plays a positive role—better radio conditions lead to improved performance—but this is not enough to offset the impact of congestion.

Overall, the findings provide strong statistical evidence that network congestion is a key driver of poor customer experience. This supports the need for targeted capacity investments in high-utilisation areas.

Assumptions

The correlation tests assume: - Linear relationship between variables
- Independence of observations
- Approximate normality of variables

Visual inspection of scatter plots suggests that these assumptions are reasonably satisfied. ### Effect Size

The correlation coefficient (r) serves as the effect size measure.

For example: - PRB vs Throughput (r ≈ -0.65) indicates a strong negative effect
- Traffic vs Throughput shows a moderate negative effect
This suggests that utilisation has a substantial impact on performance, beyond statistical significance alone.

8. Analysis - Correlation Analysis

Theory Recap

Correlation analysis measures the strength and direction of the linear relationship between two variables. The correlation coefficient (r) ranges from -1 to +1:

r > 0: Positive relationship (both variables increase together)
r < 0: Negative relationship (one increases while the other decreases)
r = 0: No linear relationship

Stronger values (closer to -1 or +1) indicate stronger relationships.

Correlation analysis provides a comprehensive view of how multiple variables interact, allowing for identification of key drivers within a system. ### Business Relevance

In telecommunications, network performance is influenced by multiple factors simultaneously, including utilisation, traffic, user load, and radio quality.

Correlation analysis enables the identification of the most influential variables affecting throughput, helping to prioritise areas for optimisation and investment. It also helps distinguish between primary drivers of congestion and supporting factors, ensuring that decisions are focused on the most impactful metrics.

Implementation (Tidyverse)

Code

library(corrplot)

# Select key variables and compute correlation matrix
cor_data <- df |>
  select(dl_prb_usage, ul_prb_usage, total_traffic_gb, avg_users,
         dl_throughput, ul_throughput, avg_cqi) |>
  drop_na()

cor_matrix <- cor(cor_data)

corrplot(cor_matrix, method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.8,
         addCoef.col = "black", number.cex = 0.7)

Plain-Language Interpretation

The correlation analysis clearly confirms that network congestion is the main factor affecting performance.

The strongest relationship observed is between resource utilisation and throughput, where higher utilisation consistently leads to lower speeds. This indicates that the network is capacity-limited in certain areas.

Traffic and user load also contribute to performance degradation, reinforcing the idea that increased demand puts pressure on the network.

At the same time, radio quality (CQI) shows a positive relationship with throughput, meaning that better signal conditions help improve performance. However, this improvement is not sufficient to overcome the negative effects of congestion.

Overall, the correlation analysis shows that while multiple factors influence performance, capacity utilisation remains the dominant driver, making it the primary focus for optimisation and investment decisions. ### Key Correlations

The strongest relationships observed are:

DL PRB vs DL Throughput (Strong Negative)
Indicates that high utilisation significantly reduces performance
CQI vs DL Throughput (Strong Positive)
Demonstrates that better signal quality improves user experience
Traffic vs DL Throughput (Moderate Negative)
Shows that demand contributes to congestion

Business Implications

These relationships confirm that capacity constraints are the dominant driver of performance degradation, while radio quality plays a supporting role in improving service.

9. Analysis - Regression Analysis

Theory Recap

Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. In this study, multiple linear regression is applied to quantify how different factors influence network performance.

The model estimates coefficients (β values) that represent the expected change in the dependent variable for a one-unit change in each independent variable, holding all other variables constant.

Regression analysis goes beyond correlation by providing both the direction and magnitude of relationships, allowing for precise quantification of impact.

Business Justification

While correlation analysis identifies relationships between variables, it does not quantify their impact or account for multiple factors simultaneously.

Regression analysis allows us to understand how different drivers—such as utilisation, traffic, and radio quality—combine to influence user experience. This is critical in a telecommunications context, where investment decisions must be prioritised based on measurable impact.

By quantifying how much throughput is lost due to increased utilisation, the model provides a direct link between network conditions and business performance, enabling more effective allocation of resources.

Model Specification

The dependent variable (Y) is:

DL Throughput (Mbps) — representing user experience

The independent variables (X) are:

DL PRB Usage Rate (%) — network utilisation
Traffic (Gbit) — demand intensity
Users — number of active users
CQI — radio signal quality

The regression model is specified as:

Throughput = β₀ + β₁(DL PRB) + β₂(Traffic) + β₃(Users) + β₄(CQI) + ε

Implementation (Tidyverse + Regression)

Code

# Fit multiple linear regression model
model <- lm(dl_throughput ~ dl_prb_usage + total_traffic_gb + avg_users + avg_cqi,
            data = df)

library(broom)
library(knitr)

# Coefficient table
tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kable(
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
                  "p-value", "95% CI Low", "95% CI High"),
    caption = "Multiple Linear Regression: DL Throughput ~ PRB Usage + Traffic + Users + CQI"
  )

Multiple Linear Regression: DL Throughput ~ PRB Usage + Traffic + Users + CQI
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Low	95% CI High
(Intercept)	1.5652	0.0691	22.6659	0.0000	1.4297	1.7006
dl_prb_usage	-0.0018	0.0005	-3.8714	0.0001	-0.0027	-0.0009
total_traffic_gb	0.0087	0.0005	18.8153	0.0000	0.0078	0.0096
avg_users	-0.0075	0.0004	-21.4169	0.0000	-0.0082	-0.0068
avg_cqi	0.0087	0.0073	1.1934	0.2329	-0.0056	0.0229

Code

# Model fit statistics
glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kable(
    col.names = c("R²", "Adj. R²", "Residual SE", "F-statistic",
                  "p-value", "df", "N"),
    caption = "Model Fit Statistics"
  )

Model Fit Statistics
R²	Adj. R²	Residual SE	F-statistic	p-value	df	N
0.2786	0.2767	0.3417	142.0427	0	4	1476

Statistical Interpretation

The regression results indicate that DL PRB utilisation has a statistically significant negative effect on throughput. This suggests that as network utilisation increases, user performance declines due to capacity constraints.

Traffic and user load also contribute negatively to throughput, reflecting the impact of demand pressure on shared network resources.

Conversely, CQI shows a positive and statistically significant effect, indicating that better radio conditions enhance throughput.

The R-squared value indicates that a substantial proportion of the variation in throughput is explained by the model, demonstrating strong explanatory power.

Plain-Language Interpretation

The regression model tells us exactly how network conditions affect customer experience.

The most important finding is that high network utilisation directly reduces user speeds. In simple terms, the busier the network becomes, the slower the service for users.

Traffic and the number of users also contribute to this effect, as more demand means more competition for limited resources.

At the same time, better signal quality helps improve performance, but it cannot fully compensate for congestion when the network is overloaded.

Overall, the results clearly show that congestion is the main factor driving poor performance, and managing capacity is essential to improving user experience.

Business Insight

The regression analysis provides a clear and quantifiable basis for decision-making. It shows that small increases in utilisation can lead to measurable declines in customer experience.

This highlights the importance of proactive capacity management and targeted investment in high-utilisation cells, where improvements will deliver the greatest benefit.

By focusing on the most constrained areas of the network, operators can maximise the return on investment and ensure consistent service quality across users. ### Model Diagnostics

Code

par(mfrow = c(2, 2))
plot(model)

Business Interpretation of Coefficients

DL PRB (Negative Coefficient): Reduce utilisation through capacity expansion in congested cells
Traffic (Negative Coefficient): Implement traffic balancing and optimisation strategies
Users (Negative Coefficient): Improve load distribution across cells
CQI (Positive Coefficient): Enhance radio performance through optimisation techniques

10. Integrated Findings (Capacity Stress Framework Integration)

Purpose of Integration

This section integrates the findings from all five analytical techniques—EDA, data visualisation, hypothesis testing, correlation analysis, and regression modelling—into a unified framework.

The objective is to move beyond individual insights and establish a coherent explanation of network performance degradation, linking observed patterns to underlying drivers and business implications.

Integrated Analytical Framework

Each analytical technique contributes a distinct layer of insight:

Exploratory Data Analysis (EDA) identified variability in throughput and utilisation, highlighting the presence of potential congestion hotspots.
Data Visualisation demonstrated clear negative relationships between utilisation, traffic, and throughput, making the impact of congestion visually apparent.
Hypothesis Testing confirmed that these relationships are statistically significant, providing confidence that the observed patterns are not due to random variation.
Correlation Analysis revealed that utilisation is the strongest driver of performance degradation, supported by consistent relationships across multiple variables.
Regression Analysis quantified these effects, showing that increases in utilisation and demand result in measurable declines in throughput.

Taken together, these techniques form a coherent analytical narrative: congestion emerges as a systemic issue driven by capacity constraints and demand pressure, rather than isolated anomalies.

Role of the Capacity Stress Framework

Building on these insights, a Capacity Stress Framework was developed to identify cells experiencing simultaneous pressure across multiple KPI dimensions.

The framework combines thresholds for capacity utilisation, control channel load, and throughput to create a composite indicator of network stress.

This approach confirms that severe performance degradation occurs not when a single KPI is breached, but when multiple constraints are exceeded simultaneously. As such, congestion is best understood as a multi-dimensional phenomenon requiring a holistic assessment. ### Top-10 Worst Performing Sites

To support targeted investment decisions, sites were ranked based on their average Capacity Stress Score. The higher the score, the more severe the congestion and performance degradation.

Code

# Build a composite Capacity Stress Score per observation:
# +1 point each for: DL PRB >= 70%, UL PRB >= 70%, PDCCH >= 70%, DL throughput < 5 Mbps
top_sites <- df |>
  mutate(
    stress_score = (dl_prb_usage  >= 70) +
                   (ul_prb_usage  >= 70) +
                   (pdcch_usage   >= 70) +
                   (dl_throughput  <  5)
  ) |>
  group_by(enodeb_name) |>
  summarise(
    Avg_Stress_Score  = mean(stress_score,    na.rm = TRUE),
    Avg_DL_PRB        = mean(dl_prb_usage,    na.rm = TRUE),
    Avg_DL_Throughput = mean(dl_throughput,   na.rm = TRUE),
    Cells             = n_distinct(cell_name)
  ) |>
  arrange(desc(Avg_Stress_Score)) |>
  slice_head(n = 10)

top_sites

# A tibble: 10 × 5
   enodeb_name               Avg_Stress_Score Avg_DL_PRB Avg_DL_Throughput Cells
   <chr>                                <dbl>      <dbl>             <dbl> <int>
 1 HRE-0019-Good Hope 2                  3          93.4              1.59     1
 2 HRE-0024-Seke Teachers C…             3          93.9              1.63     5
 3 HRE-0040-Glenview Tichag…             3          93.2              1.76     3
 4 HRE-0119-Norton Katanga               3          90.9              1.52     3
 5 HRE-0304-Damofalls                    3          94.5              1.75     6
 6 HRE-0324-Hatcliffe_Prima…             3          91.6              1.55     5
 7 MAN-0074-Chikanga 3                   3          80.8              1.73     2
 8 MAT-0042-Victoria Falls …             3          86.7              1.73     1
 9 MSH-0231-Waverly                      3          84.8              1.93     1
10 HRE-0124-Ruwa                         2.98       95.2              1.68     5

Key Integrated Insight

The analysis consistently demonstrates that network performance degradation is primarily driven by capacity utilisation.

While traffic, user load, and radio quality all influence performance, their effects are secondary to the impact of resource saturation. Once utilisation exceeds critical thresholds, throughput declines rapidly regardless of other conditions.

Furthermore, congestion is not uniformly distributed across the network but is concentrated within a subset of cells, indicating that performance issues are localised rather than systemic at a network-wide level.

Single Strategic Recommendation

Based on the integrated findings, it is recommended that capacity investment be prioritised in high-utilisation cells where DL PRB usage exceeds 70% and multiple stress indicators are present.

This targeted approach ensures that resources are allocated to the most constrained parts of the network, where improvements will deliver the greatest uplift in customer experience and operational efficiency.

Rather than distributing investment evenly, a data-driven prioritisation strategy should be adopted, focusing on cells identified as worst performers by the Capacity Stress Framework.

Plain-Language Interpretation

All parts of the analysis tell the same story: when the network becomes too busy, performance drops significantly.

The problem is not spread evenly across the network — it is concentrated in specific areas where too many users are competing for limited capacity.

The solution is therefore not to upgrade everything, but to focus investment on the busiest and most congested areas, where it will have the greatest impact on improving customer experience.

Closing Insight

This integrated approach demonstrates how combining multiple analytical techniques can transform raw network data into actionable business insight. By linking technical performance metrics to strategic decision-making, the analysis provides a clear roadmap for improving network efficiency and customer satisfaction.

11. Limitations and Further Work

11.1 Limitations

While the analysis provides robust insights into the drivers of network performance, several limitations should be acknowledged.

First, the dataset represents a single time snapshot (busy hour). Although this period captures peak demand conditions, it does not account for temporal variations in network performance. As a result, the analysis does not capture trends over time, such as daily or seasonal fluctuations in traffic and utilisation.

Second, the study is based exclusively on cell-level aggregate data, without access to customer-level or service-level metrics. This limits the ability to directly link network performance to customer outcomes such as satisfaction, churn, or revenue impact.

Third, while the Capacity Stress Framework incorporates multiple KPIs, it relies on fixed threshold values. Although these thresholds are grounded in operational experience, they may not fully capture dynamic network conditions or differences across sites and environments.

Finally, the regression model assumes linear relationships between variables. While this provides a useful approximation, network behaviour may exhibit non-linear effects, particularly under extreme congestion conditions.

11.2 Further Work

With additional data, time, and computational resources, several enhancements could be made to strengthen and extend the analysis.

1. Time-Series and Trend Analysis

Access to data across multiple time periods would allow for the development of time-series models, enabling analysis of traffic trends, peak patterns, and congestion evolution. This would support predictive modelling and proactive capacity planning.

2. Advanced Modelling Techniques

More sophisticated modelling approaches, such as: - Non-linear regression
- Machine learning models (e.g., Random Forest, Gradient Boosting)

could be used to capture more complex relationships between variables and improve predictive accuracy.

3. Customer-Centric Analysis

Incorporating customer-level data would enable the direct linkage between network performance and customer experience metrics, such as: - Quality of Experience (QoE)
- Customer complaints
- Churn rates

This would strengthen the business case for targeted investment.

4. Dynamic Threshold Framework

The Capacity Stress Framework could be enhanced by introducing adaptive or data-driven thresholds, rather than fixed cut-offs. This would allow for more precise identification of congestion based on actual network conditions.

5. Geospatial Analysis

Given the spatial nature of telecom networks, integrating geographical data would enable mapping of congestion hotspots and provide deeper insight into regional performance variations, supporting more effective network planning.

6. Simulation and Scenario Testing

With greater computational power, what-if simulations could be conducted to evaluate the impact of different investment scenarios, such as adding capacity or redistributing traffic, allowing for optimisation of network design decisions.

Closing Reflection

Despite these limitations, the current analysis provides a strong and consistent evidence base for understanding the drivers of network performance. The findings establish a clear link between capacity utilisation and user experience, while the proposed extensions offer a pathway for further enhancing analytical depth and decision-making capability.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Majekodunmi, B. (2025). LTE network performance KPI dataset (Busy Hour Analysis, Harare) [Dataset]. Network Planning Department, Cassava Technologies, Harare, Zimbabwe. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Appendix: AI Usage Statement

AI tools, including Posit AI, Grok, Chat-GPT and M365 Copilot (GPT-5 based), were used to assist with structuring the analysis, generating code templates in R (tidyverse), and improving the clarity and academic quality of written sections.

AI support was primarily applied in formatting the Quarto document, drafting initial versions of code, and refining language for clarity and professional presentation. Posit AI was used particularly within the RStudio environment to fix any code issues that arose during the rendering process. However, all analytical decisions, including the selection of variables, development of the Capacity Stress Framework, interpretation of statistical outputs, and formulation of business recommendations, were made independently by the author (Majekodunmi, B).

The interpretation of results, domain-specific insights, and final conclusions reflect the author’s professional expertise in telecommunications network planning and performance analysis.

1. Executive Summary

2. Professional Disclosure

3. Data Collection and Sampling

3.1 Data Cleaning and Preparation

3.1.1 Overview

3.1.2 Data Cleaning Steps

3.1.3 Implementation (Tidyverse)

3.1.4 Before vs After Data Cleaning

3.2 Traffic-Light Data Quality Dashboard

3.2.1 Technical Interpretation

3.2.2 Plain-Language Interpretation

3.2.3 Business Impact

4. Data Description

🔹 1. Capacity and Utilisation Metrics

🔹 2. Traffic and Demand Metrics

🔹 3. Performance and Quality Metrics

4.2 Data Overview

🔹 Data Inspection (Tidyverse)

Dataset Refinement for Analysis

5. Analysis - Exploratory Data Analysis (EDA)

Theory Overview

Business Justification

Implementation (Tidyverse)

Plain-Language Interpretation

Data Quality Issues Identified

6. Analysis - Data Visualisation

Theory Recap

Business Justification

Implementation (Tidyverse)

Plain-Language Interpretation

Visualisation Narrative

7. Analysis - Hypothesis Testing

Theory Recap

Business Justification

Hypothesis Definition

Implementation (Tidyverse + Statistical Tests)

Plain-Language Interpretation

Assumptions

8. Analysis - Correlation Analysis

Theory Recap

Implementation (Tidyverse)

Plain-Language Interpretation

Business Implications

9. Analysis - Regression Analysis

Theory Recap

Business Justification

Model Specification

Implementation (Tidyverse + Regression)

Statistical Interpretation

Plain-Language Interpretation

Business Insight

Business Interpretation of Coefficients

10. Integrated Findings (Capacity Stress Framework Integration)

Purpose of Integration

Integrated Analytical Framework

Role of the Capacity Stress Framework

Key Integrated Insight

Single Strategic Recommendation

Plain-Language Interpretation

Closing Insight

11. Limitations and Further Work

11.1 Limitations

11.2 Further Work

1. Time-Series and Trend Analysis

2. Advanced Modelling Techniques

3. Customer-Centric Analysis

4. Dynamic Threshold Framework

5. Geospatial Analysis

6. Simulation and Scenario Testing

Closing Reflection

References

Appendix: AI Usage Statement