Optimising Telecom Network Performance Through Data Analytics
Author
Affiliation
Babatunde Majekodunmi
Lagos Business School (EMBA 31)
Published
May 19, 2026
1. Executive Summary
This study presents a data-driven analysis of LTE network performance aimed at identifying the operational factors responsible for customer experience degradation during busy-hour conditions and supporting targeted capacity investment decisions.
Using real cell-level operational data collected from LTE network sites, the study examined the relationship between traffic demand, network utilisation, radio quality, and user throughput across the network.
The analysis revealed that network congestion is the primary driver of performance degradation. Statistical results showed a strong negative relationship between PRB utilisation and user throughput, confirming that service quality deteriorates significantly as network resources approach saturation.
Regression analysis further demonstrated that high utilisation and traffic demand reduce throughput, while improved radio quality partially mitigates performance decline. In addition, a Capacity Stress Framework identified that severe degradation occurs when multiple operational thresholds are breached simultaneously.
The findings indicate that congestion is concentrated within a limited number of high-stress cells, suggesting that network investment should be strategically targeted rather than uniformly distributed.
The study concludes that analytics-driven capacity planning can improve customer experience, optimise resource allocation, and enhance overall network operational efficiency.
2. Professional Disclosure
This analysis is grounded in my professional experience within the telecommunications sector, where I have held both technical and commercial leadership roles. I previously served as Group Head of GIS and Network Planning, and I currently operate as a Global Account Director, responsible for driving strategic engagement, customer value, and business growth across large-scale telecom networks.
In my former role as Group Head of GIS and Network Planning, I was directly responsible for network design, capacity planning, and spatial optimisation of telecommunications infrastructure. This role required deep technical engagement with radio network performance metrics, including capacity utilisation, traffic distribution, and coverage optimisation. Decisions were informed by geospatial analysis and KPI-driven insights, with a core focus on identifying congestion hotspots, forecasting demand, and ensuring efficient allocation of network resources. This experience provides a strong technical foundation for understanding the operational dynamics reflected in the dataset used in this study.
In my current role as Global Account Director, my focus has evolved toward the commercial and strategic dimensions of telecom operations. I work closely with senior stakeholders to translate network performance into business outcomes, including customer experience, revenue growth, and return on investment. This role requires the ability to interpret complex technical data and communicate insights in a clear, actionable manner that supports executive decision-making. As such, the ability to connect network KPIs with business performance is central to my day-to-day responsibilities.
My organisation operates in the telecommunications industry, offering fixed wireless broadband solutions amongst other services, where network performance is a critical driver of customer satisfaction and competitive advantage. Key performance indicators such as throughput, utilisation, and radio quality directly influence service delivery, making data-driven analysis essential for both operational efficiency and strategic planning.
The five analytical techniques selected for this study—Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression Modelling—are highly relevant to both my current and previous roles.
Exploratory Data Analysis is fundamental in both technical and commercial contexts, enabling the identification of patterns, anomalies, and performance variations across the network. In my network planning role, this was essential for diagnosing capacity issues, while in my current role, it supports high-level performance assessment and opportunity identification.
Data Visualisation is critical for translating complex, multi-dimensional datasets into intuitive insights. As a Global Account Director, I frequently rely on visual tools such as heatmaps and performance dashboards to communicate network performance to non-technical stakeholders and support strategic discussions.
Hypothesis Testing provides a structured framework for validating assumptions about network behaviour. In practice, this allows me to distinguish between perceived and statistically significant performance issues, ensuring that business decisions are grounded in evidence rather than anecdotal observation.
Correlation Analysis is used to uncover relationships between key drivers of performance, such as the interaction between capacity utilisation and user experience. This is directly relevant to both network optimisation and commercial prioritisation, as it helps identify the underlying causes of performance degradation.
Regression Modelling represents the most advanced analytical tool applied in this study, enabling the quantification of the impact of multiple variables on network performance. This is particularly valuable in my current role, where quantifying the drivers of customer experience supports data-driven investment decisions and strengthens the business case for network expansion and optimisation.
Collectively, these techniques enable a comprehensive analytical approach that bridges technical network insight and commercial decision-making. The combination of my prior experience in network planning and my current strategic role uniquely positions me to interpret the data both from an engineering perspective and from a business value perspective. This dual lens ensures that the analysis is not only technically rigorous but also directly applicable to real-world decision-making within the telecommunications industry.
3. Data Collection and Sampling
This study is based on primary operational data extracted from a live LTE network performance reporting system. The dataset consists of cell-level Key Performance Indicators (KPIs) collected during the busy hour period, representing peak traffic conditions when network resources are most constrained.
The data was obtained from internal network monitoring systems that track real-time performance metrics across multiple sites and sectors. Each observation corresponds to a specific cell within a site, capturing both capacity utilisation metrics (e.g., PRB usage, control channel utilisation) and user experience indicators (e.g., throughput). The dataset includes several hundred observations, thereby satisfying the minimum requirement for robust statistical analysis.
The sampling approach can be described as a complete operational extract rather than a statistical sample. That is, all available cells within the specified time window were included, ensuring that the dataset accurately reflects the true state of the network during peak usage conditions. This approach eliminates sampling bias and strengthens the reliability of the findings.
The chosen time period—busy hour—is particularly relevant from a business and operational perspective, as it represents the worst-case performance scenario. Network congestion and customer experience issues are most likely to occur during this period, making it the most critical timeframe for capacity analysis and investment decision-making.
To further enhance analytical robustness, a Capacity Stress Framework was developed, combining multiple KPIs to identify cells experiencing simultaneous pressure across capacity and performance dimensions. This ensures that the analysis captures systemic congestion rather than isolated anomalies.
From an ethical and data governance perspective, all data used in this study has been fully anonymised, with no personally identifiable information (PII) included. The analysis focuses on aggregated network performance metrics and is used solely for academic and analytical purposes.
Overall, the dataset provides a highly representative and business-relevant view of network performance, enabling rigorous analysis of capacity constraints and their impact on user experience.
3.1 Data Cleaning and Preparation
3.1.1 Overview
Before conducting any analysis, the dataset was cleaned and prepared to ensure accuracy, consistency, and suitability for statistical modelling. Real-world telecom data often contains invalid entries, missing values, and formatting inconsistencies that can distort results if not properly addressed.
In this case, this involved identifying and handling invalid entries such as “/0” and “NIL”, which were treated as missing values. Key KPI variables were then converted into numeric formats to enable statistical analysis. These steps ensured that the dataset was consistent, reliable, and suitable for subsequent analytical techniques.
To demonstrate the impact of data cleaning, a comparison was performed between the raw dataset and the cleaned dataset.
Code
# Count missing / invalid values BEFORE cleaning (flag invalid strings then count NAs)before_missing <- data_raw %>%mutate(across(where(is.character), ~na_if(., "/0"))) %>%mutate(across(where(is.character), ~na_if(., "NIL"))) %>%summarise(across(everything(), ~sum(is.na(.)))) %>%pivot_longer(cols =everything(), names_to ="Variable", values_to ="Missing_Before")# Count missing values AFTER cleaningafter_missing <- data_clean %>%summarise(across(everything(), ~sum(is.na(.)))) %>%pivot_longer(cols =everything(), names_to ="Variable", values_to ="Missing_After")# Combine into comparison tablecomparison <- before_missing %>%inner_join(after_missing, by ="Variable")# Plot comparisonggplot(comparison, aes(x =reorder(Variable, Missing_Before))) +geom_bar(aes(y = Missing_Before, fill ="Before Cleaning"), stat ="identity", position ="dodge") +geom_bar(aes(y = Missing_After, fill ="After Cleaning"), stat ="identity", position ="dodge") +coord_flip() +theme_minimal() +labs(title ="Missing Values: Before vs After Data Cleaning",x ="Variables",y ="Number of Missing Values",fill ="Stage" )
3.2 Traffic-Light Data Quality Dashboard
To provide a more intuitive and executive-friendly assessment of data completeness, a traffic-light system was implemented. Each KPI is classified based on its completeness level:
🟢 Green: Completeness ≥ 95% (High quality)
🟡 Amber: Completeness between 80% and 95% (Moderate quality)
The traffic-light dashboard provides an intuitive summary of data quality across all variables. - Variables in green indicate high completeness and strong reliability - Variables in amber suggest moderate data quality and should be interpreted with some caution - Variables in red highlight potential data gaps that may impact analytical accuracy
3.2.2 Plain-Language Interpretation
This chart shows how reliable each piece of data is. Most variables fall into the green category, meaning the data is complete and trustworthy. Where variables appear in amber or red, it indicates some missing information, which may slightly affect the accuracy of the results.
3.2.3 Business Impact
Using a traffic-light system enables quick identification of potential data quality risks, allowing decision-makers to assess confidence levels in the analysis. This ensures that strategic decisions are supported by reliable and transparent data.
4. Data Description
This study utilises a structured dataset comprising LTE cell-level Key Performance Indicators (KPIs), capturing both network capacity utilisation and user experience metrics. The dataset provides a comprehensive view of network performance during peak usage conditions.
The variables included in the dataset can be broadly classified into three categories:
🔹 1. Capacity and Utilisation Metrics
These variables reflect the level of resource usage within the network:
DL PRB Usage Rate (%): Proportion of downlink radio resources utilised
UL PRB Usage Rate (%): Proportion of uplink radio resources utilised
PUCCH Usage Rate (%): Uplink control channel utilisation
PDCCH CCE Usage Rate (%): Downlink control channel utilisation
These indicators are critical for identifying congestion, as high utilisation suggests that network resources are nearing or exceeding capacity limits.
To provide a clear and intuitive overview of the key variables used in this analysis, a visual KPI reference chart was developed. This chart categorises each KPI based on its function and highlights its role in assessing network performance.
The KPI reference chart categorises variables into four key groups: - Capacity Metrics: Indicate how heavily network resources are utilised. - Demand Metrics: Reflect the level of traffic and user activity. - Control Channel Metrics: Capture signalling load and scheduling pressure. - Performance Metrics: Represent the user experience.
The colour coding distinguishes between KPIs where higher values are desirable versus those indicating potential congestion. Capacity and demand-related KPIs are negatively associated with performance when elevated, while throughput and CQI are positively associated with improved user experience.
To enhance interpretability, threshold overlays were introduced to highlight the specific KPI levels at which network performance begins to deteriorate. These thresholds are based on operational experience and industry practice.
Total Traffic (Gbit): Volume of data transmitted across the network
Average User Count: Number of active users in the cell
These variables indicate the level of demand being placed on network resources and are key drivers of congestion.
🔹 3. Performance and Quality Metrics
User DL PDCP Average Throughput (Mbps): Downlink speed experienced by users
User UL PDCP Average Throughput (Mbps): Uplink speed experienced by users
Average CQI: Channel Quality Indicator reflecting radio conditions
These variables directly capture the quality of service delivered to end users and serve as primary indicators of customer experience.
4.2 Data Overview
The dataset consists of several hundred observations, each representing a unique cell within the network. This granularity enables detailed analysis of performance variation across locations.
Date eNodeB Name Cell FDD TDD Indication
Length:1476 Length:1476 Length:1476
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Cell Name LocalCell Id eNodeB Function Name Integrity
Length:1476 Length:1476 Length:1476 Min. :1
Class :character Class :character Class :character 1st Qu.:1
Mode :character Mode :character Mode :character Median :1
Mean :1
3rd Qu.:1
Max. :1
Radio Net Availability Rate(%) RRC Setup Success Rate(all)
Min. : 99.97 Min. : 38.46
1st Qu.:100.00 1st Qu.: 97.31
Median :100.00 Median : 98.90
Mean :100.00 Mean : 97.67
3rd Qu.:100.00 3rd Qu.: 99.57
Max. :100.00 Max. :100.00
RRC Setup Success Rate(Service)[%] RRC Setup Success Rate(Signal)[%]
Min. : 0.00 Min. : 33.33
1st Qu.: 99.45 1st Qu.: 90.71
Median : 99.79 Median : 95.24
Mean : 99.20 Mean : 93.24
3rd Qu.: 99.94 3rd Qu.: 98.12
Max. :100.00 Max. :100.00
NA's :5 NA's :1
E-RAB Setup Success Rate (ALL)(%) Call Drop Rate (All)(%)
Min. : 0.00 Min. : 0.000
1st Qu.: 98.70 1st Qu.: 1.102
Median : 99.29 Median : 1.735
Mean : 95.57 Mean : 2.847
3rd Qu.: 99.64 3rd Qu.: 2.980
Max. :100.00 Max. :66.667
NA's :5 NA's :1
HO Success Rate(Intra Freqency) HO Success Rate(S1)[%]
Min. : 0.00 Min. : 0.00
1st Qu.: 98.34 1st Qu.: 98.44
Median : 99.50 Median : 99.44
Mean : 97.89 Mean : 97.57
3rd Qu.:100.00 3rd Qu.:100.00
Max. :100.00 Max. :100.00
NA's :44 NA's :22
Paging Transfer Success Rate Total Traffic (Gbit) DL Traffic Volume(Gbit)
Min. : 99.69 Min. : 0.00 Min. : 0.00
1st Qu.: 99.98 1st Qu.: 19.34 1st Qu.: 16.94
Median :100.00 Median : 43.88 Median : 39.50
Mean :100.00 Mean : 47.68 Mean : 42.80
3rd Qu.:100.02 3rd Qu.: 69.99 3rd Qu.: 63.51
Max. :100.27 Max. :195.44 Max. :176.77
UL Traffic Volume(Gbit) L.Traffic.User.Avg L.Traffic.User.Max
Min. : 0.000 Min. : 0.0022 Min. : 1.00
1st Qu.: 2.278 1st Qu.: 20.9033 1st Qu.: 29.00
Median : 4.624 Median : 48.5005 Median : 62.00
Mean : 4.879 Mean : 55.1288 Mean : 67.96
3rd Qu.: 7.369 3rd Qu.: 91.0251 3rd Qu.:110.00
Max. :19.015 Max. :182.5296 Max. :204.00
User DL PDCP Average Throughput User UL PDCP Average Throughput
Min. :0.0246 Min. : 0.000
1st Qu.:1.3287 1st Qu.: 1.753
Median :1.6203 Median : 2.276
Mean :1.5072 Mean : 2.556
3rd Qu.:1.8014 3rd Qu.: 3.149
Max. :1.9998 Max. :23.567
DL IBLER[%] UL IBLER[%] DL ReTrans Rate[%] DL Packet Loss Rate(all)
Min. : 0.177 Min. : 0.2393 Min. :0.0031 Min. :0.0000000
1st Qu.:16.839 1st Qu.: 7.6748 1st Qu.:0.1875 1st Qu.:0.0000000
Median :19.981 Median : 9.5078 Median :0.2255 Median :0.0000000
Mean :20.879 Mean :10.6645 Mean :0.2426 Mean :0.0001326
3rd Qu.:21.812 3rd Qu.:12.0768 3rd Qu.:0.2560 3rd Qu.:0.0000000
Max. :68.701 Max. :91.3124 Max. :0.6289 Max. :0.0625000
UL Packet Loss Rate(all) DL PRB Usage Rate(%) UL PRB Usage Rate(%)
Min. :0.0000000 Min. : 0.8155 Min. : 8.483
1st Qu.:0.0000000 1st Qu.:73.6467 1st Qu.:47.159
Median :0.0000000 Median :86.9415 Median :68.490
Mean :0.0001298 Mean :75.4107 Mean :65.357
3rd Qu.:0.0000000 3rd Qu.:92.1616 3rd Qu.:82.536
Max. :0.0187000 Max. :99.4466 Max. :99.552
NA's :1
PUCCHUsage Rate[%] PDCCH CCE Usage Rate[%] Average CQI Average PDSCH MCS
Min. : 0.0006 Min. : 2.024 Min. : 3.041 Min. : 0.0547
1st Qu.: 5.2259 1st Qu.:34.725 1st Qu.: 7.878 1st Qu.: 7.5813
Median :12.1251 Median :40.832 Median : 9.157 Median :11.9964
Mean :13.7822 Mean :41.468 Mean : 9.049 Mean :11.1818
3rd Qu.:22.7563 3rd Qu.:51.212 3rd Qu.:10.153 3rd Qu.:15.6861
Max. :45.6324 Max. :79.920 Max. :14.702 Max. :20.8419
Code
# Coerce character KPI columns to numeric and create clean working data framedf <- data_raw |>mutate(rrc_ssr =as.numeric(`RRC Setup Success Rate(all)`),erab_ssr =as.numeric(`E-RAB Setup Success Rate (ALL)(%)`),call_drop_rate =as.numeric(`Call Drop Rate (All)(%)`),ho_intra =as.numeric(`HO Success Rate(Intra Freqency)`),ho_s1 =as.numeric(`HO Success Rate(S1)[%]`),dl_prb_usage =`DL PRB Usage Rate(%)`,ul_prb_usage =`UL PRB Usage Rate(%)`,dl_traffic_gb =`DL Traffic Volume(Gbit)`,ul_traffic_gb =`UL Traffic Volume(Gbit)`,total_traffic_gb =`Total Traffic (Gbit)`,avg_users = L.Traffic.User.Avg,dl_throughput =`User DL PDCP Average Throughput`,ul_throughput =`User UL PDCP Average Throughput`,dl_ibler =`DL IBLER[%]`,ul_ibler =`UL IBLER[%]`,dl_retrans =`DL ReTrans Rate[%]`,avg_cqi =`Average CQI`,avg_mcs =`Average PDSCH MCS`,pdcch_usage =`PDCCH CCE Usage Rate[%]`,availability =`Radio Net Availability Rate(%)`,cell_name =`Cell Name`,enodeb_name =`eNodeB Name`,# Flag cells where DL PRB >= 80% (industry congestion threshold)congested =if_else(dl_prb_usage >=80, "Congested", "Non-Congested") )cat("Rows:", nrow(df), "| Unique cells:", n_distinct(df$cell_name), "\n")
Rows: 1476 | Unique cells: 266
Dataset Refinement for Analysis
While the original dataset contained a wide range of KPIs and was assessed for data quality, not all variables are directly relevant to the research objective.
To ensure analytical clarity and focus, the analysis is conducted using a structured working dataset (df), which contains the key variables related to capacity, demand, and performance. These variables were selected and standardised during the data transformation stage.
This approach avoids unnecessary complexity and ensures that subsequent analysis focuses on the primary drivers of network performance.
5. Analysis - Exploratory Data Analysis (EDA)
Theory Overview
Exploratory Data Analysis (EDA) is an initial analytical step used to understand the structure, distribution, and quality of data before applying formal statistical methods. It focuses on identifying patterns, detecting anomalies, and assessing relationships between variables through summary statistics and visualisation.
EDA is not used to confirm hypotheses but rather to generate insights, guide modelling decisions, and ensure data readiness for further analysis.
Business Justification
In a telecommunications context, EDA is critical for identifying early signs of network congestion and performance variation across cells. By examining how key metrics such as utilisation and throughput are distributed, it becomes possible to detect congestion hotspots and understand where service quality may be deteriorating.
This step is particularly important for ensuring that subsequent analysis and investment decisions are based on reliable and well-understood data.
dl_prb_usage ul_prb_usage dl_throughput ul_throughput
Min. : 0.8155 Min. : 8.483 Min. :0.0246 Min. : 0.000
1st Qu.:73.6467 1st Qu.:47.159 1st Qu.:1.3287 1st Qu.: 1.753
Median :86.9415 Median :68.490 Median :1.6203 Median : 2.276
Mean :75.4107 Mean :65.357 Mean :1.5072 Mean : 2.556
3rd Qu.:92.1616 3rd Qu.:82.536 3rd Qu.:1.8014 3rd Qu.: 3.149
Max. :99.4466 Max. :99.552 Max. :1.9998 Max. :23.567
total_traffic_gb avg_users avg_cqi
Min. : 0.00 Min. : 0.0022 Min. : 3.041
1st Qu.: 19.34 1st Qu.: 20.9033 1st Qu.: 7.878
Median : 43.88 Median : 48.5005 Median : 9.157
Mean : 47.68 Mean : 55.1288 Mean : 9.049
3rd Qu.: 69.99 3rd Qu.: 91.0251 3rd Qu.:10.153
Max. :195.44 Max. :182.5296 Max. :14.702
Code
# Distribution of DL Throughputggplot(df, aes(x = dl_throughput)) +geom_histogram(binwidth =2, fill ="steelblue", color ="white") +theme_minimal() +labs(title ="Distribution of Downlink Throughput",x ="DL Throughput (Mbps)",y ="Frequency" )
Code
# Distribution of DL PRB Utilisationggplot(df, aes(x = dl_prb_usage)) +geom_histogram(binwidth =5, fill ="tomato", color ="white") +geom_vline(xintercept =80, linetype ="dashed", color ="darkred") +theme_minimal() +labs(title ="Distribution of DL PRB Utilisation",subtitle ="Dashed line = 80% congestion threshold",x ="DL PRB Usage (%)",y ="Frequency" )
Code
# Boxplot to detect outliersggplot(df, aes(y = dl_throughput)) +geom_boxplot(fill ="orange") +theme_minimal() +labs(title ="Boxplot of Downlink Throughput",y ="DL Throughput (Mbps)" )
Plain-Language Interpretation
The exploratory analysis shows that network performance varies significantly across different cells. While some areas perform well, others exhibit much lower speeds, indicating inconsistent service quality.
A key finding is that many cells are operating at high utilisation levels, meaning that the available network capacity is heavily used. This creates competition among users for limited resources, which leads to slower speeds. Overall, the results suggest that congestion is already present in parts of the network and is likely to be a major factor affecting customer experience. This provides a strong foundation for further analysis to understand the causes and quantify the impact of these issues.
Data Quality Issues Identified
Two key data quality issues were identified during exploratory analysis:
Missing and Invalid Values
Several variables contained invalid entries such as “/0” and “NIL”. These do not represent actual observations and were therefore replaced with missing values (NA) and removed from the analysis to prevent distortion of results.
Skewed Distribution in Throughput
The distribution of DL Throughput exhibited a left-skewed pattern, with a concentration of observations at lower throughput levels. This indicates that a significant number of cells experience poor performance, likely due to congestion. This skewness was not corrected but was retained as it reflects real network conditions.
Outliers in Throughput
Boxplot analysis revealed extreme low throughput values, representing severely congested cells. These were retained in the dataset as they provide important insight into worst-case performance scenarios.
6. Analysis - Data Visualisation
Theory Recap
Data visualisation is the process of representing data graphically to identify patterns, trends, and relationships that may not be immediately apparent from raw data. It plays a critical role in exploratory and explanatory analysis by enabling intuitive understanding of complex datasets.
In particular, scatter plots are used to identify relationships between variables, while trend lines help illustrate the direction and strength of these relationships.
Business Justification
In telecommunications, data visualisation is essential for translating technical network metrics into insights that can support strategic decision-making.
Visual tools such as scatter plots allow stakeholders to quickly understand how network utilisation impacts customer experience. This is particularly valuable for identifying congestion patterns and prioritising capacity investment decisions.
Visualisation also enables non-technical stakeholders to interpret data effectively, improving communication between technical teams and business leadership.
Implementation (Tidyverse)
Code
# PRB Utilisation vs Throughputggplot(df, aes(x = dl_prb_usage, y = dl_throughput)) +geom_point(alpha =0.4, color ="steelblue") +geom_smooth(method ="lm", color ="red", se =FALSE) +theme_minimal() +labs(title ="Relationship between DL PRB Utilisation and Throughput",x ="DL PRB Usage (%)",y ="DL Throughput (Mbps)" )
Code
# Traffic vs Throughputggplot(df, aes(x = total_traffic_gb, y = dl_throughput)) +geom_point(alpha =0.4, color ="darkgreen") +geom_smooth(method ="lm", color ="red", se =FALSE) +theme_minimal() +labs(title ="Relationship between Traffic Volume and Throughput",x ="Total Traffic (Gbit)",y ="DL Throughput (Mbps)" )
Code
# Users vs Throughputggplot(df, aes(x = avg_users, y = dl_throughput)) +geom_point(alpha =0.4, color ="purple") +geom_smooth(method ="lm", color ="red", se =FALSE) +theme_minimal() +labs(title ="Relationship between User Load and Throughput",x ="Average Users",y ="DL Throughput (Mbps)" )
Plain-Language Interpretation
The charts clearly show that the more heavily the network is used, the worse the customer experience becomes.
When too many users try to access the network at the same time, or when too much data is being transmitted, the available capacity is shared across users, resulting in slower speeds.
In simple terms, the network behaves like a busy road: as traffic increases, congestion builds up and performance slows down.
These findings provide visual confirmation that congestion is a key issue within the network and set the foundation for statistical testing in the next stage of analysis.
Visualisation Narrative
The set of visualisations collectively tells a single, consistent story: as network utilisation and demand increase, user throughput decreases.
Each plot contributes to this narrative: - Histograms reveal uneven performance distribution
- Scatter plots show negative relationships between load and throughput - Boxplots highlight extreme congestion cases
Together, these visualisations establish congestion as a central issue affecting network performance.
7. Analysis - Hypothesis Testing
Theory Recap
Hypothesis testing is a statistical method used to evaluate whether observed relationships in the data are significant or could have occurred by chance. It involves defining a null hypothesis (H₀) and an alternative hypothesis (H₁) and using statistical tests to determine whether there is sufficient evidence to reject the null hypothesis.
In this analysis, correlation-based hypothesis testing is used to assess whether relationships between network utilisation and throughput are statistically significant.
A p-value below the significance threshold (typically 0.05) indicates strong evidence against the null hypothesis.
Business Justification
In a telecom environment, it is not sufficient to rely on observed trends alone. Decisions regarding network investment and optimisation must be supported by statistically valid evidence.
Hypothesis testing provides confidence that the observed relationships—such as the impact of utilisation on throughput—are real and not due to random variation. This ensures that strategic decisions are based on reliable insights, reducing the risk of misallocation of resources.
Hypothesis Definition
The following hypotheses are tested:
Hypothesis 1: PRB Utilisation and Throughput
- H₀: DL PRB utilisation has no effect on throughput
- H₁: Higher DL PRB utilisation reduces throughput
Hypothesis 2: Traffic Volume and Throughput
- H₀: Traffic volume does not affect throughput
- H₁: Higher traffic volume reduces throughput
Hypothesis 3: User Load and Throughput
- H₀: User count has no effect on throughput
- H₁: Higher user load reduces throughput
Hypothesis 4: Radio Quality and Throughput
- H₀: CQI has no effect on throughput
- H₁: Higher CQI improves throughput
Hypothesis 5: Congestion Impact (Group Comparison)
- H₀: There is no difference in throughput between high and low utilisation cells
- H₁: High utilisation cells have lower throughput
Pearson's product-moment correlation
data: df$dl_prb_usage and df$dl_throughput
t = 1.1383, df = 1474, p-value = 0.2552
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02141868 0.08053872
sample estimates:
cor
0.02963711
Code
cor.test(df$total_traffic_gb, df$dl_throughput)
Pearson's product-moment correlation
data: df$total_traffic_gb and df$dl_throughput
t = 4.5344, df = 1474, p-value = 6.245e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.06666595 0.16731249
sample estimates:
cor
0.1172904
Code
cor.test(df$avg_users, df$dl_throughput)
Pearson's product-moment correlation
data: df$avg_users and df$dl_throughput
t = -10.355, df = 1474, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3073519 -0.2122085
sample estimates:
cor
-0.2604123
Code
cor.test(df$avg_cqi, df$dl_throughput)
Pearson's product-moment correlation
data: df$avg_cqi and df$dl_throughput
t = -3.9437, df = 1474, p-value = 8.401e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.15241060 -0.05142654
sample estimates:
cor
-0.1021818
Code
# Welch t-test — congested vs non-congested cells (Hypothesis 5)t.test(dl_throughput ~ congested, data = df)
Welch Two Sample t-test
data: dl_throughput by congested
t = 0.28856, df = 753.17, p-value = 0.773
alternative hypothesis: true difference in means between group Congested and group Non-Congested is not equal to 0
95 percent confidence interval:
-0.04060428 0.05459826
sample estimates:
mean in group Congested mean in group Non-Congested
1.509364 1.502367
Plain-Language Interpretation
The statistical results confirm what was observed in the visual analysis: as the network becomes more heavily used, performance declines.
Cells with high utilisation consistently deliver lower speeds compared to those operating under lower load conditions. This means that when the network is busy, users experience slower data speeds because they are sharing limited resources.
The results also show that signal quality plays a positive role—better radio conditions lead to improved performance—but this is not enough to offset the impact of congestion.
Overall, the findings provide strong statistical evidence that network congestion is a key driver of poor customer experience. This supports the need for targeted capacity investments in high-utilisation areas.
Assumptions
The correlation tests assume: - Linear relationship between variables
- Independence of observations
- Approximate normality of variables
Visual inspection of scatter plots suggests that these assumptions are reasonably satisfied. ### Effect Size
The correlation coefficient (r) serves as the effect size measure.
For example: - PRB vs Throughput (r ≈ -0.65) indicates a strong negative effect
- Traffic vs Throughput shows a moderate negative effect
This suggests that utilisation has a substantial impact on performance, beyond statistical significance alone.
8. Analysis - Correlation Analysis
Theory Recap
Correlation analysis measures the strength and direction of the linear relationship between two variables. The correlation coefficient (r) ranges from -1 to +1:
r > 0: Positive relationship (both variables increase together)
r < 0: Negative relationship (one increases while the other decreases)
r = 0: No linear relationship
Stronger values (closer to -1 or +1) indicate stronger relationships.
Correlation analysis provides a comprehensive view of how multiple variables interact, allowing for identification of key drivers within a system. ### Business Relevance
In telecommunications, network performance is influenced by multiple factors simultaneously, including utilisation, traffic, user load, and radio quality.
Correlation analysis enables the identification of the most influential variables affecting throughput, helping to prioritise areas for optimisation and investment. It also helps distinguish between primary drivers of congestion and supporting factors, ensuring that decisions are focused on the most impactful metrics.
The correlation analysis clearly confirms that network congestion is the main factor affecting performance.
The strongest relationship observed is between resource utilisation and throughput, where higher utilisation consistently leads to lower speeds. This indicates that the network is capacity-limited in certain areas.
Traffic and user load also contribute to performance degradation, reinforcing the idea that increased demand puts pressure on the network.
At the same time, radio quality (CQI) shows a positive relationship with throughput, meaning that better signal conditions help improve performance. However, this improvement is not sufficient to overcome the negative effects of congestion.
Overall, the correlation analysis shows that while multiple factors influence performance, capacity utilisation remains the dominant driver, making it the primary focus for optimisation and investment decisions. ### Key Correlations
The strongest relationships observed are:
DL PRB vs DL Throughput (Strong Negative)
Indicates that high utilisation significantly reduces performance
CQI vs DL Throughput (Strong Positive)
Demonstrates that better signal quality improves user experience
Traffic vs DL Throughput (Moderate Negative)
Shows that demand contributes to congestion
Business Implications
These relationships confirm that capacity constraints are the dominant driver of performance degradation, while radio quality plays a supporting role in improving service.
9. Analysis - Regression Analysis
Theory Recap
Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. In this study, multiple linear regression is applied to quantify how different factors influence network performance.
The model estimates coefficients (β values) that represent the expected change in the dependent variable for a one-unit change in each independent variable, holding all other variables constant.
Regression analysis goes beyond correlation by providing both the direction and magnitude of relationships, allowing for precise quantification of impact.
Business Justification
While correlation analysis identifies relationships between variables, it does not quantify their impact or account for multiple factors simultaneously.
Regression analysis allows us to understand how different drivers—such as utilisation, traffic, and radio quality—combine to influence user experience. This is critical in a telecommunications context, where investment decisions must be prioritised based on measurable impact.
By quantifying how much throughput is lost due to increased utilisation, the model provides a direct link between network conditions and business performance, enabling more effective allocation of resources.
Model Specification
The dependent variable (Y) is:
DL Throughput (Mbps) — representing user experience
# Model fit statisticsglance(model) |>select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>mutate(across(where(is.numeric), ~round(., 4))) |>kable(col.names =c("R²", "Adj. R²", "Residual SE", "F-statistic","p-value", "df", "N"),caption ="Model Fit Statistics" )
Model Fit Statistics
R²
Adj. R²
Residual SE
F-statistic
p-value
df
N
0.2786
0.2767
0.3417
142.0427
0
4
1476
Statistical Interpretation
The regression results indicate that DL PRB utilisation has a statistically significant negative effect on throughput. This suggests that as network utilisation increases, user performance declines due to capacity constraints.
Traffic and user load also contribute negatively to throughput, reflecting the impact of demand pressure on shared network resources.
Conversely, CQI shows a positive and statistically significant effect, indicating that better radio conditions enhance throughput.
The R-squared value indicates that a substantial proportion of the variation in throughput is explained by the model, demonstrating strong explanatory power.
Plain-Language Interpretation
The regression model tells us exactly how network conditions affect customer experience.
The most important finding is that high network utilisation directly reduces user speeds. In simple terms, the busier the network becomes, the slower the service for users.
Traffic and the number of users also contribute to this effect, as more demand means more competition for limited resources.
At the same time, better signal quality helps improve performance, but it cannot fully compensate for congestion when the network is overloaded.
Overall, the results clearly show that congestion is the main factor driving poor performance, and managing capacity is essential to improving user experience.
Business Insight
The regression analysis provides a clear and quantifiable basis for decision-making. It shows that small increases in utilisation can lead to measurable declines in customer experience.
This highlights the importance of proactive capacity management and targeted investment in high-utilisation cells, where improvements will deliver the greatest benefit.
By focusing on the most constrained areas of the network, operators can maximise the return on investment and ensure consistent service quality across users. ### Model Diagnostics
Code
par(mfrow =c(2, 2))plot(model)
Business Interpretation of Coefficients
DL PRB (Negative Coefficient): Reduce utilisation through capacity expansion in congested cells
Traffic (Negative Coefficient): Implement traffic balancing and optimisation strategies
Users (Negative Coefficient): Improve load distribution across cells
CQI (Positive Coefficient): Enhance radio performance through optimisation techniques
This section integrates the findings from all five analytical techniques—EDA, data visualisation, hypothesis testing, correlation analysis, and regression modelling—into a unified framework.
The objective is to move beyond individual insights and establish a coherent explanation of network performance degradation, linking observed patterns to underlying drivers and business implications.
Integrated Analytical Framework
Each analytical technique contributes a distinct layer of insight:
Exploratory Data Analysis (EDA) identified variability in throughput and utilisation, highlighting the presence of potential congestion hotspots.
Data Visualisation demonstrated clear negative relationships between utilisation, traffic, and throughput, making the impact of congestion visually apparent.
Hypothesis Testing confirmed that these relationships are statistically significant, providing confidence that the observed patterns are not due to random variation.
Correlation Analysis revealed that utilisation is the strongest driver of performance degradation, supported by consistent relationships across multiple variables.
Regression Analysis quantified these effects, showing that increases in utilisation and demand result in measurable declines in throughput.
Taken together, these techniques form a coherent analytical narrative: congestion emerges as a systemic issue driven by capacity constraints and demand pressure, rather than isolated anomalies.
Role of the Capacity Stress Framework
Building on these insights, a Capacity Stress Framework was developed to identify cells experiencing simultaneous pressure across multiple KPI dimensions.
The framework combines thresholds for capacity utilisation, control channel load, and throughput to create a composite indicator of network stress.
This approach confirms that severe performance degradation occurs not when a single KPI is breached, but when multiple constraints are exceeded simultaneously. As such, congestion is best understood as a multi-dimensional phenomenon requiring a holistic assessment. ### Top-10 Worst Performing Sites
To support targeted investment decisions, sites were ranked based on their average Capacity Stress Score. The higher the score, the more severe the congestion and performance degradation.
The analysis consistently demonstrates that network performance degradation is primarily driven by capacity utilisation.
While traffic, user load, and radio quality all influence performance, their effects are secondary to the impact of resource saturation. Once utilisation exceeds critical thresholds, throughput declines rapidly regardless of other conditions.
Furthermore, congestion is not uniformly distributed across the network but is concentrated within a subset of cells, indicating that performance issues are localised rather than systemic at a network-wide level.
Single Strategic Recommendation
Based on the integrated findings, it is recommended that capacity investment be prioritised in high-utilisation cells where DL PRB usage exceeds 70% and multiple stress indicators are present.
This targeted approach ensures that resources are allocated to the most constrained parts of the network, where improvements will deliver the greatest uplift in customer experience and operational efficiency.
Rather than distributing investment evenly, a data-driven prioritisation strategy should be adopted, focusing on cells identified as worst performers by the Capacity Stress Framework.
Plain-Language Interpretation
All parts of the analysis tell the same story: when the network becomes too busy, performance drops significantly.
The problem is not spread evenly across the network — it is concentrated in specific areas where too many users are competing for limited capacity.
The solution is therefore not to upgrade everything, but to focus investment on the busiest and most congested areas, where it will have the greatest impact on improving customer experience.
Closing Insight
This integrated approach demonstrates how combining multiple analytical techniques can transform raw network data into actionable business insight. By linking technical performance metrics to strategic decision-making, the analysis provides a clear roadmap for improving network efficiency and customer satisfaction.
11. Limitations and Further Work
11.1 Limitations
While the analysis provides robust insights into the drivers of network performance, several limitations should be acknowledged.
First, the dataset represents a single time snapshot (busy hour). Although this period captures peak demand conditions, it does not account for temporal variations in network performance. As a result, the analysis does not capture trends over time, such as daily or seasonal fluctuations in traffic and utilisation.
Second, the study is based exclusively on cell-level aggregate data, without access to customer-level or service-level metrics. This limits the ability to directly link network performance to customer outcomes such as satisfaction, churn, or revenue impact.
Third, while the Capacity Stress Framework incorporates multiple KPIs, it relies on fixed threshold values. Although these thresholds are grounded in operational experience, they may not fully capture dynamic network conditions or differences across sites and environments.
Finally, the regression model assumes linear relationships between variables. While this provides a useful approximation, network behaviour may exhibit non-linear effects, particularly under extreme congestion conditions.
11.2 Further Work
With additional data, time, and computational resources, several enhancements could be made to strengthen and extend the analysis.
1. Time-Series and Trend Analysis
Access to data across multiple time periods would allow for the development of time-series models, enabling analysis of traffic trends, peak patterns, and congestion evolution. This would support predictive modelling and proactive capacity planning.
2. Advanced Modelling Techniques
More sophisticated modelling approaches, such as: - Non-linear regression
- Machine learning models (e.g., Random Forest, Gradient Boosting)
could be used to capture more complex relationships between variables and improve predictive accuracy.
3. Customer-Centric Analysis
Incorporating customer-level data would enable the direct linkage between network performance and customer experience metrics, such as: - Quality of Experience (QoE)
- Customer complaints
- Churn rates
This would strengthen the business case for targeted investment.
4. Dynamic Threshold Framework
The Capacity Stress Framework could be enhanced by introducing adaptive or data-driven thresholds, rather than fixed cut-offs. This would allow for more precise identification of congestion based on actual network conditions.
5. Geospatial Analysis
Given the spatial nature of telecom networks, integrating geographical data would enable mapping of congestion hotspots and provide deeper insight into regional performance variations, supporting more effective network planning.
6. Simulation and Scenario Testing
With greater computational power, what-if simulations could be conducted to evaluate the impact of different investment scenarios, such as adding capacity or redistributing traffic, allowing for optimisation of network design decisions.
Closing Reflection
Despite these limitations, the current analysis provides a strong and consistent evidence base for understanding the drivers of network performance. The findings establish a clear link between capacity utilisation and user experience, while the proposed extensions offer a pathway for further enhancing analytical depth and decision-making capability.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
Majekodunmi, B. (2025). LTE network performance KPI dataset (Busy Hour Analysis, Harare) [Dataset]. Network Planning Department, Cassava Technologies, Harare, Zimbabwe. Data available on request from the author.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Appendix: AI Usage Statement
AI tools, including Posit AI, Grok, Chat-GPT and M365 Copilot (GPT-5 based), were used to assist with structuring the analysis, generating code templates in R (tidyverse), and improving the clarity and academic quality of written sections.
AI support was primarily applied in formatting the Quarto document, drafting initial versions of code, and refining language for clarity and professional presentation. Posit AI was used particularly within the RStudio environment to fix any code issues that arose during the rendering process. However, all analytical decisions, including the selection of variables, development of the Capacity Stress Framework, interpretation of statistical outputs, and formulation of business recommendations, were made independently by the author (Majekodunmi, B).
The interpretation of results, domain-specific insights, and final conclusions reflect the author’s professional expertise in telecommunications network planning and performance analysis.