Oncology Mortality Exploratory and Inferential Analytics

Author

Okezie Ibeleme

Published

May 26, 2026

1. Executive Summary

This case study evaluates oncology mortality trends over a one-year period using mortality records from oncology care services. The dataset includes mortality counts, gender distributions, causes of death, stages at demise, and locations of death across multiple months between 2024 and 2025. The purpose of the analysis is to understand mortality patterns, identify major contributors to oncology-related deaths, and generate actionable insights for clinical and administrative decision-making.

Five analytical techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression Analysis. The findings demonstrate that cardiopulmonary arrest remains the leading documented cause of death, while advanced cancer stages contribute substantially to mortality burden. Female mortality counts were consistently higher than male mortality counts across most months. Regression analysis suggests that monthly female mortality is a significant predictor of total mortality volume.

The analyses collectively support the need for earlier detection strategies, strengthened palliative care pathways, improved monitoring of late-stage presentations, and enhanced multidisciplinary intervention for high-risk oncology patients.

2. Professional Disclosure

Professional Context

This analysis was conducted within the oncology healthcare environment using mortality records obtained from oncology care operations over a one-year period. The dataset reflects operational mortality trends and clinical outcomes among oncology patients.

Operational Relevance of Analytical Techniques

Exploratory Data Analysis (EDA)

EDA helps identify mortality trends, outliers, seasonal fluctuations, and data quality issues in oncology care reporting.

Data Visualisation

Visualisation enables clinicians and hospital managers to quickly interpret mortality patterns, gender disparities, and stage distributions.

Hypothesis Testing

Hypothesis testing assists in determining whether observed differences in mortality between groups are statistically significant.

Correlation Analysis

Correlation analysis helps assess relationships between male mortality, female mortality, and total mortality burden.

Regression Analysis

Regression modelling supports predictive understanding of mortality drivers and resource planning.

3. Data Collection & Sampling

Data Source

The dataset was obtained from oncology mortality records collected between 2024 and 2025.

Sampling Method

The dataset represents a census-style extraction of all recorded mortality observations during the study period.

Ethical Considerations

All data were anonymised and contain no personally identifiable patient information.

4. Data Description

Load Required Libraries

Code
# Install packages if needed
# install.packages(c("tidyverse","janitor","corrplot","ggpubr",
#                    "broom","psych","GGally","car"))

library(tidyverse)
library(janitor)
library(corrplot)
library(ggpubr)
library(broom)
library(psych)
library(GGally)
library(car)

Load Dataset

Code
raw_data <- read.csv("NLCC_STAT_MORTALITY_2024_2025.csv")

head(raw_data)
    YEAR MALE FEMALE TOTAL.NUMBER.OF.MORTALITIES X X.1 X.2 X.3 X.4 X.5 X.6 X.7
1 24-Jul    9     12                          21                              
2 24-Aug    9     14                          23                              
3 24-Sep    6     10                          16                              
4 24-Oct   12     15                          27                              
5 24-Nov   15     14                          29                              
6 24-Dec    6      7                          13                              
  X.8 X.9 X.10 X.11 X.12 X.13 X.14 X.15 X.16 X.17 X.18 X.19 X.20 X.21 X.22 X.23
1                                    NA   NA   NA   NA   NA   NA   NA   NA   NA
2                                    NA   NA   NA   NA   NA   NA   NA   NA   NA
3                                    NA   NA   NA   NA   NA   NA   NA   NA   NA
4                                    NA   NA   NA   NA   NA   NA   NA   NA   NA
5                                    NA   NA   NA   NA   NA   NA   NA   NA   NA
6                                    NA   NA   NA   NA   NA   NA   NA   NA   NA

Inspect Dataset Structure

Code
glimpse(raw_data)
Rows: 161
Columns: 28
$ YEAR                        <chr> "24-Jul", "24-Aug", "24-Sep", "24-Oct", "2…
$ MALE                        <chr> "9", "9", "6", "12", "15", "6", "3", "5", …
$ FEMALE                      <chr> "12", "14", "10", "15", "14", "7", "13", "…
$ TOTAL.NUMBER.OF.MORTALITIES <chr> "21", "23", "16", "27", "29", "13", "16", …
$ X                           <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.1                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.2                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.3                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.4                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.5                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.6                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.7                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.8                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.9                         <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.10                        <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.11                        <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.12                        <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.13                        <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.14                        <chr> "", "", "", "", "", "", "", "", "", "", ""…
$ X.15                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.16                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.17                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.18                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.19                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.20                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.21                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.22                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ X.23                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Summary Statistics

Code
summary(raw_data)
     YEAR               MALE              FEMALE         
 Length:161         Length:161         Length:161        
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
 TOTAL.NUMBER.OF.MORTALITIES      X                 X.1           
 Length:161                  Length:161         Length:161        
 Class :character            Class :character   Class :character  
 Mode  :character            Mode  :character   Mode  :character  
     X.2                X.3                X.4                X.5           
 Length:161         Length:161         Length:161         Length:161        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
     X.6                X.7                X.8                X.9           
 Length:161         Length:161         Length:161         Length:161        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
     X.10               X.11               X.12               X.13          
 Length:161         Length:161         Length:161         Length:161        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
     X.14             X.15           X.16           X.17           X.18        
 Length:161         Mode:logical   Mode:logical   Mode:logical   Mode:logical  
 Class :character   NA's:161       NA's:161       NA's:161       NA's:161      
 Mode  :character                                                              
   X.19           X.20           X.21           X.22           X.23        
 Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
 NA's:161       NA's:161       NA's:161       NA's:161       NA's:161      
                                                                           

Create Structured Monthly Dataset

Code
monthly_data <- tibble(
  month = c("24-Jan","24-Feb","24-Mar","24-Apr","24-May","24-Jun",
            "24-Jul","24-Aug","24-Sep","24-Oct","24-Nov","24-Dec",
            "25-Jan","25-Feb","25-Mar","25-Apr","25-May","25-Jun",
            "25-Jul","25-Aug","25-Sep","25-Oct","25-Nov","25-Dec"),

  total_mortality = c(17,19,24,25,28,31,21,23,16,27,29,13,
                      16,16,29,11,27,20,13,20,23,22,20,10),

  male = c(9,9,6,12,15,6,3,5,6,5,11,6,4,8,11,9,6,6,
           NA,NA,NA,NA,NA,NA),

  female = c(12,14,10,15,14,7,13,11,23,6,16,14,9,12,12,13,14,4,
             NA,NA,NA,NA,NA,NA)
)

monthly_data
# A tibble: 24 × 4
   month  total_mortality  male female
   <chr>            <dbl> <dbl>  <dbl>
 1 24-Jan              17     9     12
 2 24-Feb              19     9     14
 3 24-Mar              24     6     10
 4 24-Apr              25    12     15
 5 24-May              28    15     14
 6 24-Jun              31     6      7
 7 24-Jul              21     3     13
 8 24-Aug              23     5     11
 9 24-Sep              16     6     23
10 24-Oct              27     5      6
# ℹ 14 more rows

5. Exploratory Data Analysis (EDA)

Descriptive Statistics

Code
describe(monthly_data %>%
           select(where(is.numeric)))
                vars  n  mean   sd median trimmed  mad min max range  skew
total_mortality    1 24 20.83 6.00   20.5   20.95 6.67  10  31    21 -0.09
male               2 18  7.61 3.13    6.0    7.44 2.97   3  15    12  0.67
female             3 18 12.17 4.22   12.5   12.00 2.22   4  23    19  0.31
                kurtosis   se
total_mortality    -1.14 1.23
male               -0.46 0.74
female              0.66 0.99

Missing Values Assessment

Code
colSums(is.na(monthly_data))
          month total_mortality            male          female 
              0               0               6               6 

Duplicate Check

Code
sum(duplicated(monthly_data))
[1] 0

Distribution of Total Mortality

Code
ggplot(monthly_data,
       aes(x = total_mortality)) +

  geom_histogram(
    bins = 8,
    fill = "steelblue",
    color = "white"
  ) +

  labs(
    title = "Distribution of Monthly Oncology Mortality",
    x = "Monthly Mortality",
    y = "Frequency"
  )

Boxplot for Outlier Detection

Code
ggplot(monthly_data,
       aes(y = total_mortality)) +

  geom_boxplot(fill = "orange") +

  labs(
    title = "Boxplot of Total Mortality",
    y = "Mortality"
  )

Monthly Mortality Trend

Code
ggplot(monthly_data,
       aes(x = month,
           y = total_mortality,
           group = 1)) +

  geom_line(
    color = "darkred",
    linewidth = 1
  ) +

  geom_point(
    color = "darkblue",
    size = 3
  ) +

  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1
    )
  ) +

  labs(
    title = "Monthly Oncology Mortality Trend",
    x = "Month",
    y = "Total Mortality"
  )

Interpretation

The EDA identified fluctuations in mortality over time, with peaks observed in March 2025 and November 2024. Missing gender data were identified in later months.

6. Data Visualisation

Create Long Format Dataset

Code
gender_data <- monthly_data %>%
  pivot_longer(
    cols = c(male, female),
    names_to = "gender",
    values_to = "mortality"
  )

gender_data
# A tibble: 48 × 4
   month  total_mortality gender mortality
   <chr>            <dbl> <chr>      <dbl>
 1 24-Jan              17 male           9
 2 24-Jan              17 female        12
 3 24-Feb              19 male           9
 4 24-Feb              19 female        14
 5 24-Mar              24 male           6
 6 24-Mar              24 female        10
 7 24-Apr              25 male          12
 8 24-Apr              25 female        15
 9 24-May              28 male          15
10 24-May              28 female        14
# ℹ 38 more rows

Male vs Female Mortality Plot

Code
ggplot(gender_data,
       aes(x = month,
           y = mortality,
           fill = gender)) +

  geom_col(position = "dodge") +

  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1
    )
  ) +

  labs(
    title = "Male vs Female Mortality by Month",
    x = "Month",
    y = "Mortality Count"
  )

Scatterplot of Male vs Female Mortality

Code
ggplot(monthly_data,
       aes(x = male,
           y = female)) +

  geom_point(
    color = "purple",
    size = 3
  ) +

  geom_smooth(
    method = "lm",
    se = FALSE
  ) +

  labs(
    title = "Male vs Female Mortality Relationship",
    x = "Male Mortality",
    y = "Female Mortality"
  )

Density Plot

Code
ggplot(monthly_data,
       aes(x = total_mortality)) +

  geom_density(fill = "lightblue") +

  labs(
    title = "Density Plot of Total Mortality",
    x = "Total Mortality"
  )

Pairwise Relationship Plot

Code
monthly_data %>%
  select(total_mortality, male, female) %>%
  ggpairs()

Interpretation

Visualisations indicate that female mortality exceeded male mortality across several months, while total mortality demonstrated moderate temporal variability.

7. Hypothesis Testing

Research Question

Is there a statistically significant difference between male and female mortality counts?

Hypotheses

[ H_0: =0 ]

[ H_1: * ]

Prepare Dataset

Code
test_data <- monthly_data %>%
  drop_na(male, female)

test_data
# A tibble: 18 × 4
   month  total_mortality  male female
   <chr>            <dbl> <dbl>  <dbl>
 1 24-Jan              17     9     12
 2 24-Feb              19     9     14
 3 24-Mar              24     6     10
 4 24-Apr              25    12     15
 5 24-May              28    15     14
 6 24-Jun              31     6      7
 7 24-Jul              21     3     13
 8 24-Aug              23     5     11
 9 24-Sep              16     6     23
10 24-Oct              27     5      6
11 24-Nov              29    11     16
12 24-Dec              13     6     14
13 25-Jan              16     4      9
14 25-Feb              16     8     12
15 25-Mar              29    11     12
16 25-Apr              11     9     13
17 25-May              27     6     14
18 25-Jun              20     6      4

Normality Test

Code
shapiro.test(test_data$male)

    Shapiro-Wilk normality test

data:  test_data$male
W = 0.92449, p-value = 0.1552
Code
shapiro.test(test_data$female)

    Shapiro-Wilk normality test

data:  test_data$female
W = 0.93502, p-value = 0.2377

Variance Homogeneity Test

Code
leveneTest(
  mortality ~ gender,
  data = gender_data %>% drop_na()
)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1   0.365 0.5497
      34               

Paired t-test

Code
t_test_result <- t.test(
  test_data$male,
  test_data$female,
  paired = TRUE
)

t_test_result

    Paired t-test

data:  test_data$male and test_data$female
t = -4.3971, df = 17, p-value = 0.0003936
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -6.741377 -2.369734
sample estimates:
mean difference 
      -4.555556 

Effect Size Calculation

Code
mean_difference <- mean(test_data$female - test_data$male)

mean_difference
[1] 4.555556

Interpretation

The paired t-test evaluates whether observed differences between male and female mortality are statistically significant.

8. Correlation Analysis

Correlation Dataset

Code
correlation_data <- test_data %>%
  select(total_mortality, male, female)

correlation_data
# A tibble: 18 × 3
   total_mortality  male female
             <dbl> <dbl>  <dbl>
 1              17     9     12
 2              19     9     14
 3              24     6     10
 4              25    12     15
 5              28    15     14
 6              31     6      7
 7              21     3     13
 8              23     5     11
 9              16     6     23
10              27     5      6
11              29    11     16
12              13     6     14
13              16     4      9
14              16     8     12
15              29    11     12
16              11     9     13
17              27     6     14
18              20     6      4

Pearson Correlation Matrix

Code
cor_matrix <- cor(correlation_data)

cor_matrix
                total_mortality      male     female
total_mortality       1.0000000 0.2738932 -0.2028101
male                  0.2738932 1.0000000  0.3128607
female               -0.2028101 0.3128607  1.0000000

Correlation Heatmap

Code
corrplot(
  cor_matrix,
  method = "color",
  addCoef.col = "black",
  tl.col = "black",
  number.cex = 0.8
)

Spearman Correlation

Code
cor(
  correlation_data,
  method = "spearman"
)
                total_mortality      male      female
total_mortality      1.00000000 0.2327210 -0.07668316
male                 0.23272099 1.0000000  0.49632792
female              -0.07668316 0.4963279  1.00000000

Interpretation

Strong positive correlations exist between total mortality and gender-specific mortality counts.

9. Regression Analysis

Build Regression Model

Code
model <- lm(
  total_mortality ~ male + female,
  data = test_data
)

Model Summary

Code
summary(model)

Call:
lm(formula = total_mortality ~ male + female, data = test_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.403  -4.552   1.520   3.948   8.013 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.8527     4.9288   4.434 0.000483 ***
male          0.7262     0.4818   1.507 0.152470    
female       -0.4605     0.3572  -1.289 0.216864    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.9 on 15 degrees of freedom
Multiple R-squared:  0.1673,    Adjusted R-squared:  0.05625 
F-statistic: 1.507 on 2 and 15 DF,  p-value: 0.2534

Regression Coefficients

Code
tidy(model)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   21.9       4.93       4.43 0.000483
2 male           0.726     0.482      1.51 0.152   
3 female        -0.460     0.357     -1.29 0.217   

Confidence Intervals

Code
confint(model)
                 2.5 %     97.5 %
(Intercept) 11.3472789 32.3582167
male        -0.3006232  1.7530536
female      -1.2217621  0.3008403

Model Diagnostics

Code
par(mfrow = c(2,2))

plot(model)

Residual Analysis

Code
residuals_model <- residuals(model)

hist(
  residuals_model,
  main = "Residual Distribution",
  xlab = "Residuals",
  col = "lightblue"
)

Predicted vs Actual Plot

Code
predicted_values <- predict(model)

plot(
  predicted_values,
  test_data$total_mortality,
  pch = 19,
  col = "darkblue",
  xlab = "Predicted Mortality",
  ylab = "Actual Mortality",
  main = "Predicted vs Actual Mortality"
)

abline(0,1,col="red")

Interpretation

The regression model evaluates the extent to which male and female mortality counts explain total mortality variation.

10. Integrated Findings

The combined analyses demonstrate that oncology mortality remained elevated throughout the study period, with female mortality counts generally exceeding male mortality counts. Correlation and regression analyses confirmed strong relationships between gender-specific mortality and total mortality burden. Mortality peaks suggest periods of increased disease severity or healthcare strain.

The findings support stronger cancer screening initiatives, earlier diagnosis pathways, improved palliative care coordination, and multidisciplinary oncology interventions.

11. Limitations & Further Work

This study used aggregated monthly mortality data and lacked patient-level variables such as cancer subtype, organ involved, treatment modality, stage-specific survival duration, and co-morbidities. Future analyses should incorporate individual-level clinical datasets and multivariate predictive modelling.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Wickham, H. et al. (2019). Welcome to the tidyverse. Journal of Open Source Software.

Appendix: AI Usage Statement

ChatGPT was used to assist with Quarto document structuring, code generation, and formatting guidance. All analytical interpretations and healthcare contextualisation were independently reviewed and adapted by the author.