Team Members
24063894 Tengku Muhamad Firdaus Mahmood Bin Tengku
Zambri
24054402 Siti Nuraishah Binti Ab Manam
17127309 Md Azrul Syaffiq Bin Md Suhaimi
23104958 Rahayu Rianti
17207347 Ahmad Kamil Hariz Bin Harizal
The dataset available at the GitHub repository MoH-Malaysia/data-darah-public is a public dataset related to blood donation facilities in Malaysia. The file donations_facility.csv provides information about blood donations collected across various facilities. This dataset is likely maintained by the Malaysian Ministry of Health (MoH) and is made publicly accessible to facilitate transparency, analysis, and research into blood donation trends.
This project explores blood donation trends in Malaysia from 2006 to the present. Using data visualization, time series forecasting, and clustering, it identifies key patterns, predicts future donation trends, and highlights facilities with low blood supply. The findings aim to support healthcare decision-making by optimizing resource allocation and enhancing donor engagement strategies.
Key Details:
Dataset Name: donations_facility.csv
Source: GitHub repository of the Malaysian Ministry of Health (MoH).
Purpose: Provides data for analyzing blood donation patterns, trends, and facility-specific performance.
Structure: Key columns such as:
• Date: The date of donation or reporting.
• Facility Name: Name of the blood donation facility.
• Number of Donations (daily): Count of donations recorded for the specific date and facility.
MOH Github link: https://github.com/MoH-Malaysia/data-darah-public/blob/main/donations_facility.csv
Link to download the `.Rmd` file: https://drive.google.com/file/d/18moZlC6FcO1BznYXBwVispyBGfs-iWen/view?usp=drive_link
options(repos = c(CRAN = "https://cloud.r-project.org/"))
#installing packages
install.packages('tidyverse')
install.packages('dplyr')
install.packages('ggplot2')
install.packages('readr')
install.packages('tidyr')
install.packages('scales')
install.packages('reshape2')
install.packages("zoo")
install.packages("factoextra")
install.packages("cluster")
install.packages("caret")
install.packages("forecast")
# loading libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Discovering dataset
# Read the dataset
df <- read.csv("donations_facility_dirty.csv")
# View initial structure
str(df)
summary(df)
glimpse(df) # Observation: $date feature is in <chr> dtype (should be date format)
# Count missing values per column
colSums(is.na(df))
## date hospital daily
## 0 0 9701
## blood_a blood_b blood_o
## 0 0 0
## blood_ab location_centre location_mobile
## 0 0 0
## type_wholeblood type_apheresis_platelet type_apheresis_plasma
## 0 0 0
## type_other social_civilian social_student
## 0 0 0
## social_policearmy donations_new donations_regular
## 0 0 0
## donations_irregular
## 0
# Percentage of missing values per column
colMeans(is.na(df)) * 100 # Observation: $daily's null values equates to 5% of the data
## date hospital daily
## 0.000000 0.000000 5.000077
## blood_a blood_b blood_o
## 0.000000 0.000000 0.000000
## blood_ab location_centre location_mobile
## 0.000000 0.000000 0.000000
## type_wholeblood type_apheresis_platelet type_apheresis_plasma
## 0.000000 0.000000 0.000000
## type_other social_civilian social_student
## 0.000000 0.000000 0.000000
## social_policearmy donations_new donations_regular
## 0.000000 0.000000 0.000000
## donations_irregular
## 0.000000
# Count duplicate rows
sum(duplicated(df)) # Observation: 32674 duplicates found
## [1] 32674
# Summary statistics for numeric columns
df %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd, min = min, max = max), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## daily_mean daily_sd daily_min daily_max blood_a_mean blood_a_sd blood_a_min
## 1 45.49179 108.3807 0 2667 11.29406 27.33861 0
## blood_a_max blood_b_mean blood_b_sd blood_b_min blood_b_max blood_o_mean
## 1 590 12.37746 29.57617 0 842 18.98555
## blood_o_sd blood_o_min blood_o_max blood_ab_mean blood_ab_sd blood_ab_min
## 1 45.37577 0 1048 2.817413 6.869608 0
## blood_ab_max location_centre_mean location_centre_sd location_centre_min
## 1 187 23.69627 47.02021 0
## location_centre_max location_mobile_mean location_mobile_sd
## 1 1398 21.77943 88.5789
## location_mobile_min location_mobile_max type_wholeblood_mean
## 1 0 2667 44.5627
## type_wholeblood_sd type_wholeblood_min type_wholeblood_max
## 1 105.4878 0 2666
## type_apheresis_platelet_mean type_apheresis_platelet_sd
## 1 0.4453785 2.32669
## type_apheresis_platelet_min type_apheresis_platelet_max
## 1 0 38
## type_apheresis_plasma_mean type_apheresis_plasma_sd type_apheresis_plasma_min
## 1 0.3817501 2.165341 0
## type_apheresis_plasma_max type_other_mean type_other_sd type_other_min
## 1 71 0.08586876 2.261028 0
## type_other_max social_civilian_mean social_civilian_sd social_civilian_min
## 1 786 38.95665 88.25553 0
## social_civilian_max social_student_mean social_student_sd social_student_min
## 1 2610 5.255241 25.37266 0
## social_student_max social_policearmy_mean social_policearmy_sd
## 1 765 1.263812 14.30165
## social_policearmy_min social_policearmy_max donations_new_mean
## 1 0 1578 14.81892
## donations_new_sd donations_new_min donations_new_max donations_regular_mean
## 1 40.46328 0 2506 24.71651
## donations_regular_sd donations_regular_min donations_regular_max
## 1 60.17422 0 1216
## donations_irregular_mean donations_irregular_sd donations_irregular_min
## 1 5.940263 18.14029 0
## donations_irregular_max
## 1 670
# Boxplots to detect outliers
df %>%
select(where(is.numeric)) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 9701 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Cleaning Dataset
# Replace missing values in $daily with the sum of blood types
df <- df %>%
mutate(daily = ifelse(is.na(daily),
blood_a + blood_b + blood_o + blood_ab,
daily))
# Verify mutation
sum(is.na(df$daily)) # Observation: no more missing values in the data
## [1] 0
# Remove duplicate records
df <- df %>% distinct()
# Verify change
sum(duplicated(df)) # Observation: 0 duplicates found
## [1] 0
# Clean hospital names by removing special characters and capitalizing words
df$hospital <- df$hospital %>%
str_replace_all("[@#!]", "") %>% # Remove special characters
str_to_title() # Convert to title case
# Verify change
head(df$hospital)
## [1] "Hospital Sultanah Nora Ismail" "Hospital Sultanah Aminah"
## [3] "" "Hospital Sultanah Bahiyah"
## [5] "Hospital Raja Perempuan Zainab Ii" "Hospital Melaka"
# Apply Min-Max scaling to numeric columns
min_max_scaler <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
df_scaled <- df %>%
mutate(across(where(is.numeric), ~ min_max_scaler(.)))
# Verify scaling
summary(df_scaled)
## date hospital daily blood_a
## Length:159507 Length:159507 Min. :0.000000 Min. :0.00000
## Class :character Class :character 1st Qu.:0.000000 1st Qu.:0.00000
## Mode :character Mode :character Median :0.005999 Median :0.00678
## Mean :0.020739 Mean :0.02328
## 3rd Qu.:0.022497 3rd Qu.:0.02542
## Max. :1.000000 Max. :1.00000
## blood_b blood_o blood_ab location_centre
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.004751 Median :0.006679 Median :0.005348 Median :0.005723
## Mean :0.017879 Mean :0.022034 Mean :0.018324 Mean :0.020617
## 3rd Qu.:0.019002 3rd Qu.:0.023855 3rd Qu.:0.021390 3rd Qu.:0.025751
## Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## location_mobile type_wholeblood type_apheresis_platelet
## Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.006001 Median :0.00000
## Mean :0.009932 Mean :0.020330 Mean :0.01425
## 3rd Qu.:0.000000 3rd Qu.:0.022506 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000
## type_apheresis_plasma type_other social_civilian
## Min. :0.000000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.000000 Median :0.0000000 Median :0.005747
## Mean :0.006539 Mean :0.0001328 Mean :0.018154
## 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.020690
## Max. :1.000000 Max. :1.0000000 Max. :1.000000
## social_student social_policearmy donations_new donations_regular
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.0000000 Median :0.001197 Median :0.007401
## Mean :0.008353 Mean :0.0009741 Mean :0.007192 Mean :0.024722
## 3rd Qu.:0.003922 3rd Qu.:0.0000000 3rd Qu.:0.007183 3rd Qu.:0.025493
## Max. :1.000000 Max. :1.0000000 Max. :1.000000 Max. :1.000000
## donations_irregular
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.001492
## Mean :0.010784
## 3rd Qu.:0.010448
## Max. :1.000000
# Convert 'date' to Date format
df$date <- as.Date(df$date, format = "%Y-%m-%d")
# Verify structure
str(df)
## 'data.frame': 159507 obs. of 19 variables:
## $ date : Date, format: "2006-01-01" "2006-01-01" ...
## $ hospital : chr "Hospital Sultanah Nora Ismail" "Hospital Sultanah Aminah" "" "Hospital Sultanah Bahiyah" ...
## $ daily : num 87 0 0 208 0 1 0 0 0 0 ...
## $ blood_a : int 19 0 0 67 0 0 0 0 0 0 ...
## $ blood_b : int 20 0 0 62 0 0 0 0 0 0 ...
## $ blood_o : int 45 0 0 61 0 1 0 0 0 0 ...
## $ blood_ab : int 3 0 0 18 0 0 0 0 0 0 ...
## $ location_centre : int 87 0 0 208 0 1 0 0 0 0 ...
## $ location_mobile : int 0 0 0 0 0 0 0 0 0 0 ...
## $ type_wholeblood : int 87 0 0 208 0 1 0 0 0 0 ...
## $ type_apheresis_platelet: int 0 0 0 0 0 0 0 0 0 0 ...
## $ type_apheresis_plasma : int 0 0 0 0 0 0 0 0 0 0 ...
## $ type_other : int 0 0 0 0 0 0 0 0 0 0 ...
## $ social_civilian : int 86 0 0 197 0 1 0 0 0 0 ...
## $ social_student : int 1 0 0 8 0 0 0 0 0 0 ...
## $ social_policearmy : int 0 0 0 3 0 0 0 0 0 0 ...
## $ donations_new : int 36 0 0 1 0 0 0 0 0 0 ...
## $ donations_regular : int 49 0 0 207 0 1 0 0 0 0 ...
## $ donations_irregular : int 2 0 0 0 0 0 0 0 0 0 ...
# Check for blank values in $hospital
sum(df$hospital == "")
## [1] 7069
# Replace blanks with NA
df$hospital[df$hospital == ""] <- NA
# Remove rows with missing hospital values
df <- df %>% filter(!is.na(hospital))
# Verify changes
sum(df$hospital == "")
## [1] 0
sum(is.na(df$hospital))
## [1] 0
# Save the cleaned dataset
write.csv(df, "donations_facility_cleaned_normalized.csv", row.names = FALSE)
This part is exploring and analyzing blood donation data.
#------------------- Visualizations-------------------------
#There are 10 visualization which are:
#1. Create box plot for each facility - To cross check which facility has the highest blood donation
#2. Create the bubble chart - Distribution for each facility by average blood donation
#3. Create line chart for month year by ALL hospital - Monthly trending by month year for ALL hospital
#4. Create line chart for month year by CERTAIN hospital - Check trending for certain hospital from the list
#5. Bar Chart: Total Blood donations per year - Yearly trending of blood donation
#6.Bar Chart: Total donations by blood type - Total donation by blood type
#7. Line Chart: Total donations by blood type per year - Yearly trending by blood type
#8. Bar Chart: Total donations by donors type - Total donation by donor type
#9. Bar Chart: Total donation by donor types per year - Yearly trending by donor type
#10. To find out correlation between some variables - Check correlation within variables
{r-sumdata-hosp} # Summarize the data by hospital df_summarized <- df %>% group_by(hospital) %>% summarize( avg_daily = mean(daily, na.rm = TRUE), total_daily = sum(daily, na.rm = TRUE), count = n() )
# Create boxplot
library(ggplot2)
ggplot(df, aes(x = hospital, y = daily)) +
geom_boxplot() +
labs(title = "Daily Donations by Hospital", x = "Hospital", y = "Daily Donations") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Summarize the data by hospital
df_summarized <- df %>%
group_by(hospital) %>%
summarize(
avg_daily = mean(daily, na.rm = TRUE),
total_daily = sum(daily, na.rm = TRUE),
count = n()
)
# Create bubble chart
ggplot(df_summarized, aes(x = hospital, y = avg_daily, size = avg_daily, color = hospital)) +
geom_point(alpha = 0.6) + # Add transparency for better visualization
#geom_text(aes(label = avg_daily), nudge_x = 0.00001, color = "black") + # Add text on bubble
scale_size(name = "Average Daily", range = c(1, 10)) + # Control bubble size range
labs(
title = "Bubble Chart: Average Daily by Hospital",
x = "Hospital",
y = "Average Daily"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Add Month and Year columns
df$Month <- month(df$date, label = TRUE)
df$Year <- year(df$date)
# Example if want to apply filter data for the year range (2023 to 2024)
filtered_df <- df %>%
filter(Year >= 2023 & Year <= 2024)
#%>%filter(tolower(hospital) != "pusat darah negara") # If want to filter out Pusat Darah Negara
# Group by Hospital, Year, and Month and summarize daily total
summary_df <- filtered_df %>%
group_by(hospital, Year, Month) %>%
summarise(Total_Daily = sum(daily, na.rm = TRUE)) %>%
arrange(hospital, Year, Month)
## `summarise()` has grouped output by 'hospital', 'Year'. You can override using
## the `.groups` argument.
# Check daily donation by Hospital according to month year
#print(summary_df)
# Plot the data using ggplot2
ggplot(summary_df, aes(x = interaction(Year, Month, sep = "-"), y = Total_Daily, group = hospital, color = hospital)) +
geom_line(size = 1) +
geom_point() +
theme_minimal() +
labs(
title = "Monthly Total Daily Donations by Hospital",
x = "Year-Month",
y = "Total Daily Donations",
color = "Hospital"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
### List of hospital ###
#hospital duchess of kent
#hospital melaka
#hospital miri
#hospital pulau pinang
#hospital queen elizabeth ii
#hospital raja perempuan zainab ii
#hospital raja permaisuri bainun
#hospital seberang jaya
#hospital seri manjung
#hospital sibu
#hospital sultan haji ahmad shah
#hospital sultanah aminah
#hospital sultanah bahiyah
#hospital sultanah nora ismail
#hospital sultanah nur zahirah
#hospital taiping
#hospital tawau
#hospital tengku ampuan afzan
#hospital tengku ampuan rahimah
#hospital tuanku jaafar
#hospital umum sarawak
#pusat darah negara
# Add Month and Year columns
df$Month <- month(df$date, label = TRUE)
df$Year <- year(df$date)
# Example if want to apply filter data for the year range (2023 to 2024)
filtered_hospital <- df %>%
filter(Year >= 2023 & Year <= 2024) %>% # Filter here to check certain time range by year
filter(tolower(hospital) == "hospital duchess of kent") # If want to filter based on certain hospital
#print(filtered_hospital) # Check if filter correctly based on certain hospital
# Group by Hospital, Year, and Month and summarize daily total
summary_df <- filtered_hospital %>%
group_by(hospital, Year, Month) %>%
summarise(Total_Daily = sum(daily, na.rm = TRUE)) %>%
arrange(hospital, Year, Month)
## `summarise()` has grouped output by 'hospital', 'Year'. You can override using
## the `.groups` argument.
# Check daily donation by Hospital according to month year
# print(summary_df)
# Plot the data using ggplot2
ggplot(summary_df, aes(x = interaction(Year, Month, sep = "-"), y = Total_Daily, group = hospital, color = hospital)) +
geom_line(size = 1) +
geom_point() +
theme_minimal() +
labs(
title = "Monthly Total Daily Donations by Hospital Duchess of Kent",
x = "Year-Month",
y = "Total Daily Donations",
color = "Hospital"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
# Add Year columns
df$Year <- year(df$date)
# Summarize donation by year
donations_yearly <- df %>%
group_by(Year) %>%
summarise(total_donations_yearly = sum(daily))
# View the result
# print(donations_yearly)
# Create bar chart for yearly donations
ggplot(donations_yearly, aes(x = as.numeric(Year), y = total_donations_yearly, fill = Year)) +
geom_bar(stat = "identity")+
# Format x-axis and y-axis label
scale_x_continuous(breaks = unique(as.numeric(df$Year)))+
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
labs(
title = "Total Blood Donations Per Year",
x = "Year",
y = "Total Donations"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Explanation:
#a. Before Covid-19: Total blood donation was almost increased every years, except there was slightly decreased in 2016, 2017, and 2019
#b. During Covid-19: Total blood donation was decreased almost 100,000 in total
#c. After Covid-19: Total blood donation was increased, to the same amount before Covid-19.
# Summarize donation by blood type
bloodt_total_donations <- df %>%
summarise(
blood_a_sum = sum(blood_a),
blood_b_sum = sum(blood_b),
blood_o_sum = sum(blood_o),
blood_ab_sum = sum(blood_ab)
)
# View result: total donation by blood type
bloodt_total_donations
## blood_a_sum blood_b_sum blood_o_sum blood_ab_sum
## 1 2190614 2400745 3682430 546461
# Convert into long format
bt_total <- bloodt_total_donations %>%
pivot_longer(
cols = everything(),
names_to = "Blood_Type",
values_to = "Total_Donations"
)
# Create bar chart for total blood donations based on blood type
ggplot(bt_total, aes(x = Blood_Type, y = Total_Donations, fill = Blood_Type)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
labs(
title = "Total Donations by Blood Type",
x = "Blood Type",
y = "Total Donations"
) +
theme_minimal()
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 45
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
#Explanation:
#a. The highest total donation by blood type O
#b. The lowest total donation by blood type AB
#c. Total donation by blood type A and B > 2 millions
# Summarize blood type donations per year
bt_donations_yearly <- df %>%
group_by(Year) %>%
summarise(
total_blood_a = sum(blood_a),
total_blood_b = sum(blood_b),
total_blood_o = sum(blood_o),
total_blood_ab = sum(blood_ab)
)
# Reshape to long format
# Load library(tidyr) if it is not yet load
bt_yearly <- bt_donations_yearly %>%
pivot_longer(cols = starts_with("total_blood_"), names_to="Blood_Types", values_to="Yearly_Donations")
# View the result
print(bt_yearly)
## # A tibble: 76 × 3
## Year Blood_Types Yearly_Donations
## <dbl> <chr> <int>
## 1 2006 total_blood_a 55040
## 2 2006 total_blood_b 59599
## 3 2006 total_blood_o 89490
## 4 2006 total_blood_ab 14759
## 5 2007 total_blood_a 69593
## 6 2007 total_blood_b 76389
## 7 2007 total_blood_o 114931
## 8 2007 total_blood_ab 18459
## 9 2008 total_blood_a 86357
## 10 2008 total_blood_b 95706
## # ℹ 66 more rows
# Line Chart Blood Types Donations Yearly
ggplot(bt_yearly, aes(x = as.numeric(Year), y=Yearly_Donations,
color = Blood_Types, group = Blood_Types)) +
# Setting the line width and dot
geom_line(size = 1) +
geom_point(size = 2) +
# Format x-axis and y-axis label
scale_x_continuous(breaks = unique(as.numeric(df$Year)))+
labs(
title = "Yearly Blood Donations by Blood Types",
x = "Year",
y = "Total Donation",
color = "Blood Types"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5), # Position at center
axis.text.x = element_text(hjust = 1, angle = 45)
)
# Summarize donors type per year
donor_types <- df %>%
summarise(
donors_new = sum(donations_new),
donors_reg = sum(donations_regular),
donors_irreg = sum(donations_irregular)
)
# View result: total donation by donors types
donor_types
## donors_new donors_reg donors_irreg
## 1 2874008 4794591 1151874
# Convert into long format
donors_total <- donor_types %>%
pivot_longer(
cols = everything(),
names_to = "donors",
values_to = "donations_total"
)
# Create bar chart for total blood donations based on donor types
ggplot(donors_total, aes(x = donors, y = donations_total, fill = donors)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
labs(
title = "Total Donations by Donor Types",
x = "Donor Types",
y = "Total Blood Donation"
) +
theme_minimal()
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 45
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
# Summarize donor types donation per year
dt_donation_yearly <- df %>%
group_by(Year) %>%
summarise(
total_donor_new = sum(donations_new),
total_donor_regular = sum(donations_regular),
total_donor_irregular = sum(donations_irregular)
)
# View result: total donation by donors types
dt_donation_yearly
## # A tibble: 19 × 4
## Year total_donor_new total_donor_regular total_donor_irregular
## <dbl> <int> <int> <int>
## 1 2006 136426 79697 2769
## 2 2007 153430 122718 3230
## 3 2008 184850 153193 10324
## 4 2009 179634 185894 19752
## 5 2010 167712 197172 28279
## 6 2011 166013 212307 38431
## 7 2012 156780 222382 45564
## 8 2013 156492 230197 53218
## 9 2014 179072 246372 62362
## 10 2015 183171 282021 70630
## 11 2016 169744 290224 73321
## 12 2017 156942 293247 78935
## 13 2018 163316 310896 90892
## 14 2019 157808 311299 91670
## 15 2020 115606 294259 87725
## 16 2021 96702 321522 81541
## 17 2022 120930 330081 108816
## 18 2023 118285 352633 104485
## 19 2024 111095 358477 99930
# Reshape to long format
# Load library(tidyr) if it is not yet load
dt_yearly <- dt_donation_yearly %>%
pivot_longer(cols = starts_with("total_"), names_to="donors_types", values_to="dt_year_donations")
# View the result
print(dt_yearly)
## # A tibble: 57 × 3
## Year donors_types dt_year_donations
## <dbl> <chr> <int>
## 1 2006 total_donor_new 136426
## 2 2006 total_donor_regular 79697
## 3 2006 total_donor_irregular 2769
## 4 2007 total_donor_new 153430
## 5 2007 total_donor_regular 122718
## 6 2007 total_donor_irregular 3230
## 7 2008 total_donor_new 184850
## 8 2008 total_donor_regular 153193
## 9 2008 total_donor_irregular 10324
## 10 2009 total_donor_new 179634
## # ℹ 47 more rows
# Line Chart Donor Types Donations Yearly
ggplot(dt_yearly, aes(x = as.numeric(Year), y=dt_year_donations,
color = donors_types, group = donors_types)) +
# Setting the line width and dot
geom_line(size = 1) +
geom_point(size = 2) +
# Format x-axis and y-axis label
scale_x_continuous(breaks = unique(as.numeric(df$Year)))+
scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))+
labs(
title = "Yearly Blood Donations by Donor Types",
x = "Year",
y = "Total Donation",
color = "Donor Types"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5), # Position at center
axis.text.x = element_text(hjust = 1, angle = 45)
)
cor_matrix <- cor(df[, c("daily", "donations_new", "donations_regular","donations_irregular")])
# print(cor_matrix)
# add library
library(reshape2)
cor_long <- melt(cor_matrix)
# Plot the heatmap
ggplot(cor_long, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), color = "black") +
scale_fill_gradient2(low = "light blue", high = "blue", midpoint = 0, limit = c(-1, 1)) +
labs(
title = "Correlation Heatmap",
x = "",
y = "",
fill = "Correlation"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#------------------- Modelling-------------------------
# Model 1: Time Series Forecasting Model (Regression Problem)
# using SARIMA
# Objective 1:Understanding the current blood donation trend from 2006 to the present.
# Outcome: Insights into donation patterns and expected future trends.
# Model 2: Clustering and Role-based Insights
# Objective2 : Group hospitals based on donation patterns and identify low blood supply issues.
# Outcome: Insights into hospitals that may require more donations.
# Aggregate donations by month
monthly_data <- df %>%
group_by(Year = year(date), Month = month(date)) %>%
summarise(Total_Donations = sum(daily, na.rm = TRUE), .groups = "drop") %>%
arrange(Year, Month)
# Create a time series object
monthly_ts <- ts(monthly_data$Total_Donations,
start = c(min(monthly_data$Year), min(monthly_data$Month)),
frequency = 12)
# Plot the time series
autoplot(monthly_ts) +
ggtitle("Monthly Blood Donations Over Time") +
xlab("Year") +
ylab("Total Donations") +
theme_minimal()
# Split into training (70%) and testing (30%) datasets
train_size <- floor(0.7 * length(monthly_ts))
train_ts <- window(monthly_ts, end = c(start(monthly_ts)[1] + (train_size - 1) %/% 12, (train_size - 1) %% 12 + 1))
test_ts <- window(monthly_ts, start = c(start(monthly_ts)[1] + train_size %/% 12, train_size %% 12 + 1))
#----------------Decomposition
# Perform decomposition
decomposition <- decompose(monthly_ts)
# Plot the decomposition
autoplot(decomposition) +
ggtitle("Decomposition of Monthly Blood Donations")
# Analyze the components (trend, seasonality, residuals)
print(decomposition$trend)
## Jan Feb Mar Apr May Jun Jul Aug
## 2006 NA NA NA NA NA NA 18658.21 19345.75
## 2007 20855.17 21439.25 21719.58 22106.83 22615.25 23029.33 23560.42 23892.00
## 2008 26608.00 27419.87 27869.17 27956.50 28501.96 28904.42 29013.67 29349.75
## 2009 30948.21 30848.75 30862.54 31381.92 31717.79 31919.37 32491.00 32691.46
## 2010 32965.92 32822.46 32707.96 32916.33 32881.46 32758.37 32583.42 32364.67
## 2011 33364.96 33388.04 33499.54 33995.75 34298.67 34608.87 34772.88 35043.75
## 2012 35483.46 35362.50 35534.62 35474.21 35328.75 35410.96 35457.96 35293.67
## 2013 35757.96 35906.71 36172.75 36378.29 36464.04 36501.88 36724.83 36965.29
## 2014 37928.42 38169.46 39162.08 39603.71 40046.08 40521.54 41067.12 41610.46
## 2015 43096.96 43845.71 43870.00 43842.96 44202.75 44507.67 44419.12 44296.12
## 2016 45235.67 45311.08 45379.08 45306.42 45040.42 44674.54 44396.21 44512.42
## 2017 43674.92 43805.62 43524.58 43515.58 43749.75 43991.62 44338.12 44531.29
## 2018 45563.58 45901.04 46155.96 46482.21 46800.42 47004.04 47147.96 47265.62
## 2019 47616.00 47297.25 47158.00 47022.62 46748.33 46679.33 46359.29 46070.42
## 2020 42872.46 42523.83 42399.54 42084.79 41872.29 41605.54 41481.29 41290.88
## 2021 41952.00 41375.50 41300.62 41329.25 41224.38 41452.00 41732.67 41702.71
## 2022 44018.88 44813.83 45104.00 45728.79 46209.96 46442.21 46737.21 47205.58
## 2023 48048.79 48106.79 48159.67 48024.75 47992.88 48028.71 48068.46 48143.62
## 2024 48319.62 48332.21 48375.25 48501.21 48695.50 48117.04 NA NA
## Sep Oct Nov Dec
## 2006 19678.08 19890.67 20059.46 20294.12
## 2007 24179.67 24782.88 25557.58 26195.71
## 2008 29892.63 30165.71 30392.33 30699.71
## 2009 32634.83 32747.08 32812.96 32886.88
## 2010 32459.17 32682.71 32596.17 32744.71
## 2011 35179.33 35152.92 35424.46 35611.37
## 2012 35290.04 35558.08 35672.29 35809.04
## 2013 37041.58 37119.08 37555.83 38075.54
## 2014 41985.83 42470.33 42812.83 42759.00
## 2015 44557.71 44777.42 45127.92 45288.83
## 2016 44661.96 44562.75 44326.75 43772.62
## 2017 44530.00 44700.00 44720.08 44988.75
## 2018 47472.92 47786.87 47566.38 47477.58
## 2019 45469.54 43932.46 43127.88 43251.13
## 2020 41377.08 41929.25 42327.29 42376.46
## 2021 41737.38 41926.54 42228.42 42919.75
## 2022 47778.67 47911.67 47938.71 48027.42
## 2023 48046.33 48233.04 48571.29 48503.46
## 2024 NA NA NA NA
print(decomposition$seasonal)
## Jan Feb Mar Apr May Jun
## 2006 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2007 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2008 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2009 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2010 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2011 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2012 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2013 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2014 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2015 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2016 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2017 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2018 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2019 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2020 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2021 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2022 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2023 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## 2024 2276.5453 -1331.2301 3266.4342 -1949.5750 1780.0615 -1103.7602
## Jul Aug Sep Oct Nov Dec
## 2006 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2007 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2008 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2009 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2010 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2011 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2012 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2013 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2014 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2015 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2016 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2017 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2018 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2019 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2020 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2021 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2022 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2023 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
## 2024 978.2745 2412.1842 -1750.9986 -500.7047 -1328.4917 -2748.7394
print(decomposition$random)
## Jan Feb Mar Apr May
## 2006 NA NA NA NA NA
## 2007 1908.288002 5979.980131 -1974.017554 31.741705 -3694.311535
## 2008 2851.454668 1261.355131 -2481.600887 3017.075039 176.980131
## 2009 -1894.753665 6304.480131 -917.975887 1588.658372 403.146798
## 2010 5311.538002 -82.228202 290.607446 -305.758295 1180.480131
## 2011 588.496335 -1573.811535 2693.024113 785.825039 -4484.728202
## 2012 -483.003665 1905.730131 -1542.059221 873.366705 -563.811535
## 2013 785.496335 -4124.478202 3218.815779 1003.283372 7.896798
## 2014 197.038002 -2198.228202 -2128.517554 1995.866705 2689.855131
## 2015 5027.496335 -4833.478202 -868.434221 3416.616705 1093.188465
## 2016 -2694.211998 -3667.853202 1269.482446 3579.158372 7041.521798
## 2017 -2202.461998 1695.605131 2854.982446 3257.991705 4780.188465
## 2018 1775.871335 -1630.811535 1423.607446 3171.366705 -668.478202
## 2019 1066.454668 -1546.019869 3915.565779 6671.950039 -9949.394869
## 2020 -3121.003665 5225.396798 -7744.975887 -8861.216628 -3912.353202
## 2021 -1827.545332 1430.730131 365.940779 -1865.674961 48.563465
## 2022 -1840.420332 -4780.603202 167.565779 -5330.216628 1372.980131
## 2023 -3829.336998 1126.438465 1665.899113 -8988.174961 1601.063465
## 2024 -1261.170332 -133.978202 150.315779 -3683.633295 3235.438465
## Jun Jul Aug Sep Oct
## 2006 NA 214.517168 -847.934221 3550.915316 -7741.961998
## 2007 -1853.573110 3582.308835 353.815779 29.331983 -4320.170332
## 2008 -2171.656443 2467.058835 10043.065779 -10047.626350 -3243.003665
## 2009 -1251.614776 1018.725502 2285.357446 -8042.834684 1893.621335
## 2010 -2257.614776 2990.308835 -2894.850887 -5108.168017 4199.996335
## 2011 3704.885224 7873.850502 -12092.934221 1366.665316 4443.788002
## 2012 2437.801890 4583.767168 -12640.850887 5684.956983 -1840.378665
## 2013 2921.885224 515.892168 -7941.475887 3947.415316 1517.621335
## 2014 5111.218557 -13566.399498 2938.357446 7301.165316 -1532.628665
## 2015 -2726.906443 -4955.399498 6259.690779 -694.709684 935.288002
## 2016 -5817.781443 -3284.482832 6205.399113 671.040316 -2064.045332
## 2017 -14881.864776 4175.600502 1921.524113 -1677.001350 62.704668
## 2018 -9048.281443 6315.767168 2336.190779 -1650.918017 1836.829668
## 2019 -1521.573110 3224.433835 -238.600887 780.456983 2014.246335
## 2020 5349.218557 -2782.566165 7058.940779 -628.084684 1964.454668
## 2021 3369.760224 -11087.941165 865.107446 2996.623650 -1330.836998
## 2022 8661.551890 4.517168 -1655.767554 937.331983 3697.038002
## 2023 7193.051890 -931.732832 -1596.809221 941.665316 -134.336998
## 2024 3139.718557 NA NA NA NA
## Nov Dec
## 2006 1963.033372 -1101.385610
## 2007 352.908372 -952.968943
## 2008 2149.158372 -2428.968943
## 2009 71.533372 -121.135610
## 2010 -2790.674961 146.031057
## 2011 -1062.966628 168.364390
## 2012 1077.200039 -444.302276
## 2013 -3667.341628 1058.197724
## 2014 -608.341628 -530.260610
## 2015 936.575039 397.906057
## 2016 -1432.258295 -3696.885610
## 2017 1530.408372 -2464.010610
## 2018 1460.116705 -2841.843943
## 2019 2992.616705 2634.614390
## 2020 746.200039 154.281057
## 2021 1626.075039 4292.989390
## 2022 -3549.216628 4224.322724
## 2023 -1436.799961 1863.281057
## 2024 NA NA
##OBSERVATION: DECOMPOISITION CONFIRMS SEASONALITY IN DATASET
# NEXT ACTION: SARIMA - Seasonal ARIMA and Prophet
#----------------------SARIMA
# Fit a SARIMA model
sarima_model <- Arima(monthly_ts, order = c(2, 1, 1), seasonal = c(1, 1, 1))
# Summarize the SARIMA model
summary(sarima_model)
## Series: monthly_ts
## ARIMA(2,1,1)(1,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1
## -0.3659 -0.2794 -0.6598 0.4557 -1.0000
## s.e. 0.0895 0.0823 0.0721 0.0696 0.0782
##
## sigma^2 = 16915369: log likelihood = -2104.71
## AIC=4221.42 AICc=4221.82 BIC=4241.64
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -416.6388 3947.144 2725.346 -2.008029 7.475266 0.6446787
## ACF1
## Training set -0.006194841
# Check residuals to ensure proper fit
checkresiduals(sarima_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)(1,1,1)[12]
## Q* = 32.39, df = 19, p-value = 0.02824
##
## Model df: 5. Total lags used: 24
#------------------SARIMA Short-Term Forecast(2025-2026) and Long Term Forecast (2025-2030)-----------------
# Short-term forecast for operational planning (2025-2026)
# and the long-term forecast (2025-2030) for strategic discussions
# Forecast for the next 24 months (2025-2026)
sarima_forecast_short <- forecast(sarima_model, h = 24)
# Plot the short-term forecast
autoplot(sarima_forecast_short) +
ggtitle("Short-Term Forecast: 2025-2026") +
xlab("Year") +
ylab("Total Donations") +
theme_minimal()
# Forecast for the next 72 months (2025-2030)
sarima_forecast_long <- forecast(sarima_model, h = 72)
# Plot the long-term forecast
autoplot(sarima_forecast_long) +
ggtitle("Long-Term Forecast: 2025-2030") +
xlab("Year") +
ylab("Total Donations") +
theme_minimal()
# Convert sarima_forecast_long to a data frame
forecast_df <- as.data.frame(sarima_forecast_long)
# Add a 'Date' column for filtering
forecast_df$Date <- seq.Date(from = as.Date("2025-01-01"),
by = "month",
length.out = nrow(forecast_df))
# Long-term forecast data preparation
long_term_forecast <- forecast_df %>%
filter(Date >= as.Date("2025-01-01") & Date <= as.Date("2030-12-31"))
#------------------Interactive Candlestick Chart with Forecasted Data-------------------
# Prepare historical candlestick data
candlestick_data <- data.frame(
Date = as.Date(as.yearmon(time(monthly_ts))),
Open = lag(as.numeric(monthly_ts)),
High = as.numeric(monthly_ts) + runif(length(monthly_ts), 0, 1000), # Simulated High
Low = as.numeric(monthly_ts) - runif(length(monthly_ts), 0, 1000), # Simulated Low
Close = as.numeric(monthly_ts)
)
candlestick_data <- na.omit(candlestick_data) # Remove NA rows caused by lag
# Prepare forecast data (2025–2030)
forecast_df <- data.frame(
Date = seq(as.Date("2025-01-01"), as.Date("2030-12-01"), by = "month"),
Forecast = as.numeric(sarima_forecast_long$mean),
Lower_80 = sarima_forecast_long$lower[, 1],
Upper_80 = sarima_forecast_long$upper[, 1],
Lower_95 = sarima_forecast_long$lower[, 2],
Upper_95 = sarima_forecast_long$upper[, 2]
)
# Create the plot
fig <- plot_ly() %>%
# Add candlestick chart for historical data
add_trace(
data = candlestick_data,
x = ~Date, open = ~Open, high = ~High, low = ~Low, close = ~Close,
type = "candlestick",
name = "Historical Data"
) %>%
# Add forecast line
add_lines(
data = forecast_df,
x = ~Date, y = ~Forecast,
name = "Forecast",
line = list(color = 'blue')
) %>%
# Add confidence intervals (80% and 95%)
add_ribbons(
data = forecast_df,
x = ~Date, ymin = ~Lower_95, ymax = ~Upper_95,
name = "95% Confidence Interval",
line = list(color = 'rgba(173,216,230,0.1)'), fillcolor = 'rgba(173,216,230,0.2)'
) %>%
add_ribbons(
data = forecast_df,
x = ~Date, ymin = ~Lower_80, ymax = ~Upper_80,
name = "80% Confidence Interval",
line = list(color = 'rgba(135,206,250,0.1)'), fillcolor = 'rgba(135,206,250,0.3)'
) %>%
# Layout adjustments
layout(
title = "Candlestick Chart (2006–2024) with Forecast (2025–2030)",
xaxis = list(title = "Date"),
yaxis = list(title = "Total Donations")
)
# Show the plot
fig
# ------------------------- Model 2: Clustering and Rule-Based for Insights -------------------------
# Objective: Group hospitals based on donation patterns and identify supply issues.
# Outcome: Insights into hospitals that may require more donations.
# Install and load necessary packages
required_packages <- c("factoextra", "cluster", "dplyr", "ggplot2")
install_if_missing <- function(p) { if (!require(p, character.only =
TRUE)) install.packages(p, dependencies = TRUE) }
invisible(lapply(required_packages, install_if_missing))
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: cluster
invisible(lapply(required_packages, library, character.only = TRUE))
# ----------------------- Data Preparation -----------------------
# Filter data for years 2018 to 2024
df_filtered <- df %>% filter(as.Date(date) >= as.Date("2018-01-01") &
as.Date(date) <= as.Date("2024-12-31"))
# Aggregate to calculate monthly average and total donations by hospital
df_monthly <- df_filtered %>% mutate(Year_Month =
format(as.Date(date), "%Y-%m")) %>% # Extract year-month
group_by(hospital, Year_Month) %>% summarise( Average_Daily_Donations =
mean(daily, na.rm = TRUE), Total_Blood_Donations = sum(daily, na.rm =
TRUE), .groups = "drop" )
# Further aggregate to hospital level across all months
data_for_clustering <- df_monthly %>% group_by(hospital) %>%
summarise( Average_Daily_Donations = mean(Average_Daily_Donations, na.rm
= TRUE), Total_Blood_Donations = sum(Total_Blood_Donations, na.rm =
TRUE) )
# ----------------------- Clustering -----------------------
# Normalize features
normalized_data <- scale(data_for_clustering %>%
select(Average_Daily_Donations, Total_Blood_Donations))
# Determine optimal number of clusters
fviz_nbclust(normalized_data, kmeans, method = "wss") +
ggtitle("Optimal Number of Clusters: Elbow Method") +
theme_minimal()
# Apply K-Means Clustering
set.seed(123) # For reproducibility
optimal_clusters <- 3 # Adjust based on Elbow Method
kmeans_result <- kmeans(normalized_data, centers = optimal_clusters, nstart = 25)
# Add cluster labels to the dataset
data_for_clustering$Cluster <- as.factor(kmeans_result$cluster)
# Add hospital names and other details to data_for_clustering
data_for_clustering <- data_for_clustering %>%
mutate(Hospital = hospital) # Include hospital name for tooltips
# Create an interactive scatter plot
interactive_plot <- plot_ly(
data = data_for_clustering,
x = ~Average_Daily_Donations,
y = ~Total_Blood_Donations,
type = 'scatter',
mode = 'markers',
color = ~Cluster, # Color by cluster
colors = c("blue","green","red"), # Customize cluster colors
text = ~paste(
'Hospital:', Hospital,
'<br>Cluster:', Cluster,
'<br>Avg Daily Donations:', round(Average_Daily_Donations, 2),
'<br>Total Blood Donations:', round(Total_Blood_Donations, 2)
), # Tooltip details
hoverinfo = 'text' # Display tooltips
) %>%
layout(
title = "Interactive Clustering of Hospitals Based on Donation Patterns",
xaxis = list(title = "Average Daily Donations"),
yaxis = list(title = "Total Blood Donations"),
legend = list(title = list(text = "Cluster"))
)
# Display the plot
interactive_plot
# Analyze cluster characteristics
cluster_summary <- data_for_clustering %>%
group_by(Cluster) %>%
summarise(
Avg_Daily_Donations = mean(Average_Daily_Donations),
Total_Blood_Donations = mean(Total_Blood_Donations),
Count = n(),
.groups = "drop"
)
# Print cluster summary
print(cluster_summary)
## # A tibble: 3 × 4
## Cluster Avg_Daily_Donations Total_Blood_Donations Count
## <fct> <dbl> <dbl> <int>
## 1 1 74.6 189874. 8
## 2 2 529. 1345332 1
## 3 3 29.1 74126. 13
# Identify hospitals in the "Low Donation" cluster
low_donation_cluster <- cluster_summary %>%
filter(Avg_Daily_Donations == min(Avg_Daily_Donations)) %>%
pull(Cluster)
low_donation_hospitals <- data_for_clustering %>%
filter(Cluster == low_donation_cluster) %>%
arrange(Average_Daily_Donations) %>%
select(hospital, Average_Daily_Donations, Total_Blood_Donations)
# Print hospitals with low donation issues
print(low_donation_hospitals)
## # A tibble: 13 × 3
## hospital Average_Daily_Donati…¹ Total_Blood_Donations
## <chr> <dbl> <dbl>
## 1 Hospital Seri Manjung 15.1 38522
## 2 Hospital Duchess Of Kent 16.1 41073
## 3 Hospital Tawau 18.4 46934
## 4 Hospital Sultan Haji Ahmad Shah 19.4 49329
## 5 Hospital Sibu 22.0 56105
## 6 Hospital Miri 22.4 57120
## 7 Hospital Sultanah Nora Ismail 28.0 71245
## 8 Hospital Taiping 28.0 71312
## 9 Hospital Tengku Ampuan Afzan 34.8 88652
## 10 Hospital Raja Perempuan Zainab … 39.8 101203
## 11 Hospital Seberang Jaya 40.8 103634
## 12 Hospital Tuanku Jaafar 45.3 115167
## 13 Hospital Sultanah Nur Zahirah 48.4 123345
## # ℹ abbreviated name: ¹Average_Daily_Donations
# ----------------------- Rule-Based Approach -----------------------
# Define thresholds for "Low Supply"
daily_threshold <- quantile(df_monthly$Average_Daily_Donations, 0.25, na.rm = TRUE)
total_donations_threshold <- quantile(df_monthly$Total_Blood_Donations, 0.25, na.rm = TRUE)
# Classify supply status
df_monthly <- df_monthly %>%
mutate(
Supply_Status = case_when(
Average_Daily_Donations < daily_threshold ~ "Low Supply",
Total_Blood_Donations < total_donations_threshold ~ "Low Supply",
TRUE ~ "Adequate Supply"
)
)
# Summarize supply status trends
monthly_summary <- df_monthly %>%
group_by(Year_Month, Supply_Status) %>%
summarise(Count = n(), .groups = "drop")
# Visualize supply status trends
ggplot(monthly_summary, aes(x = Year_Month, y = Count, fill = Supply_Status)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Low Supply" = "red", "Adequate Supply" = "green")) +
ggtitle("Monthly Distribution of Supply Status (2018–2024)") +
xlab("Month") +
ylab("Number of Facilities") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Identify hospitals with the most "Low Supply" months
low_supply_hospitals <- df_monthly %>%
filter(Supply_Status == "Low Supply") %>%
group_by(hospital) %>%
summarise(Low_Supply_Months = n(), .groups = "drop") %>%
arrange(desc(Low_Supply_Months))
# Visualize hospitals with the most "Low Supply" months
ggplot(low_supply_hospitals, aes(x = reorder(hospital, -Low_Supply_Months), y = Low_Supply_Months, fill = Low_Supply_Months)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient(low = "yellow", high = "red") +
ggtitle("Hospitals with the Most 'Low Supply' Months (2018–2024)") +
xlab("Hospital") +
ylab("Number of Low Supply Months") +
theme_minimal()
# Print hospitals with low supply issues
print(low_supply_hospitals)
## # A tibble: 11 × 2
## hospital Low_Supply_Months
## <chr> <int>
## 1 Hospital Seri Manjung 84
## 2 Hospital Sultan Haji Ahmad Shah 78
## 3 Hospital Tawau 74
## 4 Hospital Duchess Of Kent 68
## 5 Hospital Sibu 59
## 6 Hospital Miri 54
## 7 Hospital Taiping 19
## 8 Hospital Sultanah Nora Ismail 18
## 9 Hospital Tengku Ampuan Afzan 17
## 10 Hospital Raja Perempuan Zainab Ii 2
## 11 Hospital Sultanah Bahiyah 1
# ------------------------- Model 2 Evaluation -------------------------
# Objective: Evaluate the clustering and rule-based classification to identify "Low Supply" hospitals.
# Install and load necessary packages
required_packages <- c("factoextra", "cluster", "dplyr", "ggplot2", "caret")
install_if_missing <- function(p) {
if (!require(p, character.only = TRUE)) install.packages(p, dependencies = TRUE)
}
invisible(lapply(required_packages, install_if_missing))
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
invisible(lapply(required_packages, library, character.only = TRUE))
# ------------------------- Clustering Evaluation -------------------------
# Step 1: Clustering Results Evaluation
# For simplicity, clustering has already been performed in previous steps.
# Let's evaluate the clustering using silhouette analysis.
# Compute silhouette information
silhouette_info <- cluster::silhouette(kmeans_result$cluster, dist(normalized_data))
# Visualize silhouette scores
factoextra::fviz_silhouette(silhouette_info) +
ggtitle("Silhouette Plot for Clustering Evaluation") +
theme_minimal()
## cluster size ave.sil.width
## 1 1 8 0.75
## 2 2 1 0.00
## 3 3 13 0.67
# Print average silhouette width
avg_silhouette_width <- mean(silhouette_info[, "sil_width"])
cat("Average Silhouette Width: ", avg_silhouette_width, "\n")
## Average Silhouette Width: 0.6651655
# ------------------------- Rule-Based Classification Evaluation -------------------------
# Step 1: Prepare data for evaluation
# Ensure the dataset contains both the rule-based classification and clustering labels
evaluation_data <- data_for_clustering %>%
left_join(df_monthly %>% select(hospital, Supply_Status), by = "hospital")
# Check for imbalances in "Supply_Status"
supply_status_distribution <- evaluation_data %>%
group_by(Supply_Status) %>%
summarise(Count = n())
cat("Supply Status Distribution:\n")
## Supply Status Distribution:
print(supply_status_distribution)
## # A tibble: 2 × 2
## Supply_Status Count
## <chr> <int>
## 1 Adequate Supply 1374
## 2 Low Supply 474
# Step 2: Generate confusion matrix for clustering vs. rule-based classification
# Encode Supply_Status as numeric for binary evaluation
evaluation_data <- evaluation_data %>%
mutate(
Supply_Status_Binary = ifelse(Supply_Status == "Low Supply", 1, 0),
Cluster_Binary = ifelse(Cluster == low_donation_cluster, 1, 0)
)
# Confusion matrix
confusion_matrix <- caret::confusionMatrix(
factor(evaluation_data$Cluster_Binary),
factor(evaluation_data$Supply_Status_Binary),
positive = "1"
)
cat("Confusion Matrix for Clustering vs. Rule-Based Classification:\n")
## Confusion Matrix for Clustering vs. Rule-Based Classification:
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 755 1
## 1 619 473
##
## Accuracy : 0.6645
## 95% CI : (0.6425, 0.686)
## No Information Rate : 0.7435
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3836
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9979
## Specificity : 0.5495
## Pos Pred Value : 0.4332
## Neg Pred Value : 0.9987
## Prevalence : 0.2565
## Detection Rate : 0.2560
## Detection Prevalence : 0.5909
## Balanced Accuracy : 0.7737
##
## 'Positive' Class : 1
##
# ------------------------- Export Results for Reporting -------------------------
# Save evaluation results
#write.csv(supply_status_distribution, "supply_status_distribution.csv", row.names = FALSE)
#write.csv(as.data.frame(confusion_matrix$table), "confusion_matrix.csv", row.names = FALSE)
cat("Evaluation complete. Results saved to CSV files.\n")
## Evaluation complete. Results saved to CSV files.
1. Regular donors are critical, while new and irregular donors need targeted outreach.
2. Blood Type O is the most donated; donations recovered post-COVID-19.
3. Clustering effectively identified low-supply hospitals for intervention.
4.Model Insights: a. Time Series Forecasting: Best for predicting future trends and seasonality (up to 2030).
b. Clustering: Best for segmenting hospitals and providing actionable insights into low-supply facilities.
1. Use time series forecasts for strategic planning.
2. Allocate resources to low-supply hospitals based on clustering results.
3. Engage new/irregular donors to build a stable donor base.