
WQD7004 Programming for Data Science
TOPIC:
Understanding COVID-19 Burden and Spread: A Clustering and Predictive
Modeling Approach for Asian Countries
Group Project
| Chadli Rayane |
24075296 |
| Farras Azelya Putri |
S2007298 |
| Isma Adlin Binti Ismail |
24088157 |
| Wang Zheng |
24082308 |
| Mareeha sultana mohamed |
24071864 |
1. INTRODUCTION
The COVID-19 pandemic affected countries differently based several
factors such as population and population density. Asia, with its
variety of culture and population size, can be considered an interesting
case study for analyzing the spread of the virus.
This project explores COVID-19 patterns across Asian countries
through data-driven analysis and machine learning model. We aim to group
countries by severity and predict viral spread trends to uncover
critical insights for pandemic preparedness during different time
periods.
Sources : Our World in Data (OWID COVID-19
Dataset)
2. Project Questions:
Classification: Group the countries in Asia suffered the most
during the pandemic based on metrics such as mortality rates, case
rates, and severity of COVID-19 impact ?
Regression : Create a model to analyze and predict the
propagation of the virus ?
3. Data Understanding:
3.1. Dimension:
covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)
Dataset Description:
Size: 429,435 rows × 67 columns
Granularity: Daily records per country
Key Features: Includes indicators such as COVID-19 cases, deaths,
vaccinations, testing rates, and population statistics.
Format: CSV
3.2. Content:
iso_code
continent
location
date
total_cases
new_cases
new_cases_smoothed
total_deaths
new_deaths
new_deaths_smoothed
total_cases_per_million
new_cases_per_million
new_cases_smoothed_per_million
total_deaths_per_million
new_deaths_per_million
new_deaths_smoothed_per_million
reproduction_rate
icu_patients
icu_patients_per_million
hosp_patients
hosp_patients_per_million
weekly_icu_admissions
weekly_icu_admissions_per_million
weekly_hosp_admissions
weekly_hosp_admissions_per_million
total_tests
new_tests
total_tests_per_thousand
new_tests_per_thousand
new_tests_smoothed
new_tests_smoothed_per_thousand
positive_rate
tests_per_case
tests_units
total_vaccinations
people_vaccinated
people_fully_vaccinated
total_boosters
new_vaccinations
new_vaccinations_smoothed
total_vaccinations_per_hundred
people_vaccinated_per_hundred
people_fully_vaccinated_per_hundred
total_boosters_per_hundred
new_vaccinations_smoothed_per_million
new_people_vaccinated_smoothed
new_people_vaccinated_smoothed_per_hundred
stringency_index
population_density
median_age
aged_65_older
aged_70_older
gdp_per_capita
extreme_poverty
cardiovasc_death_rate
diabetes_prevalence
female_smokers
male_smokers
handwashing_facilities
hospital_beds_per_thousand
life_expectancy
human_development_index
population
excess_mortality_cumulative_absolute
excess_mortality_cumulative
excess_mortality
excess_mortality_cumulative_per_million
## </div>
The dataset contains a comprehensive collection of COVID-19 –
related metrics across different countries over time. The variables can
be grouped into the following categories:
- Identification and Geographic Information:
- Country or region identifiers - “iso_code” “continent”
“location”
- COVID-19 Cases and Deaths:
- Hospitalizations and ICU:
ICU occupancy- “icu_patients” “icu_patients_per_million”
Hospital occupancy- “hosp_patients”
“hosp_patients_per_million”
Weekly Admissions- “weekly_icu_admissions”
weekly_hosp_admissions”
- Testing Data:
- Vaccination Data:
“total_vaccinations” “people_vaccinated”
“people_fully_vaccinated” “total_boosters”
Daily Values- “new_vaccinations”
“new_vaccinations_smoothed”
Normalized- “total_vaccinations_per_hundred”
“people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred”
“total_boosters_per_hundred” “new_vaccinations_smoothed_per_million”
“new_people_vaccinated_smoothed”
“new_people_vaccinated_smoothed_per_hundred”
- Public Policy Measures:
- A composite measure of government response strictness-
“stringency_index”
- Demographic and Socioeconomic Data:
- Health and Risk Factors:
“cardiovasc_death_rate”“diabetes_prevalence”
Smoking prevalence- “female_smokers” “male_smokers”
Healthcare resources- “hospital_beds_per_thousand”
“handwashing_facilities”
- Excess Mortality Metrics:
- “excess_mortality” “excess_mortality_cumulative”
“excess_mortality_cumulative_absolute”
“excess_mortality_cumulative_per_million”
3.3. Structure:
Understanding the structure of the dataset is a crucial step in data
analysis. It allows us to identify the types of variables we are dealing
with, assess what kind of transformation or conversion is possible. The
structure also helps in detecting missing values, inconsistencies, or
anomalies before modeling and performing EDA.
The command below displays the internal structure of the dataset:
str(covid_data)
## 'data.frame': 429435 obs. of 67 variables:
## $ iso_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : chr "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
## $ total_cases : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_cases_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed_per_million : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed_per_million : num NA NA NA NA NA 0 0 0 0 0 ...
## $ reproduction_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : int NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : chr "" "" "" "" ...
## $ total_vaccinations : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : int NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed_per_hundred: num NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : num 0 0 0 0 0 0 0 0 0 0 ...
## $ population_density : num 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num 597 597 597 597 597 ...
## $ diabetes_prevalence : num 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num 64.8 64.8 64.8 64.8 64.8 ...
## $ human_development_index : num 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
## $ population : num 41128772 41128772 41128772 41128772 41128772 ...
## $ excess_mortality_cumulative_absolute : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative_per_million : num NA NA NA NA NA NA NA NA NA NA ...
Location - character
date - character
new_cases - integer
new_deaths - integer
total_deaths - integer
people_vaccinated - numeric
people_fully_vaccinated - numeric
total_boosters - numeric
population - numeric
total_vaccinations - numeric
new_vaccinations - integer
stringency_index (Measure of government
restrictions) - numeric
icu_patients - integer
hosp_patients - integer
positive_rate - numeric
reproduction_rate - numeric
3.4. Summary:
The dataset comprises 429,435 observations and
67 columns. The dataset representing detailed COVID-19
data from multiple countries over time. Each row corresponds to a
specific date and location (country and region) making the dataset
structured as a longitudinal time series.
Data Types:
- Character variables (Eg: iso_code,continent,location,date)
:Represents identifiers and date information.
- Numeric variables: Majority of the dataset consists of integers
(int) and continuous numbers (num).
Data Coverage:
Includes data on COVID-19 cases, deaths, testing,
hospitalizations, ICU usage, vaccinations, and public health policy
responses.
Demographic and socioeconomic indicators like median age, GDP per
capita, population density, and human development index are included to
conduct/support deeper analytic insights.
Also includes healthcare capacity and risk factors such as
smoking rates, cardiovascular death rate, handwashing facilities, and
hospital beds.
Excess mortality provides insight into the pandemic’s broader
impact beyond the reported COVID-19 deaths.
Missing Values:
- Several variables, especially those related to ICU, hospitalization,
testing, vaccination, and excess mortality, contain many NA values,
indicating inconsistent or unavailable reporting across regions and
time.

The dataset offers a rich foundation to help analyze how COVID-19
spread across the world, how it affected people, how countries responded
and the healthcare systems differed between regions. It is suitable for
our project as it includes up-to-date data on vaccinations,cases and
deaths,which we can use to study how effective the vaccines were in
reducing infections and saving lives.
4. Data Preparation
main_data_frame<-read.csv('owid-covid-data.csv')
4.1. Function: drop_cols()
Explanation:
- Drops columns with too many missing values.
Key Steps:
- Checks if input is NULL; returns NULL if so.
- Computes a 50% missing data threshold (drop_at).
- Analyzes summary statistics to detect columns with excessive
NAs.
- Drops any column where missing values are ≥ 50% of total rows.
- Returns dataframe that can be imputed without introducing too much
bias.
R Code:
drop_cols<-function(dataframe=NULL){
if(is.null(dataframe)){
return(NULL)
}
col_to_drop<-c()
drop_at<-0.5*nrow(dataframe)
summary_df<-data.frame(summary(dataframe))
summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
for (p in unique(colnames(dataframe)))
{
test<-summary_df[which(summary_df$Var2==p),]
if (is.na(test[7,2])){next}
if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
col_to_drop<-append(col_to_drop,p)
}
}
dataframe<-dataframe %>% select(-all_of(col_to_drop))
return(dataframe)
}
4.2. Function: missed_val_df()
Explanation:
- Creates and saves a visual table showing missing values.
Key Steps:
- Returns NULL if input dataframe is NULL.
- Uses miss_var_summary() from naniar to count missing values.
- Formats the summary using the gt package for better
readability.
- Adds a custom title to the table.
- Saves the output as an image (.png) with the title in the
filename.
- Returns the formatted table.
R Code:
missed_val_df<-function(dataframe=NULL,title=''){
if(is.null(dataframe)){return(NULL)
}else{
pretty_table <- miss_var_summary(dataframe) %>%
gt() %>%
tab_header(title = title)
print(pretty_table)
}
}
4.3. Function: fun()
Explanation:
- Converts logical (TRUE/FALSE) values to numeric (1/0).
Key Steps:
- Returns 1 if value is TRUE, 0 if FALSE.
- Handles NULL input by returning NULL.
- Supports matrix transformation for visualizations.
R Code:
fun<-function(data=NULL){
if(is.null(data)){return(NULL)
}else{
a<- ifelse(data==TRUE,1,0)
return(a)
}
}
4.4. Function: heat_map_missing()
Explanation:
- Generates a heatmap showing the locations of missing values in a
dataframe.
Key Steps:
- Converts missing values (NA) into binary format (1 = missing).
- Uses pheatmap() to create a visual heatmap.
- Uses red for missing values and white for non-missing.
- No row/column clustering.
- Returns NULL if the input dataframe is NULL.
R Code:
heat_map_missing<-function(data__frame=NULL){
if(is.null(data__frame)){
return(NULL)
}
pheatmap(
mat = sapply(as.data.frame(is.na(data__frame)),fun),
cluster_rows = FALSE,
cluster_cols = FALSE,
color = c("white", "red"),
breaks = c(-0.01, 0.5, 1.1),
main = "Missing Values Heatmap"
)
}
4.5. Function: data_char_date()
Explanation:
- Converts a character string to a proper Date object.
Key Steps:
- Extracts numeric date components from a string.
- Handles two formats: yyyy-mm-dd or Month dd, yyyy.
- Converts using as.Date().
- Returns NULL if input is NULL
R Code:
data_char_date<-function(x=NULL){
if(is.null(x)){
return(NULL)}
dated<-str_extract_all(x,"\\d+")[[1]]
if (length(dated)>2){
x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
format="%Y-%m-%d")
}else{
x<-as.Date(x, format = "%B %d, %Y")
}
return(x)
}
4.6. Function: box_plot()
Explanation:
- Plots boxplots of a selected variable (peak) against binned versions
of key socioeconomic features.
Key Steps:
- Requires a dataframe and a target variable (e.g., new_cases).
- Bins population, gdp_per_capita, and life_expectancy into 4
log-scaled groups.
- Creates a boxplot for each binned feature group vs. the peak
variable.
- Uses ggplot2 to plot and visualize distributions.
- Prints each boxplot automatically.
R Code:
box_plot<-function(data__frame=NULL,peak=NULL){
if(is.null(data__frame) | is.null(peak)){
return(NULL)
}
cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
#data_cut <- data__frame %>%
# mutate(across(
# all_of(cols_to_cut),
# ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
#.names = "{.col}_group"
#))
ata_cut <- data__frame %>%
mutate(across(
all_of(cols_to_cut),
~ cut(log10(.x + 1), breaks = 4, labels = FALSE), # +1 avoids log(0)
.names = "{.col}_group"
))
for (p in paste0(cols_to_cut,'_group')){
plo<-ggplot(data_cut,
aes(x=!!sym(p),
y=!!sym(peak),
group=!!sym(p)
)
)+
geom_boxplot(color = 'blue',
fill = 'lightblue')+
labs(x =p,
y = peak,
title = paste(peak," Across Population Bins"))+theme_minimal()
print(plo)
}
}
4.7. Function: smoothing_bins()
Explanation:
- Imputes missing smoothed data by averaging values within weekly date
bins.
Key Steps:
- Looks for columns ending in “smoothed”.
- Creates 7-day bins (cut()) based on the date column.
- For each smoothed column: -Groups data by date bin and location.
-Replaces the smoothed column with its imputed weekly average.
- Returns the updated dataframe with smoothed values replaced.
R Code:
smoothing_bins<-function(df=NULL){
if(is.null(df)){
return(NULL)
}
names_cols_smoothed<-colnames(df)
names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
for (names_bins_collection in names_cols_smoothed ){
main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
bins<-paste0(main,"_smoothed")
new_col<-paste0(bins,"_imputed")
df <- df %>%
mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
group_by(date_group,location) %>%
mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
ungroup()%>%
select(-all_of(c('date_group',bins))) %>%
rename(!!sym(bins) := new_col)
}
return(df)
}
4.8. Function: per_million_fix()
Explanation:
- Recalculates “per million” values using the actual population and
base metric.
Key Steps:
- Detects columns ending in _per_million.
- Recomputes values as: (original value / population)×1,000,000
- Replaces original _per_million columns with recalculated ones.
- Useful for ensuring consistent normalization per population.
R Code:
per_million_fix<-function(df=NULL){
if(is.null(df)){
return(NULL)
}
names_cols_per_million<-colnames(df)
names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
names_cols_per_million
for (names_bins_collection in names_cols_per_million ){
main<-str_split(names_bins_collection,"_per_million")[[1]][1]
bins<-paste0(main,"_per_million")
new_col<-paste0(bins,"_imputed")
df <- df %>%
mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
select(-all_of(bins))%>%
rename(!!sym(bins) := new_col)
}
return(df)
}
4.9. Function: fix_totals()
Explanation:
- Recalculates total cases and deaths over time for each country from
daily new values since total cases and deaths cannot decrease over
time.
Key Steps:
- Skips the date 2020-01-04 (removes it due to mismatch between
different countries).
- Iterates through each country and processes data in date order.
- On the first day, sets totals to 0.
- For subsequent days, adds new_cases and new_deaths to running
totals.
- Removes duplicates and ensures cumulative consistency.
- Prints warning if there are duplicate dates per country.
- Returns cleaned dataframe with corrected total_cases and
total_deaths.
R Code:
fix_totals<-function(df_t=NULL){
if(is.null(df_t)){
return(NULL)
}
df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
countries <- unique(df_t$location)
df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
for(country in countries){
df_sub<-df_t[df_t$location==country,]
df_sub <- df_sub[order(df_sub$date), ]
first_day<-min(df_sub$date)
last_day<-max(df_sub$date)
for (i in seq(first_day,last_day,by=1)){
if(i==first_day){
df_sub$total_cases[df_sub$date == i] <- 0
df_sub$total_deaths[df_sub$date == i] <- 0
}else{
df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
}
}
print(paste(country,'--->',any(duplicated(df_sub$date)))
)
df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
}
return(df_t)
}
4.10. Function: drop_asia()
Explanation:
- Removes Asian countries from the dataset and handles missing values
(use during testing only thus was not included in the main code).
Key Steps:
- Filters out rows where continent == ‘Asia’.
- Displays missing value summaries before and after column
removal.
- Uses drop_cols() to remove columns with too many missing values from
both the original and filtered datasets.
- Returns the non-Asian subset of the data.
R Code:
drop_asia<-function(main_data_frame){
main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
print(missed_val_df(main_data_frame,'missing main_data_frame'))
print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
main_data_remaining<-drop_cols(main_data_remaining)
main_data_frame<-drop_cols(main_data_frame)
print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
return(main_data_remaining)
}
4.11. Function: keep_asia()
Explanation:
- Keeps only Asian countries in the dataset and handles missing values
(Area of interest).
Key Steps:
- Filters the dataset to include only rows where continent ==
‘Asia’.
- Displays missing value summaries before and after cleaning.
- Uses drop_cols() to remove columns with excessive missing
values.
- Returns the Asian subset of the data.
R Code:
keep_asia<-function(main_data_frame){
main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]
print(missed_val_df(main_data_frame,'missing main_data_frame'))
print(missed_val_df(main_data_asia,'missing main_data_asia'))
main_data_asia<-drop_cols(main_data_asia)
main_data_frame<-drop_cols(main_data_frame)
print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
return(main_data_asia)
}
4.12. Function: imputate_first_part()
Explanation:
- Cleans and imputes missing values for the Asian dataset using
weighted median imputation by similarity (imputation by similarity based
on different features and calculate the weighted average of the
imputations).
Key Steps:
- Converts the date column into proper Date format using
data_char_date().
- Bins population, gdp_per_capita, and life_expectancy into 4
log-scaled groups.
- Removes rows where gdp_per_capita_group is missing.
- Identifies columns with missing values and imputes them using
weighted_median_imputation().
- Drops binning columns (*_group) before returning.
- Returns the cleaned and imputed Asian dataset.
R Code:
imputate_first_part<-function(data_set_asia){
data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
#data_cut <- data_set_asia %>%
# mutate(across(
# all_of(cols_to_cut),
# ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
# .names = "{.col}_group"
#))
data_cut <- data_set_asia %>%
mutate(across(
all_of(cols_to_cut),
~ cut(log10(.x + 1), breaks = 4, labels = FALSE), # +1 avoids log(0)
.names = "{.col}_group"
))
data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
for(col_named in cols_to_imputate){
data_cut<-weighted_median_imputation(data_cut,col_named)
}
cols_original<-paste0(cols_to_cut,"_group")
data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
#write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
return(data_set_asia_clean_except__bins)
}
4.14. Function: reproduction_function()
Explanation:
- Computes a robust weighted median by iteratively removing outliers
and calculating median values from non-outliers.
Key Steps:
- Removes outliers using the IQR (Interquartile Range) method.
- For each iteration, calculates the median of non-outliers and
assigns a weight based on the proportion of non-outliers.
- Repeats on the remaining outliers until fewer than 4 values
remain.
- Returns the weighted sum of medians from all iterations.
R Code:
reproduction_function<-function(s){
median_data<-c()
weights<-c()
if (is.data.frame(s)) {
s <- s[[1]]
}
x <- s[!is.na(s)]
size_s<-length(x)
x<-as.numeric(x)
while(TRUE)
{
if(length(x)<4){break}
Q1<-quantile(x,0.25,na.rm=TRUE)
Q3<-quantile(x,0.75,na.rm=TRUE)
IQR<-Q3-Q1
lower_bound<-Q1-1.5*IQR
higher_bound<-Q3+1.5*IQR
non_outliers<-x[x>=lower_bound & x<=higher_bound]
outliers<-x[x<lower_bound | x>higher_bound]
a<-non_outliers
b<-length(non_outliers)/size_s
weights<-append(weights,b)
a<-median(a,na.rm=TRUE)
median_data<-append(median_data,a)
x<-outliers
}
median_data <- as.numeric(median_data)
weights <- as.numeric(weights)
return(sum(median_data*weights, na.rm = TRUE))
}
5. Exploratory Data Analysis (EDA)
This report presents an exploratory data analysis of COVID-19
indicators in various Asian countries. The focus is on identifying
extreme cases, comparing Malaysia to others, and understanding
healthcare burden using clustering and visualizations.
5.1. Load Libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggcorrplot)
library(plotly)
library(pheatmap)
library(ggtext)
5.2. Load Dataset
df <- read.csv('covid_data_set_asia_fixed_final.csv')
df <- df %>%
mutate(case_fatality_ratio = case_when(
total_cases != 0 ~ total_deaths / total_cases,
TRUE ~ 0
))
df$date <- as.Date(df$date)
df_last_day <- df %>%
group_by(location) %>%
filter(date == max(date)) %>%
ungroup() %>%
filter(total_cases > 0)
5.3. Visualization
5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths
bar_plot(df_last_day, "total_deaths")

5.3.3 Plot : Deaths per Million
bar_plot(df_last_day, "total_deaths_per_million")

5.3.4 Plot : Cases per Million
bar_plot(df_last_day, "total_cases_per_million")

5.3.5 Plot : Healthcare Clustering (Heatmap)
df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
df_scaled <- scale(df_clust[, -ncol(df_clust)])
df_scaled <- as.data.frame(df_scaled)
rownames(df_scaled) <- df_clust$location
pheatmap(
df_scaled,
scale = "row",
cutree_rows = 3,
cutree_cols = 4,
fontsize_row = 5.5,
fontsize_col = 10,
angle_col = 45
)

5.3.6 Plot : Hospital Beds per 1,000 People
bar_plot(df_last_day, "hospital_beds_per_thousand")

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia
ggplotly(line_plot(df, "total_deaths_per_million"))
ggplotly(line_plot(df, "total_cases_per_million"))
ggplotly(line_plot(df, "hospital_beds_per_thousand"))
ggplotly(line_plot(df, "new_cases_per_million"))
ggplotly(line_plot(df, "new_deaths_per_million"))
ggplotly(line_plot(df, "case_fatality_ratio"))
5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)
countr <- c("China", "India", "Malaysia", "Japan")
df_selective <- df %>% filter(location %in% countr)
p <- ggplot(df_selective, aes(x = date, y = reproduction_rate, colour = location)) +
geom_line() +
labs(
title = "COVID-19 Reproduction Rate in Malaysia, China, India, and Japan Over Time",
x = "Date",
y = "Reproduction Rate"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(p)
This interesting to see the variation of India, Malaysia and China
are closely similar because malaysia is a popular distination for
chinese and indian citizen. ### 5.3.10 Plot : Case Fatality Ratio by
Country
bar_plot(df_last_day, "case_fatality_ratio")

5.4 Conclusion
This analysis shows how COVID-19 affected Asian countries differently
based on death rates, case counts, and healthcare resources. Malaysia’s
performance is compared with best- and worst-case countries across key
metrics. This provides a foundation for future modeling and regional
healthcare readiness assessment.
6.0 Data Model : Clustering
6.1. Load Libraries
library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)
library(tidyr)
6.2. Load Dataset
df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$case_fatality_ratio <- df$total_deaths/df$total_cases
# --- Define Time Points ---
df_first <- df %>%
group_by(location) %>%
filter(date == "2020-08-01") %>%
ungroup()
df_postvax <- df %>%
group_by(location) %>%
filter(date == "2021-03-01") %>%
ungroup()
df_latest <- df %>%
group_by(location) %>%
filter(date == "2024-08-04") %>%
ungroup()
6.3. First Date
# Feature Engineering
death_case_features <- df_first %>%
select(location,
total_cases_per_million,
total_deaths_per_million,
reproduction_rate,
case_fatality_ratio) %>%
drop_na()
locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

The plot is ambiguous, we will check the silhoutte score for the
optimal number of k
# Helper function to find the best k
find_best_k <- function(data) {
best_k <- NA
best_score <- -Inf
for (k in 2:5) {
set.seed(123)
km <- kmeans(data, centers = k, nstart = 25)
sil <- silhouette(km$cluster, dist(data))
score <- mean(sil[, 3])
if (score > best_score) {
best_score <- score
best_k <- k
}
}
return(list(k = best_k, score = best_score))
}
find_best_k(scaled_dc)
## $k
## [1] 3
##
## $score
## [1] 0.5107536
set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k
# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
mutate(Cluster = factor(km_dc$cluster))
# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 3 × 5
## Cluster total_cases_per_million total_deaths_per_million reproduction_rate
## <fct> <dbl> <dbl> <dbl>
## 1 1 49.8 14.1 0.87
## 2 2 1690. 17.7 0.965
## 3 3 19108. 127. 0.838
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE,
geom = "point", palette = "jco") +
labs(title = "Asian Countries Clustered by COVID Impact") +
theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
cat(paste0("Cluster ", i, ":\n"))
print(countries_by_cluster[[i]])
cat("\n")
}
## Cluster 1:
## [1] "Yemen"
##
## Cluster 2:
## [1] "Afghanistan" "Azerbaijan" "Bangladesh"
## [4] "Bhutan" "Brunei" "Cambodia"
## [7] "China" "East Timor" "Georgia"
## [10] "India" "Indonesia" "Iraq"
## [13] "Israel" "Japan" "Jordan"
## [16] "Kazakhstan" "Laos" "Lebanon"
## [19] "Malaysia" "Maldives" "Mongolia"
## [22] "Myanmar" "Nepal" "Pakistan"
## [25] "Palestine" "Philippines" "Saudi Arabia"
## [28] "Singapore" "South Korea" "Sri Lanka"
## [31] "Tajikistan" "Thailand" "Turkey"
## [34] "United Arab Emirates" "Vietnam"
##
## Cluster 3:
## [1] "Armenia" "Bahrain" "Iran" "Kuwait" "Oman" "Qatar"
- Cluster 1 has high case fatality ratio even though the total cases
and death are not high
- Cluster 2 has high reproduction rate
- Cluster 3 has high total cases and total death
label_map <- data.frame(
Cluster = factor(1:3),
ClusterName = c(
"Cluster 1: High Fatality",
"Cluster 2: High Reproduction",
"Cluster 3: High Cases & Deaths"
)
)
# Join cluster labels to the data
clustered_dc <- clustered_dc %>%
left_join(label_map, by = "Cluster") %>%
mutate(
zvalue = as.numeric(Cluster),
HoverText = paste0(location, "<br>", ClusterName)
)
# Plot interactive choropleth
plot_geo(clustered_dc) %>%
add_trace(
z = ~zvalue,
text = ~HoverText,
locations = ~location,
locationmode = "country names",
type = "choropleth",
colorscale = "Bluered",
showscale = TRUE,
colorbar = list(
title = "Cluster",
tickvals = 1:3,
ticktext = label_map$ClusterName
)
) %>%
layout(
title = list(text = "COVID-19 Cluster Map in Asia (2020-08-01)", y = 0.95),
geo = list(
showframe = FALSE,
showcoastlines = TRUE,
projection = list(type = 'equirectangular'),
lonaxis = list(range = c(25, 150)),
lataxis = list(range = c(-10, 60))
)
)
6.3. Post Vaccination
# Feature Engineering
death_case_features <- df_postvax %>%
select(location,
total_cases_per_million,
total_deaths_per_million,
reproduction_rate,
case_fatality_ratio) %>%
drop_na()
locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)
## $k
## [1] 3
##
## $score
## [1] 0.4388714
set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])
## [1] 0.4388714
# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
mutate(Cluster = factor(km_dc$cluster))
# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 3 × 5
## Cluster total_cases_per_million total_deaths_per_million reproduction_rate
## <fct> <dbl> <dbl> <dbl>
## 1 1 67.5 18.8 1.71
## 2 2 49888. 515. 1.12
## 3 3 6473. 60.1 0.914
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE,
geom = "point", palette = "jco") +
labs(title = "Asian Countries Clustered by COVID Impact") +
theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
cat(paste0("Cluster ", i, ":\n"))
print(countries_by_cluster[[i]])
cat("\n")
}
## Cluster 1:
## [1] "Yemen"
##
## Cluster 2:
## [1] "Armenia" "Azerbaijan" "Bahrain" "Georgia" "Iran"
## [6] "Israel" "Jordan" "Kuwait" "Lebanon" "Oman"
## [11] "Palestine" "Qatar" "Turkey"
##
## Cluster 3:
## [1] "Afghanistan" "Bangladesh" "Bhutan"
## [4] "Brunei" "Cambodia" "China"
## [7] "East Timor" "India" "Indonesia"
## [10] "Iraq" "Japan" "Kazakhstan"
## [13] "Laos" "Malaysia" "Maldives"
## [16] "Mongolia" "Myanmar" "Nepal"
## [19] "Pakistan" "Philippines" "Saudi Arabia"
## [22] "Singapore" "South Korea" "Sri Lanka"
## [25] "Tajikistan" "Thailand" "United Arab Emirates"
## [28] "Uzbekistan" "Vietnam"
- Cluster 1: Low total cases and death, high reproduction rate and
case fatality ratio
- Cluster 2: High total cases and death
- Cluster 3: In the middle, with low reproduction rate
# Short labels for the colorbar, full names for hover
label_map <- data.frame(
Cluster = factor(1:3),
ClusterLabel = c("High R+F", "High C+D", "Moderate"),
ClusterName = c(
"Cluster 1: Low Cases & Deaths, High Reproduction & Fatality",
"Cluster 2: High Cases & Deaths",
"Cluster 3: Moderate, Low Reproduction"
)
)
# Merge and prep data
clustered_dc <- clustered_dc %>%
left_join(label_map, by = "Cluster") %>%
mutate(
zvalue = as.numeric(Cluster),
HoverText = paste0(location, "<br>", ClusterName)
)
# Choropleth plot
plot_geo(clustered_dc) %>%
add_trace(
z = ~zvalue,
text = ~HoverText,
locations = ~location,
locationmode = "country names",
type = "choropleth",
colorscale = "Bluered",
showscale = TRUE,
colorbar = list(
title = "Cluster",
tickvals = 1:3,
ticktext = label_map$ClusterLabel, # Use short labels here
thickness = 10,
len = 0.4
)
) %>%
layout(
title = list(text = "COVID-19 Cluster Map in Asia (2021-03-01)", y = 0.95),
geo = list(
showframe = FALSE,
showcoastlines = TRUE,
projection = list(type = 'equirectangular'),
lonaxis = list(range = c(25, 150)),
lataxis = list(range = c(-10, 60)),
domain = list(x = c(0, 0.95), y = c(0, 1))
),
margin = list(l = 0, r = 0, t = 30, b = 0)
)
6.4. Last Date
# Feature Engineering
death_case_features <- df_latest %>%
select(location,
total_cases_per_million,
total_deaths_per_million,
reproduction_rate,
case_fatality_ratio) %>%
drop_na()
locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)
## $k
## [1] 2
##
## $score
## [1] 0.6776519
We will choose k = 2 because it has the highest silhouette score.
set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 2, nstart = 25) # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])
## [1] 0.6776519
# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
mutate(Cluster = factor(km_dc$cluster))
# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
group_by(Cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 2 × 5
## Cluster total_cases_per_million total_deaths_per_million reproduction_rate
## <fct> <dbl> <dbl> <dbl>
## 1 1 1071. 38.3 0.398
## 2 2 164030. 757. 0.965
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE,
geom = "point", palette = "jco") +
labs(title = "Asian Countries Clustered by COVID Impact") +
theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
cat(paste0("Cluster ", i, ":\n"))
print(countries_by_cluster[[i]])
cat("\n")
}
## Cluster 1:
## [1] "Tajikistan" "Yemen"
##
## Cluster 2:
## [1] "Afghanistan" "Armenia" "Azerbaijan"
## [4] "Bahrain" "Bangladesh" "Bhutan"
## [7] "Brunei" "Cambodia" "China"
## [10] "East Timor" "Georgia" "India"
## [13] "Indonesia" "Iran" "Iraq"
## [16] "Israel" "Japan" "Jordan"
## [19] "Kazakhstan" "Kuwait" "Kyrgyzstan"
## [22] "Laos" "Lebanon" "Malaysia"
## [25] "Maldives" "Mongolia" "Myanmar"
## [28] "Nepal" "Oman" "Pakistan"
## [31] "Palestine" "Philippines" "Qatar"
## [34] "Saudi Arabia" "Singapore" "South Korea"
## [37] "Sri Lanka" "Thailand" "Turkey"
## [40] "United Arab Emirates" "Uzbekistan" "Vietnam"
- Cluster 1: Low reproduction rate, higher case fatality ratio
- Cluster 2: High total cases, death, reproduction rate
library(dplyr)
library(plotly)
# === Cluster Naming ===
label_map <- data.frame(
Cluster = factor(1:2),
ClusterLabel = c("Low R + High F", "High C + D + R"),
ClusterName = c(
"Cluster 1: Low Reproduction, High Fatality",
"Cluster 2: High Cases, Deaths, Reproduction"
)
)
# === Merge Labels and Prepare Hover Text ===
clustered_dc <- clustered_dc %>%
left_join(label_map, by = "Cluster") %>%
mutate(
zvalue = as.numeric(Cluster),
HoverText = paste0(location, "<br>", ClusterName)
)
# === Plot Choropleth ===
plot_geo(clustered_dc) %>%
add_trace(
z = ~zvalue,
text = ~HoverText,
locations = ~location,
locationmode = "country names",
type = "choropleth",
colorscale = "Bluered",
showscale = TRUE,
colorbar = list(
title = "Cluster",
tickvals = 1:2,
ticktext = label_map$ClusterLabel,
thickness = 10,
len = 0.4
)
) %>%
layout(
title = list(text = "COVID-19 Cluster Map in Asia (2024-08-04)", y = 0.95),
geo = list(
showframe = FALSE,
showcoastlines = TRUE,
projection = list(type = 'equirectangular'),
lonaxis = list(range = c(25, 150)),
lataxis = list(range = c(-10, 60)),
domain = list(x = c(0, 0.95), y = c(0, 1))
),
margin = list(l = 0, r = 0, t = 30, b = 0)
)
7.0 Predictive Model
7.1. Introduction
In this section, we employ Support Vector Regression (SVR) to model
and predict the propagation of COVID-19. SVR is particularly suitable
for modeling non-linear relationships and works well in high-dimensional
spaces, making it ideal for epidemiological data like COVID-19
spread.
7.2. Objective
Model the number of total COVID-19 cases based on different
features such as population density.
Evaluate the model’s performance using metrics like Mean Squared
Error (MSE) or R² score.
Assess how well SVR captures the non-linear trends of virus
spread, and whether it can be used for short-term forecasting.
7.3. Load Libraries
library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)
library(tidyr)
library(e1071)
library(caret)
library(kernlab)
7.4. Load Dataset
df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$date <- as.Date(df$date)
# Calculate Case Fatality Ratio
df <- df %>%
mutate(case_fatality_ratio = case_when(
total_cases == 0 & total_deaths == 0 ~ 0,
TRUE ~ total_deaths / total_cases
))
# Filter for early pandemic data (before vaccination rollout)
df_first <- df %>%
filter(date <= as.Date("2020-12-01"))
7.5. Modeling
sigma C
5 0.35813 4 Test RMSE: 2461.042 Test R²: 0.9038621
7.6. Results
plot_df <- data.frame(
Actual = test_data$total_cases_per_million,
Predicted = pred
)
# Plot
ggplot(plot_df, aes(x = Actual, y = Predicted)) +
geom_point(color = "steelblue", size = 2, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "SVR Prediction vs Actual: Total Cases per Million",
x = "Actual Total Cases per Million",
y = "Predicted Total Cases per Million"
) +
theme_minimal(base_size = 14)

8.0 Summary
8.1. Key Findings
- Impact Variation Across Clusters:
- Early pandemic (2020): Countries varied significantly by fatality
rates, reproduction rates, and cases/deaths.
- Post-vaccination (2021): Clear differentiation emerged in severity
and reproduction rates, indicating varying effectiveness of pandemic
responses.
- Latest scenario (2024): Countries largely grouped into high
cases/deaths/reproduction vs. low reproduction/high fatality
clusters.
- SVR Model Effectiveness:
- The Support Vector Regression (SVR) model accurately captured
COVID-19 spread trends.
- High predictive performance indicated by:
- RMSE ≈ 2461
- R² ≈ 0.90
- Implications for Policy and Resource Allocation:
- Insights highlight importance of rapid response and targeted
resource deployment.
- Clustering can guide proactive policy-making and healthcare
infrastructure investments.
- Predictive modeling can significantly aid short-term forecasting and
resource planning.
8.2. Methodology Summary
- Data Collection
- Acquired comprehensive COVID-19 dataset from Our World in Data
(OWID) covering daily metrics across Asian countries.
- Data Preparation
- Missing Values: Dropped columns exceeding 50% missing data
threshold.
- Imputation: Used weighted median approach by similarity to handle
remaining missing data.
- Data Transformation: Normalized values per million and smoothed
daily data.
- Exploratory Data Analysis (EDA)
- Visualized key metrics (total cases/deaths, healthcare resources,
etc.).
- Compared Malaysia with extremes across various metrics for
insights.
- Clustering Analysis
- Applied K-Means Clustering for different pandemic phases: Early
pandemic (2020), Post-vaccination (2021), Latest scenario (2024).
- Evaluated optimal clusters via silhouette scores.
- Predictive Modelling (SVR)
- Implemented Support Vector Regression (SVR) with RBF kernel to
predict COVID-19 spread.
- Evaluated model accuracy using RMSE and R² metrics.
- Interpretation & Recommendations
- Analysed cluster variations and predictive model outcomes.
- Provided insights on pandemic response effectiveness and healthcare
resource allocation.