
WQD7004 Programming for Data Science
TOPIC:
Understanding COVID-19 Burden and Spread: A Clustering and Predictive
Modeling Approach for Asian Countries
Group Project
| Chadli Rayane |
24075296 |
| Farras Azelya Putri |
S2007298 |
| Isma Adlin Binti Ismail |
24088157 |
| Wang Zheng |
24082308 |
| Mareeha sultana mohamed |
|
1. Sources : Our World in Data (OWID COVID-19
Dataset)
2. Project Questions:
Classification: Group the countries in Asia suffered the most
during the pandemic based on metrics such as mortality rates, case
rates, and healthcare burden?
Regression:Create a model to analyze and predict the propagation
of a virus?
3. Data Understanding:
3.1. Dimension:
covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)
## [1] 429435 67
No: of rows= 429,435 (daily records per country)
No: of columns= 67 (eg: cases, deaths, vaccinations, testing,
population)
File format: CSV
3.2. Content:
colnames(covid_data)
## [1] "iso_code"
## [2] "continent"
## [3] "location"
## [4] "date"
## [5] "total_cases"
## [6] "new_cases"
## [7] "new_cases_smoothed"
## [8] "total_deaths"
## [9] "new_deaths"
## [10] "new_deaths_smoothed"
## [11] "total_cases_per_million"
## [12] "new_cases_per_million"
## [13] "new_cases_smoothed_per_million"
## [14] "total_deaths_per_million"
## [15] "new_deaths_per_million"
## [16] "new_deaths_smoothed_per_million"
## [17] "reproduction_rate"
## [18] "icu_patients"
## [19] "icu_patients_per_million"
## [20] "hosp_patients"
## [21] "hosp_patients_per_million"
## [22] "weekly_icu_admissions"
## [23] "weekly_icu_admissions_per_million"
## [24] "weekly_hosp_admissions"
## [25] "weekly_hosp_admissions_per_million"
## [26] "total_tests"
## [27] "new_tests"
## [28] "total_tests_per_thousand"
## [29] "new_tests_per_thousand"
## [30] "new_tests_smoothed"
## [31] "new_tests_smoothed_per_thousand"
## [32] "positive_rate"
## [33] "tests_per_case"
## [34] "tests_units"
## [35] "total_vaccinations"
## [36] "people_vaccinated"
## [37] "people_fully_vaccinated"
## [38] "total_boosters"
## [39] "new_vaccinations"
## [40] "new_vaccinations_smoothed"
## [41] "total_vaccinations_per_hundred"
## [42] "people_vaccinated_per_hundred"
## [43] "people_fully_vaccinated_per_hundred"
## [44] "total_boosters_per_hundred"
## [45] "new_vaccinations_smoothed_per_million"
## [46] "new_people_vaccinated_smoothed"
## [47] "new_people_vaccinated_smoothed_per_hundred"
## [48] "stringency_index"
## [49] "population_density"
## [50] "median_age"
## [51] "aged_65_older"
## [52] "aged_70_older"
## [53] "gdp_per_capita"
## [54] "extreme_poverty"
## [55] "cardiovasc_death_rate"
## [56] "diabetes_prevalence"
## [57] "female_smokers"
## [58] "male_smokers"
## [59] "handwashing_facilities"
## [60] "hospital_beds_per_thousand"
## [61] "life_expectancy"
## [62] "human_development_index"
## [63] "population"
## [64] "excess_mortality_cumulative_absolute"
## [65] "excess_mortality_cumulative"
## [66] "excess_mortality"
## [67] "excess_mortality_cumulative_per_million"
The dataset contains a comprehensive collection of COVID-19 –
related metrics across different countries over time. The variables can
be grouped into the following categories:
- Identification and Geographic Information:
- Country or region identifiers - “iso_code” “continent”
“location”
- COVID-19 Cases and Deaths:
- Hospitalizations and ICU:
ICU occupancy- “icu_patients” “icu_patients_per_million”
Hospital occupancy- “hosp_patients”
“hosp_patients_per_million”
Weekly Admissions- “weekly_icu_admissions”
weekly_hosp_admissions”
- Testing Data:
- Vaccination Data:
“total_vaccinations” “people_vaccinated”
“people_fully_vaccinated” “total_boosters”
Daily Values- “new_vaccinations”
“new_vaccinations_smoothed”
Normalized- “total_vaccinations_per_hundred”
“people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred”
“total_boosters_per_hundred” “new_vaccinations_smoothed_per_million”
“new_people_vaccinated_smoothed”
“new_people_vaccinated_smoothed_per_hundred”
- Public Policy Measures:
- A composite measure of government response strictness-
“stringency_index”
- Demographic and Socioeconomic Data:
- Health and Risk Factors:
“cardiovasc_death_rate”“diabetes_prevalence”
Smoking prevalence- “female_smokers” “male_smokers”
Healthcare resources- “hospital_beds_per_thousand”
“handwashing_facilities”
- Excess Mortality Metrics:
- “excess_mortality” “excess_mortality_cumulative”
“excess_mortality_cumulative_absolute”
“excess_mortality_cumulative_per_million”
3.3. Structure:
str(covid_data)
## 'data.frame': 429435 obs. of 67 variables:
## $ iso_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : chr "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
## $ total_cases : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_cases_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed_per_million : num NA NA NA NA NA 0 0 0 0 0 ...
## $ total_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0 ...
## $ new_deaths_smoothed_per_million : num NA NA NA NA NA 0 0 0 0 0 ...
## $ reproduction_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : int NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : chr "" "" "" "" ...
## $ total_vaccinations : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : int NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed_per_hundred: num NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : num 0 0 0 0 0 0 0 0 0 0 ...
## $ population_density : num 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num 597 597 597 597 597 ...
## $ diabetes_prevalence : num 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num 64.8 64.8 64.8 64.8 64.8 ...
## $ human_development_index : num 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
## $ population : num 41128772 41128772 41128772 41128772 41128772 ...
## $ excess_mortality_cumulative_absolute : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative_per_million : num NA NA NA NA NA NA NA NA NA NA ...
Location - character
date - character
new_cases - integer
new_deaths - integer
total_deaths - integer
people_vaccinated - numeric
people_fully_vaccinated - numeric
total_boosters - numeric
population - numeric
total_vaccinations - numeric
new_vaccinations - integer
stringency_index (Measure of government
restrictions) - numeric
icu_patients - integer
hosp_patients - integer
positive_rate - numeric
reproduction_rate - numeric
3.4. Summary:
The dataset comprises 429,435 observations and
67 columns. The dataset representing detailed COVID-19
data from multiple countries over time. Each row corresponds to a
specific date and location (country and region) making the dataset
structured as a longitudinal time series.
Data Types:
- Character variables (Eg: iso_code,continent,location,date)
:Represents identifiers and date information.
- Numeric variables: Majority of the dataset consists of integers
(int) and continuous numbers (num).
Data Coverage:
Includes data on COVID-19 cases, deaths, testing,
hospitalizations, ICU usage, vaccinations, and public health policy
responses.
Demographic and socioeconomic indicators like median age, GDP per
capita, population density, and human development index are included to
conduct/support deeper analytic insights.
Also includes healthcare capacity and risk factors such as
smoking rates, cardiovascular death rate, handwashing facilities, and
hospital beds.
Excess mortality provides insight into the pandemic’s broader
impact beyond the reported COVID-19 deaths.
Missing Values:
- Several variables, especially those related to ICU, hospitalization,
testing, vaccination, and excess mortality, contain many NA values,
indicating inconsistent or unavailable reporting across regions and
time.
The dataset offers a rich foundation to help analyze how COVID-19
spread across the world, how it affected people, how countries responded
and the healthcare systems differed between regions. It is suitable for
our project as it includes up-to-date data on vaccinations,cases and
deaths,which we can use to study how effective the vaccines were in
reducing infections and saving lives.
4. Data Preparation
main_data_frame<-read.csv('owid-covid-data.csv')
4.1. Function: drop_cols()
Explanation:
- Drops columns with too many missing values.
Key Steps:
- Checks if input is NULL; returns NULL if so.
- Computes a 50% missing data threshold (drop_at).
- Analyzes summary statistics to detect columns with excessive
NAs.
- Drops any column where missing values are ≥ 50% of total rows.
- Returns the cleaned dataframe.
R Code:
drop_cols<-function(dataframe=NULL){
if(is.null(dataframe)){
return(NULL)
}
col_to_drop<-c()
drop_at<-0.5*nrow(dataframe)
summary_df<-data.frame(summary(dataframe))
summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
for (p in unique(colnames(dataframe)))
{
test<-summary_df[which(summary_df$Var2==p),]
if (is.na(test[7,2])){next}
if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
col_to_drop<-append(col_to_drop,p)
}
}
dataframe<-dataframe %>% select(-all_of(col_to_drop))
return(dataframe)
}
4.2. Function: missed_val_df()
Explanation:
- Creates and saves a visual table showing missing values.
Key Steps:
- Returns NULL if input dataframe is NULL.
- Uses miss_var_summary() from naniar to count missing values.
- Formats the summary using the gt package for better
readability.
- Adds a custom title to the table.
- Saves the output as an image (.png) with the title in the
filename.
- Returns the formatted table.
R Code:
missed_val_df<-function(dataframe=NULL,title=''){
if(is.null(dataframe)){return(NULL)
}else{
pretty_table <- miss_var_summary(dataframe) %>%
gt() %>%
tab_header(title = title)
print(pretty_table)
}
}
4.3. Function: fun()
Explanation:
- Converts logical (TRUE/FALSE) values to numeric (1/0).
Key Steps:
- Returns 1 if value is TRUE, 0 if FALSE.
- Handles NULL input by returning NULL.
- Supports matrix transformation for visualizations.
R Code:
fun<-function(data=NULL){
if(is.null(data)){return(NULL)
}else{
a<- ifelse(data==TRUE,1,0)
return(a)
}
}
4.4. Function: heat_map_missing()
Explanation:
- Generates a heatmap showing the locations of missing values in a
dataframe.
Key Steps:
- Converts missing values (NA) into binary format (1 = missing).
- Uses pheatmap() to create a visual heatmap.
- Uses red for missing values and white for non-missing.
- No row/column clustering.
- Returns NULL if the input dataframe is NULL.
R Code:
heat_map_missing<-function(data__frame=NULL){
if(is.null(data__frame)){
return(NULL)
}
pheatmap(
mat = sapply(as.data.frame(is.na(data__frame)),fun),
cluster_rows = FALSE,
cluster_cols = FALSE,
color = c("white", "red"),
breaks = c(-0.01, 0.5, 1.1),
main = "Missing Values Heatmap"
)
}
4.5. Function: data_char_date()
Explanation:
- Converts a character string to a proper Date object.
Key Steps:
- Extracts numeric date components from a string.
- Handles two formats: yyyy-mm-dd or Month dd, yyyy.
- Converts using as.Date().
- Returns NULL if input is NULL
R Code:
data_char_date<-function(x=NULL){
if(is.null(x)){
return(NULL)}
dated<-str_extract_all(x,"\\d+")[[1]]
if (length(dated)>2){
x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
format="%Y-%m-%d")
}else{
x<-as.Date(x, format = "%B %d, %Y")
}
return(x)
}
4.6. Function: box_plot()
Explanation:
- Plots boxplots of a selected variable (peak) against binned versions
of key socioeconomic features.
Key Steps:
- Requires a dataframe and a target variable (e.g., new_cases).
- Bins population, gdp_per_capita, and life_expectancy into 4
log-scaled groups.
- Creates a boxplot for each binned feature group vs. the peak
variable.
- Uses ggplot2 to plot and visualize distributions.
- Prints each boxplot automatically.
R Code:
box_plot<-function(data__frame=NULL,peak=NULL){
if(is.null(data__frame) | is.null(peak)){
return(NULL)
}
cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
#data_cut <- data__frame %>%
# mutate(across(
# all_of(cols_to_cut),
# ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
#.names = "{.col}_group"
#))
ata_cut <- data__frame %>%
mutate(across(
all_of(cols_to_cut),
~ cut(log10(.x + 1), breaks = 4, labels = FALSE), # +1 avoids log(0)
.names = "{.col}_group"
))
for (p in paste0(cols_to_cut,'_group')){
plo<-ggplot(data_cut,
aes(x=!!sym(p),
y=!!sym(peak),
group=!!sym(p)
)
)+
geom_boxplot(color = 'blue',
fill = 'lightblue')+
labs(x =p,
y = peak,
title = paste(peak," Across Population Bins"))+theme_minimal()
print(plo)
}
}
4.7. Function: smoothing_bins()
Explanation:
- Imputes missing smoothed data by averaging values within weekly date
bins.
Key Steps:
- Looks for columns ending in “smoothed”.
- Creates 7-day bins (cut()) based on the date column.
- For each smoothed column: -Groups data by date bin and location.
-Replaces the smoothed column with its imputed weekly average.
- Returns the updated dataframe with smoothed values replaced.
R Code:
smoothing_bins<-function(df=NULL){
if(is.null(df)){
return(NULL)
}
names_cols_smoothed<-colnames(df)
names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
for (names_bins_collection in names_cols_smoothed ){
main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
bins<-paste0(main,"_smoothed")
new_col<-paste0(bins,"_imputed")
df <- df %>%
mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
group_by(date_group,location) %>%
mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
ungroup()%>%
select(-all_of(c('date_group',bins))) %>%
rename(!!sym(bins) := new_col)
}
return(df)
}
4.8. Function: per_million_fix()
Explanation:
- Recalculates “per million” values using the actual population and
base metric.
Key Steps:
- Detects columns ending in _per_million.
- Recomputes values as: (original value / population)×1,000,000
- Replaces original _per_million columns with recalculated ones.
- Useful for ensuring consistent normalization per population.
R Code:
per_million_fix<-function(df=NULL){
if(is.null(df)){
return(NULL)
}
names_cols_per_million<-colnames(df)
names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
names_cols_per_million
for (names_bins_collection in names_cols_per_million ){
main<-str_split(names_bins_collection,"_per_million")[[1]][1]
bins<-paste0(main,"_per_million")
new_col<-paste0(bins,"_imputed")
df <- df %>%
mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
select(-all_of(bins))%>%
rename(!!sym(bins) := new_col)
}
return(df)
}
4.9. Function: fix_totals()
Explanation:
- Recalculates total cases and deaths over time for each country from
daily new values.
Key Steps:
- Skips the date 2020-01-04 (removes it).
- Iterates through each country and processes data in date order.
- On the first day, sets totals to 0.
- For subsequent days, adds new_cases and new_deaths to running
totals.
- Removes duplicates and ensures cumulative consistency.
- Prints warning if there are duplicate dates per country.
- Returns cleaned dataframe with corrected total_cases and
total_deaths.
R Code:
fix_totals<-function(df_t=NULL){
if(is.null(df_t)){
return(NULL)
}
df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
countries <- unique(df_t$location)
df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
for(country in countries){
df_sub<-df_t[df_t$location==country,]
df_sub <- df_sub[order(df_sub$date), ]
first_day<-min(df_sub$date)
last_day<-max(df_sub$date)
for (i in seq(first_day,last_day,by=1)){
if(i==first_day){
df_sub$total_cases[df_sub$date == i] <- 0
df_sub$total_deaths[df_sub$date == i] <- 0
}else{
df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
}
}
print(paste(country,'--->',any(duplicated(df_sub$date)))
)
df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
}
return(df_t)
}
4.10. Function: drop_asia()
Explanation:
- Removes Asian countries from the dataset and handles missing
values.
Key Steps:
- Filters out rows where continent == ‘Asia’.
- Displays missing value summaries before and after column
removal.
- Uses drop_cols() to remove columns with too many missing values from
both the original and filtered datasets.
- Returns the non-Asian subset of the data.
R Code:
drop_asia<-function(main_data_frame){
main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
print(missed_val_df(main_data_frame,'missing main_data_frame'))
print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
main_data_remaining<-drop_cols(main_data_remaining)
main_data_frame<-drop_cols(main_data_frame)
print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
return(main_data_remaining)
}
4.11. Function: keep_asia()
Explanation:
- Keeps only Asian countries in the dataset and handles missing
values.
Key Steps:
- Filters the dataset to include only rows where continent ==
‘Asia’.
- Displays missing value summaries before and after cleaning.
- Uses drop_cols() to remove columns with excessive missing
values.
- Returns the Asian subset of the data.
R Code:
keep_asia<-function(main_data_frame){
main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]
print(missed_val_df(main_data_frame,'missing main_data_frame'))
print(missed_val_df(main_data_asia,'missing main_data_asia'))
main_data_asia<-drop_cols(main_data_asia)
main_data_frame<-drop_cols(main_data_frame)
print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
return(main_data_asia)
}
4.12. Function: imputate_first_part()
Explanation:
- Cleans and imputes missing values for the Asian dataset using
weighted median imputation.
Key Steps:
- Converts the date column into proper Date format using
data_char_date().
- Bins population, gdp_per_capita, and life_expectancy into 4
log-scaled groups.
- Removes rows where gdp_per_capita_group is missing.
- Identifies columns with missing values and imputes them using
weighted_median_imputation().
- Drops binning columns (*_group) before returning.
- Returns the cleaned and imputed Asian dataset.
R Code:
imputate_first_part<-function(data_set_asia){
data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
#data_cut <- data_set_asia %>%
# mutate(across(
# all_of(cols_to_cut),
# ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
# .names = "{.col}_group"
#))
data_cut <- data_set_asia %>%
mutate(across(
all_of(cols_to_cut),
~ cut(log10(.x + 1), breaks = 4, labels = FALSE), # +1 avoids log(0)
.names = "{.col}_group"
))
data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
for(col_named in cols_to_imputate){
data_cut<-weighted_median_imputation(data_cut,col_named)
}
cols_original<-paste0(cols_to_cut,"_group")
data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
#write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
return(data_set_asia_clean_except__bins)
}
4.14. Function: reproduction_function()
Explanation:
- Computes a robust weighted median by iteratively removing outliers
and calculating median values from non-outliers.
Key Steps:
- Removes outliers using the IQR (Interquartile Range) method.
- For each iteration, calculates the median of non-outliers and
assigns a weight based on the proportion of non-outliers.
- Repeats on the remaining outliers until fewer than 4 values
remain.
- Returns the weighted sum of medians from all iterations.
R Code:
reproduction_function<-function(s){
median_data<-c()
weights<-c()
if (is.data.frame(s)) {
s <- s[[1]]
}
x <- s[!is.na(s)]
size_s<-length(x)
x<-as.numeric(x)
while(TRUE)
{
if(length(x)<4){break}
Q1<-quantile(x,0.25,na.rm=TRUE)
Q3<-quantile(x,0.75,na.rm=TRUE)
IQR<-Q3-Q1
lower_bound<-Q1-1.5*IQR
higher_bound<-Q3+1.5*IQR
non_outliers<-x[x>=lower_bound & x<=higher_bound]
outliers<-x[x<lower_bound | x>higher_bound]
a<-non_outliers
b<-length(non_outliers)/size_s
weights<-append(weights,b)
a<-median(a,na.rm=TRUE)
median_data<-append(median_data,a)
x<-outliers
}
median_data <- as.numeric(median_data)
weights <- as.numeric(weights)
return(sum(median_data*weights, na.rm = TRUE))
}
5. Exploratory Data Analysis (EDA)
This report presents an exploratory data analysis of COVID-19
indicators in various Asian countries. The focus is on identifying
extreme cases, comparing Malaysia to others, and understanding
healthcare burden using clustering and visualizations.
5.1. Load Libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggcorrplot)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(pheatmap)
library(ggtext)
5.2. Load Dataset
df <- read.csv('covid_data_set_asia_fixed_final.csv')
df <- df %>%
mutate(case_fatality_ratio = case_when(
total_cases != 0 ~ total_deaths / total_cases,
TRUE ~ 0
))
df$date <- as.Date(df$date)
df_last_day <- df %>%
group_by(location) %>%
filter(date == max(date)) %>%
ungroup() %>%
filter(total_cases > 0)
5.3. Visualization
5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths
bar_plot(df_last_day, "total_deaths")

5.3.3 Plot : Deaths per Million
bar_plot(df_last_day, "total_deaths_per_million")

5.3.4 Plot : Cases per Million
bar_plot(df_last_day, "total_cases_per_million")

5.3.5 Plot : Healthcare Clustering (Heatmap)
df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
df_scaled <- scale(df_clust[, -ncol(df_clust)])
df_scaled <- as.data.frame(df_scaled)
rownames(df_scaled) <- df_clust$location
pheatmap(
df_scaled,
scale = "row",
cutree_rows = 3,
cutree_cols = 4,
fontsize_row = 5.5,
fontsize_col = 10,
angle_col = 45
)

5.3.6 Plot : Hospital Beds per 1,000 People
bar_plot(df_last_day, "hospital_beds_per_thousand")

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia
ggplotly(line_plot(df, "total_deaths"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(line_plot(df, "total_cases"))
ggplotly(line_plot(df, "hospital_beds_per_thousand"))
ggplotly(line_plot(df, "new_cases"))
ggplotly(line_plot(df, "new_deaths"))