WQD7004_Group1.knit

WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

NAME	STUDENT ID
Chadli Rayane	24075296
Farras Azelya Putri	S2007298
Isma Adlin Binti Ismail	24088157
Wang Zheng	24082308
Mareeha sultana mohamed	24071864

1. INTRODUCTION

The COVID-19 pandemic affected countries differently based several factors such as population and population density. Asia, with its variety of culture and population size, can be considered an interesting case study for analyzing the spread of the virus.

This project explores COVID-19 patterns across Asian countries through data-driven analysis and machine learning model. We aim to group countries by severity and predict viral spread trends to uncover critical insights for pandemic preparedness during different time periods.

Sources : Our World in Data (OWID COVID-19 Dataset)

Link : https://github.com/owid/covid-19-data/blob/master/public/data/README.md

2. Project Questions:

Classification: Group the countries in Asia suffered the most during the pandemic based on metrics such as mortality rates, case rates, and severity of COVID-19 impact ?
Regression : Create a model to analyze and predict the propagation of the virus ?

3. Data Understanding:

3.1. Dimension:

covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)

Dataset Description:

Size: 429,435 rows × 67 columns
Granularity: Daily records per country
Key Features: Includes indicators such as COVID-19 cases, deaths, vaccinations, testing rates, and population statistics.
Format: CSV

3.2. Content:

iso_code

continent

location

date

total_cases

new_cases

new_cases_smoothed

total_deaths

new_deaths

new_deaths_smoothed

total_cases_per_million

new_cases_per_million

new_cases_smoothed_per_million

total_deaths_per_million

new_deaths_per_million

new_deaths_smoothed_per_million

reproduction_rate

icu_patients

icu_patients_per_million

hosp_patients

hosp_patients_per_million

weekly_icu_admissions

weekly_icu_admissions_per_million

weekly_hosp_admissions

weekly_hosp_admissions_per_million

total_tests

new_tests

total_tests_per_thousand

new_tests_per_thousand

new_tests_smoothed

new_tests_smoothed_per_thousand

positive_rate

tests_per_case

tests_units

total_vaccinations

people_vaccinated

people_fully_vaccinated

total_boosters

new_vaccinations

new_vaccinations_smoothed

total_vaccinations_per_hundred

people_vaccinated_per_hundred

people_fully_vaccinated_per_hundred

total_boosters_per_hundred

new_vaccinations_smoothed_per_million

new_people_vaccinated_smoothed

new_people_vaccinated_smoothed_per_hundred

stringency_index

population_density

median_age

aged_65_older

aged_70_older

gdp_per_capita

extreme_poverty

cardiovasc_death_rate

diabetes_prevalence

female_smokers

male_smokers

handwashing_facilities

hospital_beds_per_thousand

life_expectancy

human_development_index

population

excess_mortality_cumulative_absolute

excess_mortality_cumulative

excess_mortality

excess_mortality_cumulative_per_million

## </div>

The dataset contains a comprehensive collection of COVID-19 – related metrics across different countries over time. The variables can be grouped into the following categories:

Identification and Geographic Information:
- Country or region identifiers - “iso_code” “continent” “location”
COVID-19 Cases and Deaths:
- Total and daily new confirmed cases- “total_cases” “new_cases” “new_cases_smoothed”
- Total and daily new confirmed deaths- “total_deaths” “new_deaths” “new_deaths_smoothed”
Hospitalizations and ICU:
- ICU occupancy- “icu_patients” “icu_patients_per_million”
- Hospital occupancy- “hosp_patients” “hosp_patients_per_million”
- Weekly Admissions- “weekly_icu_admissions” weekly_hosp_admissions”
Testing Data:
- Testing statistics- “total_tests” “new_tests” “new_tests_smoothed”
- Normalized metrics- “total_tests_per_thousand” “positive_rate” “tests_per_case” “tests_units”
Vaccination Data:
- “total_vaccinations” “people_vaccinated” “people_fully_vaccinated” “total_boosters”
- Daily Values- “new_vaccinations” “new_vaccinations_smoothed”
- Normalized- “total_vaccinations_per_hundred” “people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred” “total_boosters_per_hundred” “new_vaccinations_smoothed_per_million” “new_people_vaccinated_smoothed” “new_people_vaccinated_smoothed_per_hundred”
Public Policy Measures:
- A composite measure of government response strictness- “stringency_index”
Demographic and Socioeconomic Data:
- “population” “population_density” “median_age” “aged_65_older” “aged_70_older”
- “gdp_per_capita” “extreme_poverty” “human_development_index”
Health and Risk Factors:
- “cardiovasc_death_rate”“diabetes_prevalence”
- Smoking prevalence- “female_smokers” “male_smokers”
- Healthcare resources- “hospital_beds_per_thousand” “handwashing_facilities”
Excess Mortality Metrics:
- “excess_mortality” “excess_mortality_cumulative” “excess_mortality_cumulative_absolute” “excess_mortality_cumulative_per_million”

3.3. Structure:

Understanding the structure of the dataset is a crucial step in data analysis. It allows us to identify the types of variables we are dealing with, assess what kind of transformation or conversion is possible. The structure also helps in detecting missing values, inconsistencies, or anomalies before modeling and performing EDA.

The command below displays the internal structure of the dataset:

str(covid_data)

## 'data.frame':    429435 obs. of  67 variables:
##  $ iso_code                                  : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                                 : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location                                  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                                      : chr  "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
##  $ total_cases                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed                        : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed                       : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_cases_per_million                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_per_million                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million            : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths_per_million                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_per_million                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed_per_million           : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ reproduction_rate                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients                              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients                             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients_per_million                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions_per_million         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                               : chr  "" "" "" "" ...
##  $ total_vaccinations                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations_per_hundred            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated_per_hundred       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters_per_hundred                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed_per_million     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stringency_index                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ population_density                        : num  54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                                : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                             : num  2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                             : num  1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                            : num  1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate                     : num  597 597 597 597 597 ...
##  $ diabetes_prevalence                       : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities                    : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand                : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                           : num  64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index                   : num  0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
##  $ population                                : num  41128772 41128772 41128772 41128772 41128772 ...
##  $ excess_mortality_cumulative_absolute      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative_per_million   : num  NA NA NA NA NA NA NA NA NA NA ...

Location - character
date - character
new_cases - integer
new_deaths - integer
total_deaths - integer
people_vaccinated - numeric
people_fully_vaccinated - numeric
total_boosters - numeric
population - numeric
total_vaccinations - numeric
new_vaccinations - integer
stringency_index (Measure of government restrictions) - numeric
icu_patients - integer
hosp_patients - integer
positive_rate - numeric
reproduction_rate - numeric

3.4. Summary:

The dataset comprises 429,435 observations and 67 columns. The dataset representing detailed COVID-19 data from multiple countries over time. Each row corresponds to a specific date and location (country and region) making the dataset structured as a longitudinal time series.

Data Types:
- Character variables (Eg: iso_code,continent,location,date) :Represents identifiers and date information.
- Numeric variables: Majority of the dataset consists of integers (int) and continuous numbers (num).
Data Coverage:
- Includes data on COVID-19 cases, deaths, testing, hospitalizations, ICU usage, vaccinations, and public health policy responses.
- Demographic and socioeconomic indicators like median age, GDP per capita, population density, and human development index are included to conduct/support deeper analytic insights.
- Also includes healthcare capacity and risk factors such as smoking rates, cardiovascular death rate, handwashing facilities, and hospital beds.
- Excess mortality provides insight into the pandemic’s broader impact beyond the reported COVID-19 deaths.
Missing Values:
- Several variables, especially those related to ICU, hospitalization, testing, vaccination, and excess mortality, contain many NA values, indicating inconsistent or unavailable reporting across regions and time.

The dataset offers a rich foundation to help analyze how COVID-19 spread across the world, how it affected people, how countries responded and the healthcare systems differed between regions. It is suitable for our project as it includes up-to-date data on vaccinations,cases and deaths,which we can use to study how effective the vaccines were in reducing infections and saving lives.

4. Data Preparation

main_data_frame<-read.csv('owid-covid-data.csv')

4.1. Function: `drop_cols()`

Explanation:

Drops columns with too many missing values.

Key Steps:

Checks if input is NULL; returns NULL if so.
Computes a 50% missing data threshold (drop_at).
Analyzes summary statistics to detect columns with excessive NAs.
Drops any column where missing values are ≥ 50% of total rows.
Returns dataframe that can be imputed without introducing too much bias.

R Code:

drop_cols<-function(dataframe=NULL){
  if(is.null(dataframe)){
    return(NULL)
  }
  col_to_drop<-c()
  drop_at<-0.5*nrow(dataframe)
  summary_df<-data.frame(summary(dataframe))
  summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
  summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
  for (p in unique(colnames(dataframe)))
  {
    test<-summary_df[which(summary_df$Var2==p),]
    if (is.na(test[7,2])){next}
    if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
      col_to_drop<-append(col_to_drop,p)
    }
  }
  dataframe<-dataframe %>% select(-all_of(col_to_drop))
  return(dataframe)
}

4.2. Function: `missed_val_df()`

Explanation:

Creates and saves a visual table showing missing values.

Key Steps:

Returns NULL if input dataframe is NULL.
Uses miss_var_summary() from naniar to count missing values.
Formats the summary using the gt package for better readability.
Adds a custom title to the table.
Saves the output as an image (.png) with the title in the filename.
Returns the formatted table.

R Code:

missed_val_df<-function(dataframe=NULL,title=''){
  if(is.null(dataframe)){return(NULL)
  }else{
    pretty_table <- miss_var_summary(dataframe) %>%
      gt() %>%
      tab_header(title = title)
    print(pretty_table)
  }
}

4.3. Function: `fun()`

Explanation:

Converts logical (TRUE/FALSE) values to numeric (1/0).

Key Steps:

Returns 1 if value is TRUE, 0 if FALSE.
Handles NULL input by returning NULL.
Supports matrix transformation for visualizations.

R Code:

fun<-function(data=NULL){
  if(is.null(data)){return(NULL)
  }else{
    a<- ifelse(data==TRUE,1,0)
    return(a)
  }
}

4.4. Function: `heat_map_missing()`

Explanation:

Generates a heatmap showing the locations of missing values in a dataframe.

Key Steps:

Converts missing values (NA) into binary format (1 = missing).
Uses pheatmap() to create a visual heatmap.
Uses red for missing values and white for non-missing.
No row/column clustering.
Returns NULL if the input dataframe is NULL.

R Code:

heat_map_missing<-function(data__frame=NULL){
  if(is.null(data__frame)){
    return(NULL)
  }
  pheatmap(
    mat = sapply(as.data.frame(is.na(data__frame)),fun),
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    color = c("white", "red"),
    breaks = c(-0.01, 0.5, 1.1),
    main = "Missing Values Heatmap"
  )
}

4.5. Function: `data_char_date()`

Explanation:

Converts a character string to a proper Date object.

Key Steps:

Extracts numeric date components from a string.
Handles two formats: yyyy-mm-dd or Month dd, yyyy.
Converts using as.Date().
Returns NULL if input is NULL

R Code:

data_char_date<-function(x=NULL){
  if(is.null(x)){
    return(NULL)}
  dated<-str_extract_all(x,"\\d+")[[1]]
  if (length(dated)>2){
    x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
               format="%Y-%m-%d")
  }else{
    x<-as.Date(x, format = "%B %d, %Y")
  }
  return(x)
}

4.6. Function: `box_plot()`

Explanation:

Plots boxplots of a selected variable (peak) against binned versions of key socioeconomic features.

Key Steps:

Requires a dataframe and a target variable (e.g., new_cases).
Bins population, gdp_per_capita, and life_expectancy into 4 log-scaled groups.
Creates a boxplot for each binned feature group vs. the peak variable.
Uses ggplot2 to plot and visualize distributions.
Prints each boxplot automatically.

R Code:

box_plot<-function(data__frame=NULL,peak=NULL){
  if(is.null(data__frame) | is.null(peak)){
    return(NULL)
  }
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  #data_cut <- data__frame %>%
   # mutate(across(
    #  all_of(cols_to_cut),
     # ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
      #.names = "{.col}_group"
    #))
  ata_cut <- data__frame %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  for (p in paste0(cols_to_cut,'_group')){
  plo<-ggplot(data_cut,
         aes(x=!!sym(p),
             y=!!sym(peak),
             group=!!sym(p)
             )
         )+
      geom_boxplot(color = 'blue',
                   fill = 'lightblue')+
      labs(x =p,
           y = peak,
           title = paste(peak," Across Population Bins"))+theme_minimal()
    print(plo)
    }
}

4.7. Function: `smoothing_bins()`

Explanation:

Imputes missing smoothed data by averaging values within weekly date bins.

Key Steps:

Looks for columns ending in “smoothed”.
Creates 7-day bins (cut()) based on the date column.
For each smoothed column: -Groups data by date bin and location. -Replaces the smoothed column with its imputed weekly average.
Returns the updated dataframe with smoothed values replaced.

R Code:

smoothing_bins<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_smoothed<-colnames(df)
  names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
  breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
  for (names_bins_collection in names_cols_smoothed ){
    main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
    bins<-paste0(main,"_smoothed")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
      group_by(date_group,location) %>%
      mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
      ungroup()%>%
      select(-all_of(c('date_group',bins))) %>%
      rename(!!sym(bins) := new_col)
    }
  return(df)
  }

4.8. Function: `per_million_fix()`

Explanation:

Recalculates “per million” values using the actual population and base metric.

Key Steps:

Detects columns ending in _per_million.
Recomputes values as: (original value / population)×1,000,000
Replaces original _per_million columns with recalculated ones.
Useful for ensuring consistent normalization per population.

R Code:

per_million_fix<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_per_million<-colnames(df)
  names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
  names_cols_per_million
  
  
  for (names_bins_collection in names_cols_per_million ){
    main<-str_split(names_bins_collection,"_per_million")[[1]][1]
    bins<-paste0(main,"_per_million")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
      select(-all_of(bins))%>%
      rename(!!sym(bins) := new_col)
  }
  return(df)
}

4.9. Function: `fix_totals()`

Explanation:

Recalculates total cases and deaths over time for each country from daily new values since total cases and deaths cannot decrease over time.

Key Steps:

Skips the date 2020-01-04 (removes it due to mismatch between different countries).
Iterates through each country and processes data in date order.
On the first day, sets totals to 0.
For subsequent days, adds new_cases and new_deaths to running totals.
Removes duplicates and ensures cumulative consistency.
Prints warning if there are duplicate dates per country.
Returns cleaned dataframe with corrected total_cases and total_deaths.

R Code:

fix_totals<-function(df_t=NULL){
  if(is.null(df_t)){
  return(NULL)
  }
  df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
  countries <- unique(df_t$location)
  df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
  for(country in countries){
    df_sub<-df_t[df_t$location==country,]
    df_sub <- df_sub[order(df_sub$date), ]
    first_day<-min(df_sub$date)
    last_day<-max(df_sub$date)
    
    for (i in seq(first_day,last_day,by=1)){
    if(i==first_day){
      df_sub$total_cases[df_sub$date == i] <- 0
      df_sub$total_deaths[df_sub$date == i] <- 0
    }else{
    df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
    df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
    }
      }
    print(paste(country,'--->',any(duplicated(df_sub$date)))
          )
    df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
    df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
    
  }
  return(df_t)
  }

4.10. Function: `drop_asia()`

Explanation:

Removes Asian countries from the dataset and handles missing values (use during testing only thus was not included in the main code).

Key Steps:

Filters out rows where continent == ‘Asia’.
Displays missing value summaries before and after column removal.
Uses drop_cols() to remove columns with too many missing values from both the original and filtered datasets.
Returns the non-Asian subset of the data.

R Code:

drop_asia<-function(main_data_frame){
  
  main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
  
  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
  
  
  main_data_remaining<-drop_cols(main_data_remaining)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
  
  return(main_data_remaining)
  
}

4.11. Function: `keep_asia()`

Explanation:

Keeps only Asian countries in the dataset and handles missing values (Area of interest).

Key Steps:

Filters the dataset to include only rows where continent == ‘Asia’.
Displays missing value summaries before and after cleaning.
Uses drop_cols() to remove columns with excessive missing values.
Returns the Asian subset of the data.

R Code:

keep_asia<-function(main_data_frame){
  
  main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]

  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print(missed_val_df(main_data_asia,'missing main_data_asia'))
  
  main_data_asia<-drop_cols(main_data_asia)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
  return(main_data_asia)
  
}

4.12. Function: `imputate_first_part()`

Explanation:

Cleans and imputes missing values for the Asian dataset using weighted median imputation by similarity (imputation by similarity based on different features and calculate the weighted average of the imputations).

Key Steps:

Converts the date column into proper Date format using data_char_date().
Bins population, gdp_per_capita, and life_expectancy into 4 log-scaled groups.
Removes rows where gdp_per_capita_group is missing.
Identifies columns with missing values and imputes them using weighted_median_imputation().
Drops binning columns (*_group) before returning.
Returns the cleaned and imputed Asian dataset.

R Code:

imputate_first_part<-function(data_set_asia){
  data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  
  #data_cut <- data_set_asia %>%
  # mutate(across(
  #   all_of(cols_to_cut),
  #  ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
  # .names = "{.col}_group"
  #))
  
  data_cut <- data_set_asia %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  
  data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
  cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
  for(col_named in cols_to_imputate){
    data_cut<-weighted_median_imputation(data_cut,col_named)
  }
  cols_original<-paste0(cols_to_cut,"_group")
  data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
  #write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
  return(data_set_asia_clean_except__bins)
}

4.13. Function: `weighted_median_imputation()`

Explanation:

Imputes missing values in a specified column using a weighted median approach based on grouped statistics.

Key Steps:

Bins data by population_group, gdp_per_capita_group, and life_expectancy_group.
For each bin and location, computes a weighted median using reproduction_function().
Replaces NA values in col1 with the average of the imputed values from the three groupings.
Returns the original dataframe with imputed values filled in.

R Code:

weighted_median_imputation<-function(data__frame=NULL,col1=NULL){
  original_columns<-colnames(data__frame)
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  cols_to_cut<-paste0(cols_to_cut,'_group')
  
  
  if(is.null(col1) || is.null(data__frame)){
    return(NULL)}
  
  
  median_full<-data_frame(data__frame[,col1])
  data_temp<-data__frame[,c(col1,cols_to_cut,"location")]
  
  
  for(p in cols_to_cut){
    
    group_stats <- data_temp %>%
      group_by(.data[[p]],location) %>%
      summarise(group_median = reproduction_function(.data[[col1]]), .groups = "drop")
    
    # Merge back with original data
    data_new <- data_temp %>%
      left_join(group_stats, by = c(p,"location")) %>%
      mutate(imputed = if_else(is.na(.data[[col1]]), group_median, .data[[col1]])) %>%
      select(imputed)
    named <- paste0(p, "_median")
    median_full[[named]] <- data_new$imputed
  }
  
  
  imputed_cols <- paste0(cols_to_cut, "_median")
  median_full$averages <- rowMeans(median_full[, imputed_cols], na.rm = TRUE)
  
  
  median_full[,'average']<-apply(median_full[,paste0(cols_to_cut,"_median")],MARGIN = 1,sum)
  median_full[,'average']<-median_full[,'average']/length(cols_to_cut)
  
  
  named <- paste0(col1, "_imputated")
  
  data__frame[,named]<-median_full[,'average']
  
  
  data__frame <- data__frame %>%
    mutate(!!col1 := if_else(
      is.na(!!sym(col1)),
      !!sym(named),
      !!sym(col1)
    ))
  
  return(data__frame[,original_columns])
}

4.14. Function: `reproduction_function()`

Explanation:

Computes a robust weighted median by iteratively removing outliers and calculating median values from non-outliers.

Key Steps:

Removes outliers using the IQR (Interquartile Range) method.
For each iteration, calculates the median of non-outliers and assigns a weight based on the proportion of non-outliers.
Repeats on the remaining outliers until fewer than 4 values remain.
Returns the weighted sum of medians from all iterations.

R Code:

reproduction_function<-function(s){
  
  median_data<-c()
  weights<-c()
  
  if (is.data.frame(s)) {
    s <- s[[1]] 
  }
  
  x <- s[!is.na(s)]
  size_s<-length(x)
  x<-as.numeric(x)
  while(TRUE)
    {
  if(length(x)<4){break}
  
    
  Q1<-quantile(x,0.25,na.rm=TRUE)
  Q3<-quantile(x,0.75,na.rm=TRUE)
  IQR<-Q3-Q1
  lower_bound<-Q1-1.5*IQR
  higher_bound<-Q3+1.5*IQR
  
  non_outliers<-x[x>=lower_bound & x<=higher_bound]
  outliers<-x[x<lower_bound | x>higher_bound]
  
  
  a<-non_outliers
  b<-length(non_outliers)/size_s
  
  weights<-append(weights,b)
  a<-median(a,na.rm=TRUE)
  median_data<-append(median_data,a)
  
  x<-outliers
  
  }
  median_data <- as.numeric(median_data)
  weights <- as.numeric(weights)
  return(sum(median_data*weights, na.rm = TRUE))
}

5. Exploratory Data Analysis (EDA)

This report presents an exploratory data analysis of COVID-19 indicators in various Asian countries. The focus is on identifying extreme cases, comparing Malaysia to others, and understanding healthcare burden using clustering and visualizations.

5.1. Load Libraries

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggcorrplot)
library(plotly)
library(pheatmap)
library(ggtext)

5.2. Load Dataset

df <- read.csv('covid_data_set_asia_fixed_final.csv')

df <- df %>%
  mutate(case_fatality_ratio = case_when(
    total_cases != 0 ~ total_deaths / total_cases,
    TRUE ~ 0
  ))

df$date <- as.Date(df$date)

df_last_day <- df %>%
  group_by(location) %>%
  filter(date == max(date)) %>%
  ungroup() %>%
  filter(total_cases > 0)

5.3. Visualization

5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths

bar_plot(df_last_day, "total_deaths")

5.3.3 Plot : Deaths per Million

bar_plot(df_last_day, "total_deaths_per_million")

5.3.4 Plot : Cases per Million

bar_plot(df_last_day, "total_cases_per_million")

5.3.5 Plot : Healthcare Clustering (Heatmap)

df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
df_scaled <- scale(df_clust[, -ncol(df_clust)])
df_scaled <- as.data.frame(df_scaled)
rownames(df_scaled) <- df_clust$location

pheatmap(
  df_scaled,
  scale = "row",
  cutree_rows = 3,
  cutree_cols = 4,
  fontsize_row = 5.5,
  fontsize_col = 10,
  angle_col = 45
)

5.3.6 Plot : Hospital Beds per 1,000 People

bar_plot(df_last_day, "hospital_beds_per_thousand")

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

ggplotly(line_plot(df, "total_deaths_per_million"))

ggplotly(line_plot(df, "total_cases_per_million"))

ggplotly(line_plot(df, "hospital_beds_per_thousand"))

ggplotly(line_plot(df, "new_cases_per_million"))

ggplotly(line_plot(df, "new_deaths_per_million"))

ggplotly(line_plot(df, "case_fatality_ratio"))

5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

countr <- c("China", "India", "Malaysia", "Japan")
df_selective <- df %>% filter(location %in% countr)

p <- ggplot(df_selective, aes(x = date, y = reproduction_rate, colour = location)) +
  geom_line() +
  labs(
    title = "COVID-19 Reproduction Rate in Malaysia, China, India, and Japan Over Time",
    x = "Date",
    y = "Reproduction Rate"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplotly(p)

This interesting to see the variation of India, Malaysia and China are closely similar because malaysia is a popular distination for chinese and indian citizen. ### 5.3.10 Plot : Case Fatality Ratio by Country

bar_plot(df_last_day, "case_fatality_ratio")

5.4 Conclusion

This analysis shows how COVID-19 affected Asian countries differently based on death rates, case counts, and healthcare resources. Malaysia’s performance is compared with best- and worst-case countries across key metrics. This provides a foundation for future modeling and regional healthcare readiness assessment.

6.0 Data Model : Clustering

6.1. Load Libraries

library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)
library(tidyr)

6.2. Load Dataset

df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$case_fatality_ratio <- df$total_deaths/df$total_cases

# --- Define Time Points ---
df_first <- df %>%
  group_by(location) %>%
  filter(date == "2020-08-01") %>%
  ungroup()
df_postvax <- df %>%
  group_by(location) %>%
  filter(date == "2021-03-01") %>%
  ungroup()
df_latest <- df %>%
  group_by(location) %>%
  filter(date == "2024-08-04") %>%
  ungroup()

6.3. First Date

# Feature Engineering
death_case_features <- df_first %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

The plot is ambiguous, we will check the silhoutte score for the optimal number of k

# Helper function to find the best k
find_best_k <- function(data) {
  best_k <- NA
  best_score <- -Inf
  
  for (k in 2:5) {
    set.seed(123)
    km <- kmeans(data, centers = k, nstart = 25)
    sil <- silhouette(km$cluster, dist(data))
    score <- mean(sil[, 3])
    
    if (score > best_score) {
      best_score <- score
      best_k <- k
    }
  }
  
  return(list(k = best_k, score = best_score))
}

find_best_k(scaled_dc)

## $k
## [1] 3
## 
## $score
## [1] 0.5107536

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)

## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          49.8                     14.1             0.87 
## 2 2                        1690.                      17.7             0.965
## 3 3                       19108.                     127.              0.838
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Afghanistan"          "Azerbaijan"           "Bangladesh"          
##  [4] "Bhutan"               "Brunei"               "Cambodia"            
##  [7] "China"                "East Timor"           "Georgia"             
## [10] "India"                "Indonesia"            "Iraq"                
## [13] "Israel"               "Japan"                "Jordan"              
## [16] "Kazakhstan"           "Laos"                 "Lebanon"             
## [19] "Malaysia"             "Maldives"             "Mongolia"            
## [22] "Myanmar"              "Nepal"                "Pakistan"            
## [25] "Palestine"            "Philippines"          "Saudi Arabia"        
## [28] "Singapore"            "South Korea"          "Sri Lanka"           
## [31] "Tajikistan"           "Thailand"             "Turkey"              
## [34] "United Arab Emirates" "Vietnam"             
## 
## Cluster 3:
## [1] "Armenia" "Bahrain" "Iran"    "Kuwait"  "Oman"    "Qatar"

Cluster 1 has high case fatality ratio even though the total cases and death are not high
Cluster 2 has high reproduction rate
Cluster 3 has high total cases and total death

label_map <- data.frame(
  Cluster = factor(1:3),
  ClusterName = c(
    "Cluster 1: High Fatality",
    "Cluster 2: High Reproduction",
    "Cluster 3: High Cases & Deaths"
  )
)

# Join cluster labels to the data
clustered_dc <- clustered_dc %>%
  left_join(label_map, by = "Cluster") %>%
  mutate(
    zvalue = as.numeric(Cluster),
    HoverText = paste0(location, "<br>", ClusterName)
  )

# Plot interactive choropleth
plot_geo(clustered_dc) %>%
  add_trace(
    z = ~zvalue,
    text = ~HoverText,
    locations = ~location,
    locationmode = "country names",
    type = "choropleth",
    colorscale = "Bluered",
    showscale = TRUE,
    colorbar = list(
      title = "Cluster",
      tickvals = 1:3,
      ticktext = label_map$ClusterName
    )
  ) %>%
  layout(
    title = list(text = "COVID-19 Cluster Map in Asia (2020-08-01)", y = 0.95),
    geo = list(
      showframe = FALSE,
      showcoastlines = TRUE,
      projection = list(type = 'equirectangular'),
      lonaxis = list(range = c(25, 150)),
      lataxis = list(range = c(-10, 60))
    )
  )

6.3. Post Vaccination

# Feature Engineering
death_case_features <- df_postvax %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)

## $k
## [1] 3
## 
## $score
## [1] 0.4388714

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])

## [1] 0.4388714

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)

## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          67.5                     18.8             1.71 
## 2 2                       49888.                     515.              1.12 
## 3 3                        6473.                      60.1             0.914
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Armenia"    "Azerbaijan" "Bahrain"    "Georgia"    "Iran"      
##  [6] "Israel"     "Jordan"     "Kuwait"     "Lebanon"    "Oman"      
## [11] "Palestine"  "Qatar"      "Turkey"    
## 
## Cluster 3:
##  [1] "Afghanistan"          "Bangladesh"           "Bhutan"              
##  [4] "Brunei"               "Cambodia"             "China"               
##  [7] "East Timor"           "India"                "Indonesia"           
## [10] "Iraq"                 "Japan"                "Kazakhstan"          
## [13] "Laos"                 "Malaysia"             "Maldives"            
## [16] "Mongolia"             "Myanmar"              "Nepal"               
## [19] "Pakistan"             "Philippines"          "Saudi Arabia"        
## [22] "Singapore"            "South Korea"          "Sri Lanka"           
## [25] "Tajikistan"           "Thailand"             "United Arab Emirates"
## [28] "Uzbekistan"           "Vietnam"

Cluster 1: Low total cases and death, high reproduction rate and case fatality ratio
Cluster 2: High total cases and death
Cluster 3: In the middle, with low reproduction rate

# Short labels for the colorbar, full names for hover
label_map <- data.frame(
  Cluster = factor(1:3),
  ClusterLabel = c("High R+F", "High C+D", "Moderate"),
  ClusterName = c(
    "Cluster 1: Low Cases & Deaths, High Reproduction & Fatality",
    "Cluster 2: High Cases & Deaths",
    "Cluster 3: Moderate, Low Reproduction"
  )
)

# Merge and prep data
clustered_dc <- clustered_dc %>%
  left_join(label_map, by = "Cluster") %>%
  mutate(
    zvalue = as.numeric(Cluster),
    HoverText = paste0(location, "<br>", ClusterName)
  )

# Choropleth plot
plot_geo(clustered_dc) %>%
  add_trace(
    z = ~zvalue,
    text = ~HoverText,
    locations = ~location,
    locationmode = "country names",
    type = "choropleth",
    colorscale = "Bluered",
    showscale = TRUE,
    colorbar = list(
      title = "Cluster",
      tickvals = 1:3,
      ticktext = label_map$ClusterLabel,  # Use short labels here
      thickness = 10,
      len = 0.4
    )
  ) %>%
  layout(
    title = list(text = "COVID-19 Cluster Map in Asia (2021-03-01)", y = 0.95),
    geo = list(
      showframe = FALSE,
      showcoastlines = TRUE,
      projection = list(type = 'equirectangular'),
      lonaxis = list(range = c(25, 150)),
      lataxis = list(range = c(-10, 60)),
      domain = list(x = c(0, 0.95), y = c(0, 1))
    ),
    margin = list(l = 0, r = 0, t = 30, b = 0)
  )

6.4. Last Date

# Feature Engineering
death_case_features <- df_latest %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)

## $k
## [1] 2
## 
## $score
## [1] 0.6776519

We will choose k = 2 because it has the highest silhouette score.

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 2, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])

## [1] 0.6776519

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)

## # A tibble: 2 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                         1071.                     38.3             0.398
## 2 2                       164030.                    757.              0.965
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Tajikistan" "Yemen"     
## 
## Cluster 2:
##  [1] "Afghanistan"          "Armenia"              "Azerbaijan"          
##  [4] "Bahrain"              "Bangladesh"           "Bhutan"              
##  [7] "Brunei"               "Cambodia"             "China"               
## [10] "East Timor"           "Georgia"              "India"               
## [13] "Indonesia"            "Iran"                 "Iraq"                
## [16] "Israel"               "Japan"                "Jordan"              
## [19] "Kazakhstan"           "Kuwait"               "Kyrgyzstan"          
## [22] "Laos"                 "Lebanon"              "Malaysia"            
## [25] "Maldives"             "Mongolia"             "Myanmar"             
## [28] "Nepal"                "Oman"                 "Pakistan"            
## [31] "Palestine"            "Philippines"          "Qatar"               
## [34] "Saudi Arabia"         "Singapore"            "South Korea"         
## [37] "Sri Lanka"            "Thailand"             "Turkey"              
## [40] "United Arab Emirates" "Uzbekistan"           "Vietnam"

Cluster 1: Low reproduction rate, higher case fatality ratio
Cluster 2: High total cases, death, reproduction rate

library(dplyr)
library(plotly)

# === Cluster Naming ===
label_map <- data.frame(
  Cluster = factor(1:2),
  ClusterLabel = c("Low R + High F", "High C + D + R"),
  ClusterName = c(
    "Cluster 1: Low Reproduction, High Fatality",
    "Cluster 2: High Cases, Deaths, Reproduction"
  )
)

# === Merge Labels and Prepare Hover Text ===
clustered_dc <- clustered_dc %>%
  left_join(label_map, by = "Cluster") %>%
  mutate(
    zvalue = as.numeric(Cluster),
    HoverText = paste0(location, "<br>", ClusterName)
  )

# === Plot Choropleth ===
plot_geo(clustered_dc) %>%
  add_trace(
    z = ~zvalue,
    text = ~HoverText,
    locations = ~location,
    locationmode = "country names",
    type = "choropleth",
    colorscale = "Bluered",
    showscale = TRUE,
    colorbar = list(
      title = "Cluster",
      tickvals = 1:2,
      ticktext = label_map$ClusterLabel,
      thickness = 10,
      len = 0.4
    )
  ) %>%
  layout(
    title = list(text = "COVID-19 Cluster Map in Asia (2024-08-04)", y = 0.95),
    geo = list(
      showframe = FALSE,
      showcoastlines = TRUE,
      projection = list(type = 'equirectangular'),
      lonaxis = list(range = c(25, 150)),
      lataxis = list(range = c(-10, 60)),
      domain = list(x = c(0, 0.95), y = c(0, 1))
    ),
    margin = list(l = 0, r = 0, t = 30, b = 0)
  )

7.0 Predictive Model

7.1. Introduction

In this section, we employ Support Vector Regression (SVR) to model and predict the propagation of COVID-19. SVR is particularly suitable for modeling non-linear relationships and works well in high-dimensional spaces, making it ideal for epidemiological data like COVID-19 spread.

7.2. Objective

Model the number of total COVID-19 cases based on different features such as population density.
Evaluate the model’s performance using metrics like Mean Squared Error (MSE) or R² score.
Assess how well SVR captures the non-linear trends of virus spread, and whether it can be used for short-term forecasting.

7.3. Load Libraries

library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)
library(tidyr)
library(e1071)
library(caret)
library(kernlab)

7.4. Load Dataset

df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$date <- as.Date(df$date)

# Calculate Case Fatality Ratio
df <- df %>%
  mutate(case_fatality_ratio = case_when(
    total_cases == 0 & total_deaths == 0 ~ 0,
    TRUE ~ total_deaths / total_cases
  ))

# Filter for early pandemic data (before vaccination rollout)
df_first <- df %>%
  filter(date <= as.Date("2020-12-01"))

7.5. Modeling

sigma C

5 0.35813 4 Test RMSE: 2461.042 Test R²: 0.9038621

7.6. Results

plot_df <- data.frame(
  Actual = test_data$total_cases_per_million,
  Predicted = pred
)

# Plot
ggplot(plot_df, aes(x = Actual, y = Predicted)) +
  geom_point(color = "steelblue", size = 2, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "SVR Prediction vs Actual: Total Cases per Million",
    x = "Actual Total Cases per Million",
    y = "Predicted Total Cases per Million"
  ) +
  theme_minimal(base_size = 14)

8.0 Summary

8.1. Key Findings

Impact Variation Across Clusters:

Early pandemic (2020): Countries varied significantly by fatality rates, reproduction rates, and cases/deaths.
Post-vaccination (2021): Clear differentiation emerged in severity and reproduction rates, indicating varying effectiveness of pandemic responses.
Latest scenario (2024): Countries largely grouped into high cases/deaths/reproduction vs. low reproduction/high fatality clusters.

SVR Model Effectiveness:

The Support Vector Regression (SVR) model accurately captured COVID-19 spread trends.
High predictive performance indicated by:
1. RMSE ≈ 2461
2. R² ≈ 0.90

Implications for Policy and Resource Allocation:

Insights highlight importance of rapid response and targeted resource deployment.
Clustering can guide proactive policy-making and healthcare infrastructure investments.
Predictive modeling can significantly aid short-term forecasting and resource planning.

8.2. Methodology Summary

Data Collection

Acquired comprehensive COVID-19 dataset from Our World in Data (OWID) covering daily metrics across Asian countries.

Data Preparation

Missing Values: Dropped columns exceeding 50% missing data threshold.
Imputation: Used weighted median approach by similarity to handle remaining missing data.
Data Transformation: Normalized values per million and smoothed daily data.

Exploratory Data Analysis (EDA)

Visualized key metrics (total cases/deaths, healthcare resources, etc.).
Compared Malaysia with extremes across various metrics for insights.

Clustering Analysis

Applied K-Means Clustering for different pandemic phases: Early pandemic (2020), Post-vaccination (2021), Latest scenario (2024).
Evaluated optimal clusters via silhouette scores.

Predictive Modelling (SVR)

Implemented Support Vector Regression (SVR) with RBF kernel to predict COVID-19 spread.
Evaluated model accuracy using RMSE and R² metrics.

Interpretation & Recommendations

Analysed cluster variations and predictive model outcomes.
Provided insights on pandemic response effectiveness and healthcare resource allocation.

WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

1. INTRODUCTION

2. Project Questions:

3. Data Understanding:

3.1. Dimension:

3.2. Content:

3.3. Structure:

3.4. Summary:

4. Data Preparation

4.1. Function: drop_cols()

4.2. Function: missed_val_df()

4.3. Function: fun()

4.4. Function: heat_map_missing()

4.5. Function: data_char_date()

4.6. Function: box_plot()

4.7. Function: smoothing_bins()

4.8. Function: per_million_fix()

4.9. Function: fix_totals()

4.10. Function: drop_asia()

4.11. Function: keep_asia()

4.12. Function: imputate_first_part()

4.13. Function: weighted_median_imputation()

4.14. Function: reproduction_function()

5. Exploratory Data Analysis (EDA)

5.1. Load Libraries

5.2. Load Dataset

5.3. Visualization

5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths

5.3.3 Plot : Deaths per Million

5.3.4 Plot : Cases per Million

5.3.5 Plot : Healthcare Clustering (Heatmap)

5.3.6 Plot : Hospital Beds per 1,000 People

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

5.4 Conclusion

6.0 Data Model : Clustering

6.1. Load Libraries

6.2. Load Dataset

6.3. First Date

6.3. Post Vaccination

6.4. Last Date

7.0 Predictive Model

7.1. Introduction

7.2. Objective

7.3. Load Libraries

7.4. Load Dataset

7.5. Modeling

7.6. Results

8.0 Summary

8.1. Key Findings

8.2. Methodology Summary

4.1. Function: `drop_cols()`

4.2. Function: `missed_val_df()`

4.3. Function: `fun()`

4.4. Function: `heat_map_missing()`

4.5. Function: `data_char_date()`

4.6. Function: `box_plot()`

4.7. Function: `smoothing_bins()`

4.8. Function: `per_million_fix()`

4.9. Function: `fix_totals()`

4.10. Function: `drop_asia()`

4.11. Function: `keep_asia()`

4.12. Function: `imputate_first_part()`

4.13. Function: `weighted_median_imputation()`

4.14. Function: `reproduction_function()`