WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

NAME STUDENT ID
Chadli Rayane 24075296
Farras Azelya Putri S2007298
Isma Adlin Binti Ismail 24088157
Wang Zheng 24082308
Mareeha sultana mohamed 24071864

1. INTRODUCTION

The COVID-19 pandemic affected countries differently based several factors such as population and population density. Asia, with its variety of culture and population size, can be considered an interesting case study for analyzing the spread of the virus.

This project explores COVID-19 patterns across Asian countries through data-driven analysis and machine learning model. We aim to group countries by severity and predict viral spread trends to uncover critical insights for pandemic preparedness during different time periods.

Sources : Our World in Data (OWID COVID-19 Dataset)

2. Project Questions:

  1. Classification: Group the countries in Asia suffered the most during the pandemic based on metrics such as mortality rates, case rates, and severity of COVID-19 impact ?

  2. Regression : Create a model to analyze and predict the propagation of the virus ?

3. Data Understanding:

3.1. Dimension:

covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)

Dataset Description:

3.2. Content:

  • iso_code
  • continent
  • location
  • date
  • total_cases
  • new_cases
  • new_cases_smoothed
  • total_deaths
  • new_deaths
  • new_deaths_smoothed
  • total_cases_per_million
  • new_cases_per_million
  • new_cases_smoothed_per_million
  • total_deaths_per_million
  • new_deaths_per_million
  • new_deaths_smoothed_per_million
  • reproduction_rate
  • icu_patients
  • icu_patients_per_million
  • hosp_patients
  • hosp_patients_per_million
  • weekly_icu_admissions
  • weekly_icu_admissions_per_million
  • weekly_hosp_admissions
  • weekly_hosp_admissions_per_million
  • total_tests
  • new_tests
  • total_tests_per_thousand
  • new_tests_per_thousand
  • new_tests_smoothed
  • new_tests_smoothed_per_thousand
  • positive_rate
  • tests_per_case
  • tests_units
  • total_vaccinations
  • people_vaccinated
  • people_fully_vaccinated
  • total_boosters
  • new_vaccinations
  • new_vaccinations_smoothed
  • total_vaccinations_per_hundred
  • people_vaccinated_per_hundred
  • people_fully_vaccinated_per_hundred
  • total_boosters_per_hundred
  • new_vaccinations_smoothed_per_million
  • new_people_vaccinated_smoothed
  • new_people_vaccinated_smoothed_per_hundred
  • stringency_index
  • population_density
  • median_age
  • aged_65_older
  • aged_70_older
  • gdp_per_capita
  • extreme_poverty
  • cardiovasc_death_rate
  • diabetes_prevalence
  • female_smokers
  • male_smokers
  • handwashing_facilities
  • hospital_beds_per_thousand
  • life_expectancy
  • human_development_index
  • population
  • excess_mortality_cumulative_absolute
  • excess_mortality_cumulative
  • excess_mortality
  • excess_mortality_cumulative_per_million
  • ## </div>

    The dataset contains a comprehensive collection of COVID-19 – related metrics across different countries over time. The variables can be grouped into the following categories:

    1. Identification and Geographic Information:
      • Country or region identifiers - “iso_code” “continent” “location”
    2. COVID-19 Cases and Deaths:
      • Total and daily new confirmed cases- “total_cases” “new_cases” “new_cases_smoothed”

      • Total and daily new confirmed deaths- “total_deaths” “new_deaths” “new_deaths_smoothed”

    3. Hospitalizations and ICU:
      • ICU occupancy- “icu_patients” “icu_patients_per_million”

      • Hospital occupancy- “hosp_patients” “hosp_patients_per_million”

      • Weekly Admissions- “weekly_icu_admissions” weekly_hosp_admissions”

    4. Testing Data:
      • Testing statistics- “total_tests” “new_tests” “new_tests_smoothed”

      • Normalized metrics- “total_tests_per_thousand” “positive_rate” “tests_per_case” “tests_units”

    5. Vaccination Data:
      • “total_vaccinations” “people_vaccinated” “people_fully_vaccinated” “total_boosters”

      • Daily Values- “new_vaccinations” “new_vaccinations_smoothed”

      • Normalized- “total_vaccinations_per_hundred” “people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred” “total_boosters_per_hundred” “new_vaccinations_smoothed_per_million” “new_people_vaccinated_smoothed” “new_people_vaccinated_smoothed_per_hundred”

    6. Public Policy Measures:
      • A composite measure of government response strictness- “stringency_index”
    7. Demographic and Socioeconomic Data:
      • “population” “population_density” “median_age” “aged_65_older” “aged_70_older”

      • “gdp_per_capita” “extreme_poverty” “human_development_index”

    8. Health and Risk Factors:
      • “cardiovasc_death_rate”“diabetes_prevalence”

      • Smoking prevalence- “female_smokers” “male_smokers”

      • Healthcare resources- “hospital_beds_per_thousand” “handwashing_facilities”

    9. Excess Mortality Metrics:
      • “excess_mortality” “excess_mortality_cumulative” “excess_mortality_cumulative_absolute” “excess_mortality_cumulative_per_million”

    3.3. Structure:

    Understanding the structure of the dataset is a crucial step in data analysis. It allows us to identify the types of variables we are dealing with, assess what kind of transformation or conversion is possible. The structure also helps in detecting missing values, inconsistencies, or anomalies before modeling and performing EDA.

    The command below displays the internal structure of the dataset:

    str(covid_data)
    ## 'data.frame':    429435 obs. of  67 variables:
    ##  $ iso_code                                  : chr  "AFG" "AFG" "AFG" "AFG" ...
    ##  $ continent                                 : chr  "Asia" "Asia" "Asia" "Asia" ...
    ##  $ location                                  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
    ##  $ date                                      : chr  "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
    ##  $ total_cases                               : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_cases                                 : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_cases_smoothed                        : num  NA NA NA NA NA 0 0 0 0 0 ...
    ##  $ total_deaths                              : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_deaths                                : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_deaths_smoothed                       : num  NA NA NA NA NA 0 0 0 0 0 ...
    ##  $ total_cases_per_million                   : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_cases_per_million                     : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_cases_smoothed_per_million            : num  NA NA NA NA NA 0 0 0 0 0 ...
    ##  $ total_deaths_per_million                  : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_deaths_per_million                    : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ new_deaths_smoothed_per_million           : num  NA NA NA NA NA 0 0 0 0 0 ...
    ##  $ reproduction_rate                         : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ icu_patients                              : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ icu_patients_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ hosp_patients                             : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ hosp_patients_per_million                 : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ weekly_icu_admissions                     : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ weekly_icu_admissions_per_million         : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ weekly_hosp_admissions                    : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ weekly_hosp_admissions_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ total_tests                               : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_tests                                 : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ total_tests_per_thousand                  : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_tests_per_thousand                    : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_tests_smoothed                        : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_tests_smoothed_per_thousand           : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ positive_rate                             : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ tests_per_case                            : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ tests_units                               : chr  "" "" "" "" ...
    ##  $ total_vaccinations                        : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ people_vaccinated                         : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ people_fully_vaccinated                   : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ total_boosters                            : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_vaccinations                          : int  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_vaccinations_smoothed                 : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ total_vaccinations_per_hundred            : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ people_vaccinated_per_hundred             : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ people_fully_vaccinated_per_hundred       : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ total_boosters_per_hundred                : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_vaccinations_smoothed_per_million     : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_people_vaccinated_smoothed            : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ new_people_vaccinated_smoothed_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ stringency_index                          : num  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ population_density                        : num  54.4 54.4 54.4 54.4 54.4 ...
    ##  $ median_age                                : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
    ##  $ aged_65_older                             : num  2.58 2.58 2.58 2.58 2.58 ...
    ##  $ aged_70_older                             : num  1.34 1.34 1.34 1.34 1.34 ...
    ##  $ gdp_per_capita                            : num  1804 1804 1804 1804 1804 ...
    ##  $ extreme_poverty                           : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ cardiovasc_death_rate                     : num  597 597 597 597 597 ...
    ##  $ diabetes_prevalence                       : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
    ##  $ female_smokers                            : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ male_smokers                              : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ handwashing_facilities                    : num  37.7 37.7 37.7 37.7 37.7 ...
    ##  $ hospital_beds_per_thousand                : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
    ##  $ life_expectancy                           : num  64.8 64.8 64.8 64.8 64.8 ...
    ##  $ human_development_index                   : num  0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
    ##  $ population                                : num  41128772 41128772 41128772 41128772 41128772 ...
    ##  $ excess_mortality_cumulative_absolute      : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ excess_mortality_cumulative               : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ excess_mortality                          : num  NA NA NA NA NA NA NA NA NA NA ...
    ##  $ excess_mortality_cumulative_per_million   : num  NA NA NA NA NA NA NA NA NA NA ...

    3.4. Summary:

    The dataset comprises 429,435 observations and 67 columns. The dataset representing detailed COVID-19 data from multiple countries over time. Each row corresponds to a specific date and location (country and region) making the dataset structured as a longitudinal time series.

    The dataset offers a rich foundation to help analyze how COVID-19 spread across the world, how it affected people, how countries responded and the healthcare systems differed between regions. It is suitable for our project as it includes up-to-date data on vaccinations,cases and deaths,which we can use to study how effective the vaccines were in reducing infections and saving lives.

    4. Data Preparation

    main_data_frame<-read.csv('owid-covid-data.csv')

    4.1. Function: drop_cols()

    Explanation:

    Key Steps:

    R Code:

    drop_cols<-function(dataframe=NULL){
      if(is.null(dataframe)){
        return(NULL)
      }
      col_to_drop<-c()
      drop_at<-0.5*nrow(dataframe)
      summary_df<-data.frame(summary(dataframe))
      summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
      summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
      for (p in unique(colnames(dataframe)))
      {
        test<-summary_df[which(summary_df$Var2==p),]
        if (is.na(test[7,2])){next}
        if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
          col_to_drop<-append(col_to_drop,p)
        }
      }
      dataframe<-dataframe %>% select(-all_of(col_to_drop))
      return(dataframe)
    }

    4.2. Function: missed_val_df()

    Explanation:

    Key Steps:

    R Code:

    missed_val_df<-function(dataframe=NULL,title=''){
      if(is.null(dataframe)){return(NULL)
      }else{
        pretty_table <- miss_var_summary(dataframe) %>%
          gt() %>%
          tab_header(title = title)
        print(pretty_table)
      }
    }

    4.3. Function: fun()

    Explanation:

    Key Steps:

    R Code:

    fun<-function(data=NULL){
      if(is.null(data)){return(NULL)
      }else{
        a<- ifelse(data==TRUE,1,0)
        return(a)
      }
    }

    4.4. Function: heat_map_missing()

    Explanation:

    Key Steps:

    R Code:

    heat_map_missing<-function(data__frame=NULL){
      if(is.null(data__frame)){
        return(NULL)
      }
      pheatmap(
        mat = sapply(as.data.frame(is.na(data__frame)),fun),
        cluster_rows = FALSE,
        cluster_cols = FALSE,
        color = c("white", "red"),
        breaks = c(-0.01, 0.5, 1.1),
        main = "Missing Values Heatmap"
      )
    }

    4.5. Function: data_char_date()

    Explanation:

    Key Steps:

    R Code:

    data_char_date<-function(x=NULL){
      if(is.null(x)){
        return(NULL)}
      dated<-str_extract_all(x,"\\d+")[[1]]
      if (length(dated)>2){
        x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
                   format="%Y-%m-%d")
      }else{
        x<-as.Date(x, format = "%B %d, %Y")
      }
      return(x)
    }

    4.6. Function: box_plot()

    Explanation:

    Key Steps:

    R Code:

    box_plot<-function(data__frame=NULL,peak=NULL){
      if(is.null(data__frame) | is.null(peak)){
        return(NULL)
      }
      cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
      #data_cut <- data__frame %>%
       # mutate(across(
        #  all_of(cols_to_cut),
         # ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
          #.names = "{.col}_group"
        #))
      ata_cut <- data__frame %>%
        mutate(across(
          all_of(cols_to_cut),
          ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
          .names = "{.col}_group"
        ))
      for (p in paste0(cols_to_cut,'_group')){
      plo<-ggplot(data_cut,
             aes(x=!!sym(p),
                 y=!!sym(peak),
                 group=!!sym(p)
                 )
             )+
          geom_boxplot(color = 'blue',
                       fill = 'lightblue')+
          labs(x =p,
               y = peak,
               title = paste(peak," Across Population Bins"))+theme_minimal()
        print(plo)
        }
    }

    4.7. Function: smoothing_bins()

    Explanation:

    Key Steps:

    R Code:

    smoothing_bins<-function(df=NULL){
      if(is.null(df)){
        return(NULL)
      }
      names_cols_smoothed<-colnames(df)
      names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
      breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
      for (names_bins_collection in names_cols_smoothed ){
        main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
        bins<-paste0(main,"_smoothed")
        new_col<-paste0(bins,"_imputed")
        df <- df %>%
          mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
          group_by(date_group,location) %>%
          mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
          ungroup()%>%
          select(-all_of(c('date_group',bins))) %>%
          rename(!!sym(bins) := new_col)
        }
      return(df)
      }

    4.8. Function: per_million_fix()

    Explanation:

    Key Steps:

    R Code:

    per_million_fix<-function(df=NULL){
      if(is.null(df)){
        return(NULL)
      }
      names_cols_per_million<-colnames(df)
      names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
      names_cols_per_million
      
      
      for (names_bins_collection in names_cols_per_million ){
        main<-str_split(names_bins_collection,"_per_million")[[1]][1]
        bins<-paste0(main,"_per_million")
        new_col<-paste0(bins,"_imputed")
        df <- df %>%
          mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
          select(-all_of(bins))%>%
          rename(!!sym(bins) := new_col)
      }
      return(df)
    }

    4.9. Function: fix_totals()

    Explanation:

    Key Steps:

    R Code:

    fix_totals<-function(df_t=NULL){
      if(is.null(df_t)){
      return(NULL)
      }
      df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
      countries <- unique(df_t$location)
      df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
      for(country in countries){
        df_sub<-df_t[df_t$location==country,]
        df_sub <- df_sub[order(df_sub$date), ]
        first_day<-min(df_sub$date)
        last_day<-max(df_sub$date)
        
        for (i in seq(first_day,last_day,by=1)){
        if(i==first_day){
          df_sub$total_cases[df_sub$date == i] <- 0
          df_sub$total_deaths[df_sub$date == i] <- 0
        }else{
        df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
        df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
        }
          }
        print(paste(country,'--->',any(duplicated(df_sub$date)))
              )
        df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
        df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
        
      }
      return(df_t)
      }

    4.10. Function: drop_asia()

    Explanation:

    Key Steps:

    R Code:

    drop_asia<-function(main_data_frame){
      
      main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
      
      print(missed_val_df(main_data_frame,'missing main_data_frame'))
      print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
      
      
      main_data_remaining<-drop_cols(main_data_remaining)
      main_data_frame<-drop_cols(main_data_frame)
      
      print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
      print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
      
      return(main_data_remaining)
      
    }

    4.11. Function: keep_asia()

    Explanation:

    Key Steps:

    R Code:

    keep_asia<-function(main_data_frame){
      
      main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]
    
      print(missed_val_df(main_data_frame,'missing main_data_frame'))
      print(missed_val_df(main_data_asia,'missing main_data_asia'))
      
      main_data_asia<-drop_cols(main_data_asia)
      main_data_frame<-drop_cols(main_data_frame)
      
      print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
      print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
      return(main_data_asia)
      
    }

    4.12. Function: imputate_first_part()

    Explanation:

    Key Steps:

    R Code:

    imputate_first_part<-function(data_set_asia){
      data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
      cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
      
      #data_cut <- data_set_asia %>%
      # mutate(across(
      #   all_of(cols_to_cut),
      #  ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
      # .names = "{.col}_group"
      #))
      
      data_cut <- data_set_asia %>%
        mutate(across(
          all_of(cols_to_cut),
          ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
          .names = "{.col}_group"
        ))
      
      data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
      cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
      for(col_named in cols_to_imputate){
        data_cut<-weighted_median_imputation(data_cut,col_named)
      }
      cols_original<-paste0(cols_to_cut,"_group")
      data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
      #write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
      return(data_set_asia_clean_except__bins)
    }

    4.13. Function: weighted_median_imputation()

    Explanation:

    Key Steps:

    R Code:

    weighted_median_imputation<-function(data__frame=NULL,col1=NULL){
      original_columns<-colnames(data__frame)
      cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
      cols_to_cut<-paste0(cols_to_cut,'_group')
      
      
      if(is.null(col1) || is.null(data__frame)){
        return(NULL)}
      
      
      median_full<-data_frame(data__frame[,col1])
      data_temp<-data__frame[,c(col1,cols_to_cut,"location")]
      
      
      for(p in cols_to_cut){
        
        group_stats <- data_temp %>%
          group_by(.data[[p]],location) %>%
          summarise(group_median = reproduction_function(.data[[col1]]), .groups = "drop")
        
        # Merge back with original data
        data_new <- data_temp %>%
          left_join(group_stats, by = c(p,"location")) %>%
          mutate(imputed = if_else(is.na(.data[[col1]]), group_median, .data[[col1]])) %>%
          select(imputed)
        named <- paste0(p, "_median")
        median_full[[named]] <- data_new$imputed
      }
      
      
      imputed_cols <- paste0(cols_to_cut, "_median")
      median_full$averages <- rowMeans(median_full[, imputed_cols], na.rm = TRUE)
      
      
      median_full[,'average']<-apply(median_full[,paste0(cols_to_cut,"_median")],MARGIN = 1,sum)
      median_full[,'average']<-median_full[,'average']/length(cols_to_cut)
      
      
      named <- paste0(col1, "_imputated")
      
      data__frame[,named]<-median_full[,'average']
      
      
      data__frame <- data__frame %>%
        mutate(!!col1 := if_else(
          is.na(!!sym(col1)),
          !!sym(named),
          !!sym(col1)
        ))
      
      return(data__frame[,original_columns])
    }

    4.14. Function: reproduction_function()

    Explanation:

    Key Steps:

    R Code:

    reproduction_function<-function(s){
      
      median_data<-c()
      weights<-c()
      
      if (is.data.frame(s)) {
        s <- s[[1]] 
      }
      
      x <- s[!is.na(s)]
      size_s<-length(x)
      x<-as.numeric(x)
      while(TRUE)
        {
      if(length(x)<4){break}
      
        
      Q1<-quantile(x,0.25,na.rm=TRUE)
      Q3<-quantile(x,0.75,na.rm=TRUE)
      IQR<-Q3-Q1
      lower_bound<-Q1-1.5*IQR
      higher_bound<-Q3+1.5*IQR
      
      non_outliers<-x[x>=lower_bound & x<=higher_bound]
      outliers<-x[x<lower_bound | x>higher_bound]
      
      
      a<-non_outliers
      b<-length(non_outliers)/size_s
      
      weights<-append(weights,b)
      a<-median(a,na.rm=TRUE)
      median_data<-append(median_data,a)
      
      x<-outliers
      
      }
      median_data <- as.numeric(median_data)
      weights <- as.numeric(weights)
      return(sum(median_data*weights, na.rm = TRUE))
    }

    5. Exploratory Data Analysis (EDA)

    This report presents an exploratory data analysis of COVID-19 indicators in various Asian countries. The focus is on identifying extreme cases, comparing Malaysia to others, and understanding healthcare burden using clustering and visualizations.

    5.1. Load Libraries

    library(ggplot2)
    library(dplyr)
    library(tidyr)
    library(ggcorrplot)
    library(plotly)
    library(pheatmap)
    library(ggtext)

    5.2. Load Dataset

    df <- read.csv('covid_data_set_asia_fixed_final.csv')
    
    df <- df %>%
      mutate(case_fatality_ratio = case_when(
        total_cases != 0 ~ total_deaths / total_cases,
        TRUE ~ 0
      ))
    
    df$date <- as.Date(df$date)
    
    df_last_day <- df %>%
      group_by(location) %>%
      filter(date == max(date)) %>%
      ungroup() %>%
      filter(total_cases > 0)

    5.3. Visualization

    5.3.1 Plot : Total COVID-19 Cases

    5.3.2 Plot : Total COVID-19 Deaths

    bar_plot(df_last_day, "total_deaths")

    5.3.3 Plot : Deaths per Million

    bar_plot(df_last_day, "total_deaths_per_million")

    5.3.4 Plot : Cases per Million

    bar_plot(df_last_day, "total_cases_per_million")

    5.3.5 Plot : Healthcare Clustering (Heatmap)

    df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
    df_scaled <- scale(df_clust[, -ncol(df_clust)])
    df_scaled <- as.data.frame(df_scaled)
    rownames(df_scaled) <- df_clust$location
    
    pheatmap(
      df_scaled,
      scale = "row",
      cutree_rows = 3,
      cutree_cols = 4,
      fontsize_row = 5.5,
      fontsize_col = 10,
      angle_col = 45
    )

    5.3.6 Plot : Hospital Beds per 1,000 People

    bar_plot(df_last_day, "hospital_beds_per_thousand")

    5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

    ggplotly(line_plot(df, "total_deaths_per_million"))
    ggplotly(line_plot(df, "total_cases_per_million"))
    ggplotly(line_plot(df, "hospital_beds_per_thousand"))
    ggplotly(line_plot(df, "new_cases_per_million"))
    ggplotly(line_plot(df, "new_deaths_per_million"))
    ggplotly(line_plot(df, "case_fatality_ratio"))

    5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

    countr <- c("China", "India", "Malaysia", "Japan")
    df_selective <- df %>% filter(location %in% countr)
    
    p <- ggplot(df_selective, aes(x = date, y = reproduction_rate, colour = location)) +
      geom_line() +
      labs(
        title = "COVID-19 Reproduction Rate in Malaysia, China, India, and Japan Over Time",
        x = "Date",
        y = "Reproduction Rate"
      ) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
    
    ggplotly(p)

    This interesting to see the variation of India, Malaysia and China are closely similar because malaysia is a popular distination for chinese and indian citizen. ### 5.3.10 Plot : Case Fatality Ratio by Country

    bar_plot(df_last_day, "case_fatality_ratio")

    5.4 Conclusion

    This analysis shows how COVID-19 affected Asian countries differently based on death rates, case counts, and healthcare resources. Malaysia’s performance is compared with best- and worst-case countries across key metrics. This provides a foundation for future modeling and regional healthcare readiness assessment.

    6.0 Data Model : Clustering

    6.1. Load Libraries

    library(readr)
    library(dplyr)
    library(cluster)
    library(ggplot2)
    library(factoextra)
    library(tidyr)

    6.2. Load Dataset

    df <- read.csv("covid_data_set_asia_fixed_final.csv")
    df$case_fatality_ratio <- df$total_deaths/df$total_cases
    
    # --- Define Time Points ---
    df_first <- df %>%
      group_by(location) %>%
      filter(date == "2020-08-01") %>%
      ungroup()
    df_postvax <- df %>%
      group_by(location) %>%
      filter(date == "2021-03-01") %>%
      ungroup()
    df_latest <- df %>%
      group_by(location) %>%
      filter(date == "2024-08-04") %>%
      ungroup()

    6.3. First Date

    # Feature Engineering
    death_case_features <- df_first %>%
      select(location,
             total_cases_per_million,
             total_deaths_per_million,
             reproduction_rate,
             case_fatality_ratio) %>%
      drop_na()
    
    locations_dc <- death_case_features$location
    data_dc <- death_case_features %>% select(-location)
    scaled_dc <- scale(data_dc)
    # Determine number of cluster using WSS Elbow method
    fviz_nbclust(scaled_dc, kmeans, method = "wss")

    The plot is ambiguous, we will check the silhoutte score for the optimal number of k

    # Helper function to find the best k
    find_best_k <- function(data) {
      best_k <- NA
      best_score <- -Inf
      
      for (k in 2:5) {
        set.seed(123)
        km <- kmeans(data, centers = k, nstart = 25)
        sil <- silhouette(km$cluster, dist(data))
        score <- mean(sil[, 3])
        
        if (score > best_score) {
          best_score <- score
          best_k <- k
        }
      }
      
      return(list(k = best_k, score = best_score))
    }
    
    find_best_k(scaled_dc)
    ## $k
    ## [1] 3
    ## 
    ## $score
    ## [1] 0.5107536
    set.seed(123)
    km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k
    
    # Adding new column 'Cluster' to our table
    clustered_dc <- death_case_features %>%
      mutate(Cluster = factor(km_dc$cluster))
    
    # Summary of each cluster using mean
    cluster_summary_dc <- clustered_dc %>%
      group_by(Cluster) %>%
      summarise(across(where(is.numeric), mean, na.rm = TRUE))
    print(cluster_summary_dc)
    ## # A tibble: 3 × 5
    ##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
    ##   <fct>                     <dbl>                    <dbl>             <dbl>
    ## 1 1                          49.8                     14.1             0.87 
    ## 2 2                        1690.                      17.7             0.965
    ## 3 3                       19108.                     127.              0.838
    ## # ℹ 1 more variable: case_fatality_ratio <dbl>
    # Visualize the clusters
    fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
                 geom = "point", palette = "jco") +
      labs(title = "Asian Countries Clustered by COVID Impact") +
      theme_minimal()

    # Split countries by cluster into a list
    countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
    
    # Print countries in each cluster
    for (i in seq_along(countries_by_cluster)) {
      cat(paste0("Cluster ", i, ":\n"))
      print(countries_by_cluster[[i]])
      cat("\n")
    }
    ## Cluster 1:
    ## [1] "Yemen"
    ## 
    ## Cluster 2:
    ##  [1] "Afghanistan"          "Azerbaijan"           "Bangladesh"          
    ##  [4] "Bhutan"               "Brunei"               "Cambodia"            
    ##  [7] "China"                "East Timor"           "Georgia"             
    ## [10] "India"                "Indonesia"            "Iraq"                
    ## [13] "Israel"               "Japan"                "Jordan"              
    ## [16] "Kazakhstan"           "Laos"                 "Lebanon"             
    ## [19] "Malaysia"             "Maldives"             "Mongolia"            
    ## [22] "Myanmar"              "Nepal"                "Pakistan"            
    ## [25] "Palestine"            "Philippines"          "Saudi Arabia"        
    ## [28] "Singapore"            "South Korea"          "Sri Lanka"           
    ## [31] "Tajikistan"           "Thailand"             "Turkey"              
    ## [34] "United Arab Emirates" "Vietnam"             
    ## 
    ## Cluster 3:
    ## [1] "Armenia" "Bahrain" "Iran"    "Kuwait"  "Oman"    "Qatar"
    1. Cluster 1 has high case fatality ratio even though the total cases and death are not high
    2. Cluster 2 has high reproduction rate
    3. Cluster 3 has high total cases and total death
    label_map <- data.frame(
      Cluster = factor(1:3),
      ClusterName = c(
        "Cluster 1: High Fatality",
        "Cluster 2: High Reproduction",
        "Cluster 3: High Cases & Deaths"
      )
    )
    
    # Join cluster labels to the data
    clustered_dc <- clustered_dc %>%
      left_join(label_map, by = "Cluster") %>%
      mutate(
        zvalue = as.numeric(Cluster),
        HoverText = paste0(location, "<br>", ClusterName)
      )
    
    # Plot interactive choropleth
    plot_geo(clustered_dc) %>%
      add_trace(
        z = ~zvalue,
        text = ~HoverText,
        locations = ~location,
        locationmode = "country names",
        type = "choropleth",
        colorscale = "Bluered",
        showscale = TRUE,
        colorbar = list(
          title = "Cluster",
          tickvals = 1:3,
          ticktext = label_map$ClusterName
        )
      ) %>%
      layout(
        title = list(text = "COVID-19 Cluster Map in Asia (2020-08-01)", y = 0.95),
        geo = list(
          showframe = FALSE,
          showcoastlines = TRUE,
          projection = list(type = 'equirectangular'),
          lonaxis = list(range = c(25, 150)),
          lataxis = list(range = c(-10, 60))
        )
      )

    6.3. Post Vaccination

    # Feature Engineering
    death_case_features <- df_postvax %>%
      select(location,
             total_cases_per_million,
             total_deaths_per_million,
             reproduction_rate,
             case_fatality_ratio) %>%
      drop_na()
    
    locations_dc <- death_case_features$location
    data_dc <- death_case_features %>% select(-location)
    scaled_dc <- scale(data_dc)
    # Determine number of cluster using WSS Elbow method
    fviz_nbclust(scaled_dc, kmeans, method = "wss")

    find_best_k(scaled_dc)
    ## $k
    ## [1] 3
    ## 
    ## $score
    ## [1] 0.4388714
    set.seed(123)
    km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25)  # centers = best k
    sil <- silhouette(km_dc$cluster, dist(scaled_dc))
    mean(sil[, 3]) 
    ## [1] 0.4388714
    # Adding new column 'Cluster' to our table
    clustered_dc <- death_case_features %>%
      mutate(Cluster = factor(km_dc$cluster))
    
    # Summary of each cluster using mean
    cluster_summary_dc <- clustered_dc %>%
      group_by(Cluster) %>%
      summarise(across(where(is.numeric), mean, na.rm = TRUE))
    print(cluster_summary_dc)
    ## # A tibble: 3 × 5
    ##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
    ##   <fct>                     <dbl>                    <dbl>             <dbl>
    ## 1 1                          67.5                     18.8             1.71 
    ## 2 2                       49888.                     515.              1.12 
    ## 3 3                        6473.                      60.1             0.914
    ## # ℹ 1 more variable: case_fatality_ratio <dbl>
    # Visualize the clusters
    fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
                 geom = "point", palette = "jco") +
      labs(title = "Asian Countries Clustered by COVID Impact") +
      theme_minimal()

    # Split countries by cluster into a list
    countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
    
    # Print countries in each cluster
    for (i in seq_along(countries_by_cluster)) {
      cat(paste0("Cluster ", i, ":\n"))
      print(countries_by_cluster[[i]])
      cat("\n")
    }
    ## Cluster 1:
    ## [1] "Yemen"
    ## 
    ## Cluster 2:
    ##  [1] "Armenia"    "Azerbaijan" "Bahrain"    "Georgia"    "Iran"      
    ##  [6] "Israel"     "Jordan"     "Kuwait"     "Lebanon"    "Oman"      
    ## [11] "Palestine"  "Qatar"      "Turkey"    
    ## 
    ## Cluster 3:
    ##  [1] "Afghanistan"          "Bangladesh"           "Bhutan"              
    ##  [4] "Brunei"               "Cambodia"             "China"               
    ##  [7] "East Timor"           "India"                "Indonesia"           
    ## [10] "Iraq"                 "Japan"                "Kazakhstan"          
    ## [13] "Laos"                 "Malaysia"             "Maldives"            
    ## [16] "Mongolia"             "Myanmar"              "Nepal"               
    ## [19] "Pakistan"             "Philippines"          "Saudi Arabia"        
    ## [22] "Singapore"            "South Korea"          "Sri Lanka"           
    ## [25] "Tajikistan"           "Thailand"             "United Arab Emirates"
    ## [28] "Uzbekistan"           "Vietnam"
    1. Cluster 1: Low total cases and death, high reproduction rate and case fatality ratio
    2. Cluster 2: High total cases and death
    3. Cluster 3: In the middle, with low reproduction rate
    # Short labels for the colorbar, full names for hover
    label_map <- data.frame(
      Cluster = factor(1:3),
      ClusterLabel = c("High R+F", "High C+D", "Moderate"),
      ClusterName = c(
        "Cluster 1: Low Cases & Deaths, High Reproduction & Fatality",
        "Cluster 2: High Cases & Deaths",
        "Cluster 3: Moderate, Low Reproduction"
      )
    )
    
    # Merge and prep data
    clustered_dc <- clustered_dc %>%
      left_join(label_map, by = "Cluster") %>%
      mutate(
        zvalue = as.numeric(Cluster),
        HoverText = paste0(location, "<br>", ClusterName)
      )
    
    # Choropleth plot
    plot_geo(clustered_dc) %>%
      add_trace(
        z = ~zvalue,
        text = ~HoverText,
        locations = ~location,
        locationmode = "country names",
        type = "choropleth",
        colorscale = "Bluered",
        showscale = TRUE,
        colorbar = list(
          title = "Cluster",
          tickvals = 1:3,
          ticktext = label_map$ClusterLabel,  # Use short labels here
          thickness = 10,
          len = 0.4
        )
      ) %>%
      layout(
        title = list(text = "COVID-19 Cluster Map in Asia (2021-03-01)", y = 0.95),
        geo = list(
          showframe = FALSE,
          showcoastlines = TRUE,
          projection = list(type = 'equirectangular'),
          lonaxis = list(range = c(25, 150)),
          lataxis = list(range = c(-10, 60)),
          domain = list(x = c(0, 0.95), y = c(0, 1))
        ),
        margin = list(l = 0, r = 0, t = 30, b = 0)
      )

    6.4. Last Date

    # Feature Engineering
    death_case_features <- df_latest %>%
      select(location,
             total_cases_per_million,
             total_deaths_per_million,
             reproduction_rate,
             case_fatality_ratio) %>%
      drop_na()
    
    locations_dc <- death_case_features$location
    data_dc <- death_case_features %>% select(-location)
    scaled_dc <- scale(data_dc)
    # Determine number of cluster using WSS Elbow method
    fviz_nbclust(scaled_dc, kmeans, method = "wss")

    find_best_k(scaled_dc)
    ## $k
    ## [1] 2
    ## 
    ## $score
    ## [1] 0.6776519

    We will choose k = 2 because it has the highest silhouette score.

    set.seed(123)
    km_dc <- kmeans(scaled_dc, centers = 2, nstart = 25)  # centers = best k
    sil <- silhouette(km_dc$cluster, dist(scaled_dc))
    mean(sil[, 3])
    ## [1] 0.6776519
    # Adding new column 'Cluster' to our table
    clustered_dc <- death_case_features %>%
      mutate(Cluster = factor(km_dc$cluster))
    
    # Summary of each cluster using mean
    cluster_summary_dc <- clustered_dc %>%
      group_by(Cluster) %>%
      summarise(across(where(is.numeric), mean, na.rm = TRUE))
    print(cluster_summary_dc)
    ## # A tibble: 2 × 5
    ##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
    ##   <fct>                     <dbl>                    <dbl>             <dbl>
    ## 1 1                         1071.                     38.3             0.398
    ## 2 2                       164030.                    757.              0.965
    ## # ℹ 1 more variable: case_fatality_ratio <dbl>
    # Visualize the clusters
    fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
                 geom = "point", palette = "jco") +
      labs(title = "Asian Countries Clustered by COVID Impact") +
      theme_minimal()

    # Split countries by cluster into a list
    countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)
    
    # Print countries in each cluster
    for (i in seq_along(countries_by_cluster)) {
      cat(paste0("Cluster ", i, ":\n"))
      print(countries_by_cluster[[i]])
      cat("\n")
    }
    ## Cluster 1:
    ## [1] "Tajikistan" "Yemen"     
    ## 
    ## Cluster 2:
    ##  [1] "Afghanistan"          "Armenia"              "Azerbaijan"          
    ##  [4] "Bahrain"              "Bangladesh"           "Bhutan"              
    ##  [7] "Brunei"               "Cambodia"             "China"               
    ## [10] "East Timor"           "Georgia"              "India"               
    ## [13] "Indonesia"            "Iran"                 "Iraq"                
    ## [16] "Israel"               "Japan"                "Jordan"              
    ## [19] "Kazakhstan"           "Kuwait"               "Kyrgyzstan"          
    ## [22] "Laos"                 "Lebanon"              "Malaysia"            
    ## [25] "Maldives"             "Mongolia"             "Myanmar"             
    ## [28] "Nepal"                "Oman"                 "Pakistan"            
    ## [31] "Palestine"            "Philippines"          "Qatar"               
    ## [34] "Saudi Arabia"         "Singapore"            "South Korea"         
    ## [37] "Sri Lanka"            "Thailand"             "Turkey"              
    ## [40] "United Arab Emirates" "Uzbekistan"           "Vietnam"
    1. Cluster 1: Low reproduction rate, higher case fatality ratio
    2. Cluster 2: High total cases, death, reproduction rate
    library(dplyr)
    library(plotly)
    
    # === Cluster Naming ===
    label_map <- data.frame(
      Cluster = factor(1:2),
      ClusterLabel = c("Low R + High F", "High C + D + R"),
      ClusterName = c(
        "Cluster 1: Low Reproduction, High Fatality",
        "Cluster 2: High Cases, Deaths, Reproduction"
      )
    )
    
    # === Merge Labels and Prepare Hover Text ===
    clustered_dc <- clustered_dc %>%
      left_join(label_map, by = "Cluster") %>%
      mutate(
        zvalue = as.numeric(Cluster),
        HoverText = paste0(location, "<br>", ClusterName)
      )
    
    # === Plot Choropleth ===
    plot_geo(clustered_dc) %>%
      add_trace(
        z = ~zvalue,
        text = ~HoverText,
        locations = ~location,
        locationmode = "country names",
        type = "choropleth",
        colorscale = "Bluered",
        showscale = TRUE,
        colorbar = list(
          title = "Cluster",
          tickvals = 1:2,
          ticktext = label_map$ClusterLabel,
          thickness = 10,
          len = 0.4
        )
      ) %>%
      layout(
        title = list(text = "COVID-19 Cluster Map in Asia (2024-08-04)", y = 0.95),
        geo = list(
          showframe = FALSE,
          showcoastlines = TRUE,
          projection = list(type = 'equirectangular'),
          lonaxis = list(range = c(25, 150)),
          lataxis = list(range = c(-10, 60)),
          domain = list(x = c(0, 0.95), y = c(0, 1))
        ),
        margin = list(l = 0, r = 0, t = 30, b = 0)
      )

    7.0 Predictive Model

    7.1. Introduction

    In this section, we employ Support Vector Regression (SVR) to model and predict the propagation of COVID-19. SVR is particularly suitable for modeling non-linear relationships and works well in high-dimensional spaces, making it ideal for epidemiological data like COVID-19 spread.

    7.2. Objective

    7.3. Load Libraries

    library(readr)
    library(dplyr)
    library(cluster)
    library(ggplot2)
    library(factoextra)
    library(tidyr)
    library(e1071)
    library(caret)
    library(kernlab)

    7.4. Load Dataset

    df <- read.csv("covid_data_set_asia_fixed_final.csv")
    df$date <- as.Date(df$date)
    
    # Calculate Case Fatality Ratio
    df <- df %>%
      mutate(case_fatality_ratio = case_when(
        total_cases == 0 & total_deaths == 0 ~ 0,
        TRUE ~ total_deaths / total_cases
      ))
    
    # Filter for early pandemic data (before vaccination rollout)
    df_first <- df %>%
      filter(date <= as.Date("2020-12-01"))

    7.5. Modeling

    sigma C

    5 0.35813 4 Test RMSE: 2461.042 Test R²: 0.9038621

    7.6. Results

    plot_df <- data.frame(
      Actual = test_data$total_cases_per_million,
      Predicted = pred
    )
    
    # Plot
    ggplot(plot_df, aes(x = Actual, y = Predicted)) +
      geom_point(color = "steelblue", size = 2, alpha = 0.7) +
      geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
      labs(
        title = "SVR Prediction vs Actual: Total Cases per Million",
        x = "Actual Total Cases per Million",
        y = "Predicted Total Cases per Million"
      ) +
      theme_minimal(base_size = 14)

    8.0 Summary

    8.1. Key Findings

    1. Impact Variation Across Clusters:
    1. SVR Model Effectiveness:
    1. Implications for Policy and Resource Allocation:

    8.2. Methodology Summary

    1. Data Collection
    1. Data Preparation
    1. Exploratory Data Analysis (EDA)
    1. Clustering Analysis
    1. Predictive Modelling (SVR)
    1. Interpretation & Recommendations