WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

NAME STUDENT ID
Chadli Rayane 24075296
Farras Azelya Putri S2007298
Isma Adlin Binti Ismail 24088157
Wang Zheng 24082308
Mareeha sultana mohamed

1. Sources : Our World in Data (OWID COVID-19 Dataset)

2. Project Questions:

  1. Classification: Group the countries in Asia suffered the most during the pandemic based on metrics such as mortality rates, case rates, and healthcare burden?

  2. Regression:Create a model to analyze and predict the propagation of a virus?

3. Data Understanding:

3.1. Dimension:

covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)
## [1] 429435     67

3.2. Content:

colnames(covid_data)
##  [1] "iso_code"                                  
##  [2] "continent"                                 
##  [3] "location"                                  
##  [4] "date"                                      
##  [5] "total_cases"                               
##  [6] "new_cases"                                 
##  [7] "new_cases_smoothed"                        
##  [8] "total_deaths"                              
##  [9] "new_deaths"                                
## [10] "new_deaths_smoothed"                       
## [11] "total_cases_per_million"                   
## [12] "new_cases_per_million"                     
## [13] "new_cases_smoothed_per_million"            
## [14] "total_deaths_per_million"                  
## [15] "new_deaths_per_million"                    
## [16] "new_deaths_smoothed_per_million"           
## [17] "reproduction_rate"                         
## [18] "icu_patients"                              
## [19] "icu_patients_per_million"                  
## [20] "hosp_patients"                             
## [21] "hosp_patients_per_million"                 
## [22] "weekly_icu_admissions"                     
## [23] "weekly_icu_admissions_per_million"         
## [24] "weekly_hosp_admissions"                    
## [25] "weekly_hosp_admissions_per_million"        
## [26] "total_tests"                               
## [27] "new_tests"                                 
## [28] "total_tests_per_thousand"                  
## [29] "new_tests_per_thousand"                    
## [30] "new_tests_smoothed"                        
## [31] "new_tests_smoothed_per_thousand"           
## [32] "positive_rate"                             
## [33] "tests_per_case"                            
## [34] "tests_units"                               
## [35] "total_vaccinations"                        
## [36] "people_vaccinated"                         
## [37] "people_fully_vaccinated"                   
## [38] "total_boosters"                            
## [39] "new_vaccinations"                          
## [40] "new_vaccinations_smoothed"                 
## [41] "total_vaccinations_per_hundred"            
## [42] "people_vaccinated_per_hundred"             
## [43] "people_fully_vaccinated_per_hundred"       
## [44] "total_boosters_per_hundred"                
## [45] "new_vaccinations_smoothed_per_million"     
## [46] "new_people_vaccinated_smoothed"            
## [47] "new_people_vaccinated_smoothed_per_hundred"
## [48] "stringency_index"                          
## [49] "population_density"                        
## [50] "median_age"                                
## [51] "aged_65_older"                             
## [52] "aged_70_older"                             
## [53] "gdp_per_capita"                            
## [54] "extreme_poverty"                           
## [55] "cardiovasc_death_rate"                     
## [56] "diabetes_prevalence"                       
## [57] "female_smokers"                            
## [58] "male_smokers"                              
## [59] "handwashing_facilities"                    
## [60] "hospital_beds_per_thousand"                
## [61] "life_expectancy"                           
## [62] "human_development_index"                   
## [63] "population"                                
## [64] "excess_mortality_cumulative_absolute"      
## [65] "excess_mortality_cumulative"               
## [66] "excess_mortality"                          
## [67] "excess_mortality_cumulative_per_million"

The dataset contains a comprehensive collection of COVID-19 – related metrics across different countries over time. The variables can be grouped into the following categories:

  1. Identification and Geographic Information:
    • Country or region identifiers - “iso_code” “continent” “location”
  2. COVID-19 Cases and Deaths:
    • Total and daily new confirmed cases- “total_cases” “new_cases” “new_cases_smoothed”

    • Total and daily new confirmed deaths- “total_deaths” “new_deaths” “new_deaths_smoothed”

  3. Hospitalizations and ICU:
    • ICU occupancy- “icu_patients” “icu_patients_per_million”

    • Hospital occupancy- “hosp_patients” “hosp_patients_per_million”

    • Weekly Admissions- “weekly_icu_admissions” weekly_hosp_admissions”

  4. Testing Data:
    • Testing statistics- “total_tests” “new_tests” “new_tests_smoothed”

    • Normalized metrics- “total_tests_per_thousand” “positive_rate” “tests_per_case” “tests_units”

  5. Vaccination Data:
    • “total_vaccinations” “people_vaccinated” “people_fully_vaccinated” “total_boosters”

    • Daily Values- “new_vaccinations” “new_vaccinations_smoothed”

    • Normalized- “total_vaccinations_per_hundred” “people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred” “total_boosters_per_hundred” “new_vaccinations_smoothed_per_million” “new_people_vaccinated_smoothed” “new_people_vaccinated_smoothed_per_hundred”

  6. Public Policy Measures:
    • A composite measure of government response strictness- “stringency_index”
  7. Demographic and Socioeconomic Data:
    • “population” “population_density” “median_age” “aged_65_older” “aged_70_older”

    • “gdp_per_capita” “extreme_poverty” “human_development_index”

  8. Health and Risk Factors:
    • “cardiovasc_death_rate”“diabetes_prevalence”

    • Smoking prevalence- “female_smokers” “male_smokers”

    • Healthcare resources- “hospital_beds_per_thousand” “handwashing_facilities”

  9. Excess Mortality Metrics:
    • “excess_mortality” “excess_mortality_cumulative” “excess_mortality_cumulative_absolute” “excess_mortality_cumulative_per_million”

3.3. Structure:

str(covid_data)
## 'data.frame':    429435 obs. of  67 variables:
##  $ iso_code                                  : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                                 : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location                                  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                                      : chr  "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
##  $ total_cases                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed                        : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed                       : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_cases_per_million                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_per_million                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million            : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths_per_million                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_per_million                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed_per_million           : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ reproduction_rate                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients                              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients                             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients_per_million                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions_per_million         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                               : chr  "" "" "" "" ...
##  $ total_vaccinations                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations_per_hundred            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated_per_hundred       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters_per_hundred                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed_per_million     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stringency_index                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ population_density                        : num  54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                                : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                             : num  2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                             : num  1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                            : num  1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate                     : num  597 597 597 597 597 ...
##  $ diabetes_prevalence                       : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities                    : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand                : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                           : num  64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index                   : num  0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
##  $ population                                : num  41128772 41128772 41128772 41128772 41128772 ...
##  $ excess_mortality_cumulative_absolute      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative_per_million   : num  NA NA NA NA NA NA NA NA NA NA ...

3.4. Summary:

The dataset comprises 429,435 observations and 67 columns. The dataset representing detailed COVID-19 data from multiple countries over time. Each row corresponds to a specific date and location (country and region) making the dataset structured as a longitudinal time series.

The dataset offers a rich foundation to help analyze how COVID-19 spread across the world, how it affected people, how countries responded and the healthcare systems differed between regions. It is suitable for our project as it includes up-to-date data on vaccinations,cases and deaths,which we can use to study how effective the vaccines were in reducing infections and saving lives.

4. Data Preparation

main_data_frame<-read.csv('owid-covid-data.csv')

4.1. Function: drop_cols()

Explanation:

Key Steps:

R Code:

drop_cols<-function(dataframe=NULL){
  if(is.null(dataframe)){
    return(NULL)
  }
  col_to_drop<-c()
  drop_at<-0.5*nrow(dataframe)
  summary_df<-data.frame(summary(dataframe))
  summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
  summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
  for (p in unique(colnames(dataframe)))
  {
    test<-summary_df[which(summary_df$Var2==p),]
    if (is.na(test[7,2])){next}
    if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
      col_to_drop<-append(col_to_drop,p)
    }
  }
  dataframe<-dataframe %>% select(-all_of(col_to_drop))
  return(dataframe)
}

4.2. Function: missed_val_df()

Explanation:

Key Steps:

R Code:

missed_val_df<-function(dataframe=NULL,title=''){
  if(is.null(dataframe)){return(NULL)
  }else{
    pretty_table <- miss_var_summary(dataframe) %>%
      gt() %>%
      tab_header(title = title)
    print(pretty_table)
  }
}

4.3. Function: fun()

Explanation:

Key Steps:

R Code:

fun<-function(data=NULL){
  if(is.null(data)){return(NULL)
  }else{
    a<- ifelse(data==TRUE,1,0)
    return(a)
  }
}

4.4. Function: heat_map_missing()

Explanation:

Key Steps:

R Code:

heat_map_missing<-function(data__frame=NULL){
  if(is.null(data__frame)){
    return(NULL)
  }
  pheatmap(
    mat = sapply(as.data.frame(is.na(data__frame)),fun),
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    color = c("white", "red"),
    breaks = c(-0.01, 0.5, 1.1),
    main = "Missing Values Heatmap"
  )
}

4.5. Function: data_char_date()

Explanation:

Key Steps:

R Code:

data_char_date<-function(x=NULL){
  if(is.null(x)){
    return(NULL)}
  dated<-str_extract_all(x,"\\d+")[[1]]
  if (length(dated)>2){
    x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
               format="%Y-%m-%d")
  }else{
    x<-as.Date(x, format = "%B %d, %Y")
  }
  return(x)
}

4.6. Function: box_plot()

Explanation:

Key Steps:

R Code:

box_plot<-function(data__frame=NULL,peak=NULL){
  if(is.null(data__frame) | is.null(peak)){
    return(NULL)
  }
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  #data_cut <- data__frame %>%
   # mutate(across(
    #  all_of(cols_to_cut),
     # ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
      #.names = "{.col}_group"
    #))
  ata_cut <- data__frame %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  for (p in paste0(cols_to_cut,'_group')){
  plo<-ggplot(data_cut,
         aes(x=!!sym(p),
             y=!!sym(peak),
             group=!!sym(p)
             )
         )+
      geom_boxplot(color = 'blue',
                   fill = 'lightblue')+
      labs(x =p,
           y = peak,
           title = paste(peak," Across Population Bins"))+theme_minimal()
    print(plo)
    }
}

4.7. Function: smoothing_bins()

Explanation:

Key Steps:

R Code:

smoothing_bins<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_smoothed<-colnames(df)
  names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
  breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
  for (names_bins_collection in names_cols_smoothed ){
    main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
    bins<-paste0(main,"_smoothed")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
      group_by(date_group,location) %>%
      mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
      ungroup()%>%
      select(-all_of(c('date_group',bins))) %>%
      rename(!!sym(bins) := new_col)
    }
  return(df)
  }

4.8. Function: per_million_fix()

Explanation:

Key Steps:

R Code:

per_million_fix<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_per_million<-colnames(df)
  names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
  names_cols_per_million
  
  
  for (names_bins_collection in names_cols_per_million ){
    main<-str_split(names_bins_collection,"_per_million")[[1]][1]
    bins<-paste0(main,"_per_million")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
      select(-all_of(bins))%>%
      rename(!!sym(bins) := new_col)
  }
  return(df)
}

4.9. Function: fix_totals()

Explanation:

Key Steps:

R Code:

fix_totals<-function(df_t=NULL){
  if(is.null(df_t)){
  return(NULL)
  }
  df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
  countries <- unique(df_t$location)
  df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
  for(country in countries){
    df_sub<-df_t[df_t$location==country,]
    df_sub <- df_sub[order(df_sub$date), ]
    first_day<-min(df_sub$date)
    last_day<-max(df_sub$date)
    
    for (i in seq(first_day,last_day,by=1)){
    if(i==first_day){
      df_sub$total_cases[df_sub$date == i] <- 0
      df_sub$total_deaths[df_sub$date == i] <- 0
    }else{
    df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
    df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
    }
      }
    print(paste(country,'--->',any(duplicated(df_sub$date)))
          )
    df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
    df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
    
  }
  return(df_t)
  }

4.10. Function: drop_asia()

Explanation:

Key Steps:

R Code:

drop_asia<-function(main_data_frame){
  
  main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
  
  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
  
  
  main_data_remaining<-drop_cols(main_data_remaining)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
  
  return(main_data_remaining)
  
}

4.11. Function: keep_asia()

Explanation:

Key Steps:

R Code:

keep_asia<-function(main_data_frame){
  
  main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]

  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print(missed_val_df(main_data_asia,'missing main_data_asia'))
  
  main_data_asia<-drop_cols(main_data_asia)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
  return(main_data_asia)
  
}

4.12. Function: imputate_first_part()

Explanation:

Key Steps:

R Code:

imputate_first_part<-function(data_set_asia){
  data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  
  #data_cut <- data_set_asia %>%
  # mutate(across(
  #   all_of(cols_to_cut),
  #  ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
  # .names = "{.col}_group"
  #))
  
  data_cut <- data_set_asia %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  
  data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
  cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
  for(col_named in cols_to_imputate){
    data_cut<-weighted_median_imputation(data_cut,col_named)
  }
  cols_original<-paste0(cols_to_cut,"_group")
  data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
  #write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
  return(data_set_asia_clean_except__bins)
}

4.13. Function: weighted_median_imputation()

Explanation:

Key Steps:

R Code:

weighted_median_imputation<-function(data__frame=NULL,col1=NULL){
  original_columns<-colnames(data__frame)
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  cols_to_cut<-paste0(cols_to_cut,'_group')
  
  
  if(is.null(col1) || is.null(data__frame)){
    return(NULL)}
  
  
  median_full<-data_frame(data__frame[,col1])
  data_temp<-data__frame[,c(col1,cols_to_cut,"location")]
  
  
  for(p in cols_to_cut){
    
    group_stats <- data_temp %>%
      group_by(.data[[p]],location) %>%
      summarise(group_median = reproduction_function(.data[[col1]]), .groups = "drop")
    
    # Merge back with original data
    data_new <- data_temp %>%
      left_join(group_stats, by = c(p,"location")) %>%
      mutate(imputed = if_else(is.na(.data[[col1]]), group_median, .data[[col1]])) %>%
      select(imputed)
    named <- paste0(p, "_median")
    median_full[[named]] <- data_new$imputed
  }
  
  
  imputed_cols <- paste0(cols_to_cut, "_median")
  median_full$averages <- rowMeans(median_full[, imputed_cols], na.rm = TRUE)
  
  
  median_full[,'average']<-apply(median_full[,paste0(cols_to_cut,"_median")],MARGIN = 1,sum)
  median_full[,'average']<-median_full[,'average']/length(cols_to_cut)
  
  
  named <- paste0(col1, "_imputated")
  
  data__frame[,named]<-median_full[,'average']
  
  
  data__frame <- data__frame %>%
    mutate(!!col1 := if_else(
      is.na(!!sym(col1)),
      !!sym(named),
      !!sym(col1)
    ))
  
  return(data__frame[,original_columns])
}

4.14. Function: reproduction_function()

Explanation:

Key Steps:

R Code:

reproduction_function<-function(s){
  
  median_data<-c()
  weights<-c()
  
  if (is.data.frame(s)) {
    s <- s[[1]] 
  }
  
  x <- s[!is.na(s)]
  size_s<-length(x)
  x<-as.numeric(x)
  while(TRUE)
    {
  if(length(x)<4){break}
  
    
  Q1<-quantile(x,0.25,na.rm=TRUE)
  Q3<-quantile(x,0.75,na.rm=TRUE)
  IQR<-Q3-Q1
  lower_bound<-Q1-1.5*IQR
  higher_bound<-Q3+1.5*IQR
  
  non_outliers<-x[x>=lower_bound & x<=higher_bound]
  outliers<-x[x<lower_bound | x>higher_bound]
  
  
  a<-non_outliers
  b<-length(non_outliers)/size_s
  
  weights<-append(weights,b)
  a<-median(a,na.rm=TRUE)
  median_data<-append(median_data,a)
  
  x<-outliers
  
  }
  median_data <- as.numeric(median_data)
  weights <- as.numeric(weights)
  return(sum(median_data*weights, na.rm = TRUE))
}

5. Exploratory Data Analysis (EDA)

This report presents an exploratory data analysis of COVID-19 indicators in various Asian countries. The focus is on identifying extreme cases, comparing Malaysia to others, and understanding healthcare burden using clustering and visualizations.

5.1. Load Libraries

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggcorrplot)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(pheatmap)
library(ggtext)

5.2. Load Dataset

df <- read.csv('covid_data_set_asia_fixed_final.csv')

df <- df %>%
  mutate(case_fatality_ratio = case_when(
    total_cases != 0 ~ total_deaths / total_cases,
    TRUE ~ 0
  ))

df$date <- as.Date(df$date)

df_last_day <- df %>%
  group_by(location) %>%
  filter(date == max(date)) %>%
  ungroup() %>%
  filter(total_cases > 0)

5.3. Visualization

5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths

bar_plot(df_last_day, "total_deaths")

5.3.3 Plot : Deaths per Million

bar_plot(df_last_day, "total_deaths_per_million")

5.3.4 Plot : Cases per Million

bar_plot(df_last_day, "total_cases_per_million")

5.3.5 Plot : Healthcare Clustering (Heatmap)

df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
df_scaled <- scale(df_clust[, -ncol(df_clust)])
df_scaled <- as.data.frame(df_scaled)
rownames(df_scaled) <- df_clust$location

pheatmap(
  df_scaled,
  scale = "row",
  cutree_rows = 3,
  cutree_cols = 4,
  fontsize_row = 5.5,
  fontsize_col = 10,
  angle_col = 45
)

5.3.6 Plot : Hospital Beds per 1,000 People

bar_plot(df_last_day, "hospital_beds_per_thousand")

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

ggplotly(line_plot(df, "total_deaths"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(line_plot(df, "total_cases"))
ggplotly(line_plot(df, "hospital_beds_per_thousand"))
ggplotly(line_plot(df, "new_cases"))
ggplotly(line_plot(df, "new_deaths"))
ggplotly(line_plot(df, "case_fatality_ratio"))

5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

countr <- c("China", "India", "Malaysia", "Japan")
df_selective <- df %>% filter(location %in% countr)

p <- ggplot(df_selective, aes(x = date, y = reproduction_rate, colour = location)) +
  geom_line() +
  labs(
    title = "COVID-19 Reproduction Rate in Malaysia, China, India, and Japan Over Time",
    x = "Date",
    y = "Reproduction Rate"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplotly(p)

5.3.10 Plot : Case Fatality Ratio by Country

bar_plot(df_last_day, "case_fatality_ratio")

5.4 Conclusion

This analysis shows how COVID-19 affected Asian countries differently based on death rates, case counts, and healthcare resources. Malaysia’s performance is compared with best- and worst-case countries across key metrics. This provides a foundation for future modeling and regional healthcare readiness assessment.

6.0 Data Model : Clustering

6.1. Load Libraries

library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(tidyr)

6.2. Load Dataset

df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$case_fatality_ratio <- df$total_deaths/df$total_cases

# --- Define Time Points ---
df_first <- df %>%
  group_by(location) %>%
  filter(date == "2020-08-01") %>%
  ungroup()
df_postvax <- df %>%
  group_by(location) %>%
  filter(date == "2021-03-01") %>%
  ungroup()
df_latest <- df %>%
  group_by(location) %>%
  filter(date == "2024-08-04") %>%
  ungroup()

6.3. First Date

# Feature Engineering
death_case_features <- df_first %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

The plot is ambiguous, we will check the silhoutte score for the optimal number of k

# Helper function to find the best k
find_best_k <- function(data) {
  best_k <- NA
  best_score <- -Inf
  
  for (k in 2:5) {
    set.seed(123)
    km <- kmeans(data, centers = k, nstart = 25)
    sil <- silhouette(km$cluster, dist(data))
    score <- mean(sil[, 3])
    
    if (score > best_score) {
      best_score <- score
      best_k <- k
    }
  }
  
  return(list(k = best_k, score = best_score))
}

find_best_k(scaled_dc)
## $k
## [1] 3
## 
## $score
## [1] 0.5107536
set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `Cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          49.8                     14.1             0.87 
## 2 2                        1690.                      17.7             0.965
## 3 3                       19108.                     127.              0.838
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}
## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Afghanistan"          "Azerbaijan"           "Bangladesh"          
##  [4] "Bhutan"               "Brunei"               "Cambodia"            
##  [7] "China"                "East Timor"           "Georgia"             
## [10] "India"                "Indonesia"            "Iraq"                
## [13] "Israel"               "Japan"                "Jordan"              
## [16] "Kazakhstan"           "Laos"                 "Lebanon"             
## [19] "Malaysia"             "Maldives"             "Mongolia"            
## [22] "Myanmar"              "Nepal"                "Pakistan"            
## [25] "Palestine"            "Philippines"          "Saudi Arabia"        
## [28] "Singapore"            "South Korea"          "Sri Lanka"           
## [31] "Tajikistan"           "Thailand"             "Turkey"              
## [34] "United Arab Emirates" "Vietnam"             
## 
## Cluster 3:
## [1] "Armenia" "Bahrain" "Iran"    "Kuwait"  "Oman"    "Qatar"
  1. Cluster 1 has high case fatality ratio even though the total cases and death are not high
  2. Cluster 2 has high reproduction rate
  3. Cluster 3 has high total cases and total death

6.3. Post Vaccination

# Feature Engineering
death_case_features <- df_postvax %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)
## $k
## [1] 3
## 
## $score
## [1] 0.4388714
set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3]) 
## [1] 0.4388714
# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          67.5                     18.8             1.71 
## 2 2                       49888.                     515.              1.12 
## 3 3                        6473.                      60.1             0.914
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}
## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Armenia"    "Azerbaijan" "Bahrain"    "Georgia"    "Iran"      
##  [6] "Israel"     "Jordan"     "Kuwait"     "Lebanon"    "Oman"      
## [11] "Palestine"  "Qatar"      "Turkey"    
## 
## Cluster 3:
##  [1] "Afghanistan"          "Bangladesh"           "Bhutan"              
##  [4] "Brunei"               "Cambodia"             "China"               
##  [7] "East Timor"           "India"                "Indonesia"           
## [10] "Iraq"                 "Japan"                "Kazakhstan"          
## [13] "Laos"                 "Malaysia"             "Maldives"            
## [16] "Mongolia"             "Myanmar"              "Nepal"               
## [19] "Pakistan"             "Philippines"          "Saudi Arabia"        
## [22] "Singapore"            "South Korea"          "Sri Lanka"           
## [25] "Tajikistan"           "Thailand"             "United Arab Emirates"
## [28] "Uzbekistan"           "Vietnam"
  1. Cluster 1: Low total cases and death, high reproduction rate and case fatality ratio
  2. Cluster 2: High total cases and death
  3. Cluster 3: In the middle, with low reproduction rate

6.4. Last Date

# Feature Engineering
death_case_features <- df_latest %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)
# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)
## $k
## [1] 2
## 
## $score
## [1] 0.6776519

We will choose k = 2 because it has the highest silhouette score.

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 2, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])
## [1] 0.6776519
# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)
## # A tibble: 2 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                         1071.                     38.3             0.398
## 2 2                       164030.                    757.              0.965
## # ℹ 1 more variable: case_fatality_ratio <dbl>
# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}
## Cluster 1:
## [1] "Tajikistan" "Yemen"     
## 
## Cluster 2:
##  [1] "Afghanistan"          "Armenia"              "Azerbaijan"          
##  [4] "Bahrain"              "Bangladesh"           "Bhutan"              
##  [7] "Brunei"               "Cambodia"             "China"               
## [10] "East Timor"           "Georgia"              "India"               
## [13] "Indonesia"            "Iran"                 "Iraq"                
## [16] "Israel"               "Japan"                "Jordan"              
## [19] "Kazakhstan"           "Kuwait"               "Kyrgyzstan"          
## [22] "Laos"                 "Lebanon"              "Malaysia"            
## [25] "Maldives"             "Mongolia"             "Myanmar"             
## [28] "Nepal"                "Oman"                 "Pakistan"            
## [31] "Palestine"            "Philippines"          "Qatar"               
## [34] "Saudi Arabia"         "Singapore"            "South Korea"         
## [37] "Sri Lanka"            "Thailand"             "Turkey"              
## [40] "United Arab Emirates" "Uzbekistan"           "Vietnam"
  1. Cluster 1: Low reproduction rate, higher case fatality ratio
  2. Cluster 2: High total cases, death, reproduction rate