WQD7004_Group1.knit

WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

NAME	STUDENT ID
Chadli Rayane	24075296
Farras Azelya Putri	S2007298
Isma Adlin Binti Ismail	24088157
Wang Zheng	24082308
Mareeha sultana mohamed

1. Sources : Our World in Data (OWID COVID-19 Dataset)

Link : https://github.com/owid/covid-19-data/blob/master/public/data/README.md

2. Project Questions:

Classification: Group the countries in Asia suffered the most during the pandemic based on metrics such as mortality rates, case rates, and healthcare burden?
Regression:Create a model to analyze and predict the propagation of a virus?

3. Data Understanding:

3.1. Dimension:

covid_data <- read.csv("owid-covid-data.csv")
dim(covid_data)

## [1] 429435     67

No: of rows= 429,435 (daily records per country)
No: of columns= 67 (eg: cases, deaths, vaccinations, testing, population)
File format: CSV

3.2. Content:

colnames(covid_data)

##  [1] "iso_code"                                  
##  [2] "continent"                                 
##  [3] "location"                                  
##  [4] "date"                                      
##  [5] "total_cases"                               
##  [6] "new_cases"                                 
##  [7] "new_cases_smoothed"                        
##  [8] "total_deaths"                              
##  [9] "new_deaths"                                
## [10] "new_deaths_smoothed"                       
## [11] "total_cases_per_million"                   
## [12] "new_cases_per_million"                     
## [13] "new_cases_smoothed_per_million"            
## [14] "total_deaths_per_million"                  
## [15] "new_deaths_per_million"                    
## [16] "new_deaths_smoothed_per_million"           
## [17] "reproduction_rate"                         
## [18] "icu_patients"                              
## [19] "icu_patients_per_million"                  
## [20] "hosp_patients"                             
## [21] "hosp_patients_per_million"                 
## [22] "weekly_icu_admissions"                     
## [23] "weekly_icu_admissions_per_million"         
## [24] "weekly_hosp_admissions"                    
## [25] "weekly_hosp_admissions_per_million"        
## [26] "total_tests"                               
## [27] "new_tests"                                 
## [28] "total_tests_per_thousand"                  
## [29] "new_tests_per_thousand"                    
## [30] "new_tests_smoothed"                        
## [31] "new_tests_smoothed_per_thousand"           
## [32] "positive_rate"                             
## [33] "tests_per_case"                            
## [34] "tests_units"                               
## [35] "total_vaccinations"                        
## [36] "people_vaccinated"                         
## [37] "people_fully_vaccinated"                   
## [38] "total_boosters"                            
## [39] "new_vaccinations"                          
## [40] "new_vaccinations_smoothed"                 
## [41] "total_vaccinations_per_hundred"            
## [42] "people_vaccinated_per_hundred"             
## [43] "people_fully_vaccinated_per_hundred"       
## [44] "total_boosters_per_hundred"                
## [45] "new_vaccinations_smoothed_per_million"     
## [46] "new_people_vaccinated_smoothed"            
## [47] "new_people_vaccinated_smoothed_per_hundred"
## [48] "stringency_index"                          
## [49] "population_density"                        
## [50] "median_age"                                
## [51] "aged_65_older"                             
## [52] "aged_70_older"                             
## [53] "gdp_per_capita"                            
## [54] "extreme_poverty"                           
## [55] "cardiovasc_death_rate"                     
## [56] "diabetes_prevalence"                       
## [57] "female_smokers"                            
## [58] "male_smokers"                              
## [59] "handwashing_facilities"                    
## [60] "hospital_beds_per_thousand"                
## [61] "life_expectancy"                           
## [62] "human_development_index"                   
## [63] "population"                                
## [64] "excess_mortality_cumulative_absolute"      
## [65] "excess_mortality_cumulative"               
## [66] "excess_mortality"                          
## [67] "excess_mortality_cumulative_per_million"

The dataset contains a comprehensive collection of COVID-19 – related metrics across different countries over time. The variables can be grouped into the following categories:

Identification and Geographic Information:
- Country or region identifiers - “iso_code” “continent” “location”
COVID-19 Cases and Deaths:
- Total and daily new confirmed cases- “total_cases” “new_cases” “new_cases_smoothed”
- Total and daily new confirmed deaths- “total_deaths” “new_deaths” “new_deaths_smoothed”
Hospitalizations and ICU:
- ICU occupancy- “icu_patients” “icu_patients_per_million”
- Hospital occupancy- “hosp_patients” “hosp_patients_per_million”
- Weekly Admissions- “weekly_icu_admissions” weekly_hosp_admissions”
Testing Data:
- Testing statistics- “total_tests” “new_tests” “new_tests_smoothed”
- Normalized metrics- “total_tests_per_thousand” “positive_rate” “tests_per_case” “tests_units”
Vaccination Data:
- “total_vaccinations” “people_vaccinated” “people_fully_vaccinated” “total_boosters”
- Daily Values- “new_vaccinations” “new_vaccinations_smoothed”
- Normalized- “total_vaccinations_per_hundred” “people_vaccinated_per_hundred” “people_fully_vaccinated_per_hundred” “total_boosters_per_hundred” “new_vaccinations_smoothed_per_million” “new_people_vaccinated_smoothed” “new_people_vaccinated_smoothed_per_hundred”
Public Policy Measures:
- A composite measure of government response strictness- “stringency_index”
Demographic and Socioeconomic Data:
- “population” “population_density” “median_age” “aged_65_older” “aged_70_older”
- “gdp_per_capita” “extreme_poverty” “human_development_index”
Health and Risk Factors:
- “cardiovasc_death_rate”“diabetes_prevalence”
- Smoking prevalence- “female_smokers” “male_smokers”
- Healthcare resources- “hospital_beds_per_thousand” “handwashing_facilities”
Excess Mortality Metrics:
- “excess_mortality” “excess_mortality_cumulative” “excess_mortality_cumulative_absolute” “excess_mortality_cumulative_per_million”

3.3. Structure:

str(covid_data)

## 'data.frame':    429435 obs. of  67 variables:
##  $ iso_code                                  : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                                 : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location                                  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                                      : chr  "2020-01-05" "2020-01-06" "2020-01-07" "2020-01-08" ...
##  $ total_cases                               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases                                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed                        : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths                              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths                                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed                       : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_cases_per_million                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_per_million                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million            : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ total_deaths_per_million                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_per_million                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ new_deaths_smoothed_per_million           : num  NA NA NA NA NA 0 0 0 0 0 ...
##  $ reproduction_rate                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients                              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients                             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients_per_million                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions_per_million         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                               : chr  "" "" "" "" ...
##  $ total_vaccinations                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations_per_hundred            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated_per_hundred       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters_per_hundred                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed_per_million     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stringency_index                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ population_density                        : num  54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                                : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                             : num  2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                             : num  1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                            : num  1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate                     : num  597 597 597 597 597 ...
##  $ diabetes_prevalence                       : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities                    : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand                : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                           : num  64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index                   : num  0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
##  $ population                                : num  41128772 41128772 41128772 41128772 41128772 ...
##  $ excess_mortality_cumulative_absolute      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative_per_million   : num  NA NA NA NA NA NA NA NA NA NA ...

Location - character
date - character
new_cases - integer
new_deaths - integer
total_deaths - integer
people_vaccinated - numeric
people_fully_vaccinated - numeric
total_boosters - numeric
population - numeric
total_vaccinations - numeric
new_vaccinations - integer
stringency_index (Measure of government restrictions) - numeric
icu_patients - integer
hosp_patients - integer
positive_rate - numeric
reproduction_rate - numeric

3.4. Summary:

The dataset comprises 429,435 observations and 67 columns. The dataset representing detailed COVID-19 data from multiple countries over time. Each row corresponds to a specific date and location (country and region) making the dataset structured as a longitudinal time series.

Data Types:
- Character variables (Eg: iso_code,continent,location,date) :Represents identifiers and date information.
- Numeric variables: Majority of the dataset consists of integers (int) and continuous numbers (num).
Data Coverage:
- Includes data on COVID-19 cases, deaths, testing, hospitalizations, ICU usage, vaccinations, and public health policy responses.
- Demographic and socioeconomic indicators like median age, GDP per capita, population density, and human development index are included to conduct/support deeper analytic insights.
- Also includes healthcare capacity and risk factors such as smoking rates, cardiovascular death rate, handwashing facilities, and hospital beds.
- Excess mortality provides insight into the pandemic’s broader impact beyond the reported COVID-19 deaths.
Missing Values:
- Several variables, especially those related to ICU, hospitalization, testing, vaccination, and excess mortality, contain many NA values, indicating inconsistent or unavailable reporting across regions and time.

The dataset offers a rich foundation to help analyze how COVID-19 spread across the world, how it affected people, how countries responded and the healthcare systems differed between regions. It is suitable for our project as it includes up-to-date data on vaccinations,cases and deaths,which we can use to study how effective the vaccines were in reducing infections and saving lives.

4. Data Preparation

main_data_frame<-read.csv('owid-covid-data.csv')

4.1. Function: `drop_cols()`

Explanation:

Drops columns with too many missing values.

Key Steps:

Checks if input is NULL; returns NULL if so.
Computes a 50% missing data threshold (drop_at).
Analyzes summary statistics to detect columns with excessive NAs.
Drops any column where missing values are ≥ 50% of total rows.
Returns the cleaned dataframe.

R Code:

drop_cols<-function(dataframe=NULL){
  if(is.null(dataframe)){
    return(NULL)
  }
  col_to_drop<-c()
  drop_at<-0.5*nrow(dataframe)
  summary_df<-data.frame(summary(dataframe))
  summary_df<-summary_df %>% select(-Var1) %>%group_by(Var2)
  summary_df$Var2<-summary_df$Var2 %>% sapply(str_trim)
  for (p in unique(colnames(dataframe)))
  {
    test<-summary_df[which(summary_df$Var2==p),]
    if (is.na(test[7,2])){next}
    if (as.numeric(str_split(test[7,2],':')[[1]][2])>=drop_at){
      col_to_drop<-append(col_to_drop,p)
    }
  }
  dataframe<-dataframe %>% select(-all_of(col_to_drop))
  return(dataframe)
}

4.2. Function: `missed_val_df()`

Explanation:

Creates and saves a visual table showing missing values.

Key Steps:

Returns NULL if input dataframe is NULL.
Uses miss_var_summary() from naniar to count missing values.
Formats the summary using the gt package for better readability.
Adds a custom title to the table.
Saves the output as an image (.png) with the title in the filename.
Returns the formatted table.

R Code:

missed_val_df<-function(dataframe=NULL,title=''){
  if(is.null(dataframe)){return(NULL)
  }else{
    pretty_table <- miss_var_summary(dataframe) %>%
      gt() %>%
      tab_header(title = title)
    print(pretty_table)
  }
}

4.3. Function: `fun()`

Explanation:

Converts logical (TRUE/FALSE) values to numeric (1/0).

Key Steps:

Returns 1 if value is TRUE, 0 if FALSE.
Handles NULL input by returning NULL.
Supports matrix transformation for visualizations.

R Code:

fun<-function(data=NULL){
  if(is.null(data)){return(NULL)
  }else{
    a<- ifelse(data==TRUE,1,0)
    return(a)
  }
}

4.4. Function: `heat_map_missing()`

Explanation:

Generates a heatmap showing the locations of missing values in a dataframe.

Key Steps:

Converts missing values (NA) into binary format (1 = missing).
Uses pheatmap() to create a visual heatmap.
Uses red for missing values and white for non-missing.
No row/column clustering.
Returns NULL if the input dataframe is NULL.

R Code:

heat_map_missing<-function(data__frame=NULL){
  if(is.null(data__frame)){
    return(NULL)
  }
  pheatmap(
    mat = sapply(as.data.frame(is.na(data__frame)),fun),
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    color = c("white", "red"),
    breaks = c(-0.01, 0.5, 1.1),
    main = "Missing Values Heatmap"
  )
}

4.5. Function: `data_char_date()`

Explanation:

Converts a character string to a proper Date object.

Key Steps:

Extracts numeric date components from a string.
Handles two formats: yyyy-mm-dd or Month dd, yyyy.
Converts using as.Date().
Returns NULL if input is NULL

R Code:

data_char_date<-function(x=NULL){
  if(is.null(x)){
    return(NULL)}
  dated<-str_extract_all(x,"\\d+")[[1]]
  if (length(dated)>2){
    x<-as.Date(paste0(dated[1],'-',dated[2],'-',dated[3]),
               format="%Y-%m-%d")
  }else{
    x<-as.Date(x, format = "%B %d, %Y")
  }
  return(x)
}

4.6. Function: `box_plot()`

Explanation:

Plots boxplots of a selected variable (peak) against binned versions of key socioeconomic features.

Key Steps:

Requires a dataframe and a target variable (e.g., new_cases).
Bins population, gdp_per_capita, and life_expectancy into 4 log-scaled groups.
Creates a boxplot for each binned feature group vs. the peak variable.
Uses ggplot2 to plot and visualize distributions.
Prints each boxplot automatically.

R Code:

box_plot<-function(data__frame=NULL,peak=NULL){
  if(is.null(data__frame) | is.null(peak)){
    return(NULL)
  }
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  #data_cut <- data__frame %>%
   # mutate(across(
    #  all_of(cols_to_cut),
     # ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
      #.names = "{.col}_group"
    #))
  ata_cut <- data__frame %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  for (p in paste0(cols_to_cut,'_group')){
  plo<-ggplot(data_cut,
         aes(x=!!sym(p),
             y=!!sym(peak),
             group=!!sym(p)
             )
         )+
      geom_boxplot(color = 'blue',
                   fill = 'lightblue')+
      labs(x =p,
           y = peak,
           title = paste(peak," Across Population Bins"))+theme_minimal()
    print(plo)
    }
}

4.7. Function: `smoothing_bins()`

Explanation:

Imputes missing smoothed data by averaging values within weekly date bins.

Key Steps:

Looks for columns ending in “smoothed”.
Creates 7-day bins (cut()) based on the date column.
For each smoothed column: -Groups data by date bin and location. -Replaces the smoothed column with its imputed weekly average.
Returns the updated dataframe with smoothed values replaced.

R Code:

smoothing_bins<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_smoothed<-colnames(df)
  names_cols_smoothed<-names_cols_smoothed[str_ends(names_cols_smoothed,"smoothed")]
  breaks <- seq(min(df$date, na.rm = TRUE), max(df$date, na.rm = TRUE) + 7, by = "7 days")
  for (names_bins_collection in names_cols_smoothed ){
    main<-str_split(names_bins_collection,"_smoothed")[[1]][1]
    bins<-paste0(main,"_smoothed")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(date_group = cut(date, breaks = breaks, include.lowest = TRUE)) %>%
      group_by(date_group,location) %>%
      mutate(!!sym(new_col) := mean(!!sym(main), na.rm = TRUE)) %>%
      ungroup()%>%
      select(-all_of(c('date_group',bins))) %>%
      rename(!!sym(bins) := new_col)
    }
  return(df)
  }

4.8. Function: `per_million_fix()`

Explanation:

Recalculates “per million” values using the actual population and base metric.

Key Steps:

Detects columns ending in _per_million.
Recomputes values as: (original value / population)×1,000,000
Replaces original _per_million columns with recalculated ones.
Useful for ensuring consistent normalization per population.

R Code:

per_million_fix<-function(df=NULL){
  if(is.null(df)){
    return(NULL)
  }
  names_cols_per_million<-colnames(df)
  names_cols_per_million<-names_cols_per_million[str_ends(names_cols_per_million,"million")]
  names_cols_per_million
  
  
  for (names_bins_collection in names_cols_per_million ){
    main<-str_split(names_bins_collection,"_per_million")[[1]][1]
    bins<-paste0(main,"_per_million")
    new_col<-paste0(bins,"_imputed")
    df <- df %>%
      mutate(!!sym(new_col) :=((!!sym(main))/population)*1000000 )%>%
      select(-all_of(bins))%>%
      rename(!!sym(bins) := new_col)
  }
  return(df)
}

4.9. Function: `fix_totals()`

Explanation:

Recalculates total cases and deaths over time for each country from daily new values.

Key Steps:

Skips the date 2020-01-04 (removes it).
Iterates through each country and processes data in date order.
On the first day, sets totals to 0.
For subsequent days, adds new_cases and new_deaths to running totals.
Removes duplicates and ensures cumulative consistency.
Prints warning if there are duplicate dates per country.
Returns cleaned dataframe with corrected total_cases and total_deaths.

R Code:

fix_totals<-function(df_t=NULL){
  if(is.null(df_t)){
  return(NULL)
  }
  df_t <- df_t[df_t$date != as.Date("2020-01-04"), ]
  countries <- unique(df_t$location)
  df_t <- df_t[!duplicated(df_t[, c("location", "date")]), ]
  for(country in countries){
    df_sub<-df_t[df_t$location==country,]
    df_sub <- df_sub[order(df_sub$date), ]
    first_day<-min(df_sub$date)
    last_day<-max(df_sub$date)
    
    for (i in seq(first_day,last_day,by=1)){
    if(i==first_day){
      df_sub$total_cases[df_sub$date == i] <- 0
      df_sub$total_deaths[df_sub$date == i] <- 0
    }else{
    df_sub$total_cases[df_sub$date==(i)]<-df_sub$total_cases[df_sub$date==(i-1)]+df_sub$new_cases[df_sub$date==(i-1)]
    df_sub$total_deaths[df_sub$date==(i)]<-df_sub$total_deaths[df_sub$date==(i-1)]+df_sub$new_deaths[df_sub$date==(i-1)]
    }
      }
    print(paste(country,'--->',any(duplicated(df_sub$date)))
          )
    df_t$total_cases[which(df_t$location==country)]<-df_sub$total_cases
    df_t$total_deaths[which(df_t$location==country)]<-df_sub$total_deaths
    
  }
  return(df_t)
  }

4.10. Function: `drop_asia()`

Explanation:

Removes Asian countries from the dataset and handles missing values.

Key Steps:

Filters out rows where continent == ‘Asia’.
Displays missing value summaries before and after column removal.
Uses drop_cols() to remove columns with too many missing values from both the original and filtered datasets.
Returns the non-Asian subset of the data.

R Code:

drop_asia<-function(main_data_frame){
  
  main_data_remaining<-main_data_frame[which(main_data_frame$continent!='Asia'),]
  
  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print( missed_val_df(main_data_remaining,'missing main_data_remaining'))
  
  
  main_data_remaining<-drop_cols(main_data_remaining)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_remaining, "Missing in main_data_remaining (cleaned)"))
  
  return(main_data_remaining)
  
}

4.11. Function: `keep_asia()`

Explanation:

Keeps only Asian countries in the dataset and handles missing values.

Key Steps:

Filters the dataset to include only rows where continent == ‘Asia’.
Displays missing value summaries before and after cleaning.
Uses drop_cols() to remove columns with excessive missing values.
Returns the Asian subset of the data.

R Code:

keep_asia<-function(main_data_frame){
  
  main_data_asia<-main_data_frame[which(main_data_frame$continent=='Asia'),]

  print(missed_val_df(main_data_frame,'missing main_data_frame'))
  print(missed_val_df(main_data_asia,'missing main_data_asia'))
  
  main_data_asia<-drop_cols(main_data_asia)
  main_data_frame<-drop_cols(main_data_frame)
  
  print(missed_val_df(main_data_frame, "Missing in main_data_frame (cleaned)"))
  print(missed_val_df(main_data_asia, "Missing in main_data_asia (cleaned)"))
  return(main_data_asia)
  
}

4.12. Function: `imputate_first_part()`

Explanation:

Cleans and imputes missing values for the Asian dataset using weighted median imputation.

Key Steps:

Converts the date column into proper Date format using data_char_date().
Bins population, gdp_per_capita, and life_expectancy into 4 log-scaled groups.
Removes rows where gdp_per_capita_group is missing.
Identifies columns with missing values and imputes them using weighted_median_imputation().
Drops binning columns (*_group) before returning.
Returns the cleaned and imputed Asian dataset.

R Code:

imputate_first_part<-function(data_set_asia){
  data_set_asia<-main_data_asia %>% mutate(date=as.Date(unlist(lapply(date,data_char_date))))
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  
  #data_cut <- data_set_asia %>%
  # mutate(across(
  #   all_of(cols_to_cut),
  #  ~ cut(.x, breaks = 4, labels = FALSE, include.lowest = TRUE),
  # .names = "{.col}_group"
  #))
  
  data_cut <- data_set_asia %>%
    mutate(across(
      all_of(cols_to_cut),
      ~ cut(log10(.x + 1), breaks = 4, labels = FALSE),  # +1 avoids log(0)
      .names = "{.col}_group"
    ))
  
  data_cut<-data_cut %>% filter(!is.na(.data$gdp_per_capita_group))
  cols_to_imputate<-names(data_cut)[sapply(data_cut, function(x) any(is.na(x)))]
  for(col_named in cols_to_imputate){
    data_cut<-weighted_median_imputation(data_cut,col_named)
  }
  cols_original<-paste0(cols_to_cut,"_group")
  data_set_asia_clean_except__bins<-data_cut %>%select(-all_of(cols_original))
  #write.csv(data_set_asia_clean_except__bins,'data_set_asia_clean_except__bins_new.csv')
  return(data_set_asia_clean_except__bins)
}

4.13. Function: `weighted_median_imputation()`

Explanation:

Imputes missing values in a specified column using a weighted median approach based on grouped statistics.

Key Steps:

Bins data by population_group, gdp_per_capita_group, and life_expectancy_group.
For each bin and location, computes a weighted median using reproduction_function().
Replaces NA values in col1 with the average of the imputed values from the three groupings.
Returns the original dataframe with imputed values filled in.

R Code:

weighted_median_imputation<-function(data__frame=NULL,col1=NULL){
  original_columns<-colnames(data__frame)
  cols_to_cut <- c("population", "gdp_per_capita", "life_expectancy")
  cols_to_cut<-paste0(cols_to_cut,'_group')
  
  
  if(is.null(col1) || is.null(data__frame)){
    return(NULL)}
  
  
  median_full<-data_frame(data__frame[,col1])
  data_temp<-data__frame[,c(col1,cols_to_cut,"location")]
  
  
  for(p in cols_to_cut){
    
    group_stats <- data_temp %>%
      group_by(.data[[p]],location) %>%
      summarise(group_median = reproduction_function(.data[[col1]]), .groups = "drop")
    
    # Merge back with original data
    data_new <- data_temp %>%
      left_join(group_stats, by = c(p,"location")) %>%
      mutate(imputed = if_else(is.na(.data[[col1]]), group_median, .data[[col1]])) %>%
      select(imputed)
    named <- paste0(p, "_median")
    median_full[[named]] <- data_new$imputed
  }
  
  
  imputed_cols <- paste0(cols_to_cut, "_median")
  median_full$averages <- rowMeans(median_full[, imputed_cols], na.rm = TRUE)
  
  
  median_full[,'average']<-apply(median_full[,paste0(cols_to_cut,"_median")],MARGIN = 1,sum)
  median_full[,'average']<-median_full[,'average']/length(cols_to_cut)
  
  
  named <- paste0(col1, "_imputated")
  
  data__frame[,named]<-median_full[,'average']
  
  
  data__frame <- data__frame %>%
    mutate(!!col1 := if_else(
      is.na(!!sym(col1)),
      !!sym(named),
      !!sym(col1)
    ))
  
  return(data__frame[,original_columns])
}

4.14. Function: `reproduction_function()`

Explanation:

Computes a robust weighted median by iteratively removing outliers and calculating median values from non-outliers.

Key Steps:

Removes outliers using the IQR (Interquartile Range) method.
For each iteration, calculates the median of non-outliers and assigns a weight based on the proportion of non-outliers.
Repeats on the remaining outliers until fewer than 4 values remain.
Returns the weighted sum of medians from all iterations.

R Code:

reproduction_function<-function(s){
  
  median_data<-c()
  weights<-c()
  
  if (is.data.frame(s)) {
    s <- s[[1]] 
  }
  
  x <- s[!is.na(s)]
  size_s<-length(x)
  x<-as.numeric(x)
  while(TRUE)
    {
  if(length(x)<4){break}
  
    
  Q1<-quantile(x,0.25,na.rm=TRUE)
  Q3<-quantile(x,0.75,na.rm=TRUE)
  IQR<-Q3-Q1
  lower_bound<-Q1-1.5*IQR
  higher_bound<-Q3+1.5*IQR
  
  non_outliers<-x[x>=lower_bound & x<=higher_bound]
  outliers<-x[x<lower_bound | x>higher_bound]
  
  
  a<-non_outliers
  b<-length(non_outliers)/size_s
  
  weights<-append(weights,b)
  a<-median(a,na.rm=TRUE)
  median_data<-append(median_data,a)
  
  x<-outliers
  
  }
  median_data <- as.numeric(median_data)
  weights <- as.numeric(weights)
  return(sum(median_data*weights, na.rm = TRUE))
}

5. Exploratory Data Analysis (EDA)

This report presents an exploratory data analysis of COVID-19 indicators in various Asian countries. The focus is on identifying extreme cases, comparing Malaysia to others, and understanding healthcare burden using clustering and visualizations.

5.1. Load Libraries

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggcorrplot)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(pheatmap)
library(ggtext)

5.2. Load Dataset

df <- read.csv('covid_data_set_asia_fixed_final.csv')

df <- df %>%
  mutate(case_fatality_ratio = case_when(
    total_cases != 0 ~ total_deaths / total_cases,
    TRUE ~ 0
  ))

df$date <- as.Date(df$date)

df_last_day <- df %>%
  group_by(location) %>%
  filter(date == max(date)) %>%
  ungroup() %>%
  filter(total_cases > 0)

5.3. Visualization

5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths

bar_plot(df_last_day, "total_deaths")

5.3.3 Plot : Deaths per Million

bar_plot(df_last_day, "total_deaths_per_million")

5.3.4 Plot : Cases per Million

bar_plot(df_last_day, "total_cases_per_million")

5.3.5 Plot : Healthcare Clustering (Heatmap)

df_clust <- df_last_day[, c("population_density", "median_age", "gdp_per_capita", "hospital_beds_per_thousand", "location")]
df_scaled <- scale(df_clust[, -ncol(df_clust)])
df_scaled <- as.data.frame(df_scaled)
rownames(df_scaled) <- df_clust$location

pheatmap(
  df_scaled,
  scale = "row",
  cutree_rows = 3,
  cutree_cols = 4,
  fontsize_row = 5.5,
  fontsize_col = 10,
  angle_col = 45
)

5.3.6 Plot : Hospital Beds per 1,000 People

bar_plot(df_last_day, "hospital_beds_per_thousand")

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

ggplotly(line_plot(df, "total_deaths"))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplotly(line_plot(df, "total_cases"))

ggplotly(line_plot(df, "hospital_beds_per_thousand"))

ggplotly(line_plot(df, "new_cases"))

ggplotly(line_plot(df, "new_deaths"))

ggplotly(line_plot(df, "case_fatality_ratio"))

5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

countr <- c("China", "India", "Malaysia", "Japan")
df_selective <- df %>% filter(location %in% countr)

p <- ggplot(df_selective, aes(x = date, y = reproduction_rate, colour = location)) +
  geom_line() +
  labs(
    title = "COVID-19 Reproduction Rate in Malaysia, China, India, and Japan Over Time",
    x = "Date",
    y = "Reproduction Rate"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplotly(p)

5.3.10 Plot : Case Fatality Ratio by Country

bar_plot(df_last_day, "case_fatality_ratio")

5.4 Conclusion

This analysis shows how COVID-19 affected Asian countries differently based on death rates, case counts, and healthcare resources. Malaysia’s performance is compared with best- and worst-case countries across key metrics. This provides a foundation for future modeling and regional healthcare readiness assessment.

6.0 Data Model : Clustering

6.1. Load Libraries

library(readr)
library(dplyr)
library(cluster)
library(ggplot2)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(tidyr)

6.2. Load Dataset

df <- read.csv("covid_data_set_asia_fixed_final.csv")
df$case_fatality_ratio <- df$total_deaths/df$total_cases

# --- Define Time Points ---
df_first <- df %>%
  group_by(location) %>%
  filter(date == "2020-08-01") %>%
  ungroup()
df_postvax <- df %>%
  group_by(location) %>%
  filter(date == "2021-03-01") %>%
  ungroup()
df_latest <- df %>%
  group_by(location) %>%
  filter(date == "2024-08-04") %>%
  ungroup()

6.3. First Date

# Feature Engineering
death_case_features <- df_first %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

The plot is ambiguous, we will check the silhoutte score for the optimal number of k

# Helper function to find the best k
find_best_k <- function(data) {
  best_k <- NA
  best_score <- -Inf
  
  for (k in 2:5) {
    set.seed(123)
    km <- kmeans(data, centers = k, nstart = 25)
    sil <- silhouette(km$cluster, dist(data))
    score <- mean(sil[, 3])
    
    if (score > best_score) {
      best_score <- score
      best_k <- k
    }
  }
  
  return(list(k = best_k, score = best_score))
}

find_best_k(scaled_dc)

## $k
## [1] 3
## 
## $score
## [1] 0.5107536

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25) # centers = best k

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `Cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print(cluster_summary_dc)

## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          49.8                     14.1             0.87 
## 2 2                        1690.                      17.7             0.965
## 3 3                       19108.                     127.              0.838
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Afghanistan"          "Azerbaijan"           "Bangladesh"          
##  [4] "Bhutan"               "Brunei"               "Cambodia"            
##  [7] "China"                "East Timor"           "Georgia"             
## [10] "India"                "Indonesia"            "Iraq"                
## [13] "Israel"               "Japan"                "Jordan"              
## [16] "Kazakhstan"           "Laos"                 "Lebanon"             
## [19] "Malaysia"             "Maldives"             "Mongolia"            
## [22] "Myanmar"              "Nepal"                "Pakistan"            
## [25] "Palestine"            "Philippines"          "Saudi Arabia"        
## [28] "Singapore"            "South Korea"          "Sri Lanka"           
## [31] "Tajikistan"           "Thailand"             "Turkey"              
## [34] "United Arab Emirates" "Vietnam"             
## 
## Cluster 3:
## [1] "Armenia" "Bahrain" "Iran"    "Kuwait"  "Oman"    "Qatar"

Cluster 1 has high case fatality ratio even though the total cases and death are not high
Cluster 2 has high reproduction rate
Cluster 3 has high total cases and total death

6.3. Post Vaccination

# Feature Engineering
death_case_features <- df_postvax %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)

## $k
## [1] 3
## 
## $score
## [1] 0.4388714

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 3, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])

## [1] 0.4388714

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)

## # A tibble: 3 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                          67.5                     18.8             1.71 
## 2 2                       49888.                     515.              1.12 
## 3 3                        6473.                      60.1             0.914
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Yemen"
## 
## Cluster 2:
##  [1] "Armenia"    "Azerbaijan" "Bahrain"    "Georgia"    "Iran"      
##  [6] "Israel"     "Jordan"     "Kuwait"     "Lebanon"    "Oman"      
## [11] "Palestine"  "Qatar"      "Turkey"    
## 
## Cluster 3:
##  [1] "Afghanistan"          "Bangladesh"           "Bhutan"              
##  [4] "Brunei"               "Cambodia"             "China"               
##  [7] "East Timor"           "India"                "Indonesia"           
## [10] "Iraq"                 "Japan"                "Kazakhstan"          
## [13] "Laos"                 "Malaysia"             "Maldives"            
## [16] "Mongolia"             "Myanmar"              "Nepal"               
## [19] "Pakistan"             "Philippines"          "Saudi Arabia"        
## [22] "Singapore"            "South Korea"          "Sri Lanka"           
## [25] "Tajikistan"           "Thailand"             "United Arab Emirates"
## [28] "Uzbekistan"           "Vietnam"

Cluster 1: Low total cases and death, high reproduction rate and case fatality ratio
Cluster 2: High total cases and death
Cluster 3: In the middle, with low reproduction rate

6.4. Last Date

# Feature Engineering
death_case_features <- df_latest %>%
  select(location,
         total_cases_per_million,
         total_deaths_per_million,
         reproduction_rate,
         case_fatality_ratio) %>%
  drop_na()

locations_dc <- death_case_features$location
data_dc <- death_case_features %>% select(-location)
scaled_dc <- scale(data_dc)

# Determine number of cluster using WSS Elbow method
fviz_nbclust(scaled_dc, kmeans, method = "wss")

find_best_k(scaled_dc)

## $k
## [1] 2
## 
## $score
## [1] 0.6776519

We will choose k = 2 because it has the highest silhouette score.

set.seed(123)
km_dc <- kmeans(scaled_dc, centers = 2, nstart = 25)  # centers = best k
sil <- silhouette(km_dc$cluster, dist(scaled_dc))
mean(sil[, 3])

## [1] 0.6776519

# Adding new column 'Cluster' to our table
clustered_dc <- death_case_features %>%
  mutate(Cluster = factor(km_dc$cluster))

# Summary of each cluster using mean
cluster_summary_dc <- clustered_dc %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))
print(cluster_summary_dc)

## # A tibble: 2 × 5
##   Cluster total_cases_per_million total_deaths_per_million reproduction_rate
##   <fct>                     <dbl>                    <dbl>             <dbl>
## 1 1                         1071.                     38.3             0.398
## 2 2                       164030.                    757.              0.965
## # ℹ 1 more variable: case_fatality_ratio <dbl>

# Visualize the clusters
fviz_cluster(km_dc, data = scaled_dc, label = FALSE, 
             geom = "point", palette = "jco") +
  labs(title = "Asian Countries Clustered by COVID Impact") +
  theme_minimal()

# Split countries by cluster into a list
countries_by_cluster <- split(clustered_dc$location, clustered_dc$Cluster)

# Print countries in each cluster
for (i in seq_along(countries_by_cluster)) {
  cat(paste0("Cluster ", i, ":\n"))
  print(countries_by_cluster[[i]])
  cat("\n")
}

## Cluster 1:
## [1] "Tajikistan" "Yemen"     
## 
## Cluster 2:
##  [1] "Afghanistan"          "Armenia"              "Azerbaijan"          
##  [4] "Bahrain"              "Bangladesh"           "Bhutan"              
##  [7] "Brunei"               "Cambodia"             "China"               
## [10] "East Timor"           "Georgia"              "India"               
## [13] "Indonesia"            "Iran"                 "Iraq"                
## [16] "Israel"               "Japan"                "Jordan"              
## [19] "Kazakhstan"           "Kuwait"               "Kyrgyzstan"          
## [22] "Laos"                 "Lebanon"              "Malaysia"            
## [25] "Maldives"             "Mongolia"             "Myanmar"             
## [28] "Nepal"                "Oman"                 "Pakistan"            
## [31] "Palestine"            "Philippines"          "Qatar"               
## [34] "Saudi Arabia"         "Singapore"            "South Korea"         
## [37] "Sri Lanka"            "Thailand"             "Turkey"              
## [40] "United Arab Emirates" "Uzbekistan"           "Vietnam"

Cluster 1: Low reproduction rate, higher case fatality ratio
Cluster 2: High total cases, death, reproduction rate

WQD7004 Programming for Data Science

TOPIC:

Understanding COVID-19 Burden and Spread: A Clustering and Predictive Modeling Approach for Asian Countries

Group Project

1. Sources : Our World in Data (OWID COVID-19 Dataset)

2. Project Questions:

3. Data Understanding:

3.1. Dimension:

3.2. Content:

3.3. Structure:

3.4. Summary:

4. Data Preparation

4.1. Function: drop_cols()

4.2. Function: missed_val_df()

4.3. Function: fun()

4.4. Function: heat_map_missing()

4.5. Function: data_char_date()

4.6. Function: box_plot()

4.7. Function: smoothing_bins()

4.8. Function: per_million_fix()

4.9. Function: fix_totals()

4.10. Function: drop_asia()

4.11. Function: keep_asia()

4.12. Function: imputate_first_part()

4.13. Function: weighted_median_imputation()

4.14. Function: reproduction_function()

5. Exploratory Data Analysis (EDA)

5.1. Load Libraries

5.2. Load Dataset

5.3. Visualization

5.3.1 Plot : Total COVID-19 Cases

5.3.2 Plot : Total COVID-19 Deaths

5.3.3 Plot : Deaths per Million

5.3.4 Plot : Cases per Million

5.3.5 Plot : Healthcare Clustering (Heatmap)

5.3.6 Plot : Hospital Beds per 1,000 People

5.3.8 Plot : Line Plots – Daily Extremes vs. Malaysia

5.3.9 Plot : Reproduction Rate (Malaysia vs Neighbors)

5.3.10 Plot : Case Fatality Ratio by Country

5.4 Conclusion

6.0 Data Model : Clustering

6.1. Load Libraries

6.2. Load Dataset

6.3. First Date

6.3. Post Vaccination

6.4. Last Date

4.1. Function: `drop_cols()`

4.2. Function: `missed_val_df()`

4.3. Function: `fun()`

4.4. Function: `heat_map_missing()`

4.5. Function: `data_char_date()`

4.6. Function: `box_plot()`

4.7. Function: `smoothing_bins()`

4.8. Function: `per_million_fix()`

4.9. Function: `fix_totals()`

4.10. Function: `drop_asia()`

4.11. Function: `keep_asia()`

4.12. Function: `imputate_first_part()`

4.13. Function: `weighted_median_imputation()`

4.14. Function: `reproduction_function()`