1 Predicting Public Health Outcomes Based on Global Climate Events Using Machine Learning

WQD7004 Group 5

1. Ridhwan Bin Razali S2178595
2. Nurin Miza Afiqah Binti Andrie Dazlee S2154141
3. Maisie Chiara Salsabila 25062171
4. Amir Arifudeen Bin Mohd Fadzir 24232491

2 Introduction

Climate change represents one of the most pressing global challenges of our time, with far-reaching implications for public health systems worldwide. The intersection of environmental conditions and human health outcomes has become increasingly critical as extreme weather events, air quality degradation, and temperature anomalies intensify across regions. However, comprehensive datasets linking climate indicators to health metrics across diverse geographic and socioeconomic contexts remain limited.

To address this gap, a climate-health dataset has been generated tracking 25 countries across 8 regions from 2015 to 2025. This dataset integrates climate indicators (temperature, precipitation, extreme weather events), air quality measurements (PM2.5, AQI), health outcomes (respiratory diseases, cardiovascular mortality, vector-borne disease risks), and socioeconomic factors (healthcare access, GDP per capita, mental health indices). Comprising 30 key attributes and 13,723 weekly records, it provides researchers and data scientists with a robust foundation for exploring climate-health relationships, developing predictive models, and informing evidence-based policy decisions.

3 Research Objectives

The primary objectives of working with this climate-health dataset are:

To explore patterns and correlations between climate events (temperature anomalies, extreme weather, air pollution) and public health outcomes across different geographic regions and income levels.
To derive actionable insights related to respiratory diseases, cardiovascular mortality, vector-borne disease risks, and the influence of socioeconomic factors on health resilience.

4 Research Questions

This study/project seeks to address the following key research questions:

How do climate indicators (temperature, precipitation, air quality) vary across regions and income levels, and what temporal trends emerge from 2015-2025?
Which climate and socioeconomic factors are most predictive of adverse health outcomes (respiratory diseases, cardiovascular mortality, heat-related admissions)?

5 Dataset Description

The dataset comprehensively tracks climate-health relationships across 25 countries from January 2015 to October 2025, providing weekly granularity for time-series analysis and regional comparisons.

Key Characteristics:

Source: Dataset modeling real-world climate-health relationships
Dimension: 30 columns × 13,723 rows = 411,690 data points (after cleaning)
Geographic Coverage: 25 countries across 8 regions (3 income levels)
Temporal Range: January 2015 - October 2025 (weekly frequency, ~52-53 weeks/year)
Data Completeness: 100% complete (0% missing values after validation)

Column Details:

Geographic Dimensions (8 columns): - record_id (character): Unique identifier - e.g., “REC-USA-2015-01” - country_code (character): ISO country code - e.g., “USA”, “CHN”, “DEU” - country_name (character): Country name - e.g., “United States”, “China”, “Germany” - region (character): Geographic region - e.g., “North America”, “East Asia”, “Europe” - income_level (character): Economic classification - e.g., “High”, “Upper-Middle”, “Lower-Middle” - latitude (numeric): Geographic latitude - e.g., 38.9072 (degrees) - longitude (numeric): Geographic longitude - e.g., -77.0369 (degrees) - population_millions (numeric): Population in millions - e.g., 331.9

Temporal Dimensions (4 columns): - date (Date): Weekly observation date - e.g., 2015-01-05 - year (numeric): Calendar year - e.g., 2015, 2020, 2025 - month (numeric): Month (1-12) - e.g., 1 (January), 12 (December) - week (numeric): ISO week number (1-53) - e.g., 1, 26, 52

Climate Indicators (7 columns): - temperature_celsius (numeric): Average weekly temperature (°C) - e.g., 15.2, 28.7 - temp_anomaly_celsius (numeric): Temperature deviation from baseline (°C) - e.g., +1.3, -0.8 - precipitation_mm (numeric): Weekly precipitation (millimeters) - e.g., 25.4, 120.3 - heat_wave_days (numeric): Days with extreme heat (0-7) - e.g., 0, 3, 5 - drought_indicator (numeric): Drought severity index (0-3) - e.g., 0 (none), 2 (moderate) - flood_indicator (numeric): Flood occurrence indicator (0-1) - e.g., 0 (no), 1 (yes) - extreme_weather_events (numeric): Count of extreme events per week - e.g., 0, 2, 4

Air Quality (2 columns): - pm25_ugm3 (numeric): PM2.5 concentration (μg/m³) - e.g., 12.5, 85.3 - air_quality_index (numeric): EPA Air Quality Index (0-500) - e.g., 42 (Good), 187 (Unhealthy)

Health Outcomes (5 columns): - respiratory_disease_rate (numeric): Cases per 100,000 population - e.g., 45.2, 128.7 - cardio_mortality_rate (numeric): Deaths per 100,000 population - e.g., 8.3, 23.1 - vector_disease_risk_score (numeric): Disease risk index (0-100) - e.g., 15.6, 67.4 - waterborne_disease_incidents (numeric): Weekly incident count - e.g., 2, 15, 37 - heat_related_admissions (numeric): Hospital admissions per week - e.g., 5, 42, 89

Socioeconomic & Wellbeing Indicators (4 columns): - healthcare_access_index (numeric): Population access percentage (0-100%) - e.g., 75.3, 95.8 - gdp_per_capita_usd (numeric): GDP per capita (USD) - e.g., 15,234, 65,789 - mental_health_index (numeric): Mental wellbeing score (0-100) - e.g., 68.2, 82.5 - food_security_index (numeric): Food security score (0-100) - e.g., 72.1, 91.3

5.1 Loading libraries

# # Install required packages if not already installed
# if (!require(caret)) install.packages("caret", repos = "https://cran.rstudio.com/")
# if (!require(dplyr)) install.packages("dplyr", repos = "https://cran.rstudio.com/")
# if (!require(ggplot2)) install.packages("ggplot2", repos = "https://cran.rstudio.com/")
# if (!require(MLmetrics)) install.packages("MLmetrics", repos = "https://cran.rstudio.com/")
# if (!require(readr)) install.packages("readr", repos = "https://cran.rstudio.com/")
# if (!require(repr)) install.packages("repr", repos = "https://cran.rstudio.com/")
# if (!require(reshape2)) install.packages("reshape2", repos = "https://cran.rstudio.com/")
# if (!require(ROSE)) install.packages("ROSE", repos = "https://cran.rstudio.com/")

# if (!require(lubridate)) install.packages("lubridate", repos = "https://cran.rstudio.com/")
#  if (!require(e1071)) install.packages("e1071", repos = "https://cran.rstudio.com/")

# Load all required libraries
library(caret)      # Classification and Regression Training

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)      # Data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)    # Data visualization
library(MLmetrics)  # Machine Learning metrics

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

library(readr)      # Fast data reading
library(repr)       # Data representation
library(reshape2)   # Data reshaping
library(ROSE)       # Random Over-Sampling Examples

## Loaded ROSE 0.0-4

library(lubridate)  # Date manipulation

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(e1071)      # Statistical functions (skewness)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:ggplot2':
## 
##     element

cat("All libraries loaded successfully! ✓")

## All libraries loaded successfully! ✓

5.2 Load the Dataset

# Read the CSV of the dataset
# Google Drive URL converted to direct download format for Colab compatibility
csv_url <- "https://drive.google.com/uc?id=14kzQfJ7L_7NWBWbvX8RWBuQbAe93jPTy&export=download"
df <- read_csv(csv_url)

## Rows: 14100 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): country_code, country_name, region, income_level
## dbl  (25): record_id, year, month, week, latitude, longitude, population_mil...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Store original dimensions before cleaning
original_dim <- dim(df)

str(df)

## spc_tbl_ [14,100 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ record_id                   : num [1:14100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ country_code                : chr [1:14100] "USA" "USA" "USA" "USA" ...
##  $ country_name                : chr [1:14100] "United States" "United States" "United States" "United States" ...
##  $ region                      : chr [1:14100] "North America" "North America" "North America" "North America" ...
##  $ income_level                : chr [1:14100] "High" "High" "High" "High" ...
##  $ date                        : Date[1:14100], format: "2015-01-04" "2015-01-11" ...
##  $ year                        : num [1:14100] 2015 2015 2015 2015 2015 ...
##  $ month                       : num [1:14100] 1 1 1 1 2 2 2 2 3 3 ...
##  $ week                        : num [1:14100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ latitude                    : num [1:14100] 37.1 37.1 37.1 37.1 37.1 ...
##  $ longitude                   : num [1:14100] -95.7 -95.7 -95.7 -95.7 -95.7 ...
##  $ population_millions         : num [1:14100] 331 331 331 331 331 331 331 331 331 331 ...
##  $ temperature_celsius         : num [1:14100] 4.59 3.13 3.99 6.43 9 6.44 6.86 8.34 9.58 8.25 ...
##  $ temp_anomaly_celsius        : num [1:14100] 0.76 -0.5 -0.14 -0.06 0.47 0.18 0.38 0.64 -0.83 0.71 ...
##  $ precipitation_mm            : num [1:14100] 75.7 97 74.1 87.7 75.8 ...
##  $ heat_wave_days              : num [1:14100] 0 0 0 0 1 0 0 1 0 1 ...
##  $ drought_indicator           : num [1:14100] 0 0 0 0 0 0 0 0 0 0 ...
##  $ flood_indicator             : num [1:14100] 0 0 0 0 0 0 0 0 0 0 ...
##  $ extreme_weather_events      : num [1:14100] 0 0 0 0 1 0 0 1 0 1 ...
##  $ pm25_ugm3                   : num [1:14100] 39 17.9 91.5 5.5 37.1 ...
##  $ air_quality_index           : num [1:14100] 82 6 137 -3 48 157 51 5 29 22 ...
##  $ respiratory_disease_rate    : num [1:14100] 69.4 70 66.9 47 61.3 80.2 70.3 48 80.3 59.1 ...
##  $ cardio_mortality_rate       : num [1:14100] 31.5 26.3 33.4 35 28.3 30.6 33.2 33.3 24.9 32.5 ...
##  $ vector_disease_risk_score   : num [1:14100] 6.6 5.2 1.3 6 1.4 0.7 5.9 6.4 5.9 0.3 ...
##  $ waterborne_disease_incidents: num [1:14100] 16.2 11.4 19.5 9.7 22.6 23.7 25.4 17.7 18.3 18.8 ...
##  $ heat_related_admissions     : num [1:14100] 1.4 0 0 9 27.3 11.9 9.1 20.3 7.8 13.7 ...
##  $ healthcare_access_index     : num [1:14100] 77.3 83.6 84.7 84.3 83.6 78 81.7 86.7 84.9 87.1 ...
##  $ gdp_per_capita_usd          : num [1:14100] 63627 63627 63627 63627 63733 ...
##  $ mental_health_index         : num [1:14100] 71.2 70.6 63.4 68.1 69.1 70.1 68.5 59.4 75.7 71.9 ...
##  $ food_security_index         : num [1:14100] 90.2 94 100 96.4 100 97.1 100 95 100 95.2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   record_id = col_double(),
##   ..   country_code = col_character(),
##   ..   country_name = col_character(),
##   ..   region = col_character(),
##   ..   income_level = col_character(),
##   ..   date = col_date(format = ""),
##   ..   year = col_double(),
##   ..   month = col_double(),
##   ..   week = col_double(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   population_millions = col_double(),
##   ..   temperature_celsius = col_double(),
##   ..   temp_anomaly_celsius = col_double(),
##   ..   precipitation_mm = col_double(),
##   ..   heat_wave_days = col_double(),
##   ..   drought_indicator = col_double(),
##   ..   flood_indicator = col_double(),
##   ..   extreme_weather_events = col_double(),
##   ..   pm25_ugm3 = col_double(),
##   ..   air_quality_index = col_double(),
##   ..   respiratory_disease_rate = col_double(),
##   ..   cardio_mortality_rate = col_double(),
##   ..   vector_disease_risk_score = col_double(),
##   ..   waterborne_disease_incidents = col_double(),
##   ..   heat_related_admissions = col_double(),
##   ..   healthcare_access_index = col_double(),
##   ..   gdp_per_capita_usd = col_double(),
##   ..   mental_health_index = col_double(),
##   ..   food_security_index = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Check the shape of the dataset
dim(df)

## [1] 14100    30

The dataset contain 14,100 rows and 30 columns

sapply(df, function(x) sum(is.na(x)))

##                    record_id                 country_code 
##                            0                            0 
##                 country_name                       region 
##                            0                            0 
##                 income_level                         date 
##                            0                            0 
##                         year                        month 
##                            0                            0 
##                         week                     latitude 
##                            0                            0 
##                    longitude          population_millions 
##                            0                            0 
##          temperature_celsius         temp_anomaly_celsius 
##                            0                            0 
##             precipitation_mm               heat_wave_days 
##                            0                            0 
##            drought_indicator              flood_indicator 
##                            0                            0 
##       extreme_weather_events                    pm25_ugm3 
##                            0                            0 
##            air_quality_index     respiratory_disease_rate 
##                            0                            0 
##        cardio_mortality_rate    vector_disease_risk_score 
##                            0                            0 
## waterborne_disease_incidents      heat_related_admissions 
##                            0                            0 
##      healthcare_access_index           gdp_per_capita_usd 
##                            0                            0 
##          mental_health_index          food_security_index 
##                            0                            0

There is no missing values in the columns

head(df)

## # A tibble: 6 × 30
##   record_id country_code country_name region income_level date        year month
##       <dbl> <chr>        <chr>        <chr>  <chr>        <date>     <dbl> <dbl>
## 1         1 USA          United Stat… North… High         2015-01-04  2015     1
## 2         2 USA          United Stat… North… High         2015-01-11  2015     1
## 3         3 USA          United Stat… North… High         2015-01-18  2015     1
## 4         4 USA          United Stat… North… High         2015-01-25  2015     1
## 5         5 USA          United Stat… North… High         2015-02-01  2015     2
## 6         6 USA          United Stat… North… High         2015-02-08  2015     2
## # ℹ 22 more variables: week <dbl>, latitude <dbl>, longitude <dbl>,
## #   population_millions <dbl>, temperature_celsius <dbl>,
## #   temp_anomaly_celsius <dbl>, precipitation_mm <dbl>, heat_wave_days <dbl>,
## #   drought_indicator <dbl>, flood_indicator <dbl>,
## #   extreme_weather_events <dbl>, pm25_ugm3 <dbl>, air_quality_index <dbl>,
## #   respiratory_disease_rate <dbl>, cardio_mortality_rate <dbl>,
## #   vector_disease_risk_score <dbl>, waterborne_disease_incidents <dbl>, …

There is no issue with inconsistent capitalization for the column names

6 Data Cleaning

6.1 Check for duplicate records

Here we check the duplicates by looking at the record_id and the country + date.

# Check duplicates by record_id (should be unique)
dup_records <- sum(duplicated(df$record_id))

cat('Duplicate record_ids:', dup_records, '\n')

## Duplicate record_ids: 0

total_duplicates_removed <- 0

# Store total for summary

# Check duplicates by country + date combination

if("country_name" %in% colnames(df) && "date" %in% colnames(df)) {}

## NULL

  dup_country_date <- sum(duplicated(df[, c("country_name", "date")]))
  cat('Duplicate country+date combinations:', dup_country_date, '\n')

## Duplicate country+date combinations: 0

No duplicates are found

6.2 Validation - verify weekly records per country

# Calculate records per country per year (should be approximately 52-53 weeks)
if("country_name" %in% colnames(df) && "year" %in% colnames(df)) {
  temporal_check <- df %>%
    group_by(country_name, year) %>%
    summarise(record_count = n(), .groups = 'drop') %>%
    filter(record_count < 45 | record_count > 55)

  temporal_flags_count <- nrow(temporal_check)

  if(temporal_flags_count > 0) {
    cat('\n=== ISSUES DETECTED ===\n')
    cat('Country-year combinations with unusual record counts:\n')
    print(temporal_check, n = Inf)
  } else {
    cat('Temporal validation passed - all country-years have ~52-53 records\n')
  }

  temporal_issues <- temporal_check
} else {
  temporal_flags_count <- 0
  temporal_issues <- data.frame()
  cat('Cannot perform temporal validation - required columns not found\n')
}

## 
## === ISSUES DETECTED ===
## Country-year combinations with unusual record counts:
## # A tibble: 25 × 3
##    country_name    year record_count
##    <chr>          <dbl>        <int>
##  1 Argentina       2025           42
##  2 Australia       2025           42
##  3 Bangladesh      2025           42
##  4 Brazil          2025           42
##  5 Canada          2025           42
##  6 China           2025           42
##  7 Colombia        2025           42
##  8 Egypt           2025           42
##  9 France          2025           42
## 10 Germany         2025           42
## 11 India           2025           42
## 12 Indonesia       2025           42
## 13 Italy           2025           42
## 14 Japan           2025           42
## 15 Kenya           2025           42
## 16 Mexico          2025           42
## 17 Nigeria         2025           42
## 18 Pakistan        2025           42
## 19 Philippines     2025           42
## 20 South Africa    2025           42
## 21 Spain           2025           42
## 22 Thailand        2025           42
## 23 United Kingdom  2025           42
## 24 United States   2025           42
## 25 Vietnam         2025           42

A calendar year has approximately 52 weeks (365 days ÷ 7 days/week ≈ 52.14 weeks). Some years have 53 weeks depending on how the year starts and ends.

However from the output above, there are some countries that have incomplete data for 2025
Explanation: Since we are currently in December 2025 and the year hasn’t ended yet, the dataset contains approximately 38-42 weeks of data for 2025 across all countries
This is expected behavior and not a data quality issue
Recommendation for next step: For year-over-year comparisons, consider:
- Excluding 2025 data from annual analyses
- Using only complete years (2015-2024) for trend analysis
- Clearly documenting that 2025 represents partial year data when included in visualizations

6.3 Validate income level categories

# Check income_level contains expected classifications
expected_income_levels <- c("Low", "Lower-Middle", "Upper-Middle", "High")

if("income_level" %in% colnames(df)) {
  unique_income_levels <- unique(df$income_level)
  cat('Income level categories found:\n')
  print(unique_income_levels)

  # Check for unexpected categories
  unexpected_income_levels <- setdiff(unique_income_levels, expected_income_levels)

  if(length(unexpected_income_levels) > 0) {
    cat('\nUnexpected income level categories detected:\n')
    print(unexpected_income_levels)
  } else {
    cat('\nAll income levels match expected classifications\n')
  }
} else {
  unexpected_income_levels <- c()
  cat('income_level column not found\n')
}

## Income level categories found:
## [1] "High"         "Lower-Middle" "Upper-Middle"
## 
## All income levels match expected classifications

6.4 Income Level Validation

The dataset contains 3 of the 4 expected income level classifications: - High: High-income countries - Upper-Middle: Upper-middle-income countries
- Lower-Middle: Lower-middle-income countries - Low: (Not present in dataset)

Note: The “Low” income category is absent from the dataset, meaning no low-income countries are included in this climate-health study. All present categories follow the expected classification format.

6.5 Validate geographic coordinates

# Check coordinate validity and consistency
lat_violations <- 0
lon_violations <- 0
coord_missing <- 0
coord_records_removed <- 0
inconsistent_coord_countries <- c()

if("latitude" %in% colnames(df) && "longitude" %in% colnames(df)) {
  # Check for values outside valid ranges
  lat_violations <- sum(df$latitude < -90 | df$latitude > 90, na.rm = TRUE)
  lon_violations <- sum(df$longitude < -180 | df$longitude > 180, na.rm = TRUE)
  coord_missing <- sum(is.na(df$latitude) | is.na(df$longitude))

  cat('=== GEOGRAPHIC COORDINATE VALIDATION ===\n')
  cat('Latitude violations (outside -90 to 90):', lat_violations, '\n')
  cat('Longitude violations (outside -180 to 180):', lon_violations, '\n')
  cat('Missing coordinates:', coord_missing, '\n')

  # Check coordinate consistency per country
  if("country_name" %in% colnames(df)) {
    coord_consistency <- df %>%
      group_by(country_name) %>%
      summarise(
        unique_coords = n_distinct(paste(latitude, longitude)),
        .groups = 'drop'
      ) %>%
      filter(unique_coords > 1)

    if(nrow(coord_consistency) > 0) {
      inconsistent_coord_countries <- coord_consistency$country_name
      cat('\nCountries with inconsistent coordinates:', length(inconsistent_coord_countries), '\n')
      cat('Flagged countries:\n')
      print(inconsistent_coord_countries)
    } else {
      cat('\nAll countries have consistent coordinates\n')
    }
  }

  # Remove records with clearly invalid coordinates
  if(lat_violations > 0 || lon_violations > 0) {
    original_rows <- nrow(df)
    df <- df %>%
      filter(latitude >= -90 & latitude <= 90 & longitude >= -180 & longitude <= 180)
    coord_records_removed <- original_rows - nrow(df)
    cat('\nRemoved', coord_records_removed, 'records with invalid coordinates\n')
  }
} else {
  cat('Coordinate columns not found\n')
}

## === GEOGRAPHIC COORDINATE VALIDATION ===
## Latitude violations (outside -90 to 90): 0 
## Longitude violations (outside -180 to 180): 0 
## Missing coordinates: 0 
## 
## All countries have consistent coordinates

6.6 Validate climate and health data quality

# Validate climate and health data for impossible values
climate_quality_records_removed <- 0
data_violations <- list()

cat('=== CLIMATE AND HEALTH DATA QUALITY VALIDATION ===\n\n')

## === CLIMATE AND HEALTH DATA QUALITY VALIDATION ===

# Temperature check (-50 to 50 Celsius is reasonable range)
if("temperature_celsius" %in% colnames(df)) {
  temp_violations <- sum(df$temperature_celsius < -50 | df$temperature_celsius > 50, na.rm = TRUE)
  data_violations$temperature <- temp_violations
  cat('Temperature violations (outside -50 to 50°C):', temp_violations, '\n')
}

## Temperature violations (outside -50 to 50°C): 0

# Check negative values in rate/count columns (should be non-negative)
rate_columns <- c("respiratory_disease_rate", "cardio_mortality_rate", "vector_disease_risk_score",
                  "waterborne_disease_incidents", "heat_related_admissions")
for(col in rate_columns) {
  if(col %in% colnames(df)) {
    neg_count <- sum(df[[col]] < 0, na.rm = TRUE)
    if(neg_count > 0) {
      data_violations[[col]] <- neg_count
      cat(col, 'negative values:', neg_count, '\n')
    }
  }
}

# Check index bounds
# Healthcare Access Index: 0-100
if("healthcare_access_index" %in% colnames(df)) {
  hai_violations <- sum(df$healthcare_access_index < 0 | df$healthcare_access_index > 100, na.rm = TRUE)
  data_violations$healthcare_access_index <- hai_violations
  cat('Healthcare Access Index violations (outside 0-100):', hai_violations, '\n')

  # Show examples of violations
  if(hai_violations > 0) {
    hai_examples <- df %>%
      filter(healthcare_access_index < 0 | healthcare_access_index > 100) %>%
      select(record_id, country_name, date, healthcare_access_index) %>%
      head(5)
    cat('  Example violations:\n')
    print(hai_examples)
  }
}

## Healthcare Access Index violations (outside 0-100): 3 
##   Example violations:
## # A tibble: 3 × 4
##   record_id country_name   date       healthcare_access_index
##       <dbl> <chr>          <date>                       <dbl>
## 1      4066 United Kingdom 2017-04-02                    102.
## 2      4122 United Kingdom 2018-04-29                    100.
## 3      4467 United Kingdom 2024-12-08                    102

# Air Quality Index: 0-500
if("air_quality_index" %in% colnames(df)) {
  aqi_violations <- sum(df$air_quality_index < 0 | df$air_quality_index > 500, na.rm = TRUE)
  data_violations$air_quality_index <- aqi_violations
  cat('Air Quality Index violations (outside 0-500):', aqi_violations, '\n')

  # Show examples of violations
  if(aqi_violations > 0) {
    aqi_examples <- df %>%
      filter(air_quality_index < 0 | air_quality_index > 500) %>%
      select(record_id, country_name, date, air_quality_index) %>%
      head(5)
    cat('  Example violations:\n')
    print(aqi_examples)
  }
}

## Air Quality Index violations (outside 0-500): 374 
##   Example violations:
## # A tibble: 5 × 4
##   record_id country_name  date       air_quality_index
##       <dbl> <chr>         <date>                 <dbl>
## 1         4 United States 2015-01-25                -3
## 2        11 United States 2015-03-15               -23
## 3        37 United States 2015-09-13               -13
## 4        48 United States 2015-11-29                -1
## 5        61 United States 2016-02-28               -25

# Mental Health Index: 0-100
if("mental_health_index" %in% colnames(df)) {
  mhi_violations <- sum(df$mental_health_index < 0 | df$mental_health_index > 100, na.rm = TRUE)
  data_violations$mental_health_index <- mhi_violations
  cat('Mental Health Index violations (outside 0-100):', mhi_violations, '\n')
}

## Mental Health Index violations (outside 0-100): 0

# Food Security Index: 0-100
if("food_security_index" %in% colnames(df)) {
  fsi_violations <- sum(df$food_security_index < 0 | df$food_security_index > 100, na.rm = TRUE)
  data_violations$food_security_index <- fsi_violations
  cat('Food Security Index violations (outside 0-100):', fsi_violations, '\n')
}

## Food Security Index violations (outside 0-100): 0

# Population should be positive
if("population_millions" %in% colnames(df)) {
  pop_violations <- sum(df$population_millions < 0, na.rm = TRUE)
  data_violations$population_millions <- pop_violations
  cat('Population violations (negative values):', pop_violations, '\n')
}

## Population violations (negative values): 0

# Remove records with impossible values
original_rows <- nrow(df)
df <- df %>%
  filter(
    if("temperature_celsius" %in% colnames(.)) temperature_celsius >= -50 & temperature_celsius <= 50 else TRUE,
    if("respiratory_disease_rate" %in% colnames(.)) respiratory_disease_rate >= 0 | is.na(respiratory_disease_rate) else TRUE,
    if("cardio_mortality_rate" %in% colnames(.)) cardio_mortality_rate >= 0 | is.na(cardio_mortality_rate) else TRUE,
    if("vector_disease_risk_score" %in% colnames(.)) vector_disease_risk_score >= 0 | is.na(vector_disease_risk_score) else TRUE,
    if("waterborne_disease_incidents" %in% colnames(.)) waterborne_disease_incidents >= 0 | is.na(waterborne_disease_incidents) else TRUE,
    if("heat_related_admissions" %in% colnames(.)) heat_related_admissions >= 0 | is.na(heat_related_admissions) else TRUE,
    if("healthcare_access_index" %in% colnames(.)) healthcare_access_index >= 0 & healthcare_access_index <= 100 | is.na(healthcare_access_index) else TRUE,
    if("air_quality_index" %in% colnames(.)) air_quality_index >= 0 & air_quality_index <= 500 | is.na(air_quality_index) else TRUE,
    if("mental_health_index" %in% colnames(.)) mental_health_index >= 0 & mental_health_index <= 100 | is.na(mental_health_index) else TRUE,
    if("food_security_index" %in% colnames(.)) food_security_index >= 0 & food_security_index <= 100 | is.na(food_security_index) else TRUE,
    if("population_millions" %in% colnames(.)) population_millions >= 0 | is.na(population_millions) else TRUE
  )

climate_quality_records_removed <- original_rows - nrow(df)
cat('\nRemoved', climate_quality_records_removed, 'records with invalid climate/health data\n')

## 
## Removed 377 records with invalid climate/health data

6.7 Data Quality Validation Results

6.7.1 Issues Found & Resolved:

1. Healthcare Access Index (3 violations) - Valid range: 0-100% (percentage of population with healthcare access) - Found: UK records with 102% (impossible) - Action: Removed 3 records

2. Air Quality Index (374 violations) - Valid range: 0-500 (EPA scale) - Found: Negative values (-3 to -25) in US data (2015-2016) - Action: Removed 374 records

3. Other Metrics - Temperature, Mental Health Index, Food Security Index, Population: All valid

6.7.2 Summary:

377 total records removed (~2.7% of dataset)
Likely data entry/collection errors from 2015-2018
Dataset now contains only physically possible values

Note: Healthcare Access >100% and negative Air Quality Index values are logically impossible and indicate data quality issues requiring removal.

6.8 Reorder columns by category

# Organize columns in logical order for analysis
column_order <- c(
  # Geographic information (8)
  "record_id", "country_code", "country_name", "region",
  "income_level", "latitude", "longitude", "population_millions",

  # Temporal information (4)
  "date", "year", "month", "week",

  # Climate indicators (8)
  "temperature_celsius", "temp_anomaly_celsius", "precipitation_mm", "heat_wave_days",
  "drought_indicator", "flood_indicator", "extreme_weather_events",

  # Air quality (2)
  "pm25_ugm3", "air_quality_index",

  # Health outcomes (5)
  "respiratory_disease_rate", "cardio_mortality_rate", "vector_disease_risk_score",
  "waterborne_disease_incidents", "heat_related_admissions",

  # Socioeconomic and wellbeing (4)
  "healthcare_access_index", "gdp_per_capita_usd", "mental_health_index", "food_security_index"
)

# Reorder columns
df <- df[, column_order]

cat('Columns reordered into logical categories:\n')

## Columns reordered into logical categories:

cat('- Geographic (8 columns)\n')

## - Geographic (8 columns)

cat('- Temporal (4 columns)\n')

## - Temporal (4 columns)

cat('- Climate (8 columns)\n')

## - Climate (8 columns)

cat('- Air Quality (2 columns)\n')

## - Air Quality (2 columns)

cat('- Health Outcomes (5 columns)\n')

## - Health Outcomes (5 columns)

cat('- Socioeconomic/Wellbeing (4 columns)\n')

## - Socioeconomic/Wellbeing (4 columns)

6.9 Cleaned Dataset Structure

Split by category

# Display final column information organized by category
cat('=== CLEANED DATASET STRUCTURE ===\n\n')

## === CLEANED DATASET STRUCTURE ===

cat('GEOGRAPHIC INFORMATION (8 columns):\n')

## GEOGRAPHIC INFORMATION (8 columns):

cat('  1. record_id:', class(df$record_id), '\n')

##   1. record_id: numeric

cat('  2. country_code:', class(df$country_code), '\n')

##   2. country_code: character

cat('  3. country_name:', class(df$country_name), '\n')

##   3. country_name: character

cat('  4. region:', class(df$region), '\n')

##   4. region: character

cat('  5. income_level:', class(df$income_level), '\n')

##   5. income_level: character

cat('  6. latitude:', class(df$latitude), '\n')

##   6. latitude: numeric

cat('  7. longitude:', class(df$longitude), '\n')

##   7. longitude: numeric

cat('  8. population_millions:', class(df$population_millions), '\n\n')

##   8. population_millions: numeric

cat('TEMPORAL INFORMATION (4 columns):\n')

## TEMPORAL INFORMATION (4 columns):

cat('  9. date:', class(df$date), '\n')

##   9. date: Date

cat(' 10. year:', class(df$year), '\n')

##  10. year: numeric

cat(' 11. month:', class(df$month), '\n')

##  11. month: numeric

cat(' 12. week:', class(df$week), '\n\n')

##  12. week: numeric

cat('CLIMATE INDICATORS (7 columns):\n')

## CLIMATE INDICATORS (7 columns):

cat(' 13. temperature_celsius:', class(df$temperature_celsius), '\n')

##  13. temperature_celsius: numeric

cat(' 14. temp_anomaly_celsius:', class(df$temp_anomaly_celsius), '\n')

##  14. temp_anomaly_celsius: numeric

cat(' 15. precipitation_mm:', class(df$precipitation_mm), '\n')

##  15. precipitation_mm: numeric

cat(' 16. heat_wave_days:', class(df$heat_wave_days), '\n')

##  16. heat_wave_days: numeric

cat(' 17. drought_indicator:', class(df$drought_indicator), '\n')

##  17. drought_indicator: numeric

cat(' 18. flood_indicator:', class(df$flood_indicator), '\n')

##  18. flood_indicator: numeric

cat(' 19. extreme_weather_events:', class(df$extreme_weather_events), '\n\n')

##  19. extreme_weather_events: numeric

cat('AIR QUALITY (2 columns):\n')

## AIR QUALITY (2 columns):

cat(' 20. pm25_ugm3:', class(df$pm25_ugm3), '\n')

##  20. pm25_ugm3: numeric

cat(' 21. air_quality_index:', class(df$air_quality_index), '\n\n')

##  21. air_quality_index: numeric

cat('HEALTH OUTCOMES (5 columns):\n')

## HEALTH OUTCOMES (5 columns):

cat(' 22. respiratory_disease_rate:', class(df$respiratory_disease_rate), '\n')

##  22. respiratory_disease_rate: numeric

cat(' 23. cardio_mortality_rate:', class(df$cardio_mortality_rate), '\n')

##  23. cardio_mortality_rate: numeric

cat(' 24. vector_disease_risk_score:', class(df$vector_disease_risk_score), '\n')

##  24. vector_disease_risk_score: numeric

cat(' 25. waterborne_disease_incidents:', class(df$waterborne_disease_incidents), '\n')

##  25. waterborne_disease_incidents: numeric

cat(' 26. heat_related_admissions:', class(df$heat_related_admissions), '\n\n')

##  26. heat_related_admissions: numeric

cat('SOCIOECONOMIC & WELLBEING (4 columns):\n')

## SOCIOECONOMIC & WELLBEING (4 columns):

cat(' 27. healthcare_access_index:', class(df$healthcare_access_index), '\n')

##  27. healthcare_access_index: numeric

cat(' 28. gdp_per_capita_usd:', class(df$gdp_per_capita_usd), '\n')

##  28. gdp_per_capita_usd: numeric

cat(' 29. mental_health_index:', class(df$mental_health_index), '\n')

##  29. mental_health_index: numeric

cat(' 30. food_security_index:', class(df$food_security_index), '\n')

##  30. food_security_index: numeric

6.10 Summary

# Document issues flagged during data cleaning for team review
cat('=== ISSUES FLAGGED FOR REVIEW ===\n\n')

## === ISSUES FLAGGED FOR REVIEW ===

# Missing values check
cat('1. MISSING VALUES:\n')

## 1. MISSING VALUES:

missing_counts <- colSums(is.na(df))
total_missing <- sum(missing_counts)
if(total_missing > 0) {
  cat('   Total missing values:', total_missing, '\n')
  cols_with_missing <- names(missing_counts[missing_counts > 0])
  for(col in cols_with_missing) {
    pct <- round(missing_counts[col] / nrow(df) * 100, 1)
    cat('   -', col, ':', missing_counts[col], 'missing (', pct, '%)\n')
  }
} else {
  cat('   No missing values in the dataset\n')
}

##    No missing values in the dataset

cat('\n')

# Coordinate inconsistencies
cat('2. COORDINATE INCONSISTENCIES:\n')

## 2. COORDINATE INCONSISTENCIES:

if(exists("inconsistent_coord_countries") && length(inconsistent_coord_countries) > 0) {
  cat('   Countries with varying coordinates:', length(inconsistent_coord_countries), '\n')
  cat('   Flagged for manual review:\n')
  for(country in inconsistent_coord_countries) {
    cat('   -', country, '\n')
  }
} else {
  cat('   None found\n')
}

##    None found

cat('\n')

# Temporal gaps
cat('3. TEMPORAL CONSISTENCY:\n')

## 3. TEMPORAL CONSISTENCY:

if(exists("temporal_issues") && nrow(temporal_issues) > 0) {
  cat('   Countries with unexpected weekly record counts:\n')
  print(temporal_issues)
} else {
  cat('   None found\n')
}

##    Countries with unexpected weekly record counts:
## # A tibble: 25 × 3
##    country_name  year record_count
##    <chr>        <dbl>        <int>
##  1 Argentina     2025           42
##  2 Australia     2025           42
##  3 Bangladesh    2025           42
##  4 Brazil        2025           42
##  5 Canada        2025           42
##  6 China         2025           42
##  7 Colombia      2025           42
##  8 Egypt         2025           42
##  9 France        2025           42
## 10 Germany       2025           42
## # ℹ 15 more rows

cat('\n')

# Week mismatches
cat('4. WEEK NUMBER DISCREPANCIES:\n')

## 4. WEEK NUMBER DISCREPANCIES:

if(exists("week_mismatches") && week_mismatches > 0) {
  cat('   Records where week column differs from calculated ISO week:', week_mismatches, '\n')
  cat('   Consider reviewing week column calculation methodology\n')
} else {
  cat('   None found\n')
}

##    None found

cat('\n')

# Income level issues
cat('5. INCOME LEVEL VALIDATION:\n')

## 5. INCOME LEVEL VALIDATION:

if(exists("unexpected_income_levels") && length(unexpected_income_levels) > 0) {
  cat('   Unexpected income level categories found:\n')
  for(level in unexpected_income_levels) {
    cat('   -', level, '\n')
  }
} else {
  cat('   All income levels validated successfully\n')
}

##    All income levels validated successfully

cat('\n')

# Data validation summary
cat('6. DATA QUALITY ISSUES RESOLVED:\n')

## 6. DATA QUALITY ISSUES RESOLVED:

cat('   Duplicate records removed:', total_duplicates_removed, '\n')

##    Duplicate records removed: 0

cat('   Invalid coordinates removed:', coord_records_removed, '\n')

##    Invalid coordinates removed: 0

cat('   Invalid climate/health data removed:', climate_quality_records_removed, '\n')

##    Invalid climate/health data removed: 377

cat('\n')

6.11 To take note:

6.11.1 ⚠️ 2025 Data is Incomplete:

The dataset contains only 42 weeks of data for 2025 (partial year). - Recommendation: Exclude 2025 from year-over-year trend analysis - Use only 2015-2024 (10 complete years) for annual comparisons - If including 2025, clearly label it as “partial year data”

6.11.2 Data Quality Cleaning:

377 records were removed due to invalid climate/health values: - Healthcare Access Index violations: 3 records (>100%) - Air Quality Index violations: 374 records (>500) - All other validation checks passed

6.11.3 Categorical Variables:

Several columns need factor conversion for analysis: - country_code, country_name, region - income_level (ordered factor: Lower-Middle < Upper-Middle < High) - Optional: drought_indicator, flood_indicator

6.11.4 Income Level Note:

“Low” income category is absent from the dataset. Only 3 categories present: - High - Upper-Middle
- Lower-Middle

6.12 Preview of Cleaned Dataset

# Display first few rows of cleaned dataset
cat('First 10 rows of cleaned dataset:\n\n')

## First 10 rows of cleaned dataset:

head(df, 10)

## # A tibble: 10 × 30
##    record_id country_code country_name  region   income_level latitude longitude
##        <dbl> <chr>        <chr>         <chr>    <chr>           <dbl>     <dbl>
##  1         1 USA          United States North A… High             37.1     -95.7
##  2         2 USA          United States North A… High             37.1     -95.7
##  3         3 USA          United States North A… High             37.1     -95.7
##  4         5 USA          United States North A… High             37.1     -95.7
##  5         6 USA          United States North A… High             37.1     -95.7
##  6         7 USA          United States North A… High             37.1     -95.7
##  7         8 USA          United States North A… High             37.1     -95.7
##  8         9 USA          United States North A… High             37.1     -95.7
##  9        10 USA          United States North A… High             37.1     -95.7
## 10        12 USA          United States North A… High             37.1     -95.7
## # ℹ 23 more variables: population_millions <dbl>, date <date>, year <dbl>,
## #   month <dbl>, week <dbl>, temperature_celsius <dbl>,
## #   temp_anomaly_celsius <dbl>, precipitation_mm <dbl>, heat_wave_days <dbl>,
## #   drought_indicator <dbl>, flood_indicator <dbl>,
## #   extreme_weather_events <dbl>, pm25_ugm3 <dbl>, air_quality_index <dbl>,
## #   respiratory_disease_rate <dbl>, cardio_mortality_rate <dbl>,
## #   vector_disease_risk_score <dbl>, waterborne_disease_incidents <dbl>, …

7 Data visualization/EDA

# install.packages("corrplot")
# install.packages("skimr")
# Load libraries
library(corrplot)

## corrplot 0.95 loaded

library(skimr)
library(ggplot2)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(ggplot2)
library(reshape2)

7.1 Dataset Overview

~13,700 rows, 30 columns
25 countries, 8 regions, 3 income levels
Variables: climate, health, socio-economic

nrow(df); ncol(df)

## [1] 13723

## [1] 30

length(unique(df$country_code))

## [1] 25

length(unique(df$region))

## [1] 8

length(unique(df$income_level))

## [1] 3

7.2 Data Quality

No major missing values
Strong categorical diversity

colSums(is.na(df))

##                    record_id                 country_code 
##                            0                            0 
##                 country_name                       region 
##                            0                            0 
##                 income_level                     latitude 
##                            0                            0 
##                    longitude          population_millions 
##                            0                            0 
##                         date                         year 
##                            0                            0 
##                        month                         week 
##                            0                            0 
##          temperature_celsius         temp_anomaly_celsius 
##                            0                            0 
##             precipitation_mm               heat_wave_days 
##                            0                            0 
##            drought_indicator              flood_indicator 
##                            0                            0 
##       extreme_weather_events                    pm25_ugm3 
##                            0                            0 
##            air_quality_index     respiratory_disease_rate 
##                            0                            0 
##        cardio_mortality_rate    vector_disease_risk_score 
##                            0                            0 
## waterborne_disease_incidents      heat_related_admissions 
##                            0                            0 
##      healthcare_access_index           gdp_per_capita_usd 
##                            0                            0 
##          mental_health_index          food_security_index 
##                            0                            0

table(df$income_level)

## 
##         High Lower-Middle Upper-Middle 
##         4866         5074         3783

7.3 Histograms

# Select only numeric columns for histograms
numeric_cols <- names(df)[sapply(df, is.numeric)]

# Exclude identifier and temporal columns that are not suitable for histograms
exclude_cols <- c("record_id", "year", "month", "week", "latitude", "longitude")
plot_numeric_cols <- setdiff(numeric_cols, exclude_cols)

# Prepare data for plotting (long format)
hist_data_long <- df %>%
  select(all_of(plot_numeric_cols)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value")

# Create histograms for each selected numeric variable
histogram_plot <- ggplot(hist_data_long, aes(x = value)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", color = "black") +
  facet_wrap(~ variable, scales = "free") +
  labs(
    title = "Histograms of Numerical Variables",
    x = "Value",
    y = "Frequency"
  ) +
  theme_minimal()

print(histogram_plot)

📊 Distribution Insights

Skewed: air quality, PM2.5, heat wave days, GDP, population
Bimodal: mental health, food security
Near-normal: temperature, precipitation, respiratory disease

🧠 Modeling Implications - Apply log or Box-Cox transformations to skewed variables before regression or tree-based modeling. - Use standardization for variables with wide ranges (e.g. GDP, population) to improve model stability. - Consider clustering or segmentation for bimodal variables to uncover latent groups. - Flag outliers for robust evaluation and scenario testing during deployment.

7.4 Correlation Matrix

# Select numeric columns
selected_cols <- c(
  "temperature_celsius", "precipitation_mm", "heat_wave_days",
  "extreme_weather_events", "air_quality_index",
  "vector_disease_risk_score", "heat_related_admissions",
  "respiratory_disease_rate", "healthcare_access_index",
  "food_security_index"
)

# Prepare correlation matrix
cor_matrix <- cor(df[, selected_cols], use = "complete.obs")

# Convert to long format for ggplot
cor_data <- melt(cor_matrix)

# Plot heatmap
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, color = "black"),
    axis.text.y = element_text(color = "black")
  ) +
  labs(
    title = "Correlation Heatmap of Climate Variables",
    x = "",
    y = ""
  )

📊 Correlation Highlights - Temp ↔︎ Vector disease risk (0.65) - Temp ↔︎ Heat-related admissions (0.42) - Heat wave days ↔︎ Extreme events (0.96) → redundant - Healthcare access & food security negatively correlated with climate stressors

🧠 Modeling & Deployment Tips - Prioritize temperature, healthcare access, and vector risk as core features. - Use interaction terms (e.g. temperature × access index) to capture compounding effects. - Evaluate model fairness across low-access regions. - Highlight vulnerable zones in dashboards for policy targeting.

8 Regression model

8.1 Install Required Packages for Regression Models

This section need to run for 20mins+, Require the progress statement to prevent the R runtime disconnection due to false sense of inactivity

# SECTION: PACKAGE INSTALLATION & ENVIRONMENT SETUP

# 1. Define required packages (Excludes pre-installed dplyr/ggplot2)
packages <- c("caret", "glmnet", "ranger", "xgboost", "recipes")

# 2. Configure Repositories for High-Speed Linux Binaries
options(
  repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest"),
  Ncpus = parallel::detectCores() # Automatically uses 2 cores (or more if available)
)

# 3. Identify ONLY what is missing
missing_pkgs <- packages[!(packages %in% installed.packages()[, "Package"])]

# 4. Install missing items in one optimized batch
if (length(missing_pkgs) > 0) {
  cat("Fast-installing missing binaries and dependencies:", paste(missing_pkgs, collapse = ", "), "\n")

  # dependencies = TRUE is the "secret sauce" to get all 27+ sub-packages at once
  install.packages(missing_pkgs, dependencies = TRUE, quiet = TRUE)
}

# 5. Load everything efficiently
library(dplyr)
invisible(lapply(packages, library, character.only = TRUE))

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-10

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:Matrix':
## 
##     update

## The following object is masked from 'package:stats':
## 
##     step

cat("----------------------------------------------------------\n")

## ----------------------------------------------------------

cat("SUCCESS: Environment Ready (using", parallel::detectCores(), "cores)\n")

## SUCCESS: Environment Ready (using 12 cores)

8.2 Step 1: Prepare Data for Regression

Validation & Cleaning: Applied logical range filters to environmental and economic variables (e.g., -50°C to 60°C for temperature) and discarded 2025 data to ensure a clean, historical dataset.
Categorical Encoding: Converted region into a factor and income_level into an ordered factor, preserving the natural economic hierarchy (Low to High).
Feature Engineering: Applied log1p transformations to GDP and Air Quality (AQI) variables to normalize right-skewed data and improve model stability.
Feature Selection: Curated a final set of 11 predictors (climate, pollution, and socioeconomic metrics) and used na.omit() to ensure a “complete case” dataset.
Stratified Splitting: Partitioned the data into an 80/20 train-test split using stratified sampling to ensure both sets have a similar distribution of heat-related admissions.

# --- 1. Filter and Clean Data ---
df_cleaned_for_regression <- df %>%
  filter(
    # Core target validation
    if("heat_related_admissions" %in% colnames(.)) heat_related_admissions >= 0 | is.na(heat_related_admissions) else TRUE,

    # Predictor validations
    if("temperature_celsius" %in% colnames(.)) temperature_celsius >= -50 & temperature_celsius <= 60 else TRUE,
    if("temp_anomaly_celsius" %in% colnames(.)) temp_anomaly_celsius >= -20 & temp_anomaly_celsius <= 20 else TRUE,
    if("heat_wave_days" %in% colnames(.)) heat_wave_days >= 0 & heat_wave_days <= 366 else TRUE,
    if("air_quality_index" %in% colnames(.)) air_quality_index >= 0 & air_quality_index <= 500 else TRUE,
    if("healthcare_access_index" %in% colnames(.)) healthcare_access_index >= 0 & healthcare_access_index <= 100 else TRUE,
    if("gdp_per_capita_usd" %in% colnames(.)) gdp_per_capita_usd >= 0 else TRUE
  )

# --- 1.1 Discard 2025 data ---
df_clean <- df_cleaned_for_regression %>% filter(year < 2025)

# --- 1.2 Convert categorical variables to factors ---
df_clean$region <- as.factor(df_clean$region)
df_clean$income_level <- factor(df_clean$income_level,
                                levels = c("Low", "Lower-Middle", "Upper-Middle", "High"), # Added 'Low' just in case
                                ordered = TRUE)

# --- 1.3 Feature Engineering & Transformations ---
# Using log1p (log(1+x)) to handle skewed environmental and economic data
df_clean$log_gdp <- log1p(df_clean$gdp_per_capita_usd)
df_clean$log_pm25 <- log1p(df_clean$pm25_ugm3)
df_clean$log_aqi <- log1p(df_clean$air_quality_index)

# --- 1.4 Select Features for Regression ---
# UPDATED: Included heat_related_admissions and removed respiratory_disease_rate
features <- c("temperature_celsius", "temp_anomaly_celsius", "precipitation_mm",
              "heat_wave_days", "log_pm25", "log_aqi", "healthcare_access_index",
              "log_gdp", "region", "income_level")

# We must select the target variable explicitly to use it in df_model
df_model <- df_clean %>%
  select(all_of(features), heat_related_admissions) %>%
  na.omit()

# --- 1.5 Split Data (80/20) ---
set.seed(123)
# Splitting based on the heat_related_admissions distribution
trainIndex_reg <- createDataPartition(df_model$heat_related_admissions, p = .8, list = FALSE)
train_data_reg <- df_model[trainIndex_reg, ]
test_data_reg  <- df_model[-trainIndex_reg, ]

cat("Environment setup for Heat Related Admissions complete! ✓\n")

## Environment setup for Heat Related Admissions complete! ✓

cat("Training set size:", nrow(train_data_reg), "rows\n")

## Training set size: 10162 rows

8.3 Step 2: Configure Cross-Validation

Set up 5-fold cross-validation for model training to ensure the model generalizes well and to prevent overfitting.

# Set up 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5, verboseIter = TRUE)

8.4 Step 3: Train Regression Models

Hardware Optimization: Automatically detected CPU capacity and allocated parallel threads (n_cores - 1) to maximize processing speed for ensemble models.

Data Transformation: Implemented One-Hot Encoding via dummyVars to convert categorical regions and income levels into numeric matrices, ensuring compatibility across all model types.

Based on research objectives, we will train three distinct models:

Model 1 (Elastic Net): Established a robust linear baseline using glmnet to manage multicollinearity between correlated climate variables (e.g., Temperature vs. Heat Wave Days).
Model 2 (Random Forest): Utilized the ranger implementation to capture complex, non-linear interactions between socioeconomic factors and environmental stressors.
Model 3 (XGBoost): Implemented a high-performance Gradient Boosting approach using a native DMatrix structure for memory efficiency and maximum predictive accuracy.

Training Safeguards: Integrated Early Stopping (10 rounds) in XGBoost and 5-fold Cross-Validation in caret to prevent overfitting and ensure the models generalize well to new heat admission data.

# Re-detect hardware capacity
library(parallel)
n_cores <- detectCores()
# Define the missing variable
suggested_threads <- max(1, n_cores - 1)

cat("Hardware Check: Using", suggested_threads, "threads for training.\n")

## Hardware Check: Using 11 threads for training.

# --- One-Hot Encoding ---
dummy_transformer <- dummyVars(heat_related_admissions ~ ., data = df_model)

# Standardize as matrices immediately to avoid repeating the conversion inside models
to_plain_matrix <- function(transformer, data) {
  tmp <- predict(transformer, newdata = data)
  mat <- matrix(as.numeric(tmp), nrow = nrow(tmp), ncol = ncol(tmp))
  colnames(mat) <- colnames(tmp)
  return(mat)
}

train_x <- to_plain_matrix(dummy_transformer, train_data_reg)
test_x  <- to_plain_matrix(dummy_transformer, test_data_reg)

train_y <- train_data_reg$heat_related_admissions
test_y  <- test_data_reg$heat_related_admissions

dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest  <- xgb.DMatrix(data = test_x, label = test_y)

# --- Step 3: Train Regression Models ---

# Model 1: Elastic Net
cat("Training Model 1: Elastic Net...\n")

## Training Model 1: Elastic Net...

time_enet <- system.time({
  reg_model_enet <- train(
    x = train_x, y = train_y,
    method = "glmnet",
    trControl = train_control
  )
})

## + Fold1: alpha=0.10, lambda=1.326 
## - Fold1: alpha=0.10, lambda=1.326 
## + Fold1: alpha=0.55, lambda=1.326 
## - Fold1: alpha=0.55, lambda=1.326 
## + Fold1: alpha=1.00, lambda=1.326 
## - Fold1: alpha=1.00, lambda=1.326 
## + Fold2: alpha=0.10, lambda=1.326 
## - Fold2: alpha=0.10, lambda=1.326 
## + Fold2: alpha=0.55, lambda=1.326 
## - Fold2: alpha=0.55, lambda=1.326 
## + Fold2: alpha=1.00, lambda=1.326 
## - Fold2: alpha=1.00, lambda=1.326 
## + Fold3: alpha=0.10, lambda=1.326 
## - Fold3: alpha=0.10, lambda=1.326 
## + Fold3: alpha=0.55, lambda=1.326 
## - Fold3: alpha=0.55, lambda=1.326 
## + Fold3: alpha=1.00, lambda=1.326 
## - Fold3: alpha=1.00, lambda=1.326 
## + Fold4: alpha=0.10, lambda=1.326 
## - Fold4: alpha=0.10, lambda=1.326 
## + Fold4: alpha=0.55, lambda=1.326 
## - Fold4: alpha=0.55, lambda=1.326 
## + Fold4: alpha=1.00, lambda=1.326 
## - Fold4: alpha=1.00, lambda=1.326 
## + Fold5: alpha=0.10, lambda=1.326 
## - Fold5: alpha=0.10, lambda=1.326 
## + Fold5: alpha=0.55, lambda=1.326 
## - Fold5: alpha=0.55, lambda=1.326 
## + Fold5: alpha=1.00, lambda=1.326 
## - Fold5: alpha=1.00, lambda=1.326 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.55, lambda = 0.0133 on full training set

cat("Elapsed Time:", time_enet["elapsed"], "seconds\nModel 1 Complete! ✓\n\n")

## Elapsed Time: 0.33 seconds
## Model 1 Complete! ✓

# Model 2: Ranger (Random Forest)
cat("Training Model 2: Ranger Forest...\n")

## Training Model 2: Ranger Forest...

time_rf <- system.time({
  reg_model_rf <- train(
    x = train_x, y = train_y,
    method = "ranger",
    trControl = train_control,
    num.trees = 100,
    num.threads = suggested_threads,
    importance = "impurity"
  )
})

## + Fold1: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold1: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold1: mtry=10, min.node.size=5, splitrule=variance 
## - Fold1: mtry=10, min.node.size=5, splitrule=variance 
## + Fold1: mtry=19, min.node.size=5, splitrule=variance 
## - Fold1: mtry=19, min.node.size=5, splitrule=variance 
## + Fold1: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold1: mtry=10, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry=10, min.node.size=5, splitrule=extratrees 
## + Fold1: mtry=19, min.node.size=5, splitrule=extratrees 
## - Fold1: mtry=19, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold2: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold2: mtry=10, min.node.size=5, splitrule=variance 
## - Fold2: mtry=10, min.node.size=5, splitrule=variance 
## + Fold2: mtry=19, min.node.size=5, splitrule=variance 
## - Fold2: mtry=19, min.node.size=5, splitrule=variance 
## + Fold2: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry=10, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry=10, min.node.size=5, splitrule=extratrees 
## + Fold2: mtry=19, min.node.size=5, splitrule=extratrees 
## - Fold2: mtry=19, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold3: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold3: mtry=10, min.node.size=5, splitrule=variance 
## - Fold3: mtry=10, min.node.size=5, splitrule=variance 
## + Fold3: mtry=19, min.node.size=5, splitrule=variance 
## - Fold3: mtry=19, min.node.size=5, splitrule=variance 
## + Fold3: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry=10, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry=10, min.node.size=5, splitrule=extratrees 
## + Fold3: mtry=19, min.node.size=5, splitrule=extratrees 
## - Fold3: mtry=19, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold4: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold4: mtry=10, min.node.size=5, splitrule=variance 
## - Fold4: mtry=10, min.node.size=5, splitrule=variance 
## + Fold4: mtry=19, min.node.size=5, splitrule=variance 
## - Fold4: mtry=19, min.node.size=5, splitrule=variance 
## + Fold4: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry=10, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry=10, min.node.size=5, splitrule=extratrees 
## + Fold4: mtry=19, min.node.size=5, splitrule=extratrees 
## - Fold4: mtry=19, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry= 2, min.node.size=5, splitrule=variance 
## - Fold5: mtry= 2, min.node.size=5, splitrule=variance 
## + Fold5: mtry=10, min.node.size=5, splitrule=variance 
## - Fold5: mtry=10, min.node.size=5, splitrule=variance 
## + Fold5: mtry=19, min.node.size=5, splitrule=variance 
## - Fold5: mtry=19, min.node.size=5, splitrule=variance 
## + Fold5: mtry= 2, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry= 2, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry=10, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry=10, min.node.size=5, splitrule=extratrees 
## + Fold5: mtry=19, min.node.size=5, splitrule=extratrees 
## - Fold5: mtry=19, min.node.size=5, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 19, splitrule = extratrees, min.node.size = 5 on full training set

cat("Elapsed Time:", time_rf["elapsed"], "seconds\nModel 2 Complete! ✓\n\n")

## Elapsed Time: 9.11 seconds
## Model 2 Complete! ✓

# Model 3: XGBoost

params <- list(
  booster = "gbtree",
  objective = "reg:squarederror",
  eta = 0.3,
  max_depth = 6,
  nthread = suggested_threads
)


cat("Training Model 3: XGBoost...\n")

## Training Model 3: XGBoost...

time_xgb <- system.time({
  reg_model_xgb <- xgb.train(
    params = params,
    data = dtrain,
    nrounds = 100,
    watchlist = list(val=dtest, train=dtrain),
    print_every_n = 10,
    early_stopping_rounds = 10,
    maximize = FALSE
)
})

## Warning in throw_err_or_depr_msg("Parameter '", match_old, "' has been renamed
## to '", : Parameter 'watchlist' has been renamed to 'evals'. This warning will
## become an error in a future version.

## Multiple eval metrics are present. Will use train_rmse for early stopping.
## Will train until train_rmse hasn't improved in 10 rounds.
## 
## [1]  val-rmse:7.633804   train-rmse:7.542605 
## [11] val-rmse:4.505231   train-rmse:3.941566 
## [21] val-rmse:4.259704   train-rmse:3.411858 
## [31] val-rmse:4.192930   train-rmse:3.132369 
## [41] val-rmse:4.145195   train-rmse:2.891480 
## [51] val-rmse:4.158602   train-rmse:2.711914 
## [61] val-rmse:4.179990   train-rmse:2.547518 
## [71] val-rmse:4.187054   train-rmse:2.414084 
## [81] val-rmse:4.194985   train-rmse:2.292317 
## [91] val-rmse:4.207165   train-rmse:2.190803 
## [100]    val-rmse:4.208809   train-rmse:2.087171

cat("Elapsed Time:", time_xgb["elapsed"], "seconds\nModel 3 Complete! ✓\n\n")

## Elapsed Time: 0.48 seconds
## Model 3 Complete! ✓

cat("\nAll models trained successfully! ✓\n")

## 
## All models trained successfully! ✓

# Create a timing summary
timing_summary <- data.frame(
  Model = c("Elastic Net", "Ranger", "XGBoost"),
  Time_Seconds = c(time_enet["elapsed"], time_rf["elapsed"], time_xgb["elapsed"])
)

print(timing_summary)

##         Model Time_Seconds
## 1 Elastic Net         0.33
## 2      Ranger         9.11
## 3     XGBoost         0.48

8.5 Step 4: Compare Model Performance (Cross-Validation)

Compare accuracy across models using cross-validation results

Metric Aggregation: Utilized the resamples() function to collect and align performance distributions (RMSE, \(R^2\), MAE) from the cross-validation folds.
Performance Benchmarking: Generated a statistical summary to compare the average error and stability of the Elastic Net and Random Forest models.
Visual Comparative Analysis:
- dotplot: Used to visualize the confidence intervals of each model’s accuracy, identifying which model is more consistent
- .bwplot (Boxplot): Employed to analyze the variance of the RMSE,highlighting the risk of error fluctuations across different data folds.
Relative Evaluation: While XGBoost is evaluated separately on the test set, these CV metrics provide the primary evidence for the Random Forest and Elastic Net reliability before final testing.

cat("\n Comparing Model Performance (CV Results)...\n")

## 
##  Comparing Model Performance (CV Results)...

# 1. Collect results ONLY from caret models
# XGBoost is native now, so it cannot go into resamples()
results <- resamples(list(
  ElasticNet = reg_model_enet,
  RandomForest = reg_model_rf
))

# 2. Print summary and plots for the caret models
summary_results <- summary(results)
print(summary_results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: ElasticNet, RandomForest 
## Number of resamples: 5 
## 
## MAE 
##                  Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## ElasticNet   3.889701 3.940367 3.943154 3.952131 3.975853 4.011581    0
## RandomForest 2.432477 2.528552 2.534338 2.548872 2.576373 2.672619    0
## 
## RMSE 
##                  Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## ElasticNet   5.124736 5.137692 5.151461 5.170391 5.218914 5.219153    0
## RandomForest 3.765574 3.932168 3.967069 3.940888 3.968093 4.071535    0
## 
## Rsquared 
##                   Min.  1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ElasticNet   0.6787786 0.690385 0.7010065 0.6985460 0.7091445 0.7134152    0
## RandomForest 0.8175791 0.823341 0.8246324 0.8256884 0.8256008 0.8372887    0

dotplot(results, main = "Caret Models: Cross-Validation Metrics")

bwplot(results, main = "Caret Models: RMSE Distribution")

8.6 Step 5: Make prediction & Evaluate Performance of the test data

Custom Metric Standardization: Implemented a dedicated eval_metrics function to compute a multi-dimensional performance profile, including RMSE for error magnitude, MAE for average prediction deviation, and \(R^2\) for the proportion of variance explained.
Final Predictive Validation: Executed model predictions on a “hold-out” test set (20% of the data) that was never seen during training, providing an unbiased assessment of how the models will perform on real-world future data.
Ensemble “Horse Race”: Simultaneously evaluated the Elastic Net, Random Forest, and XGBoost models to statistically identify the most reliable architecture for heat-related admission forecasting.
Visual Error Analysis: Utilized a Predicted vs. Actual scatter plot to diagnose model behavior.
- Identity Mapping: Added a red dashed \(1:1\) line to represent the “perfect model” baseline.
- Bias Detection: Used the distribution of points around the line to confirm that the model remains accurate across both low and high admission counts.

cat("\n Making Predictions and Evaluating on Test Set...\n")

## 
##  Making Predictions and Evaluating on Test Set...

# Function to calculate regression metrics
eval_metrics <- function(actual, predicted) {
  rmse <- sqrt(mean((actual - predicted)^2))
  mae  <- mean(abs(actual - predicted))
  rss  <- sum((actual - predicted)^2)
  tss  <- sum((actual - mean(actual))^2)
  rsq  <- 1 - (rss/tss)
  return(c(RMSE = rmse, R2 = rsq, MAE = mae))
}

# Ensure test_x is a matrix if it isn't already
test_x_matrix <- as.matrix(test_x)

# Predict using all three models
pred_enet <- predict(reg_model_enet, test_x_matrix)
pred_rf   <- predict(reg_model_rf, test_x_matrix)
pred_xgb  <- predict(reg_model_xgb, dtest)

# Create a comparison table for Test Set
test_eval <- data.frame(
  Model = c("Elastic Net", "Random Forest", "XGBoost"),
  rbind(
    eval_metrics(test_y, pred_enet),
    eval_metrics(test_y, pred_rf),
    eval_metrics(test_y, pred_xgb)
  )
)

print(test_eval)

##           Model     RMSE        R2      MAE
## 1   Elastic Net 5.348955 0.6791145 4.039578
## 2 Random Forest 4.025883 0.8182248 2.567852
## 3       XGBoost 4.208809 0.8013307 2.743016

library(ggplot2)

# Combine results for plotting
plot_data <- data.frame(
  Actual = test_y,
  XGBoost = pred_xgb,
  RandomForest = pred_rf
)

# Plot for the XGBoost Model
ggplot(plot_data, aes(x = Actual, y = XGBoost)) +
  geom_point(alpha = 0.4, color = "darkblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
  labs(title = "XGBoost: Predicted vs. Actual Heat Admissions",
       subtitle = paste("R-Squared:", round(test_eval[test_eval$Model=="XGBoost", "R2"], 3)),
       x = "Actual Admissions",
       y = "Predicted Admissions") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

8.7 Step 6: Identify Best Model and Visualize Results

Automated Selection Logic: Implemented an objective selection process that programmatically identifies the winner by finding the minimum Root Mean Square Error (RMSE) among all candidates.
Dynamic Prediction Mapping: The script automatically maps the correct model object and its corresponding test-set predictions for downstream analysis, ensuring the visualizations always reflect the highest-performing architecture.
Regression Accuracy Visualization: Produced a “Predicted vs. Actual” scatter plot to assess model bias.
- Steelblue Points: Represent individual admission records.
- Red Dashed Line: Acts as the 1:1 “perfect prediction” reference.
Factor Influence Analysis (Variable Importance): Deployed a conditional importance logic tailored to the model type (native xgb.importance for XGBoost or varImp for Caret models).
- Insight: This identifies which environmental or socioeconomic drivers—such as Heat Wave Days or Temperature—had the most weight in the model’s decision-making process.
Interpretability: By ranking the top 10 influencing factors, the approach transitions from “black-box” prediction to actionable scientific insight, specifically highlighting how climate stressors impact hospital workload.

cat("\n Visualizing Best Model Results...\n")

## 
##  Visualizing Best Model Results...

# 1. Identify the best model automatically
best_model_idx <- which.min(test_eval$RMSE)
best_model_name <- test_eval$Model[best_model_idx]
cat(paste0("The best performing model is: ", best_model_name, "\n"))

## The best performing model is: Random Forest

# 2. Select the correct model object and predictions
if(best_model_name == "XGBoost") {
  best_model_obj <- reg_model_xgb
  best_preds     <- pred_xgb
} else if(best_model_name == "Random Forest") {
  best_model_obj <- reg_model_rf
  best_preds     <- pred_rf
} else {
  best_model_obj <- reg_model_enet
  best_preds     <- pred_enet
}

# 3. Regression Plot: Predicted vs Actual
plot_data <- data.frame(Actual = test_y, Predicted = best_preds)

library(ggplot2)
p1 <- ggplot(plot_data, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = paste("Best Model Performance:", best_model_name),
       subtitle = "Actual vs Predicted Heat Related Admissions",
       x = "Actual Admissions",
       y = "Predicted Admissions") +
  theme_minimal()

print(p1)

# 4. Corrected Variable Importance Plot
cat("\n Generating Variable Importance for:", best_model_name, "...\n")

## 
##  Generating Variable Importance for: Random Forest ...

if(best_model_name == "XGBoost") {
  # Native XGBoost importance logic
  imp_mat <- xgb.importance(model = reg_model_xgb)
  xgb.plot.importance(imp_mat, top_n = 10, main = "XGBoost Variable Importance")
} else {
  # Caret importance logic for RF and E-Net
  importance <- varImp(best_model_obj)
  plot(importance, main = paste("Top Factors Influencing Admissions (", best_model_name, ")", sep=""))
}

8.8 Regression Results Summary

Training Environment and Data
- Training Scale: The models were trained on a final dataset of 10,162 rows.
- Methodology: Used 5-fold cross-validation (CV) to ensure stability across different data subsets.
- Computing: Optimized for a single-thread environment during training.
Performance Comparison
- The Random Forest (Ranger) model was identified as the best-performing model, outperforming both the linear baseline (Elastic Net) and the high-speed gradient booster (XGBoost).
- Accuracy: The Random Forest model explains approximately 82% of the variance in heat-related admissions on the test set.
- Error Rate: On average, the Random Forest model’s predictions were off by only 2.55 admissions (MAE), compared to 4.04 for the Elastic Net.
- Stability: Cross-validation plots confirm that Random Forest maintained a significantly lower and tighter error distribution (RMSE) than the Elastic Net.
Predictive Insights (XGBoost Analysis)

While Random Forest won on accuracy, the XGBoost visualization provides clear evidence of the model’s reliability:
- Correlation: The “Predicted vs. Actual” scatter plot shows a strong alignment with the red identity line, indicating that predictions closely follow actual trends.
- Performance: Achieved an \(R^2\) of 0.801 on the test set, demonstrating that even the second-best model remains highly competitive.
Key Drivers of Heat Admissions

The Variable Importance analysis for the winning Random Forest model highlights the critical factors influencing hospital admissions:
- Heat Wave Days: By far the most significant predictor, dominating the model’s decision-making.
- Temperature (Celsius): The second strongest environmental driver.
- Precipitation (mm): Shows a notable influence, potentially interacting with humidity.
- Log GDP & Healthcare Access: Socioeconomic factors serve as secondary but important predictors of admissions.Regional/Income Factors: Geographic and economic categories (e.g., region.Europe) showed the lowest relative influence compared to direct environmental stressors.
Efficiency Summary
- XGBoost was the fastest to train (0.44 seconds), offering a near-perfect balance between speed and high accuracy (R^2 = 0.80).
- Random Forest required the most time (119.7 seconds) but delivered the most precise predictions for this specific heat-health dataset.

9 Classification model

Goal:

Predict population health risk levels using climate, air quality, and socioeconomic indicators, focusing on identifying populations at higher risk from climate-sensitive health outcomes. For modeling purposes, health risk was binarized into High vs Not_High to simplify prediction, ensure stable model training, and emphasize populations in the top one-third of risk scores.

Approach: - 3 classification models (Logistic Regression, Random Forest, and k-Nearest Neighbors (kNN)) were trained to classify health risk. - The dataset was split into 80% training and 20% testing sets. - 5-fold cross-validation was applied on the training data, with ROC as the primary metric to handle potential class imbalance and select the best model. - Final model performance was evaluated on the unseen test set using accuracy, sensitivity, specificity, balanced accuracy, and confusion matrices. - Binarizing the target allows the models to focus on distinguishing high-risk populations from the rest, which is most relevant for policy and health intervention decisions.

9.1 Install Required Packages for Classification Models

# Install required packages for classification models
if (!require(nnet)) install.packages("nnet", repos = "https://cran.rstudio.com/")

## Loading required package: nnet

if (!require(randomForest)) install.packages("randomForest", repos = "https://cran.rstudio.com/")

## Loading required package: randomForest

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

if (!require(class)) install.packages("class", repos = "https://cran.rstudio.com/")

## Loading required package: class

# Load the packages
library(nnet)         # For multinomial logistic regression
library(randomForest) # For random forest
library(class)        # For k-NN

cat("All classification packages loaded successfully! ✓\n")

## All classification packages loaded successfully! ✓

9.2 Step 1: Data Preparation for Classification

A binary target variable (health_risk_binary) was created based on a composite health risk score from standardized respiratory disease rate, heat-related admissions, and cardiovascular mortality.
Observations above the 66th percentile were labeled as High, others as Not_High.
- Reason: The 66th percentile ensures that roughly the top one-third of health risk scores are classified as High, highlighting populations at higher risk while keeping enough cases in each class for stable model training.
Predictor features included climate (temperature, precipitation, heat waves, extreme weather), air quality (AQI, PM2.5), and socioeconomic indicators (GDP per capita, income level, healthcare access, food security).
Categorical variables were converted to factors.
The dataset was split into 80% training and 20% testing for unbiased evaluation.

# 1. Create Binary Target Variable
# Convert health outcomes into a binary classification problem to ensure model stability in Google Colab
# Classify whether a country-week observation represents High Health Risk or Not High Risk

# Create composite health risk score
df <- df %>%
  mutate(
    health_risk_score =
      scale(respiratory_disease_rate) +
      scale(heat_related_admissions) +
      scale(cardio_mortality_rate)
  )

# Convert into binary classification
df <- df %>%
  mutate(
    health_risk_binary = ifelse(
      health_risk_score > quantile(health_risk_score, 0.66),
      "High",
      "Not_High"
    )
  )

# Convert to factor
df$health_risk_binary <- factor(df$health_risk_binary)

# 2. Select features
# Use climate, air quality, and socioeconomic indicators as predictors

classification_data <- df %>%
  select(
    country_name,
    year,
    temperature_celsius,
    precipitation_mm,
    heat_wave_days,
    extreme_weather_events,
    air_quality_index,
    pm25_ugm3,
    healthcare_access_index,
    gdp_per_capita_usd,
    food_security_index,
    income_level,
    health_risk_binary
  )


# Convert categorical variable
classification_data$income_level <- factor(classification_data$income_level)

# 3. Split Data into Training and Test Sets (80/20)
# Evaluate model performance on unseen data

set.seed(123)

train_index <- createDataPartition(
  classification_data$health_risk_binary,
  p = 0.8,
  list = FALSE
)

train_data <- classification_data[train_index, ]
test_data  <- classification_data[-train_index, ]

9.3 Step 2: Cross-Validation Configuration

To ensure robust and reliable model performance estimation

5-fold cross-validation was applied during model training. The training data was divided into 5 subsets, with 4 folds used for training and 1 fold for validation in each iteration.
This approach reduces overfitting and provides a more stable estimate of model performance compared to a single train-test split.
ROC (Receiver Operating Characteristic) was used to select the best model, as it effectively measures the ability to distinguish High-risk from Not_High-risk observations in potentially imbalanced data.

ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

9.4 Step 3: Model Training

3 models were trained on the training set: - Logistic Regression: interpretable baseline model assuming linear relationships - Random Forest: ensemble model capable of capturing non-linear interactions and variable importance - k-Nearest Neighbors (kNN): distance-based model requiring feature scaling

Identifier variables such as country name and year were excluded to avoid data leakage

# Logistic Regression
model_logit <- train(
  health_risk_binary ~ . -country_name -year,  # Exclude identifiers
  data = train_data,
  method = "glm",
  family = "binomial",
  trControl = ctrl,
  metric = "ROC"
)

# Random Forest
model_rf <- train(
  health_risk_binary ~ . -country_name -year,
  data = train_data,
  method = "rf",
  trControl = ctrl,
  metric = "ROC",
  importance = TRUE
)

# kNN
model_knn <- train(
  health_risk_binary ~ . -country_name -year,
  data = train_data,
  method = "knn",
  trControl = ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10,
  metric = "ROC"
)

9.5 Step 4: Cross-Validation Performance Comparison

Cross-validated ROC scores were compared to assess model performance and stability across folds. Logistic Regression achieved the highest mean ROC, followed closely by Random Forest, while kNN showed slightly lower performance.

Considering ROC stability and the ability to model complex relationships, Random Forest was selected as the final model.

resamples_results <- resamples(
  list(
    Logistic = model_logit,
    RandomForest = model_rf,
    kNN = model_knn
  )
)

summary(resamples_results)

## 
## Call:
## summary.resamples(object = resamples_results)
## 
## Models: Logistic, RandomForest, kNN 
## Number of resamples: 5 
## 
## ROC 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Logistic     0.8009170 0.8057720 0.8070195 0.8093477 0.8146307 0.8183994    0
## RandomForest 0.7956561 0.7991288 0.8025629 0.8019161 0.8057803 0.8064524    0
## kNN          0.7813283 0.7866958 0.7872880 0.7909899 0.7916793 0.8079583    0
## 
## Sens 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Logistic     0.4959786 0.4979920 0.5073628 0.5145978 0.5268097 0.5448461    0
## RandomForest 0.4672021 0.4825737 0.4832664 0.4859409 0.4899598 0.5067024    0
## kNN          0.4812332 0.4819277 0.4872825 0.4969189 0.5107239 0.5234270    0
## 
## Spec 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Logistic     0.8875086 0.8957902 0.8992409 0.8998058 0.9068966 0.9095928    0
## RandomForest 0.8992409 0.9026915 0.9123533 0.9102942 0.9185645 0.9186207    0
## kNN          0.8833678 0.8909593 0.8910345 0.8927687 0.8978606 0.9006211    0

bwplot(resamples_results, metric = "ROC")

Logistic Regression achieved the highest mean ROC (~ 0.81), followed closely by Random Forest (~ 0.80) and kNN (~ 0.79). Sensitivity is modest (~ 0.45–0.51), indicating some high-risk cases are missed, but specificity is high (>0.90), meaning low-risk populations are accurately identified. Random Forest balances ROC stability and ability to handle non-linear relationships, making it the best choice for deployment.”

9.6 Step 5: Test Data Predictions & Confusion Matrix

The selected models were evaluated on the unseen test dataset using accuracy, sensitivity, specificity, and balanced accuracy. These metrics provide insight into the model’s ability to correctly identify both high-risk and low-risk populations.

pred_logit <- predict(model_logit, test_data)
pred_rf    <- predict(model_rf, test_data)
pred_knn   <- predict(model_knn, test_data)

Confusion matrices were generated to calculate overall accuracy and class-level performance.
Key metrics include sensitivity (correctly identifying High-risk cases), specificity (correctly identifying Not_High-risk), and balanced accuracy.

confusionMatrix(pred_logit, test_data$health_risk_binary)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Not_High
##   High      449      153
##   Not_High  484     1658
##                                           
##                Accuracy : 0.7679          
##                  95% CI : (0.7516, 0.7835)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4341          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4812          
##             Specificity : 0.9155          
##          Pos Pred Value : 0.7458          
##          Neg Pred Value : 0.7740          
##              Prevalence : 0.3400          
##          Detection Rate : 0.1636          
##    Detection Prevalence : 0.2194          
##       Balanced Accuracy : 0.6984          
##                                           
##        'Positive' Class : High            
##

confusionMatrix(pred_rf, test_data$health_risk_binary)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Not_High
##   High      422      140
##   Not_High  511     1671
##                                           
##                Accuracy : 0.7628          
##                  95% CI : (0.7464, 0.7786)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.415           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4523          
##             Specificity : 0.9227          
##          Pos Pred Value : 0.7509          
##          Neg Pred Value : 0.7658          
##              Prevalence : 0.3400          
##          Detection Rate : 0.1538          
##    Detection Prevalence : 0.2048          
##       Balanced Accuracy : 0.6875          
##                                           
##        'Positive' Class : High            
##

confusionMatrix(pred_knn, test_data$health_risk_binary)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Not_High
##   High      444      167
##   Not_High  489     1644
##                                           
##                Accuracy : 0.7609          
##                  95% CI : (0.7445, 0.7768)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4187          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4759          
##             Specificity : 0.9078          
##          Pos Pred Value : 0.7267          
##          Neg Pred Value : 0.7707          
##              Prevalence : 0.3400          
##          Detection Rate : 0.1618          
##    Detection Prevalence : 0.2227          
##       Balanced Accuracy : 0.6918          
##                                           
##        'Positive' Class : High            
##

Although Logistic Regression achieved slightly higher accuracy (76.8%) and balanced accuracy (70%) compared to Random Forest, the differences in performance were marginal. Random Forest achieved an accuracy of approximately 76.6%, with high specificity (>92%) and moderate sensitivity (~45%). Random Forest was selected as the final model because it is better suited to capturing non-linear relationships and interactions among climate, air quality, and socioeconomic variables. Climate–health relationships are inherently complex and non-linear, and Random Forest provides greater modeling flexibility while maintaining comparable predictive performance. Additionally, Random Forest enables feature importance analysis, which supports interpretability and policy-relevant insights.

accuracy_results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "kNN"),
  Accuracy = c(
    confusionMatrix(pred_logit, test_data$health_risk_binary)$overall["Accuracy"],
    confusionMatrix(pred_rf, test_data$health_risk_binary)$overall["Accuracy"],
    confusionMatrix(pred_knn, test_data$health_risk_binary)$overall["Accuracy"]
  )
)

ggplot(accuracy_results, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.2f", Accuracy)), vjust = -0.5) +  # show accuracy above bars
  ylim(0, 1) +  # fix y-axis from 0 to 1 for clarity
  theme_minimal() +
  labs(title = "Model Accuracy Comparison", y = "Accuracy", x = "")

The bar chart confirms that all models perform similarly, with small differences in accuracy. Confusion matrices indicate trade-offs between sensitivity and specificity that should guide model selection depending on whether identifying High-risk populations is more critical than avoiding false positives.

cm <- confusionMatrix(pred_rf, test_data$health_risk_binary)$table
cm_df <- as.data.frame(cm)
colnames(cm_df) <- c("Predicted", "Actual", "Count")

ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Count)) +
  geom_tile() +
  geom_text(aes(label = Count), color = "white", size = 5) +
  scale_fill_gradient(low = "steelblue", high = "firebrick") +
  labs(
    title = "Confusion Matrix – Random Forest",
    x = "Actual Class",
    y = "Predicted Class"
  ) +
  theme_minimal()

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# ROC curve for Random Forest
rf_probs <- predict(model_rf, test_data, type = "prob")

roc_rf <- roc(
  response = test_data$health_risk_binary,
  predictor = rf_probs$High,
  levels = rev(levels(test_data$health_risk_binary))
)

## Setting direction: controls < cases

plot(
  roc_rf,
  main = "ROC Curve – Random Forest Classifier",
  col = "darkblue",
  lwd = 2
)

auc(roc_rf)

## Area under the curve: 0.8089

The ROC curve demonstrates strong discrimination between High-risk and Not High-risk populations, with an AUC of approximately 0.80. This indicates good overall classification performance beyond random chance.

9.7 Step 6: Feature Importance (Random Forest)

Feature importance from the Random Forest model was examined to identify the most influential predictors of high health risk.

varImp(model_rf)

## rf variable importance
## 
##                          Importance
## heat_wave_days              100.000
## temperature_celsius          76.491
## pm25_ugm3                    53.400
## extreme_weather_events       46.219
## precipitation_mm             37.887
## air_quality_index            37.132
## food_security_index          16.671
## healthcare_access_index      10.067
## gdp_per_capita_usd            8.246
## income_levelUpper-Middle      4.309
## income_levelLower-Middle      0.000

# Extract feature importance
importance_rf <- varImp(model_rf, scale = TRUE)

# Convert importance to data frame safely
importance_df <- data.frame(
  Feature = rownames(importance_rf$importance),
  Importance = importance_rf$importance[, 1]
)

# Sort by importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Plot
library(ggplot2)

ggplot(importance_df,
       aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Random Forest Feature Importance",
    x = "Feature",
    y = "Relative Importance"
  ) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal()

The most important predictors of high health risk were heat wave days, average temperature, and PM2.5 concentration, followed by extreme weather events and air quality index. Socioeconomic variables such as food security, GDP per capita, and healthcare access showed smaller but meaningful contributions.

These results highlight that climate extremes and air pollution are the dominant drivers of population health risk, while socioeconomic conditions influence resilience and vulnerability.

9.8 Step 8: Identification of High-Risk Populations

The Random Forest model was used to estimate the probability of each observation belonging to the High-risk class. The top 10 observations with the highest predicted probabilities were identified to highlight populations most vulnerable to climate-sensitive health outcomes.

# Predicted class labels
test_data$predicted_class <- predict(model_rf, test_data)

# Predicted probability of High-risk
probs <- predict(model_rf, test_data, type = "prob")
test_data$predicted_prob_high <- probs$High

top_high_risk <- test_data %>%
  arrange(desc(predicted_prob_high)) %>%
  head(10) %>%
  select(country_name, year, health_risk_binary, predicted_class, predicted_prob_high)

top_high_risk

## # A tibble: 10 × 5
##    country_name  year health_risk_binary predicted_class predicted_prob_high
##    <chr>        <dbl> <fct>              <fct>                         <dbl>
##  1 Philippines   2023 High               High                          0.99 
##  2 Indonesia     2016 High               High                          0.988
##  3 Philippines   2020 High               High                          0.988
##  4 Italy         2022 High               High                          0.984
##  5 Philippines   2016 High               High                          0.984
##  6 India         2021 High               High                          0.982
##  7 Philippines   2019 High               High                          0.98 
##  8 Philippines   2023 High               High                          0.98 
##  9 Australia     2021 High               High                          0.978
## 10 Italy         2016 High               High                          0.978

The highest-risk populations were concentrated in the Philippines, Italy, Vietnam, France, and China across multiple years. Repeated high-risk predictions for the Philippines and Italy suggest persistent vulnerability driven by climate extremes and air quality conditions. These findings can help policymakers prioritize targeted health interventions and climate adaptation strategies.

# Aggregate predicted classes by year
risk_trend <- test_data %>%
  group_by(year, predicted_class) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(year) %>%
  mutate(percentage = count / sum(count) * 100)

# Plot
ggplot(risk_trend, aes(x = year, y = percentage, fill = predicted_class)) +
  geom_area(alpha = 0.8) +
  labs(
    title = "Predicted Population Health Risk Over Time",
    x = "Year",
    y = "Percentage of Observations",
    fill = "Health Risk"
  ) +
  theme_minimal()

This visualization shows the changing proportion of populations classified as High risk over time. An increasing share of High-risk observations in recent years suggests growing climate-related health vulnerability, highlighting the need for stronger adaptation and healthcare preparedness.

# Count High-risk predictions by country
country_risk <- test_data %>%
  filter(predicted_class == "High") %>%
  count(country_name, sort = TRUE) %>%
  top_n(10, n)

# Plot
ggplot(country_risk, aes(x = reorder(country_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "firebrick") +
  coord_flip() +
  labs(
    title = "Countries with Most High-Risk Predictions",
    x = "Country",
    y = "Number of High-Risk Observations"
  ) +
  theme_minimal()

While the highest-risk individual observations were concentrated in countries such as the Philippines and Italy, the frequency-based analysis reveals that countries like Indonesia, Kenya, and Bangladesh experience persistent high-risk conditions across multiple weeks. This distinction highlights the difference between extreme health risk events and chronic climate-health vulnerability, both of which are important for public health planning.

ggplot(test_data, aes(x = predicted_prob_high)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.8) +
  labs(
    title = "Distribution of Predicted High-Risk Probabilities",
    x = "Predicted Probability of High Risk",
    y = "Number of Observations"
  ) +
  theme_minimal()

The distribution of predicted probabilities shows that many populations fall within a moderate-risk range, while a smaller group exhibits extremely high risk. This probability-based view allows policymakers to prioritize interventions not only for confirmed high-risk populations but also for those approaching critical risk thresholds.

ggplot(test_data, aes(
  x = temperature_celsius,
  y = pm25_ugm3,
  color = predicted_class
)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Climate and Air Quality Conditions by Predicted Health Risk",
    x = "Average Temperature (°C)",
    y = "PM2.5 Concentration (µg/m³)",
    color = "Predicted Risk"
  ) +
  theme_minimal()

High-risk populations tend to cluster in regions experiencing higher temperatures and elevated PM2.5 concentrations, reinforcing the combined role of heat and air pollution in driving climate-related health risks.

10 Conclusion

10.1 Key Findings

Correlations: Strong links between temperature and vector disease risk (0.65) and heat-related admissions (0.42) heat wave days and extreme weather are nearly collinear (0.96).
Protective factors: Healthcare access and food security reduce climate-health risks.
High-risk clusters: Overlap of high temperatures and PM2.5, especially in Indonesia, Kenya, and Bangladesh (chronic risk), and Philippines, Italy, Vietnam, France, and China (acute events).
Trend: Rising proportion of high-risk weeks from 2020–2024.

10.2 Model Performance

Heat admissions (Regression): Random Forest best (R² = 0.82); XGBoost close (R² = 0.80) with faster training. Top predictors: heat wave days, temperature, precipitation, GDP, healthcare access.
Health risk (Classification): Random Forest (76.6% accuracy, AUC = 0.80, specificity >92%). Key drivers: heat waves, temperature, PM2.5, extreme weather, air quality.

10.3 Recommendations

Policy: Target heat-health action plans in Philippines, Italy, Indonesia, Kenya, Bangladesh and improve air quality and strengthen healthcare/food systems.
Monitoring: Deploy early warning systems for areas with combined heat + PM2.5 exposure; tailor responses to acute vs. chronic vulnerability.
Research: Use higher-frequency data, study climate-socioeconomic interactions, and validate models with real-world outcomes.

WQD7004 Group 5 Group Project

Ridhwan

December 9th, 2025