Ridhwan
December 9th, 2025
WQD7004 Group 5
1. Ridhwan Bin Razali S2178595
2. Nurin Miza Afiqah Binti Andrie Dazlee S2154141
3. Maisie Chiara Salsabila 25062171
4. Amir Arifudeen Bin Mohd Fadzir 24232491
Climate change represents one of the most pressing global challenges of our time, with far-reaching implications for public health systems worldwide. The intersection of environmental conditions and human health outcomes has become increasingly critical as extreme weather events, air quality degradation, and temperature anomalies intensify across regions. However, comprehensive datasets linking climate indicators to health metrics across diverse geographic and socioeconomic contexts remain limited.
To address this gap, a climate-health dataset has been generated tracking 25 countries across 8 regions from 2015 to 2025. This dataset integrates climate indicators (temperature, precipitation, extreme weather events), air quality measurements (PM2.5, AQI), health outcomes (respiratory diseases, cardiovascular mortality, vector-borne disease risks), and socioeconomic factors (healthcare access, GDP per capita, mental health indices). Comprising 30 key attributes and 13,723 weekly records, it provides researchers and data scientists with a robust foundation for exploring climate-health relationships, developing predictive models, and informing evidence-based policy decisions.
The primary objectives of working with this climate-health dataset are:
This study/project seeks to address the following key research questions:
The dataset comprehensively tracks climate-health relationships across 25 countries from January 2015 to October 2025, providing weekly granularity for time-series analysis and regional comparisons.
Key Characteristics:
Column Details:
Geographic Dimensions (8 columns): -
record_id (character): Unique identifier - e.g.,
“REC-USA-2015-01” - country_code (character): ISO country
code - e.g., “USA”, “CHN”, “DEU” - country_name
(character): Country name - e.g., “United States”, “China”, “Germany” -
region (character): Geographic region - e.g., “North
America”, “East Asia”, “Europe” - income_level (character):
Economic classification - e.g., “High”, “Upper-Middle”, “Lower-Middle” -
latitude (numeric): Geographic latitude - e.g., 38.9072
(degrees) - longitude (numeric): Geographic longitude -
e.g., -77.0369 (degrees) - population_millions (numeric):
Population in millions - e.g., 331.9
Temporal Dimensions (4 columns): - date
(Date): Weekly observation date - e.g., 2015-01-05 - year
(numeric): Calendar year - e.g., 2015, 2020, 2025 - month
(numeric): Month (1-12) - e.g., 1 (January), 12 (December) -
week (numeric): ISO week number (1-53) - e.g., 1, 26,
52
Climate Indicators (7 columns): -
temperature_celsius (numeric): Average weekly temperature
(°C) - e.g., 15.2, 28.7 - temp_anomaly_celsius (numeric):
Temperature deviation from baseline (°C) - e.g., +1.3, -0.8 -
precipitation_mm (numeric): Weekly precipitation
(millimeters) - e.g., 25.4, 120.3 - heat_wave_days
(numeric): Days with extreme heat (0-7) - e.g., 0, 3, 5 -
drought_indicator (numeric): Drought severity index (0-3) -
e.g., 0 (none), 2 (moderate) - flood_indicator (numeric):
Flood occurrence indicator (0-1) - e.g., 0 (no), 1 (yes) -
extreme_weather_events (numeric): Count of extreme events
per week - e.g., 0, 2, 4
Air Quality (2 columns): - pm25_ugm3
(numeric): PM2.5 concentration (μg/m³) - e.g., 12.5, 85.3 -
air_quality_index (numeric): EPA Air Quality Index (0-500)
- e.g., 42 (Good), 187 (Unhealthy)
Health Outcomes (5 columns): -
respiratory_disease_rate (numeric): Cases per 100,000
population - e.g., 45.2, 128.7 - cardio_mortality_rate
(numeric): Deaths per 100,000 population - e.g., 8.3, 23.1 -
vector_disease_risk_score (numeric): Disease risk index
(0-100) - e.g., 15.6, 67.4 - waterborne_disease_incidents
(numeric): Weekly incident count - e.g., 2, 15, 37 -
heat_related_admissions (numeric): Hospital admissions per
week - e.g., 5, 42, 89
Socioeconomic & Wellbeing Indicators (4
columns): - healthcare_access_index (numeric):
Population access percentage (0-100%) - e.g., 75.3, 95.8 -
gdp_per_capita_usd (numeric): GDP per capita (USD) - e.g.,
15,234, 65,789 - mental_health_index (numeric): Mental
wellbeing score (0-100) - e.g., 68.2, 82.5 -
food_security_index (numeric): Food security score (0-100)
- e.g., 72.1, 91.3
# # Install required packages if not already installed
# if (!require(caret)) install.packages("caret", repos = "https://cran.rstudio.com/")
# if (!require(dplyr)) install.packages("dplyr", repos = "https://cran.rstudio.com/")
# if (!require(ggplot2)) install.packages("ggplot2", repos = "https://cran.rstudio.com/")
# if (!require(MLmetrics)) install.packages("MLmetrics", repos = "https://cran.rstudio.com/")
# if (!require(readr)) install.packages("readr", repos = "https://cran.rstudio.com/")
# if (!require(repr)) install.packages("repr", repos = "https://cran.rstudio.com/")
# if (!require(reshape2)) install.packages("reshape2", repos = "https://cran.rstudio.com/")
# if (!require(ROSE)) install.packages("ROSE", repos = "https://cran.rstudio.com/")
# if (!require(lubridate)) install.packages("lubridate", repos = "https://cran.rstudio.com/")
# if (!require(e1071)) install.packages("e1071", repos = "https://cran.rstudio.com/")
# Load all required libraries
library(caret) # Classification and Regression Training
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr) # Data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # Data visualization
library(MLmetrics) # Machine Learning metrics
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(readr) # Fast data reading
library(repr) # Data representation
library(reshape2) # Data reshaping
library(ROSE) # Random Over-Sampling Examples
## Loaded ROSE 0.0-4
library(lubridate) # Date manipulation
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(e1071) # Statistical functions (skewness)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
##
## element
cat("All libraries loaded successfully! ✓")
## All libraries loaded successfully! ✓
# Read the CSV of the dataset
# Google Drive URL converted to direct download format for Colab compatibility
csv_url <- "https://drive.google.com/uc?id=14kzQfJ7L_7NWBWbvX8RWBuQbAe93jPTy&export=download"
df <- read_csv(csv_url)
## Rows: 14100 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): country_code, country_name, region, income_level
## dbl (25): record_id, year, month, week, latitude, longitude, population_mil...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Store original dimensions before cleaning
original_dim <- dim(df)
str(df)
## spc_tbl_ [14,100 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ record_id : num [1:14100] 1 2 3 4 5 6 7 8 9 10 ...
## $ country_code : chr [1:14100] "USA" "USA" "USA" "USA" ...
## $ country_name : chr [1:14100] "United States" "United States" "United States" "United States" ...
## $ region : chr [1:14100] "North America" "North America" "North America" "North America" ...
## $ income_level : chr [1:14100] "High" "High" "High" "High" ...
## $ date : Date[1:14100], format: "2015-01-04" "2015-01-11" ...
## $ year : num [1:14100] 2015 2015 2015 2015 2015 ...
## $ month : num [1:14100] 1 1 1 1 2 2 2 2 3 3 ...
## $ week : num [1:14100] 1 2 3 4 5 6 7 8 9 10 ...
## $ latitude : num [1:14100] 37.1 37.1 37.1 37.1 37.1 ...
## $ longitude : num [1:14100] -95.7 -95.7 -95.7 -95.7 -95.7 ...
## $ population_millions : num [1:14100] 331 331 331 331 331 331 331 331 331 331 ...
## $ temperature_celsius : num [1:14100] 4.59 3.13 3.99 6.43 9 6.44 6.86 8.34 9.58 8.25 ...
## $ temp_anomaly_celsius : num [1:14100] 0.76 -0.5 -0.14 -0.06 0.47 0.18 0.38 0.64 -0.83 0.71 ...
## $ precipitation_mm : num [1:14100] 75.7 97 74.1 87.7 75.8 ...
## $ heat_wave_days : num [1:14100] 0 0 0 0 1 0 0 1 0 1 ...
## $ drought_indicator : num [1:14100] 0 0 0 0 0 0 0 0 0 0 ...
## $ flood_indicator : num [1:14100] 0 0 0 0 0 0 0 0 0 0 ...
## $ extreme_weather_events : num [1:14100] 0 0 0 0 1 0 0 1 0 1 ...
## $ pm25_ugm3 : num [1:14100] 39 17.9 91.5 5.5 37.1 ...
## $ air_quality_index : num [1:14100] 82 6 137 -3 48 157 51 5 29 22 ...
## $ respiratory_disease_rate : num [1:14100] 69.4 70 66.9 47 61.3 80.2 70.3 48 80.3 59.1 ...
## $ cardio_mortality_rate : num [1:14100] 31.5 26.3 33.4 35 28.3 30.6 33.2 33.3 24.9 32.5 ...
## $ vector_disease_risk_score : num [1:14100] 6.6 5.2 1.3 6 1.4 0.7 5.9 6.4 5.9 0.3 ...
## $ waterborne_disease_incidents: num [1:14100] 16.2 11.4 19.5 9.7 22.6 23.7 25.4 17.7 18.3 18.8 ...
## $ heat_related_admissions : num [1:14100] 1.4 0 0 9 27.3 11.9 9.1 20.3 7.8 13.7 ...
## $ healthcare_access_index : num [1:14100] 77.3 83.6 84.7 84.3 83.6 78 81.7 86.7 84.9 87.1 ...
## $ gdp_per_capita_usd : num [1:14100] 63627 63627 63627 63627 63733 ...
## $ mental_health_index : num [1:14100] 71.2 70.6 63.4 68.1 69.1 70.1 68.5 59.4 75.7 71.9 ...
## $ food_security_index : num [1:14100] 90.2 94 100 96.4 100 97.1 100 95 100 95.2 ...
## - attr(*, "spec")=
## .. cols(
## .. record_id = col_double(),
## .. country_code = col_character(),
## .. country_name = col_character(),
## .. region = col_character(),
## .. income_level = col_character(),
## .. date = col_date(format = ""),
## .. year = col_double(),
## .. month = col_double(),
## .. week = col_double(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. population_millions = col_double(),
## .. temperature_celsius = col_double(),
## .. temp_anomaly_celsius = col_double(),
## .. precipitation_mm = col_double(),
## .. heat_wave_days = col_double(),
## .. drought_indicator = col_double(),
## .. flood_indicator = col_double(),
## .. extreme_weather_events = col_double(),
## .. pm25_ugm3 = col_double(),
## .. air_quality_index = col_double(),
## .. respiratory_disease_rate = col_double(),
## .. cardio_mortality_rate = col_double(),
## .. vector_disease_risk_score = col_double(),
## .. waterborne_disease_incidents = col_double(),
## .. heat_related_admissions = col_double(),
## .. healthcare_access_index = col_double(),
## .. gdp_per_capita_usd = col_double(),
## .. mental_health_index = col_double(),
## .. food_security_index = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Check the shape of the dataset
dim(df)
## [1] 14100 30
sapply(df, function(x) sum(is.na(x)))
## record_id country_code
## 0 0
## country_name region
## 0 0
## income_level date
## 0 0
## year month
## 0 0
## week latitude
## 0 0
## longitude population_millions
## 0 0
## temperature_celsius temp_anomaly_celsius
## 0 0
## precipitation_mm heat_wave_days
## 0 0
## drought_indicator flood_indicator
## 0 0
## extreme_weather_events pm25_ugm3
## 0 0
## air_quality_index respiratory_disease_rate
## 0 0
## cardio_mortality_rate vector_disease_risk_score
## 0 0
## waterborne_disease_incidents heat_related_admissions
## 0 0
## healthcare_access_index gdp_per_capita_usd
## 0 0
## mental_health_index food_security_index
## 0 0
head(df)
## # A tibble: 6 × 30
## record_id country_code country_name region income_level date year month
## <dbl> <chr> <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 1 USA United Stat… North… High 2015-01-04 2015 1
## 2 2 USA United Stat… North… High 2015-01-11 2015 1
## 3 3 USA United Stat… North… High 2015-01-18 2015 1
## 4 4 USA United Stat… North… High 2015-01-25 2015 1
## 5 5 USA United Stat… North… High 2015-02-01 2015 2
## 6 6 USA United Stat… North… High 2015-02-08 2015 2
## # ℹ 22 more variables: week <dbl>, latitude <dbl>, longitude <dbl>,
## # population_millions <dbl>, temperature_celsius <dbl>,
## # temp_anomaly_celsius <dbl>, precipitation_mm <dbl>, heat_wave_days <dbl>,
## # drought_indicator <dbl>, flood_indicator <dbl>,
## # extreme_weather_events <dbl>, pm25_ugm3 <dbl>, air_quality_index <dbl>,
## # respiratory_disease_rate <dbl>, cardio_mortality_rate <dbl>,
## # vector_disease_risk_score <dbl>, waterborne_disease_incidents <dbl>, …
Here we check the duplicates by looking at the record_id and the country + date.
# Check duplicates by record_id (should be unique)
dup_records <- sum(duplicated(df$record_id))
cat('Duplicate record_ids:', dup_records, '\n')
## Duplicate record_ids: 0
total_duplicates_removed <- 0
# Store total for summary
# Check duplicates by country + date combination
if("country_name" %in% colnames(df) && "date" %in% colnames(df)) {}
## NULL
dup_country_date <- sum(duplicated(df[, c("country_name", "date")]))
cat('Duplicate country+date combinations:', dup_country_date, '\n')
## Duplicate country+date combinations: 0
No duplicates are found
# Calculate records per country per year (should be approximately 52-53 weeks)
if("country_name" %in% colnames(df) && "year" %in% colnames(df)) {
temporal_check <- df %>%
group_by(country_name, year) %>%
summarise(record_count = n(), .groups = 'drop') %>%
filter(record_count < 45 | record_count > 55)
temporal_flags_count <- nrow(temporal_check)
if(temporal_flags_count > 0) {
cat('\n=== ISSUES DETECTED ===\n')
cat('Country-year combinations with unusual record counts:\n')
print(temporal_check, n = Inf)
} else {
cat('Temporal validation passed - all country-years have ~52-53 records\n')
}
temporal_issues <- temporal_check
} else {
temporal_flags_count <- 0
temporal_issues <- data.frame()
cat('Cannot perform temporal validation - required columns not found\n')
}
##
## === ISSUES DETECTED ===
## Country-year combinations with unusual record counts:
## # A tibble: 25 × 3
## country_name year record_count
## <chr> <dbl> <int>
## 1 Argentina 2025 42
## 2 Australia 2025 42
## 3 Bangladesh 2025 42
## 4 Brazil 2025 42
## 5 Canada 2025 42
## 6 China 2025 42
## 7 Colombia 2025 42
## 8 Egypt 2025 42
## 9 France 2025 42
## 10 Germany 2025 42
## 11 India 2025 42
## 12 Indonesia 2025 42
## 13 Italy 2025 42
## 14 Japan 2025 42
## 15 Kenya 2025 42
## 16 Mexico 2025 42
## 17 Nigeria 2025 42
## 18 Pakistan 2025 42
## 19 Philippines 2025 42
## 20 South Africa 2025 42
## 21 Spain 2025 42
## 22 Thailand 2025 42
## 23 United Kingdom 2025 42
## 24 United States 2025 42
## 25 Vietnam 2025 42
A calendar year has approximately 52 weeks (365 days ÷ 7 days/week ≈ 52.14 weeks). Some years have 53 weeks depending on how the year starts and ends.
# Check income_level contains expected classifications
expected_income_levels <- c("Low", "Lower-Middle", "Upper-Middle", "High")
if("income_level" %in% colnames(df)) {
unique_income_levels <- unique(df$income_level)
cat('Income level categories found:\n')
print(unique_income_levels)
# Check for unexpected categories
unexpected_income_levels <- setdiff(unique_income_levels, expected_income_levels)
if(length(unexpected_income_levels) > 0) {
cat('\nUnexpected income level categories detected:\n')
print(unexpected_income_levels)
} else {
cat('\nAll income levels match expected classifications\n')
}
} else {
unexpected_income_levels <- c()
cat('income_level column not found\n')
}
## Income level categories found:
## [1] "High" "Lower-Middle" "Upper-Middle"
##
## All income levels match expected classifications
The dataset contains 3 of the 4 expected income level
classifications: - High: High-income countries -
Upper-Middle: Upper-middle-income countries
- Lower-Middle: Lower-middle-income countries -
Low: (Not present in dataset)
Note: The “Low” income category is absent from the dataset, meaning no low-income countries are included in this climate-health study. All present categories follow the expected classification format.
# Check coordinate validity and consistency
lat_violations <- 0
lon_violations <- 0
coord_missing <- 0
coord_records_removed <- 0
inconsistent_coord_countries <- c()
if("latitude" %in% colnames(df) && "longitude" %in% colnames(df)) {
# Check for values outside valid ranges
lat_violations <- sum(df$latitude < -90 | df$latitude > 90, na.rm = TRUE)
lon_violations <- sum(df$longitude < -180 | df$longitude > 180, na.rm = TRUE)
coord_missing <- sum(is.na(df$latitude) | is.na(df$longitude))
cat('=== GEOGRAPHIC COORDINATE VALIDATION ===\n')
cat('Latitude violations (outside -90 to 90):', lat_violations, '\n')
cat('Longitude violations (outside -180 to 180):', lon_violations, '\n')
cat('Missing coordinates:', coord_missing, '\n')
# Check coordinate consistency per country
if("country_name" %in% colnames(df)) {
coord_consistency <- df %>%
group_by(country_name) %>%
summarise(
unique_coords = n_distinct(paste(latitude, longitude)),
.groups = 'drop'
) %>%
filter(unique_coords > 1)
if(nrow(coord_consistency) > 0) {
inconsistent_coord_countries <- coord_consistency$country_name
cat('\nCountries with inconsistent coordinates:', length(inconsistent_coord_countries), '\n')
cat('Flagged countries:\n')
print(inconsistent_coord_countries)
} else {
cat('\nAll countries have consistent coordinates\n')
}
}
# Remove records with clearly invalid coordinates
if(lat_violations > 0 || lon_violations > 0) {
original_rows <- nrow(df)
df <- df %>%
filter(latitude >= -90 & latitude <= 90 & longitude >= -180 & longitude <= 180)
coord_records_removed <- original_rows - nrow(df)
cat('\nRemoved', coord_records_removed, 'records with invalid coordinates\n')
}
} else {
cat('Coordinate columns not found\n')
}
## === GEOGRAPHIC COORDINATE VALIDATION ===
## Latitude violations (outside -90 to 90): 0
## Longitude violations (outside -180 to 180): 0
## Missing coordinates: 0
##
## All countries have consistent coordinates
# Validate climate and health data for impossible values
climate_quality_records_removed <- 0
data_violations <- list()
cat('=== CLIMATE AND HEALTH DATA QUALITY VALIDATION ===\n\n')
## === CLIMATE AND HEALTH DATA QUALITY VALIDATION ===
# Temperature check (-50 to 50 Celsius is reasonable range)
if("temperature_celsius" %in% colnames(df)) {
temp_violations <- sum(df$temperature_celsius < -50 | df$temperature_celsius > 50, na.rm = TRUE)
data_violations$temperature <- temp_violations
cat('Temperature violations (outside -50 to 50°C):', temp_violations, '\n')
}
## Temperature violations (outside -50 to 50°C): 0
# Check negative values in rate/count columns (should be non-negative)
rate_columns <- c("respiratory_disease_rate", "cardio_mortality_rate", "vector_disease_risk_score",
"waterborne_disease_incidents", "heat_related_admissions")
for(col in rate_columns) {
if(col %in% colnames(df)) {
neg_count <- sum(df[[col]] < 0, na.rm = TRUE)
if(neg_count > 0) {
data_violations[[col]] <- neg_count
cat(col, 'negative values:', neg_count, '\n')
}
}
}
# Check index bounds
# Healthcare Access Index: 0-100
if("healthcare_access_index" %in% colnames(df)) {
hai_violations <- sum(df$healthcare_access_index < 0 | df$healthcare_access_index > 100, na.rm = TRUE)
data_violations$healthcare_access_index <- hai_violations
cat('Healthcare Access Index violations (outside 0-100):', hai_violations, '\n')
# Show examples of violations
if(hai_violations > 0) {
hai_examples <- df %>%
filter(healthcare_access_index < 0 | healthcare_access_index > 100) %>%
select(record_id, country_name, date, healthcare_access_index) %>%
head(5)
cat(' Example violations:\n')
print(hai_examples)
}
}
## Healthcare Access Index violations (outside 0-100): 3
## Example violations:
## # A tibble: 3 × 4
## record_id country_name date healthcare_access_index
## <dbl> <chr> <date> <dbl>
## 1 4066 United Kingdom 2017-04-02 102.
## 2 4122 United Kingdom 2018-04-29 100.
## 3 4467 United Kingdom 2024-12-08 102
# Air Quality Index: 0-500
if("air_quality_index" %in% colnames(df)) {
aqi_violations <- sum(df$air_quality_index < 0 | df$air_quality_index > 500, na.rm = TRUE)
data_violations$air_quality_index <- aqi_violations
cat('Air Quality Index violations (outside 0-500):', aqi_violations, '\n')
# Show examples of violations
if(aqi_violations > 0) {
aqi_examples <- df %>%
filter(air_quality_index < 0 | air_quality_index > 500) %>%
select(record_id, country_name, date, air_quality_index) %>%
head(5)
cat(' Example violations:\n')
print(aqi_examples)
}
}
## Air Quality Index violations (outside 0-500): 374
## Example violations:
## # A tibble: 5 × 4
## record_id country_name date air_quality_index
## <dbl> <chr> <date> <dbl>
## 1 4 United States 2015-01-25 -3
## 2 11 United States 2015-03-15 -23
## 3 37 United States 2015-09-13 -13
## 4 48 United States 2015-11-29 -1
## 5 61 United States 2016-02-28 -25
# Mental Health Index: 0-100
if("mental_health_index" %in% colnames(df)) {
mhi_violations <- sum(df$mental_health_index < 0 | df$mental_health_index > 100, na.rm = TRUE)
data_violations$mental_health_index <- mhi_violations
cat('Mental Health Index violations (outside 0-100):', mhi_violations, '\n')
}
## Mental Health Index violations (outside 0-100): 0
# Food Security Index: 0-100
if("food_security_index" %in% colnames(df)) {
fsi_violations <- sum(df$food_security_index < 0 | df$food_security_index > 100, na.rm = TRUE)
data_violations$food_security_index <- fsi_violations
cat('Food Security Index violations (outside 0-100):', fsi_violations, '\n')
}
## Food Security Index violations (outside 0-100): 0
# Population should be positive
if("population_millions" %in% colnames(df)) {
pop_violations <- sum(df$population_millions < 0, na.rm = TRUE)
data_violations$population_millions <- pop_violations
cat('Population violations (negative values):', pop_violations, '\n')
}
## Population violations (negative values): 0
# Remove records with impossible values
original_rows <- nrow(df)
df <- df %>%
filter(
if("temperature_celsius" %in% colnames(.)) temperature_celsius >= -50 & temperature_celsius <= 50 else TRUE,
if("respiratory_disease_rate" %in% colnames(.)) respiratory_disease_rate >= 0 | is.na(respiratory_disease_rate) else TRUE,
if("cardio_mortality_rate" %in% colnames(.)) cardio_mortality_rate >= 0 | is.na(cardio_mortality_rate) else TRUE,
if("vector_disease_risk_score" %in% colnames(.)) vector_disease_risk_score >= 0 | is.na(vector_disease_risk_score) else TRUE,
if("waterborne_disease_incidents" %in% colnames(.)) waterborne_disease_incidents >= 0 | is.na(waterborne_disease_incidents) else TRUE,
if("heat_related_admissions" %in% colnames(.)) heat_related_admissions >= 0 | is.na(heat_related_admissions) else TRUE,
if("healthcare_access_index" %in% colnames(.)) healthcare_access_index >= 0 & healthcare_access_index <= 100 | is.na(healthcare_access_index) else TRUE,
if("air_quality_index" %in% colnames(.)) air_quality_index >= 0 & air_quality_index <= 500 | is.na(air_quality_index) else TRUE,
if("mental_health_index" %in% colnames(.)) mental_health_index >= 0 & mental_health_index <= 100 | is.na(mental_health_index) else TRUE,
if("food_security_index" %in% colnames(.)) food_security_index >= 0 & food_security_index <= 100 | is.na(food_security_index) else TRUE,
if("population_millions" %in% colnames(.)) population_millions >= 0 | is.na(population_millions) else TRUE
)
climate_quality_records_removed <- original_rows - nrow(df)
cat('\nRemoved', climate_quality_records_removed, 'records with invalid climate/health data\n')
##
## Removed 377 records with invalid climate/health data
1. Healthcare Access Index (3 violations) - Valid range: 0-100% (percentage of population with healthcare access) - Found: UK records with 102% (impossible) - Action: Removed 3 records
2. Air Quality Index (374 violations) - Valid range: 0-500 (EPA scale) - Found: Negative values (-3 to -25) in US data (2015-2016) - Action: Removed 374 records
3. Other Metrics - Temperature, Mental Health Index, Food Security Index, Population: All valid
Note: Healthcare Access >100% and negative Air Quality Index values are logically impossible and indicate data quality issues requiring removal.
# Organize columns in logical order for analysis
column_order <- c(
# Geographic information (8)
"record_id", "country_code", "country_name", "region",
"income_level", "latitude", "longitude", "population_millions",
# Temporal information (4)
"date", "year", "month", "week",
# Climate indicators (8)
"temperature_celsius", "temp_anomaly_celsius", "precipitation_mm", "heat_wave_days",
"drought_indicator", "flood_indicator", "extreme_weather_events",
# Air quality (2)
"pm25_ugm3", "air_quality_index",
# Health outcomes (5)
"respiratory_disease_rate", "cardio_mortality_rate", "vector_disease_risk_score",
"waterborne_disease_incidents", "heat_related_admissions",
# Socioeconomic and wellbeing (4)
"healthcare_access_index", "gdp_per_capita_usd", "mental_health_index", "food_security_index"
)
# Reorder columns
df <- df[, column_order]
cat('Columns reordered into logical categories:\n')
## Columns reordered into logical categories:
cat('- Geographic (8 columns)\n')
## - Geographic (8 columns)
cat('- Temporal (4 columns)\n')
## - Temporal (4 columns)
cat('- Climate (8 columns)\n')
## - Climate (8 columns)
cat('- Air Quality (2 columns)\n')
## - Air Quality (2 columns)
cat('- Health Outcomes (5 columns)\n')
## - Health Outcomes (5 columns)
cat('- Socioeconomic/Wellbeing (4 columns)\n')
## - Socioeconomic/Wellbeing (4 columns)
Split by category
# Display final column information organized by category
cat('=== CLEANED DATASET STRUCTURE ===\n\n')
## === CLEANED DATASET STRUCTURE ===
cat('GEOGRAPHIC INFORMATION (8 columns):\n')
## GEOGRAPHIC INFORMATION (8 columns):
cat(' 1. record_id:', class(df$record_id), '\n')
## 1. record_id: numeric
cat(' 2. country_code:', class(df$country_code), '\n')
## 2. country_code: character
cat(' 3. country_name:', class(df$country_name), '\n')
## 3. country_name: character
cat(' 4. region:', class(df$region), '\n')
## 4. region: character
cat(' 5. income_level:', class(df$income_level), '\n')
## 5. income_level: character
cat(' 6. latitude:', class(df$latitude), '\n')
## 6. latitude: numeric
cat(' 7. longitude:', class(df$longitude), '\n')
## 7. longitude: numeric
cat(' 8. population_millions:', class(df$population_millions), '\n\n')
## 8. population_millions: numeric
cat('TEMPORAL INFORMATION (4 columns):\n')
## TEMPORAL INFORMATION (4 columns):
cat(' 9. date:', class(df$date), '\n')
## 9. date: Date
cat(' 10. year:', class(df$year), '\n')
## 10. year: numeric
cat(' 11. month:', class(df$month), '\n')
## 11. month: numeric
cat(' 12. week:', class(df$week), '\n\n')
## 12. week: numeric
cat('CLIMATE INDICATORS (7 columns):\n')
## CLIMATE INDICATORS (7 columns):
cat(' 13. temperature_celsius:', class(df$temperature_celsius), '\n')
## 13. temperature_celsius: numeric
cat(' 14. temp_anomaly_celsius:', class(df$temp_anomaly_celsius), '\n')
## 14. temp_anomaly_celsius: numeric
cat(' 15. precipitation_mm:', class(df$precipitation_mm), '\n')
## 15. precipitation_mm: numeric
cat(' 16. heat_wave_days:', class(df$heat_wave_days), '\n')
## 16. heat_wave_days: numeric
cat(' 17. drought_indicator:', class(df$drought_indicator), '\n')
## 17. drought_indicator: numeric
cat(' 18. flood_indicator:', class(df$flood_indicator), '\n')
## 18. flood_indicator: numeric
cat(' 19. extreme_weather_events:', class(df$extreme_weather_events), '\n\n')
## 19. extreme_weather_events: numeric
cat('AIR QUALITY (2 columns):\n')
## AIR QUALITY (2 columns):
cat(' 20. pm25_ugm3:', class(df$pm25_ugm3), '\n')
## 20. pm25_ugm3: numeric
cat(' 21. air_quality_index:', class(df$air_quality_index), '\n\n')
## 21. air_quality_index: numeric
cat('HEALTH OUTCOMES (5 columns):\n')
## HEALTH OUTCOMES (5 columns):
cat(' 22. respiratory_disease_rate:', class(df$respiratory_disease_rate), '\n')
## 22. respiratory_disease_rate: numeric
cat(' 23. cardio_mortality_rate:', class(df$cardio_mortality_rate), '\n')
## 23. cardio_mortality_rate: numeric
cat(' 24. vector_disease_risk_score:', class(df$vector_disease_risk_score), '\n')
## 24. vector_disease_risk_score: numeric
cat(' 25. waterborne_disease_incidents:', class(df$waterborne_disease_incidents), '\n')
## 25. waterborne_disease_incidents: numeric
cat(' 26. heat_related_admissions:', class(df$heat_related_admissions), '\n\n')
## 26. heat_related_admissions: numeric
cat('SOCIOECONOMIC & WELLBEING (4 columns):\n')
## SOCIOECONOMIC & WELLBEING (4 columns):
cat(' 27. healthcare_access_index:', class(df$healthcare_access_index), '\n')
## 27. healthcare_access_index: numeric
cat(' 28. gdp_per_capita_usd:', class(df$gdp_per_capita_usd), '\n')
## 28. gdp_per_capita_usd: numeric
cat(' 29. mental_health_index:', class(df$mental_health_index), '\n')
## 29. mental_health_index: numeric
cat(' 30. food_security_index:', class(df$food_security_index), '\n')
## 30. food_security_index: numeric
# Document issues flagged during data cleaning for team review
cat('=== ISSUES FLAGGED FOR REVIEW ===\n\n')
## === ISSUES FLAGGED FOR REVIEW ===
# Missing values check
cat('1. MISSING VALUES:\n')
## 1. MISSING VALUES:
missing_counts <- colSums(is.na(df))
total_missing <- sum(missing_counts)
if(total_missing > 0) {
cat(' Total missing values:', total_missing, '\n')
cols_with_missing <- names(missing_counts[missing_counts > 0])
for(col in cols_with_missing) {
pct <- round(missing_counts[col] / nrow(df) * 100, 1)
cat(' -', col, ':', missing_counts[col], 'missing (', pct, '%)\n')
}
} else {
cat(' No missing values in the dataset\n')
}
## No missing values in the dataset
cat('\n')
# Coordinate inconsistencies
cat('2. COORDINATE INCONSISTENCIES:\n')
## 2. COORDINATE INCONSISTENCIES:
if(exists("inconsistent_coord_countries") && length(inconsistent_coord_countries) > 0) {
cat(' Countries with varying coordinates:', length(inconsistent_coord_countries), '\n')
cat(' Flagged for manual review:\n')
for(country in inconsistent_coord_countries) {
cat(' -', country, '\n')
}
} else {
cat(' None found\n')
}
## None found
cat('\n')
# Temporal gaps
cat('3. TEMPORAL CONSISTENCY:\n')
## 3. TEMPORAL CONSISTENCY:
if(exists("temporal_issues") && nrow(temporal_issues) > 0) {
cat(' Countries with unexpected weekly record counts:\n')
print(temporal_issues)
} else {
cat(' None found\n')
}
## Countries with unexpected weekly record counts:
## # A tibble: 25 × 3
## country_name year record_count
## <chr> <dbl> <int>
## 1 Argentina 2025 42
## 2 Australia 2025 42
## 3 Bangladesh 2025 42
## 4 Brazil 2025 42
## 5 Canada 2025 42
## 6 China 2025 42
## 7 Colombia 2025 42
## 8 Egypt 2025 42
## 9 France 2025 42
## 10 Germany 2025 42
## # ℹ 15 more rows
cat('\n')
# Week mismatches
cat('4. WEEK NUMBER DISCREPANCIES:\n')
## 4. WEEK NUMBER DISCREPANCIES:
if(exists("week_mismatches") && week_mismatches > 0) {
cat(' Records where week column differs from calculated ISO week:', week_mismatches, '\n')
cat(' Consider reviewing week column calculation methodology\n')
} else {
cat(' None found\n')
}
## None found
cat('\n')
# Income level issues
cat('5. INCOME LEVEL VALIDATION:\n')
## 5. INCOME LEVEL VALIDATION:
if(exists("unexpected_income_levels") && length(unexpected_income_levels) > 0) {
cat(' Unexpected income level categories found:\n')
for(level in unexpected_income_levels) {
cat(' -', level, '\n')
}
} else {
cat(' All income levels validated successfully\n')
}
## All income levels validated successfully
cat('\n')
# Data validation summary
cat('6. DATA QUALITY ISSUES RESOLVED:\n')
## 6. DATA QUALITY ISSUES RESOLVED:
cat(' Duplicate records removed:', total_duplicates_removed, '\n')
## Duplicate records removed: 0
cat(' Invalid coordinates removed:', coord_records_removed, '\n')
## Invalid coordinates removed: 0
cat(' Invalid climate/health data removed:', climate_quality_records_removed, '\n')
## Invalid climate/health data removed: 377
cat('\n')
The dataset contains only 42 weeks of data for 2025 (partial year). - Recommendation: Exclude 2025 from year-over-year trend analysis - Use only 2015-2024 (10 complete years) for annual comparisons - If including 2025, clearly label it as “partial year data”
377 records were removed due to invalid climate/health values: - Healthcare Access Index violations: 3 records (>100%) - Air Quality Index violations: 374 records (>500) - All other validation checks passed
Several columns need factor conversion for analysis: -
country_code, country_name,
region - income_level (ordered factor:
Lower-Middle < Upper-Middle < High) - Optional:
drought_indicator, flood_indicator
“Low” income category is absent from the dataset. Only 3 categories
present: - High - Upper-Middle
- Lower-Middle
# Display first few rows of cleaned dataset
cat('First 10 rows of cleaned dataset:\n\n')
## First 10 rows of cleaned dataset:
head(df, 10)
## # A tibble: 10 × 30
## record_id country_code country_name region income_level latitude longitude
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 USA United States North A… High 37.1 -95.7
## 2 2 USA United States North A… High 37.1 -95.7
## 3 3 USA United States North A… High 37.1 -95.7
## 4 5 USA United States North A… High 37.1 -95.7
## 5 6 USA United States North A… High 37.1 -95.7
## 6 7 USA United States North A… High 37.1 -95.7
## 7 8 USA United States North A… High 37.1 -95.7
## 8 9 USA United States North A… High 37.1 -95.7
## 9 10 USA United States North A… High 37.1 -95.7
## 10 12 USA United States North A… High 37.1 -95.7
## # ℹ 23 more variables: population_millions <dbl>, date <date>, year <dbl>,
## # month <dbl>, week <dbl>, temperature_celsius <dbl>,
## # temp_anomaly_celsius <dbl>, precipitation_mm <dbl>, heat_wave_days <dbl>,
## # drought_indicator <dbl>, flood_indicator <dbl>,
## # extreme_weather_events <dbl>, pm25_ugm3 <dbl>, air_quality_index <dbl>,
## # respiratory_disease_rate <dbl>, cardio_mortality_rate <dbl>,
## # vector_disease_risk_score <dbl>, waterborne_disease_incidents <dbl>, …
# install.packages("corrplot")
# install.packages("skimr")
# Load libraries
library(corrplot)
## corrplot 0.95 loaded
library(skimr)
library(ggplot2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(ggplot2)
library(reshape2)
nrow(df); ncol(df)
## [1] 13723
## [1] 30
length(unique(df$country_code))
## [1] 25
length(unique(df$region))
## [1] 8
length(unique(df$income_level))
## [1] 3
colSums(is.na(df))
## record_id country_code
## 0 0
## country_name region
## 0 0
## income_level latitude
## 0 0
## longitude population_millions
## 0 0
## date year
## 0 0
## month week
## 0 0
## temperature_celsius temp_anomaly_celsius
## 0 0
## precipitation_mm heat_wave_days
## 0 0
## drought_indicator flood_indicator
## 0 0
## extreme_weather_events pm25_ugm3
## 0 0
## air_quality_index respiratory_disease_rate
## 0 0
## cardio_mortality_rate vector_disease_risk_score
## 0 0
## waterborne_disease_incidents heat_related_admissions
## 0 0
## healthcare_access_index gdp_per_capita_usd
## 0 0
## mental_health_index food_security_index
## 0 0
table(df$income_level)
##
## High Lower-Middle Upper-Middle
## 4866 5074 3783
# Select only numeric columns for histograms
numeric_cols <- names(df)[sapply(df, is.numeric)]
# Exclude identifier and temporal columns that are not suitable for histograms
exclude_cols <- c("record_id", "year", "month", "week", "latitude", "longitude")
plot_numeric_cols <- setdiff(numeric_cols, exclude_cols)
# Prepare data for plotting (long format)
hist_data_long <- df %>%
select(all_of(plot_numeric_cols)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value")
# Create histograms for each selected numeric variable
histogram_plot <- ggplot(hist_data_long, aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", color = "black") +
facet_wrap(~ variable, scales = "free") +
labs(
title = "Histograms of Numerical Variables",
x = "Value",
y = "Frequency"
) +
theme_minimal()
print(histogram_plot)
📊 Distribution Insights
🧠 Modeling Implications - Apply log or Box-Cox transformations to skewed variables before regression or tree-based modeling. - Use standardization for variables with wide ranges (e.g. GDP, population) to improve model stability. - Consider clustering or segmentation for bimodal variables to uncover latent groups. - Flag outliers for robust evaluation and scenario testing during deployment.
# Select numeric columns
selected_cols <- c(
"temperature_celsius", "precipitation_mm", "heat_wave_days",
"extreme_weather_events", "air_quality_index",
"vector_disease_risk_score", "heat_related_admissions",
"respiratory_disease_rate", "healthcare_access_index",
"food_security_index"
)
# Prepare correlation matrix
cor_matrix <- cor(df[, selected_cols], use = "complete.obs")
# Convert to long format for ggplot
cor_data <- melt(cor_matrix)
# Plot heatmap
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, color = "black"),
axis.text.y = element_text(color = "black")
) +
labs(
title = "Correlation Heatmap of Climate Variables",
x = "",
y = ""
)
📊 Correlation Highlights - Temp ↔︎ Vector disease risk (0.65) - Temp ↔︎ Heat-related admissions (0.42) - Heat wave days ↔︎ Extreme events (0.96) → redundant - Healthcare access & food security negatively correlated with climate stressors
🧠 Modeling & Deployment Tips - Prioritize temperature, healthcare access, and vector risk as core features. - Use interaction terms (e.g. temperature × access index) to capture compounding effects. - Evaluate model fairness across low-access regions. - Highlight vulnerable zones in dashboards for policy targeting.
This section need to run for 20mins+, Require the progress statement to prevent the R runtime disconnection due to false sense of inactivity
# SECTION: PACKAGE INSTALLATION & ENVIRONMENT SETUP
# 1. Define required packages (Excludes pre-installed dplyr/ggplot2)
packages <- c("caret", "glmnet", "ranger", "xgboost", "recipes")
# 2. Configure Repositories for High-Speed Linux Binaries
options(
repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest"),
Ncpus = parallel::detectCores() # Automatically uses 2 cores (or more if available)
)
# 3. Identify ONLY what is missing
missing_pkgs <- packages[!(packages %in% installed.packages()[, "Package"])]
# 4. Install missing items in one optimized batch
if (length(missing_pkgs) > 0) {
cat("Fast-installing missing binaries and dependencies:", paste(missing_pkgs, collapse = ", "), "\n")
# dependencies = TRUE is the "secret sauce" to get all 27+ sub-packages at once
install.packages(missing_pkgs, dependencies = TRUE, quiet = TRUE)
}
# 5. Load everything efficiently
library(dplyr)
invisible(lapply(packages, library, character.only = TRUE))
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-10
##
## Attaching package: 'recipes'
## The following object is masked from 'package:Matrix':
##
## update
## The following object is masked from 'package:stats':
##
## step
cat("----------------------------------------------------------\n")
## ----------------------------------------------------------
cat("SUCCESS: Environment Ready (using", parallel::detectCores(), "cores)\n")
## SUCCESS: Environment Ready (using 12 cores)
Validation & Cleaning: Applied logical range filters to environmental and economic variables (e.g., -50°C to 60°C for temperature) and discarded 2025 data to ensure a clean, historical dataset.
Categorical Encoding: Converted region into a factor and income_level into an ordered factor, preserving the natural economic hierarchy (Low to High).
Feature Engineering: Applied log1p transformations to GDP and Air Quality (AQI) variables to normalize right-skewed data and improve model stability.
Feature Selection: Curated a final set of 11 predictors (climate, pollution, and socioeconomic metrics) and used na.omit() to ensure a “complete case” dataset.
Stratified Splitting: Partitioned the data into an 80/20 train-test split using stratified sampling to ensure both sets have a similar distribution of heat-related admissions.
# --- 1. Filter and Clean Data ---
df_cleaned_for_regression <- df %>%
filter(
# Core target validation
if("heat_related_admissions" %in% colnames(.)) heat_related_admissions >= 0 | is.na(heat_related_admissions) else TRUE,
# Predictor validations
if("temperature_celsius" %in% colnames(.)) temperature_celsius >= -50 & temperature_celsius <= 60 else TRUE,
if("temp_anomaly_celsius" %in% colnames(.)) temp_anomaly_celsius >= -20 & temp_anomaly_celsius <= 20 else TRUE,
if("heat_wave_days" %in% colnames(.)) heat_wave_days >= 0 & heat_wave_days <= 366 else TRUE,
if("air_quality_index" %in% colnames(.)) air_quality_index >= 0 & air_quality_index <= 500 else TRUE,
if("healthcare_access_index" %in% colnames(.)) healthcare_access_index >= 0 & healthcare_access_index <= 100 else TRUE,
if("gdp_per_capita_usd" %in% colnames(.)) gdp_per_capita_usd >= 0 else TRUE
)
# --- 1.1 Discard 2025 data ---
df_clean <- df_cleaned_for_regression %>% filter(year < 2025)
# --- 1.2 Convert categorical variables to factors ---
df_clean$region <- as.factor(df_clean$region)
df_clean$income_level <- factor(df_clean$income_level,
levels = c("Low", "Lower-Middle", "Upper-Middle", "High"), # Added 'Low' just in case
ordered = TRUE)
# --- 1.3 Feature Engineering & Transformations ---
# Using log1p (log(1+x)) to handle skewed environmental and economic data
df_clean$log_gdp <- log1p(df_clean$gdp_per_capita_usd)
df_clean$log_pm25 <- log1p(df_clean$pm25_ugm3)
df_clean$log_aqi <- log1p(df_clean$air_quality_index)
# --- 1.4 Select Features for Regression ---
# UPDATED: Included heat_related_admissions and removed respiratory_disease_rate
features <- c("temperature_celsius", "temp_anomaly_celsius", "precipitation_mm",
"heat_wave_days", "log_pm25", "log_aqi", "healthcare_access_index",
"log_gdp", "region", "income_level")
# We must select the target variable explicitly to use it in df_model
df_model <- df_clean %>%
select(all_of(features), heat_related_admissions) %>%
na.omit()
# --- 1.5 Split Data (80/20) ---
set.seed(123)
# Splitting based on the heat_related_admissions distribution
trainIndex_reg <- createDataPartition(df_model$heat_related_admissions, p = .8, list = FALSE)
train_data_reg <- df_model[trainIndex_reg, ]
test_data_reg <- df_model[-trainIndex_reg, ]
cat("Environment setup for Heat Related Admissions complete! ✓\n")
## Environment setup for Heat Related Admissions complete! ✓
cat("Training set size:", nrow(train_data_reg), "rows\n")
## Training set size: 10162 rows
Set up 5-fold cross-validation for model training to ensure the model generalizes well and to prevent overfitting.
# Set up 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
Hardware Optimization: Automatically detected CPU capacity and allocated parallel threads (n_cores - 1) to maximize processing speed for ensemble models.
Data Transformation: Implemented One-Hot Encoding via dummyVars to convert categorical regions and income levels into numeric matrices, ensuring compatibility across all model types.
Based on research objectives, we will train three distinct models:
Model 1 (Elastic Net): Established a robust linear baseline using glmnet to manage multicollinearity between correlated climate variables (e.g., Temperature vs. Heat Wave Days).
Model 2 (Random Forest): Utilized the ranger implementation to capture complex, non-linear interactions between socioeconomic factors and environmental stressors.
Model 3 (XGBoost): Implemented a high-performance Gradient Boosting approach using a native DMatrix structure for memory efficiency and maximum predictive accuracy.
Training Safeguards: Integrated Early Stopping (10 rounds) in XGBoost and 5-fold Cross-Validation in caret to prevent overfitting and ensure the models generalize well to new heat admission data.
# Re-detect hardware capacity
library(parallel)
n_cores <- detectCores()
# Define the missing variable
suggested_threads <- max(1, n_cores - 1)
cat("Hardware Check: Using", suggested_threads, "threads for training.\n")
## Hardware Check: Using 11 threads for training.
# --- One-Hot Encoding ---
dummy_transformer <- dummyVars(heat_related_admissions ~ ., data = df_model)
# Standardize as matrices immediately to avoid repeating the conversion inside models
to_plain_matrix <- function(transformer, data) {
tmp <- predict(transformer, newdata = data)
mat <- matrix(as.numeric(tmp), nrow = nrow(tmp), ncol = ncol(tmp))
colnames(mat) <- colnames(tmp)
return(mat)
}
train_x <- to_plain_matrix(dummy_transformer, train_data_reg)
test_x <- to_plain_matrix(dummy_transformer, test_data_reg)
train_y <- train_data_reg$heat_related_admissions
test_y <- test_data_reg$heat_related_admissions
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest <- xgb.DMatrix(data = test_x, label = test_y)
# --- Step 3: Train Regression Models ---
# Model 1: Elastic Net
cat("Training Model 1: Elastic Net...\n")
## Training Model 1: Elastic Net...
time_enet <- system.time({
reg_model_enet <- train(
x = train_x, y = train_y,
method = "glmnet",
trControl = train_control
)
})
## + Fold1: alpha=0.10, lambda=1.326
## - Fold1: alpha=0.10, lambda=1.326
## + Fold1: alpha=0.55, lambda=1.326
## - Fold1: alpha=0.55, lambda=1.326
## + Fold1: alpha=1.00, lambda=1.326
## - Fold1: alpha=1.00, lambda=1.326
## + Fold2: alpha=0.10, lambda=1.326
## - Fold2: alpha=0.10, lambda=1.326
## + Fold2: alpha=0.55, lambda=1.326
## - Fold2: alpha=0.55, lambda=1.326
## + Fold2: alpha=1.00, lambda=1.326
## - Fold2: alpha=1.00, lambda=1.326
## + Fold3: alpha=0.10, lambda=1.326
## - Fold3: alpha=0.10, lambda=1.326
## + Fold3: alpha=0.55, lambda=1.326
## - Fold3: alpha=0.55, lambda=1.326
## + Fold3: alpha=1.00, lambda=1.326
## - Fold3: alpha=1.00, lambda=1.326
## + Fold4: alpha=0.10, lambda=1.326
## - Fold4: alpha=0.10, lambda=1.326
## + Fold4: alpha=0.55, lambda=1.326
## - Fold4: alpha=0.55, lambda=1.326
## + Fold4: alpha=1.00, lambda=1.326
## - Fold4: alpha=1.00, lambda=1.326
## + Fold5: alpha=0.10, lambda=1.326
## - Fold5: alpha=0.10, lambda=1.326
## + Fold5: alpha=0.55, lambda=1.326
## - Fold5: alpha=0.55, lambda=1.326
## + Fold5: alpha=1.00, lambda=1.326
## - Fold5: alpha=1.00, lambda=1.326
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.55, lambda = 0.0133 on full training set
cat("Elapsed Time:", time_enet["elapsed"], "seconds\nModel 1 Complete! ✓\n\n")
## Elapsed Time: 0.33 seconds
## Model 1 Complete! ✓
# Model 2: Ranger (Random Forest)
cat("Training Model 2: Ranger Forest...\n")
## Training Model 2: Ranger Forest...
time_rf <- system.time({
reg_model_rf <- train(
x = train_x, y = train_y,
method = "ranger",
trControl = train_control,
num.trees = 100,
num.threads = suggested_threads,
importance = "impurity"
)
})
## + Fold1: mtry= 2, min.node.size=5, splitrule=variance
## - Fold1: mtry= 2, min.node.size=5, splitrule=variance
## + Fold1: mtry=10, min.node.size=5, splitrule=variance
## - Fold1: mtry=10, min.node.size=5, splitrule=variance
## + Fold1: mtry=19, min.node.size=5, splitrule=variance
## - Fold1: mtry=19, min.node.size=5, splitrule=variance
## + Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold1: mtry=10, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=10, min.node.size=5, splitrule=extratrees
## + Fold1: mtry=19, min.node.size=5, splitrule=extratrees
## - Fold1: mtry=19, min.node.size=5, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=5, splitrule=variance
## - Fold2: mtry= 2, min.node.size=5, splitrule=variance
## + Fold2: mtry=10, min.node.size=5, splitrule=variance
## - Fold2: mtry=10, min.node.size=5, splitrule=variance
## + Fold2: mtry=19, min.node.size=5, splitrule=variance
## - Fold2: mtry=19, min.node.size=5, splitrule=variance
## + Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=10, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=10, min.node.size=5, splitrule=extratrees
## + Fold2: mtry=19, min.node.size=5, splitrule=extratrees
## - Fold2: mtry=19, min.node.size=5, splitrule=extratrees
## + Fold3: mtry= 2, min.node.size=5, splitrule=variance
## - Fold3: mtry= 2, min.node.size=5, splitrule=variance
## + Fold3: mtry=10, min.node.size=5, splitrule=variance
## - Fold3: mtry=10, min.node.size=5, splitrule=variance
## + Fold3: mtry=19, min.node.size=5, splitrule=variance
## - Fold3: mtry=19, min.node.size=5, splitrule=variance
## + Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold3: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=10, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=10, min.node.size=5, splitrule=extratrees
## + Fold3: mtry=19, min.node.size=5, splitrule=extratrees
## - Fold3: mtry=19, min.node.size=5, splitrule=extratrees
## + Fold4: mtry= 2, min.node.size=5, splitrule=variance
## - Fold4: mtry= 2, min.node.size=5, splitrule=variance
## + Fold4: mtry=10, min.node.size=5, splitrule=variance
## - Fold4: mtry=10, min.node.size=5, splitrule=variance
## + Fold4: mtry=19, min.node.size=5, splitrule=variance
## - Fold4: mtry=19, min.node.size=5, splitrule=variance
## + Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold4: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=10, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=10, min.node.size=5, splitrule=extratrees
## + Fold4: mtry=19, min.node.size=5, splitrule=extratrees
## - Fold4: mtry=19, min.node.size=5, splitrule=extratrees
## + Fold5: mtry= 2, min.node.size=5, splitrule=variance
## - Fold5: mtry= 2, min.node.size=5, splitrule=variance
## + Fold5: mtry=10, min.node.size=5, splitrule=variance
## - Fold5: mtry=10, min.node.size=5, splitrule=variance
## + Fold5: mtry=19, min.node.size=5, splitrule=variance
## - Fold5: mtry=19, min.node.size=5, splitrule=variance
## + Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## - Fold5: mtry= 2, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=10, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=10, min.node.size=5, splitrule=extratrees
## + Fold5: mtry=19, min.node.size=5, splitrule=extratrees
## - Fold5: mtry=19, min.node.size=5, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 19, splitrule = extratrees, min.node.size = 5 on full training set
cat("Elapsed Time:", time_rf["elapsed"], "seconds\nModel 2 Complete! ✓\n\n")
## Elapsed Time: 9.11 seconds
## Model 2 Complete! ✓
# Model 3: XGBoost
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6,
nthread = suggested_threads
)
cat("Training Model 3: XGBoost...\n")
## Training Model 3: XGBoost...
time_xgb <- system.time({
reg_model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(val=dtest, train=dtrain),
print_every_n = 10,
early_stopping_rounds = 10,
maximize = FALSE
)
})
## Warning in throw_err_or_depr_msg("Parameter '", match_old, "' has been renamed
## to '", : Parameter 'watchlist' has been renamed to 'evals'. This warning will
## become an error in a future version.
## Multiple eval metrics are present. Will use train_rmse for early stopping.
## Will train until train_rmse hasn't improved in 10 rounds.
##
## [1] val-rmse:7.633804 train-rmse:7.542605
## [11] val-rmse:4.505231 train-rmse:3.941566
## [21] val-rmse:4.259704 train-rmse:3.411858
## [31] val-rmse:4.192930 train-rmse:3.132369
## [41] val-rmse:4.145195 train-rmse:2.891480
## [51] val-rmse:4.158602 train-rmse:2.711914
## [61] val-rmse:4.179990 train-rmse:2.547518
## [71] val-rmse:4.187054 train-rmse:2.414084
## [81] val-rmse:4.194985 train-rmse:2.292317
## [91] val-rmse:4.207165 train-rmse:2.190803
## [100] val-rmse:4.208809 train-rmse:2.087171
cat("Elapsed Time:", time_xgb["elapsed"], "seconds\nModel 3 Complete! ✓\n\n")
## Elapsed Time: 0.48 seconds
## Model 3 Complete! ✓
cat("\nAll models trained successfully! ✓\n")
##
## All models trained successfully! ✓
# Create a timing summary
timing_summary <- data.frame(
Model = c("Elastic Net", "Ranger", "XGBoost"),
Time_Seconds = c(time_enet["elapsed"], time_rf["elapsed"], time_xgb["elapsed"])
)
print(timing_summary)
## Model Time_Seconds
## 1 Elastic Net 0.33
## 2 Ranger 9.11
## 3 XGBoost 0.48
Compare accuracy across models using cross-validation results
Metric Aggregation: Utilized the resamples() function to collect and align performance distributions (RMSE, \(R^2\), MAE) from the cross-validation folds.
Performance Benchmarking: Generated a statistical summary to compare the average error and stability of the Elastic Net and Random Forest models.
Visual Comparative Analysis:
Relative Evaluation: While XGBoost is evaluated separately on the test set, these CV metrics provide the primary evidence for the Random Forest and Elastic Net reliability before final testing.
cat("\n Comparing Model Performance (CV Results)...\n")
##
## Comparing Model Performance (CV Results)...
# 1. Collect results ONLY from caret models
# XGBoost is native now, so it cannot go into resamples()
results <- resamples(list(
ElasticNet = reg_model_enet,
RandomForest = reg_model_rf
))
# 2. Print summary and plots for the caret models
summary_results <- summary(results)
print(summary_results)
##
## Call:
## summary.resamples(object = results)
##
## Models: ElasticNet, RandomForest
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ElasticNet 3.889701 3.940367 3.943154 3.952131 3.975853 4.011581 0
## RandomForest 2.432477 2.528552 2.534338 2.548872 2.576373 2.672619 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ElasticNet 5.124736 5.137692 5.151461 5.170391 5.218914 5.219153 0
## RandomForest 3.765574 3.932168 3.967069 3.940888 3.968093 4.071535 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ElasticNet 0.6787786 0.690385 0.7010065 0.6985460 0.7091445 0.7134152 0
## RandomForest 0.8175791 0.823341 0.8246324 0.8256884 0.8256008 0.8372887 0
dotplot(results, main = "Caret Models: Cross-Validation Metrics")
bwplot(results, main = "Caret Models: RMSE Distribution")
Custom Metric Standardization: Implemented a dedicated eval_metrics function to compute a multi-dimensional performance profile, including RMSE for error magnitude, MAE for average prediction deviation, and \(R^2\) for the proportion of variance explained.
Final Predictive Validation: Executed model predictions on a “hold-out” test set (20% of the data) that was never seen during training, providing an unbiased assessment of how the models will perform on real-world future data.
Ensemble “Horse Race”: Simultaneously evaluated the Elastic Net, Random Forest, and XGBoost models to statistically identify the most reliable architecture for heat-related admission forecasting.
Visual Error Analysis: Utilized a Predicted vs. Actual scatter plot to diagnose model behavior.
Identity Mapping: Added a red dashed \(1:1\) line to represent the “perfect model” baseline.
Bias Detection: Used the distribution of points around the line to confirm that the model remains accurate across both low and high admission counts.
cat("\n Making Predictions and Evaluating on Test Set...\n")
##
## Making Predictions and Evaluating on Test Set...
# Function to calculate regression metrics
eval_metrics <- function(actual, predicted) {
rmse <- sqrt(mean((actual - predicted)^2))
mae <- mean(abs(actual - predicted))
rss <- sum((actual - predicted)^2)
tss <- sum((actual - mean(actual))^2)
rsq <- 1 - (rss/tss)
return(c(RMSE = rmse, R2 = rsq, MAE = mae))
}
# Ensure test_x is a matrix if it isn't already
test_x_matrix <- as.matrix(test_x)
# Predict using all three models
pred_enet <- predict(reg_model_enet, test_x_matrix)
pred_rf <- predict(reg_model_rf, test_x_matrix)
pred_xgb <- predict(reg_model_xgb, dtest)
# Create a comparison table for Test Set
test_eval <- data.frame(
Model = c("Elastic Net", "Random Forest", "XGBoost"),
rbind(
eval_metrics(test_y, pred_enet),
eval_metrics(test_y, pred_rf),
eval_metrics(test_y, pred_xgb)
)
)
print(test_eval)
## Model RMSE R2 MAE
## 1 Elastic Net 5.348955 0.6791145 4.039578
## 2 Random Forest 4.025883 0.8182248 2.567852
## 3 XGBoost 4.208809 0.8013307 2.743016
library(ggplot2)
# Combine results for plotting
plot_data <- data.frame(
Actual = test_y,
XGBoost = pred_xgb,
RandomForest = pred_rf
)
# Plot for the XGBoost Model
ggplot(plot_data, aes(x = Actual, y = XGBoost)) +
geom_point(alpha = 0.4, color = "darkblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(title = "XGBoost: Predicted vs. Actual Heat Admissions",
subtitle = paste("R-Squared:", round(test_eval[test_eval$Model=="XGBoost", "R2"], 3)),
x = "Actual Admissions",
y = "Predicted Admissions") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Automated Selection Logic: Implemented an objective selection process that programmatically identifies the winner by finding the minimum Root Mean Square Error (RMSE) among all candidates.
Dynamic Prediction Mapping: The script automatically maps the correct model object and its corresponding test-set predictions for downstream analysis, ensuring the visualizations always reflect the highest-performing architecture.
Regression Accuracy Visualization: Produced a “Predicted vs. Actual” scatter plot to assess model bias.
Steelblue Points: Represent individual admission records.
Red Dashed Line: Acts as the 1:1 “perfect prediction” reference.
Factor Influence Analysis (Variable Importance): Deployed a conditional importance logic tailored to the model type (native xgb.importance for XGBoost or varImp for Caret models).
Interpretability: By ranking the top 10 influencing factors, the approach transitions from “black-box” prediction to actionable scientific insight, specifically highlighting how climate stressors impact hospital workload.
cat("\n Visualizing Best Model Results...\n")
##
## Visualizing Best Model Results...
# 1. Identify the best model automatically
best_model_idx <- which.min(test_eval$RMSE)
best_model_name <- test_eval$Model[best_model_idx]
cat(paste0("The best performing model is: ", best_model_name, "\n"))
## The best performing model is: Random Forest
# 2. Select the correct model object and predictions
if(best_model_name == "XGBoost") {
best_model_obj <- reg_model_xgb
best_preds <- pred_xgb
} else if(best_model_name == "Random Forest") {
best_model_obj <- reg_model_rf
best_preds <- pred_rf
} else {
best_model_obj <- reg_model_enet
best_preds <- pred_enet
}
# 3. Regression Plot: Predicted vs Actual
plot_data <- data.frame(Actual = test_y, Predicted = best_preds)
library(ggplot2)
p1 <- ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = paste("Best Model Performance:", best_model_name),
subtitle = "Actual vs Predicted Heat Related Admissions",
x = "Actual Admissions",
y = "Predicted Admissions") +
theme_minimal()
print(p1)
# 4. Corrected Variable Importance Plot
cat("\n Generating Variable Importance for:", best_model_name, "...\n")
##
## Generating Variable Importance for: Random Forest ...
if(best_model_name == "XGBoost") {
# Native XGBoost importance logic
imp_mat <- xgb.importance(model = reg_model_xgb)
xgb.plot.importance(imp_mat, top_n = 10, main = "XGBoost Variable Importance")
} else {
# Caret importance logic for RF and E-Net
importance <- varImp(best_model_obj)
plot(importance, main = paste("Top Factors Influencing Admissions (", best_model_name, ")", sep=""))
}
Training Environment and Data
Training Scale: The models were trained on a final dataset of 10,162 rows.
Methodology: Used 5-fold cross-validation (CV) to ensure stability across different data subsets.
Computing: Optimized for a single-thread environment during training.
Performance Comparison
Predictive Insights (XGBoost Analysis)
While Random Forest won on accuracy, the XGBoost visualization provides clear evidence of the model’s reliability:
Key Drivers of Heat Admissions
The Variable Importance analysis for the winning Random Forest model highlights the critical factors influencing hospital admissions:
Efficiency Summary
Goal:
Predict population health risk levels using climate, air quality, and socioeconomic indicators, focusing on identifying populations at higher risk from climate-sensitive health outcomes. For modeling purposes, health risk was binarized into High vs Not_High to simplify prediction, ensure stable model training, and emphasize populations in the top one-third of risk scores.
Approach: - 3 classification models (Logistic Regression, Random Forest, and k-Nearest Neighbors (kNN)) were trained to classify health risk. - The dataset was split into 80% training and 20% testing sets. - 5-fold cross-validation was applied on the training data, with ROC as the primary metric to handle potential class imbalance and select the best model. - Final model performance was evaluated on the unseen test set using accuracy, sensitivity, specificity, balanced accuracy, and confusion matrices. - Binarizing the target allows the models to focus on distinguishing high-risk populations from the rest, which is most relevant for policy and health intervention decisions.
# Install required packages for classification models
if (!require(nnet)) install.packages("nnet", repos = "https://cran.rstudio.com/")
## Loading required package: nnet
if (!require(randomForest)) install.packages("randomForest", repos = "https://cran.rstudio.com/")
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
if (!require(class)) install.packages("class", repos = "https://cran.rstudio.com/")
## Loading required package: class
# Load the packages
library(nnet) # For multinomial logistic regression
library(randomForest) # For random forest
library(class) # For k-NN
cat("All classification packages loaded successfully! ✓\n")
## All classification packages loaded successfully! ✓
# 1. Create Binary Target Variable
# Convert health outcomes into a binary classification problem to ensure model stability in Google Colab
# Classify whether a country-week observation represents High Health Risk or Not High Risk
# Create composite health risk score
df <- df %>%
mutate(
health_risk_score =
scale(respiratory_disease_rate) +
scale(heat_related_admissions) +
scale(cardio_mortality_rate)
)
# Convert into binary classification
df <- df %>%
mutate(
health_risk_binary = ifelse(
health_risk_score > quantile(health_risk_score, 0.66),
"High",
"Not_High"
)
)
# Convert to factor
df$health_risk_binary <- factor(df$health_risk_binary)
# 2. Select features
# Use climate, air quality, and socioeconomic indicators as predictors
classification_data <- df %>%
select(
country_name,
year,
temperature_celsius,
precipitation_mm,
heat_wave_days,
extreme_weather_events,
air_quality_index,
pm25_ugm3,
healthcare_access_index,
gdp_per_capita_usd,
food_security_index,
income_level,
health_risk_binary
)
# Convert categorical variable
classification_data$income_level <- factor(classification_data$income_level)
# 3. Split Data into Training and Test Sets (80/20)
# Evaluate model performance on unseen data
set.seed(123)
train_index <- createDataPartition(
classification_data$health_risk_binary,
p = 0.8,
list = FALSE
)
train_data <- classification_data[train_index, ]
test_data <- classification_data[-train_index, ]
To ensure robust and reliable model performance estimation
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
3 models were trained on the training set: - Logistic Regression: interpretable baseline model assuming linear relationships - Random Forest: ensemble model capable of capturing non-linear interactions and variable importance - k-Nearest Neighbors (kNN): distance-based model requiring feature scaling
Identifier variables such as country name and year were excluded to avoid data leakage
# Logistic Regression
model_logit <- train(
health_risk_binary ~ . -country_name -year, # Exclude identifiers
data = train_data,
method = "glm",
family = "binomial",
trControl = ctrl,
metric = "ROC"
)
# Random Forest
model_rf <- train(
health_risk_binary ~ . -country_name -year,
data = train_data,
method = "rf",
trControl = ctrl,
metric = "ROC",
importance = TRUE
)
# kNN
model_knn <- train(
health_risk_binary ~ . -country_name -year,
data = train_data,
method = "knn",
trControl = ctrl,
preProcess = c("center", "scale"),
tuneLength = 10,
metric = "ROC"
)
Cross-validated ROC scores were compared to assess model performance and stability across folds. Logistic Regression achieved the highest mean ROC, followed closely by Random Forest, while kNN showed slightly lower performance.
Considering ROC stability and the ability to model complex relationships, Random Forest was selected as the final model.
resamples_results <- resamples(
list(
Logistic = model_logit,
RandomForest = model_rf,
kNN = model_knn
)
)
summary(resamples_results)
##
## Call:
## summary.resamples(object = resamples_results)
##
## Models: Logistic, RandomForest, kNN
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Logistic 0.8009170 0.8057720 0.8070195 0.8093477 0.8146307 0.8183994 0
## RandomForest 0.7956561 0.7991288 0.8025629 0.8019161 0.8057803 0.8064524 0
## kNN 0.7813283 0.7866958 0.7872880 0.7909899 0.7916793 0.8079583 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Logistic 0.4959786 0.4979920 0.5073628 0.5145978 0.5268097 0.5448461 0
## RandomForest 0.4672021 0.4825737 0.4832664 0.4859409 0.4899598 0.5067024 0
## kNN 0.4812332 0.4819277 0.4872825 0.4969189 0.5107239 0.5234270 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Logistic 0.8875086 0.8957902 0.8992409 0.8998058 0.9068966 0.9095928 0
## RandomForest 0.8992409 0.9026915 0.9123533 0.9102942 0.9185645 0.9186207 0
## kNN 0.8833678 0.8909593 0.8910345 0.8927687 0.8978606 0.9006211 0
bwplot(resamples_results, metric = "ROC")
Logistic Regression achieved the highest mean ROC (~ 0.81), followed closely by Random Forest (~ 0.80) and kNN (~ 0.79). Sensitivity is modest (~ 0.45–0.51), indicating some high-risk cases are missed, but specificity is high (>0.90), meaning low-risk populations are accurately identified. Random Forest balances ROC stability and ability to handle non-linear relationships, making it the best choice for deployment.”
The selected models were evaluated on the unseen test dataset using accuracy, sensitivity, specificity, and balanced accuracy. These metrics provide insight into the model’s ability to correctly identify both high-risk and low-risk populations.
pred_logit <- predict(model_logit, test_data)
pred_rf <- predict(model_rf, test_data)
pred_knn <- predict(model_knn, test_data)
confusionMatrix(pred_logit, test_data$health_risk_binary)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Not_High
## High 449 153
## Not_High 484 1658
##
## Accuracy : 0.7679
## 95% CI : (0.7516, 0.7835)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4341
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4812
## Specificity : 0.9155
## Pos Pred Value : 0.7458
## Neg Pred Value : 0.7740
## Prevalence : 0.3400
## Detection Rate : 0.1636
## Detection Prevalence : 0.2194
## Balanced Accuracy : 0.6984
##
## 'Positive' Class : High
##
confusionMatrix(pred_rf, test_data$health_risk_binary)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Not_High
## High 422 140
## Not_High 511 1671
##
## Accuracy : 0.7628
## 95% CI : (0.7464, 0.7786)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.415
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4523
## Specificity : 0.9227
## Pos Pred Value : 0.7509
## Neg Pred Value : 0.7658
## Prevalence : 0.3400
## Detection Rate : 0.1538
## Detection Prevalence : 0.2048
## Balanced Accuracy : 0.6875
##
## 'Positive' Class : High
##
confusionMatrix(pred_knn, test_data$health_risk_binary)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Not_High
## High 444 167
## Not_High 489 1644
##
## Accuracy : 0.7609
## 95% CI : (0.7445, 0.7768)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4187
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4759
## Specificity : 0.9078
## Pos Pred Value : 0.7267
## Neg Pred Value : 0.7707
## Prevalence : 0.3400
## Detection Rate : 0.1618
## Detection Prevalence : 0.2227
## Balanced Accuracy : 0.6918
##
## 'Positive' Class : High
##
Although Logistic Regression achieved slightly higher accuracy (76.8%) and balanced accuracy (70%) compared to Random Forest, the differences in performance were marginal. Random Forest achieved an accuracy of approximately 76.6%, with high specificity (>92%) and moderate sensitivity (~45%). Random Forest was selected as the final model because it is better suited to capturing non-linear relationships and interactions among climate, air quality, and socioeconomic variables. Climate–health relationships are inherently complex and non-linear, and Random Forest provides greater modeling flexibility while maintaining comparable predictive performance. Additionally, Random Forest enables feature importance analysis, which supports interpretability and policy-relevant insights.
accuracy_results <- data.frame(
Model = c("Logistic Regression", "Random Forest", "kNN"),
Accuracy = c(
confusionMatrix(pred_logit, test_data$health_risk_binary)$overall["Accuracy"],
confusionMatrix(pred_rf, test_data$health_risk_binary)$overall["Accuracy"],
confusionMatrix(pred_knn, test_data$health_risk_binary)$overall["Accuracy"]
)
)
ggplot(accuracy_results, aes(x = Model, y = Accuracy, fill = Model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.2f", Accuracy)), vjust = -0.5) + # show accuracy above bars
ylim(0, 1) + # fix y-axis from 0 to 1 for clarity
theme_minimal() +
labs(title = "Model Accuracy Comparison", y = "Accuracy", x = "")
The bar chart confirms that all models perform similarly, with small differences in accuracy. Confusion matrices indicate trade-offs between sensitivity and specificity that should guide model selection depending on whether identifying High-risk populations is more critical than avoiding false positives.
cm <- confusionMatrix(pred_rf, test_data$health_risk_binary)$table
cm_df <- as.data.frame(cm)
colnames(cm_df) <- c("Predicted", "Actual", "Count")
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Count)) +
geom_tile() +
geom_text(aes(label = Count), color = "white", size = 5) +
scale_fill_gradient(low = "steelblue", high = "firebrick") +
labs(
title = "Confusion Matrix – Random Forest",
x = "Actual Class",
y = "Predicted Class"
) +
theme_minimal()
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve for Random Forest
rf_probs <- predict(model_rf, test_data, type = "prob")
roc_rf <- roc(
response = test_data$health_risk_binary,
predictor = rf_probs$High,
levels = rev(levels(test_data$health_risk_binary))
)
## Setting direction: controls < cases
plot(
roc_rf,
main = "ROC Curve – Random Forest Classifier",
col = "darkblue",
lwd = 2
)
auc(roc_rf)
## Area under the curve: 0.8089
The ROC curve demonstrates strong discrimination between High-risk and Not High-risk populations, with an AUC of approximately 0.80. This indicates good overall classification performance beyond random chance.
Feature importance from the Random Forest model was examined to identify the most influential predictors of high health risk.
varImp(model_rf)
## rf variable importance
##
## Importance
## heat_wave_days 100.000
## temperature_celsius 76.491
## pm25_ugm3 53.400
## extreme_weather_events 46.219
## precipitation_mm 37.887
## air_quality_index 37.132
## food_security_index 16.671
## healthcare_access_index 10.067
## gdp_per_capita_usd 8.246
## income_levelUpper-Middle 4.309
## income_levelLower-Middle 0.000
# Extract feature importance
importance_rf <- varImp(model_rf, scale = TRUE)
# Convert importance to data frame safely
importance_df <- data.frame(
Feature = rownames(importance_rf$importance),
Importance = importance_rf$importance[, 1]
)
# Sort by importance
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]
# Plot
library(ggplot2)
ggplot(importance_df,
aes(x = reorder(Feature, Importance), y = Importance, fill = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Random Forest Feature Importance",
x = "Feature",
y = "Relative Importance"
) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
theme_minimal()
The most important predictors of high health risk were heat wave days, average temperature, and PM2.5 concentration, followed by extreme weather events and air quality index. Socioeconomic variables such as food security, GDP per capita, and healthcare access showed smaller but meaningful contributions.
These results highlight that climate extremes and air pollution are the dominant drivers of population health risk, while socioeconomic conditions influence resilience and vulnerability.
The Random Forest model was used to estimate the probability of each observation belonging to the High-risk class. The top 10 observations with the highest predicted probabilities were identified to highlight populations most vulnerable to climate-sensitive health outcomes.
# Predicted class labels
test_data$predicted_class <- predict(model_rf, test_data)
# Predicted probability of High-risk
probs <- predict(model_rf, test_data, type = "prob")
test_data$predicted_prob_high <- probs$High
top_high_risk <- test_data %>%
arrange(desc(predicted_prob_high)) %>%
head(10) %>%
select(country_name, year, health_risk_binary, predicted_class, predicted_prob_high)
top_high_risk
## # A tibble: 10 × 5
## country_name year health_risk_binary predicted_class predicted_prob_high
## <chr> <dbl> <fct> <fct> <dbl>
## 1 Philippines 2023 High High 0.99
## 2 Indonesia 2016 High High 0.988
## 3 Philippines 2020 High High 0.988
## 4 Italy 2022 High High 0.984
## 5 Philippines 2016 High High 0.984
## 6 India 2021 High High 0.982
## 7 Philippines 2019 High High 0.98
## 8 Philippines 2023 High High 0.98
## 9 Australia 2021 High High 0.978
## 10 Italy 2016 High High 0.978
The highest-risk populations were concentrated in the Philippines, Italy, Vietnam, France, and China across multiple years. Repeated high-risk predictions for the Philippines and Italy suggest persistent vulnerability driven by climate extremes and air quality conditions. These findings can help policymakers prioritize targeted health interventions and climate adaptation strategies.
# Aggregate predicted classes by year
risk_trend <- test_data %>%
group_by(year, predicted_class) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(year) %>%
mutate(percentage = count / sum(count) * 100)
# Plot
ggplot(risk_trend, aes(x = year, y = percentage, fill = predicted_class)) +
geom_area(alpha = 0.8) +
labs(
title = "Predicted Population Health Risk Over Time",
x = "Year",
y = "Percentage of Observations",
fill = "Health Risk"
) +
theme_minimal()
This visualization shows the changing proportion of populations classified as High risk over time. An increasing share of High-risk observations in recent years suggests growing climate-related health vulnerability, highlighting the need for stronger adaptation and healthcare preparedness.
# Count High-risk predictions by country
country_risk <- test_data %>%
filter(predicted_class == "High") %>%
count(country_name, sort = TRUE) %>%
top_n(10, n)
# Plot
ggplot(country_risk, aes(x = reorder(country_name, n), y = n)) +
geom_bar(stat = "identity", fill = "firebrick") +
coord_flip() +
labs(
title = "Countries with Most High-Risk Predictions",
x = "Country",
y = "Number of High-Risk Observations"
) +
theme_minimal()
While the highest-risk individual observations were concentrated in countries such as the Philippines and Italy, the frequency-based analysis reveals that countries like Indonesia, Kenya, and Bangladesh experience persistent high-risk conditions across multiple weeks. This distinction highlights the difference between extreme health risk events and chronic climate-health vulnerability, both of which are important for public health planning.
ggplot(test_data, aes(x = predicted_prob_high)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.8) +
labs(
title = "Distribution of Predicted High-Risk Probabilities",
x = "Predicted Probability of High Risk",
y = "Number of Observations"
) +
theme_minimal()
The distribution of predicted probabilities shows that many populations fall within a moderate-risk range, while a smaller group exhibits extremely high risk. This probability-based view allows policymakers to prioritize interventions not only for confirmed high-risk populations but also for those approaching critical risk thresholds.
ggplot(test_data, aes(
x = temperature_celsius,
y = pm25_ugm3,
color = predicted_class
)) +
geom_point(alpha = 0.6) +
labs(
title = "Climate and Air Quality Conditions by Predicted Health Risk",
x = "Average Temperature (°C)",
y = "PM2.5 Concentration (µg/m³)",
color = "Predicted Risk"
) +
theme_minimal()
High-risk populations tend to cluster in regions experiencing higher temperatures and elevated PM2.5 concentrations, reinforcing the combined role of heat and air pollution in driving climate-related health risks.