Dengue fever remains a significant global health challenge, with outbreaks causing substantial morbidity and mortality, particularly in tropical and subtropical regions. Understanding the factors influencing dengue outbreaks is critical for effective prevention and control strategies. This report examines the Dengue dataset, which includes features such as climatic variables, vegetation indices, and temporal data, alongside target variables representing the number of dengue cases (total_cases) in San Juan (SJ) and Iquitos (IQ). These features are essential for analyzing the factors influencing dengue outbreaks and building predictive models.
The analysis involved a structured methodology beginning with data cleaning, where missing values and irrelevant features were addressed to ensure dataset integrity. This was followed by feature engineering, where additional features were derived to enhance the predictive power of the models. Next, exploratory data analysis (EDA) was conducted to analyze patterns and distributions, providing valuable insights into the dataset and potential predictors of outcomes. In the modeling phase, multiple machine learning models, including Random Forest (RF) and XGBoost, were implemented, with a particular focus on handling class imbalances. Finally, the evaluation phase assessed model performance using Mean Absolute Error (MAE) to measure predictive accuracy.
City: Indicates the location of the data, either San Juan (SJ) or Iquitos (IQ).
Year: The year in which the data was recorded.
Week of the Year: The specific week within the year when the data was recorded.
Station Maximum Temperature (Kelvin): Maximum temperature recorded at the meteorological station.
Station Minimum Temperature (Kelvin): Minimum temperature recorded at the meteorological station.
Station Average Temperature (Kelvin): Mean temperature recorded at the meteorological station.
Precipitation Amount (mm): Total rainfall during the specified period.
NDVI Variables (Normalized Difference Vegetation Index):
NDVI_NE: NDVI for the northeast region.
NDVI_NW: NDVI for the northwest region.
NDVI_SE: NDVI for the southeast region.
NDVI_SW: NDVI for the southwest region.
Reanalysis Data:
Reanalysis Maximum Air Temperature (Kelvin): Highest air temperature observed during the period.
Reanalysis Minimum Air Temperature (Kelvin): Lowest air temperature observed during the period.
Reanalysis Dew Point Temperature (Kelvin): Dew point temperature, indicative of atmospheric moisture.
Reanalysis Precipitation Amount (mm): Modeled estimate of precipitation based on reanalysis data.
Sea-Level Pressure (millibars): Pressure measured at sea level.
Relative Humidity (%): Measure of moisture content in the air.
Wind Speed (m/s):
Station Average Wind Speed: Average wind speed recorded at the station.
Reanalysis Wind Speed: Modeled estimate of wind speed based on reanalysis data.
Total Cases: The target variable representing the number of dengue cases reported.
MAE is a common metric used to evaluate the accuracy of regression models. It measures the average magnitude of errors between the predicted and actual values, without considering their direction (it treats all errors as positive).
Characteristics:
MAE is always non-negative and a lower value indicates better model performance
MAE tells us the average amount by which the model’s predictions deviate from the actual values.
Since it is measured in the same units as the target variable, MAE is directly interprebale in terms of the variable being predicted.
The required libraries were loaded, and the datasets were
successfully imported. The dengue_labels_train dataset
contains the target variable, while dengue_features_train
provides the predictors. Both datasets were checked for dimensions and
basic structure, ensuring alignment between features and labels.
# Install all needed libraries if it is not present
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(dplyr)) install.packages("dplyr")
if(!require(caret)) install.packages("caret")
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
if(!require(xgboost)) install.packages("xgboost")
## Loading required package: xgboost
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
if(!require(randomForest)) install.packages("randomForest")
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
if(!require(kableExtra)) install.packages("kableExtra")
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
# Loading all needed libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(caret)
library(xgboost)
library(randomForest)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following object is masked from 'package:dplyr':
##
## combine
library(ggcorrplot)
library(knitr)
library(kableExtra)
## Loading the dataset
dengue_labels_train <- read.csv("dengue_labels_train.csv")
dengue_features_train <- read.csv("dengue_features_train.csv")
The dataset consists of two main files: dengue_labels_train and dengue_features_train.
dengue_labels_train: This file contains 1,456 rows and 4 columns, including the city, year, week of the year, and the total reported dengue cases. This dataset serves as the target variable for predictive analysis.
dengue_features_train: This file contains 1,456 rows and 24 columns, detailing various environmental and climatic features like temperature, precipitation, and vegetation indices (NDVI) for the same time periods and locations. These features will be used to predict dengue cases.
# Check dimensions
dengue_labels_summary <- data.frame("Length" = nrow(dengue_labels_train), "Columns" = ncol(dengue_labels_train))
print(dengue_labels_summary)
## Length Columns
## 1 1456 4
dengue_features_summary <- data.frame("Length" = nrow(dengue_features_train), "Columns" = ncol(dengue_features_train))
print(dengue_features_summary)
## Length Columns
## 1 1456 24
These datasets provide a combination of climatic, temporal, and geographical data that makes this suitable for regression or time-series analysis.
head(dengue_features_train) %>% kable(caption="Dengue Features Train")%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_k | reanalysis_avg_temp_k | reanalysis_dew_point_temp_k | reanalysis_max_air_temp_k | reanalysis_min_air_temp_k | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sj | 1990 | 18 | 1990-04-30 | 0.1226000 | 0.1037250 | 0.1984833 | 0.1776167 | 12.42 | 297.5729 | 297.7429 | 292.4143 | 299.8 | 295.9 | 32.00 | 73.36571 | 12.42 | 14.01286 | 2.628571 | 25.44286 | 6.900000 | 29.4 | 20.0 | 16.0 |
| sj | 1990 | 19 | 1990-05-07 | 0.1699000 | 0.1421750 | 0.1623571 | 0.1554857 | 22.82 | 298.2114 | 298.4429 | 293.9514 | 300.9 | 296.4 | 17.94 | 77.36857 | 22.82 | 15.37286 | 2.371429 | 26.71429 | 6.371429 | 31.7 | 22.2 | 8.6 |
| sj | 1990 | 20 | 1990-05-14 | 0.0322500 | 0.1729667 | 0.1572000 | 0.1708429 | 34.54 | 298.7814 | 298.8786 | 295.4343 | 300.5 | 297.3 | 26.10 | 82.05286 | 34.54 | 16.84857 | 2.300000 | 26.71429 | 6.485714 | 32.2 | 22.8 | 41.4 |
| sj | 1990 | 21 | 1990-05-21 | 0.1286333 | 0.2450667 | 0.2275571 | 0.2358857 | 15.36 | 298.9871 | 299.2286 | 295.3100 | 301.4 | 297.0 | 13.90 | 80.33714 | 15.36 | 16.67286 | 2.428571 | 27.47143 | 6.771429 | 33.3 | 23.3 | 4.0 |
| sj | 1990 | 22 | 1990-05-28 | 0.1962000 | 0.2622000 | 0.2512000 | 0.2473400 | 7.52 | 299.5186 | 299.6643 | 295.8214 | 301.9 | 297.5 | 12.20 | 80.46000 | 7.52 | 17.21000 | 3.014286 | 28.94286 | 9.371429 | 35.0 | 23.9 | 5.8 |
| sj | 1990 | 23 | 1990-06-04 | NA | 0.1748500 | 0.2543143 | 0.1817429 | 9.58 | 299.6300 | 299.7643 | 295.8514 | 302.4 | 298.1 | 26.49 | 79.89143 | 9.58 | 17.21286 | 2.100000 | 28.11429 | 6.942857 | 34.4 | 23.9 | 39.1 |
head(dengue_labels_train) %>% kable(caption="dengue labels train")%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | total_cases |
|---|---|---|---|
| sj | 1990 | 18 | 4 |
| sj | 1990 | 19 | 5 |
| sj | 1990 | 20 | 4 |
| sj | 1990 | 21 | 3 |
| sj | 1990 | 22 | 6 |
| sj | 1990 | 23 | 2 |
This bar chart shows the total dengue cases by year from 1990 to 2010. Certain years, such as those with sharp peaks in 1995 and 1998, indicate major outbreaks. With no consistent upward or downward trend, this variability points to potential environmental or seasonal factors.
# Group by year and aggregate total cases
year_max_cases <- dengue_labels_train %>%
group_by(year) %>%
summarise(total_cases = sum(total_cases, na.rm = TRUE))
# Create the bar plot
ggplot(year_max_cases, aes(x = as.factor(year), y = total_cases)) +
geom_bar(stat = "identity", fill = "purple", color = "black") +
labs(title = "Total Dengue Cases by Year", x = "Year", y = "Total Cases") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 50, hjust = 1))
The bar chart illustrates the total dengue cases across two cities, “iq” (Iquitos) and “sj” (San Juan). It is evident that San Juan has significantly more dengue cases compared to Iquitos, with totals exceeding 30,000 cases in San Juan versus a substantially lower number in Iquitos. This stark disparity suggests that San Juan might face higher risk factors such as climate, population density, or public health conditions conducive to dengue transmission.
# Group by city and aggregate total cases
city_affected <- dengue_labels_train %>%
group_by(city) %>%
summarise(total_cases = sum(total_cases, na.rm = TRUE))
# Create the bar plot
ggplot(city_affected, aes(x = city, y = total_cases)) +
geom_bar(stat = "identity", fill = "lightblue", color = "black") +
labs(title = "Total Cases of Dengue by City", x = "City", y = "Total Cases of Dengue") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 2)) +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
The table highlights the number of missing values for specific
columns in the dataset. The column ndvi_ne has the highest
count of missing values with 194, followed by ndvi_nw with
52, and both ndvi_se and ndvi_sw with 22
missing values each.
# Missing values by column
missing_values_by_column <- dengue_features_train %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "MissingCount") %>%
filter(MissingCount > 0)
head(missing_values_by_column) %>% kable(caption="Missing values by column")%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Column | MissingCount |
|---|---|
| ndvi_ne | 194 |
| ndvi_nw | 52 |
| ndvi_se | 22 |
| ndvi_sw | 22 |
| precipitation_amt_mm | 13 |
| reanalysis_air_temp_k | 10 |
# Rows with missing values
rows_with_missing <- dengue_features_train %>%
filter(if_any(everything(), is.na)) %>%
summarise(Count = n())
print(rows_with_missing)
## Count
## 1 257
The missing values were replaced with the mean of the respective
columns using the mutate and across functions
in R. By imputing the missing values with the column-wise mean, the
dataset’s completeness and suitability for analysis are improved without
introducing significant bias. The table shows that no missing values
remain.
# Fill the missing values with mean
dengue_features_train <- dengue_features_train %>%
mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# This shows that the missing values are resolved
head(dengue_features_train) %>% kable(caption="Dengue Features Train") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_k | reanalysis_avg_temp_k | reanalysis_dew_point_temp_k | reanalysis_max_air_temp_k | reanalysis_min_air_temp_k | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sj | 1990 | 18 | 1990-04-30 | 0.1226000 | 0.1037250 | 0.1984833 | 0.1776167 | 12.42 | 297.5729 | 297.7429 | 292.4143 | 299.8 | 295.9 | 32.00 | 73.36571 | 12.42 | 14.01286 | 2.628571 | 25.44286 | 6.900000 | 29.4 | 20.0 | 16.0 |
| sj | 1990 | 19 | 1990-05-07 | 0.1699000 | 0.1421750 | 0.1623571 | 0.1554857 | 22.82 | 298.2114 | 298.4429 | 293.9514 | 300.9 | 296.4 | 17.94 | 77.36857 | 22.82 | 15.37286 | 2.371429 | 26.71429 | 6.371429 | 31.7 | 22.2 | 8.6 |
| sj | 1990 | 20 | 1990-05-14 | 0.0322500 | 0.1729667 | 0.1572000 | 0.1708429 | 34.54 | 298.7814 | 298.8786 | 295.4343 | 300.5 | 297.3 | 26.10 | 82.05286 | 34.54 | 16.84857 | 2.300000 | 26.71429 | 6.485714 | 32.2 | 22.8 | 41.4 |
| sj | 1990 | 21 | 1990-05-21 | 0.1286333 | 0.2450667 | 0.2275571 | 0.2358857 | 15.36 | 298.9871 | 299.2286 | 295.3100 | 301.4 | 297.0 | 13.90 | 80.33714 | 15.36 | 16.67286 | 2.428571 | 27.47143 | 6.771429 | 33.3 | 23.3 | 4.0 |
| sj | 1990 | 22 | 1990-05-28 | 0.1962000 | 0.2622000 | 0.2512000 | 0.2473400 | 7.52 | 299.5186 | 299.6643 | 295.8214 | 301.9 | 297.5 | 12.20 | 80.46000 | 7.52 | 17.21000 | 3.014286 | 28.94286 | 9.371429 | 35.0 | 23.9 | 5.8 |
| sj | 1990 | 23 | 1990-06-04 | 0.1422935 | 0.1748500 | 0.2543143 | 0.1817429 | 9.58 | 299.6300 | 299.7643 | 295.8514 | 302.4 | 298.1 | 26.49 | 79.89143 | 9.58 | 17.21286 | 2.100000 | 28.11429 | 6.942857 | 34.4 | 23.9 | 39.1 |
Temperature features were converted from Kelvin to Celsius for better interpretability. These variables are crucial for understanding climatic influences on dengue outbreaks. The updated column names also improve clarity.
# Temperature
# Finding columns with "_temp_k"
temperature_kelvin_columns <- grep("_temp_k", names(dengue_features_train), value = TRUE)
print(temperature_kelvin_columns)
## [1] "reanalysis_air_temp_k" "reanalysis_avg_temp_k"
## [3] "reanalysis_dew_point_temp_k" "reanalysis_max_air_temp_k"
## [5] "reanalysis_min_air_temp_k"
# Convert Kelvin to Celsius for the identified columns
dengue_features_train[temperature_kelvin_columns] <- dengue_features_train[temperature_kelvin_columns] - 273.15
# Rename columns by replacing '_temp_k' with '_temp_c'
names(dengue_features_train) <- gsub("_temp_k", "_temp_c", names(dengue_features_train))
# View updated column names
print(names(dengue_features_train))
## [1] "city"
## [2] "year"
## [3] "weekofyear"
## [4] "week_start_date"
## [5] "ndvi_ne"
## [6] "ndvi_nw"
## [7] "ndvi_se"
## [8] "ndvi_sw"
## [9] "precipitation_amt_mm"
## [10] "reanalysis_air_temp_c"
## [11] "reanalysis_avg_temp_c"
## [12] "reanalysis_dew_point_temp_c"
## [13] "reanalysis_max_air_temp_c"
## [14] "reanalysis_min_air_temp_c"
## [15] "reanalysis_precip_amt_kg_per_m2"
## [16] "reanalysis_relative_humidity_percent"
## [17] "reanalysis_sat_precip_amt_mm"
## [18] "reanalysis_specific_humidity_g_per_kg"
## [19] "reanalysis_tdtr_k"
## [20] "station_avg_temp_c"
## [21] "station_diur_temp_rng_c"
## [22] "station_max_temp_c"
## [23] "station_min_temp_c"
## [24] "station_precip_mm"
Rounding improves data consistency and readability, especially when dealing with continuous variables. This step is essential for visualizations, summaries, and reports, where overly precise values may be unnecessary.
# Round all numerical columns to 3 decimal places
dengue_features_train <- dengue_features_train %>%
mutate(across(where(is.numeric), ~ round(., 3)))
# Display the first few rows of the dataset
head(dengue_features_train) %>% kable(caption="Dengue Features Train") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sj | 1990 | 18 | 1990-04-30 | 0.123 | 0.104 | 0.198 | 0.178 | 12.42 | 24.423 | 24.593 | 19.264 | 26.65 | 22.75 | 32.00 | 73.366 | 12.42 | 14.013 | 2.629 | 25.443 | 6.900 | 29.4 | 20.0 | 16.0 |
| sj | 1990 | 19 | 1990-05-07 | 0.170 | 0.142 | 0.162 | 0.155 | 22.82 | 25.061 | 25.293 | 20.801 | 27.75 | 23.25 | 17.94 | 77.369 | 22.82 | 15.373 | 2.371 | 26.714 | 6.371 | 31.7 | 22.2 | 8.6 |
| sj | 1990 | 20 | 1990-05-14 | 0.032 | 0.173 | 0.157 | 0.171 | 34.54 | 25.631 | 25.729 | 22.284 | 27.35 | 24.15 | 26.10 | 82.053 | 34.54 | 16.849 | 2.300 | 26.714 | 6.486 | 32.2 | 22.8 | 41.4 |
| sj | 1990 | 21 | 1990-05-21 | 0.129 | 0.245 | 0.228 | 0.236 | 15.36 | 25.837 | 26.079 | 22.160 | 28.25 | 23.85 | 13.90 | 80.337 | 15.36 | 16.673 | 2.429 | 27.471 | 6.771 | 33.3 | 23.3 | 4.0 |
| sj | 1990 | 22 | 1990-05-28 | 0.196 | 0.262 | 0.251 | 0.247 | 7.52 | 26.369 | 26.514 | 22.671 | 28.75 | 24.35 | 12.20 | 80.460 | 7.52 | 17.210 | 3.014 | 28.943 | 9.371 | 35.0 | 23.9 | 5.8 |
| sj | 1990 | 23 | 1990-06-04 | 0.142 | 0.175 | 0.254 | 0.182 | 9.58 | 26.480 | 26.614 | 22.701 | 29.25 | 24.95 | 26.49 | 79.891 | 9.58 | 17.213 | 2.100 | 28.114 | 6.943 | 34.4 | 23.9 | 39.1 |
These columns represent the average of maximum and minimum temperatures derived from two different sources: station recordings and reanalysis data. These new variables can be useful for identifying correlations between temperature trends and dengue case patterns.
avg_station_max_min: Average of station_max_temp_c and station_min_temp_c.
avg_analysis_max_min: Average of reanalysis_max_air_temp_c and reanalysis_min_air_temp_c.
#Average (Max and min)
# Create a new column 'avg_station_max_min'
dengue_features_train <- dengue_features_train %>%
mutate(avg_station_max_min = (station_max_temp_c + station_min_temp_c) / 2)
# Display the first few values of the new column
head(dengue_features_train$avg_station_max_min)
## [1] 24.70 26.95 27.50 28.30 29.45 29.15
# Create a new column 'avg_analysis_max_min'
dengue_features_train <- dengue_features_train %>%
mutate(avg_analysis_max_min = (reanalysis_max_air_temp_c + reanalysis_min_air_temp_c) / 2)
# Display the first few values of the new column
head(dengue_features_train$avg_analysis_max_min)
## [1] 24.70 25.50 25.75 26.05 26.55 27.10
Adding the total_cases column from the dengue_labels_train dataset to the dengue_features_train dataset ensures that the target variable is available within the both datasets. This streamlines analysis and model training processes.
# Add the 'total_cases' column from dengue_labels to dengue_features
dengue_features_train$total_cases <- dengue_labels_train$total_cases
# View the first few rows to verify
head(dengue_features_train) %>% kable(caption="Dengue Features Train") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sj | 1990 | 18 | 1990-04-30 | 0.123 | 0.104 | 0.198 | 0.178 | 12.42 | 24.423 | 24.593 | 19.264 | 26.65 | 22.75 | 32.00 | 73.366 | 12.42 | 14.013 | 2.629 | 25.443 | 6.900 | 29.4 | 20.0 | 16.0 | 24.70 | 24.70 | 4 |
| sj | 1990 | 19 | 1990-05-07 | 0.170 | 0.142 | 0.162 | 0.155 | 22.82 | 25.061 | 25.293 | 20.801 | 27.75 | 23.25 | 17.94 | 77.369 | 22.82 | 15.373 | 2.371 | 26.714 | 6.371 | 31.7 | 22.2 | 8.6 | 26.95 | 25.50 | 5 |
| sj | 1990 | 20 | 1990-05-14 | 0.032 | 0.173 | 0.157 | 0.171 | 34.54 | 25.631 | 25.729 | 22.284 | 27.35 | 24.15 | 26.10 | 82.053 | 34.54 | 16.849 | 2.300 | 26.714 | 6.486 | 32.2 | 22.8 | 41.4 | 27.50 | 25.75 | 4 |
| sj | 1990 | 21 | 1990-05-21 | 0.129 | 0.245 | 0.228 | 0.236 | 15.36 | 25.837 | 26.079 | 22.160 | 28.25 | 23.85 | 13.90 | 80.337 | 15.36 | 16.673 | 2.429 | 27.471 | 6.771 | 33.3 | 23.3 | 4.0 | 28.30 | 26.05 | 3 |
| sj | 1990 | 22 | 1990-05-28 | 0.196 | 0.262 | 0.251 | 0.247 | 7.52 | 26.369 | 26.514 | 22.671 | 28.75 | 24.35 | 12.20 | 80.460 | 7.52 | 17.210 | 3.014 | 28.943 | 9.371 | 35.0 | 23.9 | 5.8 | 29.45 | 26.55 | 6 |
| sj | 1990 | 23 | 1990-06-04 | 0.142 | 0.175 | 0.254 | 0.182 | 9.58 | 26.480 | 26.614 | 22.701 | 29.25 | 24.95 | 26.49 | 79.891 | 9.58 | 17.213 | 2.100 | 28.114 | 6.943 | 34.4 | 23.9 | 39.1 | 29.15 | 27.10 | 2 |
summary_table <- as.list(summary(dengue_features_train$total_cases))
summary_df <- data.frame(
Statistic = names(summary_table),
Value = as.numeric(summary_table)
)
# Summary of Total Cases
head(summary_df) %>% kable(caption="Summary of Total Cases") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Statistic | Value |
|---|---|
| Min. | 0.00000 |
| 1st Qu. | 5.00000 |
| Median | 12.00000 |
| Mean | 24.67514 |
| 3rd Qu. | 28.00000 |
| Max. | 461.00000 |
The correlation matrix revealed relationships between various numerical features in the dengue dataset. Strong correlations were observed between certain temperature features and dengue cases, whereas NDVI values showed weaker relationships. This analysis helps identify the features most likely to improve model performance.
# CORRELATION MATRIX
# Select only numerical columns
numeric_columns <- dengue_features_train %>%
select(where(is.numeric))
# Create the correlation matrix
correlation_matrix <- cor(numeric_columns, use = "complete.obs")
# Print the correlation matrix
head(correlation_matrix) %>% kable(caption="Correlation Matrix") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| year | weekofyear | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| year | 1.0000000 | -0.0716490 | 0.2060195 | 0.1404616 | 0.2316619 | 0.2747632 | 0.2038558 | -0.1394323 | 0.0846750 | 0.1317295 | 0.4784084 | -0.3907229 | 0.1320744 | 0.3349531 | 0.2038558 | 0.1609286 | 0.4885040 | 0.0464154 | 0.3688217 | 0.2254525 | -0.2134673 | 0.2178466 | 0.0391647 | 0.2049803 | -0.3068055 |
| weekofyear | -0.0716490 | 1.0000000 | 0.0493891 | 0.0477166 | 0.1197071 | 0.0677387 | 0.1166278 | 0.4213454 | 0.4588829 | 0.3453326 | 0.2356121 | 0.1778180 | 0.0721991 | -0.0150331 | 0.1166278 | 0.3471847 | 0.0771254 | 0.3611733 | 0.0227651 | 0.2293976 | 0.2887670 | 0.0652627 | 0.3374806 | 0.4581123 | 0.2164519 |
| ndvi_ne | 0.2060195 | 0.0493891 | 1.0000000 | 0.8069591 | 0.5864105 | 0.6441809 | 0.1901420 | -0.3196147 | -0.0351797 | 0.0518173 | 0.6046773 | -0.5944221 | 0.1888210 | 0.4399884 | 0.1901420 | 0.0876688 | 0.6470540 | 0.1665617 | 0.6163515 | 0.4516151 | -0.2938920 | 0.2301899 | 0.1571090 | 0.1620626 | -0.1574710 |
| ndvi_nw | 0.1404616 | 0.0477166 | 0.8069591 | 1.0000000 | 0.5513386 | 0.6473126 | 0.1888090 | -0.3124682 | -0.0303539 | 0.0623737 | 0.6002513 | -0.5815113 | 0.1888959 | 0.4441694 | 0.1888090 | 0.0991279 | 0.6388308 | 0.1972800 | 0.6321415 | 0.4813762 | -0.2968213 | 0.2171099 | 0.1771318 | 0.1691337 | -0.1368746 |
| ndvi_se | 0.2316619 | 0.1197071 | 0.5864105 | 0.5513386 | 1.0000000 | 0.8211246 | 0.0742664 | -0.1568604 | 0.0512258 | 0.0028531 | 0.4664288 | -0.4063497 | 0.0346166 | 0.2147492 | 0.0742664 | 0.0287796 | 0.4841222 | 0.0872994 | 0.4399119 | 0.2884974 | -0.2497454 | 0.1296173 | 0.0638904 | 0.1753348 | -0.1174623 |
| ndvi_sw | 0.2747632 | 0.0677387 | 0.6441809 | 0.6473126 | 0.8211246 | 1.0000000 | 0.1234938 | -0.2268030 | 0.0226081 | 0.0208516 | 0.5444165 | -0.4922419 | 0.1012813 | 0.3104171 | 0.1234938 | 0.0534429 | 0.5671607 | 0.1182781 | 0.5331198 | 0.3739951 | -0.2919345 | 0.1565436 | 0.1015449 | 0.1873356 | -0.1373575 |
# Visualize the correlation matrix
library(ggcorrplot)
# Plot the correlation matrix
ggcorrplot(correlation_matrix,
method = "square",
type = "lower",
lab = TRUE,
tl.cex = 8,
tl.srt = 45,
lab_size = 0, # Too many labels block the visual
colors = c("blue", "white", "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1), # Tilt x-axis text
axis.text.y = element_text(size = 8) # Adjust y-axis text size
)
Highly Correlated Variables:
ndvi_ne and ndvi_nw: 0.81 (Strong
positive correlation).
ndvi_se and ndvi_sw: 0.82 (Strong
positive correlation).
ndvi_sw and ndvi_ne: 0.64 (Moderate
positive correlation).
Weak or Negative Correlations:
station_min_temp_c shows weak or negative
correlations with most variables.
Example: station_min_temp_c and ndvi_sw: -0.29 (Weak
negative correlation).
Temperature and Precipitation:
station_max_temp_c is moderately correlated
with ndvi_ne (0.45)
and ndvi_nw (0.48), suggesting that
vegetation and temperature may share some relationship.
precipitation_amt_mm shows very weak correlations
with other variables, indicating it might have less direct impact on
these NDVI values.
# Boxplot of total_cases by city
ggplot(dengue_features_train, aes(x = city, y = total_cases)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(
title = "Distribution of Total Cases by City",
x = "City",
y = "Total Cases"
) +
theme_minimal()
# Assuming 'x' is a vector of column names for NDVI features
x <- c("ndvi_ne", "ndvi_nw", "ndvi_se", "ndvi_sw")
# Create individual scatter plots
plot1 <- ggplot(dengue_features_train, aes_string(x = x[1], y = "total_cases")) +
geom_point(color = "red") +
labs(title = paste("Total Cases vs", x[1]), x = x[1], y = "Total Cases") +
theme_minimal()
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot2 <- ggplot(dengue_features_train, aes_string(x = x[2], y = "total_cases")) +
geom_point(color = "green") +
labs(title = paste("Total Cases vs", x[2]), x = x[2], y = "Total Cases") +
theme_minimal()
plot3 <- ggplot(dengue_features_train, aes_string(x = x[3], y = "total_cases")) +
geom_point(color = "blue") +
labs(title = paste("Total Cases vs", x[3]), x = x[3], y = "Total Cases") +
theme_minimal()
plot4 <- ggplot(dengue_features_train, aes_string(x = x[4], y = "total_cases")) +
geom_point(color = "#B16150") +
labs(title = paste("Total Cases vs", x[4]), x = x[4], y = "Total Cases") +
theme_minimal()
# Arrange the plots in a 2x2 grid
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
The scatter plots show the relationships between different NDVI
indices (plot1: ndvi_ne, plot2:
ndvi_nw, plot3: ndvi_se, plot4:
ndvi_sw) and the total dengue cases.
Centralized Clustering:
Vertical Spread (Total Cases):
Symmetry:
# total cases vs precipitation
# Assuming 'x' is a vector containing column names
x <- c("ndvi_ne", "ndvi_nw", "ndvi_se", "ndvi_sw", "precipitation_amt_mm")
# Create a scatter plot
ggplot(dengue_features_train, aes_string(x = x[5], y = "total_cases")) +
geom_point(color = "orange") +
labs(title = paste("Total Cases vs", x[5]),
x = x[5],
y = "Total Cases") +
theme_minimal() +
theme(legend.position = "none")
High Density Near Zero Precipitation:
Decrease in Density with Higher Precipitation:
Extreme Cases:
Lack of a Clear Trend:
# Select numerical variables
numeric_vars <- dengue_features_train %>%
select(ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw, station_max_temp_c, station_min_temp_c, precipitation_amt_mm)
# Compute correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
# Visualize the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, method = "square", lab = TRUE)
Strong Positive Correlations Among NDVI Features:
ndvi_ne, ndvi_nw, ndvi_se, and
ndvi_sw) exhibit strong positive correlations with one
another (above 0.8). This indicates that vegetation indices across
different regions tend to move together, likely due to shared
environmental conditions.Moderate Correlation Between Station Maximum Temperature and NDVI:
station_max_temp_c and NDVI variables. This suggests that
higher maximum temperatures are associated with greener vegetation,
which aligns with the idea that warmer climates often foster vegetation
growth.Weak Relationships with Precipitation and Minimum Temperature:
precipitation_amt_mm) and minimum
temperature (station_min_temp_c) show weak or negative
correlations with most variables. This implies that these factors are
less directly linked to vegetation or maximum temperature patterns in
this dataset.Implications for Modeling:
The dengue cases of San Juan sj and Iquitos iq are not dependent, so we need to split them into different data frames for individual analysis.
This process ensures that the feature sets for the two cities are tailored to their respective data, which improves model accuracy and interpretability. Handling collinearity and removing irrelevant columns help models better capture patterns without being influenced by redundant or non-informative data.
# FEATURE ENGINEERING
# Split the data by city
X_sj <- dengue_features_train[dengue_features_train$city == 'sj', ]
X_iq <- dengue_features_train[dengue_features_train$city == 'iq', ]
# Drop specific columns
X_sj <- X_sj[, !(names(X_sj) %in% c('city', 'weekofyear', 'week_start_date'))]
X_iq <- X_iq[, !(names(X_iq) %in% c('city', 'weekofyear', 'week_start_date'))]
# One-hot encode the 'year' column
one_hot_sj <- model.matrix(~ year - 1, data = X_sj)
X_sj <- X_sj[, !names(X_sj) %in% 'year']
X_sj <- cbind(X_sj, one_hot_sj)
one_hot_iq <- model.matrix(~ year - 1, data = X_iq)
X_iq <- X_iq[, !names(X_iq) %in% 'year']
X_iq <- cbind(X_iq, one_hot_iq)
# View the transformed datasets
head(X_sj) %>% kable(caption="X San Juan") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.123 | 0.104 | 0.198 | 0.178 | 12.42 | 24.423 | 24.593 | 19.264 | 26.65 | 22.75 | 32.00 | 73.366 | 12.42 | 14.013 | 2.629 | 25.443 | 6.900 | 29.4 | 20.0 | 16.0 | 24.70 | 24.70 | 4 | 1990 |
| 0.170 | 0.142 | 0.162 | 0.155 | 22.82 | 25.061 | 25.293 | 20.801 | 27.75 | 23.25 | 17.94 | 77.369 | 22.82 | 15.373 | 2.371 | 26.714 | 6.371 | 31.7 | 22.2 | 8.6 | 26.95 | 25.50 | 5 | 1990 |
| 0.032 | 0.173 | 0.157 | 0.171 | 34.54 | 25.631 | 25.729 | 22.284 | 27.35 | 24.15 | 26.10 | 82.053 | 34.54 | 16.849 | 2.300 | 26.714 | 6.486 | 32.2 | 22.8 | 41.4 | 27.50 | 25.75 | 4 | 1990 |
| 0.129 | 0.245 | 0.228 | 0.236 | 15.36 | 25.837 | 26.079 | 22.160 | 28.25 | 23.85 | 13.90 | 80.337 | 15.36 | 16.673 | 2.429 | 27.471 | 6.771 | 33.3 | 23.3 | 4.0 | 28.30 | 26.05 | 3 | 1990 |
| 0.196 | 0.262 | 0.251 | 0.247 | 7.52 | 26.369 | 26.514 | 22.671 | 28.75 | 24.35 | 12.20 | 80.460 | 7.52 | 17.210 | 3.014 | 28.943 | 9.371 | 35.0 | 23.9 | 5.8 | 29.45 | 26.55 | 6 | 1990 |
| 0.142 | 0.175 | 0.254 | 0.182 | 9.58 | 26.480 | 26.614 | 22.701 | 29.25 | 24.95 | 26.49 | 79.891 | 9.58 | 17.213 | 2.100 | 28.114 | 6.943 | 34.4 | 23.9 | 39.1 | 29.15 | 27.10 | 2 | 1990 |
head(X_iq) %>% kable(caption="X Iquitos") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 937 | 0.193 | 0.132 | 0.341 | 0.247 | 25.41 | 23.590 | 25.300 | 22.034 | 34.15 | 19.95 | 43.19 | 92.419 | 25.41 | 16.651 | 8.929 | 26.400 | 10.775 | 32.5 | 20.7 | 3.0 | 26.60 | 27.05 | 0 | 2000 |
| 938 | 0.217 | 0.276 | 0.289 | 0.242 | 60.61 | 23.484 | 25.279 | 22.209 | 33.45 | 17.95 | 46.00 | 93.581 | 60.61 | 16.863 | 10.314 | 26.900 | 11.567 | 34.0 | 20.8 | 55.6 | 27.40 | 25.70 | 0 | 2000 |
| 939 | 0.177 | 0.173 | 0.204 | 0.128 | 55.52 | 23.266 | 24.243 | 22.473 | 31.35 | 19.45 | 64.77 | 95.849 | 55.52 | 17.120 | 7.386 | 26.800 | 11.467 | 33.0 | 20.7 | 38.1 | 26.85 | 25.40 | 0 | 2000 |
| 940 | 0.228 | 0.145 | 0.254 | 0.200 | 5.60 | 22.207 | 23.079 | 19.647 | 30.45 | 15.45 | 23.96 | 87.234 | 5.60 | 14.431 | 9.114 | 25.767 | 10.533 | 31.5 | 14.7 | 30.0 | 23.10 | 22.95 | 0 | 2000 |
| 941 | 0.329 | 0.322 | 0.254 | 0.361 | 62.76 | 23.283 | 24.486 | 20.807 | 33.85 | 18.35 | 31.80 | 88.161 | 62.76 | 15.444 | 9.500 | 26.600 | 11.480 | 33.3 | 19.1 | 4.0 | 26.20 | 26.10 | 0 | 2000 |
| 942 | 0.206 | 0.191 | 0.232 | 0.255 | 16.24 | 24.041 | 25.136 | 18.576 | 33.05 | 15.35 | 1.00 | 74.729 | 16.24 | 13.421 | 13.771 | 25.340 | 10.940 | 32.0 | 17.0 | 11.5 | 24.50 | 24.20 | 0 | 2000 |
# Select columns related to years (assuming years are the last columns in reverse order)
years <- rev(names(X_sj)[24:length(names(X_sj))])
# Subset and rearrange the dataset
X_sj <- cbind(X_sj[, years], X_sj[, 1:24])
# Display the first few rows
head(X_sj) %>% kable(caption="X San Juan") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| X_sj[, years] | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1990 | 0.123 | 0.104 | 0.198 | 0.178 | 12.42 | 24.423 | 24.593 | 19.264 | 26.65 | 22.75 | 32.00 | 73.366 | 12.42 | 14.013 | 2.629 | 25.443 | 6.900 | 29.4 | 20.0 | 16.0 | 24.70 | 24.70 | 4 | 1990 |
| 1990 | 0.170 | 0.142 | 0.162 | 0.155 | 22.82 | 25.061 | 25.293 | 20.801 | 27.75 | 23.25 | 17.94 | 77.369 | 22.82 | 15.373 | 2.371 | 26.714 | 6.371 | 31.7 | 22.2 | 8.6 | 26.95 | 25.50 | 5 | 1990 |
| 1990 | 0.032 | 0.173 | 0.157 | 0.171 | 34.54 | 25.631 | 25.729 | 22.284 | 27.35 | 24.15 | 26.10 | 82.053 | 34.54 | 16.849 | 2.300 | 26.714 | 6.486 | 32.2 | 22.8 | 41.4 | 27.50 | 25.75 | 4 | 1990 |
| 1990 | 0.129 | 0.245 | 0.228 | 0.236 | 15.36 | 25.837 | 26.079 | 22.160 | 28.25 | 23.85 | 13.90 | 80.337 | 15.36 | 16.673 | 2.429 | 27.471 | 6.771 | 33.3 | 23.3 | 4.0 | 28.30 | 26.05 | 3 | 1990 |
| 1990 | 0.196 | 0.262 | 0.251 | 0.247 | 7.52 | 26.369 | 26.514 | 22.671 | 28.75 | 24.35 | 12.20 | 80.460 | 7.52 | 17.210 | 3.014 | 28.943 | 9.371 | 35.0 | 23.9 | 5.8 | 29.45 | 26.55 | 6 | 1990 |
| 1990 | 0.142 | 0.175 | 0.254 | 0.182 | 9.58 | 26.480 | 26.614 | 22.701 | 29.25 | 24.95 | 26.49 | 79.891 | 9.58 | 17.213 | 2.100 | 28.114 | 6.943 | 34.4 | 23.9 | 39.1 | 29.15 | 27.10 | 2 | 1990 |
# Select columns related to years (assuming years are the last columns in reverse order)
years <- rev(names(X_iq)[24:length(names(X_iq))])
# Subset and rearrange the dataset
X_iq <- cbind(X_iq[, years], X_iq[, 1:24])
# Display the first few rows
head(X_iq) %>% kable(caption="X Iquitos") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| X_iq[, years] | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 937 | 2000 | 0.193 | 0.132 | 0.341 | 0.247 | 25.41 | 23.590 | 25.300 | 22.034 | 34.15 | 19.95 | 43.19 | 92.419 | 25.41 | 16.651 | 8.929 | 26.400 | 10.775 | 32.5 | 20.7 | 3.0 | 26.60 | 27.05 | 0 | 2000 |
| 938 | 2000 | 0.217 | 0.276 | 0.289 | 0.242 | 60.61 | 23.484 | 25.279 | 22.209 | 33.45 | 17.95 | 46.00 | 93.581 | 60.61 | 16.863 | 10.314 | 26.900 | 11.567 | 34.0 | 20.8 | 55.6 | 27.40 | 25.70 | 0 | 2000 |
| 939 | 2000 | 0.177 | 0.173 | 0.204 | 0.128 | 55.52 | 23.266 | 24.243 | 22.473 | 31.35 | 19.45 | 64.77 | 95.849 | 55.52 | 17.120 | 7.386 | 26.800 | 11.467 | 33.0 | 20.7 | 38.1 | 26.85 | 25.40 | 0 | 2000 |
| 940 | 2000 | 0.228 | 0.145 | 0.254 | 0.200 | 5.60 | 22.207 | 23.079 | 19.647 | 30.45 | 15.45 | 23.96 | 87.234 | 5.60 | 14.431 | 9.114 | 25.767 | 10.533 | 31.5 | 14.7 | 30.0 | 23.10 | 22.95 | 0 | 2000 |
| 941 | 2000 | 0.329 | 0.322 | 0.254 | 0.361 | 62.76 | 23.283 | 24.486 | 20.807 | 33.85 | 18.35 | 31.80 | 88.161 | 62.76 | 15.444 | 9.500 | 26.600 | 11.480 | 33.3 | 19.1 | 4.0 | 26.20 | 26.10 | 0 | 2000 |
| 942 | 2000 | 0.206 | 0.191 | 0.232 | 0.255 | 16.24 | 24.041 | 25.136 | 18.576 | 33.05 | 15.35 | 1.00 | 74.729 | 16.24 | 13.421 | 13.771 | 25.340 | 10.940 | 32.0 | 17.0 | 11.5 | 24.50 | 24.20 | 0 | 2000 |
The removal of the first year in each dataset (1990 for San Juan and 2000 for Iquitos) help avoid the dummy variable trap and avoid redundancy in one-hot encoded columns while retaining all relevant information for predictive modeling.
# Removing dummy variable trap (dropping specific year columns)
X_sj <- X_sj[, !(names(X_sj) %in% "1990")]
X_iq <- X_iq[, !(names(X_iq) %in% "2000")]
# Convert the data frames to arrays (matrices in R)
x_sj_arr <- as.matrix(X_sj)
x_iq_arr <- as.matrix(X_iq)
A test ratio of 20% is chosen, meaning 80% of the data will be used for training, and 20% will be reserved for validation or testing. The createDataPartition function ensures a stratified split, preserving the distribution of the target variable (total_cases).
# Define train-test split ratio
test_ratio <- 0.2
# Create train-test splits for SJ data
set.seed(23) # For reproducibility
sj_indices <- createDataPartition(x_sj_arr[, ncol(x_sj_arr)], p = 1 - test_ratio, list = FALSE)
X_sj_train <- x_sj_arr[sj_indices, -ncol(x_sj_arr)]
y_sj_train <- x_sj_arr[sj_indices, ncol(x_sj_arr)]
X_sj_test <- x_sj_arr[-sj_indices, -ncol(x_sj_arr)]
y_sj_test <- x_sj_arr[-sj_indices, ncol(x_sj_arr)]
# Create train-test splits for IQ data
iq_indices <- createDataPartition(x_iq_arr[, ncol(x_iq_arr)], p = 1 - test_ratio, list = FALSE)
X_iq_train <- x_iq_arr[iq_indices, -ncol(x_iq_arr)]
y_iq_train <- x_iq_arr[iq_indices, ncol(x_iq_arr)]
X_iq_test <- x_iq_arr[-iq_indices, -ncol(x_iq_arr)]
y_iq_test <- x_iq_arr[-iq_indices, ncol(x_iq_arr)]
Data standardization ensures all features have a mean of 0 and a standard deviation of 1. This is especially important for models sensitive to feature scales, such as k-NN and SVM.
# Feature Scaling for SJ
sj_scaler <- preProcess(X_sj_train, method = c("center", "scale"))
# Apply scaling to training and test sets
X_sj_train <- predict(sj_scaler, X_sj_train)
X_sj_test <- predict(sj_scaler, X_sj_test)
# Feature Scaling for IQ
iq_scaler <- preProcess(X_iq_train, method = c("center", "scale"))
# Apply scaling to training and test sets
X_iq_train <- predict(iq_scaler, X_iq_train)
X_iq_test <- predict(iq_scaler, X_iq_test)
# Verify normalization
head(dengue_features_train) %>% kable(caption="Dengue Features Train") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| city | year | weekofyear | week_start_date | ndvi_ne | ndvi_nw | ndvi_se | ndvi_sw | precipitation_amt_mm | reanalysis_air_temp_c | reanalysis_avg_temp_c | reanalysis_dew_point_temp_c | reanalysis_max_air_temp_c | reanalysis_min_air_temp_c | reanalysis_precip_amt_kg_per_m2 | reanalysis_relative_humidity_percent | reanalysis_sat_precip_amt_mm | reanalysis_specific_humidity_g_per_kg | reanalysis_tdtr_k | station_avg_temp_c | station_diur_temp_rng_c | station_max_temp_c | station_min_temp_c | station_precip_mm | avg_station_max_min | avg_analysis_max_min | total_cases |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sj | 1990 | 18 | 1990-04-30 | 0.123 | 0.104 | 0.198 | 0.178 | 12.42 | 24.423 | 24.593 | 19.264 | 26.65 | 22.75 | 32.00 | 73.366 | 12.42 | 14.013 | 2.629 | 25.443 | 6.900 | 29.4 | 20.0 | 16.0 | 24.70 | 24.70 | 4 |
| sj | 1990 | 19 | 1990-05-07 | 0.170 | 0.142 | 0.162 | 0.155 | 22.82 | 25.061 | 25.293 | 20.801 | 27.75 | 23.25 | 17.94 | 77.369 | 22.82 | 15.373 | 2.371 | 26.714 | 6.371 | 31.7 | 22.2 | 8.6 | 26.95 | 25.50 | 5 |
| sj | 1990 | 20 | 1990-05-14 | 0.032 | 0.173 | 0.157 | 0.171 | 34.54 | 25.631 | 25.729 | 22.284 | 27.35 | 24.15 | 26.10 | 82.053 | 34.54 | 16.849 | 2.300 | 26.714 | 6.486 | 32.2 | 22.8 | 41.4 | 27.50 | 25.75 | 4 |
| sj | 1990 | 21 | 1990-05-21 | 0.129 | 0.245 | 0.228 | 0.236 | 15.36 | 25.837 | 26.079 | 22.160 | 28.25 | 23.85 | 13.90 | 80.337 | 15.36 | 16.673 | 2.429 | 27.471 | 6.771 | 33.3 | 23.3 | 4.0 | 28.30 | 26.05 | 3 |
| sj | 1990 | 22 | 1990-05-28 | 0.196 | 0.262 | 0.251 | 0.247 | 7.52 | 26.369 | 26.514 | 22.671 | 28.75 | 24.35 | 12.20 | 80.460 | 7.52 | 17.210 | 3.014 | 28.943 | 9.371 | 35.0 | 23.9 | 5.8 | 29.45 | 26.55 | 6 |
| sj | 1990 | 23 | 1990-06-04 | 0.142 | 0.175 | 0.254 | 0.182 | 9.58 | 26.480 | 26.614 | 22.701 | 29.25 | 24.95 | 26.49 | 79.891 | 9.58 | 17.213 | 2.100 | 28.114 | 6.943 | 34.4 | 23.9 | 39.1 | 29.15 | 27.10 | 2 |
summary_df <- as.data.frame(t(summary(dengue_features_train)))
head(summary_df) %>% kable(caption="Summary Statistics of Dengue Features Dataset") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F,
position = "center")
| Var1 | Var2 | Freq |
|---|---|---|
| city | Length:1456 | |
| year | Min. :1990 | |
| weekofyear | Min. : 1.00 | |
| week_start_date | Length:1456 | |
| ndvi_ne | Min. :-0.40600 | |
| ndvi_nw | Min. :-0.4560 |
The scatter plots represent the relationship between individual features (1 through 24) and the target variable (year, serving as a proxy for the dengue cases timeline). Key observations include:
Feature 1 shows a strong linear relationship with the target, indicating its potential importance in modeling trends over time.
Other features, such as Feature 21 and Feature 6, display a moderate spread with noticeable clustering, suggesting some correlation with the target variable.
Many features, including Feature 3, Feature 13, and others, exhibit a more uniform or noisy distribution, indicating weak or no correlation with the target variable.
Features like Feature 19 and Feature 20 show distinct patterns (e.g., band-like structures), which might suggest discrete or categorical influences.
These plots are valuable for identifying which features may be most relevant in predicting dengue case trends and guiding feature selection or transformation during the modeling process.
#Scatter plot
# Loop through all features and create individual plots
par(mfrow = c(2, 2)) # Set up a 2x2 grid for multiple plots
for (i in 1:ncol(X_sj_train)) {
plot(X_sj_train[, i], y_sj_train, pch = 19, col = "green",
xlab = paste("Feature", i), ylab = "Target", main = paste("Feature", i, "vs Target"))
}
par(mfrow = c(1, 1)) # Reset plotting layout
Hyperparameter Tuning: Grid search with
cross-validation was employed to optimize critical hyperparameters such
as the number of variables tried at each split (mtry), the
splitting rule (e.g., “variance”), and the minimum node size. The model
was tuned for minimizing Mean Absolute Error (MAE).
Model Training: The Random Forest model was trained on the San Juan dataset using the training set. The model leverages ensemble learning to build multiple decision trees and combines their predictions for robust results.
Prediction: Using the optimal hyperparameters, predictions were made on the test set, estimating dengue cases based on the trained Random Forest model.
Performance Metrics (MAE): The model’s accuracy was evaluated using Mean Absolute Error, which measures the average magnitude of prediction errors, providing insights into the model’s prediction reliability.
# Define for Random Forest
# Define training control with cross-validation
train_control <- trainControl(
method = "cv",
number = 5
)
# Define the hyperparameter grid
tune_grid <- expand.grid(
mtry = c(2, 4, 6, 8),
splitrule = c("variance"),
min.node.size = c(1, 5, 10)
)
# Random Forest for SJ
# Train the Random Forest model using grid search with cross-validation
sj_rf <- train(
x = X_sj_train,
y = y_sj_train,
method = "ranger",
trControl = train_control,
tuneGrid = tune_grid
)
# Display the best hyperparameters
print(sj_rf$bestTune)
## mtry splitrule min.node.size
## 10 8 variance 1
# Predict on the test set
sj_rf_pred <- predict(sj_rf, X_sj_test)
# Calculate Mean Absolute Error (MAE)
mae_sj_rf <- mean(abs(sj_rf_pred - y_sj_test))
cat("MAE:", mae_sj_rf, "\n")
## MAE: 0.437086
### Random Forest for Iq
# Random Forest for IQ
# Train the Random Forest model using grid search with cross-validation
iq_rf <- train(
x = X_iq_train,
y = y_iq_train,
method = "ranger",
trControl = train_control,
tuneGrid = tune_grid
)
# Display the best hyperparameters
print(iq_rf$bestTune)
## mtry splitrule min.node.size
## 11 8 variance 5
# Predict on the test set
iq_rf_pred <- predict(iq_rf, X_iq_test)
# Calculate Mean Absolute Error (MAE)
mae_iq_rf <- mean(abs(iq_rf_pred - y_iq_test))
cat("MAE:", mae_iq_rf, "\n")
## MAE: 0.3705987
Hyperparameter Tuning: Grid search with cross-validation was used to optimize key parameters such as the number of neighbors (k), algorithm (e.g., “ball_tree,” “kd_tree,” “brute”), and leaf size. The model was tuned for minimizing Mean Absolute Error (MAE).
Model Training: The k-Nearest Neighbors (k-NN) model was trained on the San Juan dataset, utilizing the training set to identify dengue case patterns.
Prediction: After selecting the best hyperparameters, the trained k-NN model was applied to the test set to predict dengue cases.
Performance Metrics (MAE): The best hyperparameters and corresponding MAE were extracted and evaluated, demonstrating the model’s ability to make accurate predictions.
# k-NN Model
# Define & Tune Hyper-parameters for k-NN
knn_grid <- expand.grid(
k = c(3, 5, 7), # Number of neighbors to try
algorithm = c("ball_tree", "kd_tree", "brute"), # Algorithms (manually incorporate logic later if needed)
leaf_size = c(25) # Leaf size (applicable only for certain algorithms)
)
# Train the k-NN model with cross-validation for SJ
sj_knn_grid_cv <- train(
X_sj_train, y_sj_train,
method = "knn",
trControl = trainControl(
method = "cv", # Cross-validation
number = 5, # Number of folds
summaryFunction = defaultSummary # Include metrics like MAE
),
# tuneGrid = knn_grid,
metric = "MAE" # Optimize for Mean Absolute Error
)
# Extract best score (MAE)
best_score <- min(sj_knn_grid_cv$results$MAE)
# Extract best parameters
best_params <- sj_knn_grid_cv$bestTune
# Print results
print(best_score)
## [1] 1.715535
print(best_params)
## k
## 1 5
# Predict on the test set
y_sj_pred <- predict(sj_knn_grid_cv$finalModel, X_sj_test)
# Calculate Mean Absolute Error (MAE)
mae_sj_knn <- mean(abs(y_sj_test - y_sj_pred))
# Print MAE
print(mae_sj_knn)
## [1] 1.755735
# Overview of the Model: k-NN for SJ
sj_knn_grid_cv
## k-Nearest Neighbors
##
## 750 samples
## 24 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 600, 601, 600, 599, 600
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 2.146043 0.8401278 1.715535
## 7 2.142910 0.8474323 1.730857
## 9 2.174091 0.8477397 1.740108
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Train the k-NN model with cross-validation for IQ
iq_knn_grid_cv <- train(
X_iq_train, y_iq_train,
method = "knn",
trControl = trainControl(
method = "cv", # Cross-validation
number = 5, # Number of folds
summaryFunction = defaultSummary # Include metrics like MAE
),
tuneGrid = expand.grid(k = c(3, 5, 7)), # Only tuning over k
metric = "MAE" # Optimize for Mean Absolute Error
)
# Extract best score (MAE)
best_score <- min(iq_knn_grid_cv$results$MAE)
# Extract best parameters
best_params <- iq_knn_grid_cv$bestTune
# Print results
print(best_score)
## [1] 1.253834
print(best_params)
## k
## 1 3
# Predict on the test set
y_iq_pred <- predict(iq_knn_grid_cv$finalModel, X_iq_test)
# Calculate Mean Absolute Error (MAE)
mae_iq_knn <- mean(abs(y_iq_test - y_iq_pred))
# Print MAE
print(mae_iq_knn)
## [1] 1.346278
iq_knn_grid_cv
## k-Nearest Neighbors
##
## 417 samples
## 24 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 333, 334, 333, 334, 334
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 3 1.620303 0.6891023 1.253834
## 5 1.579458 0.7175550 1.280266
## 7 1.532337 0.7533292 1.259128
##
## MAE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
Data Preparation: The training and test datasets
for the San Juan region were converted into matrix format. These
matrices were further transformed into XGBoost’s DMatrix
format, which optimizes computation during training and
prediction.
Hyperparameter Tuning: Training was conducted
with parameters such as learning rate (eta = 0.1), maximum
tree depth (max_depth = 6), subsampling ratios for rows
(subsample = 0.8) and columns
(colsample_bytree = 0.8), all tailored to optimize
regression performance.
Model Training: The XGBoost model was trained using 100 boosting rounds with a watchlist that monitored both training and test datasets to prevent overfitting. Early stopping was applied to halt training if the test performance did not improve for 10 consecutive rounds.
Prediction: The trained XGBoost model was used to predict dengue cases on the test set.
Performance Metrics (MAE): Model accuracy was assessed using Mean Absolute Error, capturing the model’s ability to predict dengue cases accurately and effectively.
# XGBOOST MODEL
# Convert the data to matrices
sj_dtrain <- as.matrix(X_sj_train)
sj_dtest <- as.matrix(X_sj_test)
iq_dtrain <- as.matrix(X_iq_train)
iq_dtest <- as.matrix(X_iq_test)
# Define training parameters for xgboost
xgb_params <- list(
objective = "reg:squarederror", # Regression task
eta = 0.1, # Learning rate
max_depth = 6, # Maximum tree depth
subsample = 0.8, # Subsample ratio of training data
colsample_bytree = 0.8 # Subsample ratio of columns
)
# Convert training data to DMatrix format
dtrain_sj <- xgb.DMatrix(data = sj_dtrain, label = y_sj_train)
dtest_sj <- xgb.DMatrix(data = sj_dtest, label = y_sj_test)
# Train XGBoost model for San Juan
xgb_sj <- xgb.train(
params = xgb_params,
data = dtrain_sj,
nrounds = 100,
watchlist = list(train = dtrain_sj, test = dtest_sj),
early_stopping_rounds = 10,
print_every_n = 10
)
## [1] train-rmse:1798.816410 test-rmse:1798.945350
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:628.380934 test-rmse:628.510180
## [21] train-rmse:219.571847 test-rmse:219.701947
## [31] train-rmse:76.824680 test-rmse:76.928735
## [41] train-rmse:26.995268 test-rmse:27.043721
## [51] train-rmse:9.507405 test-rmse:9.517873
## [61] train-rmse:3.361861 test-rmse:3.351093
## [71] train-rmse:1.202664 test-rmse:1.200162
## [81] train-rmse:0.438678 test-rmse:0.438054
## [91] train-rmse:0.167807 test-rmse:0.171728
## [100] train-rmse:0.077309 test-rmse:0.087671
# Train XGBoost model for Iquitos
dtrain_iq <- xgb.DMatrix(data = iq_dtrain, label = y_iq_train)
dtest_iq <- xgb.DMatrix(data = iq_dtest, label = y_iq_test)
xgb_iq <- xgb.train(
params = xgb_params,
data = dtrain_iq,
nrounds = 100,
watchlist = list(train = dtrain_iq, test = dtest_iq),
early_stopping_rounds = 10,
print_every_n = 10
)
## [1] train-rmse:1804.673681 test-rmse:1804.770725
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:631.355482 test-rmse:631.452872
## [21] train-rmse:220.873041 test-rmse:220.971422
## [31] train-rmse:77.322517 test-rmse:77.423685
## [41] train-rmse:27.154651 test-rmse:27.270354
## [51] train-rmse:9.579263 test-rmse:9.664059
## [61] train-rmse:3.392090 test-rmse:3.464586
## [71] train-rmse:1.218045 test-rmse:1.270635
## [81] train-rmse:0.447854 test-rmse:0.475615
## [91] train-rmse:0.170830 test-rmse:0.191916
## [100] train-rmse:0.072782 test-rmse:0.088841
# Predict for San Juan
xgb_sj_pred <- predict(xgb_sj, sj_dtest)
mae_sj_xgb <- mean(abs(xgb_sj_pred - y_sj_test))
cat("XGBoost MAE (San Juan):", mae_sj_xgb, "\n")
## XGBoost MAE (San Juan): 0.07022915
# Predict for Iquitos
xgb_iq_pred <- predict(xgb_iq, iq_dtest)
mae_iq_xgb <- mean(abs(xgb_iq_pred - y_iq_test))
cat("XGBoost MAE (Iquitos):", mae_iq_xgb, "\n")
## XGBoost MAE (Iquitos): 0.07082567
# COMPARE RESULTS
# Create a data frame to store the test results for both models and datasets
test_results <- data.frame(
Model = c("Random Forest", "Random Forest", "k-NN", "k-NN", "XGBoost", "XGBoost"),
Dataset = c("San Juan (SJ)", "Iquitos (IQ)", "San Juan (SJ)", "Iquitos (IQ)", "San Juan (SJ)", "Iquitos (IQ)"),
MAE = c(mae_sj_rf, mae_iq_rf, mae_sj_knn, mae_iq_knn, mae_sj_xgb, mae_iq_xgb)
)
# Print the results
print("Test Results Comparison:")
## [1] "Test Results Comparison:"
print(test_results)
## Model Dataset MAE
## 1 Random Forest San Juan (SJ) 0.43708602
## 2 Random Forest Iquitos (IQ) 0.37059871
## 3 k-NN San Juan (SJ) 1.75573477
## 4 k-NN Iquitos (IQ) 1.34627832
## 5 XGBoost San Juan (SJ) 0.07022915
## 6 XGBoost Iquitos (IQ) 0.07082567
# Create a bar plot to compare MAE across models and datasets
ggplot(test_results, aes(x = Dataset, y = MAE, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(title = "Model Comparison: MAE on Test Data",
x = "Dataset",
y = "Mean Absolute Error (MAE)",
fill = "Model") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
XGBoost Performance:
XGBoost achieved the lowest MAE for both datasets:
San Juan (SJ): 0.0702
Iquitos (IQ): 0.0708
This indicates that XGBoost is the most accurate model for predicting dengue cases in this context.
Random Forest Performance:
Random Forest performed reasonably well, particularly for Iquitos (IQ):
San Juan (SJ): 0.4371
Iquitos (IQ): 0.3706
San Juan (SJ) is significantly higher than XGBoost but still much better than k-NN.
k-Nearest Neighbors (k-NN) Performance:
k-NN showed the highest MAE values, indicating the poorest performance among the models:
San Juan (SJ): 1.7557
Iquitos (IQ): 1.3463
This suggests that k-NN might not be suitable for this problem, possibly due to the complexity of the dataset or the nature of the relationships between features and target variables.
Comparison Between Cities:
For all models, predictions for the Iquitos dataset had lower MAE values compared to San Juan.
This may indicate that the Iquitos dataset has more consistent patterns or less noise, making it easier to model accurately.
XGBoost outperformed the other models significantly and is the preferred choice for both San Juan and Iquitos datasets, especially in terms of minimizing prediction errors. However, further evaluation, such as hyperparameter tuning or feature selection, might help improve the performance of the other models.
# Add scatter plots for actual vs predicted values
# Random Forest: San Juan
rf_sj_comparison <- data.frame(Actual = y_sj_test, Predicted = sj_rf_pred)
ggplot(rf_sj_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
labs(title = "Random Forest: Actual vs Predicted (SJ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
# Random Forest: Iquitos
rf_iq_comparison <- data.frame(Actual = y_iq_test, Predicted = iq_rf_pred)
ggplot(rf_iq_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "green") +
labs(title = "Random Forest: Actual vs Predicted (IQ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
The blue (SJ) and green (IQ) plots show that RF generally captures the trend of actual dengue cases well for both cities
There is a consistent alignment between predicted and actual values, suggesting that RF is effective but may have slight deviations in some cases.
# k-NN: San Juan
knn_sj_comparison <- data.frame(Actual = y_sj_test, Predicted = y_sj_pred)
ggplot(knn_sj_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "red") +
labs(title = "k-NN: Actual vs Predicted (SJ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
# k-NN: Iquitos
knn_iq_comparison <- data.frame(Actual = y_iq_test, Predicted = y_iq_pred)
ggplot(knn_iq_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "purple") +
labs(title = "k-NN: Actual vs Predicted (IQ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
The red (SJ) and purple (IQ) plots demonstrate k-NN’s predictions. While it follows the overall trends, the scatter appears less tightly clustered around the diagonal compared to Random Forest or XGBoost, indicating relatively poorer accuracy and higher variability.
# XGBoost: San Juan
xgb_sj_comparison <- data.frame(Actual = y_sj_test, Predicted = y_sj_pred)
ggplot(xgb_sj_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "red") +
labs(title = "XGBoost: Actual vs Predicted (SJ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
# XGBoost: Iquitos
xgb_iq_comparison <- data.frame(Actual = y_iq_test, Predicted = y_iq_pred)
ggplot(xgb_iq_comparison, aes(x = Actual, y = Predicted)) +
geom_point(color = "purple") +
labs(title = "XGBoost: Actual vs Predicted (IQ)",
x = "Actual Total Cases",
y = "Predicted Total Cases") +
theme_minimal()
XGBoost (red for SJ and purple for IQ) shows the most robust performance. Predictions are closely aligned with actual values, and the scatter points are tightly clustered along the diagonal. This indicates that XGBoost provides the most accurate and consistent results among the models.
City-Specific Accuracy: While both models perform well, the performance differences (e.g., MAE values) indicate that San Juan and Iquitos require tailored considerations in feature representation or data characteristics.
Model Superiority: XGBoost outperforms the other models in both cities based on tighter clustering around the diagonal, reinforcing its robustness for dengue case prediction.
In this study, various machine learning models, including Random Forest, k-Nearest Neighbors (k-NN), and XGBoost, were evaluated for predicting dengue cases in San Juan (SJ) and Iquitos (IQ). The results show that XGBoost consistently outperformed other models, achieving the lowest Mean Absolute Error (MAE) and demonstrating strong alignment between actual and predicted values in both cities.
The analysis highlights that:
San Juan (SJ) exhibited greater prediction accuracy across models, reflecting the robustness of features or dataset quality.
Iquitos (IQ) predictions, while effective, showed slightly higher variability, indicating potential differences in data patterns or external factors influencing model performance.
XGBoost’s superior predictive capability and its ability to generalize across datasets make it the most reliable model for future dengue case forecasting.
These predictive models can provide early warnings, helping governments and healthcare agencies allocate resources efficiently and target high-risk areas.
The implementation of accurate forecasting systems can reduce healthcare costs and improve patient outcomes by enabling timely interventions.
The dataset may not fully capture real-world complexities, as it lacks certain features such as socio-economic factors and environmental variables that could enhance model predictions.
Predictive models require continuous updates and monitoring to remain accurate as dengue patterns evolve due to climate change and urbanization.
Enhanced Feature Engineering: Incorporate additional features such as population density, economic indicators, and more granular climate data to refine predictions.
Real-Time Monitoring Systems: Deploy these models into real-time operational systems for live monitoring of dengue trends.
Intervention Analysis: Expand the scope to evaluate how interventions (e.g., fumigation campaigns) impact case predictions and optimize public health strategies.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939785
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Cover, T., & Hart, P. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964
World Health Organization (WHO). (2022). Dengue and severe dengue. Retrieved from https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue
Roy, S. K., & Bhattacharjee, S. (2020). Dengue virus: Epidemiology, biology, and disease aetiology. Canadian Journal of Microbiology. https://cdnsciencepub.com/doi/pdf/10.1139/cjm-2020-0572
Kaggle. (n.d.). Dengue Disease Prediction Dataset. Retrieved from https://www.kaggle.com/datasets/ahmadpk/dengue-dataset