Governments today warn the public about the weather and the air using two separate numbers: a temperature / heat index on one side, and an Air Quality Index (AQI) on the other. The human body, however, does not feel them separately. When it is very hot, the heart works harder and we breathe faster, so inhaling fine particles (PM2.5) and ground-level ozone becomes far more dangerous. A study of ~1.5 million deaths in California found that extreme heat or PM2.5 alone raised mortality only slightly, but the two together pushed it up by around 21%. Standard isolated indices miss this “synergy”.
Our project builds a single health-weighted index — the Acute Risk Score (ARS) — that captures this combined danger, and then forecasts, one day in advance, whether tomorrow will be a “High Health-Risk Day” for a city. This document is the full reproducible analysis: data understanding, cleaning, feature engineering, EDA, modelling, and evaluation.
Why “one day in advance” matters. The ARS is computed from a day’s own temperature, PM2.5 and ozone. If we used today’s values to predict today’s risk label, the model would just be re-deriving our own formula — that is data leakage, not prediction. So we frame the task on the time axis: use a window of past days to predict the next day. This is the “time-span” design that runs through the whole notebook.
# Robust path resolution so the notebook knits from a few common locations.
csv_candidates <- c(
"GlobalWeatherRepository.csv",
"../GlobalWeatherRepository.csv",
"C:/Users/cyanc/OneDrive/Attachments/RealDocs/Malay/UM/WQD7004 Programming of DS/gp/GlobalWeatherRepository.csv"
)
csv_path <- csv_candidates[file.exists(csv_candidates)][1]
stopifnot(!is.na(csv_path))
dt <- fread(csv_path)
dim(dt)## [1] 78526 41
The Global Weather Repository (Kaggle) aggregates daily weather and air-quality readings for cities worldwide. Each row is one city at one update time.
dt[, dt_ts := ymd_hm(last_updated)]
dt[, date := as.IDate(dt_ts)]
cat("Rows:", nrow(dt),
"| Cities:", uniqueN(dt$location_name),
"| Countries:", uniqueN(dt$country), "\n")## Rows: 78526 | Cities: 248 | Countries: 210
cat("Date range:", as.character(min(dt$date)), "->", as.character(max(dt$date)),
"| Unique dates:", uniqueN(dt$date), "\n")## Date range: 2024-05-16 -> 2025-06-24 | Unique dates: 404
A quick numeric summary of the variables we care about:
focus <- c("temperature_celsius","humidity","pressure_mb","wind_kph","cloud",
"air_quality_PM2.5","air_quality_Ozone","air_quality_PM10",
"air_quality_Carbon_Monoxide","air_quality_Sulphur_dioxide")
summary(dt[, ..focus])## temperature_celsius humidity pressure_mb wind_kph
## Min. :-24.90 Min. : 2.00 Min. : 947 Min. : 3.60
## 1st Qu.: 17.20 1st Qu.: 46.00 1st Qu.:1010 1st Qu.: 6.50
## Median : 24.70 Median : 69.00 Median :1013 Median : 11.20
## Mean : 22.36 Mean : 63.59 Mean :1014 Mean : 13.36
## 3rd Qu.: 28.30 3rd Qu.: 83.00 3rd Qu.:1018 3rd Qu.: 18.40
## Max. : 49.20 Max. :100.00 Max. :3006 Max. :2963.20
## cloud air_quality_PM2.5 air_quality_Ozone air_quality_PM10
## Min. : 0.00 Min. : 0.168 Min. : 0.00 Min. :-1848.15
## 1st Qu.: 2.00 1st Qu.: 6.660 1st Qu.: 42.00 1st Qu.: 9.99
## Median : 29.00 Median : 14.700 Median : 61.00 Median : 21.83
## Mean : 39.63 Mean : 26.208 Mean : 64.08 Mean : 54.57
## 3rd Qu.: 75.00 3rd Qu.: 30.525 3rd Qu.: 82.00 3rd Qu.: 47.36
## Max. :100.00 Max. :1614.100 Max. :480.70 Max. : 6037.29
## air_quality_Carbon_Monoxide air_quality_Sulphur_dioxide
## Min. :-9999.0 Min. :-9999.000
## 1st Qu.: 233.7 1st Qu.: 0.800
## Median : 327.4 Median : 2.405
## Mean : 521.5 Mean : 11.365
## 3rd Qu.: 507.4 3rd Qu.: 9.100
## Max. :38879.4 Max. : 521.330
Two things jump out immediately and tell us what cleaning is needed:
-9999, PM10 as low as -1848). These are sensor
error sentinels, not real concentrations.The “time-span” design only works if each city is observed on (almost) every day. Let’s check the spacing between consecutive observations per city.
gap_summary <- dt[order(location_name, date),
.(n_days = uniqueN(date),
span_start = min(date), span_end = max(date)),
by = location_name]
# fraction of consecutive observations that are exactly 1 day apart
one_day <- dt[order(location_name, date),
.(g = as.integer(diff(sort(unique(date))))), by = location_name]
cat(sprintf("Median observations per city: %d\n", median(gap_summary$n_days)))## Median observations per city: 401
cat(sprintf("Share of consecutive gaps == 1 day: %.1f%%\n",
100 * mean(one_day$g == 1, na.rm = TRUE)))## Share of consecutive gaps == 1 day: 99.3%
## Max gap between observations: 6 days
So this is effectively a clean daily panel: most cities have ~400 consecutive days, and almost every consecutive pair is exactly one day apart. That is exactly what we need to build lag features and forecast the next day.
n0 <- nrow(dt)
# 1. Drop rows missing any variable we model on.
model_vars <- c("temperature_celsius","humidity","pressure_mb","wind_kph","cloud",
"uv_index","precip_mm","air_quality_PM2.5","air_quality_Ozone",
"air_quality_Nitrogen_dioxide","air_quality_Sulphur_dioxide",
"air_quality_PM10","air_quality_Carbon_Monoxide")
dt <- dt[complete.cases(dt[, ..model_vars])]
# 2. Remove physically impossible readings.
aq_conc <- c("air_quality_PM2.5","air_quality_Ozone","air_quality_PM10",
"air_quality_Carbon_Monoxide","air_quality_Nitrogen_dioxide",
"air_quality_Sulphur_dioxide")
for (c in aq_conc) dt <- dt[get(c) >= 0]
dt <- dt[humidity >= 0 & humidity <= 100] # guard against e.g. 150% humidity
# 3. Keep one record per city-day (latest timestamp).
setorder(dt, location_name, date, dt_ts)
dt <- dt[, .SD[.N], by = .(location_name, date)]
cat(sprintf("Removed %d rows in cleaning (%.2f%%); final %d city-day records.\n",
n0 - nrow(dt), 100*(n0 - nrow(dt))/n0, nrow(dt)))## Removed 246 rows in cleaning (0.31%); final 78280 city-day records.
The ARS adds the baseline risk from PM2.5 and ozone (each scaled by a health limit), adds a penalty for heat above 32 °C, and then adds synergy terms that only switch on when heat and pollution occur together — this is the multiplicative danger the project is about.
calc_ars <- function(d) {
S_pm25 <- 35.0; S_o3 <- 100.0; T_thr <- 32.0 # health limits + heat threshold
w_pm25 <- 1.0; w_o3 <- 1.0; w_T <- 1.2 # baseline weights
lam_s <- 1.5; lam_o <- 1.3 # synergy coefficients
tp <- pmax(0, d$temperature_celsius - T_thr) # heat penalty f(T)
(w_pm25 * d$air_quality_PM2.5 / S_pm25) +
(w_o3 * d$air_quality_Ozone / S_o3) +
(w_T * tp) +
(lam_s * d$air_quality_PM2.5 / S_pm25 * tp) + # PM2.5 x heat synergy
(lam_o * d$air_quality_Ozone / S_o3 * tp) # ozone x heat synergy
}
dt[, acute_risk_score := calc_ars(dt)]We call a day “High-Risk” if its ARS is in the top 15% (the 85th percentile). Crucially, we fix that percentile cut-off using only the training period, so no information from the future test period leaks into the label.
cutoff <- as.IDate("2025-03-25") # last ~3 months are held out as "the future"
risk_thr <- as.numeric(quantile(dt[date < cutoff, acute_risk_score], 0.85, na.rm = TRUE))
dt[, high_risk := as.integer(acute_risk_score >= risk_thr)]
cat(sprintf("High-risk threshold (85th pct of TRAIN ARS): %.3f\n", risk_thr))## High-risk threshold (85th pct of TRAIN ARS): 2.583
## Overall positive rate: 15.2%
For each city-day t, we build features from days ≤ t only, and set the target to the High-Risk label of day t+1. We only keep a target when t+1 is genuinely the next calendar day.
setorder(dt, location_name, date)
roll_mean <- function(x, n) frollmean(x, n, align = "right", na.rm = TRUE)
roll_max <- function(x, n) frollapply(x, n, max, align = "right")
dt[, `:=`(
next_date = shift(date, 1, type = "lead"),
target = shift(high_risk, 1, type = "lead")
), by = location_name]
dt[, valid_target := as.integer(!is.na(next_date) & (next_date - date) == 1L)]
dt[, `:=`(
temp_lag1 = shift(temperature_celsius,1), temp_lag2 = shift(temperature_celsius,2), temp_lag3 = shift(temperature_celsius,3),
pm25_lag1 = shift(air_quality_PM2.5,1), pm25_lag2 = shift(air_quality_PM2.5,2), pm25_lag3 = shift(air_quality_PM2.5,3),
o3_lag1 = shift(air_quality_Ozone,1), o3_lag2 = shift(air_quality_Ozone,2), o3_lag3 = shift(air_quality_Ozone,3),
ars_lag1 = shift(acute_risk_score,1), ars_lag2 = shift(acute_risk_score,2), ars_lag3 = shift(acute_risk_score,3),
hr_lag1 = shift(high_risk,1),
temp_roll3 = roll_mean(temperature_celsius,3), temp_roll7 = roll_mean(temperature_celsius,7),
temp_max7 = roll_max(temperature_celsius,7),
pm25_roll3 = roll_mean(air_quality_PM2.5,3), pm25_roll7 = roll_mean(air_quality_PM2.5,7),
o3_roll3 = roll_mean(air_quality_Ozone,3), o3_roll7 = roll_mean(air_quality_Ozone,7),
ars_roll3 = roll_mean(acute_risk_score,3), ars_roll7 = roll_mean(acute_risk_score,7)
), by = location_name]
dt[, `:=`(
temp_change1 = temperature_celsius - temp_lag1,
pm25_change1 = air_quality_PM2.5 - pm25_lag1,
ars_change1 = acute_risk_score - ars_lag1,
month = month(date),
doy = yday(date),
doy_sin = sin(2*pi*yday(date)/365.25),
doy_cos = cos(2*pi*yday(date)/365.25)
)]ggplot(dt, aes(acute_risk_score)) +
geom_histogram(bins = 60, fill = "firebrick", colour = "white", alpha = .85) +
geom_vline(xintercept = risk_thr, linetype = "dashed", linewidth = 1) +
annotate("text", x = risk_thr, y = Inf, vjust = 2, hjust = -0.05,
label = sprintf("85th pct = %.2f", risk_thr)) +
labs(title = "Distribution of Acute Risk Score",
subtitle = "Heavily right-skewed: most days are safe, a long tail of dangerous days",
x = "Acute Risk Score", y = "Count")The score is strongly right-skewed. Most days are safe (the tall bar near zero), but a long tail stretches to the right — those rare, dangerous days are exactly what we want to predict. The dashed line is the 85th-percentile cut-off.
cols <- c("temperature_celsius","humidity","air_quality_PM2.5","air_quality_Ozone",
"air_quality_PM10","air_quality_Nitrogen_dioxide",
"air_quality_Sulphur_dioxide","acute_risk_score")
M <- cor(dt[, ..cols], use = "complete.obs")
corrplot(M, method = "color", type = "upper", addCoef.col = "black",
tl.col = "black", tl.srt = 45, number.cex = 0.7,
mar = c(0,0,1,0), title = "Correlation Heatmap")Note the positive correlation between temperature and ozone — heat speeds up the photochemical reactions that create ground-level ozone, so the two hazards tend to arrive together. That is the physical basis of the synergy we model.
set.seed(1)
samp <- dt[sample(.N, min(15000, .N))] # sample for a readable scatter
ggplot(samp, aes(temperature_celsius, acute_risk_score, colour = air_quality_PM2.5)) +
geom_point(alpha = .5, size = .9) +
scale_colour_viridis_c(option = "D", trans = "sqrt", name = "PM2.5") +
geom_vline(xintercept = 32, colour = "red", linetype = "dashed") +
annotate("text", x = 32, y = max(samp$acute_risk_score)*.95,
label = "Heatwave threshold 32°C", hjust = -0.05, colour = "red") +
labs(title = "Acute Risk Score vs Temperature",
subtitle = "Below 32°C the score is flat; above it the score explodes, worst when PM2.5 is high",
x = "Temperature (°C)", y = "Acute Risk Score")Below 32 °C the score is essentially flat. Past the red line it shoots up in a “hockey-stick” shape, and the brightest (high-PM2.5) points climb fastest — high heat plus high pollution is where the real danger lives.
ggplot(dt[!is.na(high_risk)], aes(factor(high_risk), fill = factor(high_risk))) +
geom_bar() +
scale_fill_manual(values = c("0"="skyblue","1"="salmon"), guide = "none") +
scale_x_discrete(labels = c("0"="Normal","1"="High-Risk")) +
labs(title = "High Health-Risk Days are the minority class (~15%)",
x = NULL, y = "Count")Because we forecast the next day, seasonal rhythm matters. High-risk days are not spread evenly through the year.
mon <- dt[, .(rate = mean(high_risk)), by = month][order(month)]
ggplot(mon, aes(factor(month), rate)) +
geom_col(fill = "#d95f0e") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Share of High-Risk days by month (all cities pooled)",
x = "Month", y = "% High-Risk")Zooming into the city with the most high-risk days shows the autocorrelation the forecaster exploits: risky days cluster together rather than appearing at random.
top_city <- dt[high_risk == 1, .N, by = location_name][order(-N)][1, location_name]
city <- dt[location_name == top_city][order(date)]
ggplot(city, aes(date, acute_risk_score)) +
geom_line(colour = "grey60") +
geom_point(data = city[high_risk == 1], colour = "firebrick", size = 1) +
geom_hline(yintercept = risk_thr, linetype = "dashed") +
labs(title = sprintf("Acute Risk Score over time — %s", top_city),
subtitle = "Red points = High-Risk days; note how they cluster into spells",
x = NULL, y = "Acute Risk Score")city_mean <- dt[, .(mean_ars = mean(acute_risk_score), lat = first(latitude),
lon = first(longitude)), by = location_name]
ggplot(city_mean, aes(lon, lat, colour = mean_ars)) +
geom_point(alpha = .8, size = 1.6) +
scale_colour_viridis_c(option = "B", name = "Mean ARS") +
labs(title = "Cities by location, coloured by average risk",
subtitle = "Sensor coverage is uneven — a source of spatial bias to keep in mind",
x = "Longitude", y = "Latitude")We assemble the modelling table (rows with a valid next-day target and complete features), then split by time: train on everything before the cut-off, test on the most recent ~3 months that the model has never seen.
feature_cols <- c(
"temperature_celsius","humidity","pressure_mb","wind_kph","cloud","uv_index","precip_mm",
"air_quality_PM2.5","air_quality_Ozone","air_quality_Nitrogen_dioxide",
"air_quality_Sulphur_dioxide","air_quality_PM10","air_quality_Carbon_Monoxide",
"acute_risk_score","high_risk",
"temp_lag1","temp_lag2","temp_lag3","pm25_lag1","pm25_lag2","pm25_lag3",
"o3_lag1","o3_lag2","o3_lag3","ars_lag1","ars_lag2","ars_lag3","hr_lag1",
"temp_roll3","temp_roll7","temp_max7","pm25_roll3","pm25_roll7",
"o3_roll3","o3_roll7","ars_roll3","ars_roll7",
"temp_change1","pm25_change1","ars_change1",
"latitude","longitude","month","doy","doy_sin","doy_cos"
)
model_dt <- dt[valid_target == 1]
model_dt <- model_dt[complete.cases(model_dt[, ..feature_cols]) & !is.na(target)]
train <- model_dt[next_date < cutoff]
test <- model_dt[next_date >= cutoff]
cat(sprintf("Train: %d rows (%.1f%% positive)\nTest: %d rows (%.1f%% positive)\n",
nrow(train), 100*mean(train$target), nrow(test), 100*mean(test$target)))## Train: 58358 rows (15.1% positive)
## Test: 17896 rows (15.7% positive)
X_train <- as.matrix(train[, ..feature_cols]); y_train <- train$target
X_test <- as.matrix(test[, ..feature_cols]); y_test <- test$targetWe compare four approaches, from a sanity-check baseline up to gradient boosting:
eval_model <- function(name, prob, y) {
auc <- as.numeric(pROC::auc(pROC::roc(y, prob, quiet = TRUE)))
pr <- PRROC::pr.curve(scores.class0 = prob[y==1], scores.class1 = prob[y==0])$auc.integral
pred <- as.integer(prob >= 0.5)
tp<-sum(pred==1&y==1); fp<-sum(pred==1&y==0); fn<-sum(pred==0&y==1); tn<-sum(pred==0&y==0)
prec<-ifelse(tp+fp>0,tp/(tp+fp),0); rec<-ifelse(tp+fn>0,tp/(tp+fn),0)
f1<-ifelse(prec+rec>0,2*prec*rec/(prec+rec),0)
data.table(Model=name, ROC_AUC=auc, PR_AUC=pr, Precision=prec, Recall=rec, F1=f1,
TP=tp, FP=fp, FN=fn, TN=tn)
}# (a) Naive persistence: "tomorrow looks like today"
res_persist <- eval_model("Persistence (baseline)", as.numeric(test$high_risk), y_test)# (b) Logistic regression
glm_fit <- glm(target ~ ., data = as.data.frame(train[, c(feature_cols,"target"), with=FALSE]),
family = binomial())
prob_glm <- predict(glm_fit, as.data.frame(test[, ..feature_cols]), type = "response")
res_glm <- eval_model("Logistic Regression", prob_glm, y_test)# (c) Random forest
rf_fit <- ranger(x = as.data.frame(X_train), y = factor(y_train),
probability = TRUE, num.trees = 300, importance = "impurity",
seed = 7001, num.threads = 0)
prob_rf <- predict(rf_fit, as.data.frame(X_test))$predictions[, "1"]
res_rf <- eval_model("Random Forest", prob_rf, y_test)# (d) XGBoost (main model). scale_pos_weight handles the 15% class imbalance and
# deliberately biases toward recall — for an early-warning system a missed
# dangerous day is worse than a false alarm.
spw <- sum(y_train==0) / sum(y_train==1)
dtrain <- xgb.DMatrix(X_train, label = y_train)
dtest <- xgb.DMatrix(X_test, label = y_test)
params <- list(objective="binary:logistic", eval_metric="aucpr",
eta=0.05, max_depth=5, subsample=0.8, colsample_bytree=0.8,
min_child_weight=5, scale_pos_weight=spw)
xgb_fit <- xgb.train(params, dtrain, nrounds=600,
evals=list(train=dtrain, test=dtest),
early_stopping_rounds=40, verbose=0)
prob_xgb <- predict(xgb_fit, dtest)
res_xgb <- eval_model("XGBoost", prob_xgb, y_test)results <- rbindlist(list(res_persist, res_glm, res_rf, res_xgb))
knitr::kable(results[, .(Model, ROC_AUC, PR_AUC, Precision, Recall, F1)],
digits = 3, caption = "Performance on the held-out future test set")| Model | ROC_AUC | PR_AUC | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Persistence (baseline) | 0.884 | 0.705 | 0.804 | 0.803 | 0.804 |
| Logistic Regression | 0.971 | 0.890 | 0.842 | 0.782 | 0.811 |
| Random Forest | 0.977 | 0.901 | 0.851 | 0.775 | 0.811 |
| XGBoost | 0.978 | 0.904 | 0.685 | 0.922 | 0.786 |
Every learned model beats the naive persistence baseline. The jump in PR-AUC (from ~0.71 to ~0.90) is the important one — on an imbalanced problem like this, PR-AUC is a fairer measure than accuracy, and it shows the models add real forecasting value rather than just echoing today’s label.
roc_glm <- pROC::roc(y_test, prob_glm, quiet=TRUE)
roc_rf <- pROC::roc(y_test, prob_rf, quiet=TRUE)
roc_xgb <- pROC::roc(y_test, prob_xgb, quiet=TRUE)
plot(roc_xgb, col="#d7301f", lwd=2.5, main="ROC curves (test set)")
plot(roc_rf, col="#2c7fb8", lwd=2, add=TRUE)
plot(roc_glm, col="#41ab5d", lwd=2, add=TRUE)
legend("bottomright", bty="n",
legend=c(sprintf("XGBoost (%.3f)", res_xgb$ROC_AUC),
sprintf("Random Forest (%.3f)", res_rf$ROC_AUC),
sprintf("Logistic (%.3f)", res_glm$ROC_AUC)),
col=c("#d7301f","#2c7fb8","#41ab5d"), lwd=2)cm <- data.table(
Actual = factor(c("Normal","Normal","High-Risk","High-Risk"), levels=c("Normal","High-Risk")),
Predicted = factor(c("Normal","High-Risk","Normal","High-Risk"), levels=c("Normal","High-Risk")),
N = c(res_xgb$TN, res_xgb$FP, res_xgb$FN, res_xgb$TP))
ggplot(cm, aes(Predicted, Actual, fill = N)) +
geom_tile() + geom_text(aes(label = N), size = 6) +
scale_fill_gradient(low = "#f0f0f0", high = "#3182bd", guide = "none") +
labs(title = "XGBoost confusion matrix (test, threshold 0.5)")imp <- xgb.importance(model = xgb_fit)
ggplot(head(imp, 15), aes(reorder(Feature, Gain), Gain)) +
geom_col(fill = "#d95f0e") + coord_flip() +
labs(title = "XGBoost feature importance (gain)", x = NULL, y = "Gain")shap <- predict(xgb_fit, dtest, predcontrib = TRUE)
shap <- shap[, colnames(shap) != "BIAS", drop = FALSE]
shap_imp <- data.table(Feature = colnames(shap), mean_abs_shap = colMeans(abs(shap)))[order(-mean_abs_shap)]
knitr::kable(head(shap_imp, 10), digits = 4,
caption = "Mean |SHAP| contribution — average push each feature gives the prediction")| Feature | mean_abs_shap |
|---|---|
| ars_roll7 | 1.1295 |
| acute_risk_score | 1.1259 |
| temperature_celsius | 0.2851 |
| latitude | 0.1866 |
| ars_roll3 | 0.1784 |
| cloud | 0.1266 |
| temp_max7 | 0.1196 |
| air_quality_Carbon_Monoxide | 0.1047 |
| uv_index | 0.1000 |
| air_quality_PM10 | 0.0899 |
The picture is consistent and honest: the strongest signals are the recent risk regime (today’s ARS and its 3- and 7-day rolling averages), with temperature and PM2.5 lead-indicators adding a smaller lift. In plain terms, high-risk conditions persist for several days, and the model sharpens that persistence into a calibrated next-day probability.
A test AUC of 0.98 partly reflects an easy task: telling persistently hot, polluted cities apart from cool, clean ones. Two fairer checks are the AUC computed within each city, and the recall split into persistence days (already high-risk today) versus onset days (normal today, dangerous tomorrow).
te <- copy(test); te[, `:=`(prob = prob_xgb, pred = as.integer(prob_xgb >= 0.5))]
# within-city AUC, only for cities that actually have both classes
wc <- te[, if (uniqueN(target) == 2 && .N >= 20)
.(auc = as.numeric(pROC::auc(pROC::roc(target, prob, quiet = TRUE)))), by = location_name]
pooled <- as.numeric(pROC::auc(pROC::roc(te$target, te$prob, quiet = TRUE)))
cat(sprintf("Pooled AUC: %.3f\n", pooled))## Pooled AUC: 0.978
## Cities that actually vary: 90 of 195
cat(sprintf("Within-city AUC: median %.3f IQR [%.3f, %.3f]\n",
median(wc$auc), quantile(wc$auc, .25), quantile(wc$auc, .75)))## Within-city AUC: median 0.820 IQR [0.730, 0.906]
# onset vs persistence recall
hi <- te[target == 1]; persist <- hi[high_risk == 1]; onset <- hi[high_risk == 0]
cat(sprintf("Persistence events: %d (%.0f%%), recall %.3f\n",
nrow(persist), 100*nrow(persist)/nrow(hi), mean(persist$pred)))## Persistence events: 2252 (80%), recall 0.997
cat(sprintf("Onset events: %d (%.0f%%), recall %.3f\n",
nrow(onset), 100*nrow(onset)/nrow(hi), mean(onset$pred)))## Onset events: 551 (20%), recall 0.615
ggplot(wc, aes(auc)) +
geom_histogram(bins = 25, fill = "#2c7fb8", alpha = .85) +
geom_vline(xintercept = median(wc$auc), linetype = "dashed") +
geom_vline(xintercept = pooled, colour = "#d7301f") +
labs(title = "Within-city AUC (cities that actually have High-Risk days)",
x = "AUC for one city", y = "Number of cities")So the model is near-perfect on continuing spells but only moderate on brand-new onsets. That onset gap, not the headline AUC, is the real limitation here.
The ARS weights and thresholds are design choices, so we re-run the whole pipeline under different settings. If the conclusions barely move, the score is robust rather than fragile to the numbers we picked by hand.
RAWCOLS <- c("temperature_celsius","humidity","pressure_mb","wind_kph","cloud","uv_index","precip_mm",
"air_quality_PM2.5","air_quality_Ozone","air_quality_Nitrogen_dioxide",
"air_quality_Sulphur_dioxide","air_quality_PM10","air_quality_Carbon_Monoxide")
base_dt <- dt[, c("location_name","date","latitude","longitude", RAWCOLS), with = FALSE]
run_setting <- function(syn = 1.0, pct = 0.85, t_thr = 32.0) {
d <- copy(base_dt)
tp <- pmax(0, d$temperature_celsius - t_thr)
d[, acute_risk_score := air_quality_PM2.5/35 + air_quality_Ozone/100 + 1.2*tp +
syn*1.5*(air_quality_PM2.5/35)*tp + syn*1.3*(air_quality_Ozone/100)*tp]
thr2 <- as.numeric(quantile(d[date < cutoff, acute_risk_score], pct))
d[, high_risk := as.integer(acute_risk_score >= thr2)]
setorder(d, location_name, date)
d[, `:=`(next_date = shift(date, 1, type = "lead"),
target = shift(high_risk, 1, type = "lead")), by = location_name]
d[, valid_target := as.integer(!is.na(next_date) & (next_date - date) == 1L)]
d[, `:=`(
temp_lag1=shift(temperature_celsius,1), temp_lag2=shift(temperature_celsius,2), temp_lag3=shift(temperature_celsius,3),
pm25_lag1=shift(air_quality_PM2.5,1), pm25_lag2=shift(air_quality_PM2.5,2), pm25_lag3=shift(air_quality_PM2.5,3),
o3_lag1=shift(air_quality_Ozone,1), o3_lag2=shift(air_quality_Ozone,2), o3_lag3=shift(air_quality_Ozone,3),
ars_lag1=shift(acute_risk_score,1), ars_lag2=shift(acute_risk_score,2), ars_lag3=shift(acute_risk_score,3),
hr_lag1=shift(high_risk,1),
temp_roll3=frollmean(temperature_celsius,3,align="right",na.rm=TRUE), temp_roll7=frollmean(temperature_celsius,7,align="right",na.rm=TRUE),
temp_max7=frollapply(temperature_celsius,7,max,align="right"),
pm25_roll3=frollmean(air_quality_PM2.5,3,align="right",na.rm=TRUE), pm25_roll7=frollmean(air_quality_PM2.5,7,align="right",na.rm=TRUE),
o3_roll3=frollmean(air_quality_Ozone,3,align="right",na.rm=TRUE), o3_roll7=frollmean(air_quality_Ozone,7,align="right",na.rm=TRUE),
ars_roll3=frollmean(acute_risk_score,3,align="right",na.rm=TRUE), ars_roll7=frollmean(acute_risk_score,7,align="right",na.rm=TRUE)
), by = location_name]
d[, `:=`(temp_change1=temperature_celsius-temp_lag1, pm25_change1=air_quality_PM2.5-pm25_lag1,
ars_change1=acute_risk_score-ars_lag1, month=month(date), doy=yday(date),
doy_sin=sin(2*pi*yday(date)/365.25), doy_cos=cos(2*pi*yday(date)/365.25))]
md2 <- d[valid_target == 1]
md2 <- md2[complete.cases(md2[, ..feature_cols]) & !is.na(target)]
tr <- md2[next_date < cutoff]; tev <- md2[next_date >= cutoff]
spw <- sum(tr$target == 0) / sum(tr$target == 1)
bst <- xgb.train(list(objective="binary:logistic", eval_metric="aucpr", eta=0.05, max_depth=5,
subsample=0.8, colsample_bytree=0.8, min_child_weight=5, scale_pos_weight=spw),
xgb.DMatrix(as.matrix(tr[, ..feature_cols]), label = tr$target), nrounds = 400, verbose = 0)
p <- predict(bst, xgb.DMatrix(as.matrix(tev[, ..feature_cols])))
data.table(pos_rate = mean(d$high_risk),
roc_auc = as.numeric(pROC::auc(pROC::roc(tev$target, p, quiet = TRUE))),
pr_auc = PRROC::pr.curve(scores.class0 = p[tev$target==1], scores.class1 = p[tev$target==0])$auc.integral)
}
settings <- list("Baseline"=list(), "Threshold 80th"=list(pct=0.80), "Threshold 90th"=list(pct=0.90),
"Synergy x0.5"=list(syn=0.5), "Synergy x2.0"=list(syn=2.0))
sens <- rbindlist(lapply(names(settings), function(nm) {
r <- do.call(run_setting, settings[[nm]]); r[, Setting := nm]; r }))
setcolorder(sens, "Setting")
knitr::kable(sens, digits = 3,
caption = "Sensitivity of the model to the ARS design choices")| Setting | pos_rate | roc_auc | pr_auc |
|---|---|---|---|
| Baseline | 0.152 | 0.977 | 0.903 |
| Threshold 80th | 0.200 | 0.971 | 0.907 |
| Threshold 90th | 0.103 | 0.983 | 0.894 |
| Synergy x0.5 | 0.151 | 0.977 | 0.903 |
| Synergy x2.0 | 0.152 | 0.977 | 0.903 |
ROC-AUC stays around 0.97 to 0.98 and PR-AUC around 0.89 to 0.91 across every setting, and halving or doubling the synergy barely changes anything. The result is robust to the parts of the score we set by hand, which is the answer to the obvious question of where the weights come from.
This final step saves the trained model and scored predictions so the
Streamlit early-warning dashboard can serve the same
model. R’s xgboost writes a cross-platform JSON booster
that Python’s xgboost can load directly.
EXPORT_ARTIFACTS <- TRUE
if (EXPORT_ARTIFACTS) {
out <- "artifacts"; dir.create(out, showWarnings = FALSE, recursive = TRUE)
xgb.save(xgb_fit, file.path(out, "xgb_model.json"))
jsonlite::write_json(list(feature_names = feature_cols, risk_threshold = risk_thr,
decision_threshold = 0.5, cutoff_date = as.character(cutoff)),
file.path(out, "model_meta.json"), auto_unbox = TRUE, pretty = TRUE)
fwrite(results, file.path(out, "metrics.csv"))
fwrite(imp, file.path(out, "feature_importance.csv"))
scored <- test[, .(location_name, country, latitude, longitude, date, next_date,
today_temp = temperature_celsius, today_pm25 = air_quality_PM2.5,
today_o3 = air_quality_Ozone, today_ars = acute_risk_score,
today_high_risk = high_risk, actual_tomorrow = target)]
scored[, `:=`(pred_prob = prob_xgb, pred_label = as.integer(prob_xgb >= 0.5))]
fwrite(scored, file.path(out, "scored_predictions.csv"))
cat("Artifacts written to ./artifacts\n")
}## Artifacts written to ./artifacts
## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] PRROC_1.4 rlang_1.1.7 pROC_1.19.0.1 ranger_0.18.0
## [5] xgboost_3.2.1.1 corrplot_0.95 ggplot2_4.0.2 lubridate_1.9.5
## [9] data.table_1.18.4
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-4 gtable_0.3.6 jsonlite_2.0.0 dplyr_1.2.0
## [5] compiler_4.5.3 Rcpp_1.1.1 tidyselect_1.2.1 jquerylib_0.1.4
## [9] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
## [13] R6_2.6.1 labeling_0.4.3 generics_0.1.4 knitr_1.51
## [17] tibble_3.3.1 bslib_0.10.0 pillar_1.11.1 RColorBrewer_1.1-3
## [21] cachem_1.1.0 xfun_0.57 sass_0.4.10 S7_0.2.1
## [25] viridisLite_0.4.3 timechange_0.4.0 cli_3.6.5 withr_3.0.2
## [29] magrittr_2.0.4 digest_0.6.39 grid_4.5.3 lifecycle_1.0.5
## [33] vctrs_0.7.2 evaluate_1.0.5 glue_1.8.0 farver_2.1.2
## [37] rmarkdown_2.30 tools_4.5.3 pkgconfig_2.0.3 htmltools_0.5.9