Statistical models provide a mathematical relationship between hazards and their explanatory factors (e.g., rainfall, slope, land use, population). These models assume data patterns can be described by probability distributions and estimated from historical data.
Regression models help quantify relationships between environmental variables and predict outcomes like flood risk, pollution levels, or temperature anomalies.
Regression models are foundational tools for analyzing relationships between environmental variables and predicting outcomes. Different types of regression are suited to different data types and distributions.
Linear regression models the relationship between predictors and a continuous outcome. Assumes linearity and normally distributed residuals.
\[Y=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n+\epsilon \] where: \(Y\) is a continous response variable, \(X_i\) are the predictor variables, \(\beta_i\) are the coefficients, and \(\epsilon\) is the error term (assume to be normally distributed)
Let’s use R to model Ozone concentration(pbb) based on wind and solar radiation and temperature using the built-in airquality dataset. Ozone is a key indicator of air quality and a common target in environmental hazard modeling. In the airquality dataset, it represents daily ozone concentration in parts per billion (ppb) measured in New York during the summer of 1973. Understanding what influences ozone levels—such as solar radiation, wind speed, and temperature—can help us assess pollution risks and guide mitigation strategies. Let’s explore how different regression models can be used to analyze and predict ozone levels.
The model is therefore:
\[Ozone=\beta_0+\beta_1* Solar.R+\beta_2*Wind+\beta_3*Temp+\epsilon\] This model assumes a linear relationship between ozone concentration and predictors like solar radiation, wind speed, and temperature.
The following code loads the dataset, removes rows with missing values, and fits a linear model. The summary() function reveals the estimated coefficients, their statistical significance, and the overall model fit. For example, a positive coefficient for Solar.R suggests that higher solar radiation is associated with increased ozone levels—consistent with the chemistry of ozone formation.
data(airquality)
airquality <- na.omit(airquality) # Remove missing values
model_lm <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
summary(model_lm)##
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
The output of summary(model_lm) will include:
Coefficients Table
This shows the estimated regression coefficients for each predictor:
Each coefficient comes with:
Model Fit Statistics
Brief Interpretation:
The multiple linear regression model was built to predict Ozone concentration using three predictors: Solar Radiation (Solar.R), Wind speed, and Temperature (Temp) from the airquality dataset. After removing missing values, the model explains approximately 61% of the variability in Ozone levels (Adjusted R² = 0.5948), indicating a reasonably strong fit.
The analysis reveals that Temperature is the most influential factor, showing a strong positive relationship with Ozone. For every 1°F increase in temperature, Ozone levels rise by about 1.65 units, and this effect is highly significant (p < 0.001). Wind speed, on the other hand, has a strong negative effect: an increase of 1 mph in wind speed reduces Ozone by approximately 3.34 units, also highly significant (p < 0.001). Solar Radiation contributes positively but modestly; each unit increase in Solar.R adds about 0.06 units to Ozone, and this effect is statistically significant (p ≈ 0.012).
Overall, the model suggests that hot, sunny, and calm conditions tend to produce higher Ozone levels, while windy conditions help disperse Ozone, reducing its concentration. The F-statistic (54.83, p < 2.2e-16) confirms that the model is highly significant overall.
Logistic regression is used when the dependent variable is categorical, typically binary (e.g., “High Ozone” vs. “Low Ozone”). Instead of predicting actual Ozone concentration, we predict the probability that Ozone exceeds a certain threshold.
For binary outcome \(Y\in(0,1)\) \[logit(p)=ln(\frac{p}{1-p})=\beta_0+\beta_1 *Solar.R+\beta_2*Wind+\beta_3*Temp \] where: \[p=\frac{1}{1+e^{-(\beta_0+\beta_1 *Solar.R+\beta_2*Wind+\beta_3*Temp)}}\]
To perform logistic regression using R:
# Load dataset
data(airquality)
airquality <- na.omit(airquality)
# Create binary outcome
airquality$HighOzone <- ifelse(airquality$Ozone > 80, 1, 0)
# Fit logistic regression
model_logit <- glm(HighOzone ~ Solar.R + Wind + Temp,
family = binomial(link = "logit"),
data = airquality)
summary(model_logit)##
## Call:
## glm(formula = HighOzone ~ Solar.R + Wind + Temp, family = binomial(link = "logit"),
## data = airquality)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.18556 -0.31794 -0.08178 -0.02388 1.98574
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.541024 6.803148 -2.725 0.00642 **
## Solar.R 0.006571 0.007128 0.922 0.35657
## Wind -0.427943 0.168137 -2.545 0.01092 *
## Temp 0.223966 0.073804 3.035 0.00241 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 91.556 on 110 degrees of freedom
## Residual deviance: 44.828 on 107 degrees of freedom
## AIC: 52.828
##
## Number of Fisher Scoring iterations: 7
# Odds ratios
exp(coef(model_logit))## (Intercept) Solar.R Wind Temp
## 8.866158e-09 1.006593e+00 6.518487e-01 1.251028e+00
✅ Narrative Explanation
✅ Interpretation
The ROC curve (Receiver Operating Characteristic) for a logistic regression model is a graphical tool to evaluate how well the model distinguishes between two classes—in this case, High Ozone (1) vs. Not High Ozone (0).
This code loads the pROC package, predicts probabilities from the logistic regression model, computes the ROC curve using roc(), and plots it with AUC displayed in the title. The diagonal gray line represents random guessing (AUC = 0.5). A curve above this line indicates the model performs better than random. The AUC value quantifies classification performance: closer to 1 means excellent discrimination.
# load pROC package
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predict probabilities from logistic regression model
pred <- predict(model_logit, type = "response")
# Compute ROC curve
roc_obj <- roc(airquality$HighOzone, pred)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve with AUC
plot(roc_obj, col = "blue", lwd = 2,
main = paste("ROC Curve (AUC =", round(auc(roc_obj), 2), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")# Find optimal cutoff using Youden's Index
opt_coords <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "youden")
opt_coords## threshold sensitivity specificity
## 1 0.1372684 1 0.8526316
✅ ROC Curve Analysis
✅ Optimal Cutoff (Youden’s Index) From the output:
✅ Implications
Poisson regression is a type of Generalized Linear Model (GLM) used when the response variable is a count (non-negative integers), such as the number of disease cases, accidents, or environmental events. It assumes:
The model relates the expected count \(\lambda\) to predictors using a log link function: \[\ln (Y)=\beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_k X_k \] where: \(\lambda\) is the expected counts of events (e.g., number of cases), \(X_i\) are the predictor variables, \(\beta_i\) are the coefficients estimated from data
Exponentiating both sides: \[\lambda=e^{\beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_k X_k} \] Interpretation of coefficients:
Each \(\beta_i\) represents the change in the log of expected counts for a one-unit increase in \(X_i\).
\(e^{\beta_i}\) gives the multiplicative effect on expected counts:
Let us now try performing Poisson Regression in R:
# -----------------------------
# Poisson Regression on Health-Related Data
# -----------------------------
# Outcome: Number of hospital visits
# Predictors: PM2.5 exposure, age, smoking status
# -----------------------------
# 1. Load libraries
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# 2. Simulate realistic health data with true effects
set.seed(2025)
n <- 300
# Create predictors
pm25 <- runif(n, 10, 60) # PM2.5 levels (µg/m³)
age <- rnorm(n, 45, 12) # Age in years
smoker <- factor(sample(c("No", "Yes"), n, replace = TRUE, prob = c(0.6, 0.4)))
# Define true relationship:
# log(lambda) = β0 + β1*pm25 + β2*age + β3*(smokerYes)
# PM2.5 and smoking will increase hospital visits
beta0 <- 0.3
beta1 <- 0.04 # each µg/m³ increase in PM2.5 increases expected visits
beta2 <- 0.01 # older individuals tend to visit more
beta3 <- 0.7 # smokers have higher rate of visits
# Compute expected rate (lambda)
lambda <- exp(beta0 + beta1*pm25 + beta2*age + ifelse(smoker == "Yes", beta3, 0))
# Simulate count outcome
visits <- rpois(n, lambda = lambda)
# Combine into data frame
health_data <- data.frame(visits, pm25, age, smoker)
# 3. Fit Poisson regression
poisson_model <- glm(visits ~ pm25 + age + smoker,
data = health_data,
family = poisson(link = "log"))
# 4. Display model summary
summary(poisson_model)##
## Call:
## glm(formula = visits ~ pm25 + age + smoker, family = poisson(link = "log"),
## data = health_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0094 -0.7239 -0.1111 0.6382 2.7771
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.344688 0.082800 4.163 3.14e-05 ***
## pm25 0.038360 0.001150 33.357 < 2e-16 ***
## age 0.010252 0.001332 7.698 1.38e-14 ***
## smokerYes 0.752765 0.030683 24.534 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2126.05 on 299 degrees of freedom
## Residual deviance: 292.06 on 296 degrees of freedom
## AIC: 1565.5
##
## Number of Fisher Scoring iterations: 4
# 5. Exponentiate coefficients to interpret as rate ratios (IRR)
exp(coef(poisson_model))## (Intercept) pm25 age smokerYes
## 1.411550 1.039106 1.010305 2.122862
# 6. Check overdispersion
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) / poisson_model$df.residual
cat("\nDispersion statistic:", round(dispersion, 2), "\n")##
## Dispersion statistic: 0.96
# 7. Visualization
health_data$pred <- predict(poisson_model, type = "response")
ggplot(health_data, aes(x = pm25, y = pred, color = smoker)) +
geom_line(linewidth = 1) +
labs(title = "Predicted Hospital Visits by PM2.5 and Smoking Status",
x = "PM2.5 Concentration (µg/m³)",
y = "Predicted Number of Hospital Visits") +
theme_minimal()✅ Understanding the Coding
The R script demonstrates how to perform a Poisson regression analysis on health-related data, focusing on modeling the number of hospital visits as a function of air pollution exposure (PM2.5), age, and smoking status. Poisson regression is appropriate for count data, where the dependent variable represents the number of events occurring in a given period. This analysis is often used in public health and epidemiology to study how exposure to environmental or behavioral risk factors affects the frequency of disease-related outcomes.
To begin, the necessary R packages are loaded. The dplyr package allows for efficient data manipulation, while ggplot2 is used to create clear and informative visualizations of model results. Although base R can perform regression modeling on its own, these packages improve the workflow and clarity of the outputs.
A synthetic dataset is then generated to illustrate the modeling process. The data consist of 300 individuals, each with three predictor variables: PM2.5 concentration, age, and smoking status. The PM2.5 variable, measured in micrograms per cubic meter (µg/m³), represents the level of fine particulate matter in the air. The age variable follows a normal distribution centered around 45 years old, while smoking status is a categorical variable that classifies individuals as either smokers or non-smokers. The random number seed is set using set.seed(2025) to ensure that the simulated data remain consistent every time the code is executed.
After defining the predictors, the code specifies a true underlying relationship between the predictors and the outcome variable — the number of hospital visits. In Poisson regression, the natural logarithm of the expected count (denoted by λ, or lambda) is modeled as a linear combination of the predictors. This relationship can be expressed as:
\[\log (\lambda)=\beta_0+\beta_1(PM2.5)+\beta_2(Age)+\beta_3(Smoker) \] In the simulation, the model assumes that higher PM2.5 levels and smoking status increase the expected number of hospital visits, while age has a smaller but still positive effect. Once the simulated data are assembled into a data frame, a Poisson regression model is fitted using the glm() function with a Poisson family and a log link. This function estimates the regression coefficients that best describe the relationship between the predictors and the outcome variable. The model summary provides estimates of each coefficient, standard errors, z-values, and p-values. Predictors with p-values less than 0.05 are considered statistically significant, meaning they have a meaningful association with the number of hospital visits.
To make the coefficients easier to interpret, they are exponentiated using the exp() function. This transformation converts the log-scale coefficients into Incidence Rate Ratios (IRRs). The IRR expresses how the expected count changes with a one-unit increase in the predictor. For example, an IRR of 1.05 for PM2.5 means that for each additional µg/m³ of PM2.5, the expected number of hospital visits increases by 5%. Similarly, an IRR of 2.0 for smoking indicates that smokers are expected to have twice as many hospital visits as non-smokers.
The script also includes a diagnostic step to check for overdispersion, a common issue in count data where the variance exceeds the mean. The dispersion statistic is calculated using the Pearson residuals. A value close to one suggests that the Poisson model fits the data well, whereas a value substantially greater than one indicates overdispersion, suggesting the need for a quasi-Poisson or negative binomial model.
Finally, the code generates a visualization of the predicted number of hospital visits across PM2.5 levels, grouped by smoking status. Using ggplot2, the plot shows that as PM2.5 concentration increases, the predicted hospital visits also rise. The curve for smokers lies consistently above that of non-smokers, illustrating that smokers are expected to experience more hospital visits across all pollution levels. This visualization effectively communicates the model’s findings and highlights the combined effects of environmental and behavioral risk factors.
✅ Interpretation
The Poisson regression analysis aimed to determine how air pollution exposure (PM2.5), age, and smoking status influence the number of hospital visits. The results show that all three predictors are statistically significant at the 0.001 level, as evidenced by their very small p-values. This indicates that PM2.5 concentration, age, and smoking status each have a meaningful effect on the expected number of hospital visits. The estimated coefficients, when exponentiated to obtain incidence rate ratios (IRRs), provide a clearer interpretation of their effects on hospital visit counts.
The coefficient for PM2.5 is 0.03836, corresponding to an incidence rate ratio of 1.039. This means that for every 1 µg/m³ increase in PM2.5 concentration, the expected number of hospital visits increases by approximately 3.9%, holding age and smoking status constant. The coefficient for age is 0.01025, which translates to an IRR of 1.010, indicating that for each additional year of age, the expected number of hospital visits increases by about 1.0%, assuming other factors remain constant. Smoking status also shows a strong effect, with a coefficient of 0.7528 and an IRR of 2.123. This suggests that smokers have more than twice the expected rate of hospital visits compared to non-smokers, after adjusting for age and air pollution exposure. The intercept of 0.3447, corresponding to an IRR of 1.4116, represents the baseline expected number of hospital visits when PM2.5 and age are zero and the individual is a non-smoker.
In terms of model fit, the residual deviance (292.06 on 296 degrees of freedom) is substantially lower than the null deviance (2126.05 on 299 degrees of freedom), indicating that the inclusion of the predictor variables greatly improves the model’s explanatory power. The dispersion parameter is approximately equal to one, suggesting that the Poisson regression assumptions hold and that there is no significant overdispersion in the data. The overall model is therefore statistically adequate and well-fitted to the simulated health dataset.
Overall, the findings indicate that higher air pollution exposure, older age, and smoking behavior all contribute significantly to the increased frequency of hospital visits. Specifically, hospital visits tend to rise with increasing PM2.5 concentration and age, and are considerably higher among smokers. These results are consistent with public health evidence linking both environmental and behavioral factors to adverse health outcomes. The analysis highlights the importance of addressing air quality management and smoking cessation as key components of strategies to reduce hospital admissions and improve population health.
A dispersion statistic (also called dispersion parameter or ratio) of 0.96 indicates that the variance of the data is very close to the mean, suggesting no significant overdispersion or underdispersion.
In the context of a Poisson regression model, this value (≈1) means that the Poisson assumption of equal mean and variance holds, and there is no need to use a quasi-Poisson or negative binomial model.
Negative Binomial regression is an extension of Poisson regression for modeling count data when there is overdispersion (variance > mean).
Poisson assumption:
\[\text{Var}(Y) = \text{E}(Y)\] If this is violated (common in real data), Poisson underestimates variability → inflated Type I errors.
Negative Binomial assumption: \[\text{Var}(Y) = \mu + \alpha \mu^2\] Here, \(\alpha\) is an extra dispersion parameter that accounts for overdispersion.
The Model for NB Regression is given by: \[Y_i∼NB(μ_i,θ)\]
with \[ \log(μ_i)=β_0+β_1X_{1i}+⋯+β_kX_{k_i}\] ✅ Why use NB instead of Poisson?
#load packages
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(AER)## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(ggplot2)
library(dplyr)
# -------------------------------
# 1. Simulate environmental health data
# -------------------------------
set.seed(123)
n <- 120 # 120 local observations (e.g., barangays or cities)
PM25 <- runif(n, 5, 80) # PM2.5 concentration (μg/m³)
temperature <- runif(n, 20, 35) # Average daily temperature (°C)
population_density <- runif(n, 500, 5000) # people per km²
# Generate expected count (mean structure)
lambda <- exp(0.5 + 0.03 * PM25 - 0.02 * temperature + 0.0002 * population_density)
# Generate counts with overdispersion (respiratory illness cases)
resp_illness <- rnbinom(n, mu = lambda, size = 1.5)
# Combine into a dataframe
env_data <- data.frame(resp_illness, PM25, temperature, population_density)
# View first few rows
head(env_data)## resp_illness PM25 temperature population_density
## 1 0 26.568314 29.71840 2960.718
## 2 2 64.122885 24.79731 3480.429
## 3 1 35.673269 24.61580 1272.643
## 4 28 71.226305 23.29651 3348.749
## 5 15 75.535046 25.54233 1903.414
## 6 4 8.416737 34.76329 3760.495
# -------------------------------
# 2. Fit Negative Binomial Regression Model
# -------------------------------
nb_model <- glm.nb(resp_illness ~ PM25 + temperature + population_density, data = env_data)
# Model summary
summary(nb_model)##
## Call:
## glm.nb(formula = resp_illness ~ PM25 + temperature + population_density,
## data = env_data, init.theta = 1.421581897, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5654 -1.0776 -0.2427 0.4702 2.5748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2453691 0.6831448 1.823 0.0683 .
## PM25 0.0276490 0.0040886 6.762 1.36e-11 ***
## temperature -0.0367009 0.0221787 -1.655 0.0980 .
## population_density 0.0001332 0.0000643 2.071 0.0384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.4216) family taken to be 1)
##
## Null deviance: 190.89 on 119 degrees of freedom
## Residual deviance: 136.38 on 116 degrees of freedom
## AIC: 699.42
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.422
## Std. Err.: 0.241
##
## 2 x log-likelihood: -689.416
# -------------------------------
# 3. Check for overdispersion
# -------------------------------
dispersion_stat <- sum(residuals(nb_model, type = "pearson")^2) / nb_model$df.residual
cat("Dispersion statistic:", round(dispersion_stat, 2), "\n")## Dispersion statistic: 0.97
# -------------------------------
# 4. Compare with Poisson Regression
# -------------------------------
poisson_model <- glm(resp_illness ~ PM25 + temperature + population_density,
data = env_data, family = "poisson")
library(lmtest)
lrtest(poisson_model, nb_model)## Likelihood ratio test
##
## Model 1: resp_illness ~ PM25 + temperature + population_density
## Model 2: resp_illness ~ PM25 + temperature + population_density
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -503.47
## 2 5 -344.71 1 317.52 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -------------------------------
# 5. Predicted values and visualization
# -------------------------------
env_data <- env_data %>%
mutate(predicted = predict(nb_model, type = "response"))
ggplot(env_data, aes(x = PM25, y = resp_illness)) +
geom_point(alpha = 0.6) +
geom_line(aes(y = predicted), color = "darkgreen", linewidth = 1.2) +
labs(
title = "Negative Binomial Regression: PM2.5 and Respiratory Illness Cases",
x = "PM2.5 Concentration (μg/m³)",
y = "Number of Respiratory Illness Cases"
) +
theme_minimal()✅ Interpretation
The Negative Binomial Regression model was used to examine the relationship between the number of respiratory illness cases and selected environmental factors, including PM2.5 concentration, temperature, and population density. The model was fitted using a log link function, and the estimated dispersion parameter (θ = 1.422) indicated the presence of moderate overdispersion in the data, justifying the choice of the Negative Binomial model over the standard Poisson regression.
For every 1 µg/m³ increase in PM2.5, the expected number of respiratory illness cases increases by about 2.8%, holding other factors constant.Computed from exp(0.0276)=1.028.
For every 1°C increase in temperature, the expected number of respiratory illness cases decreases by about 3.6%, though the effect is not statistically significant (p = 0.098). Computed from exp(-0.036709)=0.964.
For every additional person per km², the expected number of respiratory illness cases increases by 0.013%, holding other factors constant.Computed from exp(0.000132)=1.00013.
The coefficients in the Negative Binomial model can be interpreted in terms of rate ratios by exponentiating the parameter estimates. The exponentiated coefficients (IRRs) indicate the multiplicative change in the expected number of respiratory illness cases for each one-unit increase in the predictor variable. Specifically, each 1 µg/m³ increase in PM2.5 concentration is associated with a 2.8% rise in the expected number of respiratory illness cases, holding temperature and population density constant. Conversely, each 1°C increase in temperature is associated with an estimated 3.6% reduction in the expected number of cases, although this effect was not statistically significant at the 0.05 level. Meanwhile, higher population density corresponds to a small but significant increase in respiratory illness cases, with an IRR of approximately 1.00013 per unit increase in population per square kilometer.
✅ Overdispersion
The dispersion statistic of 0.97 suggests that the variance of the data is approximately equal to its mean, indicating that there is no substantial overdispersion or underdispersion in the model. In other words, the Negative Binomial regression adequately accounts for the variability in the count data, and the assumption of equidispersion typically required in Poisson regression is not severely violated. A dispersion statistic close to 1 generally signifies a well-fitting model and supports the appropriateness of the chosen distribution.
To further evaluate whether the Negative Binomial model provides a better fit than the simpler Poisson model, a likelihood ratio test was conducted. The test compared the Poisson regression model (Model 1) and the Negative Binomial regression model (Model 2), both containing the same predictors: PM2.5 concentration, temperature, and population density. The results yielded a chi-square value of 317.52 with 1 degree of freedom and a p-value less than 0.001, indicating a highly significant difference in model fit.
This result means that the Negative Binomial regression model fits the data significantly better than the Poisson model. The significant improvement in model fit suggests that the Negative Binomial model’s inclusion of an additional dispersion parameter successfully captures extra variability that the Poisson model cannot. Therefore, despite the dispersion statistic being close to 1, the Negative Binomial regression remains the more appropriate choice for this dataset, providing a more robust and flexible framework for analyzing count data with potential overdispersion.
In the world of statistics, most models are designed to describe the average behavior of a system. But what happens when we need to understand the extremes—those rare, high-impact events like devastating floods, intense typhoons, or record-breaking heatwaves? This is where Extreme Value Theory (EVT) comes in.
EVT provides a framework for modeling the behavior of the tails of a distribution—where the most extreme values lie. These are the events that, while rare, can have catastrophic consequences. By focusing on these extremes, EVT helps engineers, scientists, and policymakers make informed decisions in the face of uncertainty.
Let’s explore two widely used EVT approaches: the Block Maxima method and the Peaks Over Threshold (POT) method.
The Block Maxima method involves dividing a time series into fixed intervals—typically years—and extracting the maximum value from each block. These maxima are then modeled using the Generalized Extreme Value (GEV) distribution.
The Generalized Extreme Value (GEV) distribution models block maxima and has the following cumulative distribution function (CDF): \[G(x) = \exp \left\{ - \left[ 1 + \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{-1/\xi} \right\}, \quad \text{for } 1 + \xi \left( \frac{x - \mu}{\sigma} \right) > 0\] Where:
Return Level Formula: For a return period TTT, the return level \(x_T\) is: \[x_T = \mu + \frac{\sigma}{\xi} \left[ \left( -\log \left( 1 - \frac{1}{T} \right) \right)^{-\xi} - 1 \right]\]
✅ Why use it: This approach is simple and intuitive. It’s especially useful when data is naturally grouped into periods (e.g., annual maximum rainfall).
We simulated 30 years of daily rainfall data. For each year, we extracted the maximum daily rainfall and fitted a GEV distribution to these 30 values.
✅ GEV Parameters:
Using this model, we estimated the 100-year return level—the rainfall amount expected to be exceeded once every 100 years:
Estimated 100-year return level (GEV): 140.05 mm
This value is critical for designing infrastructure like flood control systems, ensuring they can withstand rare but severe rainfall events.
We will demonstrate the use of this method using R. In this approach, we aim to model the maximum rainfall observed in each year. This is useful when we want to estimate the intensity of rare events like the 100-year rainfall, which is critical for infrastructure planning.
# Load the required package
library(evd)##
## Attaching package: 'evd'
## The following object is masked from 'package:car':
##
## subsets
# Simulate 30 years of daily rainfall data
set.seed(42)
rainfall <- rgamma(365 * 30, shape = 2, scale = 10)
# Create a time series with dates
dates <- seq.Date(from = as.Date("1990-01-01"), by = "day", length.out = length(rainfall))
rain_data <- data.frame(Date = dates, Rainfall = rainfall)
# Extract annual maxima
rain_data$Year <- format(rain_data$Date, "%Y")
annual_max <- aggregate(Rainfall ~ Year, data = rain_data, max)
# Fit a GEV distribution to the annual maxima
gev_fit <- fgev(annual_max$Rainfall)
# Check the estimated parameters
print(gev_fit)##
## Call: fgev(x = annual_max$Rainfall)
## Deviance: 229.865
##
## Estimates
## loc scale shape
## 82.87868 9.67521 -0.02941
##
## Standard Errors
## loc scale shape
## 1.9823 1.4202 0.1308
##
## Optimization Information
## Convergence: successful
## Function Evaluations: 15
## Gradient Evaluations: 8
# Extract GEV parameters
params <- gev_fit$estimate
loc <- params["loc"]
scale <- params["scale"]
shape <- params["shape"]
# Estimate the 100-year return level using qgev()
return_period <- 100
prob <- 1 - 1 / return_period
gev_return_level <- qgev(prob, loc = loc, scale = scale, shape = shape)
# Display the result
cat("Estimated 100-year return level (GEV):", gev_return_level, "mm\n")## Estimated 100-year return level (GEV): 124.5071 mm
✅ Interpretation :
The analysis using the Block Maxima approach and the Generalized Extreme Value (GEV) distribution shows that the 100-year return level for daily rainfall is approximately 124.5 mm. This means that, statistically, a rainfall event of 124 mm or more in a single day has a 1% chance of occurring in any given year.
✅ What Does This Mean for Policy?
Flood Control Infrastructure Design: Drainage systems, culverts, and floodways in Davao del Sur should be designed to handle at least 124 mm of rainfall in one day. Current infrastructure standards should be reviewed to ensure they meet or exceed this threshold.
The POT approach takes a different route. Instead of using only one value per year, it considers all values that exceed a high threshold. These exceedances are modeled using the Generalized Pareto Distribution (GPD), which is well-suited for modeling the tail of a distribution.
The Generalized Pareto Distribution (GPD) models exceedances over a threshold uuu. Its CDF is: \[H(y) = 1 - \left( 1 + \xi \frac{y}{\beta} \right)^{-1/\xi}, \quad \text{for } y > 0, \; 1 + \xi \frac{y}{\beta} > 0\] Where:
Return Level Formula: For a return period TTT, the return level \(x_T\) is: \[x_T = u + \frac{\beta}{\xi} \left[ \left( \frac{T \cdot \lambda_u}{n} \right)^{\xi} - 1 \right]\] Where:
Le us try to use R in doing Peaks over threshold analysis:
The Peaks Over Threshold (POT) method is a powerful technique in Extreme Value Theory for modeling rare events. Unlike the Block Maxima approach, which uses only one extreme value per block (e.g., per year), POT leverages all observations that exceed a high threshold, making it more data-efficient and often more accurate for estimating tail behavior.
We use the evir package because it provides functions for fitting the Generalized Pareto Distribution (GPD), which is the cornerstone of the POT approach.
Here, we simulate 30 years of daily rainfall using a gamma distribution, which is commonly used for rainfall modeling due to its skewed nature.
We choose the 95th percentile as our threshold. This means we only consider the top 5% of rainfall values—those that represent extreme events.
The gpd() function fits a Generalized Pareto Distribution to the exceedances (values above the threshold). This model estimates two key parameters:
We extract the estimated shape and scale parameters for use in return level calculations.
Here’s what happens next:
# Load the required package
library(evir)##
## Attaching package: 'evir'
## The following objects are masked from 'package:evd':
##
## dgev, dgpd, pgev, pgpd, qgev, qgpd, rgev, rgpd
## The following object is masked from 'package:ggplot2':
##
## qplot
# Simulate 30 years of daily rainfall data
set.seed(42)
rainfall <- rgamma(365 * 30, shape = 2, scale = 10)
# Set a high threshold (e.g., 95th percentile)
threshold <- quantile(rainfall, 0.95)
# Fit a Generalized Pareto Distribution (GPD) to exceedances
gpd_fit <- gpd(rainfall, threshold)
# Display fitted parameters
print(gpd_fit)## $n
## [1] 10950
##
## $data
## [1] 49.90563 48.07191 50.68208 50.65502 50.32085 69.39245 54.53628
## [8] 65.04813 59.33251 52.18695 73.31518 51.62654 49.68325 55.02293
## [15] 50.41828 80.61509 49.11433 49.48407 52.19094 59.88456 49.27391
## [22] 71.47710 50.51012 48.39283 49.26226 48.52232 48.00536 62.04664
## [29] 88.35144 53.69513 49.18208 64.59729 48.14041 50.16108 59.88161
## [36] 49.51229 50.97162 97.46599 48.04895 56.29928 57.23982 56.88218
## [43] 49.56769 48.14958 52.08306 52.67508 65.62312 85.09912 77.68062
## [50] 64.58845 62.81671 57.43431 80.70038 74.03711 53.01155 70.94464
## [57] 55.47996 60.34507 71.65542 51.97645 53.01872 76.73281 73.51216
## [64] 58.48059 71.91544 59.34209 54.94885 61.42294 50.99258 49.53936
## [71] 47.96985 64.56054 58.43218 89.07708 54.68868 50.28323 65.57784
## [78] 49.91317 68.49850 56.53979 69.97545 54.12988 55.27103 54.23011
## [85] 57.62203 51.18808 71.59163 70.90420 51.87155 51.11373 48.03498
## [92] 50.69159 55.75716 75.01878 80.41250 65.84386 60.10305 72.48566
## [99] 49.54129 56.20696 53.88073 63.64015 58.05894 50.63552 58.92297
## [106] 49.52519 48.97038 48.40739 55.22696 48.89669 55.70808 58.70252
## [113] 76.84796 53.40573 87.38360 65.82558 48.97476 60.93122 83.35521
## [120] 53.75968 48.20487 52.19061 58.13393 59.44614 62.09655 54.84729
## [127] 56.05368 81.16338 78.11718 85.94818 73.94028 48.68340 65.39453
## [134] 53.70451 52.39075 74.49039 54.70566 50.04651 61.96452 86.49718
## [141] 52.09283 49.87677 64.42065 57.39794 83.86975 120.91872 50.97881
## [148] 64.68988 48.75597 56.45648 59.29773 48.46026 59.76278 66.70911
## [155] 56.53383 60.56631 50.63353 70.49610 57.78943 47.91353 57.78856
## [162] 73.30457 49.33362 66.13731 50.84662 50.13851 83.90018 48.38500
## [169] 81.37688 58.13633 56.34522 49.98024 50.06545 50.94306 74.86237
## [176] 49.40702 53.69579 58.23855 53.70114 70.52585 71.53504 52.60123
## [183] 75.85093 65.99190 56.95711 49.58698 58.71730 58.83906 53.08885
## [190] 51.16453 52.22948 76.33864 50.63748 52.91385 52.11756 67.85752
## [197] 49.43335 48.66188 62.95259 63.70572 50.41483 65.08772 56.50637
## [204] 48.12237 48.89345 48.61047 71.22566 48.76542 77.79313 51.06386
## [211] 56.30182 62.74116 55.72260 55.21685 79.52915 52.63002 52.26367
## [218] 62.95628 73.87048 50.74412 56.05197 51.15858 50.66498 48.41490
## [225] 51.62363 52.17084 49.29745 66.27650 60.12550 63.22969 54.77121
## [232] 49.44144 48.90044 67.66717 52.10177 58.12309 50.01990 52.33009
## [239] 67.66255 60.10318 54.70491 56.15465 52.39644 61.21527 48.04684
## [246] 49.12383 64.68280 49.64919 61.14759 74.46350 52.88013 51.51791
## [253] 53.82059 49.07676 59.13943 55.74759 51.76338 48.63269 70.71132
## [260] 73.07760 67.91434 63.04289 48.38532 59.67391 92.73758 61.48822
## [267] 79.36168 57.69450 52.41186 51.67504 59.65309 57.13914 56.93148
## [274] 74.30402 60.91831 53.22317 53.74254 48.95208 57.06636 48.19102
## [281] 61.01394 55.53258 70.14278 48.22120 53.35772 66.75480 48.16007
## [288] 61.32364 93.20988 49.93024 49.44344 75.19806 61.54687 67.15725
## [295] 48.33950 59.76925 59.14192 67.86486 53.39994 50.35839 74.48765
## [302] 72.91309 80.86434 80.53483 50.11210 54.68550 53.65331 57.79060
## [309] 50.26466 53.33234 56.11656 53.76778 59.65750 48.91719 74.87250
## [316] 64.22434 47.99994 52.72858 67.63085 68.05643 75.72558 70.12555
## [323] 48.14672 71.03827 60.05755 61.07221 69.55701 56.09649 56.35918
## [330] 70.94517 50.34499 53.75234 50.20805 50.23179 62.85184 59.14933
## [337] 49.07186 64.49973 52.17065 64.51880 65.16378 90.14549 57.42012
## [344] 51.95806 49.35999 62.92704 51.93877 61.01608 69.58749 70.20678
## [351] 54.14014 59.43998 59.94099 53.89683 60.21209 49.03657 74.95240
## [358] 50.40778 65.22536 78.93285 49.24245 66.12289 63.77510 63.47978
## [365] 86.84906 71.05365 65.65553 64.48246 56.81851 54.89870 101.25277
## [372] 59.32963 56.53586 64.13864 49.13938 117.64628 50.01250 56.62276
## [379] 93.40521 63.17446 49.78645 82.10898 63.84403 60.64903 89.82522
## [386] 64.50939 65.23278 65.55105 47.98169 55.20108 57.76089 50.85459
## [393] 49.06821 50.93817 65.41133 53.73924 49.72285 54.22026 48.46533
## [400] 59.40339 48.38141 53.15646 67.79176 89.79930 56.82187 51.30550
## [407] 49.38870 62.95368 50.40218 58.11662 105.30328 54.98337 52.64390
## [414] 60.77934 53.62388 52.11112 64.20793 53.57967 55.82376 57.01110
## [421] 92.04829 71.24850 92.97462 88.28282 50.27653 47.92197 80.13976
## [428] 61.22704 75.85520 48.41583 48.53230 62.57858 79.59596 57.98792
## [435] 68.95434 52.17111 55.83189 48.51234 67.45345 61.67315 47.95072
## [442] 50.13816 51.91321 47.97767 57.09205 53.34882 49.96307 49.04737
## [449] 50.14298 73.37632 61.09370 51.06557 81.84370 51.36023 94.41652
## [456] 50.31829 66.83626 54.50446 50.51746 52.04310 48.75853 73.20820
## [463] 71.10532 61.99987 53.48109 57.86714 51.00934 82.94507 50.37888
## [470] 74.37179 50.74321 49.96438 56.09602 65.79587 48.81579 79.24329
## [477] 48.23554 51.51282 73.29775 68.26998 48.48120 94.63052 56.89108
## [484] 48.75480 63.49172 52.08003 56.34847 52.83204 71.35828 61.65076
## [491] 63.26775 49.97338 56.94619 53.11206 55.91253 49.18189 52.07002
## [498] 79.78450 50.82507 73.18272 55.35117 59.33456 73.72666 50.32131
## [505] 52.43354 70.90389 54.09617 56.04411 48.61600 54.49545 48.96794
## [512] 48.53082 54.84244 51.62104 66.37059 63.09079 53.10245 48.88523
## [519] 55.94653 48.08286 62.59760 49.37120 95.02002 84.91988 59.91324
## [526] 65.90787 68.39823 54.01620 58.98157 54.70457 55.42213 52.32153
## [533] 55.65265 84.94166 50.50493 50.88869 53.85233 61.29537 64.55861
## [540] 57.81184 80.84435 59.52917 62.94419 51.39014 48.82550 100.49777
## [547] 49.02316 78.21955
##
## $threshold
## 95%
## 47.90876
##
## $p.less.thresh
## [1] 0.9499543
##
## $n.exceed
## [1] 548
##
## $method
## [1] "ml"
##
## $par.ests
## xi beta
## -0.03367919 12.57196431
##
## $par.ses
## xi beta
## 0.04725699 0.80072651
##
## $varcov
## [,1] [,2]
## [1,] 0.002233223 -0.02882877
## [2,] -0.028828768 0.64116294
##
## $information
## [1] "observed"
##
## $converged
## [1] 0
##
## $nllh.final
## [1] 1916.843
##
## attr(,"class")
## [1] "gpd"
# Extract GPD parameters
xi <- gpd_fit$par.ests["xi"] # Shape parameter
beta <- gpd_fit$par.ests["beta"] # Scale parameter
# Estimate the 100-year return level
n <- length(rainfall) # Total number of observations
m <- sum(rainfall > threshold) # Number of exceedances
return_period <- 100
prob <- 1 - 1 / return_period
# Compute return level using qgpd()
gpd_return_level <- threshold + qgpd(prob, xi = xi, beta = beta) * (n / m)
# Display the result
cat("Estimated 100-year return level (POT/GPD):", gpd_return_level, "mm\n")## Estimated 100-year return level (POT/GPD): 1119.523 mm
✅ This means:
✅ Why Is It So High?
✅ Policy Interpretation
If this estimate were based on real data:
Environmental risks—such as floods, droughts, and heatwaves—are often driven by patterns in climate and weather data. Time series analysis helps us understand these patterns and predict future conditions, enabling proactive policy and disaster preparedness.
Idea: Today’s rainfall depends on previous days. Formula:
\[X_t = \phi_1 X_{t-1} + \dots + \phi_p X_{t-p} + \varepsilon_t\]
Use: Short-term rainfall prediction.
Idea: Current value depends on past shocks (errors). Formula:
\[X_t = \mu + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} + \varepsilon_t\] Use: Smoothing noise in temperature or pollution data.
Combines Autoregressive (AR) and Moving Average (MA) components: \[X_t = \phi_1 X_{t-1} + \dots + \phi_p X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}\] Where:
Adds Integration (I) for non-stationary data:
ARIMA(p,d,q):
\[\quad \Delta^d X_t = \phi_1 \Delta^d X_{t-1} + \dots + \phi_p \Delta^d X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}\] Where:
Adds Seasonality: \[\text{SARIMA}(p,d,q)(P,D,Q)_s\] Full form: \[\Phi(B^s) \phi(B) \Delta^d \Delta_s^D X_t = \Theta(B^s) \theta(B) \varepsilon_t\] Where:
Generate Time Series Data
# Example: Monthly PM2.5 concentration data (simulated)
set.seed(123)
pm25 <- ts(rnorm(120, mean = 35, sd = 10), frequency = 12, start = c(2015, 1))
plot(pm25, main = "Simulated PM2.5 Data", ylab = "PM2.5 (µg/m³)", col = "steelblue")AR (Autoregressive) Model
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Fit an AR(1) model
ar_model <- arima(pm25, order = c(1, 0, 0))
coeftest(ar_model)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.011740 0.091589 0.1282 0.898
## intercept 35.152816 0.822706 42.7283 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_ar <- forecast(ar_model, h = 12)
# Plot the forecast
plot(forecast_ar, main = "AR(1) Forecast for PM2.5", ylab = "PM2.5 Concentration", xlab = "Time")MA (Moving Average) Model
# Fit an MA(1) model
ma_model <- arima(pm25, order = c(0, 0, 1))
coeftest(ma_model)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ma1 0.013969 0.100130 0.1395 0.889
## intercept 35.152543 0.824390 42.6407 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_ma <- forecast(ma_model, h = 12)
# Plot the forecast
plot(forecast_ma, main = "MA(1) Forecast for PM2.5", ylab = "PM2.5 Concentration", xlab = "Time")ARMA (Autoregressive Moving Average)
# Fit an ARMA(1,1) model
arma_model <- arima(pm25, order = c(1, 0, 1))
coeftest(arma_model)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.871914 0.081788 -10.661 < 2.2e-16 ***
## ma1 0.968136 0.048873 19.809 < 2.2e-16 ***
## intercept 35.165608 0.838127 41.957 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_arma <- forecast(arma_model, h = 12)
plot(forecast_arma, main = "ARIMA(1,0,1) Forecast for PM2.5")ARIMA (Autoregressive Integrated Moving Average)
# Fit an ARIMA(1,1,1) model
arima_model <- arima(pm25, order = c(1, 1, 1))
coeftest(arima_model)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.020303 0.092379 0.2198 0.826
## ma1 -0.999999 0.030107 -33.2151 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_arima <- forecast(arima_model, h = 12)
plot(forecast_arima, main = "ARIMA(1,1,1) Forecast for PM2.5")SARIMA (Seasonal ARIMA)
# Fit a Seasonal ARIMA (1,1,1)(1,1,1)[12]
sarima_model <- arima(pm25, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12))
coeftest(sarima_model)##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.056504 0.098230 0.5752 0.5651459
## ma1 -0.999988 0.134896 -7.4130 1.234e-13 ***
## sar1 -0.172166 0.123349 -1.3958 0.1627860
## sma1 -0.867714 0.225778 -3.8432 0.0001214 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast 12 months ahead
forecast_sarima <- forecast(sarima_model, h = 12)
plot(forecast_sarima, main = "SARIMA(1,1,1)(1,1,1)[12] Forecast for PM2.5")