Statistical models provide a mathematical relationship between hazards and their explanatory factors (e.g., rainfall, slope, land use, population). These models assume data patterns can be described by probability distributions and estimated from historical data.

Regression Models

Regression models help quantify relationships between environmental variables and predict outcomes like flood risk, pollution levels, or temperature anomalies.

Regression models are foundational tools for analyzing relationships between environmental variables and predicting outcomes. Different types of regression are suited to different data types and distributions.

Linear Regression

Linear regression models the relationship between predictors and a continuous outcome. Assumes linearity and normally distributed residuals.

\[Y=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n+\epsilon \] where: \(Y\) is a continous response variable, \(X_i\) are the predictor variables, \(\beta_i\) are the coefficients, and \(\epsilon\) is the error term (assume to be normally distributed)

Let’s use R to model Ozone concentration(pbb) based on wind and solar radiation and temperature using the built-in airquality dataset. Ozone is a key indicator of air quality and a common target in environmental hazard modeling. In the airquality dataset, it represents daily ozone concentration in parts per billion (ppb) measured in New York during the summer of 1973. Understanding what influences ozone levels—such as solar radiation, wind speed, and temperature—can help us assess pollution risks and guide mitigation strategies. Let’s explore how different regression models can be used to analyze and predict ozone levels.

The model is therefore:

\[Ozone=\beta_0+\beta_1* Solar.R+\beta_2*Wind+\beta_3*Temp+\epsilon\] This model assumes a linear relationship between ozone concentration and predictors like solar radiation, wind speed, and temperature.

The following code loads the dataset, removes rows with missing values, and fits a linear model. The summary() function reveals the estimated coefficients, their statistical significance, and the overall model fit. For example, a positive coefficient for Solar.R suggests that higher solar radiation is associated with increased ozone levels—consistent with the chemistry of ozone formation.

data(airquality)
airquality <- na.omit(airquality)  # Remove missing values
model_lm <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
summary(model_lm)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

The output of summary(model_lm) will include:

Coefficients Table

This shows the estimated regression coefficients for each predictor:

  1. Intercept: Baseline ozone level when all predictors are zero.
  2. Solar.R: Change in ozone per unit increase in solar radiation.
  3. Wind: Change in ozone per unit increase in wind speed.
  4. Temp: Change in ozone per unit increase in temperature.

Each coefficient comes with:

  1. Estimate: The actual value of the coefficient.
  2. Std. Error: Standard error of the estimate.
  3. t value and Pr(>|t|): Used to test if the coefficient is significantly different from zero.

Model Fit Statistics

  1. Residual standard error: How much the observed values deviate from the fitted values.
  2. Multiple R-squared: Proportion of variance in Ozone explained by the model.
  3. Adjusted R-squared: Adjusted for the number of predictors.
  4. F-statistic: Tests if the model as a whole is significant.

Brief Interpretation:

The multiple linear regression model was built to predict Ozone concentration using three predictors: Solar Radiation (Solar.R), Wind speed, and Temperature (Temp) from the airquality dataset. After removing missing values, the model explains approximately 61% of the variability in Ozone levels (Adjusted R² = 0.5948), indicating a reasonably strong fit.

The analysis reveals that Temperature is the most influential factor, showing a strong positive relationship with Ozone. For every 1°F increase in temperature, Ozone levels rise by about 1.65 units, and this effect is highly significant (p < 0.001). Wind speed, on the other hand, has a strong negative effect: an increase of 1 mph in wind speed reduces Ozone by approximately 3.34 units, also highly significant (p < 0.001). Solar Radiation contributes positively but modestly; each unit increase in Solar.R adds about 0.06 units to Ozone, and this effect is statistically significant (p ≈ 0.012).

Overall, the model suggests that hot, sunny, and calm conditions tend to produce higher Ozone levels, while windy conditions help disperse Ozone, reducing its concentration. The F-statistic (54.83, p < 2.2e-16) confirms that the model is highly significant overall.

Logistic Regression

Logistic regression is used when the dependent variable is categorical, typically binary (e.g., “High Ozone” vs. “Low Ozone”). Instead of predicting actual Ozone concentration, we predict the probability that Ozone exceeds a certain threshold.

For binary outcome \(Y\in(0,1)\) \[logit(p)=ln(\frac{p}{1-p})=\beta_0+\beta_1 *Solar.R+\beta_2*Wind+\beta_3*Temp \] where: \[p=\frac{1}{1+e^{-(\beta_0+\beta_1 *Solar.R+\beta_2*Wind+\beta_3*Temp)}}\]

To perform logistic regression using R:

  • We start by loading the airquality dataset and removing missing values.
  • A new variable HighOzone is created to classify days as high ozone (1) if Ozone > 80, otherwise 0.
  • The glm() function fits a logistic regression model with predictors Solar Radiation, Wind, and Temperature.
  • family = binomial(link = “logit”) specifies a logistic link for binary outcomes.
  • summary(model_logit) displays coefficients, significance, and model fit.
  • exp(coef(model_logit)) converts log-odds coefficients into odds ratios, making interpretation easier.
# Load dataset
data(airquality)
airquality <- na.omit(airquality)

# Create binary outcome
airquality$HighOzone <- ifelse(airquality$Ozone > 80, 1, 0)

# Fit logistic regression
model_logit <- glm(HighOzone ~ Solar.R + Wind + Temp,
                   family = binomial(link = "logit"),
                   data = airquality)
summary(model_logit)
## 
## Call:
## glm(formula = HighOzone ~ Solar.R + Wind + Temp, family = binomial(link = "logit"), 
##     data = airquality)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.18556  -0.31794  -0.08178  -0.02388   1.98574  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -18.541024   6.803148  -2.725  0.00642 **
## Solar.R       0.006571   0.007128   0.922  0.35657   
## Wind         -0.427943   0.168137  -2.545  0.01092 * 
## Temp          0.223966   0.073804   3.035  0.00241 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 91.556  on 110  degrees of freedom
## Residual deviance: 44.828  on 107  degrees of freedom
## AIC: 52.828
## 
## Number of Fisher Scoring iterations: 7
# Odds ratios
exp(coef(model_logit))
##  (Intercept)      Solar.R         Wind         Temp 
## 8.866158e-09 1.006593e+00 6.518487e-01 1.251028e+00

✅ Narrative Explanation

  • Intercept (8.87e-09): This is extremely small, meaning the baseline odds of high ozone when all predictors are zero is almost zero (not meaningful physically, but part of the model).
  • Solar.R (1.0066): For each unit increase in Solar Radiation, the odds of high ozone increase by about 0.66% (since OR ≈ 1.0066).
  • Wind (0.6518): For each 1 mph increase in wind speed, the odds of high ozone decrease by about 35% (OR < 1 indicates a negative effect).
  • Temp (1.2510): For each 1°F increase in temperature, the odds of high ozone increase by about 25% (OR > 1 indicates a positive effect).

✅ Interpretation

  • Temperature has the strongest positive effect on the likelihood of high ozone.
  • Wind has a strong negative effect, reducing the odds significantly.
  • Solar Radiation has a small positive effect.

The ROC curve (Receiver Operating Characteristic) for a logistic regression model is a graphical tool to evaluate how well the model distinguishes between two classes—in this case, High Ozone (1) vs. Not High Ozone (0).

This code loads the pROC package, predicts probabilities from the logistic regression model, computes the ROC curve using roc(), and plots it with AUC displayed in the title. The diagonal gray line represents random guessing (AUC = 0.5). A curve above this line indicates the model performs better than random. The AUC value quantifies classification performance: closer to 1 means excellent discrimination.

# load pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Predict probabilities from logistic regression model
pred <- predict(model_logit, type = "response")

# Compute ROC curve
roc_obj <- roc(airquality$HighOzone, pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve with AUC
plot(roc_obj, col = "blue", lwd = 2,
     main = paste("ROC Curve (AUC =", round(auc(roc_obj), 2), ")"))
abline(a = 0, b = 1, lty = 2, col = "gray")

# Find optimal cutoff using Youden's Index
opt_coords <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "youden")
opt_coords
##   threshold sensitivity specificity
## 1 0.1372684           1   0.8526316

✅ ROC Curve Analysis

  • AUC = 0.95: This indicates excellent model performance. The logistic regression model is highly effective at distinguishing between high ozone days and non-high ozone days.
  • The ROC curve (blue line) is far above the diagonal gray line, which represents random guessing (AUC = 0.5). This confirms the model is much better than chance.

✅ Optimal Cutoff (Youden’s Index) From the output:

  • Threshold = 0.137: Instead of the default 0.5, the best cutoff for classification is 0.137. This means if the predicted probability of high ozone is ≥ 0.137, classify as “High Ozone.”
  • Sensitivity = 1 (100%): The model correctly identifies all actual high ozone cases (no false negatives).
  • Specificity = 0.85 (85%): The model correctly identifies 85% of non-high ozone cases (some false positives remain).

✅ Implications

  • Using the optimal threshold improves overall classification performance compared to the default 0.5.
  • This threshold prioritizes catching all high ozone days (perfect sensitivity), which is useful for environmental alerts, even if it means a few false alarms.

Poisson Regression

Poisson regression is a type of Generalized Linear Model (GLM) used when the response variable is a count (non-negative integers), such as the number of disease cases, accidents, or environmental events. It assumes:

  • The response variable \(Y\) follows a Poisson distribution: \(P(Y=y)=\frac{e^{-\lambda}\lambda^y}{y!}\) for \(y=0,1,2,...\)
  • The mean and variance of \(Y\) are equal to \(\lambda\).

The model relates the expected count \(\lambda\) to predictors using a log link function: \[\ln (Y)=\beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_k X_k \] where: \(\lambda\) is the expected counts of events (e.g., number of cases), \(X_i\) are the predictor variables, \(\beta_i\) are the coefficients estimated from data

Exponentiating both sides: \[\lambda=e^{\beta_0+\beta_1 X_1+\beta_2 X_2+...+\beta_k X_k} \] Interpretation of coefficients:

  • Each \(\beta_i\) represents the change in the log of expected counts for a one-unit increase in \(X_i\).

  • \(e^{\beta_i}\) gives the multiplicative effect on expected counts:

    • If \(e^{\beta_i} = 1.5\), then a one-unit increase in \(X_i\) increases expected counts by 50%.

Let us now try performing Poisson Regression in R:

  • This R code loads the built-in esoph dataset, which contains counts of esophageal cancer cases (ncases) and controls grouped by age, alcohol consumption, and tobacco consumption.
  • The glm() function fits a Poisson regression model where the response variable is ncases and the predictors are agegp, alcgp, and tobgp.
  • The argument family = poisson(link = “log”) specifies that the model assumes a Poisson distribution for the counts and uses a log link function, meaning the model predicts the log of expected counts.
  • The summary(model_esoph) command displays the estimated coefficients, their significance, and model fit statistics.
  • Finally, exp(coef(model_esoph)) exponentiates the coefficients to interpret them as multiplicative effects on the expected number of cases: values greater than 1 indicate an increase in expected cases compared to the baseline category, while values less than 1 indicate a decrease.
# -----------------------------
# Poisson Regression on Health-Related Data
# -----------------------------
# Outcome: Number of hospital visits
# Predictors: PM2.5 exposure, age, smoking status
# -----------------------------

# 1. Load libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# 2. Simulate realistic health data with true effects
set.seed(2025)
n <- 300

# Create predictors
pm25 <- runif(n, 10, 60)                     # PM2.5 levels (µg/m³)
age <- rnorm(n, 45, 12)                      # Age in years
smoker <- factor(sample(c("No", "Yes"), n, replace = TRUE, prob = c(0.6, 0.4)))

# Define true relationship:
# log(lambda) = β0 + β1*pm25 + β2*age + β3*(smokerYes)
# PM2.5 and smoking will increase hospital visits
beta0 <- 0.3
beta1 <- 0.04      # each µg/m³ increase in PM2.5 increases expected visits
beta2 <- 0.01      # older individuals tend to visit more
beta3 <- 0.7       # smokers have higher rate of visits

# Compute expected rate (lambda)
lambda <- exp(beta0 + beta1*pm25 + beta2*age + ifelse(smoker == "Yes", beta3, 0))

# Simulate count outcome
visits <- rpois(n, lambda = lambda)

# Combine into data frame
health_data <- data.frame(visits, pm25, age, smoker)

# 3. Fit Poisson regression
poisson_model <- glm(visits ~ pm25 + age + smoker,
                     data = health_data,
                     family = poisson(link = "log"))

# 4. Display model summary
summary(poisson_model)
## 
## Call:
## glm(formula = visits ~ pm25 + age + smoker, family = poisson(link = "log"), 
##     data = health_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0094  -0.7239  -0.1111   0.6382   2.7771  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.344688   0.082800   4.163 3.14e-05 ***
## pm25        0.038360   0.001150  33.357  < 2e-16 ***
## age         0.010252   0.001332   7.698 1.38e-14 ***
## smokerYes   0.752765   0.030683  24.534  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2126.05  on 299  degrees of freedom
## Residual deviance:  292.06  on 296  degrees of freedom
## AIC: 1565.5
## 
## Number of Fisher Scoring iterations: 4
# 5. Exponentiate coefficients to interpret as rate ratios (IRR)
exp(coef(poisson_model))
## (Intercept)        pm25         age   smokerYes 
##    1.411550    1.039106    1.010305    2.122862
# 6. Check overdispersion
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) / poisson_model$df.residual
cat("\nDispersion statistic:", round(dispersion, 2), "\n")
## 
## Dispersion statistic: 0.96
# 7. Visualization
health_data$pred <- predict(poisson_model, type = "response")

ggplot(health_data, aes(x = pm25, y = pred, color = smoker)) +
  geom_line(linewidth = 1) +
  labs(title = "Predicted Hospital Visits by PM2.5 and Smoking Status",
       x = "PM2.5 Concentration (µg/m³)",
       y = "Predicted Number of Hospital Visits") +
  theme_minimal()

✅ Understanding the Coding

The R script demonstrates how to perform a Poisson regression analysis on health-related data, focusing on modeling the number of hospital visits as a function of air pollution exposure (PM2.5), age, and smoking status. Poisson regression is appropriate for count data, where the dependent variable represents the number of events occurring in a given period. This analysis is often used in public health and epidemiology to study how exposure to environmental or behavioral risk factors affects the frequency of disease-related outcomes.

To begin, the necessary R packages are loaded. The dplyr package allows for efficient data manipulation, while ggplot2 is used to create clear and informative visualizations of model results. Although base R can perform regression modeling on its own, these packages improve the workflow and clarity of the outputs.

A synthetic dataset is then generated to illustrate the modeling process. The data consist of 300 individuals, each with three predictor variables: PM2.5 concentration, age, and smoking status. The PM2.5 variable, measured in micrograms per cubic meter (µg/m³), represents the level of fine particulate matter in the air. The age variable follows a normal distribution centered around 45 years old, while smoking status is a categorical variable that classifies individuals as either smokers or non-smokers. The random number seed is set using set.seed(2025) to ensure that the simulated data remain consistent every time the code is executed.

After defining the predictors, the code specifies a true underlying relationship between the predictors and the outcome variable — the number of hospital visits. In Poisson regression, the natural logarithm of the expected count (denoted by λ, or lambda) is modeled as a linear combination of the predictors. This relationship can be expressed as:

\[\log (\lambda)=\beta_0+\beta_1(PM2.5)+\beta_2(Age)+\beta_3(Smoker) \] In the simulation, the model assumes that higher PM2.5 levels and smoking status increase the expected number of hospital visits, while age has a smaller but still positive effect. Once the simulated data are assembled into a data frame, a Poisson regression model is fitted using the glm() function with a Poisson family and a log link. This function estimates the regression coefficients that best describe the relationship between the predictors and the outcome variable. The model summary provides estimates of each coefficient, standard errors, z-values, and p-values. Predictors with p-values less than 0.05 are considered statistically significant, meaning they have a meaningful association with the number of hospital visits.

To make the coefficients easier to interpret, they are exponentiated using the exp() function. This transformation converts the log-scale coefficients into Incidence Rate Ratios (IRRs). The IRR expresses how the expected count changes with a one-unit increase in the predictor. For example, an IRR of 1.05 for PM2.5 means that for each additional µg/m³ of PM2.5, the expected number of hospital visits increases by 5%. Similarly, an IRR of 2.0 for smoking indicates that smokers are expected to have twice as many hospital visits as non-smokers.

The script also includes a diagnostic step to check for overdispersion, a common issue in count data where the variance exceeds the mean. The dispersion statistic is calculated using the Pearson residuals. A value close to one suggests that the Poisson model fits the data well, whereas a value substantially greater than one indicates overdispersion, suggesting the need for a quasi-Poisson or negative binomial model.

Finally, the code generates a visualization of the predicted number of hospital visits across PM2.5 levels, grouped by smoking status. Using ggplot2, the plot shows that as PM2.5 concentration increases, the predicted hospital visits also rise. The curve for smokers lies consistently above that of non-smokers, illustrating that smokers are expected to experience more hospital visits across all pollution levels. This visualization effectively communicates the model’s findings and highlights the combined effects of environmental and behavioral risk factors.

✅ Interpretation

The Poisson regression analysis aimed to determine how air pollution exposure (PM2.5), age, and smoking status influence the number of hospital visits. The results show that all three predictors are statistically significant at the 0.001 level, as evidenced by their very small p-values. This indicates that PM2.5 concentration, age, and smoking status each have a meaningful effect on the expected number of hospital visits. The estimated coefficients, when exponentiated to obtain incidence rate ratios (IRRs), provide a clearer interpretation of their effects on hospital visit counts.

The coefficient for PM2.5 is 0.03836, corresponding to an incidence rate ratio of 1.039. This means that for every 1 µg/m³ increase in PM2.5 concentration, the expected number of hospital visits increases by approximately 3.9%, holding age and smoking status constant. The coefficient for age is 0.01025, which translates to an IRR of 1.010, indicating that for each additional year of age, the expected number of hospital visits increases by about 1.0%, assuming other factors remain constant. Smoking status also shows a strong effect, with a coefficient of 0.7528 and an IRR of 2.123. This suggests that smokers have more than twice the expected rate of hospital visits compared to non-smokers, after adjusting for age and air pollution exposure. The intercept of 0.3447, corresponding to an IRR of 1.4116, represents the baseline expected number of hospital visits when PM2.5 and age are zero and the individual is a non-smoker.

In terms of model fit, the residual deviance (292.06 on 296 degrees of freedom) is substantially lower than the null deviance (2126.05 on 299 degrees of freedom), indicating that the inclusion of the predictor variables greatly improves the model’s explanatory power. The dispersion parameter is approximately equal to one, suggesting that the Poisson regression assumptions hold and that there is no significant overdispersion in the data. The overall model is therefore statistically adequate and well-fitted to the simulated health dataset.

Overall, the findings indicate that higher air pollution exposure, older age, and smoking behavior all contribute significantly to the increased frequency of hospital visits. Specifically, hospital visits tend to rise with increasing PM2.5 concentration and age, and are considerably higher among smokers. These results are consistent with public health evidence linking both environmental and behavioral factors to adverse health outcomes. The analysis highlights the importance of addressing air quality management and smoking cessation as key components of strategies to reduce hospital admissions and improve population health.

A dispersion statistic (also called dispersion parameter or ratio) of 0.96 indicates that the variance of the data is very close to the mean, suggesting no significant overdispersion or underdispersion.

In the context of a Poisson regression model, this value (≈1) means that the Poisson assumption of equal mean and variance holds, and there is no need to use a quasi-Poisson or negative binomial model.

Negative Binomial Regression

Negative Binomial regression is an extension of Poisson regression for modeling count data when there is overdispersion (variance > mean).

Poisson assumption:

\[\text{Var}(Y) = \text{E}(Y)\] If this is violated (common in real data), Poisson underestimates variability → inflated Type I errors.

Negative Binomial assumption: \[\text{Var}(Y) = \mu + \alpha \mu^2\] Here, \(\alpha\) is an extra dispersion parameter that accounts for overdispersion.

The Model for NB Regression is given by: \[Y_i∼NB(μ_i,θ)\]

with \[ \log(μ_i)=β_0+β_1X_{1i}+⋯+β_kX_{k_i}\] ✅ Why use NB instead of Poisson?

  • Handles extra variability in counts.
  • Produces more accurate standard errors and confidence intervals.
  • Common in epidemiology, ecology, and health studies.
#load packages
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(ggplot2)
library(dplyr)

# -------------------------------
# 1. Simulate environmental health data
# -------------------------------
set.seed(123)

n <- 120  # 120 local observations (e.g., barangays or cities)
PM25 <- runif(n, 5, 80)  # PM2.5 concentration (μg/m³)
temperature <- runif(n, 20, 35)  # Average daily temperature (°C)
population_density <- runif(n, 500, 5000)  # people per km²

# Generate expected count (mean structure)
lambda <- exp(0.5 + 0.03 * PM25 - 0.02 * temperature + 0.0002 * population_density)

# Generate counts with overdispersion (respiratory illness cases)
resp_illness <- rnbinom(n, mu = lambda, size = 1.5)

# Combine into a dataframe
env_data <- data.frame(resp_illness, PM25, temperature, population_density)

# View first few rows
head(env_data)
##   resp_illness      PM25 temperature population_density
## 1            0 26.568314    29.71840           2960.718
## 2            2 64.122885    24.79731           3480.429
## 3            1 35.673269    24.61580           1272.643
## 4           28 71.226305    23.29651           3348.749
## 5           15 75.535046    25.54233           1903.414
## 6            4  8.416737    34.76329           3760.495
# -------------------------------
# 2. Fit Negative Binomial Regression Model
# -------------------------------
nb_model <- glm.nb(resp_illness ~ PM25 + temperature + population_density, data = env_data)

# Model summary
summary(nb_model)
## 
## Call:
## glm.nb(formula = resp_illness ~ PM25 + temperature + population_density, 
##     data = env_data, init.theta = 1.421581897, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5654  -1.0776  -0.2427   0.4702   2.5748  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.2453691  0.6831448   1.823   0.0683 .  
## PM25                0.0276490  0.0040886   6.762 1.36e-11 ***
## temperature        -0.0367009  0.0221787  -1.655   0.0980 .  
## population_density  0.0001332  0.0000643   2.071   0.0384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.4216) family taken to be 1)
## 
##     Null deviance: 190.89  on 119  degrees of freedom
## Residual deviance: 136.38  on 116  degrees of freedom
## AIC: 699.42
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.422 
##           Std. Err.:  0.241 
## 
##  2 x log-likelihood:  -689.416
# -------------------------------
# 3. Check for overdispersion
# -------------------------------
dispersion_stat <- sum(residuals(nb_model, type = "pearson")^2) / nb_model$df.residual
cat("Dispersion statistic:", round(dispersion_stat, 2), "\n")
## Dispersion statistic: 0.97
# -------------------------------
# 4. Compare with Poisson Regression
# -------------------------------
poisson_model <- glm(resp_illness ~ PM25 + temperature + population_density,
                     data = env_data, family = "poisson")

library(lmtest)
lrtest(poisson_model, nb_model)
## Likelihood ratio test
## 
## Model 1: resp_illness ~ PM25 + temperature + population_density
## Model 2: resp_illness ~ PM25 + temperature + population_density
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   4 -503.47                         
## 2   5 -344.71  1 317.52  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -------------------------------
# 5. Predicted values and visualization
# -------------------------------
env_data <- env_data %>%
  mutate(predicted = predict(nb_model, type = "response"))

ggplot(env_data, aes(x = PM25, y = resp_illness)) +
  geom_point(alpha = 0.6) +
  geom_line(aes(y = predicted), color = "darkgreen", linewidth = 1.2) +
  labs(
    title = "Negative Binomial Regression: PM2.5 and Respiratory Illness Cases",
    x = "PM2.5 Concentration (μg/m³)",
    y = "Number of Respiratory Illness Cases"
  ) +
  theme_minimal()

✅ Interpretation

The Negative Binomial Regression model was used to examine the relationship between the number of respiratory illness cases and selected environmental factors, including PM2.5 concentration, temperature, and population density. The model was fitted using a log link function, and the estimated dispersion parameter (θ = 1.422) indicated the presence of moderate overdispersion in the data, justifying the choice of the Negative Binomial model over the standard Poisson regression.

For every 1 µg/m³ increase in PM2.5, the expected number of respiratory illness cases increases by about 2.8%, holding other factors constant.Computed from exp(0.0276)=1.028.

For every 1°C increase in temperature, the expected number of respiratory illness cases decreases by about 3.6%, though the effect is not statistically significant (p = 0.098). Computed from exp(-0.036709)=0.964.

For every additional person per km², the expected number of respiratory illness cases increases by 0.013%, holding other factors constant.Computed from exp(0.000132)=1.00013.

The coefficients in the Negative Binomial model can be interpreted in terms of rate ratios by exponentiating the parameter estimates. The exponentiated coefficients (IRRs) indicate the multiplicative change in the expected number of respiratory illness cases for each one-unit increase in the predictor variable. Specifically, each 1 µg/m³ increase in PM2.5 concentration is associated with a 2.8% rise in the expected number of respiratory illness cases, holding temperature and population density constant. Conversely, each 1°C increase in temperature is associated with an estimated 3.6% reduction in the expected number of cases, although this effect was not statistically significant at the 0.05 level. Meanwhile, higher population density corresponds to a small but significant increase in respiratory illness cases, with an IRR of approximately 1.00013 per unit increase in population per square kilometer.

✅ Overdispersion

The dispersion statistic of 0.97 suggests that the variance of the data is approximately equal to its mean, indicating that there is no substantial overdispersion or underdispersion in the model. In other words, the Negative Binomial regression adequately accounts for the variability in the count data, and the assumption of equidispersion typically required in Poisson regression is not severely violated. A dispersion statistic close to 1 generally signifies a well-fitting model and supports the appropriateness of the chosen distribution.

To further evaluate whether the Negative Binomial model provides a better fit than the simpler Poisson model, a likelihood ratio test was conducted. The test compared the Poisson regression model (Model 1) and the Negative Binomial regression model (Model 2), both containing the same predictors: PM2.5 concentration, temperature, and population density. The results yielded a chi-square value of 317.52 with 1 degree of freedom and a p-value less than 0.001, indicating a highly significant difference in model fit.

This result means that the Negative Binomial regression model fits the data significantly better than the Poisson model. The significant improvement in model fit suggests that the Negative Binomial model’s inclusion of an additional dispersion parameter successfully captures extra variability that the Poisson model cannot. Therefore, despite the dispersion statistic being close to 1, the Negative Binomial regression remains the more appropriate choice for this dataset, providing a more robust and flexible framework for analyzing count data with potential overdispersion.

Extreme Value Theory

In the world of statistics, most models are designed to describe the average behavior of a system. But what happens when we need to understand the extremes—those rare, high-impact events like devastating floods, intense typhoons, or record-breaking heatwaves? This is where Extreme Value Theory (EVT) comes in.

EVT provides a framework for modeling the behavior of the tails of a distribution—where the most extreme values lie. These are the events that, while rare, can have catastrophic consequences. By focusing on these extremes, EVT helps engineers, scientists, and policymakers make informed decisions in the face of uncertainty.

Let’s explore two widely used EVT approaches: the Block Maxima method and the Peaks Over Threshold (POT) method.

Block Maxima Approach

The Block Maxima method involves dividing a time series into fixed intervals—typically years—and extracting the maximum value from each block. These maxima are then modeled using the Generalized Extreme Value (GEV) distribution.

The Generalized Extreme Value (GEV) distribution models block maxima and has the following cumulative distribution function (CDF): \[G(x) = \exp \left\{ - \left[ 1 + \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{-1/\xi} \right\}, \quad \text{for } 1 + \xi \left( \frac{x - \mu}{\sigma} \right) > 0\] Where:

  • \(\mu\) = location parameter
  • \(\sigma > 0\) = scale parameter
  • \(\xi\) = shape parameter (controls tail behavior)

Return Level Formula: For a return period TTT, the return level \(x_T\) is: \[x_T = \mu + \frac{\sigma}{\xi} \left[ \left( -\log \left( 1 - \frac{1}{T} \right) \right)^{-\xi} - 1 \right]\]

✅ Why use it: This approach is simple and intuitive. It’s especially useful when data is naturally grouped into periods (e.g., annual maximum rainfall).

We simulated 30 years of daily rainfall data. For each year, we extracted the maximum daily rainfall and fitted a GEV distribution to these 30 values.

✅ GEV Parameters:

  • Shape: -0.0173
  • Location: 80.11 mm
  • Scale: 12.52 mm

Using this model, we estimated the 100-year return level—the rainfall amount expected to be exceeded once every 100 years:

Estimated 100-year return level (GEV): 140.05 mm

This value is critical for designing infrastructure like flood control systems, ensuring they can withstand rare but severe rainfall events.

We will demonstrate the use of this method using R. In this approach, we aim to model the maximum rainfall observed in each year. This is useful when we want to estimate the intensity of rare events like the 100-year rainfall, which is critical for infrastructure planning.

  • We begin by loading the evd package in R, which provides tools for fitting the Generalized Extreme Value (GEV) distribution. This distribution is designed to model the maximum values observed over fixed time blocks—typically years.
  • Next, we simulate 30 years of daily rainfall data using a gamma distribution, which is commonly used to model rainfall due to its skewed nature.
  • We then create a time series by assigning a date to each rainfall value, allowing us to group the data by year.
  • To apply the block maxima method, we extract the maximum daily rainfall for each year. With these annual maxima, we fit a GEV distribution.
  • The model estimates three parameters: location, scale, and shape, which describe the center, spread, and tail behavior of the distribution, respectively.
  • Finally, we estimate the 100-year return level—the rainfall amount expected to be exceeded once every 100 years:
# Load the required package
library(evd)
## 
## Attaching package: 'evd'
## The following object is masked from 'package:car':
## 
##     subsets
# Simulate 30 years of daily rainfall data
set.seed(42)
rainfall <- rgamma(365 * 30, shape = 2, scale = 10)

# Create a time series with dates
dates <- seq.Date(from = as.Date("1990-01-01"), by = "day", length.out = length(rainfall))
rain_data <- data.frame(Date = dates, Rainfall = rainfall)

# Extract annual maxima
rain_data$Year <- format(rain_data$Date, "%Y")
annual_max <- aggregate(Rainfall ~ Year, data = rain_data, max)

# Fit a GEV distribution to the annual maxima
gev_fit <- fgev(annual_max$Rainfall)

# Check the estimated parameters
print(gev_fit)
## 
## Call: fgev(x = annual_max$Rainfall) 
## Deviance: 229.865 
## 
## Estimates
##      loc     scale     shape  
## 82.87868   9.67521  -0.02941  
## 
## Standard Errors
##    loc   scale   shape  
## 1.9823  1.4202  0.1308  
## 
## Optimization Information
##   Convergence: successful 
##   Function Evaluations: 15 
##   Gradient Evaluations: 8
# Extract GEV parameters
params <- gev_fit$estimate
loc <- params["loc"]
scale <- params["scale"]
shape <- params["shape"]

# Estimate the 100-year return level using qgev()
return_period <- 100
prob <- 1 - 1 / return_period
gev_return_level <- qgev(prob, loc = loc, scale = scale, shape = shape)

# Display the result
cat("Estimated 100-year return level (GEV):", gev_return_level, "mm\n")
## Estimated 100-year return level (GEV): 124.5071 mm

✅ Interpretation :

  • Based on the fitted GEV model, a daily rainfall of about 124.5 mm is expected to be exceeded once every 100 years.
  • This is a critical design value for flood control systems, drainage, and infrastructure planning.

The analysis using the Block Maxima approach and the Generalized Extreme Value (GEV) distribution shows that the 100-year return level for daily rainfall is approximately 124.5 mm. This means that, statistically, a rainfall event of 124 mm or more in a single day has a 1% chance of occurring in any given year.

✅ What Does This Mean for Policy?

Flood Control Infrastructure Design: Drainage systems, culverts, and floodways in Davao del Sur should be designed to handle at least 124 mm of rainfall in one day. Current infrastructure standards should be reviewed to ensure they meet or exceed this threshold.

Peaks Over Threshold

The POT approach takes a different route. Instead of using only one value per year, it considers all values that exceed a high threshold. These exceedances are modeled using the Generalized Pareto Distribution (GPD), which is well-suited for modeling the tail of a distribution.

The Generalized Pareto Distribution (GPD) models exceedances over a threshold uuu. Its CDF is: \[H(y) = 1 - \left( 1 + \xi \frac{y}{\beta} \right)^{-1/\xi}, \quad \text{for } y > 0, \; 1 + \xi \frac{y}{\beta} > 0\] Where:

  • \(y = x - u\) (excess over threshold)
  • \(\beta > 0\) = scale parameter
  • \(\xi\) = shape parameter

Return Level Formula: For a return period TTT, the return level \(x_T​\) is: \[x_T = u + \frac{\beta}{\xi} \left[ \left( \frac{T \cdot \lambda_u}{n} \right)^{\xi} - 1 \right]\] Where:

  • \(u\) = threshold
  • \(\lambda_u\) = number of exceedances
  • \(n\) = total observations

Le us try to use R in doing Peaks over threshold analysis:

  • The Peaks Over Threshold (POT) method is a powerful technique in Extreme Value Theory for modeling rare events. Unlike the Block Maxima approach, which uses only one extreme value per block (e.g., per year), POT leverages all observations that exceed a high threshold, making it more data-efficient and often more accurate for estimating tail behavior.

  • We use the evir package because it provides functions for fitting the Generalized Pareto Distribution (GPD), which is the cornerstone of the POT approach.

  • Here, we simulate 30 years of daily rainfall using a gamma distribution, which is commonly used for rainfall modeling due to its skewed nature.

  • We choose the 95th percentile as our threshold. This means we only consider the top 5% of rainfall values—those that represent extreme events.

  • The gpd() function fits a Generalized Pareto Distribution to the exceedances (values above the threshold). This model estimates two key parameters:

    • xi (shape): Determines tail heaviness. Positive xi means heavy tails (extreme events can be very large).
    • beta (scale): Controls the spread of exceedances.
  • We extract the estimated shape and scale parameters for use in return level calculations.

  • Here’s what happens next:

    • qgpd() computes the quantile for the given probability using the fitted GPD.
    • We adjust by (n / m) to account for the frequency of exceedances relative to the entire dataset. The result is the rainfall amount expected to be exceeded once every 100 years.
# Load the required package
library(evir)
## 
## Attaching package: 'evir'
## The following objects are masked from 'package:evd':
## 
##     dgev, dgpd, pgev, pgpd, qgev, qgpd, rgev, rgpd
## The following object is masked from 'package:ggplot2':
## 
##     qplot
# Simulate 30 years of daily rainfall data
set.seed(42)
rainfall <- rgamma(365 * 30, shape = 2, scale = 10)

# Set a high threshold (e.g., 95th percentile)
threshold <- quantile(rainfall, 0.95)

# Fit a Generalized Pareto Distribution (GPD) to exceedances
gpd_fit <- gpd(rainfall, threshold)

# Display fitted parameters
print(gpd_fit)
## $n
## [1] 10950
## 
## $data
##   [1]  49.90563  48.07191  50.68208  50.65502  50.32085  69.39245  54.53628
##   [8]  65.04813  59.33251  52.18695  73.31518  51.62654  49.68325  55.02293
##  [15]  50.41828  80.61509  49.11433  49.48407  52.19094  59.88456  49.27391
##  [22]  71.47710  50.51012  48.39283  49.26226  48.52232  48.00536  62.04664
##  [29]  88.35144  53.69513  49.18208  64.59729  48.14041  50.16108  59.88161
##  [36]  49.51229  50.97162  97.46599  48.04895  56.29928  57.23982  56.88218
##  [43]  49.56769  48.14958  52.08306  52.67508  65.62312  85.09912  77.68062
##  [50]  64.58845  62.81671  57.43431  80.70038  74.03711  53.01155  70.94464
##  [57]  55.47996  60.34507  71.65542  51.97645  53.01872  76.73281  73.51216
##  [64]  58.48059  71.91544  59.34209  54.94885  61.42294  50.99258  49.53936
##  [71]  47.96985  64.56054  58.43218  89.07708  54.68868  50.28323  65.57784
##  [78]  49.91317  68.49850  56.53979  69.97545  54.12988  55.27103  54.23011
##  [85]  57.62203  51.18808  71.59163  70.90420  51.87155  51.11373  48.03498
##  [92]  50.69159  55.75716  75.01878  80.41250  65.84386  60.10305  72.48566
##  [99]  49.54129  56.20696  53.88073  63.64015  58.05894  50.63552  58.92297
## [106]  49.52519  48.97038  48.40739  55.22696  48.89669  55.70808  58.70252
## [113]  76.84796  53.40573  87.38360  65.82558  48.97476  60.93122  83.35521
## [120]  53.75968  48.20487  52.19061  58.13393  59.44614  62.09655  54.84729
## [127]  56.05368  81.16338  78.11718  85.94818  73.94028  48.68340  65.39453
## [134]  53.70451  52.39075  74.49039  54.70566  50.04651  61.96452  86.49718
## [141]  52.09283  49.87677  64.42065  57.39794  83.86975 120.91872  50.97881
## [148]  64.68988  48.75597  56.45648  59.29773  48.46026  59.76278  66.70911
## [155]  56.53383  60.56631  50.63353  70.49610  57.78943  47.91353  57.78856
## [162]  73.30457  49.33362  66.13731  50.84662  50.13851  83.90018  48.38500
## [169]  81.37688  58.13633  56.34522  49.98024  50.06545  50.94306  74.86237
## [176]  49.40702  53.69579  58.23855  53.70114  70.52585  71.53504  52.60123
## [183]  75.85093  65.99190  56.95711  49.58698  58.71730  58.83906  53.08885
## [190]  51.16453  52.22948  76.33864  50.63748  52.91385  52.11756  67.85752
## [197]  49.43335  48.66188  62.95259  63.70572  50.41483  65.08772  56.50637
## [204]  48.12237  48.89345  48.61047  71.22566  48.76542  77.79313  51.06386
## [211]  56.30182  62.74116  55.72260  55.21685  79.52915  52.63002  52.26367
## [218]  62.95628  73.87048  50.74412  56.05197  51.15858  50.66498  48.41490
## [225]  51.62363  52.17084  49.29745  66.27650  60.12550  63.22969  54.77121
## [232]  49.44144  48.90044  67.66717  52.10177  58.12309  50.01990  52.33009
## [239]  67.66255  60.10318  54.70491  56.15465  52.39644  61.21527  48.04684
## [246]  49.12383  64.68280  49.64919  61.14759  74.46350  52.88013  51.51791
## [253]  53.82059  49.07676  59.13943  55.74759  51.76338  48.63269  70.71132
## [260]  73.07760  67.91434  63.04289  48.38532  59.67391  92.73758  61.48822
## [267]  79.36168  57.69450  52.41186  51.67504  59.65309  57.13914  56.93148
## [274]  74.30402  60.91831  53.22317  53.74254  48.95208  57.06636  48.19102
## [281]  61.01394  55.53258  70.14278  48.22120  53.35772  66.75480  48.16007
## [288]  61.32364  93.20988  49.93024  49.44344  75.19806  61.54687  67.15725
## [295]  48.33950  59.76925  59.14192  67.86486  53.39994  50.35839  74.48765
## [302]  72.91309  80.86434  80.53483  50.11210  54.68550  53.65331  57.79060
## [309]  50.26466  53.33234  56.11656  53.76778  59.65750  48.91719  74.87250
## [316]  64.22434  47.99994  52.72858  67.63085  68.05643  75.72558  70.12555
## [323]  48.14672  71.03827  60.05755  61.07221  69.55701  56.09649  56.35918
## [330]  70.94517  50.34499  53.75234  50.20805  50.23179  62.85184  59.14933
## [337]  49.07186  64.49973  52.17065  64.51880  65.16378  90.14549  57.42012
## [344]  51.95806  49.35999  62.92704  51.93877  61.01608  69.58749  70.20678
## [351]  54.14014  59.43998  59.94099  53.89683  60.21209  49.03657  74.95240
## [358]  50.40778  65.22536  78.93285  49.24245  66.12289  63.77510  63.47978
## [365]  86.84906  71.05365  65.65553  64.48246  56.81851  54.89870 101.25277
## [372]  59.32963  56.53586  64.13864  49.13938 117.64628  50.01250  56.62276
## [379]  93.40521  63.17446  49.78645  82.10898  63.84403  60.64903  89.82522
## [386]  64.50939  65.23278  65.55105  47.98169  55.20108  57.76089  50.85459
## [393]  49.06821  50.93817  65.41133  53.73924  49.72285  54.22026  48.46533
## [400]  59.40339  48.38141  53.15646  67.79176  89.79930  56.82187  51.30550
## [407]  49.38870  62.95368  50.40218  58.11662 105.30328  54.98337  52.64390
## [414]  60.77934  53.62388  52.11112  64.20793  53.57967  55.82376  57.01110
## [421]  92.04829  71.24850  92.97462  88.28282  50.27653  47.92197  80.13976
## [428]  61.22704  75.85520  48.41583  48.53230  62.57858  79.59596  57.98792
## [435]  68.95434  52.17111  55.83189  48.51234  67.45345  61.67315  47.95072
## [442]  50.13816  51.91321  47.97767  57.09205  53.34882  49.96307  49.04737
## [449]  50.14298  73.37632  61.09370  51.06557  81.84370  51.36023  94.41652
## [456]  50.31829  66.83626  54.50446  50.51746  52.04310  48.75853  73.20820
## [463]  71.10532  61.99987  53.48109  57.86714  51.00934  82.94507  50.37888
## [470]  74.37179  50.74321  49.96438  56.09602  65.79587  48.81579  79.24329
## [477]  48.23554  51.51282  73.29775  68.26998  48.48120  94.63052  56.89108
## [484]  48.75480  63.49172  52.08003  56.34847  52.83204  71.35828  61.65076
## [491]  63.26775  49.97338  56.94619  53.11206  55.91253  49.18189  52.07002
## [498]  79.78450  50.82507  73.18272  55.35117  59.33456  73.72666  50.32131
## [505]  52.43354  70.90389  54.09617  56.04411  48.61600  54.49545  48.96794
## [512]  48.53082  54.84244  51.62104  66.37059  63.09079  53.10245  48.88523
## [519]  55.94653  48.08286  62.59760  49.37120  95.02002  84.91988  59.91324
## [526]  65.90787  68.39823  54.01620  58.98157  54.70457  55.42213  52.32153
## [533]  55.65265  84.94166  50.50493  50.88869  53.85233  61.29537  64.55861
## [540]  57.81184  80.84435  59.52917  62.94419  51.39014  48.82550 100.49777
## [547]  49.02316  78.21955
## 
## $threshold
##      95% 
## 47.90876 
## 
## $p.less.thresh
## [1] 0.9499543
## 
## $n.exceed
## [1] 548
## 
## $method
## [1] "ml"
## 
## $par.ests
##          xi        beta 
## -0.03367919 12.57196431 
## 
## $par.ses
##         xi       beta 
## 0.04725699 0.80072651 
## 
## $varcov
##              [,1]        [,2]
## [1,]  0.002233223 -0.02882877
## [2,] -0.028828768  0.64116294
## 
## $information
## [1] "observed"
## 
## $converged
## [1] 0
## 
## $nllh.final
## [1] 1916.843
## 
## attr(,"class")
## [1] "gpd"
# Extract GPD parameters
xi <- gpd_fit$par.ests["xi"]    # Shape parameter
beta <- gpd_fit$par.ests["beta"] # Scale parameter

# Estimate the 100-year return level
n <- length(rainfall)           # Total number of observations
m <- sum(rainfall > threshold)  # Number of exceedances
return_period <- 100
prob <- 1 - 1 / return_period

# Compute return level using qgpd()
gpd_return_level <- threshold + qgpd(prob, xi = xi, beta = beta) * (n / m)

# Display the result
cat("Estimated 100-year return level (POT/GPD):", gpd_return_level, "mm\n")
## Estimated 100-year return level (POT/GPD): 1119.523 mm

✅ This means:

  • Based on the POT approach, a daily rainfall of about 1,119 mm has a 1% chance of occurring in any given year.
  • This is significantly higher than the estimate from the Block Maxima approach (≈124 mm), because POT captures the tail behavior more aggressively, considering all extreme exceedances rather than just annual maxima.

✅ Why Is It So High?

  • The POT method uses all values above a threshold, which often results in heavier tail estimates.
  • The adjustment factor (n / m) scales the return level to account for the rarity of exceedances in the full dataset.
  • This approach is sensitive to threshold choice and parameter estimates (xi and beta). A positive or near-zero xi indicates a heavy tail, allowing for very large extremes.

✅ Policy Interpretation

If this estimate were based on real data:

  • Infrastructure Design: Flood control systems, drainage networks, and dams must be designed for extreme scenarios far beyond historical norms.
  • Disaster Preparedness: Emergency response plans should include contingencies for catastrophic rainfall events.
  • Climate Adaptation: Incorporate EVT-based projections into local climate resilience strategies, as climate change may increase the frequency of such extremes.
  • Urban Planning: Avoid development in flood-prone areas and invest in green infrastructure to manage runoff.

Time Series Analysis

Environmental risks—such as floods, droughts, and heatwaves—are often driven by patterns in climate and weather data. Time series analysis helps us understand these patterns and predict future conditions, enabling proactive policy and disaster preparedness.

Autoregressive (AR) Model

Idea: Today’s rainfall depends on previous days. Formula:

\[X_t = \phi_1 X_{t-1} + \dots + \phi_p X_{t-p} + \varepsilon_t\]

Use: Short-term rainfall prediction.

Moving Average (MA) Model

Idea: Current value depends on past shocks (errors). Formula:

\[X_t = \mu + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} + \varepsilon_t\] Use: Smoothing noise in temperature or pollution data.

ARMA models

Combines Autoregressive (AR) and Moving Average (MA) components: \[X_t = \phi_1 X_{t-1} + \dots + \phi_p X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}\] Where:

  • \(p\) = number of AR terms
  • \(q\) = number of MA terms
  • \(\varepsilon_t\) = white noise

ARIMA Model

Adds Integration (I) for non-stationary data:

ARIMA(p,d,q):

\[\quad \Delta^d X_t = \phi_1 \Delta^d X_{t-1} + \dots + \phi_p \Delta^d X_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q}\] Where:

  • \(d\) = number of differences applied to make the series stationary
  • \(\Delta^d X_t\)= differenced series

SARIMA Model

Adds Seasonality: \[\text{SARIMA}(p,d,q)(P,D,Q)_s\] Full form: \[\Phi(B^s) \phi(B) \Delta^d \Delta_s^D X_t = \Theta(B^s) \theta(B) \varepsilon_t\] Where:

  • \(B\) = backshift operator
  • \(s\) = seasonal period (e.g., 12 for monthly data)
  • \(\phi(B), \theta(B)\) = non-seasonal AR and MA polynomials
  • \(\Phi(B^s), \Theta(B^s)\) = seasonal AR and MA polynomials
  • \(\Delta^d\) = non-seasonal differencing
  • \(\Delta_s^D\) = seasonal differencing

Applications

Generate Time Series Data

# Example: Monthly PM2.5 concentration data (simulated)
set.seed(123)
pm25 <- ts(rnorm(120, mean = 35, sd = 10), frequency = 12, start = c(2015, 1))
plot(pm25, main = "Simulated PM2.5 Data", ylab = "PM2.5 (µg/m³)", col = "steelblue")

AR (Autoregressive) Model

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Fit an AR(1) model
ar_model <- arima(pm25, order = c(1, 0, 0))
coeftest(ar_model)
## 
## z test of coefficients:
## 
##            Estimate Std. Error z value Pr(>|z|)    
## ar1        0.011740   0.091589  0.1282    0.898    
## intercept 35.152816   0.822706 42.7283   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_ar <- forecast(ar_model, h = 12)

# Plot the forecast
plot(forecast_ar, main = "AR(1) Forecast for PM2.5", ylab = "PM2.5 Concentration", xlab = "Time")

MA (Moving Average) Model

# Fit an MA(1) model
ma_model <- arima(pm25, order = c(0, 0, 1))
coeftest(ma_model)
## 
## z test of coefficients:
## 
##            Estimate Std. Error z value Pr(>|z|)    
## ma1        0.013969   0.100130  0.1395    0.889    
## intercept 35.152543   0.824390 42.6407   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_ma <- forecast(ma_model, h = 12)

# Plot the forecast
plot(forecast_ma, main = "MA(1) Forecast for PM2.5", ylab = "PM2.5 Concentration", xlab = "Time")

ARMA (Autoregressive Moving Average)

# Fit an ARMA(1,1) model
arma_model <- arima(pm25, order = c(1, 0, 1))
coeftest(arma_model)
## 
## z test of coefficients:
## 
##            Estimate Std. Error z value  Pr(>|z|)    
## ar1       -0.871914   0.081788 -10.661 < 2.2e-16 ***
## ma1        0.968136   0.048873  19.809 < 2.2e-16 ***
## intercept 35.165608   0.838127  41.957 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_arma <- forecast(arma_model, h = 12)
plot(forecast_arma, main = "ARIMA(1,0,1) Forecast for PM2.5")

ARIMA (Autoregressive Integrated Moving Average)

# Fit an ARIMA(1,1,1) model
arima_model <- arima(pm25, order = c(1, 1, 1))
coeftest(arima_model)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1  0.020303   0.092379   0.2198    0.826    
## ma1 -0.999999   0.030107 -33.2151   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast next 12 months
forecast_arima <- forecast(arima_model, h = 12)
plot(forecast_arima, main = "ARIMA(1,1,1) Forecast for PM2.5")

SARIMA (Seasonal ARIMA)

# Fit a Seasonal ARIMA (1,1,1)(1,1,1)[12]
sarima_model <- arima(pm25, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12))
coeftest(sarima_model)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.056504   0.098230  0.5752 0.5651459    
## ma1  -0.999988   0.134896 -7.4130 1.234e-13 ***
## sar1 -0.172166   0.123349 -1.3958 0.1627860    
## sma1 -0.867714   0.225778 -3.8432 0.0001214 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Forecast 12 months ahead
forecast_sarima <- forecast(sarima_model, h = 12)
plot(forecast_sarima, main = "SARIMA(1,1,1)(1,1,1)[12] Forecast for PM2.5")