In this assignment, we use the weather disasters dataset to model whether two disaster types occur in a given year:
Each of these outcomes is converted to a binary variable:
1 = the event occurred at least once in that year
(count > 0)0 = the event did not occur (count = 0)Because the outcomes are yes/no, we use binomial logistic regression with the logit link. Logistic regression models the probability of an event occurring (a value between 0 and 1).
We test four possible predictor (X) structures for each outcome:
delta_tempyeardelta_temp + year (additive effects)delta_temp * year (interaction; includes both main
effects + interaction term)Finally, we choose the best model for drought and wildfire using:
anova(..., test="Chisq") to check whether adding
predictors significantly improves model fit.# Y: drought, wildfire (modeled as binomial with logit link)
# X candidates: delta_temp, year, delta_temp + year, delta_temp * year
# Model selection: AIC + Deviance (Likelihood Ratio Tests)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(knitr)
weather <- read.csv("NOAAGISSWD.csv")
names(weather) <- make.names(names(weather))
if ("delta.Temp" %in% names(weather)) {
weather <- weather %>% rename(delta_temp = delta.Temp)
} else if ("delta.temp" %in% names(weather)) {
weather <- weather %>% rename(delta_temp = delta.temp)
} else if ("delta_temp" %in% names(weather)) {
# already good
} else {
stop("Couldn't find a delta temp column. Check names(weather).")
}
required_cols <- c("year", "count_drought", "count_wildfire", "delta_temp")
missing <- setdiff(required_cols, names(weather))
if (length(missing) > 0) stop(paste("Missing columns:", paste(missing, collapse = ", ")))
weather <- weather %>%
mutate(
drought = ifelse(count_drought > 0, 1, 0),
wildfire = ifelse(count_wildfire > 0, 1, 0)
)
cat("\nOutcome checks (should be 0/1):\n")
##
## Outcome checks (should be 0/1):
print(table(weather$drought, useNA = "ifany"))
##
## 0 1
## 13 33
print(table(weather$wildfire, useNA = "ifany"))
##
## 0 1
## 22 24
We read the dataset and clean column names so R can use them safely.
We ensure the temperature-change variable is named
delta_temp.
We convert counts (count_drought,
count_wildfire) into binary outcomes needed for logistic
regression.
For each outcome (drought and wildfire), we fit these four models:
Model 1: event ~ delta_temp
Model 2: event ~ year
Model 3: event ~ delta_temp + year
Model 4: event ~ delta_temp * year
The interaction model delta_temp * year expands to:
delta_temp + year + delta_temp:yearSo it includes all predictors plus the interaction term.
# Helper: fit the 4 requested models for a given Y
fit_candidate_models <- function(df, y_name) {
f1 <- as.formula(paste0(y_name, " ~ delta_temp"))
f2 <- as.formula(paste0(y_name, " ~ year"))
f3 <- as.formula(paste0(y_name, " ~ delta_temp + year"))
f4 <- as.formula(paste0(y_name, " ~ delta_temp * year"))
m1 <- glm(f1, data = df, family = binomial(link = "logit"))
m2 <- glm(f2, data = df, family = binomial(link = "logit"))
m3 <- glm(f3, data = df, family = binomial(link = "logit"))
m4 <- glm(f4, data = df, family = binomial(link = "logit"))
list(
delta_temp = m1,
year = m2,
additive = m3,
interaction = m4
)
}
AIC rewards better fit but penalizes unnecessary complexity.
Lower AIC = better model (in this set of candidates).
We compare nested models using:
This helps us answers questions like:
Does adding year improve a
delta_temp model?
Does adding delta_temp improve a
year model?
Does adding the interaction term improve the additive model?
Important note:
The models ~ delta_temp and ~ year are
not nested in each other, so we do not compare those
two directly using LRT.
summarize_model_set <- function(model_list, label = "Outcome") {
# AIC table
aic_tbl <- tibble(
model = names(model_list),
AIC = sapply(model_list, AIC),
deviance = sapply(model_list, deviance),
df_resid = sapply(model_list, df.residual)
) %>% arrange(AIC)
cat("\n====================================================\n")
cat("AIC + Deviance Summary for:", label, "\n")
cat("====================================================\n")
print(kable(aic_tbl, digits = 3))
# Deviance (LRT) comparisons for nested models
cat("\n--- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---\n")
lrt_1v3 <- anova(model_list$delta_temp, model_list$additive, test = "Chisq")
lrt_2v3 <- anova(model_list$year, model_list$additive, test = "Chisq")
lrt_3v4 <- anova(model_list$additive, model_list$interaction, test = "Chisq")
cat("\nCompare delta_temp vs additive (does adding year help beyond delta_temp?)\n")
print(lrt_1v3)
cat("\nCompare year vs additive (does adding delta_temp help beyond year?)\n")
print(lrt_2v3)
cat("\nCompare additive vs interaction (does interaction help?)\n")
print(lrt_3v4)
# Choose best model:
# - Primary: lowest AIC
# - Secondary: if interaction is best-by-AIC but not significant by LRT, choose additive for simplicity
best_by_aic <- aic_tbl$model[1]
p_int <- tryCatch({
as.numeric(lrt_3v4$`Pr(>Chi)`[2])
}, error = function(e) NA_real_)
chosen <- best_by_aic
if (!is.na(p_int) && best_by_aic == "interaction" && p_int >= 0.05) {
chosen <- "additive"
}
cat("\n--- Best model choice for", label, "---\n")
cat("Best-by-AIC:", best_by_aic, "\n")
if (!is.na(p_int)) cat("Interaction LRT p-value (additive vs interaction):", signif(p_int, 4), "\n")
cat("Chosen (AIC + deviance/parsimony):", chosen, "\n")
list(
aic_table = aic_tbl,
lrt_1v3 = lrt_1v3,
lrt_2v3 = lrt_2v3,
lrt_3v4 = lrt_3v4,
chosen_name = chosen,
chosen_model = model_list[[chosen]]
)
}
models_drought <- fit_candidate_models(weather, "drought")
results_drought <- summarize_model_set(models_drought, label = "DROUGHT")
##
## ====================================================
## AIC + Deviance Summary for: DROUGHT
## ====================================================
##
##
## |model | AIC| deviance| df_resid|
## |:-----------|------:|--------:|--------:|
## |year | 51.410| 47.410| 44|
## |delta_temp | 51.641| 47.641| 44|
## |additive | 53.247| 47.247| 43|
## |interaction | 55.085| 47.085| 42|
##
## --- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---
##
## Compare delta_temp vs additive (does adding year help beyond delta_temp?)
## Analysis of Deviance Table
##
## Model 1: drought ~ delta_temp
## Model 2: drought ~ delta_temp + year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 44 47.641
## 2 43 47.247 1 0.3946 0.5299
##
## Compare year vs additive (does adding delta_temp help beyond year?)
## Analysis of Deviance Table
##
## Model 1: drought ~ year
## Model 2: drought ~ delta_temp + year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 44 47.410
## 2 43 47.247 1 0.163 0.6864
##
## Compare additive vs interaction (does interaction help?)
## Analysis of Deviance Table
##
## Model 1: drought ~ delta_temp + year
## Model 2: drought ~ delta_temp * year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 43 47.247
## 2 42 47.085 1 0.16145 0.6878
##
## --- Best model choice for DROUGHT ---
## Best-by-AIC: year
## Interaction LRT p-value (additive vs interaction): 0.6878
## Chosen (AIC + deviance/parsimony): year
From the printed AIC table:
The lowest AIC for drought is the
year model (AIC = 51.410).
Adding delta_temp to year does
not meaningfully improve fit:
year vs additive is large
(0.6864), so the extra term is not helpful.Adding the interaction also does not help:
additive vs interaction is
large (0.6878).Conclusion for drought:
drought ~ yearmodels_wildfire <- fit_candidate_models(weather, "wildfire")
results_wildfire <- summarize_model_set(models_wildfire, label = "WILDFIRE")
##
## ====================================================
## AIC + Deviance Summary for: WILDFIRE
## ====================================================
##
##
## |model | AIC| deviance| df_resid|
## |:-----------|------:|--------:|--------:|
## |year | 48.014| 44.014| 44|
## |delta_temp | 49.163| 45.163| 44|
## |additive | 49.859| 43.859| 43|
## |interaction | 51.806| 43.806| 42|
##
## --- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---
##
## Compare delta_temp vs additive (does adding year help beyond delta_temp?)
## Analysis of Deviance Table
##
## Model 1: wildfire ~ delta_temp
## Model 2: wildfire ~ delta_temp + year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 44 45.163
## 2 43 43.859 1 1.304 0.2535
##
## Compare year vs additive (does adding delta_temp help beyond year?)
## Analysis of Deviance Table
##
## Model 1: wildfire ~ year
## Model 2: wildfire ~ delta_temp + year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 44 44.014
## 2 43 43.859 1 0.15505 0.6938
##
## Compare additive vs interaction (does interaction help?)
## Analysis of Deviance Table
##
## Model 1: wildfire ~ delta_temp + year
## Model 2: wildfire ~ delta_temp * year
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 43 43.859
## 2 42 43.806 1 0.053576 0.817
##
## --- Best model choice for WILDFIRE ---
## Best-by-AIC: year
## Interaction LRT p-value (additive vs interaction): 0.817
## Chosen (AIC + deviance/parsimony): year
From the printed AIC table:
The lowest AIC for wildfire is also the
year model (AIC = 48.014).
Adding delta_temp does not significantly improve fit
beyond year:
year vs additive is large
(0.6938).Adding the interaction also does NOT help:
additive vs interaction is
large (0.817).Conclusion for wildfire:
wildfire ~ yearcat("\nDrought chosen model:", results_drought$chosen_name, "\n")
##
## Drought chosen model: year
print(summary(results_drought$chosen_model))
##
## Call:
## glm(formula = f2, family = binomial(link = "logit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -147.51135 60.48512 -2.439 0.0147 *
## year 0.07423 0.03028 2.451 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54.777 on 45 degrees of freedom
## Residual deviance: 47.410 on 44 degrees of freedom
## AIC: 51.41
##
## Number of Fisher Scoring iterations: 4
cat("\nWildfire chosen model:", results_wildfire$chosen_name, "\n")
##
## Wildfire chosen model: year
print(summary(results_wildfire$chosen_model))
##
## Call:
## glm(formula = f2, family = binomial(link = "logit"), data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -249.88051 70.99611 -3.52 0.000432 ***
## year 0.12485 0.03547 3.52 0.000431 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63.683 on 45 degrees of freedom
## Residual deviance: 44.014 on 44 degrees of freedom
## AIC: 48.014
##
## Number of Fisher Scoring iterations: 4
In both chosen models, the coefficient for year is
positive and statistically significant.
That means:
In simpler terms: drought and wildfire occurrences become more likely over time in this dataset.
odds_ratio_table <- function(model) {
broom::tidy(model, conf.int = TRUE) %>%
mutate(
odds_ratio = exp(estimate),
conf.low.OR = exp(conf.low),
conf.high.OR = exp(conf.high)
) %>%
select(term, estimate, std.error, statistic, p.value,
odds_ratio, conf.low.OR, conf.high.OR)
}
cat("\n--- Odds Ratios: Drought chosen model ---\n")
##
## --- Odds Ratios: Drought chosen model ---
print(kable(odds_ratio_table(results_drought$chosen_model), digits = 4))
##
##
## |term | estimate| std.error| statistic| p.value| odds_ratio| conf.low.OR| conf.high.OR|
## |:-----------|---------:|---------:|---------:|-------:|----------:|-----------:|------------:|
## |(Intercept) | -147.5113| 60.4851| -2.4388| 0.0147| 0.0000| 0.0000| 0.0000|
## |year | 0.0742| 0.0303| 2.4513| 0.0142| 1.0771| 1.0197| 1.1509|
cat("\n--- Odds Ratios: Wildfire chosen model ---\n")
##
## --- Odds Ratios: Wildfire chosen model ---
print(kable(odds_ratio_table(results_wildfire$chosen_model), digits = 4))
##
##
## |term | estimate| std.error| statistic| p.value| odds_ratio| conf.low.OR| conf.high.OR|
## |:-----------|---------:|---------:|---------:|-------:|----------:|-----------:|------------:|
## |(Intercept) | -249.8805| 70.9961| -3.5196| 4e-04| 0.000| 0.0000| 0.0000|
## |year | 0.1249| 0.0355| 3.5203| 4e-04| 1.133| 1.0653| 1.2279|
Drought: year odds ratio ≈ 1.077
Each additional year multiplies the odds of drought occurrence(at least once) by ~1.077 (about a 7.7% increase in odds per year).
This suggests a statistically significant upward trend in drought occurrence over time.
Wildfire: year odds ratio ≈ 1.133
Each additional year multiplies the odds of wildfire occurrence( at least once) by ~1.133 (about a 13.3% increase in odds per year).
This indicates a strong and statistically significant increase in wildfire occurrence over time, with a larger annual growth rate than drought.
These statements are “odds” not “probability,” but they do indicate an increasing trend.
The plot shows the model-predicted probability as
delta_temp changes, holding year constant at the median.
(Note: for a year-only chosen model, the predicted line
will be flat with respect to delta_temp, because delta_temp
is not in the chosen model.)
plot_pred <- function(df, model, y_label) {
med_year <- median(df$year, na.rm = TRUE)
cat("\n", y_label, "- median year used:", med_year, "\n")
newdat <- data.frame(
delta_temp = seq(min(df$delta_temp, na.rm = TRUE),
max(df$delta_temp, na.rm = TRUE),
length.out = 200),
year = med_year
)
newdat$pred <- predict(model, newdata = newdat, type = "response")
plot(newdat$delta_temp, newdat$pred,
xlab = "delta_temp", ylab = "Predicted probability",
main = paste("Predicted", y_label, "(year fixed at median)"),
type = "l")
}
plot_pred(weather, results_drought$chosen_model, "Drought")
##
## Drought - median year used: 2002.5
plot_pred(weather, results_wildfire$chosen_model, "Wildfire")
##
## Wildfire - median year used: 2002.5
At the median year in the dataset, the model predicts a 0.7561 (75.6%) probability that a drought occurs. For wildfire at the same median year, the predicted probability is 0.5349 (53.5%), meaning drought is predicted to be more likely than wildfire.
# Predicted probability for drought vs wildfire at the MEDIAN year
med_year <- median(weather$year, na.rm = TRUE)
# newdata must include 'year' (and delta_temp can be included; it won't matter for year-only models)
new_med <- data.frame(
year = med_year,
delta_temp = median(weather$delta_temp, na.rm = TRUE)
)
p_drought <- predict(results_drought$chosen_model, newdata = new_med, type = "response")
p_wildfire <- predict(results_wildfire$chosen_model, newdata = new_med, type = "response")
cat("Median year:", med_year, "\n")
## Median year: 2002.5
cat("Predicted P(Drought=1 | year=median): ", round(p_drought, 4), "\n")
## Predicted P(Drought=1 | year=median): 0.7561
cat("Predicted P(Wildfire=1 | year=median):", round(p_wildfire, 4), "\n")
## Predicted P(Wildfire=1 | year=median): 0.5349
Overall takeaway:
Both drought and wildfire show clear increasing temporal trends.
Year is a significant predictor (BEST MODEL) for both outcomes, while delta_temp does not provide additional explanatory power once time is accounted for.
The interaction between temperature anomaly and year is also not significant.
These results suggest that long-term temporal trends explain increases in drought and wildfire occurrence more strongly than annual temperature anomaly alone.