1) Introduction

In this assignment, we use the weather disasters dataset to model whether two disaster types occur in a given year:

Drought
Wildfire

Each of these outcomes is converted to a binary variable:

1 = the event occurred at least once in that year (count > 0)
0 = the event did not occur (count = 0)

Because the outcomes are yes/no, we use binomial logistic regression with the logit link. Logistic regression models the probability of an event occurring (a value between 0 and 1).

We test four possible predictor (X) structures for each outcome:

delta_temp
year
delta_temp + year (additive effects)
delta_temp * year (interaction; includes both main effects + interaction term)

Finally, we choose the best model for drought and wildfire using:

AIC (Akaike Information Criterion): lower AIC indicates a better balance of fit and simplicity.
Deviance difference tests (Likelihood Ratio Tests) using anova(..., test="Chisq") to check whether adding predictors significantly improves model fit.

2) Load and Prepare the Data

# Y: drought, wildfire (modeled as binomial with logit link)
# X candidates: delta_temp, year, delta_temp + year, delta_temp * year
# Model selection: AIC + Deviance (Likelihood Ratio Tests)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)
library(knitr)

weather <- read.csv("NOAAGISSWD.csv")

Clean column names

names(weather) <- make.names(names(weather))

Standardize delta temp column name

if ("delta.Temp" %in% names(weather)) {
  weather <- weather %>% rename(delta_temp = delta.Temp)
} else if ("delta.temp" %in% names(weather)) {
  weather <- weather %>% rename(delta_temp = delta.temp)
} else if ("delta_temp" %in% names(weather)) {
  # already good
} else {
  stop("Couldn't find a delta temp column. Check names(weather).")
}

Confirm required columns exist

required_cols <- c("year", "count_drought", "count_wildfire", "delta_temp")
missing <- setdiff(required_cols, names(weather))
if (length(missing) > 0) stop(paste("Missing columns:", paste(missing, collapse = ", ")))

Create BINOMIAL outcomes (0/1)

weather <- weather %>%
  mutate(
    drought  = ifelse(count_drought > 0, 1, 0),
    wildfire = ifelse(count_wildfire > 0, 1, 0)
  )

Sanity checks

cat("\nOutcome checks (should be 0/1):\n")

## 
## Outcome checks (should be 0/1):

print(table(weather$drought, useNA = "ifany"))

## 
##  0  1 
## 13 33

print(table(weather$wildfire, useNA = "ifany"))

## 
##  0  1 
## 22 24

We read the dataset and clean column names so R can use them safely.
We ensure the temperature-change variable is named delta_temp.
We convert counts (count_drought, count_wildfire) into binary outcomes needed for logistic regression.

3) Fit the Candidate Models

For each outcome (drought and wildfire), we fit these four models:

Model 1: event ~ delta_temp
Model 2: event ~ year
Model 3: event ~ delta_temp + year
Model 4: event ~ delta_temp * year

The interaction model delta_temp * year expands to:

delta_temp + year + delta_temp:year

So it includes all predictors plus the interaction term.

# Helper: fit the 4 requested models for a given Y
fit_candidate_models <- function(df, y_name) {
  f1 <- as.formula(paste0(y_name, " ~ delta_temp"))
  f2 <- as.formula(paste0(y_name, " ~ year"))
  f3 <- as.formula(paste0(y_name, " ~ delta_temp + year"))
  f4 <- as.formula(paste0(y_name, " ~ delta_temp * year"))

  m1 <- glm(f1, data = df, family = binomial(link = "logit"))
  m2 <- glm(f2, data = df, family = binomial(link = "logit"))
  m3 <- glm(f3, data = df, family = binomial(link = "logit"))
  m4 <- glm(f4, data = df, family = binomial(link = "logit"))

  list(
    delta_temp  = m1,
    year        = m2,
    additive    = m3,
    interaction = m4
  )
}

4) Model Selection Tools: AIC and Deviance Differences

4.1 AIC (Akaike Information Criterion)

AIC rewards better fit but penalizes unnecessary complexity.
Lower AIC = better model (in this set of candidates).

4.2 Deviance difference tests (Likelihood Ratio Tests)

We compare nested models using:

anova(model_small, model_big, test = “Chisq”)

This helps us answers questions like:

Does adding year improve a delta_temp model?
Does adding delta_temp improve a year model?
Does adding the interaction term improve the additive model?

Important note:

The models ~ delta_temp and ~ year are not nested in each other, so we do not compare those two directly using LRT.

summarize_model_set <- function(model_list, label = "Outcome") {

  # AIC table
  aic_tbl <- tibble(
    model    = names(model_list),
    AIC      = sapply(model_list, AIC),
    deviance = sapply(model_list, deviance),
    df_resid = sapply(model_list, df.residual)
  ) %>% arrange(AIC)

  cat("\n====================================================\n")
  cat("AIC + Deviance Summary for:", label, "\n")
  cat("====================================================\n")
  print(kable(aic_tbl, digits = 3))

  # Deviance (LRT) comparisons for nested models
  cat("\n--- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---\n")

  lrt_1v3 <- anova(model_list$delta_temp, model_list$additive, test = "Chisq")
  lrt_2v3 <- anova(model_list$year, model_list$additive, test = "Chisq")
  lrt_3v4 <- anova(model_list$additive, model_list$interaction, test = "Chisq")

  cat("\nCompare delta_temp vs additive (does adding year help beyond delta_temp?)\n")
  print(lrt_1v3)

  cat("\nCompare year vs additive (does adding delta_temp help beyond year?)\n")
  print(lrt_2v3)

  cat("\nCompare additive vs interaction (does interaction help?)\n")
  print(lrt_3v4)

  # Choose best model:
  # - Primary: lowest AIC
  # - Secondary: if interaction is best-by-AIC but not significant by LRT, choose additive for simplicity
  best_by_aic <- aic_tbl$model[1]

  p_int <- tryCatch({
    as.numeric(lrt_3v4$`Pr(>Chi)`[2])
  }, error = function(e) NA_real_)

  chosen <- best_by_aic
  if (!is.na(p_int) && best_by_aic == "interaction" && p_int >= 0.05) {
    chosen <- "additive"
  }

  cat("\n--- Best model choice for", label, "---\n")
  cat("Best-by-AIC:", best_by_aic, "\n")
  if (!is.na(p_int)) cat("Interaction LRT p-value (additive vs interaction):", signif(p_int, 4), "\n")
  cat("Chosen (AIC + deviance/parsimony):", chosen, "\n")

  list(
    aic_table    = aic_tbl,
    lrt_1v3      = lrt_1v3,
    lrt_2v3      = lrt_2v3,
    lrt_3v4      = lrt_3v4,
    chosen_name  = chosen,
    chosen_model = model_list[[chosen]]
  )
}

5) Results: Drought Models

models_drought  <- fit_candidate_models(weather, "drought")
results_drought <- summarize_model_set(models_drought, label = "DROUGHT")

## 
## ====================================================
## AIC + Deviance Summary for: DROUGHT 
## ====================================================
## 
## 
## |model       |    AIC| deviance| df_resid|
## |:-----------|------:|--------:|--------:|
## |year        | 51.410|   47.410|       44|
## |delta_temp  | 51.641|   47.641|       44|
## |additive    | 53.247|   47.247|       43|
## |interaction | 55.085|   47.085|       42|
## 
## --- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---
## 
## Compare delta_temp vs additive (does adding year help beyond delta_temp?)
## Analysis of Deviance Table
## 
## Model 1: drought ~ delta_temp
## Model 2: drought ~ delta_temp + year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        44     47.641                     
## 2        43     47.247  1   0.3946   0.5299
## 
## Compare year vs additive (does adding delta_temp help beyond year?)
## Analysis of Deviance Table
## 
## Model 1: drought ~ year
## Model 2: drought ~ delta_temp + year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        44     47.410                     
## 2        43     47.247  1    0.163   0.6864
## 
## Compare additive vs interaction (does interaction help?)
## Analysis of Deviance Table
## 
## Model 1: drought ~ delta_temp + year
## Model 2: drought ~ delta_temp * year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        43     47.247                     
## 2        42     47.085  1  0.16145   0.6878
## 
## --- Best model choice for DROUGHT ---
## Best-by-AIC: year 
## Interaction LRT p-value (additive vs interaction): 0.6878 
## Chosen (AIC + deviance/parsimony): year

Interpretation:

From the printed AIC table:

The lowest AIC for drought is the year model (AIC = 51.410).
Adding delta_temp to year does not meaningfully improve fit:
- LRT p-value for year vs additive is large (0.6864), so the extra term is not helpful.
Adding the interaction also does not help:
- LRT p-value for additive vs interaction is large (0.6878).
Conclusion for drought:
- The best model is: drought ~ year

6) Results: Wildfire Models

models_wildfire  <- fit_candidate_models(weather, "wildfire")
results_wildfire <- summarize_model_set(models_wildfire, label = "WILDFIRE")

## 
## ====================================================
## AIC + Deviance Summary for: WILDFIRE 
## ====================================================
## 
## 
## |model       |    AIC| deviance| df_resid|
## |:-----------|------:|--------:|--------:|
## |year        | 48.014|   44.014|       44|
## |delta_temp  | 49.163|   45.163|       44|
## |additive    | 49.859|   43.859|       43|
## |interaction | 51.806|   43.806|       42|
## 
## --- Deviance (LRT) comparisons (anova(..., test='Chisq')) ---
## 
## Compare delta_temp vs additive (does adding year help beyond delta_temp?)
## Analysis of Deviance Table
## 
## Model 1: wildfire ~ delta_temp
## Model 2: wildfire ~ delta_temp + year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        44     45.163                     
## 2        43     43.859  1    1.304   0.2535
## 
## Compare year vs additive (does adding delta_temp help beyond year?)
## Analysis of Deviance Table
## 
## Model 1: wildfire ~ year
## Model 2: wildfire ~ delta_temp + year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        44     44.014                     
## 2        43     43.859  1  0.15505   0.6938
## 
## Compare additive vs interaction (does interaction help?)
## Analysis of Deviance Table
## 
## Model 1: wildfire ~ delta_temp + year
## Model 2: wildfire ~ delta_temp * year
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        43     43.859                     
## 2        42     43.806  1 0.053576    0.817
## 
## --- Best model choice for WILDFIRE ---
## Best-by-AIC: year 
## Interaction LRT p-value (additive vs interaction): 0.817 
## Chosen (AIC + deviance/parsimony): year

Interpretation:

From the printed AIC table:

The lowest AIC for wildfire is also the year model (AIC = 48.014).
Adding delta_temp does not significantly improve fit beyond year:
- LRT p-value for year vs additive is large (0.6938).
Adding the interaction also does NOT help:
- LRT p-value for additive vs interaction is large (0.817).
Conclusion for wildfire:
- The best model is: wildfire ~ year

7) Final Chosen Model Summaries

cat("\nDrought chosen model:", results_drought$chosen_name, "\n")

## 
## Drought chosen model: year

print(summary(results_drought$chosen_model))

## 
## Call:
## glm(formula = f2, family = binomial(link = "logit"), data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -147.51135   60.48512  -2.439   0.0147 *
## year           0.07423    0.03028   2.451   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.777  on 45  degrees of freedom
## Residual deviance: 47.410  on 44  degrees of freedom
## AIC: 51.41
## 
## Number of Fisher Scoring iterations: 4

cat("\nWildfire chosen model:", results_wildfire$chosen_name, "\n")

## 
## Wildfire chosen model: year

print(summary(results_wildfire$chosen_model))

## 
## Call:
## glm(formula = f2, family = binomial(link = "logit"), data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -249.88051   70.99611   -3.52 0.000432 ***
## year           0.12485    0.03547    3.52 0.000431 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 63.683  on 45  degrees of freedom
## Residual deviance: 44.014  on 44  degrees of freedom
## AIC: 48.014
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

In both chosen models, the coefficient for year is positive and statistically significant.

That means:

As year increases, the log-odds of the event occurring increases.

In simpler terms: drought and wildfire occurrences become more likely over time in this dataset.

8) Odds Ratios

odds_ratio_table <- function(model) {
  broom::tidy(model, conf.int = TRUE) %>%
    mutate(
      odds_ratio   = exp(estimate),
      conf.low.OR  = exp(conf.low),
      conf.high.OR = exp(conf.high)
    ) %>%
    select(term, estimate, std.error, statistic, p.value,
           odds_ratio, conf.low.OR, conf.high.OR)
}

cat("\n--- Odds Ratios: Drought chosen model ---\n")

## 
## --- Odds Ratios: Drought chosen model ---

print(kable(odds_ratio_table(results_drought$chosen_model), digits = 4))

## 
## 
## |term        |  estimate| std.error| statistic| p.value| odds_ratio| conf.low.OR| conf.high.OR|
## |:-----------|---------:|---------:|---------:|-------:|----------:|-----------:|------------:|
## |(Intercept) | -147.5113|   60.4851|   -2.4388|  0.0147|     0.0000|      0.0000|       0.0000|
## |year        |    0.0742|    0.0303|    2.4513|  0.0142|     1.0771|      1.0197|       1.1509|

cat("\n--- Odds Ratios: Wildfire chosen model ---\n")

## 
## --- Odds Ratios: Wildfire chosen model ---

print(kable(odds_ratio_table(results_wildfire$chosen_model), digits = 4))

## 
## 
## |term        |  estimate| std.error| statistic| p.value| odds_ratio| conf.low.OR| conf.high.OR|
## |:-----------|---------:|---------:|---------:|-------:|----------:|-----------:|------------:|
## |(Intercept) | -249.8805|   70.9961|   -3.5196|   4e-04|      0.000|      0.0000|       0.0000|
## |year        |    0.1249|    0.0355|    3.5203|   4e-04|      1.133|      1.0653|       1.2279|

Interpretation:

Drought: year odds ratio ≈ 1.077
- Each additional year multiplies the odds of drought occurrence(at least once) by ~1.077 (about a 7.7% increase in odds per year).
- This suggests a statistically significant upward trend in drought occurrence over time.
Wildfire: year odds ratio ≈ 1.133
- Each additional year multiplies the odds of wildfire occurrence( at least once) by ~1.133 (about a 13.3% increase in odds per year).
- This indicates a strong and statistically significant increase in wildfire occurrence over time, with a larger annual growth rate than drought.

These statements are “odds” not “probability,” but they do indicate an increasing trend.

9) Predicted Probability Plot

The plot shows the model-predicted probability as delta_temp changes, holding year constant at the median. (Note: for a year-only chosen model, the predicted line will be flat with respect to delta_temp, because delta_temp is not in the chosen model.)

plot_pred <- function(df, model, y_label) {
  med_year <- median(df$year, na.rm = TRUE)
  
  cat("\n", y_label, "- median year used:", med_year, "\n")
  
  newdat <- data.frame(
    delta_temp = seq(min(df$delta_temp, na.rm = TRUE),
                     max(df$delta_temp, na.rm = TRUE),
                     length.out = 200),
    year = med_year
  )
  
  newdat$pred <- predict(model, newdata = newdat, type = "response")
  
  plot(newdat$delta_temp, newdat$pred,
       xlab = "delta_temp", ylab = "Predicted probability",
       main = paste("Predicted", y_label, "(year fixed at median)"),
       type = "l")
}
plot_pred(weather, results_drought$chosen_model, "Drought")

## 
##  Drought - median year used: 2002.5

plot_pred(weather, results_wildfire$chosen_model, "Wildfire")

## 
##  Wildfire - median year used: 2002.5

At the median year in the dataset, the model predicts a 0.7561 (75.6%) probability that a drought occurs. For wildfire at the same median year, the predicted probability is 0.5349 (53.5%), meaning drought is predicted to be more likely than wildfire.

# Predicted probability for drought vs wildfire at the MEDIAN year

med_year <- median(weather$year, na.rm = TRUE)

# newdata must include 'year' (and delta_temp can be included; it won't matter for year-only models)
new_med <- data.frame(
  year = med_year,
  delta_temp = median(weather$delta_temp, na.rm = TRUE)
)

p_drought  <- predict(results_drought$chosen_model,  newdata = new_med, type = "response")
p_wildfire <- predict(results_wildfire$chosen_model, newdata = new_med, type = "response")

cat("Median year:", med_year, "\n")

## Median year: 2002.5

cat("Predicted P(Drought=1 | year=median): ", round(p_drought, 4), "\n")

## Predicted P(Drought=1 | year=median):  0.7561

cat("Predicted P(Wildfire=1 | year=median):", round(p_wildfire, 4), "\n")

## Predicted P(Wildfire=1 | year=median): 0.5349

Overall takeaway:
- Both drought and wildfire show clear increasing temporal trends.
- Year is a significant predictor (BEST MODEL) for both outcomes, while delta_temp does not provide additional explanatory power once time is accounted for.
- The interaction between temperature anomaly and year is also not significant.
- These results suggest that long-term temporal trends explain increases in drought and wildfire occurrence more strongly than annual temperature anomaly alone.

Assignment 3 - Weather Disasters

Romika Jani & Tsing Lee

1) Introduction

2) Load and Prepare the Data

Clean column names

Standardize delta temp column name

Confirm required columns exist

Create BINOMIAL outcomes (0/1)

Sanity checks

3) Fit the Candidate Models

4) Model Selection Tools: AIC and Deviance Differences

4.1 AIC (Akaike Information Criterion)

4.2 Deviance difference tests (Likelihood Ratio Tests)

5) Results: Drought Models

Interpretation:

6) Results: Wildfire Models

Interpretation:

7) Final Chosen Model Summaries

Interpretation:

8) Odds Ratios

Interpretation:

9) Predicted Probability Plot