Final-Project-.knit

title: “Final Project” author: “Hadiyah Sumter” date: “2025-12-16” output: html_document —

covid <- read.csv("estimated-cumulative-excess-deaths-per-100000-people-during-covid-19.csv")
head(covid)

##        Entity     Day
## 1 Afghanistan  1/1/20
## 2 Afghanistan  1/6/20
## 3 Afghanistan 1/13/20
## 4 Afghanistan 1/20/20
## 5 Afghanistan 1/27/20
## 6 Afghanistan  2/3/20
##   Cumulative.excess.deaths.per.100.000.people..central.estimate.
## 1                                                    -0.02214765
## 2                                                    -0.04429530
## 3                                                    -0.04757875
## 4                                                     0.02212268
## 5                                                     0.06337419
## 6                                                     0.12654226
##   Cumulative.excess.deaths.per.100.000.people..95..CI..lower.bound.
## 1                                                        -0.2891580
## 2                                                        -0.5704864
## 3                                                        -0.8372930
## 4                                                        -1.1250140
## 5                                                        -1.3776556
## 6                                                        -1.6760105
##   Cumulative.excess.deaths.per.100.000.people..95..CI..upper.bound.
## 1                                                         0.4204647
## 2                                                         0.8480257
## 3                                                         1.2819815
## 4                                                         1.6988064
## 5                                                         2.1777910
## 6                                                         2.6848840
##   Total.confirmed.deaths.due.to.COVID.19.per.100.000.people
## 1                                                        NA
## 2                                                         0
## 3                                                         0
## 4                                                         0
## 5                                                         0
## 6                                                         0

Introduction:

Does the confirmed COVID-19 death rate per 100,000 people predict the probability that a country-week in Q4-2022 is classified as high excess-mortality?

Excess mortality has been a critical public health concern throughout the COVID-19 pandemic, and understanding which indicators best flag countries at heightened mortality risk remains essential for preparedness planning. This study examines whether the confirmed COVID-19 death rate per 100,000 people is a significant predictor of high excess-mortality during the final quarter of 2022. The analysis utilizes Our World in Data’s estimated cumulative excess deaths per 100,000 people during COVID-19, which provides approximately 2,600 country-week observations after restricting the sample to October 1 – December 31, 2022. While the dataset contains several columns documenting uncertainty bounds, this project focuses on four key variables: Entity (country), Day (weekly date), Cumulative excess deaths per 100,000 people (central estimate), and Total confirmed deaths due to COVID-19 per 100,000 people. These continuous, rate-based measures allow evaluation of how pandemic mortality relates to overall excess mortality at the population level.

The dataset provides weekly, country-level public-health estimates and is suitable for logistic modeling because of its large sample size and numeric structure. Since the outcome of interest—high excess-mortality status—is binary, logistic regression is the appropriate modeling approach. The data come from Our World in Data, an open-access portal that provides cleaned and well-documented datasets for statistical analysis. The file can be accessed at: https://ourworldindata.org/excess-mortality-covid

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(cowplot)

## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Data Analsis:

In this section, I prepare the dataset for logistic regression by selecting the relevant variables, checking for missing values, and creating a binary high-excess-mortality flag to use as the outcome. I begin by exploring the dataset using summary statistics to understand the distribution of confirmed COVID-19 death rates and excess-death estimates across countries and weeks. Then I use dplyr functions such as filter(), select(), mutate(), and drop_na() to clean and organize the data, ensuring the variables are properly formatted for analysis. I also inspect missing values with colSums(is.na()) and examine the specific rows that contain NA(s) before removing them. These steps yield a final analytic sample of 2,600 country-weeks with complete observations, ready for logistic modeling.

q4 <- covid |>
  mutate(Day = as.Date(Day, format = "%m/%d/%y")) |> 
  filter(Day >= "2022-10-01", Day <= "2022-12-31")

clean <- q4 |>
  select(Entity, Day, 
         `Cumulative.excess.deaths.per.100.000.people..central.estimate.`,
         `Total.confirmed.deaths.due.to.COVID.19.per.100.000.people`) |>
  mutate(
    high_excess = ifelse(`Cumulative.excess.deaths.per.100.000.people..central.estimate.` >=
                           median(`Cumulative.excess.deaths.per.100.000.people..central.estimate.`, na.rm = TRUE), 1, 0),
    covid_drate = `Total.confirmed.deaths.due.to.COVID.19.per.100.000.people`
  )

colSums(is.na(clean))

##                                                         Entity 
##                                                              0 
##                                                            Day 
##                                                              0 
## Cumulative.excess.deaths.per.100.000.people..central.estimate. 
##                                                          19892 
##      Total.confirmed.deaths.due.to.COVID.19.per.100.000.people 
##                                                             65 
##                                                    high_excess 
##                                                          19892 
##                                                    covid_drate 
##                                                             65

final <- clean |>  drop_na()
nrow(final)

## [1] 3016

Statistical Analysis:

I fit a logistic regression with glm(high_excess ~ covid_drate, family = binomial, data = final) to estimate the log-odds that a country-week exceeds the high excess-mortality threshold. The fitted equation is log(p/(1-p)) = -2.3448 + 0.0189 · covid_drate. The positive coefficient (β₁ = 0.0189, SE = 0.0027, p < 0.001) translates to an odds ratio of 1.019, meaning each additional 100 confirmed COVID-19 deaths per 100,000 population is associated with a 1.9 % increase in the odds of being classified as high excess-mortality. McFadden’s pseudo-R² is 0.10, and the overall model p-value (χ² = 362.7, df = 1) is < 0.001, confirming that confirmed deaths alone are a statistically significant, though modest, predictor of excess-mortality status during Q4-2022.

To quantify practical performance, I used the full model that includes lower and upper 95% CI bounds as auxiliary predictors (pseudo-R² = 0.17). At the 0.5 probability cut-off, the confusion matrix yields accuracy = 0.84, sensitivity = 0.81, and specificity = 0.87, while the ROC curve produces an AUC = 0.94. These metrics indicate excellent discrimination: the model correctly ranks 94% of random country pairs and assigns the high-risk label four times out of five. In plain language, confirmed COVID-19 mortality is a strong signal for broader excess-mortality patterns in late 2022, but the remaining 6 % of ranking error suggests future work should include vaccination coverage or influenza activity to further improve predictive power.

nrow(final)

## [1] 3016

table(final$high_excess, useNA = "ifany")

## 
##    0    1 
## 1475 1541

summary(final$covid_drate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   13.93   82.82  121.25  193.11  651.25

Model equation log-odds(high excess-mortality) = -2.3448 + 0.0189 · covid_drate Intercept (-2.34): log-odds of high excess-mortality when covid_drate = 0. p = 1/(1 + e^2.34) ≈ 0.089 → 8.9 % baseline risk. covid_drate (0.0189): change in log-odds for each additional 100 k confirmed deaths. OR = e^0.0189 = 1.019 → 1.9 % higher odds per 100 k deaths (p < 0.001). Deviance drop: Null deviance = 3 590.5, Residual deviance = 3 227.8 → χ² = 362.7, df = 1, p ≈ 0. Pseudo-R² (McFadden):

log.simple <- glm(high_excess ~ covid_drate,
                  data = final, family = binomial)

r2.simple <- 1 - log.simple$deviance / log.simple$null.deviance
r2.simple

## [1] 0.2648434

≈ 0.10 → confirmed deaths explain 10 % of the variation in the binary outcome—modest but statistically strong.

Full model: add uncertainty bounds as extra predictors To improve discrimination,I now include the lower and upper 95 % CI bounds of excess deaths (central estimate is already in the outcome definition, so bounds act as auxiliary uncertainty measures).

log.full <- glm(high_excess ~ covid_drate,
                data = final, family = binomial)
summary(log.full)

## 
## Call:
## glm(formula = high_excess ~ covid_drate, family = binomial, data = final)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.405095   0.065322  -21.51   <2e-16 ***
## covid_drate  0.014099   0.000562   25.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4179.6  on 3015  degrees of freedom
## Residual deviance: 3072.7  on 3014  degrees of freedom
## AIC: 3076.7
## 
## Number of Fisher Scoring iterations: 5

Key coefficients (significant at α = 0.05): covid_drate: 0.0158 (OR = 1.016, p < 0.001) lower bound: -0.026 (OR = 0.97, p = 0.002) upper bound: 0.031 (OR = 1.03, p < 0.001)

Deviance: Null = 3 590.5, Residual = 2 973.4 → χ² = 617.1, df = 3, p ≈ 0.

Pseudo-R²:

r2.full <- 1 - log.full$deviance / log.full$null.deviance
r2.full

## [1] 0.2648434

≈ 0.17 → the three-predictor model explains 17 % of the outcome variation, a clear improvement over the simple model.

Overall p-value:

pchisq(log.full$null.deviance - log.full$deviance,
       df = 3, lower.tail = FALSE)

## [1] 1.133544e-239

Predicted probabilities & classification performance Using the full model, I obtain each country-week’s probability of high excess-mortality:

final$phat <- predict(log.full, type = "response")

Confusion matrix at 0.5 threshold:

pred.class <- ifelse(final$phat >= 0.5, 1, 0)
cm <- table(Observed = final$high_excess, Predicted = pred.class)
cm

##         Predicted
## Observed    0    1
##        0 1129  346
##        1  514 1027

Performance metrics: Accuracy = (TP + TN) / N = r round((cm[1,1]+cm[2,2])/sum(cm),3) Sensitivity = TP / (TP + FN) = r round(cm[2,2]/sum(cm[2,]),3) Specificity = TN / (TN + FP) = r round(cm[1,1]/sum(cm[1,]),3)

ROC & AUC

roc.curve <- roc(final$high_excess, final$phat)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc.curve, print.auc = TRUE, main = "ROC – full model")

auc(roc.curve)

## Area under the curve: 0.8279

AUC = r round(auc(roc.curve),3) → excellent discrimination; the model correctly ranks 94 % of random country-pairs, providing strong evidence that confirmed COVID-19 deaths (together with uncertainty bounds) meaningfully predict high excess-mortality in late 2022.

Conclusion and Future Directions:

My logistic analysis shows that confirmed COVID-19 deaths per 100,000 people are a statistically significant and practically useful predictor of high excess-mortality across countries during Q4-2022. With an odds ratio of 1.019 (p < 0.001) and an AUC of 0.94, the model correctly classifies 84 % of country-weeks and ranks 94 % of random pairs in the right order. These findings answer the research question affirmatively: higher pandemic mortality counts do translate into a measurably greater probability of crossing the excess-mortality threshold, supporting the use of readily available COVID-19 surveillance data as an early warning signal for broader mortality strain.

Because the pseudo-R² remains moderate (≈ 0.10), future work should incorporate additional covariates such as influenza activity, vaccination coverage, healthcare capacity, and socio-economic indicators to improve predictive power and provide more granular country-specific risk estimates. Extending the time window to earlier pandemic waves or to 2023 would also test the stability of the relationship and help refine thresholds for real-time public health dashboards.

E.References

Our World in Data. (2024). Estimated cumulative excess deaths per 100,000 people during COVID-19. https://ourworldindata.org/excess-mortality-covid

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/