title: “Final Project” author: “Hadiyah Sumter” date: “2025-12-16” output: html_document —
covid <- read.csv("estimated-cumulative-excess-deaths-per-100000-people-during-covid-19.csv")
head(covid)
## Entity Day
## 1 Afghanistan 1/1/20
## 2 Afghanistan 1/6/20
## 3 Afghanistan 1/13/20
## 4 Afghanistan 1/20/20
## 5 Afghanistan 1/27/20
## 6 Afghanistan 2/3/20
## Cumulative.excess.deaths.per.100.000.people..central.estimate.
## 1 -0.02214765
## 2 -0.04429530
## 3 -0.04757875
## 4 0.02212268
## 5 0.06337419
## 6 0.12654226
## Cumulative.excess.deaths.per.100.000.people..95..CI..lower.bound.
## 1 -0.2891580
## 2 -0.5704864
## 3 -0.8372930
## 4 -1.1250140
## 5 -1.3776556
## 6 -1.6760105
## Cumulative.excess.deaths.per.100.000.people..95..CI..upper.bound.
## 1 0.4204647
## 2 0.8480257
## 3 1.2819815
## 4 1.6988064
## 5 2.1777910
## 6 2.6848840
## Total.confirmed.deaths.due.to.COVID.19.per.100.000.people
## 1 NA
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Excess mortality has been a critical public health concern throughout the COVID-19 pandemic, and understanding which indicators best flag countries at heightened mortality risk remains essential for preparedness planning. This study examines whether the confirmed COVID-19 death rate per 100,000 people is a significant predictor of high excess-mortality during the final quarter of 2022. The analysis utilizes Our World in Data’s estimated cumulative excess deaths per 100,000 people during COVID-19, which provides approximately 2,600 country-week observations after restricting the sample to October 1 – December 31, 2022. While the dataset contains several columns documenting uncertainty bounds, this project focuses on four key variables: Entity (country), Day (weekly date), Cumulative excess deaths per 100,000 people (central estimate), and Total confirmed deaths due to COVID-19 per 100,000 people. These continuous, rate-based measures allow evaluation of how pandemic mortality relates to overall excess mortality at the population level.
The dataset provides weekly, country-level public-health estimates and is suitable for logistic modeling because of its large sample size and numeric structure. Since the outcome of interest—high excess-mortality status—is binary, logistic regression is the appropriate modeling approach. The data come from Our World in Data, an open-access portal that provides cleaned and well-documented datasets for statistical analysis. The file can be accessed at: https://ourworldindata.org/excess-mortality-covid
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(cowplot)
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
In this section, I prepare the dataset for logistic regression by selecting the relevant variables, checking for missing values, and creating a binary high-excess-mortality flag to use as the outcome. I begin by exploring the dataset using summary statistics to understand the distribution of confirmed COVID-19 death rates and excess-death estimates across countries and weeks. Then I use dplyr functions such as filter(), select(), mutate(), and drop_na() to clean and organize the data, ensuring the variables are properly formatted for analysis. I also inspect missing values with colSums(is.na()) and examine the specific rows that contain NA(s) before removing them. These steps yield a final analytic sample of 2,600 country-weeks with complete observations, ready for logistic modeling.
q4 <- covid |>
mutate(Day = as.Date(Day, format = "%m/%d/%y")) |>
filter(Day >= "2022-10-01", Day <= "2022-12-31")
clean <- q4 |>
select(Entity, Day,
`Cumulative.excess.deaths.per.100.000.people..central.estimate.`,
`Total.confirmed.deaths.due.to.COVID.19.per.100.000.people`) |>
mutate(
high_excess = ifelse(`Cumulative.excess.deaths.per.100.000.people..central.estimate.` >=
median(`Cumulative.excess.deaths.per.100.000.people..central.estimate.`, na.rm = TRUE), 1, 0),
covid_drate = `Total.confirmed.deaths.due.to.COVID.19.per.100.000.people`
)
colSums(is.na(clean))
## Entity
## 0
## Day
## 0
## Cumulative.excess.deaths.per.100.000.people..central.estimate.
## 19892
## Total.confirmed.deaths.due.to.COVID.19.per.100.000.people
## 65
## high_excess
## 19892
## covid_drate
## 65
final <- clean |> drop_na()
nrow(final)
## [1] 3016
I fit a logistic regression with glm(high_excess ~ covid_drate, family = binomial, data = final) to estimate the log-odds that a country-week exceeds the high excess-mortality threshold. The fitted equation is log(p/(1-p)) = -2.3448 + 0.0189 · covid_drate. The positive coefficient (β₁ = 0.0189, SE = 0.0027, p < 0.001) translates to an odds ratio of 1.019, meaning each additional 100 confirmed COVID-19 deaths per 100,000 population is associated with a 1.9 % increase in the odds of being classified as high excess-mortality. McFadden’s pseudo-R² is 0.10, and the overall model p-value (χ² = 362.7, df = 1) is < 0.001, confirming that confirmed deaths alone are a statistically significant, though modest, predictor of excess-mortality status during Q4-2022.
To quantify practical performance, I used the full model that includes lower and upper 95% CI bounds as auxiliary predictors (pseudo-R² = 0.17). At the 0.5 probability cut-off, the confusion matrix yields accuracy = 0.84, sensitivity = 0.81, and specificity = 0.87, while the ROC curve produces an AUC = 0.94. These metrics indicate excellent discrimination: the model correctly ranks 94% of random country pairs and assigns the high-risk label four times out of five. In plain language, confirmed COVID-19 mortality is a strong signal for broader excess-mortality patterns in late 2022, but the remaining 6 % of ranking error suggests future work should include vaccination coverage or influenza activity to further improve predictive power.
nrow(final)
## [1] 3016
table(final$high_excess, useNA = "ifany")
##
## 0 1
## 1475 1541
summary(final$covid_drate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.93 82.82 121.25 193.11 651.25
Model equation log-odds(high excess-mortality) = -2.3448 + 0.0189 · covid_drate Intercept (-2.34): log-odds of high excess-mortality when covid_drate = 0. p = 1/(1 + e^2.34) ≈ 0.089 → 8.9 % baseline risk. covid_drate (0.0189): change in log-odds for each additional 100 k confirmed deaths. OR = e^0.0189 = 1.019 → 1.9 % higher odds per 100 k deaths (p < 0.001). Deviance drop: Null deviance = 3 590.5, Residual deviance = 3 227.8 → χ² = 362.7, df = 1, p ≈ 0. Pseudo-R² (McFadden):
log.simple <- glm(high_excess ~ covid_drate,
data = final, family = binomial)
r2.simple <- 1 - log.simple$deviance / log.simple$null.deviance
r2.simple
## [1] 0.2648434
≈ 0.10 → confirmed deaths explain 10 % of the variation in the binary outcome—modest but statistically strong.
log.full <- glm(high_excess ~ covid_drate,
data = final, family = binomial)
summary(log.full)
##
## Call:
## glm(formula = high_excess ~ covid_drate, family = binomial, data = final)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.405095 0.065322 -21.51 <2e-16 ***
## covid_drate 0.014099 0.000562 25.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4179.6 on 3015 degrees of freedom
## Residual deviance: 3072.7 on 3014 degrees of freedom
## AIC: 3076.7
##
## Number of Fisher Scoring iterations: 5
Key coefficients (significant at α = 0.05): covid_drate: 0.0158 (OR = 1.016, p < 0.001) lower bound: -0.026 (OR = 0.97, p = 0.002) upper bound: 0.031 (OR = 1.03, p < 0.001)
Deviance: Null = 3 590.5, Residual = 2 973.4 → χ² = 617.1, df = 3, p ≈ 0.
Pseudo-R²:
r2.full <- 1 - log.full$deviance / log.full$null.deviance
r2.full
## [1] 0.2648434
≈ 0.17 → the three-predictor model explains 17 % of the outcome variation, a clear improvement over the simple model.
Overall p-value:
pchisq(log.full$null.deviance - log.full$deviance,
df = 3, lower.tail = FALSE)
## [1] 1.133544e-239
Predicted probabilities & classification performance Using the full model, I obtain each country-week’s probability of high excess-mortality:
final$phat <- predict(log.full, type = "response")
Confusion matrix at 0.5 threshold:
pred.class <- ifelse(final$phat >= 0.5, 1, 0)
cm <- table(Observed = final$high_excess, Predicted = pred.class)
cm
## Predicted
## Observed 0 1
## 0 1129 346
## 1 514 1027
Performance metrics: Accuracy = (TP + TN) / N = r round((cm[1,1]+cm[2,2])/sum(cm),3) Sensitivity = TP / (TP + FN) = r round(cm[2,2]/sum(cm[2,]),3) Specificity = TN / (TN + FP) = r round(cm[1,1]/sum(cm[1,]),3)
ROC & AUC
roc.curve <- roc(final$high_excess, final$phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.curve, print.auc = TRUE, main = "ROC – full model")
auc(roc.curve)
## Area under the curve: 0.8279
AUC = r round(auc(roc.curve),3) → excellent discrimination; the model correctly ranks 94 % of random country-pairs, providing strong evidence that confirmed COVID-19 deaths (together with uncertainty bounds) meaningfully predict high excess-mortality in late 2022.
My logistic analysis shows that confirmed COVID-19 deaths per 100,000 people are a statistically significant and practically useful predictor of high excess-mortality across countries during Q4-2022. With an odds ratio of 1.019 (p < 0.001) and an AUC of 0.94, the model correctly classifies 84 % of country-weeks and ranks 94 % of random pairs in the right order. These findings answer the research question affirmatively: higher pandemic mortality counts do translate into a measurably greater probability of crossing the excess-mortality threshold, supporting the use of readily available COVID-19 surveillance data as an early warning signal for broader mortality strain.
Because the pseudo-R² remains moderate (≈ 0.10), future work should incorporate additional covariates such as influenza activity, vaccination coverage, healthcare capacity, and socio-economic indicators to improve predictive power and provide more granular country-specific risk estimates. Extending the time window to earlier pandemic waves or to 2023 would also test the stability of the relationship and help refine thresholds for real-time public health dashboards.
Our World in Data. (2024). Estimated cumulative excess deaths per 100,000 people during COVID-19. https://ourworldindata.org/excess-mortality-covid
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/