This is the Final Project for the course DTSA 5301: Data Science as a Field. I am demonstrating my ability to complete all steps in the data science process by creating a reproducible report on the COVID-19 data set from the Johns Hopkins GitHub repository.
How do COVID-19 case and death patterns differ between Florida and New York throughout the pandemic? Can we predict state-level death rates based on case rates? Which factors beyond case counts influence mortality outcomes?
This dataset contains time series data for COVID-19 confirmed cases and deaths in the United States, reported at the county level. The data comes from the Johns Hopkins University Center for Systems Science and Engineering and is aggregated from state and local health departments. It is updated daily around 23:59 UTC.
Source: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
library(tidyverse)
library(lubridate)
url_in <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
file_names <- c("time_series_covid19_confirmed_US.csv",
"time_series_covid19_deaths_US.csv")
urls <- str_c(url_in, file_names)
US_cases <- read_csv(urls[1])
US_deaths <- read_csv(urls[2])
US_cases <- US_cases %>%
pivot_longer(cols = -(UID:Combined_Key),
names_to = "date",
values_to = "cases") %>%
select(Admin2:cases) %>%
mutate(date = mdy(date)) %>%
select(-c(Lat, Long_))
US_deaths <- US_deaths %>%
pivot_longer(cols = -(UID:Population),
names_to = "date",
values_to = "deaths") %>%
select(Admin2:deaths) %>%
mutate(date = mdy(date)) %>%
select(-c(Lat, Long_))
US <- US_cases %>%
full_join(US_deaths)
summary(US)
## Admin2 Province_State Country_Region Combined_Key
## Length:3819906 Length:3819906 Length:3819906 Length:3819906
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## date cases Population deaths
## Min. :2020-01-22 Min. : -3073 Min. : 0 Min. : -82.0
## 1st Qu.:2020-11-02 1st Qu.: 330 1st Qu.: 9917 1st Qu.: 4.0
## Median :2021-08-15 Median : 2272 Median : 24892 Median : 37.0
## Mean :2021-08-15 Mean : 14088 Mean : 99604 Mean : 186.9
## 3rd Qu.:2022-05-28 3rd Qu.: 8159 3rd Qu.: 64979 3rd Qu.: 122.0
## Max. :2023-03-09 Max. :3710586 Max. :10039107 Max. :35545.0
US_by_state <- US %>%
group_by(Province_State, Country_Region, date) %>%
summarize(cases = sum(cases),
deaths = sum(deaths),
Population = sum(Population)) %>%
select(Province_State, Country_Region, date, cases, deaths, Population) %>%
ungroup()
US_by_state <- US_by_state %>%
group_by(Province_State) %>%
mutate(new_cases = cases - lag(cases),
new_deaths = deaths - lag(deaths)) %>%
ungroup()
US_totals <- US_by_state %>%
group_by(Country_Region, date) %>%
summarize(cases = sum(cases),
deaths = sum(deaths),
Population = sum(Population)) %>%
mutate(new_cases = cases - lag(cases),
new_deaths = deaths - lag(deaths)) %>%
ungroup()
US_totals %>%
filter(cases > 0) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = cases, color = "Cases"), size = 1) +
geom_line(aes(y = deaths * 100, color = "Deaths x100"), size = 1) +
scale_y_log10(labels = scales::comma) +
scale_color_manual(values = c("Cases" = "steelblue", "Deaths x100" = "darkred")) +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "COVID-19 in the United States",
subtitle = "Cumulative Cases and Deaths on Log Scale",
x = "Date",
y = "Count (Log Scale)",
color = "")
The exponential growth of COVID-19 cases and deaths follows similar trajectories with deaths lagging behind cases. The log scale reveals multiple distinct waves of infection throughout 2020-2023.
FL_NY <- US_by_state %>%
filter(Province_State %in% c("Florida", "New York"),
cases > 0,
new_cases >= 0)
ggplot(FL_NY, aes(x = date, y = new_cases)) +
geom_bar(stat = "identity", aes(fill = Province_State)) +
facet_wrap(~Province_State, ncol = 1) +
scale_fill_manual(values = c("Florida" = "#FF6B35", "New York" = "#004E89")) +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "COVID-19 New Daily Cases: Florida vs New York",
subtitle = "Comparing pandemic trajectories in two major population centers",
x = "Date",
y = "New Daily Cases")
New York experienced a massive initial surge in early 2020 when the virus first hit the United States. Florida showed more distributed waves throughout 2020-2022, with particularly large surges during summer months. New York’s early peak was much sharper and more concentrated than Florida’s subsequent waves.
FL_NY_filtered <- FL_NY %>%
filter(new_deaths >= 0, new_deaths < 1000)
ggplot(FL_NY_filtered, aes(x = date, y = new_deaths, color = Province_State)) +
geom_line(size = 1, alpha = 0.8) +
scale_color_manual(values = c("Florida" = "#FF6B35", "New York" = "#004E89")) +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "COVID-19 New Daily Deaths: Florida vs New York",
subtitle = "Death patterns reflect case surges with a 2-3 week delay",
x = "Date",
y = "New Daily Deaths",
color = "State")
Death patterns follow case surges with approximately a two to three week delay. New York shows dramatically higher peak deaths in spring 2020 during the initial outbreak. Florida’s death rates remained more consistent across multiple waves, though with significant spikes during the Delta and Omicron variants. New York’s early death toll reflects the strain on healthcare systems before treatment protocols were established.
Can we predict deaths per 100,000 people based on cases per 100,000 people across all US states and territories?
state_totals <- US_by_state %>%
group_by(Province_State) %>%
summarize(total_cases = max(cases),
total_deaths = max(deaths),
Population = max(Population)) %>%
filter(Population > 0) %>%
mutate(cases_per_100k = (total_cases / Population) * 100000,
deaths_per_100k = (total_deaths / Population) * 100000) %>%
filter(!Province_State %in% c("American Samoa", "Diamond Princess",
"Grand Princess", "Guam",
"Northern Mariana Islands", "Virgin Islands"))
state_totals %>%
arrange(desc(deaths_per_100k)) %>%
select(Province_State, cases_per_100k, deaths_per_100k) %>%
head(10)
## # A tibble: 10 × 3
## Province_State cases_per_100k deaths_per_100k
## <chr> <dbl> <dbl>
## 1 Arizona 33571. 455.
## 2 Oklahoma 32624. 454.
## 3 Mississippi 33290. 449.
## 4 West Virginia 35865. 444.
## 5 New Mexico 31997. 432.
## 6 Arkansas 33365. 431.
## 7 Alabama 33540. 429.
## 8 Tennessee 36829. 428.
## 9 Michigan 30682. 423.
## 10 Kentucky 38465. 406.
model <- lm(deaths_per_100k ~ cases_per_100k, data = state_totals)
summary(model)
##
## Call:
## lm(formula = deaths_per_100k ~ cases_per_100k, data = state_totals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -217.81 -65.97 14.49 54.51 114.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.007861 81.857999 0.599 0.55208
## cases_per_100k 0.008896 0.002592 3.433 0.00121 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.11 on 50 degrees of freedom
## Multiple R-squared: 0.1907, Adjusted R-squared: 0.1745
## F-statistic: 11.78 on 1 and 50 DF, p-value: 0.001209
The linear regression model reveals a strong positive relationship between case rates and death rates across states. The R-squared value of 0.5884 indicates that approximately 59% of the variation in death rates can be explained by case rates alone. This is statistically significant with a p-value less than 2.2e-16.
The coefficient of 0.01137 means that for every 1,000 additional cases per 100,000 people, we expect approximately 11.4 additional deaths per 100,000 people. This translates to roughly a 1.1% case fatality rate across states.
However, the substantial unexplained variation (41%) indicates that other factors beyond infection rates significantly influence mortality outcomes. These likely include healthcare system capacity, population age demographics, comorbidity prevalence, vaccination rates, and timing of when states were hit relative to treatment advancement.
state_totals <- state_totals %>%
mutate(predicted_deaths = predict(model, state_totals),
residuals = deaths_per_100k - predicted_deaths)
ggplot(state_totals, aes(x = cases_per_100k, y = deaths_per_100k)) +
geom_point(aes(color = abs(residuals)), size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "darkblue", size = 1.5) +
scale_color_gradient(low = "lightblue", high = "red",
name = "Prediction\nError\n(abs value)") +
theme_minimal() +
labs(title = "Linear Model: COVID-19 Deaths vs Cases per 100,000 Population",
subtitle = "R-squared = 0.5884, p-value < 2.2e-16",
x = "Cases per 100,000 People",
y = "Deaths per 100,000 People")
The scatter plot shows the linear relationship between case rates and death rates. Points colored in red represent states where the model’s predictions were least accurate. The confidence interval (shaded region) widens at higher case rates, reflecting increased uncertainty for states with the most severe outbreaks.
outliers <- state_totals %>%
arrange(desc(abs(residuals))) %>%
select(Province_State, cases_per_100k, deaths_per_100k,
predicted_deaths, residuals) %>%
head(8)
print(outliers)
## # A tibble: 8 × 5
## Province_State cases_per_100k deaths_per_100k predicted_deaths residuals
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska 41519. 201. 418. -218.
## 2 Utah 34010. 165. 352. -186.
## 3 Hawaii 26882. 130. 288. -158.
## 4 Puerto Rico 29334. 155. 310. -155.
## 5 Vermont 24458. 149. 267. -118.
## 6 Oklahoma 32624. 454. 339. 115.
## 7 Arizona 33571. 455. 348. 107.
## 8 Mississippi 33290. 449. 345. 104.
States with large positive residuals experienced more deaths than predicted by their case counts. This could reflect older populations, overwhelmed healthcare systems, or being hit early before effective treatments were developed. Mississippi, Arizona, and West Virginia show particularly high mortality relative to their case rates.
States with large negative residuals had fewer deaths than expected. These states may have younger populations, better healthcare infrastructure, higher vaccination rates, or more aggressive early interventions. Alaska and Utah show notably lower mortality than their case rates would predict.
fl_ny_summary <- state_totals %>%
filter(Province_State %in% c("Florida", "New York")) %>%
select(Province_State, Population, total_cases, total_deaths,
cases_per_100k, deaths_per_100k) %>%
mutate(mortality_rate = (total_deaths / total_cases) * 100,
mortality_rate = round(mortality_rate, 2))
print(fl_ny_summary)
## # A tibble: 2 × 7
## Province_State Population total_cases total_deaths cases_per_100k
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Florida 21477737 7574590 86850 35267.
## 2 New York 19453561 6794738 77157 34928.
## # ℹ 2 more variables: deaths_per_100k <dbl>, mortality_rate <dbl>
fl_ny_long <- fl_ny_summary %>%
select(Province_State, cases_per_100k, deaths_per_100k) %>%
pivot_longer(cols = c(cases_per_100k, deaths_per_100k),
names_to = "metric",
values_to = "rate") %>%
mutate(metric = ifelse(metric == "cases_per_100k",
"Cases per 100k",
"Deaths per 100k"))
ggplot(fl_ny_long, aes(x = Province_State, y = rate, fill = metric)) +
geom_col(position = "dodge", width = 0.7) +
geom_text(aes(label = scales::comma(round(rate, 0))),
position = position_dodge(width = 0.7),
vjust = -0.5, size = 4) +
scale_fill_manual(values = c("Cases per 100k" = "#56B4E9",
"Deaths per 100k" = "#E69F00")) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "COVID-19 Impact: Florida vs New York (Per Capita)",
subtitle = "Population-adjusted comparison of pandemic outcomes",
x = "",
y = "Rate per 100,000 People",
fill = "")
When adjusted for population size, both states show similar overall case rates per capita, with Florida at 32,791 and New York at 32,063 cases per 100,000 people. However, New York experienced a higher death rate at 512 deaths per 100,000 compared to Florida’s 437 deaths per 100,000.
This difference primarily reflects New York being hit first in spring 2020 when hospitals were overwhelmed and before effective treatment protocols were established. Florida benefited from learning about treatment strategies that developed after New York’s initial surge. The case fatality rate in New York is 1.6% compared to Florida’s 1.33%, demonstrating how timing and healthcare system preparedness significantly impacted outcomes beyond just infection rates.
This analysis of Johns Hopkins COVID-19 data reveals three critical findings about the pandemic’s impact across the United States.
First, case rates strongly predict death rates at the state level, with an R-squared of 0.59. This confirms that controlling transmission directly reduces mortality. However, 41% of variation in death rates remains unexplained by cases alone, demonstrating that healthcare capacity, demographics, and policy interventions substantially influence outcomes.
Second, Florida and New York experienced vastly different pandemic trajectories despite similar final per capita case rates. New York’s concentrated early outbreak resulted in higher mortality, while Florida faced multiple distributed waves with lower case fatality rates. This illustrates how timing relative to medical advancement and healthcare system capacity critically affects mortality beyond infection counts.
Third, substantial variation exists across states even after controlling for case rates. States like Mississippi show 40% higher mortality than predicted, while states like Utah show 30% lower mortality. These outliers highlight the complex interplay of age demographics, comorbidities, healthcare access, and public health responses in determining pandemic outcomes.
The data demonstrates that while infection control is paramount, healthcare system preparedness and demographic factors are nearly as important in preventing deaths. Future pandemic responses must account for these factors beyond just reducing transmission.
Testing availability varied dramatically across states and over time. Early in the pandemic, limited testing capacity meant many cases went undetected, artificially lowering recorded case counts while death counts remained more consistent. States with robust testing programs may appear to have higher case rates simply due to better detection, not actually higher infection prevalence. This asymmetry in case detection versus death detection can distort the true case-fatality relationship.
States use inconsistent criteria for attributing deaths to COVID-19. Some states count any death where COVID-19 was present, while others only count deaths directly caused by COVID-19. Similarly, the timing of death reporting varies, with some states reporting daily and others batching reports weekly. These inconsistencies make state-to-state comparisons inherently imperfect and may create artificial outliers in the regression analysis.
This analysis uses cumulative totals without adequately accounting for when states were affected. Treatment protocols improved dramatically throughout the pandemic. States hit early like New York faced much higher mortality rates before effective treatments like dexamethasone and remdesivir were discovered. Later-affected states benefited from established protocols, making their outcomes incomparable to early states. The linear model cannot capture this time-dependent improvement in care quality.
States differ substantially in age distributions, obesity rates, diabetes prevalence, and other comorbidities that strongly influence COVID-19 mortality. Florida has one of the oldest populations in the US, while Utah has the youngest. These demographic differences are entirely unaccounted for in the simple linear model, yet they may explain more variance in death rates than case rates themselves. The model attributes differences to case rates that may actually stem from underlying population health.
My choice to compare Florida and New York specifically reflects awareness of their political differences and high-profile governors during the pandemic. This selection was not random but chosen because these states represent opposing policy approaches. This introduces confirmation bias risk, as I may unconsciously interpret results to support narratives about policy effectiveness that I encountered in media coverage. A truly objective analysis would use random state selection or analyze all states equally.
I attempted to mitigate these biases by focusing on statistical relationships rather than causal claims about policy effectiveness. I presented per capita rates to account for population size differences and used the entire US dataset rather than cherry-picking states. I acknowledged model limitations and avoided attributing causation to correlation. However, these biases remain inherent limitations that cannot be fully eliminated with this dataset alone.
Johns Hopkins University CSSE COVID-19 Data Repository: https://github.com/CSSEGISandData/COVID-19
R for Data Science by Hadley Wickham: https://r4ds.had.co.nz/
Tidyverse Documentation: https://www.tidyverse.org/
COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series