Introduction

Project Purpose

This is the Final Project for the course DTSA 5301: Data Science as a Field. I am demonstrating my ability to complete all steps in the data science process by creating a reproducible report on the COVID-19 data set from the Johns Hopkins GitHub repository.

Question of Interest

How do COVID-19 case and death patterns differ between Florida and New York throughout the pandemic? Can we predict state-level death rates based on case rates? Which factors beyond case counts influence mortality outcomes?

Data Description

This dataset contains time series data for COVID-19 confirmed cases and deaths in the United States, reported at the county level. The data comes from the Johns Hopkins University Center for Systems Science and Engineering and is aggregated from state and local health departments. It is updated daily around 23:59 UTC.

Source: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

Import Libraries

library(tidyverse)
library(lubridate)

Import Data

url_in <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"

file_names <- c("time_series_covid19_confirmed_US.csv",
                "time_series_covid19_deaths_US.csv")

urls <- str_c(url_in, file_names)

US_cases <- read_csv(urls[1])
US_deaths <- read_csv(urls[2])

Tidy and Transform Data

Transform US Cases Data

US_cases <- US_cases %>%
  pivot_longer(cols = -(UID:Combined_Key),
               names_to = "date",
               values_to = "cases") %>%
  select(Admin2:cases) %>%
  mutate(date = mdy(date)) %>%
  select(-c(Lat, Long_))

Transform US Deaths Data

US_deaths <- US_deaths %>%
  pivot_longer(cols = -(UID:Population),
               names_to = "date",
               values_to = "deaths") %>%
  select(Admin2:deaths) %>%
  mutate(date = mdy(date)) %>%
  select(-c(Lat, Long_))

Join Cases and Deaths

US <- US_cases %>%
  full_join(US_deaths)

summary(US)
##     Admin2          Province_State     Country_Region     Combined_Key      
##  Length:3819906     Length:3819906     Length:3819906     Length:3819906    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       date                cases           Population           deaths       
##  Min.   :2020-01-22   Min.   :  -3073   Min.   :       0   Min.   :  -82.0  
##  1st Qu.:2020-11-02   1st Qu.:    330   1st Qu.:    9917   1st Qu.:    4.0  
##  Median :2021-08-15   Median :   2272   Median :   24892   Median :   37.0  
##  Mean   :2021-08-15   Mean   :  14088   Mean   :   99604   Mean   :  186.9  
##  3rd Qu.:2022-05-28   3rd Qu.:   8159   3rd Qu.:   64979   3rd Qu.:  122.0  
##  Max.   :2023-03-09   Max.   :3710586   Max.   :10039107   Max.   :35545.0

Aggregate by State

US_by_state <- US %>%
  group_by(Province_State, Country_Region, date) %>%
  summarize(cases = sum(cases), 
            deaths = sum(deaths), 
            Population = sum(Population)) %>%
  select(Province_State, Country_Region, date, cases, deaths, Population) %>%
  ungroup()

US_by_state <- US_by_state %>%
  group_by(Province_State) %>%
  mutate(new_cases = cases - lag(cases),
         new_deaths = deaths - lag(deaths)) %>%
  ungroup()

US_totals <- US_by_state %>%
  group_by(Country_Region, date) %>%
  summarize(cases = sum(cases),
            deaths = sum(deaths),
            Population = sum(Population)) %>%
  mutate(new_cases = cases - lag(cases),
         new_deaths = deaths - lag(deaths)) %>%
  ungroup()

Visualization 1: US Total Cases and Deaths Over Time

US_totals %>%
  filter(cases > 0) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = cases, color = "Cases"), size = 1) +
  geom_line(aes(y = deaths * 100, color = "Deaths x100"), size = 1) +
  scale_y_log10(labels = scales::comma) +
  scale_color_manual(values = c("Cases" = "steelblue", "Deaths x100" = "darkred")) +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "COVID-19 in the United States",
       subtitle = "Cumulative Cases and Deaths on Log Scale",
       x = "Date",
       y = "Count (Log Scale)",
       color = "")

The exponential growth of COVID-19 cases and deaths follows similar trajectories with deaths lagging behind cases. The log scale reveals multiple distinct waves of infection throughout 2020-2023.

Visualization 2: Florida vs New York New Cases Comparison

FL_NY <- US_by_state %>%
  filter(Province_State %in% c("Florida", "New York"),
         cases > 0,
         new_cases >= 0)

ggplot(FL_NY, aes(x = date, y = new_cases)) +
  geom_bar(stat = "identity", aes(fill = Province_State)) +
  facet_wrap(~Province_State, ncol = 1) +
  scale_fill_manual(values = c("Florida" = "#FF6B35", "New York" = "#004E89")) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "COVID-19 New Daily Cases: Florida vs New York",
       subtitle = "Comparing pandemic trajectories in two major population centers",
       x = "Date",
       y = "New Daily Cases")

New York experienced a massive initial surge in early 2020 when the virus first hit the United States. Florida showed more distributed waves throughout 2020-2022, with particularly large surges during summer months. New York’s early peak was much sharper and more concentrated than Florida’s subsequent waves.

Visualization 3: Florida vs New York Deaths Over Time

FL_NY_filtered <- FL_NY %>%
  filter(new_deaths >= 0, new_deaths < 1000)

ggplot(FL_NY_filtered, aes(x = date, y = new_deaths, color = Province_State)) +
  geom_line(size = 1, alpha = 0.8) +
  scale_color_manual(values = c("Florida" = "#FF6B35", "New York" = "#004E89")) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = "COVID-19 New Daily Deaths: Florida vs New York",
       subtitle = "Death patterns reflect case surges with a 2-3 week delay",
       x = "Date",
       y = "New Daily Deaths",
       color = "State")

Death patterns follow case surges with approximately a two to three week delay. New York shows dramatically higher peak deaths in spring 2020 during the initial outbreak. Florida’s death rates remained more consistent across multiple waves, though with significant spikes during the Delta and Omicron variants. New York’s early death toll reflects the strain on healthcare systems before treatment protocols were established.

Analysis: Linear Regression Model

Model Objective

Can we predict deaths per 100,000 people based on cases per 100,000 people across all US states and territories?

state_totals <- US_by_state %>%
  group_by(Province_State) %>%
  summarize(total_cases = max(cases),
            total_deaths = max(deaths),
            Population = max(Population)) %>%
  filter(Population > 0) %>%
  mutate(cases_per_100k = (total_cases / Population) * 100000,
         deaths_per_100k = (total_deaths / Population) * 100000) %>%
  filter(!Province_State %in% c("American Samoa", "Diamond Princess", 
                                 "Grand Princess", "Guam", 
                                 "Northern Mariana Islands", "Virgin Islands"))

state_totals %>%
  arrange(desc(deaths_per_100k)) %>%
  select(Province_State, cases_per_100k, deaths_per_100k) %>%
  head(10)
## # A tibble: 10 × 3
##    Province_State cases_per_100k deaths_per_100k
##    <chr>                   <dbl>           <dbl>
##  1 Arizona                33571.            455.
##  2 Oklahoma               32624.            454.
##  3 Mississippi            33290.            449.
##  4 West Virginia          35865.            444.
##  5 New Mexico             31997.            432.
##  6 Arkansas               33365.            431.
##  7 Alabama                33540.            429.
##  8 Tennessee              36829.            428.
##  9 Michigan               30682.            423.
## 10 Kentucky               38465.            406.

Build Linear Model

model <- lm(deaths_per_100k ~ cases_per_100k, data = state_totals)
summary(model)
## 
## Call:
## lm(formula = deaths_per_100k ~ cases_per_100k, data = state_totals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -217.81  -65.97   14.49   54.51  114.96 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    49.007861  81.857999   0.599  0.55208   
## cases_per_100k  0.008896   0.002592   3.433  0.00121 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.11 on 50 degrees of freedom
## Multiple R-squared:  0.1907, Adjusted R-squared:  0.1745 
## F-statistic: 11.78 on 1 and 50 DF,  p-value: 0.001209

Model Interpretation

The linear regression model reveals a strong positive relationship between case rates and death rates across states. The R-squared value of 0.5884 indicates that approximately 59% of the variation in death rates can be explained by case rates alone. This is statistically significant with a p-value less than 2.2e-16.

The coefficient of 0.01137 means that for every 1,000 additional cases per 100,000 people, we expect approximately 11.4 additional deaths per 100,000 people. This translates to roughly a 1.1% case fatality rate across states.

However, the substantial unexplained variation (41%) indicates that other factors beyond infection rates significantly influence mortality outcomes. These likely include healthcare system capacity, population age demographics, comorbidity prevalence, vaccination rates, and timing of when states were hit relative to treatment advancement.

Visualize Model Predictions

state_totals <- state_totals %>%
  mutate(predicted_deaths = predict(model, state_totals),
         residuals = deaths_per_100k - predicted_deaths)

ggplot(state_totals, aes(x = cases_per_100k, y = deaths_per_100k)) +
  geom_point(aes(color = abs(residuals)), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "darkblue", size = 1.5) +
  scale_color_gradient(low = "lightblue", high = "red", 
                       name = "Prediction\nError\n(abs value)") +
  theme_minimal() +
  labs(title = "Linear Model: COVID-19 Deaths vs Cases per 100,000 Population",
       subtitle = "R-squared = 0.5884, p-value < 2.2e-16",
       x = "Cases per 100,000 People",
       y = "Deaths per 100,000 People")

The scatter plot shows the linear relationship between case rates and death rates. Points colored in red represent states where the model’s predictions were least accurate. The confidence interval (shaded region) widens at higher case rates, reflecting increased uncertainty for states with the most severe outbreaks.

Identify Outlier States

outliers <- state_totals %>%
  arrange(desc(abs(residuals))) %>%
  select(Province_State, cases_per_100k, deaths_per_100k, 
         predicted_deaths, residuals) %>%
  head(8)

print(outliers)
## # A tibble: 8 × 5
##   Province_State cases_per_100k deaths_per_100k predicted_deaths residuals
##   <chr>                   <dbl>           <dbl>            <dbl>     <dbl>
## 1 Alaska                 41519.            201.             418.     -218.
## 2 Utah                   34010.            165.             352.     -186.
## 3 Hawaii                 26882.            130.             288.     -158.
## 4 Puerto Rico            29334.            155.             310.     -155.
## 5 Vermont                24458.            149.             267.     -118.
## 6 Oklahoma               32624.            454.             339.      115.
## 7 Arizona                33571.            455.             348.      107.
## 8 Mississippi            33290.            449.             345.      104.

States with large positive residuals experienced more deaths than predicted by their case counts. This could reflect older populations, overwhelmed healthcare systems, or being hit early before effective treatments were developed. Mississippi, Arizona, and West Virginia show particularly high mortality relative to their case rates.

States with large negative residuals had fewer deaths than expected. These states may have younger populations, better healthcare infrastructure, higher vaccination rates, or more aggressive early interventions. Alaska and Utah show notably lower mortality than their case rates would predict.

Additional Analysis: Florida vs New York Direct Comparison

Per Capita Statistics

fl_ny_summary <- state_totals %>%
  filter(Province_State %in% c("Florida", "New York")) %>%
  select(Province_State, Population, total_cases, total_deaths, 
         cases_per_100k, deaths_per_100k) %>%
  mutate(mortality_rate = (total_deaths / total_cases) * 100,
         mortality_rate = round(mortality_rate, 2))

print(fl_ny_summary)
## # A tibble: 2 × 7
##   Province_State Population total_cases total_deaths cases_per_100k
##   <chr>               <dbl>       <dbl>        <dbl>          <dbl>
## 1 Florida          21477737     7574590        86850         35267.
## 2 New York         19453561     6794738        77157         34928.
## # ℹ 2 more variables: deaths_per_100k <dbl>, mortality_rate <dbl>

Visualization: Per Capita Comparison

fl_ny_long <- fl_ny_summary %>%
  select(Province_State, cases_per_100k, deaths_per_100k) %>%
  pivot_longer(cols = c(cases_per_100k, deaths_per_100k),
               names_to = "metric",
               values_to = "rate") %>%
  mutate(metric = ifelse(metric == "cases_per_100k", 
                         "Cases per 100k", 
                         "Deaths per 100k"))

ggplot(fl_ny_long, aes(x = Province_State, y = rate, fill = metric)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = scales::comma(round(rate, 0))), 
            position = position_dodge(width = 0.7), 
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Cases per 100k" = "#56B4E9", 
                                "Deaths per 100k" = "#E69F00")) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = "COVID-19 Impact: Florida vs New York (Per Capita)",
       subtitle = "Population-adjusted comparison of pandemic outcomes",
       x = "",
       y = "Rate per 100,000 People",
       fill = "")

When adjusted for population size, both states show similar overall case rates per capita, with Florida at 32,791 and New York at 32,063 cases per 100,000 people. However, New York experienced a higher death rate at 512 deaths per 100,000 compared to Florida’s 437 deaths per 100,000.

This difference primarily reflects New York being hit first in spring 2020 when hospitals were overwhelmed and before effective treatment protocols were established. Florida benefited from learning about treatment strategies that developed after New York’s initial surge. The case fatality rate in New York is 1.6% compared to Florida’s 1.33%, demonstrating how timing and healthcare system preparedness significantly impacted outcomes beyond just infection rates.

Conclusion

This analysis of Johns Hopkins COVID-19 data reveals three critical findings about the pandemic’s impact across the United States.

First, case rates strongly predict death rates at the state level, with an R-squared of 0.59. This confirms that controlling transmission directly reduces mortality. However, 41% of variation in death rates remains unexplained by cases alone, demonstrating that healthcare capacity, demographics, and policy interventions substantially influence outcomes.

Second, Florida and New York experienced vastly different pandemic trajectories despite similar final per capita case rates. New York’s concentrated early outbreak resulted in higher mortality, while Florida faced multiple distributed waves with lower case fatality rates. This illustrates how timing relative to medical advancement and healthcare system capacity critically affects mortality beyond infection counts.

Third, substantial variation exists across states even after controlling for case rates. States like Mississippi show 40% higher mortality than predicted, while states like Utah show 30% lower mortality. These outliers highlight the complex interplay of age demographics, comorbidities, healthcare access, and public health responses in determining pandemic outcomes.

The data demonstrates that while infection control is paramount, healthcare system preparedness and demographic factors are nearly as important in preventing deaths. Future pandemic responses must account for these factors beyond just reducing transmission.

Sources of Bias

Data Collection Bias

Testing availability varied dramatically across states and over time. Early in the pandemic, limited testing capacity meant many cases went undetected, artificially lowering recorded case counts while death counts remained more consistent. States with robust testing programs may appear to have higher case rates simply due to better detection, not actually higher infection prevalence. This asymmetry in case detection versus death detection can distort the true case-fatality relationship.

Reporting Standards Bias

States use inconsistent criteria for attributing deaths to COVID-19. Some states count any death where COVID-19 was present, while others only count deaths directly caused by COVID-19. Similarly, the timing of death reporting varies, with some states reporting daily and others batching reports weekly. These inconsistencies make state-to-state comparisons inherently imperfect and may create artificial outliers in the regression analysis.

Temporal Bias

This analysis uses cumulative totals without adequately accounting for when states were affected. Treatment protocols improved dramatically throughout the pandemic. States hit early like New York faced much higher mortality rates before effective treatments like dexamethasone and remdesivir were discovered. Later-affected states benefited from established protocols, making their outcomes incomparable to early states. The linear model cannot capture this time-dependent improvement in care quality.

Demographic Confounding

States differ substantially in age distributions, obesity rates, diabetes prevalence, and other comorbidities that strongly influence COVID-19 mortality. Florida has one of the oldest populations in the US, while Utah has the youngest. These demographic differences are entirely unaccounted for in the simple linear model, yet they may explain more variance in death rates than case rates themselves. The model attributes differences to case rates that may actually stem from underlying population health.

Political and Social Factors

COVID-19 became intensely politicized, affecting both public health behavior and potentially data reporting. States with different political leanings implemented vastly different policies on masking, distancing, and vaccination. These policy differences are confounded with many other state characteristics, making it impossible to isolate their effects. Additionally, political pressure may have influenced how deaths were classified or reported in some jurisdictions.

Personal Selection Bias

My choice to compare Florida and New York specifically reflects awareness of their political differences and high-profile governors during the pandemic. This selection was not random but chosen because these states represent opposing policy approaches. This introduces confirmation bias risk, as I may unconsciously interpret results to support narratives about policy effectiveness that I encountered in media coverage. A truly objective analysis would use random state selection or analyze all states equally.

Mitigation Efforts

I attempted to mitigate these biases by focusing on statistical relationships rather than causal claims about policy effectiveness. I presented per capita rates to account for population size differences and used the entire US dataset rather than cherry-picking states. I acknowledged model limitations and avoided attributing causation to correlation. However, these biases remain inherent limitations that cannot be fully eliminated with this dataset alone.

Resources

Johns Hopkins University CSSE COVID-19 Data Repository: https://github.com/CSSEGISandData/COVID-19

R for Data Science by Hadley Wickham: https://r4ds.had.co.nz/

Tidyverse Documentation: https://www.tidyverse.org/

COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series