library(tidyverse)
library(dotenv)
library(RSocrata)
library(scales)
library(usmap)
library(lubridate)

Overivew

Center for Disease Control and Prevention (CDC) reports aggregate counts of COVID-19 death and case number daily. Data is based on most recent numbers reported by states, territories and other jurisdictions. This depends on timely and accurate reporting, therefore some values are missing.

There are currently 60 public health jurisdictions reporting COVID-19 cases, including the 50 states, District of Columbia, NYC, American Samoa, Guam, the Commonwealth of the Northern Mariana Islands, Puerto Rico and the US Virgin Islands. Additionally three independent countries in compacts of free association with the United States, Federated States of Micronesia, Republic of the Marshall Islands, and Republic of Palu.

# load up hidden api key
load_dot_env()

# import dataset to R
df <- read.socrata("https://data.cdc.gov/resource/9mfq-cb36.json",
             app_token = Sys.getenv("SOCRATA_API"))

# save df to data folder
write_csv(df, paste("./data/us_covid_cases_and_deaths_",Sys.Date(),".csv", sep = ""))
# load data saved
df <- read_csv(paste("./data/us_covid_cases_and_deaths_",Sys.Date(),".csv", sep=""))
head(df)
## # A tibble: 6 × 15
##   submission_date     state tot_cases conf_cases prob_cases new_case pnew_case
##   <dttm>              <chr>     <dbl>      <dbl>      <dbl>    <dbl>     <dbl>
## 1 2021-12-01 05:00:00 ND       163565     135705      27860      589       220
## 2 2020-08-17 04:00:00 MD       100715         NA         NA      503         0
## 3 2021-07-20 04:00:00 MD       464491         NA         NA      155         0
## 4 2020-05-13 04:00:00 VT          855         NA         NA        2         0
## 5 2021-01-21 05:00:00 NC       706315     632991      73324     7113      1040
## 6 2021-03-11 05:00:00 WA       346788         NA         NA      674        89
## # … with 8 more variables: tot_death <dbl>, new_death <dbl>, pnew_death <dbl>,
## #   created_at <dttm>, consent_cases <chr>, consent_deaths <chr>,
## #   conf_death <dbl>, prob_death <dbl>

Data dictionary

There are 15 columns in this data set, you can see the titles, description and type in the data dictionary below:

Column Name Description Type
submission_date Date of counts Date & Time
state Jurisdiction Plain Text
tot_cases Total number of cases Number
conf_cases Total confirmed cases Number
prob_cases Total probable cases Number
new_case Number of new cases Number
pnew_case Number of new probable cases Number
tot_death Total number of deaths Number
conf_death Total number of confirmed deaths Number
prob_death Total number of probable deaths Number
new_death Number of new deaths Number
pnew_death Number of new probable deaths Number
created_at Date and time record was created Date & Time
consent_cases If Agree, then confirmed and probable cases are included.
Not Agree, then only total cases are included.
Plain Text
consent_deaths If Agree, then confirmed and probable deaths are included.
If Not Agree, then only total deaths are included.
Plain Text

Data Questions

Questions of interest are:

1. Find the state with the highest and lowest deaths

2. Compare the death rate both before and after the vaccine was released

Question 1

State with highest and lowest deaths

Let’s first look at this from a pure count perspective

df_tot_deaths <- df %>% 
  select(submission_date, state, tot_cases, tot_death) %>% 
  group_by(state) %>% 
  filter(submission_date == max(submission_date)) %>% 
  ungroup()

# preview
head(df_tot_deaths)
## # A tibble: 6 × 4
##   submission_date     state tot_cases tot_death
##   <dttm>              <chr>     <dbl>     <dbl>
## 1 2022-03-11 05:00:00 NH       300185      2415
## 2 2022-03-11 05:00:00 NV       688054      9922
## 3 2022-03-11 05:00:00 CT       729697     10648
## 4 2022-03-11 05:00:00 AL      1289351     18890
## 5 2022-03-11 05:00:00 MO      1403268     19355
## 6 2022-03-11 05:00:00 VT       105475       578

Since NYC and NYS are reported separately, let’s isolate these values. May need to add them together later to get a total and create a new row.

df_tot_deaths[df_tot_deaths$state == 'NY' | df_tot_deaths$state == 'NYC',]
## # A tibble: 2 × 4
##   submission_date     state tot_cases tot_death
##   <dttm>              <chr>     <dbl>     <dbl>
## 1 2022-03-11 05:00:00 NYC     2285511     39927
## 2 2022-03-11 05:00:00 NY      2642783     27102

All States

df_tot_deaths %>%
  arrange(desc(tot_death)) %>% 
  ggplot(aes(x = reorder(state, tot_death), y = tot_death)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  theme(legend.position = "none") +
  labs(x = "States", y = "Total Deaths", subtitle = paste("COVID-19 Total Death Count by state as of ", Sys.Date(), sep = ""))

California has the highest count of COVID Cases. Let’s zoom in and look at the top 10.

Top 10

df_tot_deaths %>% 
  slice_max(tot_death, n = 10) %>% 
  ggplot(aes(reorder(state, tot_death), tot_death, fill = tot_death)) +
    geom_bar(stat = 'identity') +
    scale_fill_gradient2(high = muted("blue")) +
    coord_flip() +
    theme(legend.position = 'none') +
    labs(x = "States", y = "Total Deaths", subtitle = paste("Top 10 COVID-19 Total Death Count states as of ", Sys.Date(), sep = ""))

Here we can see the top 10:

  • California
  • Texas
  • Florida
  • Pennsylvania
  • NYC
  • Ohio
  • Illinois
  • Georgia
  • Michigan
  • New Jersey

Now let’s look at the bottom 10.

Bottom 10

df_tot_deaths %>% 
  slice_min(tot_death, n = 10) %>% 
    ggplot(aes(reorder(state, -tot_death), tot_death, fill = tot_death)) +
    geom_bar(stat = 'identity') +
    scale_fill_gradient2(high = muted("blue")) +
    coord_flip() +
    labs(x = "States", y = "Total Deaths", subtitle = paste("Bottom 10 COVID-19 Total Death Count states as of ", Sys.Date(), sep = "")) +
    guides(fill = guide_legend(title = "Total Deaths"))

Death rate by state

Now lets create a death_rate column and look at top_n that way.

df_tot_deaths <- df_tot_deaths %>% 
  mutate(death_rate = tot_death / tot_cases)

Let’s view top_n by rate.

Top 10 death rate

df_tot_deaths %>% 
  slice_max(death_rate, n = 10) %>% 
  ggplot(aes(reorder(state, death_rate), death_rate, fill = death_rate)) +
    geom_bar(stat = 'identity') +
    scale_fill_gradient2(high = muted("blue"), mid = "red") +
    coord_flip() +
    theme(legend.position = 'none') +
    labs(x = "States", y = "Death Rate", subtitle = paste("Top 10 COVID-19 Total Death Rate states as of ", Sys.Date(), sep = ""))

When we compare via rate, we can see California isn’t even in the top 10. NYC takes the cake by almost a full 2%.

Bottom 10 death rate

df_tot_deaths %>% 
  slice_min(death_rate, n = 10) %>% 
    ggplot(aes(reorder(state, -death_rate), death_rate, fill = death_rate)) +
    geom_bar(stat = 'identity') +
    scale_fill_gradient2(high = muted("blue"), low = "white") +
    coord_flip() +
    labs(x = "States", y = "Death Rate", subtitle = paste("Bottom 10 COVID-19 Total Death Rate states as of ", Sys.Date(), sep = "")) +
    guides(fill = guide_legend(title = "Death Rate"))

Simple US map

Now let’s create a simple US map to visualize count and rate by state. Built in package usmaps will only have the 52 states but is fine for our purposes.

Have to sum NY and NYC and create new row.

# sum total cases
NYS_tot_cases <- sum(df_tot_deaths$tot_cases[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])

# sum total deaths
NYS_tot_deaths <- sum(df_tot_deaths$tot_death[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])

# sum death rate
NYS_death_rate <- sum(df_tot_deaths$death_rate[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])/2

#Latest date
dateLatest = as.POSIXct(max(df_tot_deaths$submission_date), origin = "1970-01-01", tz = "UTC")

# add new york summed row
df_tot_deaths[nrow(df_tot_deaths) + 1,] <- list(dateLatest, "NYS", NYS_tot_cases, NYS_tot_deaths, NYS_death_rate)

# remove separate ny and nyc
df_tot_deaths_ny <- subset(df_tot_deaths, state != "NYC" & state !="NY")

# rename NYS to NY
df_tot_deaths_ny <-  within(df_tot_deaths_ny, state[state == "NYS"] <- "NY")

Join df_tot_deaths_ny with statepop

df_tot_deaths_ny <-  left_join(statepop, df_tot_deaths_ny, by = c("abbr" = "state"))

Plot simple US map.

plot_usmap(data = df_tot_deaths_ny, values = "tot_death", color = "blue") +
  scale_fill_continuous(low="white", high="red", name = "Total Deaths",
                        label = scales::comma) +
  labs(subtitle = paste("COVID-19 Total Death Count by state as of ", Sys.Date(), sep = "")) +
  theme(legend.position = c(0.9,0.1))

Clearly California, Texas, Florida and New York have the most cases.

Let’s look at this simple map via rate.

plot_usmap(data = df_tot_deaths_ny, values = "death_rate", color = "blue") +
  scale_fill_continuous(low="white", high="red", name = "Total Death Rate",
                        label = scales::comma) +
  labs(subtitle = paste("COVID-19 Total Death Rate by state as of ", Sys.Date(), sep = "")) +
  theme(legend.position = c(0.9,0.1))

When we combine NY with NYC, we get a 0.013 death rate, therefore it does not have the highest rate. Here we can see Pennsylvania, Georga and Michigan have the darker red color and therefore higher rate. However you can visually see the difference between the count and rate US maps. Clearly they paint a different picture.

Question 2

Compare death rate before and after vaccine being released

According to FDA.gov, the First COVID-19 Vaccine was authorized on Dec 11th, 2020. This was for emergency use only, however for simplicity sake we will use this as a place marker for this analysis.

Additionally, we will use what the World Health Organization defines as Infection fatality ratio (IFR) as our death rate.

This is calculated by:

\[\text{Infection fatality ratio (IFR, in %) = }\frac{\text {Number of deaths from disease}}{\text{Number of infected individuals}} \text{x 100}\]

Let’s create two data frames, before and after the vaccination was released.

Before Vax data frame

df_before_vax <- df %>% 
  select(submission_date, state, tot_death, tot_cases) %>% 
  filter(submission_date <= '2020-12-11') %>%
  group_by(state) %>% 
  filter(submission_date == max(submission_date)) %>% 
  mutate(death_rate_before = tot_death / tot_cases) %>% 
  arrange(desc(death_rate_before))

# check for NaN
df_before_vax[is.na(df_before_vax$death_rate_before),]
## # A tibble: 2 × 5
## # Groups:   state [2]
##   submission_date     state tot_death tot_cases death_rate_before
##   <dttm>              <chr>     <dbl>     <dbl>             <dbl>
## 1 2020-12-11 05:00:00 PW            0         0               NaN
## 2 2020-12-11 05:00:00 FSM           0         0               NaN
# preview df
head(df_before_vax)
## # A tibble: 6 × 5
## # Groups:   state [6]
##   submission_date     state tot_death tot_cases death_rate_before
##   <dttm>              <chr>     <dbl>     <dbl>             <dbl>
## 1 2020-12-11 05:00:00 NYC       24280    361074            0.0672
## 2 2020-12-11 05:00:00 NJ        17662    428581            0.0412
## 3 2020-12-11 05:00:00 MA        11386    280436            0.0406
## 4 2020-12-11 05:00:00 CT         5363    146761            0.0365
## 5 2020-12-11 05:00:00 DC          709     24357            0.0291
## 6 2020-12-11 05:00:00 MS         4833    180770            0.0267

Missing values are simply divide by zero error, no need to account for them using na.rm = T will work.

After Vax data frame

df_after_vax <- df %>% 
  select(submission_date, state, tot_death, tot_cases) %>% 
  filter(submission_date > '2020-12-11') %>%
  group_by(state) %>% 
  filter(submission_date == max(submission_date)) %>% 
  mutate(death_rate_after = tot_death / tot_cases) %>% 
  arrange(desc(death_rate_after))

# check for NaN
df_after_vax[is.na(df_after_vax$death_rate_after),]
## # A tibble: 0 × 5
## # Groups:   state [0]
## # … with 5 variables: submission_date <dttm>, state <chr>, tot_death <dbl>,
## #   tot_cases <dbl>, death_rate_after <dbl>
# preview df
head(df_after_vax)
## # A tibble: 6 × 5
## # Groups:   state [6]
##   submission_date     state tot_death tot_cases death_rate_after
##   <dttm>              <chr>     <dbl>     <dbl>            <dbl>
## 1 2022-03-11 05:00:00 NYC       39927   2285511           0.0175
## 2 2022-03-11 05:00:00 PA        43808   2768240           0.0158
## 3 2022-03-11 05:00:00 MS        12255    792560           0.0155
## 4 2022-03-11 05:00:00 NJ        33087   2180216           0.0152
## 5 2022-03-11 05:00:00 MI        35188   2371788           0.0148
## 6 2022-03-11 05:00:00 AL        18890   1289351           0.0147

No missing values in df_after_vax

Let’s create an overall data frame to get a complete picutre of how the release of the vaccine effected death rates.

Overall data frame

df_death_rate <- df_before_vax %>% 
  left_join(df_after_vax, by = "state") %>% 
  select(state, death_rate_before, death_rate_after)

head(df_death_rate)
## # A tibble: 6 × 3
## # Groups:   state [6]
##   state death_rate_before death_rate_after
##   <chr>             <dbl>            <dbl>
## 1 NYC              0.0672          0.0175 
## 2 NJ               0.0412          0.0152 
## 3 MA               0.0406          0.0141 
## 4 CT               0.0365          0.0146 
## 5 DC               0.0291          0.00981
## 6 MS               0.0267          0.0155

What is the total death rate before and after vaccine?

# death rate before
before_vax <- sum(df_death_rate$death_rate_before, na.rm = T)/nrow(df_death_rate)

# death rate after
after_vax <- sum(df_death_rate$death_rate_after, na.rm = T)/nrow(df_death_rate)

# create new df
df_death_rate_sum <- data.frame(vax_status = c("Before Vax", "After Vax"), 
                                death_rate = c(before_vax, after_vax))

df_death_rate_sum
##   vax_status death_rate
## 1 Before Vax 0.01704147
## 2  After Vax 0.01049090

Let’s visualize this table.

ggplot(df_death_rate_sum, aes(x=death_rate, y= vax_status, fill = vax_status)) +
  geom_bar(stat = 'identity') +
  labs(x = "Death Rate", y = "", subtitle = paste("Overall COVID-19 US Death Rate as of ", Sys.Date(),sep = "")) +
  theme(legend.position = "none")

Conclusion

Question 1

Which states had the most and least deaths?

  • While California had the most deaths, the death rate (total cases / total deaths) was highest in NYC
  • Least deaths in the continental United States was Vermont with ~600, they also were in the bottom 10 for lowest death rate. Utah was first in lowest death rate with ~0.0045.

Question 2

What was the death rate before and after the vaccine was released?

  • If we use December 11th 2020 as the after vaccination date, the aggregate death rate for the entire data set is 0.016 before the vaccine was released, and 0.010 after. So we see a pretty significant decrease in total death rate after the vaccine was released.