library(tidyverse)
library(dotenv)
library(RSocrata)
library(scales)
library(usmap)
library(lubridate)
Center for Disease Control and Prevention (CDC) reports aggregate counts of COVID-19 death and case number daily. Data is based on most recent numbers reported by states, territories and other jurisdictions. This depends on timely and accurate reporting, therefore some values are missing.
There are currently 60 public health jurisdictions reporting COVID-19 cases, including the 50 states, District of Columbia, NYC, American Samoa, Guam, the Commonwealth of the Northern Mariana Islands, Puerto Rico and the US Virgin Islands. Additionally three independent countries in compacts of free association with the United States, Federated States of Micronesia, Republic of the Marshall Islands, and Republic of Palu.
# load up hidden api key
load_dot_env()
# import dataset to R
df <- read.socrata("https://data.cdc.gov/resource/9mfq-cb36.json",
app_token = Sys.getenv("SOCRATA_API"))
# save df to data folder
write_csv(df, paste("./data/us_covid_cases_and_deaths_",Sys.Date(),".csv", sep = ""))
# load data saved
df <- read_csv(paste("./data/us_covid_cases_and_deaths_",Sys.Date(),".csv", sep=""))
head(df)
## # A tibble: 6 × 15
## submission_date state tot_cases conf_cases prob_cases new_case pnew_case
## <dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-12-01 05:00:00 ND 163565 135705 27860 589 220
## 2 2020-08-17 04:00:00 MD 100715 NA NA 503 0
## 3 2021-07-20 04:00:00 MD 464491 NA NA 155 0
## 4 2020-05-13 04:00:00 VT 855 NA NA 2 0
## 5 2021-01-21 05:00:00 NC 706315 632991 73324 7113 1040
## 6 2021-03-11 05:00:00 WA 346788 NA NA 674 89
## # … with 8 more variables: tot_death <dbl>, new_death <dbl>, pnew_death <dbl>,
## # created_at <dttm>, consent_cases <chr>, consent_deaths <chr>,
## # conf_death <dbl>, prob_death <dbl>
There are 15 columns in this data set, you can see the titles, description and type in the data dictionary below:
Column Name | Description | Type |
---|---|---|
submission_date | Date of counts | Date & Time |
state | Jurisdiction | Plain Text |
tot_cases | Total number of cases | Number |
conf_cases | Total confirmed cases | Number |
prob_cases | Total probable cases | Number |
new_case | Number of new cases | Number |
pnew_case | Number of new probable cases | Number |
tot_death | Total number of deaths | Number |
conf_death | Total number of confirmed deaths | Number |
prob_death | Total number of probable deaths | Number |
new_death | Number of new deaths | Number |
pnew_death | Number of new probable deaths | Number |
created_at | Date and time record was created | Date & Time |
consent_cases | If Agree, then confirmed and probable cases are included. Not Agree, then only total cases are included. |
Plain Text |
consent_deaths | If Agree, then confirmed and probable deaths are included. If Not Agree, then only total deaths are included. |
Plain Text |
Questions of interest are:
1. Find the state with the highest and lowest deaths
2. Compare the death rate both before and after the vaccine was released
Let’s first look at this from a pure count perspective
df_tot_deaths <- df %>%
select(submission_date, state, tot_cases, tot_death) %>%
group_by(state) %>%
filter(submission_date == max(submission_date)) %>%
ungroup()
# preview
head(df_tot_deaths)
## # A tibble: 6 × 4
## submission_date state tot_cases tot_death
## <dttm> <chr> <dbl> <dbl>
## 1 2022-03-11 05:00:00 NH 300185 2415
## 2 2022-03-11 05:00:00 NV 688054 9922
## 3 2022-03-11 05:00:00 CT 729697 10648
## 4 2022-03-11 05:00:00 AL 1289351 18890
## 5 2022-03-11 05:00:00 MO 1403268 19355
## 6 2022-03-11 05:00:00 VT 105475 578
Since NYC
and NYS
are reported separately, let’s isolate these values. May need to add them together later to get a total and create a new row.
df_tot_deaths[df_tot_deaths$state == 'NY' | df_tot_deaths$state == 'NYC',]
## # A tibble: 2 × 4
## submission_date state tot_cases tot_death
## <dttm> <chr> <dbl> <dbl>
## 1 2022-03-11 05:00:00 NYC 2285511 39927
## 2 2022-03-11 05:00:00 NY 2642783 27102
df_tot_deaths %>%
arrange(desc(tot_death)) %>%
ggplot(aes(x = reorder(state, tot_death), y = tot_death)) +
geom_bar(stat = 'identity') +
coord_flip() +
theme(legend.position = "none") +
labs(x = "States", y = "Total Deaths", subtitle = paste("COVID-19 Total Death Count by state as of ", Sys.Date(), sep = ""))
California
has the highest count of COVID Cases. Let’s zoom in and look at the top 10.
df_tot_deaths %>%
slice_max(tot_death, n = 10) %>%
ggplot(aes(reorder(state, tot_death), tot_death, fill = tot_death)) +
geom_bar(stat = 'identity') +
scale_fill_gradient2(high = muted("blue")) +
coord_flip() +
theme(legend.position = 'none') +
labs(x = "States", y = "Total Deaths", subtitle = paste("Top 10 COVID-19 Total Death Count states as of ", Sys.Date(), sep = ""))
Here we can see the top 10:
Now let’s look at the bottom 10.
df_tot_deaths %>%
slice_min(tot_death, n = 10) %>%
ggplot(aes(reorder(state, -tot_death), tot_death, fill = tot_death)) +
geom_bar(stat = 'identity') +
scale_fill_gradient2(high = muted("blue")) +
coord_flip() +
labs(x = "States", y = "Total Deaths", subtitle = paste("Bottom 10 COVID-19 Total Death Count states as of ", Sys.Date(), sep = "")) +
guides(fill = guide_legend(title = "Total Deaths"))
Now lets create a death_rate
column and look at top_n that way.
df_tot_deaths <- df_tot_deaths %>%
mutate(death_rate = tot_death / tot_cases)
Let’s view top_n by rate.
df_tot_deaths %>%
slice_max(death_rate, n = 10) %>%
ggplot(aes(reorder(state, death_rate), death_rate, fill = death_rate)) +
geom_bar(stat = 'identity') +
scale_fill_gradient2(high = muted("blue"), mid = "red") +
coord_flip() +
theme(legend.position = 'none') +
labs(x = "States", y = "Death Rate", subtitle = paste("Top 10 COVID-19 Total Death Rate states as of ", Sys.Date(), sep = ""))
When we compare via rate, we can see California isn’t even in the top 10. NYC takes the cake by almost a full 2%.
df_tot_deaths %>%
slice_min(death_rate, n = 10) %>%
ggplot(aes(reorder(state, -death_rate), death_rate, fill = death_rate)) +
geom_bar(stat = 'identity') +
scale_fill_gradient2(high = muted("blue"), low = "white") +
coord_flip() +
labs(x = "States", y = "Death Rate", subtitle = paste("Bottom 10 COVID-19 Total Death Rate states as of ", Sys.Date(), sep = "")) +
guides(fill = guide_legend(title = "Death Rate"))
Now let’s create a simple US map to visualize count and rate by state. Built in package usmaps
will only have the 52 states but is fine for our purposes.
Have to sum NY
and NYC
and create new row.
# sum total cases
NYS_tot_cases <- sum(df_tot_deaths$tot_cases[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])
# sum total deaths
NYS_tot_deaths <- sum(df_tot_deaths$tot_death[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])
# sum death rate
NYS_death_rate <- sum(df_tot_deaths$death_rate[df_tot_deaths$state == "NY" | df_tot_deaths$state == "NYC"])/2
#Latest date
dateLatest = as.POSIXct(max(df_tot_deaths$submission_date), origin = "1970-01-01", tz = "UTC")
# add new york summed row
df_tot_deaths[nrow(df_tot_deaths) + 1,] <- list(dateLatest, "NYS", NYS_tot_cases, NYS_tot_deaths, NYS_death_rate)
# remove separate ny and nyc
df_tot_deaths_ny <- subset(df_tot_deaths, state != "NYC" & state !="NY")
# rename NYS to NY
df_tot_deaths_ny <- within(df_tot_deaths_ny, state[state == "NYS"] <- "NY")
Join df_tot_deaths_ny
with statepop
df_tot_deaths_ny <- left_join(statepop, df_tot_deaths_ny, by = c("abbr" = "state"))
Plot simple US map.
plot_usmap(data = df_tot_deaths_ny, values = "tot_death", color = "blue") +
scale_fill_continuous(low="white", high="red", name = "Total Deaths",
label = scales::comma) +
labs(subtitle = paste("COVID-19 Total Death Count by state as of ", Sys.Date(), sep = "")) +
theme(legend.position = c(0.9,0.1))
Clearly California, Texas, Florida and New York have the most cases.
Let’s look at this simple map via rate.
plot_usmap(data = df_tot_deaths_ny, values = "death_rate", color = "blue") +
scale_fill_continuous(low="white", high="red", name = "Total Death Rate",
label = scales::comma) +
labs(subtitle = paste("COVID-19 Total Death Rate by state as of ", Sys.Date(), sep = "")) +
theme(legend.position = c(0.9,0.1))
When we combine NY
with NYC
, we get a 0.013 death rate, therefore it does not have the highest rate. Here we can see Pennsylvania, Georga and Michigan have the darker red color and therefore higher rate. However you can visually see the difference between the count and rate US maps. Clearly they paint a different picture.
According to FDA.gov, the First COVID-19 Vaccine was authorized on Dec 11th, 2020. This was for emergency use only, however for simplicity sake we will use this as a place marker for this analysis.
Additionally, we will use what the World Health Organization defines as Infection fatality ratio (IFR) as our death rate.
This is calculated by:
\[\text{Infection fatality ratio (IFR, in %) = }\frac{\text {Number of deaths from disease}}{\text{Number of infected individuals}} \text{x 100}\]
Let’s create two data frames, before and after the vaccination was released.
df_before_vax <- df %>%
select(submission_date, state, tot_death, tot_cases) %>%
filter(submission_date <= '2020-12-11') %>%
group_by(state) %>%
filter(submission_date == max(submission_date)) %>%
mutate(death_rate_before = tot_death / tot_cases) %>%
arrange(desc(death_rate_before))
# check for NaN
df_before_vax[is.na(df_before_vax$death_rate_before),]
## # A tibble: 2 × 5
## # Groups: state [2]
## submission_date state tot_death tot_cases death_rate_before
## <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2020-12-11 05:00:00 PW 0 0 NaN
## 2 2020-12-11 05:00:00 FSM 0 0 NaN
# preview df
head(df_before_vax)
## # A tibble: 6 × 5
## # Groups: state [6]
## submission_date state tot_death tot_cases death_rate_before
## <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2020-12-11 05:00:00 NYC 24280 361074 0.0672
## 2 2020-12-11 05:00:00 NJ 17662 428581 0.0412
## 3 2020-12-11 05:00:00 MA 11386 280436 0.0406
## 4 2020-12-11 05:00:00 CT 5363 146761 0.0365
## 5 2020-12-11 05:00:00 DC 709 24357 0.0291
## 6 2020-12-11 05:00:00 MS 4833 180770 0.0267
Missing values are simply divide by zero error, no need to account for them using na.rm = T will work.
df_after_vax <- df %>%
select(submission_date, state, tot_death, tot_cases) %>%
filter(submission_date > '2020-12-11') %>%
group_by(state) %>%
filter(submission_date == max(submission_date)) %>%
mutate(death_rate_after = tot_death / tot_cases) %>%
arrange(desc(death_rate_after))
# check for NaN
df_after_vax[is.na(df_after_vax$death_rate_after),]
## # A tibble: 0 × 5
## # Groups: state [0]
## # … with 5 variables: submission_date <dttm>, state <chr>, tot_death <dbl>,
## # tot_cases <dbl>, death_rate_after <dbl>
# preview df
head(df_after_vax)
## # A tibble: 6 × 5
## # Groups: state [6]
## submission_date state tot_death tot_cases death_rate_after
## <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2022-03-11 05:00:00 NYC 39927 2285511 0.0175
## 2 2022-03-11 05:00:00 PA 43808 2768240 0.0158
## 3 2022-03-11 05:00:00 MS 12255 792560 0.0155
## 4 2022-03-11 05:00:00 NJ 33087 2180216 0.0152
## 5 2022-03-11 05:00:00 MI 35188 2371788 0.0148
## 6 2022-03-11 05:00:00 AL 18890 1289351 0.0147
No missing values in df_after_vax
Let’s create an overall data frame to get a complete picutre of how the release of the vaccine effected death rates.
df_death_rate <- df_before_vax %>%
left_join(df_after_vax, by = "state") %>%
select(state, death_rate_before, death_rate_after)
head(df_death_rate)
## # A tibble: 6 × 3
## # Groups: state [6]
## state death_rate_before death_rate_after
## <chr> <dbl> <dbl>
## 1 NYC 0.0672 0.0175
## 2 NJ 0.0412 0.0152
## 3 MA 0.0406 0.0141
## 4 CT 0.0365 0.0146
## 5 DC 0.0291 0.00981
## 6 MS 0.0267 0.0155
What is the total death rate before and after vaccine?
# death rate before
before_vax <- sum(df_death_rate$death_rate_before, na.rm = T)/nrow(df_death_rate)
# death rate after
after_vax <- sum(df_death_rate$death_rate_after, na.rm = T)/nrow(df_death_rate)
# create new df
df_death_rate_sum <- data.frame(vax_status = c("Before Vax", "After Vax"),
death_rate = c(before_vax, after_vax))
df_death_rate_sum
## vax_status death_rate
## 1 Before Vax 0.01704147
## 2 After Vax 0.01049090
Let’s visualize this table.
ggplot(df_death_rate_sum, aes(x=death_rate, y= vax_status, fill = vax_status)) +
geom_bar(stat = 'identity') +
labs(x = "Death Rate", y = "", subtitle = paste("Overall COVID-19 US Death Rate as of ", Sys.Date(),sep = "")) +
theme(legend.position = "none")
Which states had the most and least deaths?
What was the death rate before and after the vaccine was released?