The Ins and Outs of the Pandemic

Introduction

The Covid-19 pandemic hit the United States in the beginning of 2020, but the virus was still very unknown. As time went on, the CDC was able to collect more data that gave insight to the ins and outs of the virus. This analysis will look at a large amount of reported cases and the information along with each case to identify trends for Covid-19 cases. The visualizations will look further into how Covid-19 impacts certain demographics, how the virus reacts to warmer and colder months, and how the virus reacts to lockdown measures. You will see that this analysis is relatively consistent with the possible news that has been relayed about Covid-19 and its severity.

Dataset

This data is provided from https://www.cdc.gov/. I pulled this dataset form kaggle.com. The data in this public dataset includes United States Covid-19 demographic information, exposure history, disease severity indicators and outcomes, clinical data, laboratory test results, and whether individuals have prior medical conditions. The data in each row relates to one specific CDC reported case. Each row includes the date reported which is followed by the characteristics relating to that specific case. The dataset I visualized is comprised of 1,048,575 reported cases in the United States from January 2020 to November 2020. While this isn’t the exact number of cases through this time period, the amount of data is sufficient enough to draw conclusions and visualize the impact of Covid-19 in the United States through this time period.

## Classes 'data.table' and 'data.frame':   1048575 obs. of  11 variables:
##  $ cdc_report_dt                : chr  "11/10/20" "11/14/20" "11/19/20" "11/14/20" ...
##  $ pos_spec_dt                  : chr  "11/10/20" "11/10/20" "11/10/20" "11/10/20" ...
##  $ onset_dt                     : chr  "" "11/10/20" "11/9/20" "" ...
##  $ current_status               : chr  "Laboratory-confirmed case" "Laboratory-confirmed case" "Laboratory-confirmed case" "Laboratory-confirmed case" ...
##  $ sex                          : chr  "Male" "Male" "Male" "Male" ...
##  $ age_group                    : chr  "10 - 19 Years" "10 - 19 Years" "10 - 19 Years" "10 - 19 Years" ...
##  $ Race and ethnicity (combined): chr  "Black, Non-Hispanic" "Black, Non-Hispanic" "Black, Non-Hispanic" "Black, Non-Hispanic" ...
##  $ hosp_yn                      : chr  "No" "No" "No" "Missing" ...
##  $ icu_yn                       : chr  "Unknown" "No" "No" "Missing" ...
##  $ death_yn                     : chr  "No" "No" "No" "No" ...
##  $ medcond_yn                   : chr  "No" "No" "No" "Missing" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Findings

Cases by Month

library(data.table)
file <- "COVID-19_Case_Surveillance_Public_Use_Data.csv"
covid_df <- fread(file)
library(lubridate)
library(ggplot2)
library(dplyr)
library(scales)
covid_df$cdc_report_dt <- mdy(covid_df$cdc_report_dt)
covid_df$pos_spec_dt <- mdy(covid_df$pos_spec_dt)
covid_df$onset_dt <- mdy(covid_df$onset_dt)
covid_df$month <- month(ymd(covid_df$cdc_report_dt))
covid_df$monthname <- months(covid_df$cdc_report_dt, abbreviate = TRUE)
p1 <- ggplot(covid_df, aes(x=month)) +
  geom_histogram(bins = 11, color = "darkblue", fill="lightblue") +
  labs(title = "Histogram of Reported Cases by Month", x= "Month Number", y = "Count of Cases") +
  scale_y_continuous(labels= comma) +
  stat_bin(binwidth = 1, geom= 'text', color= 'black', aes(label=scales::comma(..count..)), size = 6, vjust=-0.4) +
  theme(text = element_text(size = 16), plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 14), axis.title = element_text(size = 14, face = "bold"))
x_axis_labels <- min(covid_df$month):max(covid_df$month)
p1 <- p1 + scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
p1

From the histogram above, it’s obvious that Covid-19 cases have been rising since January 2020. As the United States entered the summer months, reported Covid-19 cases took a large jump. In may, the CDC reported 46,414 cases, and that number jumped to 118,756 in June and 161,590 in July. These increasing numbers through these months can be attributed to the reopening of restaurants, gyms, salons, etc. The number of cases reported began to drop leading into the fall months; however, these numbers spiked from September through November. From the graph, the peak month is November with 225,064 cases reported. The spike in the fall months can be attributed to the return of in-person classes for school, athletics, and the colder weather. The constant increase in cases throughout 2020 gives light to how contagious and easily transmissible Covid-19 is.

Top 10 Days of Reported Cases

top_report_dates <- count(covid_df, cdc_report_dt)

top_report_dates <- top_report_dates[order(top_report_dates$n, decreasing = TRUE)]



p7 <- ggplot(top_report_dates[1:10,], aes(x = reorder(cdc_report_dt, -n), y = n)) +
  geom_bar(color = "black", fill = "red", stat = "identity") +
  geom_text(size = 6, vjust = -0.4, aes(label=comma(n))) +
  labs(title = "Number of Covid-19 Cases Reported by Date (Top 10)", x = "Report Date", y = "Number of Covid-19 Cases Reported") +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 14), axis.title = element_text(size = 14, face = "bold"))
p7

The graph above identifies the days from January 2020 to November 2020 that reported the most Covid-19 cases. The greatest single day of reported Covid-19 cases was June 24, 2020 with 24,388. Again, this day could be linked to the reopening of indoor dining and traveling during the beginning of the summer. Large states like California, Texas, and Arizona struggled with the pandemic at this time, so they most likely contributed a substantial amount to this day total. The second highest day of reported cases was September, 5 2020 with 20,765. This large amount of reported cases can be attributed to college students returning to campus, children returning to in-person classes, and people returning to work. The majority of days represented in the bar chart fall in the months of October and November. This could be due to the colder weather, Halloween gatherings, and traveling for Thanksgiving. It can be assumed that people receive results several days after taking their test; therefore, reported cases do not reflect the day people test positive.

Cases and Hospitalizations by Race

library(DescTools)
library(dplyr)



library(ggplot2)
library(lubridate)
library(scales)
library(ggthemes)
library(RColorBrewer)
covid_df$race <- covid_df$`Race and ethnicity (combined)`

race_df <- covid_df %>%
  filter(race %in% c("White, Non-Hispanic", "Hispanic/Latino", 
                     "Black, Non-Hispanic", "Multiple/Other, Non-Hispanic", 
                     "Asian, Non-Hispanic", "American Indian/Alaska Native, Non-Hispanic", "Native Hawaiian/Other Pacific Islander, Non-Hispanic"), 
         hosp_yn %in% c("Yes")) %>%
  select(hosp_yn, race) %>%
  group_by(race, hosp_yn) %>%
  summarise(n = length(race), .groups = 'keep') %>%
  data.frame()
p2 <- ggplot(race_df, aes(x = reorder(race, -n), y = n)) +
  geom_bar(color= 'black', fill = 'blue', stat = 'identity') +
  coord_flip() +
  labs(title = "Number of Hospitalizations by Race", x = "Race", y = "Number of Hospitalizations") +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 14), axis.title = element_text(size = 14, face = "bold"))
p2

top_race <- count(covid_df, race)
top_race <- top_race[order(-n), ]
top_race <- covid_df[covid_df$race %in% c("White, Non-Hispanic", "Hispanic/Latino", 
                        "Black, Non-Hispanic", "Multiple/Other, Non-Hispanic", 
                        "Asian, Non-Hispanic", "American Indian/Alaska Native, Non-Hispanic", "Native Hawaiian/Other Pacific Islander, Non-Hispanic")]
race_reports <- count(top_race, race)


ggplot(race_reports, aes(x = reorder(race, -n), y = n)) +
  geom_bar(color= 'black', fill = 'blue', stat = 'identity') +
  coord_flip() +
  labs(title = "Number of Cases Reported by Race", x = "Race", y = "Number of Cases") +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 14), axis.title = element_text(size = 14, face = "bold"))

The two graphs above show the number of hospitalizations and number of cases reported by race/ethnicity. On the bottom graph, it’s apparent that white people had the most cases reported with hispanics and blacks right behind. While hispanics and blacks have significantly less reported cases, their hospitalizations are more than whites. This shows that hispanics and blacks are more likely to be hospitalized from Covid-19 than whites are based off the two charts. With more hospitalization and less cases, the chances are higher for a black or hispanic male/female to be hospitalized due to Covid-19 than a white male/female.

Which day of the week had the most cases reported?

covid_df$day <- weekdays(ymd(covid_df$cdc_report_dt))

months_df <- covid_df %>%
  filter(monthname %in% c('Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov')) %>%
  select(monthname, day) %>%
  group_by(monthname, day) %>%
  summarise(n = length(day), .groups = 'keep') %>%
  data.frame()


mymonths <- c( 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov')
month_order <- factor(months_df$monthname, level = mymonths)
mydays <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
day_order <- factor(months_df$day, levels = mydays)
p4 <- ggplot(months_df, aes(x = day_order, y = n, fill = month_order)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  theme_light() +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 16), axis.title = element_text(size = 16, face = "bold"), legend.title = element_text(size = 16), legend.text = element_text(size = 16), strip.text = element_text(size = 16)) +
  scale_y_continuous(labels = comma) +
  labs(title = 'Multiple Bar Charts - Total Reported Cases by Month and Day',
       x = 'Days of the week',
       y = 'Case Count',
       fill = 'Month') +
  scale_fill_brewer(palette = "Set2") +
  facet_wrap(~monthname, ncol = 2, nrow = 4)
p4

The graph above identifies the amount of cases reported for each day of the week by month. For the months of June, September, and November, there were a noticeable amount of reported cases on Wednesdays. These months happen to be months where Covid-19 cases were high. Since it takes a couple days for symptoms to arise and results to come back, it would make sense for results to be reported in the middle of the week for Covid-19 spread on the weekends. It’s possible that social gatherings could be held on the weekends with results coming back midweek. In October, reported cased increased throughout the week with Friday having the largest amount of Covid-19 cases.

Hospitalization by Month

hospital_df <- covid_df %>%
  filter(hosp_yn %in% c('Yes')) %>%
  select(hosp_yn, monthname) %>%
  group_by(hosp_yn, monthname) %>%
  summarise(n = length(hosp_yn), .groups = 'keep')

library(ggrepel)

my_months <- factor(hospital_df$monthname, levels = month.abb)

p5 <- ggplot(hospital_df, aes(x = my_months, y = n)) +
  geom_line(aes(group=1), color = 'black', size=1) +
  geom_point(shape=21, size=4, color='red', fill = 'white') +
  labs(x = "Months", y = "Number of Hospitalizations", title = "Number of Hospitalizations by Month") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(limits = month.abb) +
  theme_light() +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5), axis.text = element_text(size = 14), axis.title = element_text(size = 14, face = "bold")) +
  geom_label_repel(aes(label=n),
                   box.padding = 1,
                   point.padding = 1,
                   size = 6,
                   color = "red",
                   segment.color = "darkblue")
p5

The graph above shows the number of hospitalizations by month. This graph does not include missing or unknown information. The number of hospitalizations increased from the start of the pandemic, and there is a massive spike in hospitalizations for the month of June. The hospitalizations can be attributed to the reopening of businesses, bars, and restaurants. After June, hospitalizations dropped. This could be due to the United States handling the pandemic better as time goes on and more is discovered about the virus. The peak in June is significant because the amount of cases reported in June is lower than the following months, but the number of hospitalizations is the highest out of all the months.

Cases by Race and Month

race_month_df <- covid_df %>%
  filter(race %in% c("White, Non-Hispanic", "Hispanic/Latino", 
                "Black, Non-Hispanic", "Multiple/Other, Non-Hispanic", 
                "Asian, Non-Hispanic", "American Indian/Alaska Native, Non-Hispanic", "Native Hawaiian/Other Pacific Islander, Non-Hispanic")) %>%
  select(race, monthname) %>%
  group_by(race, monthname) %>%
  summarise(n = length(monthname), .groups = 'keep') %>%
  data.frame()


mymonths <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov')
month_order <- factor(race_month_df$monthname, level = mymonths)

breaks <- c(seq(0, max(race_month_df$n), by = 10000))

p6 <- ggplot(race_month_df, aes(x = month_order, y = race, fill = n)) +
  geom_tile(color = 'black') +
  geom_text(size = 5, aes(label=comma(n))) +
  coord_equal(ratio = 1) +
  labs(title = "Heatmap: CDC Reported Cases By Month and Race",
       x = "Month",
       y = "Race",
       fill = "Cases Reported") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5), axis.text = element_text(size = 16), axis.title = element_text(size = 16, face = "bold"), legend.title = element_text(size = 16), legend.text = element_text(size = 16)) +
  scale_fill_continuous(low="white", high="red", breaks = breaks) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(color = "black")))
p6

The heatmap above shows the number of cases reported by month for each race. The number of reported cases for whites increased steadily, for the most part, each month. Hispanics had their most reported cases in June with 28,469. That number dropped to 22,877 where it stayed around this number through the rest of the months. Blacks had their most reported cases in July with 17,108. This number dropped in the following months staying at around 13,000 cases per month.

Conclusion

After visualizing this data, it’s obvious that Covid-19 cases were increasing throughout the year as it is a highly contagious virus. However, the graphs indicate that Covid-19 was more prominent in the fall months where the weather is colder. There has been rumor that warmer weather prevents more spread of the virus, and the visualization can support that rumor. With the majority of the United States being white, the number of cases for whites outnumbered the other races; however, whites did not see as many hospitalization from Covid-19 as blacks and hispanics. It can be concluded that Covid-19 is more severe in certain demographics. Hospitalizations dropped after a large spike in June which is intriguing since the number of cases did not drop after June. At this point, the pandemic was far along enough for information to be understood which leads me to believe that the health system in the United States was able to handle the pandemic better which led to less hospitalizations. We can conclude that cases were being reported in the middle of the week is a result of contraction on the weekends because test results take a few days to come back.