Provisional Death Counts for Influenza, Pneumonia, and Covid-19

The number of deaths reported are the total number of deaths recieved and coded as of the data anlaysis. The are 17 columns with approximately 17,600 rows of data. This is a big dataset! The columns in this data set include the day of analysis, starting and ending week periods by data, the year, the week number, the group, indicator, jurisdiction, age group, covid-19 deaths, total deaths, pneumonia deaths, influenza deaths, pneumonia or influenza and pneumonia, influenza or covid-19 deaths.

#Load Libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(ggplot2)
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#set working direcory and read_csv data
setwd("C:/Users/Haley/Desktop")
covid_data <- read_csv("covid_data.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   `Data As Of` = col_character(),
##   `Start Week` = col_character(),
##   `End Week` = col_character(),
##   MMWRyear = col_double(),
##   MMWRweek = col_double(),
##   `Week Ending Date` = col_character(),
##   Group = col_character(),
##   Indicator = col_character(),
##   Jurisdiction = col_character(),
##   `Age Group` = col_character(),
##   `COVID-19 Deaths` = col_double(),
##   `Total Deaths` = col_double(),
##   `Pneumonia Deaths` = col_double(),
##   `Influenza Deaths` = col_double(),
##   `Pneumonia or Influenza` = col_double(),
##   `Pneumonia, Influenza, or COVID-19 Deaths` = col_double(),
##   Footnote = col_character()
## )
#print(covid_data)
#make all variables lowercase and add _ between spaces
names(covid_data) <- tolower(names(covid_data))
names(covid_data) <- gsub(" ", "_", names(covid_data))
#change two columns names to make them easier to work with
names(covid_data)[11] <- "covid_deaths"
names(covid_data)[16] <- "PIC_deaths"
names(covid_data)
##  [1] "data_as_of"             "start_week"             "end_week"              
##  [4] "mmwryear"               "mmwrweek"               "week_ending_date"      
##  [7] "group"                  "indicator"              "jurisdiction"          
## [10] "age_group"              "covid_deaths"           "total_deaths"          
## [13] "pneumonia_deaths"       "influenza_deaths"       "pneumonia_or_influenza"
## [16] "PIC_deaths"             "footnote"
#ensure the covid_data passes any na data, can not omit becuase there are a lot of missing blocks of data
covid_data <- na.pass(covid_data)

#build covid dataframe that will include the variables for the visuals 
covid_df <- data.frame(covid_data)
covid_df <- covid_df %>% select(mmwrweek, jurisdiction, age_group, covid_deaths, total_deaths,pneumonia_deaths, influenza_deaths, PIC_deaths)

#print(covid_df)
names(covid_df)
## [1] "mmwrweek"         "jurisdiction"     "age_group"        "covid_deaths"    
## [5] "total_deaths"     "pneumonia_deaths" "influenza_deaths" "PIC_deaths"
#List the data in descending order of covid_deaths to see which jurisdictions have higher recorded covid deaths 
#covid_df[order(-covid_df$covid_deaths),]
#Filter only data for the united states
plot1 <- covid_df %>% filter(jurisdiction == c("United States")) %>% filter(age_group == "0-17 years" | age_group == " All Ages" | age_group == "18-64 years" | age_group == "65 years and over") %>%
        group_by(mmwrweek, age_group) %>%
        ggplot(mapping = aes(x = mmwrweek, y = total_deaths, color = age_group)) + 
        geom_line(size = .5) + labs(subtitle = "Starting Data: 12/29/2019 until Ending Data:04/25/2021")+ xlab("Week") + ylab("Total Deaths") + ggtitle("Total Deaths Recorded per each Analysis Week") + theme_minimal() 
plot1

Plot 1

The week variable is determined by the established start and ending dates. Starting in December of 2019 was when this data was first being accumulated. Marking out 7 days establishes this first week as week 1. There is a multitude of data collected for each given week but it is a great variable to represent the longevity of this national pandemic. In fact, it was suprising to see this data was recording starting December 29, 2019. I am extremely concerned that data analysts were already counting the numbers as the public was just starting to hear about the virus that would change our lives in an instant. Considering if the masks mandates, shutdowns, and other health protocols would have began back in January of 2020, the amount of lives that could have been saved seems like it could have made this past year and a half a lot more different. I would also like to see where the 3 scriggles in between 10 and 20 weeks for the 0-17 years old is coming from. Otherwise, it is evident that this virus came for the 65 and older population with the 18 - 64 years old coming closely behind.

visual1 <- covid_df %>% filter(age_group == "0-17 years" | age_group == " All Ages" | age_group == "18-64 years" | age_group == "65 years and over")%>% filter(jurisdiction == "United States")

visual1<- plot_ly(covid_df, x = ~age_group, y = ~total_deaths, type = "bar", name = "total_deaths") %>% add_trace(y = ~influenza_deaths, name = "influenza_deaths")  %>% add_trace(y = ~covid_deaths, name = "covid_deaths")%>% add_trace(y = ~pneumonia_deaths, name = "pneumonia_deaths") %>% add_trace(y = ~PIC_deaths, name = "pneumonia, influenza, or covid_deaths") %>% layout(title = "The More Deadly Virus" , yaxis = list(title = "Death Count", zeroline = TRUE), xaxis = list(title = "Age Group"))

visual1
## Warning: Ignoring 1817 observations
## Warning: Ignoring 4096 observations
## Warning: Ignoring 2554 observations
## Warning: Ignoring 2992 observations
## Warning: Ignoring 2790 observations

Visual 2

I chose to analyze the age groups to see whether influenza, covid, or pneumonia were more likely the culprit for the extremely high death count with in the past year and half. It is extremely interesting to see that the covid deaths are the highest our of the comparison between influenza, pneumonia and covid. The third column, the combined disease variable, I decided to include becuase it accounts for those who had been affected by more than one disease and passed away as a result of being infected.

#create new dataframe that will use the top states based on the highest recorded covid deaths 
new_df <- covid_df %>% filter(age_group == "0-17 years" | age_group == " All Ages" | age_group == "18-64 years" | age_group == "65 years and over")%>% filter(jurisdiction == "New York City"| jurisdiction == "California"| jurisdiction ==  "Texas" | jurisdiction == "New Jersey" | jurisdiction == "Ohio" | jurisdiction ==  "Pennsylvania") %>% group_by (age_group, jurisdiction, mmwrweek)

new_df
## # A tibble: 1,260 x 8
## # Groups:   age_group, jurisdiction, mmwrweek [954]
##    mmwrweek jurisdiction  age_group   covid_deaths total_deaths pneumonia_deaths
##       <dbl> <chr>         <chr>              <dbl>        <dbl>            <dbl>
##  1        1 California    0-17 years             0           60               NA
##  2        1 California    18-64 years            0         1404               81
##  3        1 California    65 years a~            0         4401              382
##  4        1 New Jersey    0-17 years             0           13                0
##  5        1 New Jersey    18-64 years            0          398               15
##  6        1 New Jersey    65 years a~            0         1230               61
##  7        1 New York City 0-17 years             0           15                0
##  8        1 New York City 18-64 years            0          271               13
##  9        1 New York City 65 years a~            0          831               72
## 10        1 Ohio          0-17 years             0           45               NA
## # ... with 1,250 more rows, and 2 more variables: influenza_deaths <dbl>,
## #   PIC_deaths <dbl>
visual2 <- plot_ly(new_df, x = ~mmwrweek, y =~total_deaths , text = ~jurisdiction, type = "scatter", mode = "markers", size = ~covid_deaths, color = ~age_group, colors = "Paired", marker = list(opacity = .5))

visual2<- visual2 %>%layout(title = "Deaths per Week including Jurisdiction & Age group", xaxis = list(showgrid = FALSE), yaxis = list(showgrid = FALSE))

visual2
## Warning: Ignoring 61 observations
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

##Visual 3

My goal was to create a bubble chart one way or another and I am really intrigued with how this one turned out. What was most shocking to me was the parabolic shape for the 65 year and over group, with deaths all coming from the state of California. Since California is so heavily populated, the high numbers makes sense; however, the dip from high to low to high is what makes this plot so interesting.

library(treemap)
visual3 <- treemap(new_df,index= "jurisdiction", vSize= "total_deaths",vColor= "covid_deaths", type="value", palette="RdYlBu")

##Visual 4

I wanted to see the comparison in number of covid deaths relative to the top 6 states that I had identified by looking at the descending data in the beginning. I was also extremely interested that Texas was higher than New York becuase I expected the extremely populated city would have been higher. However, I also could see how the lack of masks and restrictions in a state like Texas may have caused these higher death numbers.

visual5 <- ggplot(new_df, aes(x = mmwrweek, y = total_deaths)) + geom_point(base_size = 8) + geom_smooth(color = "red") + ggtitle ("Total Deaths recorded each week") + xlab("Week") + ylab("Total Deaths") + labs(subtitle = "Using Linear Regression to analyze data")
## Warning: Ignoring unknown parameters: base_size
visual5
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 61 rows containing non-finite values (stat_smooth).
## Warning: Removed 61 rows containing missing values (geom_point).

Visual 5

The slight curvature of this line is extremely telling about the data even though there is the outlier data coming from the state of California. What I find most interesting is how fairly straight this linear regression line is even though there is that extreme curve at the top of the graph. I really like how this scatter plot also closely ressembles the bubble plot because it is the same type of representation; however, this is strictly to show the variable correlations where as the bubble chart is to show the jurisdictions.

Summary

The topic of this data is Covid-19! An extremely relavent dataset that included variables such as influenza deaths, pneumonia deaths, covid-19 deaths, week counts, jurisdiction records, and more variables to accumulate an extensive dataset on the virus. This data set uses a mix of categorical and numerical variables which I found on data.cdc.gov (https://data.cdc.gov/NCHS/Provisional-Death-Counts-for-Influenza-Pneumonia-a/ynw2-4viq). At first, I wasn’t sure how I wanted to best analyze this data but I know I needed to lower in on some specific aspects of this dataset in order to create specific and detailed visuals. When first looking at this dataset, it was extremely big and had a lot of missing columns. When I first tried to just omit the NA data, I did not accomplish much becuase over half of the data was chopt from the dataset when doing this. I decided to simply pass the NA values in the dataset which allowed me to keep the numerous columns and rows of data for my visuals. Then, I used the lowercase() asset to relabel all of my columns in lowercase, as well as getting rid of labal spaces with an "_“. This made it easier to use the names of the columns when coding the visuals. Similarly, there were a few issues trying to read covid-19_deaths, so I completely renamed that variable to covid_deaths and the final row including all of the diseases to PIC_deaths(using the first letter of each disease). I chose this topic because I really wanted to conduct a meaningful analysis. I wanted to dive into a topic that I could educate myself on that is prevalent in my day to day life right now. With the vaccine coming out, (Go Vacciners!) I want to emaphasize to those that believe Covid is a”hoax," that is is infact a disease that has been the direct killer to many elders. It is terrible that data has been collected about this disease before the public even heard anything about it. I remember March being the month where I finally learned of what was happening, little did I know that the word had already been spread in government and higher up departments. Learning how important data is in life is what has inspired me to look into this covid-19 dataset and analyze how truly deadly this virus has been to the United States and the states within this country.

Here’s a list of other CDC datasets on Covid-19: https://www.cdc.gov/nchs/covid19/covid-19-mortality-data-files.htm