COVID 19 Analysis

Part 1 - Basic Exploration of US Data

The New York Times (the Times) has aggregated reported COVID-19 data from state and local governments and health departments since 2020 and provides public access through a repository on GitHub. One of the data sets provided by the Times is county-level data for cumulative cases and deaths each day. This will be your primary data set for the first two parts of your analysis.

County-level COVID data from 2020, 2021, and 2022 has been imported below. Each row of data reports the cumulative number of cases and deaths for a specific county each day. A FIPS code, a standard geographic identifier, is also provided which you will use in Part 2 to construct a map visualization at the county level for a state.

Additionally, county-level population estimates reported by the US Census Bureau has been imported as well. You will use these estimates to caluclate statistics per 100,000 people.

# Import New York Times COVID-19 data
# Import Population Estimates from US Census Bureau

us_counties_2020 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2020.csv")

## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   county = col_character(),
##   state = col_character(),
##   fips = col_character(),
##   cases = col_double(),
##   deaths = col_double()
## )

us_counties_2021 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2021.csv")

## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   county = col_character(),
##   state = col_character(),
##   fips = col_character(),
##   cases = col_double(),
##   deaths = col_double()
## )

us_counties_2022 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2022.csv")

## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   county = col_character(),
##   state = col_character(),
##   fips = col_character(),
##   cases = col_double(),
##   deaths = col_double()
## )

us_population_estimates <- read_csv("fips_population_estimates.csv")

## Parsed with column specification:
## cols(
##   STNAME = col_character(),
##   CTYNAME = col_character(),
##   fips = col_double(),
##   STATE = col_double(),
##   COUNTY = col_double(),
##   Year = col_double(),
##   Estimate = col_double()
## )

Question 1

Your first task is to combine and tidy the 2020, 2021, and 2022 COVID data sets and find the total deaths and cases for each day since March 15, 2020 (2020-03-15). The data sets provided from the NY Times also includes statistics from Puerto Rico, a US territory. You may remove these observations from the data as they will not be needed for your analysis. Once you have tidied the data, find the total COVID-19 cases and deaths since March 15, 2020. Write a sentence or two after the code block communicating your results. Use inline code to include the max_date, us_total_cases, and us_total_deaths variables. To write inline code use r.

# Combine and tidy the 2020, 2021, and 2022 COVID data sets.
# Hint: Review the rbind() documentation to combine the three data sets.

# First we need to check that the 3 files have the same columns and data types before we can rbind them.

str(us_counties_2020)

## tibble [884,737 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date  : Date[1:884737], format: "2020-01-21" "2020-01-22" ...
##  $ county: chr [1:884737] "Snohomish" "Snohomish" "Snohomish" "Cook" ...
##  $ state : chr [1:884737] "Washington" "Washington" "Washington" "Illinois" ...
##  $ fips  : chr [1:884737] "53061" "53061" "53061" "17031" ...
##  $ cases : num [1:884737] 1 1 1 1 1 1 1 1 1 1 ...
##  $ deaths: num [1:884737] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   county = col_character(),
##   ..   state = col_character(),
##   ..   fips = col_character(),
##   ..   cases = col_double(),
##   ..   deaths = col_double()
##   .. )

str(us_counties_2021)

## tibble [1,185,373 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date  : Date[1:1185373], format: "2021-01-01" "2021-01-01" ...
##  $ county: chr [1:1185373] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ state : chr [1:1185373] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ fips  : chr [1:1185373] "01001" "01003" "01005" "01007" ...
##  $ cases : num [1:1185373] 4239 13823 1517 1854 4693 ...
##  $ deaths: num [1:1185373] 50 169 33 46 63 22 45 157 63 22 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   county = col_character(),
##   ..   state = col_character(),
##   ..   fips = col_character(),
##   ..   cases = col_double(),
##   ..   deaths = col_double()
##   .. )

str(us_counties_2022)

## tibble [1,188,042 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date  : Date[1:1188042], format: "2022-01-01" "2022-01-01" ...
##  $ county: chr [1:1188042] "Autauga" "Baldwin" "Barbour" "Bibb" ...
##  $ state : chr [1:1188042] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ fips  : chr [1:1188042] "01001" "01003" "01005" "01007" ...
##  $ cases : num [1:1188042] 11018 39911 3860 4533 11256 ...
##  $ deaths: num [1:1188042] 160 593 81 95 198 46 102 532 147 65 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   county = col_character(),
##   ..   state = col_character(),
##   ..   fips = col_character(),
##   ..   cases = col_double(),
##   ..   deaths = col_double()
##   .. )

We do confirm that the 3 datasets has the same structure so now we can proceed with merging the datasets into one.

We will also run a check that the total number of rows of the new dataset equal the sum of the number for rows for the 3 files to confirm the rbind process was successful.

us_counties_total <- rbind(us_counties_2020, us_counties_2021, us_counties_2022)
nrow(us_counties_total) == nrow(us_counties_2020) + nrow(us_counties_2021) + nrow(us_counties_2022)

## [1] TRUE

Now we will need to check if the dataset is clean and tidy. The first step is to check that there is no duplicate rows.

sum(duplicated(us_counties_total))

## [1] 0

We will check the range of date for the data that has been collected.

range(us_counties_total$date)

## [1] "2020-01-21" "2022-12-31"

We need to check the unique counties included in the dataset and count it as well. we can use unique() command but it will show a lot of counties.

The total number of unique counties included as below is 1856 while the current total number of counties as per the resources found online was 3,143. We need to understand why the data for other counties was not included.

n_distinct(us_counties_total$county)

## [1] 1932

Next we need to count unique states in USA.

n_distinct(us_counties_total$state)

## [1] 56

The total is showing 55 state while it should be 50, which means there might be a mistake in the name of some states or there is other classification like unknown, we will need a list of the sorted unique states.

sort(unique(us_counties_total$state))

##  [1] "Alabama"                  "Alaska"                  
##  [3] "American Samoa"           "Arizona"                 
##  [5] "Arkansas"                 "California"              
##  [7] "Colorado"                 "Connecticut"             
##  [9] "Delaware"                 "District of Columbia"    
## [11] "Florida"                  "Georgia"                 
## [13] "Guam"                     "Hawaii"                  
## [15] "Idaho"                    "Illinois"                
## [17] "Indiana"                  "Iowa"                    
## [19] "Kansas"                   "Kentucky"                
## [21] "Louisiana"                "Maine"                   
## [23] "Maryland"                 "Massachusetts"           
## [25] "Michigan"                 "Minnesota"               
## [27] "Mississippi"              "Missouri"                
## [29] "Montana"                  "Nebraska"                
## [31] "Nevada"                   "New Hampshire"           
## [33] "New Jersey"               "New Mexico"              
## [35] "New York"                 "North Carolina"          
## [37] "North Dakota"             "Northern Mariana Islands"
## [39] "Ohio"                     "Oklahoma"                
## [41] "Oregon"                   "Pennsylvania"            
## [43] "Puerto Rico"              "Rhode Island"            
## [45] "South Carolina"           "South Dakota"            
## [47] "Tennessee"                "Texas"                   
## [49] "Utah"                     "Vermont"                 
## [51] "Virgin Islands"           "Virginia"                
## [53] "Washington"               "West Virginia"           
## [55] "Wisconsin"                "Wyoming"

There is no misspelling in state names but there is 5 states as following which are not included in the official list of states.

American Samoa, District of Columbia, Guam, Northern Mariana Islands & Virgin Islands.

Now we will check the range of values for deaths and cases.

range(us_counties_total$deaths)

## [1] NA NA

range(us_counties_total$cases)

## [1]       0 3632440

Now we will Check if there is any NA values.

filter(us_counties_total, is.na(deaths))

## # A tibble: 75,701 x 6
##    date       county       state       fips  cases deaths
##    <date>     <chr>        <chr>       <chr> <dbl>  <dbl>
##  1 2020-05-05 Adjuntas     Puerto Rico 72001     3     NA
##  2 2020-05-05 Aguada       Puerto Rico 72003     7     NA
##  3 2020-05-05 Aguadilla    Puerto Rico 72005    11     NA
##  4 2020-05-05 Aguas Buenas Puerto Rico 72007    22     NA
##  5 2020-05-05 Aibonito     Puerto Rico 72009    13     NA
##  6 2020-05-05 Anasco       Puerto Rico 72011     5     NA
##  7 2020-05-05 Arecibo      Puerto Rico 72013    43     NA
##  8 2020-05-05 Arroyo       Puerto Rico 72015     5     NA
##  9 2020-05-05 Barceloneta  Puerto Rico 72017     3     NA
## 10 2020-05-05 Barranquitas Puerto Rico 72019    15     NA
## # … with 75,691 more rows

filter(us_counties_total, is.na(cases))

## # A tibble: 0 x 6
## # … with 6 variables: date <date>, county <chr>, state <chr>, fips <chr>,
## #   cases <dbl>, deaths <dbl>

Now we will calculate the required details.

max_date <- max(us_counties_total$deaths)
us_total_cases <- max(us_counties_total$cases)
us_total_deaths <- max(us_counties_total$deaths)
cbind(max_date, us_total_cases, us_total_deaths)

##      max_date us_total_cases us_total_deaths
## [1,]       NA        3632440              NA

Now after we have completed cleaning the dataset, we can follow the steps for completing the required analysis.

Now we will calculate the total deaths and cases per day across USA.

# Your output should look similar to the following tibble:
##
A tibble: 657 x 3
# date total_deaths total_cases
# <date> <dbl> <dbl>
# 1 2020-03-15 68 3595
# 2 2020-03-16 91 4502
# 3 2020-03-17 117 5901
# 4 2020-03-18 162 8345
# 5 2020-03-19 212 12387
# 6 2020-03-20 277 17998
# 7 2020-03-21 359 24507
# 8 2020-03-22 457 33050
# 9 2020-03-23 577 43474
# 10 2020-03-24 783 53899
# ... with 647 more rows
#

sum_us_counties_total <- us_counties_total %>%
  filter(date >= "2020-03-15") %>%
  group_by(date) %>%
  summarise(total_deaths = sum(deaths), total_cases = sum(cases)) %>%
  select(date, total_deaths, total_cases)

## `summarise()` ungrouping output (override with `.groups` argument)

head(sum_us_counties_total)

## # A tibble: 6 x 3
##   date       total_deaths total_cases
##   <date>            <dbl>       <dbl>
## 1 2020-03-15           68        3600
## 2 2020-03-16           91        4507
## 3 2020-03-17          117        5906
## 4 2020-03-18          162        8350
## 5 2020-03-19          212       12393
## 6 2020-03-20          277       18012

– Communicate your methodology, results, and interpretation here –

The total number of new deaths and new cases has increased on daily basis from 2020-03-15 tell 2022-12-31

Question 2

Create a visualization for the total number of deaths and cases in the US since March 15, 2020. Before you create your visualization, review the types of plots you can create using the ggplot2 library and think about which plots would be effective in communicating your results. After you have created your visualization, write a few sentences describing your visualization. How could the plot be interpreted? Could it be misleading?

# Create a visualization for the total number of US cases and deaths since March 15, 2020.

pivot_sum_us_counties_total <- sum_us_counties_total %>% pivot_longer(!date)
ggplot(pivot_sum_us_counties_total, aes(x = date, y = value)) +
  geom_line(aes(color = name)) +
  labs(title = "Compare Total Cases and Deaths over time", y = "Total")

## Warning: Removed 971 row(s) containing missing values (geom_path).

First we tried to plot two lines side by side visualization but since the number of cases is much higher than the number of deaths, the visualization will be misleading, it is showing the total number of cases increase over time but it looks like total number of deaths is steady over time which is not correct.

ggplot(pivot_sum_us_counties_total, aes(x = date, y = value, fill = name)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare Total Cases and Deaths over time", y = "Total")

## Warning: Removed 971 rows containing missing values (geom_col).

Next when we try to plot two columns side by side, we still have the same issue that the number of cases is much higher than the death numbers, we can hardly see that the total number of deaths is increasing over time.

ggplot(sum_us_counties_total, aes(x = date, y = total_cases, fill = total_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare Total Cases and Deaths over time", y = "Total")

– Communicate your methodology, results, and interpretation here –

We can plot a column chart with fill of the second variable, it shows that both variables increase over time but it takes time for people to find the total number for each variable. This will be more clear than the first two plots, but again sine the value of deaths and cases vary significantly having one visulization that include both variables will be hard.

Question 3

While it is important to know the total deaths and cases throughout the COVID-19 pandemic, it is also important for local and state health officials to know the the number of new cases and deaths each day to understand how rapidly the virus is spreading. Using the table you created in Question 1, calculate the number of new deaths and cases each day and a seven-day average of new deaths and cases. Once you have organized your data, find the days that saw the largest number of new cases and deaths. Write a sentence or two after the code block communicating your results.

# Create a new table, based on the table from Question 1, and calculate the number of new deaths and cases each day and a seven day average of new deaths and cases.
#
# Hint: Look at the documentation for lag() when computing the number of new deaths and cases and the seven-day averages.
#
#
## YOUR CODE HERE ##

# I will use the original dataset in order calculate the new cases and new deaths in 2020-03-15 as the we need to include the date from 2020-03-14 to calculate the new cases and deaths on 2020-03-15

new_us_counties_total <- us_counties_total %>%
  filter(date >= "2020-03-14") %>%
  group_by(date) %>%
  summarise(total_deaths = sum(deaths), total_cases = sum(cases)) %>%
  select(date, total_deaths, total_cases) %>%
  mutate(new_deaths = total_deaths - lag(total_deaths), new_cases = total_cases - lag(total_cases))

## `summarise()` ungrouping output (override with `.groups` argument)

# Remove first row
new_us_counties_total <- new_us_counties_total[-1, ]


head(new_us_counties_total)

## # A tibble: 6 x 5
##   date       total_deaths total_cases new_deaths new_cases
##   <date>            <dbl>       <dbl>      <dbl>     <dbl>
## 1 2020-03-15           68        3600          8       702
## 2 2020-03-16           91        4507         23       907
## 3 2020-03-17          117        5906         26      1399
## 4 2020-03-18          162        8350         45      2444
## 5 2020-03-19          212       12393         50      4043
## 6 2020-03-20          277       18012         65      5619

Find the date with maximum new_cases and new_deaths

new_us_counties_total %>%
  filter(new_deaths == max(new_deaths))

## # A tibble: 0 x 5
## # … with 5 variables: date <date>, total_deaths <dbl>, total_cases <dbl>,
## #   new_deaths <dbl>, new_cases <dbl>

new_us_counties_total %>%
  filter(new_cases == max(new_cases))

## # A tibble: 1 x 5
##   date       total_deaths total_cases new_deaths new_cases
##   <date>            <dbl>       <dbl>      <dbl>     <dbl>
## 1 2022-01-10           NA    61600909         NA   1433977

We can add day of the week and month column in order to aggregate the results per day or month to see how the cases relates to the day of the week or month.

new_us_counties_total <- mutate(new_us_counties_total, day_of_week = wday(date, label = TRUE, abbr = FALSE))

new_us_counties_total$day_of_week <- factor(new_us_counties_total$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

new_us_counties_total <- mutate(new_us_counties_total, month = month.abb[month(ymd(new_us_counties_total$date))])

new_us_counties_total$month <- factor(new_us_counties_total$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
head(new_us_counties_total)

## # A tibble: 6 x 7
##   date       total_deaths total_cases new_deaths new_cases day_of_week month
##   <date>            <dbl>       <dbl>      <dbl>     <dbl> <ord>       <fct>
## 1 2020-03-15           68        3600          8       702 Sunday      Mar  
## 2 2020-03-16           91        4507         23       907 Monday      Mar  
## 3 2020-03-17          117        5906         26      1399 Tuesday     Mar  
## 4 2020-03-18          162        8350         45      2444 Wednesday   Mar  
## 5 2020-03-19          212       12393         50      4043 Thursday    Mar  
## 6 2020-03-20          277       18012         65      5619 Friday      Mar

Calculate 7 days average new cases and new deaths.

#  Your output should look similar to the following tibble: 
#  
#  date
#  total_deaths     >  the cumulative number of deaths up to and including the associated date
#  total_cases      >  the cumulative number of cases up to and including the associated date
#  delta_deaths_1   >  the number of new deaths since the previous day
#  delta_cases_1    >  the number of new cases since the previous day
#  delta_deaths_7   >  the average number of deaths in a seven-day period
#  delta_cases_7    >  the average number of cases in a seven-day period
#==
#  A tibble: 813 x 7
#    date           total_deaths   total_cases   delta_deaths_1    delta_cases_1  delta_deaths_7  delta_cases_7
#    <date>            <dbl>          <dbl>         <dbl>               <dbl>        <dbl>            <dbl>
#  1 2020-03-15           68          3600            0                     0         NA                NA 
#  2 2020-03-16           91          4507           23                   907         NA                NA 
#  3 2020-03-17          117          5906           26                  1399         NA                NA 
#  4 2020-03-18          162          8350           45                  2444         NA                NA 
#  5 2020-03-19          212         12393           50                  4043         NA                NA 
#  6 2020-03-20          277         18012           65                  5619         NA                NA 
#  7 2020-03-21          360         24528           83                  6516         NA                NA 
#  8 2020-03-22          458         33073           98                  8545       55.7              4210.
#  9 2020-03-23          579         43505          121                 10432       69.7              5571.
# 10 2020-03-24          785         53938          206                 10433       95.4              6862.
# ... with 803 more rows

new_us_counties_total <- new_us_counties_total %>%
  mutate(avg_week_deaths = (new_deaths) + lead(new_deaths) + lead(new_deaths, 2) + lead(new_deaths, 3) + lead(new_deaths, 4) + lead(new_deaths, 5) + lead(new_deaths, 6) / 7, avg_week_cases = (new_cases + lead(new_cases) + lead(new_cases, 2) + lead(new_cases, 3) + lead(new_cases, 4) + lead(new_cases, 5) + lead(new_cases, 6) / 7))

# show a tibble with date, new deaths, new cases, avg weekly new deaths and avg weekly new cases

head(select(new_us_counties_total, date, new_deaths, new_cases, avg_week_deaths, avg_week_cases))

## # A tibble: 6 x 5
##   date       new_deaths new_cases avg_week_deaths avg_week_cases
##   <date>          <dbl>     <dbl>           <dbl>          <dbl>
## 1 2020-03-15          8       702            229.         16045.
## 2 2020-03-16         23       907            306          22149.
## 3 2020-03-17         26      1399            384.         30056.
## 4 2020-03-18         45      2444            491.         39089.
## 5 2020-03-19         50      4043            661.         47679.
## 6 2020-03-20         65      5619            885.         58607.

Find the week with the highest new cases and the week with the highest new death cases.

new_us_counties_total %>%
  filter(avg_week_deaths == max(avg_week_deaths, na.rm = TRUE))

## # A tibble: 1 x 9
##   date       total_deaths total_cases new_deaths new_cases day_of_week month
##   <date>            <dbl>       <dbl>      <dbl>     <dbl> <ord>       <fct>
## 1 2020-04-13        26613      584018       1764     25769 Monday      Apr  
## # … with 2 more variables: avg_week_deaths <dbl>, avg_week_cases <dbl>

new_us_counties_total %>%
  filter(avg_week_cases == max(avg_week_cases, na.rm = TRUE))

## # A tibble: 1 x 9
##   date       total_deaths total_cases new_deaths new_cases day_of_week month
##   <date>            <dbl>       <dbl>      <dbl>     <dbl> <ord>       <fct>
## 1 2022-01-10           NA    61600909         NA   1433977 Monday      Jan  
## # … with 2 more variables: avg_week_deaths <dbl>, avg_week_cases <dbl>

The week with the highest average weekly new deaths was 2021-01-06 to 2021-01-12 while the single day with highest new deaths was later that year on 2022-11-11.

The week with the highest average weekly new cases was 2022-01-10 to 2022-01-16 will the single day with highest new cases was in the same week on 2022-01-10.

We need to know why the day with the highest new death count was not in the same week with the highest average weekly death.

We will filter the date to see the the daily death counts and compare it between the two periods.

new_us_counties_total %>%
  filter(between(date, as.Date("2021-01-06"), as.Date("2021-01-12"))) %>%
  select(date, new_deaths, avg_week_deaths)

## # A tibble: 7 x 3
##   date       new_deaths avg_week_deaths
##   <date>          <dbl>           <dbl>
## 1 2021-01-06         NA              NA
## 2 2021-01-07         NA              NA
## 3 2021-01-08         NA              NA
## 4 2021-01-09         NA              NA
## 5 2021-01-10         NA              NA
## 6 2021-01-11         NA              NA
## 7 2021-01-12         NA              NA

new_us_counties_total %>%
  filter(between(date, as.Date("2022-11-5"), as.Date("2022-11-17"))) %>%
  select(date, new_deaths, avg_week_deaths)

## # A tibble: 13 x 3
##    date       new_deaths avg_week_deaths
##    <date>          <dbl>           <dbl>
##  1 2022-11-05         NA              NA
##  2 2022-11-06         NA              NA
##  3 2022-11-07         NA              NA
##  4 2022-11-08         NA              NA
##  5 2022-11-09         NA              NA
##  6 2022-11-10         NA              NA
##  7 2022-11-11         NA              NA
##  8 2022-11-12         NA              NA
##  9 2022-11-13         NA              NA
## 10 2022-11-14         NA              NA
## 11 2022-11-15         NA              NA
## 12 2022-11-16         NA              NA
## 13 2022-11-17         NA              NA

We can see that the daily deaths in the week with the highest average new deaths has a very near high values, while toward the end of the year the few days before 2022-11-11 and the few days after 2022-11-11 had a very minimum new death count and there was a huge spike in the new death count on 2022-11-11. So we need to know what happened exactly on that day.

All the resources found online suggests that the number of daily deaths during November 2022 was around 400+ deaths daily and there has been no spike on new death counts on 2022-11-11.

While if we calculate the average new daily deaths during November in our dataset we will find that the average daily deaths will be 720,5. This might indicate an issue in the dataset or an issue with reporting the numbers on that specific day.

new_us_counties_total %>%
  filter(between(date, as.Date("2022-11-01"), as.Date("2022-11-30"))) %>%
  summarise(avg_daily_death = mean(new_deaths))

## # A tibble: 1 x 1
##   avg_daily_death
##             <dbl>
## 1              NA

Let us visualize the new cases and new deaths, trying to visualize the data over 3 years period will make the visualization crowded and therefore it will be hard to spot the difference, therefore we will make 3 visualization for each year 2020, 2021 & 2022.

new_us_counties_total %>%
  filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
  ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare New Cases and Deaths over time - 2020", y = "Total Cases")

new_us_counties_total %>%
  filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
  ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare New cases and Deaths over time - 2021", y = "Total Cases")

new_us_counties_total %>%
  filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
  ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare New Cases and Deaths over time - 2022", y = "Total Cases")

A better way of doing it is by aggregating the results on monthly basis and then try to plot the new case and new deaths again.

new_us_counties_total %>%
  filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
  group_by(month) %>%
  summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths Monthly - 2020", y = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

new_us_counties_total %>%
  filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
  group_by(month) %>%
  summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths Monthly - 2021", y = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

new_us_counties_total %>%
  filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
  group_by(month) %>%
  summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths Monthly - 2022", y = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

We can also make a visualization for the day of the week

new_us_counties_total %>%
  filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
  group_by(day_of_week) %>%
  summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths per Day of Week - 2020", y = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

new_us_counties_total %>%
  filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
  group_by(day_of_week) %>%
  summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths per Day of Week - 2021", y_lab = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

new_us_counties_total %>%
  filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
  group_by(day_of_week) %>%
  summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
  ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
  geom_col(position = "Dodge") +
  labs(title = "Compare NEW Cases and Deaths per Day of Week - 2022", y_lab = "Total Cases")

## `summarise()` ungrouping output (override with `.groups` argument)

we can see that constantly over the 3 years, Saturday and Sunday had the lowest new cases count, however Wednesday has been always with the highest death count.

– Communicate your methodology, results, and interpretation here –

We have now a list of new deaths new cases with weekly averages new cases and deaths.

Question 4

#  Your output should look similar to the following tibble: 
#  
#  date
#  total_deaths     >  the cumulative number of deaths up to and including the associated date
#  total_cases      >  the cumulative number of cases up to and including the associated date
#  delta_deaths_1   >  the number of new deaths since the previous day
#  delta_cases_1    >  the number of new cases since the previous day
#  delta_deaths_7   >  the average number of deaths in a seven-day period
#  delta_cases_7    >  the average number of cases in a seven-day period
#==
#  A tibble: 657 x 7
#       date         total_deaths   total_cases   delta_deaths_1   delta_cases_1 delta_deaths_7 delta_cases_7
#      <date>            <dbl>         <dbl>          <dbl>           <dbl>           <dbl>         <dbl>
#   1 2020-03-15         0.0205        1.08                0              0              NA           NA   
#   2 2020-03-16         0.0275        1.36          0.00694          0.274              NA           NA   
#   3 2020-03-17         0.0353        1.78          0.00784          0.422              NA           NA   
#   4 2020-03-18         0.0489        2.52           0.0136          0.737              NA           NA   
#   5 2020-03-19         0.0640        3.74           0.0151           1.22              NA           NA   
#   6 2020-03-20         0.0836        5.43           0.0196           1.69              NA           NA   
#   7 2020-03-21         0.108         7.39           0.0247           1.96              NA           NA   
#   8 2020-03-22         0.138         9.97           0.0296           2.58          0.0168         1.27
#   9 2020-03-23         0.174         13.1           0.0362           3.14          0.0209         1.68
#  10 2020-03-24         0.236         16.3           0.0621           3.14          0.0287         2.07

# Create a new table, based on the table from Question 3, and calculate the number of new deaths and cases per 100,000 people each day and a seven day average of new deaths and cases per 100,000 people.

# Hint: To calculate per 100,000 people, first tidy the population estimates data and calculate the US population in 2020 and 2021. Then, you will need to divide  each statistic by the estimated population and then multiply by 100,000.
#
# Hint: look at the help documentation for grepl() and case_when() to divide the averages by the US population for each year.
# For example, take the simple tibble, t_new:
#
#     x     y
#   <int> <chr>
#     1     a
#     2     b
#     3     a
#     4     b
#     5     a
#     6     b
#
#
# To add a column, z, that is dependent on the value in y, you could:
#
# t_new %>%
#   mutate(z = case_when(grepl("a", y) ~ "not b",
#                        grepl("b", y) ~ "not a"))
#

## YOUR CODE HERE ##

# First we need to check and tidy us_population_estimates if needed

str(us_population_estimates)

## tibble [6,286 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ STNAME  : chr [1:6286] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ CTYNAME : chr [1:6286] "Autauga County" "Autauga County" "Baldwin County" "Baldwin County" ...
##  $ fips    : num [1:6286] 1001 1001 1003 1003 1005 ...
##  $ STATE   : num [1:6286] 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY  : num [1:6286] 1 1 3 3 5 5 7 7 9 9 ...
##  $ Year    : num [1:6286] 2020 2021 2020 2021 2020 ...
##  $ Estimate: num [1:6286] 58877 59095 233140 239294 25180 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   STNAME = col_character(),
##   ..   CTYNAME = col_character(),
##   ..   fips = col_double(),
##   ..   STATE = col_double(),
##   ..   COUNTY = col_double(),
##   ..   Year = col_double(),
##   ..   Estimate = col_double()
##   .. )

sum(duplicated(us_population_estimates))

## [1] 0

range(us_population_estimates$Year)

## [1] 2020 2021

range(us_population_estimates$Estimate)

## [1]      57 9989165

n_distinct(us_population_estimates$STNAME)

## [1] 51

n_distinct(us_population_estimates$CTYNAME)

## [1] 1878

Everything looks fine, here we can see that the number of states is 51 and not 55 as per the first dataset.

# Calculate the total number of population estimate for 2020, 2021
year_us_population_estimates <- us_population_estimates %>%
  group_by(Year) %>%
  summarise(year_population = sum(Estimate))

## `summarise()` ungrouping output (override with `.groups` argument)

year_us_population_estimates

## # A tibble: 2 x 2
##    Year year_population
##   <dbl>           <dbl>
## 1  2020       331501080
## 2  2021       331893745

# let us get a tibble of the data we need from new_us_countries dataset
p_new_us_counties_total <- new_us_counties_total %>%
  select(date, new_cases, new_deaths, avg_week_cases, avg_week_deaths) %>%
  # add new column called year
  mutate(Year = year(date)) %>%
  # Now we can easily add the population total by doing left join
  left_join(year_us_population_estimates) %>%
  # since there is no estimates for population in 2022, we will exclude the data for 2022 from our records
  filter(Year != 2022) %>%
  # Calculate the required fields
  mutate(p_new_cases = (new_cases / year_population) * 100000, p_new_deaths = (new_deaths / year_population) * 100000, p_avg_week_cases = (avg_week_cases / year_population) * 100000, p_avg_week_deaths = (avg_week_deaths / year_population) * 100000)

## Joining, by = "Year"

p_new_us_counties_total

## # A tibble: 657 x 11
##    date       new_cases new_deaths avg_week_cases avg_week_deaths  Year
##    <date>         <dbl>      <dbl>          <dbl>           <dbl> <dbl>
##  1 2020-03-15       702          8         16045.            229.  2020
##  2 2020-03-16       907         23         22149.            306   2020
##  3 2020-03-17      1399         26         30056.            384.  2020
##  4 2020-03-18      2444         45         39089.            491.  2020
##  5 2020-03-19      4043         50         47679.            661.  2020
##  6 2020-03-20      5619         65         58607.            885.  2020
##  7 2020-03-21      6516         83         70034.           1135.  2020
##  8 2020-03-22      8545         98         81381.           1480.  2020
##  9 2020-03-23     10432        121         93539.           1893.  2020
## 10 2020-03-24     10433        206        102048            2220.  2020
## # … with 647 more rows, and 5 more variables: year_population <dbl>,
## #   p_new_cases <dbl>, p_new_deaths <dbl>, p_avg_week_cases <dbl>,
## #   p_avg_week_deaths <dbl>

– Communicate your methodology, results, and interpretation here –

We have merged the two datasets and perform join to one datasets to calculate new deaths and cases and average weekly new deaths and cases.

COVID 19 Analysis

Ariel Beera

27 - May - 2025

Required Packages

Part 1 - Basic Exploration of US Data

Question 1

Question 2

Question 3

Question 4

Question 5