The New York Times (the Times) has aggregated reported COVID-19 data from state and local governments and health departments since 2020 and provides public access through a repository on GitHub. One of the data sets provided by the Times is county-level data for cumulative cases and deaths each day. This will be your primary data set for the first two parts of your analysis.
County-level COVID data from 2020, 2021, and 2022 has been imported below. Each row of data reports the cumulative number of cases and deaths for a specific county each day. A FIPS code, a standard geographic identifier, is also provided which you will use in Part 2 to construct a map visualization at the county level for a state.
Additionally, county-level population estimates reported by the US Census Bureau has been imported as well. You will use these estimates to caluclate statistics per 100,000 people.
# Import New York Times COVID-19 data
# Import Population Estimates from US Census Bureau
us_counties_2020 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2020.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## county = col_character(),
## state = col_character(),
## fips = col_character(),
## cases = col_double(),
## deaths = col_double()
## )
us_counties_2021 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2021.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## county = col_character(),
## state = col_character(),
## fips = col_character(),
## cases = col_double(),
## deaths = col_double()
## )
us_counties_2022 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2022.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## county = col_character(),
## state = col_character(),
## fips = col_character(),
## cases = col_double(),
## deaths = col_double()
## )
us_population_estimates <- read_csv("fips_population_estimates.csv")
## Parsed with column specification:
## cols(
## STNAME = col_character(),
## CTYNAME = col_character(),
## fips = col_double(),
## STATE = col_double(),
## COUNTY = col_double(),
## Year = col_double(),
## Estimate = col_double()
## )
Your first task is to combine and tidy the 2020, 2021, and 2022 COVID data sets and find the total deaths and cases for each day since March 15, 2020 (2020-03-15). The data sets provided from the NY Times also includes statistics from Puerto Rico, a US territory. You may remove these observations from the data as they will not be needed for your analysis. Once you have tidied the data, find the total COVID-19 cases and deaths since March 15, 2020. Write a sentence or two after the code block communicating your results. Use inline code to include the max_date, us_total_cases, and us_total_deaths variables. To write inline code use r.
# Combine and tidy the 2020, 2021, and 2022 COVID data sets.
# Hint: Review the rbind() documentation to combine the three data sets.
# First we need to check that the 3 files have the same columns and data types before we can rbind them.
str(us_counties_2020)
## tibble [884,737 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:884737], format: "2020-01-21" "2020-01-22" ...
## $ county: chr [1:884737] "Snohomish" "Snohomish" "Snohomish" "Cook" ...
## $ state : chr [1:884737] "Washington" "Washington" "Washington" "Illinois" ...
## $ fips : chr [1:884737] "53061" "53061" "53061" "17031" ...
## $ cases : num [1:884737] 1 1 1 1 1 1 1 1 1 1 ...
## $ deaths: num [1:884737] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. county = col_character(),
## .. state = col_character(),
## .. fips = col_character(),
## .. cases = col_double(),
## .. deaths = col_double()
## .. )
str(us_counties_2021)
## tibble [1,185,373 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:1185373], format: "2021-01-01" "2021-01-01" ...
## $ county: chr [1:1185373] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ state : chr [1:1185373] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ fips : chr [1:1185373] "01001" "01003" "01005" "01007" ...
## $ cases : num [1:1185373] 4239 13823 1517 1854 4693 ...
## $ deaths: num [1:1185373] 50 169 33 46 63 22 45 157 63 22 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. county = col_character(),
## .. state = col_character(),
## .. fips = col_character(),
## .. cases = col_double(),
## .. deaths = col_double()
## .. )
str(us_counties_2022)
## tibble [1,188,042 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:1188042], format: "2022-01-01" "2022-01-01" ...
## $ county: chr [1:1188042] "Autauga" "Baldwin" "Barbour" "Bibb" ...
## $ state : chr [1:1188042] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ fips : chr [1:1188042] "01001" "01003" "01005" "01007" ...
## $ cases : num [1:1188042] 11018 39911 3860 4533 11256 ...
## $ deaths: num [1:1188042] 160 593 81 95 198 46 102 532 147 65 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. county = col_character(),
## .. state = col_character(),
## .. fips = col_character(),
## .. cases = col_double(),
## .. deaths = col_double()
## .. )
We do confirm that the 3 datasets has the same structure so now we can proceed with merging the datasets into one.
We will also run a check that the total number of rows of the new dataset equal the sum of the number for rows for the 3 files to confirm the rbind process was successful.
us_counties_total <- rbind(us_counties_2020, us_counties_2021, us_counties_2022)
nrow(us_counties_total) == nrow(us_counties_2020) + nrow(us_counties_2021) + nrow(us_counties_2022)
## [1] TRUE
Now we will need to check if the dataset is clean and tidy. The first step is to check that there is no duplicate rows.
sum(duplicated(us_counties_total))
## [1] 0
We will check the range of date for the data that has been collected.
range(us_counties_total$date)
## [1] "2020-01-21" "2022-12-31"
We need to check the unique counties included in the dataset and count it as well. we can use unique() command but it will show a lot of counties.
The total number of unique counties included as below is 1856 while the current total number of counties as per the resources found online was 3,143. We need to understand why the data for other counties was not included.
n_distinct(us_counties_total$county)
## [1] 1932
Next we need to count unique states in USA.
n_distinct(us_counties_total$state)
## [1] 56
The total is showing 55 state while it should be 50, which means there might be a mistake in the name of some states or there is other classification like unknown, we will need a list of the sorted unique states.
sort(unique(us_counties_total$state))
## [1] "Alabama" "Alaska"
## [3] "American Samoa" "Arizona"
## [5] "Arkansas" "California"
## [7] "Colorado" "Connecticut"
## [9] "Delaware" "District of Columbia"
## [11] "Florida" "Georgia"
## [13] "Guam" "Hawaii"
## [15] "Idaho" "Illinois"
## [17] "Indiana" "Iowa"
## [19] "Kansas" "Kentucky"
## [21] "Louisiana" "Maine"
## [23] "Maryland" "Massachusetts"
## [25] "Michigan" "Minnesota"
## [27] "Mississippi" "Missouri"
## [29] "Montana" "Nebraska"
## [31] "Nevada" "New Hampshire"
## [33] "New Jersey" "New Mexico"
## [35] "New York" "North Carolina"
## [37] "North Dakota" "Northern Mariana Islands"
## [39] "Ohio" "Oklahoma"
## [41] "Oregon" "Pennsylvania"
## [43] "Puerto Rico" "Rhode Island"
## [45] "South Carolina" "South Dakota"
## [47] "Tennessee" "Texas"
## [49] "Utah" "Vermont"
## [51] "Virgin Islands" "Virginia"
## [53] "Washington" "West Virginia"
## [55] "Wisconsin" "Wyoming"
There is no misspelling in state names but there is 5 states as following which are not included in the official list of states.
American Samoa, District of Columbia, Guam, Northern Mariana Islands & Virgin Islands.
Now we will check the range of values for deaths and cases.
range(us_counties_total$deaths)
## [1] NA NA
range(us_counties_total$cases)
## [1] 0 3632440
Now we will Check if there is any NA values.
filter(us_counties_total, is.na(deaths))
## # A tibble: 75,701 x 6
## date county state fips cases deaths
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-05-05 Adjuntas Puerto Rico 72001 3 NA
## 2 2020-05-05 Aguada Puerto Rico 72003 7 NA
## 3 2020-05-05 Aguadilla Puerto Rico 72005 11 NA
## 4 2020-05-05 Aguas Buenas Puerto Rico 72007 22 NA
## 5 2020-05-05 Aibonito Puerto Rico 72009 13 NA
## 6 2020-05-05 Anasco Puerto Rico 72011 5 NA
## 7 2020-05-05 Arecibo Puerto Rico 72013 43 NA
## 8 2020-05-05 Arroyo Puerto Rico 72015 5 NA
## 9 2020-05-05 Barceloneta Puerto Rico 72017 3 NA
## 10 2020-05-05 Barranquitas Puerto Rico 72019 15 NA
## # … with 75,691 more rows
filter(us_counties_total, is.na(cases))
## # A tibble: 0 x 6
## # … with 6 variables: date <date>, county <chr>, state <chr>, fips <chr>,
## # cases <dbl>, deaths <dbl>
Now we will calculate the required details.
max_date <- max(us_counties_total$deaths)
us_total_cases <- max(us_counties_total$cases)
us_total_deaths <- max(us_counties_total$deaths)
cbind(max_date, us_total_cases, us_total_deaths)
## max_date us_total_cases us_total_deaths
## [1,] NA 3632440 NA
Now after we have completed cleaning the dataset, we can follow the steps for completing the required analysis.
Now we will calculate the total deaths and cases per day across USA.
# Your output should look similar to the following tibble:
##
A tibble: 657 x 3
# date total_deaths total_cases
# <date> <dbl> <dbl>
# 1 2020-03-15 68 3595
# 2 2020-03-16 91 4502
# 3 2020-03-17 117 5901
# 4 2020-03-18 162 8345
# 5 2020-03-19 212 12387
# 6 2020-03-20 277 17998
# 7 2020-03-21 359 24507
# 8 2020-03-22 457 33050
# 9 2020-03-23 577 43474
# 10 2020-03-24 783 53899
# ... with 647 more rows
#
sum_us_counties_total <- us_counties_total %>%
filter(date >= "2020-03-15") %>%
group_by(date) %>%
summarise(total_deaths = sum(deaths), total_cases = sum(cases)) %>%
select(date, total_deaths, total_cases)
## `summarise()` ungrouping output (override with `.groups` argument)
head(sum_us_counties_total)
## # A tibble: 6 x 3
## date total_deaths total_cases
## <date> <dbl> <dbl>
## 1 2020-03-15 68 3600
## 2 2020-03-16 91 4507
## 3 2020-03-17 117 5906
## 4 2020-03-18 162 8350
## 5 2020-03-19 212 12393
## 6 2020-03-20 277 18012
– Communicate your methodology, results, and interpretation here –
The total number of new deaths and new cases has increased on daily basis from 2020-03-15 tell 2022-12-31
Create a visualization for the total number of deaths and cases in the US since March 15, 2020. Before you create your visualization, review the types of plots you can create using the ggplot2 library and think about which plots would be effective in communicating your results. After you have created your visualization, write a few sentences describing your visualization. How could the plot be interpreted? Could it be misleading?
# Create a visualization for the total number of US cases and deaths since March 15, 2020.
pivot_sum_us_counties_total <- sum_us_counties_total %>% pivot_longer(!date)
ggplot(pivot_sum_us_counties_total, aes(x = date, y = value)) +
geom_line(aes(color = name)) +
labs(title = "Compare Total Cases and Deaths over time", y = "Total")
## Warning: Removed 971 row(s) containing missing values (geom_path).
First we tried to plot two lines side by side visualization but since the number of cases is much higher than the number of deaths, the visualization will be misleading, it is showing the total number of cases increase over time but it looks like total number of deaths is steady over time which is not correct.
ggplot(pivot_sum_us_counties_total, aes(x = date, y = value, fill = name)) +
geom_col(position = "Dodge") +
labs(title = "Compare Total Cases and Deaths over time", y = "Total")
## Warning: Removed 971 rows containing missing values (geom_col).
Next when we try to plot two columns side by side, we still have the same issue that the number of cases is much higher than the death numbers, we can hardly see that the total number of deaths is increasing over time.
ggplot(sum_us_counties_total, aes(x = date, y = total_cases, fill = total_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare Total Cases and Deaths over time", y = "Total")
– Communicate your methodology, results, and interpretation here –
We can plot a column chart with fill of the second variable, it shows that both variables increase over time but it takes time for people to find the total number for each variable. This will be more clear than the first two plots, but again sine the value of deaths and cases vary significantly having one visulization that include both variables will be hard.
While it is important to know the total deaths and cases throughout the COVID-19 pandemic, it is also important for local and state health officials to know the the number of new cases and deaths each day to understand how rapidly the virus is spreading. Using the table you created in Question 1, calculate the number of new deaths and cases each day and a seven-day average of new deaths and cases. Once you have organized your data, find the days that saw the largest number of new cases and deaths. Write a sentence or two after the code block communicating your results.
# Create a new table, based on the table from Question 1, and calculate the number of new deaths and cases each day and a seven day average of new deaths and cases.
#
# Hint: Look at the documentation for lag() when computing the number of new deaths and cases and the seven-day averages.
#
#
## YOUR CODE HERE ##
# I will use the original dataset in order calculate the new cases and new deaths in 2020-03-15 as the we need to include the date from 2020-03-14 to calculate the new cases and deaths on 2020-03-15
new_us_counties_total <- us_counties_total %>%
filter(date >= "2020-03-14") %>%
group_by(date) %>%
summarise(total_deaths = sum(deaths), total_cases = sum(cases)) %>%
select(date, total_deaths, total_cases) %>%
mutate(new_deaths = total_deaths - lag(total_deaths), new_cases = total_cases - lag(total_cases))
## `summarise()` ungrouping output (override with `.groups` argument)
# Remove first row
new_us_counties_total <- new_us_counties_total[-1, ]
head(new_us_counties_total)
## # A tibble: 6 x 5
## date total_deaths total_cases new_deaths new_cases
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03-15 68 3600 8 702
## 2 2020-03-16 91 4507 23 907
## 3 2020-03-17 117 5906 26 1399
## 4 2020-03-18 162 8350 45 2444
## 5 2020-03-19 212 12393 50 4043
## 6 2020-03-20 277 18012 65 5619
Find the date with maximum new_cases and new_deaths
new_us_counties_total %>%
filter(new_deaths == max(new_deaths))
## # A tibble: 0 x 5
## # … with 5 variables: date <date>, total_deaths <dbl>, total_cases <dbl>,
## # new_deaths <dbl>, new_cases <dbl>
new_us_counties_total %>%
filter(new_cases == max(new_cases))
## # A tibble: 1 x 5
## date total_deaths total_cases new_deaths new_cases
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2022-01-10 NA 61600909 NA 1433977
We can add day of the week and month column in order to aggregate the results per day or month to see how the cases relates to the day of the week or month.
new_us_counties_total <- mutate(new_us_counties_total, day_of_week = wday(date, label = TRUE, abbr = FALSE))
new_us_counties_total$day_of_week <- factor(new_us_counties_total$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
new_us_counties_total <- mutate(new_us_counties_total, month = month.abb[month(ymd(new_us_counties_total$date))])
new_us_counties_total$month <- factor(new_us_counties_total$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
head(new_us_counties_total)
## # A tibble: 6 x 7
## date total_deaths total_cases new_deaths new_cases day_of_week month
## <date> <dbl> <dbl> <dbl> <dbl> <ord> <fct>
## 1 2020-03-15 68 3600 8 702 Sunday Mar
## 2 2020-03-16 91 4507 23 907 Monday Mar
## 3 2020-03-17 117 5906 26 1399 Tuesday Mar
## 4 2020-03-18 162 8350 45 2444 Wednesday Mar
## 5 2020-03-19 212 12393 50 4043 Thursday Mar
## 6 2020-03-20 277 18012 65 5619 Friday Mar
Calculate 7 days average new cases and new deaths.
# Your output should look similar to the following tibble:
#
# date
# total_deaths > the cumulative number of deaths up to and including the associated date
# total_cases > the cumulative number of cases up to and including the associated date
# delta_deaths_1 > the number of new deaths since the previous day
# delta_cases_1 > the number of new cases since the previous day
# delta_deaths_7 > the average number of deaths in a seven-day period
# delta_cases_7 > the average number of cases in a seven-day period
#==
# A tibble: 813 x 7
# date total_deaths total_cases delta_deaths_1 delta_cases_1 delta_deaths_7 delta_cases_7
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2020-03-15 68 3600 0 0 NA NA
# 2 2020-03-16 91 4507 23 907 NA NA
# 3 2020-03-17 117 5906 26 1399 NA NA
# 4 2020-03-18 162 8350 45 2444 NA NA
# 5 2020-03-19 212 12393 50 4043 NA NA
# 6 2020-03-20 277 18012 65 5619 NA NA
# 7 2020-03-21 360 24528 83 6516 NA NA
# 8 2020-03-22 458 33073 98 8545 55.7 4210.
# 9 2020-03-23 579 43505 121 10432 69.7 5571.
# 10 2020-03-24 785 53938 206 10433 95.4 6862.
# ... with 803 more rows
new_us_counties_total <- new_us_counties_total %>%
mutate(avg_week_deaths = (new_deaths) + lead(new_deaths) + lead(new_deaths, 2) + lead(new_deaths, 3) + lead(new_deaths, 4) + lead(new_deaths, 5) + lead(new_deaths, 6) / 7, avg_week_cases = (new_cases + lead(new_cases) + lead(new_cases, 2) + lead(new_cases, 3) + lead(new_cases, 4) + lead(new_cases, 5) + lead(new_cases, 6) / 7))
# show a tibble with date, new deaths, new cases, avg weekly new deaths and avg weekly new cases
head(select(new_us_counties_total, date, new_deaths, new_cases, avg_week_deaths, avg_week_cases))
## # A tibble: 6 x 5
## date new_deaths new_cases avg_week_deaths avg_week_cases
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03-15 8 702 229. 16045.
## 2 2020-03-16 23 907 306 22149.
## 3 2020-03-17 26 1399 384. 30056.
## 4 2020-03-18 45 2444 491. 39089.
## 5 2020-03-19 50 4043 661. 47679.
## 6 2020-03-20 65 5619 885. 58607.
Find the week with the highest new cases and the week with the highest new death cases.
new_us_counties_total %>%
filter(avg_week_deaths == max(avg_week_deaths, na.rm = TRUE))
## # A tibble: 1 x 9
## date total_deaths total_cases new_deaths new_cases day_of_week month
## <date> <dbl> <dbl> <dbl> <dbl> <ord> <fct>
## 1 2020-04-13 26613 584018 1764 25769 Monday Apr
## # … with 2 more variables: avg_week_deaths <dbl>, avg_week_cases <dbl>
new_us_counties_total %>%
filter(avg_week_cases == max(avg_week_cases, na.rm = TRUE))
## # A tibble: 1 x 9
## date total_deaths total_cases new_deaths new_cases day_of_week month
## <date> <dbl> <dbl> <dbl> <dbl> <ord> <fct>
## 1 2022-01-10 NA 61600909 NA 1433977 Monday Jan
## # … with 2 more variables: avg_week_deaths <dbl>, avg_week_cases <dbl>
The week with the highest average weekly new deaths was 2021-01-06 to 2021-01-12 while the single day with highest new deaths was later that year on 2022-11-11.
The week with the highest average weekly new cases was 2022-01-10 to 2022-01-16 will the single day with highest new cases was in the same week on 2022-01-10.
We need to know why the day with the highest new death count was not in the same week with the highest average weekly death.
We will filter the date to see the the daily death counts and compare it between the two periods.
new_us_counties_total %>%
filter(between(date, as.Date("2021-01-06"), as.Date("2021-01-12"))) %>%
select(date, new_deaths, avg_week_deaths)
## # A tibble: 7 x 3
## date new_deaths avg_week_deaths
## <date> <dbl> <dbl>
## 1 2021-01-06 NA NA
## 2 2021-01-07 NA NA
## 3 2021-01-08 NA NA
## 4 2021-01-09 NA NA
## 5 2021-01-10 NA NA
## 6 2021-01-11 NA NA
## 7 2021-01-12 NA NA
new_us_counties_total %>%
filter(between(date, as.Date("2022-11-5"), as.Date("2022-11-17"))) %>%
select(date, new_deaths, avg_week_deaths)
## # A tibble: 13 x 3
## date new_deaths avg_week_deaths
## <date> <dbl> <dbl>
## 1 2022-11-05 NA NA
## 2 2022-11-06 NA NA
## 3 2022-11-07 NA NA
## 4 2022-11-08 NA NA
## 5 2022-11-09 NA NA
## 6 2022-11-10 NA NA
## 7 2022-11-11 NA NA
## 8 2022-11-12 NA NA
## 9 2022-11-13 NA NA
## 10 2022-11-14 NA NA
## 11 2022-11-15 NA NA
## 12 2022-11-16 NA NA
## 13 2022-11-17 NA NA
We can see that the daily deaths in the week with the highest average new deaths has a very near high values, while toward the end of the year the few days before 2022-11-11 and the few days after 2022-11-11 had a very minimum new death count and there was a huge spike in the new death count on 2022-11-11. So we need to know what happened exactly on that day.
All the resources found online suggests that the number of daily deaths during November 2022 was around 400+ deaths daily and there has been no spike on new death counts on 2022-11-11.
While if we calculate the average new daily deaths during November in our dataset we will find that the average daily deaths will be 720,5. This might indicate an issue in the dataset or an issue with reporting the numbers on that specific day.
new_us_counties_total %>%
filter(between(date, as.Date("2022-11-01"), as.Date("2022-11-30"))) %>%
summarise(avg_daily_death = mean(new_deaths))
## # A tibble: 1 x 1
## avg_daily_death
## <dbl>
## 1 NA
Let us visualize the new cases and new deaths, trying to visualize the data over 3 years period will make the visualization crowded and therefore it will be hard to spot the difference, therefore we will make 3 visualization for each year 2020, 2021 & 2022.
new_us_counties_total %>%
filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare New Cases and Deaths over time - 2020", y = "Total Cases")
new_us_counties_total %>%
filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare New cases and Deaths over time - 2021", y = "Total Cases")
new_us_counties_total %>%
filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
ggplot(aes(x = date, y = new_cases, fill = new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare New Cases and Deaths over time - 2022", y = "Total Cases")
A better way of doing it is by aggregating the results on monthly basis and then try to plot the new case and new deaths again.
new_us_counties_total %>%
filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
group_by(month) %>%
summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths Monthly - 2020", y = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
new_us_counties_total %>%
filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
group_by(month) %>%
summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths Monthly - 2021", y = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
new_us_counties_total %>%
filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
group_by(month) %>%
summarise(m_new_cases = sum(new_cases), m_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = month, y = m_new_cases, fill = m_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths Monthly - 2022", y = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
We can also make a visualization for the day of the week
new_us_counties_total %>%
filter(between(date, as.Date("2020-03-15"), as.Date("2020-12-31"))) %>%
group_by(day_of_week) %>%
summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths per Day of Week - 2020", y = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
new_us_counties_total %>%
filter(between(date, as.Date("2021-01-01"), as.Date("2021-12-31"))) %>%
group_by(day_of_week) %>%
summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths per Day of Week - 2021", y_lab = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
new_us_counties_total %>%
filter(between(date, as.Date("2022-01-01"), as.Date("2022-12-31"))) %>%
group_by(day_of_week) %>%
summarise(w_new_cases = sum(new_cases), w_new_deaths = sum(new_deaths)) %>%
ggplot(aes(x = day_of_week, y = w_new_cases, fill = w_new_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare NEW Cases and Deaths per Day of Week - 2022", y_lab = "Total Cases")
## `summarise()` ungrouping output (override with `.groups` argument)
we can see that constantly over the 3 years, Saturday and Sunday had the lowest new cases count, however Wednesday has been always with the highest death count.
– Communicate your methodology, results, and interpretation here –
We have now a list of new deaths new cases with weekly averages new cases and deaths.
# Your output should look similar to the following tibble:
#
# date
# total_deaths > the cumulative number of deaths up to and including the associated date
# total_cases > the cumulative number of cases up to and including the associated date
# delta_deaths_1 > the number of new deaths since the previous day
# delta_cases_1 > the number of new cases since the previous day
# delta_deaths_7 > the average number of deaths in a seven-day period
# delta_cases_7 > the average number of cases in a seven-day period
#==
# A tibble: 657 x 7
# date total_deaths total_cases delta_deaths_1 delta_cases_1 delta_deaths_7 delta_cases_7
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2020-03-15 0.0205 1.08 0 0 NA NA
# 2 2020-03-16 0.0275 1.36 0.00694 0.274 NA NA
# 3 2020-03-17 0.0353 1.78 0.00784 0.422 NA NA
# 4 2020-03-18 0.0489 2.52 0.0136 0.737 NA NA
# 5 2020-03-19 0.0640 3.74 0.0151 1.22 NA NA
# 6 2020-03-20 0.0836 5.43 0.0196 1.69 NA NA
# 7 2020-03-21 0.108 7.39 0.0247 1.96 NA NA
# 8 2020-03-22 0.138 9.97 0.0296 2.58 0.0168 1.27
# 9 2020-03-23 0.174 13.1 0.0362 3.14 0.0209 1.68
# 10 2020-03-24 0.236 16.3 0.0621 3.14 0.0287 2.07
# Create a new table, based on the table from Question 3, and calculate the number of new deaths and cases per 100,000 people each day and a seven day average of new deaths and cases per 100,000 people.
# Hint: To calculate per 100,000 people, first tidy the population estimates data and calculate the US population in 2020 and 2021. Then, you will need to divide each statistic by the estimated population and then multiply by 100,000.
#
# Hint: look at the help documentation for grepl() and case_when() to divide the averages by the US population for each year.
# For example, take the simple tibble, t_new:
#
# x y
# <int> <chr>
# 1 a
# 2 b
# 3 a
# 4 b
# 5 a
# 6 b
#
#
# To add a column, z, that is dependent on the value in y, you could:
#
# t_new %>%
# mutate(z = case_when(grepl("a", y) ~ "not b",
# grepl("b", y) ~ "not a"))
#
## YOUR CODE HERE ##
# First we need to check and tidy us_population_estimates if needed
str(us_population_estimates)
## tibble [6,286 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ STNAME : chr [1:6286] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ CTYNAME : chr [1:6286] "Autauga County" "Autauga County" "Baldwin County" "Baldwin County" ...
## $ fips : num [1:6286] 1001 1001 1003 1003 1005 ...
## $ STATE : num [1:6286] 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY : num [1:6286] 1 1 3 3 5 5 7 7 9 9 ...
## $ Year : num [1:6286] 2020 2021 2020 2021 2020 ...
## $ Estimate: num [1:6286] 58877 59095 233140 239294 25180 ...
## - attr(*, "spec")=
## .. cols(
## .. STNAME = col_character(),
## .. CTYNAME = col_character(),
## .. fips = col_double(),
## .. STATE = col_double(),
## .. COUNTY = col_double(),
## .. Year = col_double(),
## .. Estimate = col_double()
## .. )
sum(duplicated(us_population_estimates))
## [1] 0
range(us_population_estimates$Year)
## [1] 2020 2021
range(us_population_estimates$Estimate)
## [1] 57 9989165
n_distinct(us_population_estimates$STNAME)
## [1] 51
n_distinct(us_population_estimates$CTYNAME)
## [1] 1878
Everything looks fine, here we can see that the number of states is 51 and not 55 as per the first dataset.
# Calculate the total number of population estimate for 2020, 2021
year_us_population_estimates <- us_population_estimates %>%
group_by(Year) %>%
summarise(year_population = sum(Estimate))
## `summarise()` ungrouping output (override with `.groups` argument)
year_us_population_estimates
## # A tibble: 2 x 2
## Year year_population
## <dbl> <dbl>
## 1 2020 331501080
## 2 2021 331893745
# let us get a tibble of the data we need from new_us_countries dataset
p_new_us_counties_total <- new_us_counties_total %>%
select(date, new_cases, new_deaths, avg_week_cases, avg_week_deaths) %>%
# add new column called year
mutate(Year = year(date)) %>%
# Now we can easily add the population total by doing left join
left_join(year_us_population_estimates) %>%
# since there is no estimates for population in 2022, we will exclude the data for 2022 from our records
filter(Year != 2022) %>%
# Calculate the required fields
mutate(p_new_cases = (new_cases / year_population) * 100000, p_new_deaths = (new_deaths / year_population) * 100000, p_avg_week_cases = (avg_week_cases / year_population) * 100000, p_avg_week_deaths = (avg_week_deaths / year_population) * 100000)
## Joining, by = "Year"
p_new_us_counties_total
## # A tibble: 657 x 11
## date new_cases new_deaths avg_week_cases avg_week_deaths Year
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-03-15 702 8 16045. 229. 2020
## 2 2020-03-16 907 23 22149. 306 2020
## 3 2020-03-17 1399 26 30056. 384. 2020
## 4 2020-03-18 2444 45 39089. 491. 2020
## 5 2020-03-19 4043 50 47679. 661. 2020
## 6 2020-03-20 5619 65 58607. 885. 2020
## 7 2020-03-21 6516 83 70034. 1135. 2020
## 8 2020-03-22 8545 98 81381. 1480. 2020
## 9 2020-03-23 10432 121 93539. 1893. 2020
## 10 2020-03-24 10433 206 102048 2220. 2020
## # … with 647 more rows, and 5 more variables: year_population <dbl>,
## # p_new_cases <dbl>, p_new_deaths <dbl>, p_avg_week_cases <dbl>,
## # p_avg_week_deaths <dbl>
– Communicate your methodology, results, and interpretation here –
We have merged the two datasets and perform join to one datasets to calculate new deaths and cases and average weekly new deaths and cases.
# Create a visualization to compare the seven-day average cases and deaths per 100,000 people.
p_new_us_counties_total %>%
filter(between(date, as.Date("2020-03-15"), as.Date("2021-12-31"))) %>%
ggplot(aes(x = date, y = p_avg_week_cases, fill = p_avg_week_deaths)) +
geom_col(position = "Dodge") +
labs(title = "Compare Seven-day Average Cases and Deaths", subtitle = "per 100,000 population", y = "Total")
– Communicate your methodology, results, and interpretation here –
The visualization for the average weekly new cases and new deaths per 100,000 population will add no additional information as we are comparing the states nation wide, however this will be helpful in next part when we compare US state comparison, then average weekly states per 100,000 population for that state will give additional insights on the most and least affected states.