Note: This assignment is due by 11:59pm on Sunday, March 20. You may consult with other students while completing this assignment, but the work you turn in should be your own, in your own words.
This assignment is meant to give you practice in using some of the main features of the tidyr and dplyr packages, part of R’s tidyverse. For this, we will use some highly topical data: COVID-19 case data for the United States. The instructions below may look quite long, but don’t panic! They are meant to gently guide you through a number of operations that a typical analyst may undertake when working with this data. You are not assumed to know anything in particular about epidemiology or public health. Each step has checks where you can verify that you’ve done things correctly, so you will not be working blind!
There are a number of sources for COVID-19 data; we will use the data from Johns Hopkins University here. Go to the following website, which serves as JHU’s public repository:
[https://github.com/CSSEGISandData/COVID-19] (Links to an external site.)
Navigate to the folder “csse_covid_19_data”, and then into the folder “csse_covid_19_time_series”. Download the file named “time_series_covid19_confirmed_US.csv”. You may need to right-click or use “Save As…” to properly save this data as a .csv file.
library(readr)
covid_raw <- read_csv("/Users/Claire/Downloads/time_series_covid19_confirmed_US.csv")
## Rows: 3342 Columns: 799
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (793): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow (covid_raw)
## [1] 3342
ncol(covid_raw)
## [1] 799
| Old Name | New Name |
|---|---|
| Province_State | state |
| Admin2 | county |
| (All of the date columns) | (Keep all of the original names) |
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.8
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.2.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
covid <- select(covid_raw, state= "Province_State", county= "Admin2", contains("/") )
nrow(covid)
## [1] 3342
ncol(covid)
## [1] 790
covid <- filter(covid, state== "Minnesota")
nrow(covid)
## [1] 89
ncol(covid)
## [1] 790
covid_long <- pivot_longer(covid, names_to = "date", values_to = "c_cases", cols = contains("/"))
## # A tibble: 10 × 4
## state county date c_cases
## <chr> <chr> <chr> <dbl>
## 1 Minnesota Aitkin 1/22/20 0
## 2 Minnesota Aitkin 1/23/20 0
## 3 Minnesota Aitkin 1/24/20 0
## 4 Minnesota Aitkin 1/25/20 0
## 5 Minnesota Aitkin 1/26/20 0
## 6 Minnesota Aitkin 1/27/20 0
## 7 Minnesota Aitkin 1/28/20 0
## 8 Minnesota Aitkin 1/29/20 0
## 9 Minnesota Aitkin 1/30/20 0
## 10 Minnesota Aitkin 1/31/20 0
nrow(covid_long)
## [1] 70132
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
covid_long <- mutate(covid_long, date=mdy(date))
## # A tibble: 10 × 4
## state county date c_cases
## <chr> <chr> <date> <dbl>
## 1 Minnesota Aitkin 2020-01-22 0
## 2 Minnesota Aitkin 2020-01-23 0
## 3 Minnesota Aitkin 2020-01-24 0
## 4 Minnesota Aitkin 2020-01-25 0
## 5 Minnesota Aitkin 2020-01-26 0
## 6 Minnesota Aitkin 2020-01-27 0
## 7 Minnesota Aitkin 2020-01-28 0
## 8 Minnesota Aitkin 2020-01-29 0
## 9 Minnesota Aitkin 2020-01-30 0
## 10 Minnesota Aitkin 2020-01-31 0
covid_long <- filter(covid_long, date <= as.Date("2022-02-15"))
nrow(covid_long)
## [1] 67284
covid_long <- group_by(covid_long, county)
covid_long <- arrange(covid_long, date)
## # A tibble: 10 × 4
## # Groups: county [10]
## state county date c_cases
## <chr> <chr> <date> <dbl>
## 1 Minnesota Aitkin 2020-01-22 0
## 2 Minnesota Anoka 2020-01-22 0
## 3 Minnesota Becker 2020-01-22 0
## 4 Minnesota Beltrami 2020-01-22 0
## 5 Minnesota Benton 2020-01-22 0
## 6 Minnesota Big Stone 2020-01-22 0
## 7 Minnesota Blue Earth 2020-01-22 0
## 8 Minnesota Brown 2020-01-22 0
## 9 Minnesota Carlton 2020-01-22 0
## 10 Minnesota Carver 2020-01-22 0
covid_long <- mutate(covid_long, cases = c_cases - lag(c_cases, n=1L, default=0))
## # A tibble: 10 × 5
## # Groups: county [10]
## state county date c_cases cases
## <chr> <chr> <date> <dbl> <dbl>
## 1 Minnesota Aitkin 2020-01-22 0 0
## 2 Minnesota Anoka 2020-01-22 0 0
## 3 Minnesota Becker 2020-01-22 0 0
## 4 Minnesota Beltrami 2020-01-22 0 0
## 5 Minnesota Benton 2020-01-22 0 0
## 6 Minnesota Big Stone 2020-01-22 0 0
## 7 Minnesota Blue Earth 2020-01-22 0 0
## 8 Minnesota Brown 2020-01-22 0 0
## 9 Minnesota Carlton 2020-01-22 0 0
## 10 Minnesota Carver 2020-01-22 0 0
covid_last <- summarize (covid_long, total_cases=sum(c_cases))
nrow(covid_last)
## [1] 89
ncol(covid_last)
## [1] 2
load("/Users/Claire/Downloads/2020 Minnesota County Populations/mnpops.RData")
## # A tibble: 10 × 2
## county pop2020
## <chr> <dbl>
## 1 Hennepin 1281565
## 2 Ramsey 552352
## 3 Dakota 439882
## 4 Anoka 363887
## 5 Washington 267568
## 6 St. Louis 200231
## 7 Olmsted 162847
## 8 Stearns 158292
## 9 Scott 150928
## 10 Wright 141337
covid_last <- inner_join(covid_last, mnpops)
## Joining, by = "county"
covid_last <- mutate(covid_last, per100k = total_cases / pop2020 * 100000)
| county | total | pop2020 | per100k |
|---|---|---|---|
| Benton | 13618 | 41379 | 32910.41 |
| Stearns | 49548 | 158292 | 31301.65 |
| Nobles | 6841 | 22290 | 30690.89 |
| Kandiyohi | 13029 | 43732 | 29792.83 |
| Mahnomen | 1596 | 5411 | 29495.47 |
| Clay | 18941 | 65318 | 28998.13 |
| Mower | 11577 | 40029 | 28921.53 |
| Wadena | 4045 | 14065 | 28759.33 |
| Freeborn | 8764 | 30895 | 28367.05 |
| Waseca | 5354 | 18968 | 28226.49 |
top10 <- slice_max(covid_last, per100k, n=10)
What to turn in:
A simple text file, with a .R suffix, containing the commands you used for each of the parts. You do not have to include any output. Mark each part with a comment line, which begins with #. For example, if Part X asked you to add 1 and 1, and store the result in a variable called “a”, you might put this in the file:
# Part X
a <- 1 + 1
To further help with grading, please put your name in a comment line at the top of the file.
As always, let us know if you have questions or problems with this assignment. We hope you find it interesting, and that you learn one or two new things!
Dave, Divya, and Eunice…