R Markdown

Note: This assignment is due by 11:59pm on Sunday, March 20. You may consult with other students while completing this assignment, but the work you turn in should be your own, in your own words.

This assignment is meant to give you practice in using some of the main features of the tidyr and dplyr packages, part of R’s tidyverse. For this, we will use some highly topical data: COVID-19 case data for the United States. The instructions below may look quite long, but don’t panic! They are meant to gently guide you through a number of operations that a typical analyst may undertake when working with this data. You are not assumed to know anything in particular about epidemiology or public health. Each step has checks where you can verify that you’ve done things correctly, so you will not be working blind!

There are a number of sources for COVID-19 data; we will use the data from Johns Hopkins University here. Go to the following website, which serves as JHU’s public repository:

[https://github.com/CSSEGISandData/COVID-19] (Links to an external site.)

Navigate to the folder “csse_covid_19_data”, and then into the folder “csse_covid_19_time_series”. Download the file named “time_series_covid19_confirmed_US.csv”. You may need to right-click or use “Save As…” to properly save this data as a .csv file.

library(readr)
  1. Import this data into R, and name the resulting data frame “covid_raw”. You may use either base R or readr commands for this. It’s also fine to use the “Import Dataset” wizard in RStudio, but be sure to copy the commands used when preparing your assignment submission. It isn’t necessary to do any extra conversions of the data at this stage, as we will do them instead using tidyr and dplyr functions. As a check, your dataset should have 3342 rows, and over 700 columns, though the latter number will increase by one each day. As of Sunday, February 20, 2022, there were 771 columns.
covid_raw <- read_csv("/Users/Claire/Downloads/time_series_covid19_confirmed_US.csv")
## Rows: 3342 Columns: 799
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (793): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow (covid_raw)
## [1] 3342
ncol(covid_raw)
## [1] 799
  1. If you look at this data using View() in RStudio, you will note that it has many columns, including some that we will not need. Using the select() function in dplyr, select the following columns, reordering and renaming them according to the table below. Save this transformation to a new data frame called “covid”. It should still have 3342 rows, but 9 fewer columns than the raw data. Note: there are a number of ways to select all of the date columns instead of typing them all out individually. You are encouraged to look at the help file for the select() function to see all the options. We want you to explore and experiment!
Old Name New Name
Province_State state
Admin2 county
(All of the date columns) (Keep all of the original names)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.8
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.2.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
covid <- select(covid_raw, state= "Province_State", county= "Admin2", contains("/") )
nrow(covid)
## [1] 3342
ncol(covid)
## [1] 790
  1. As you can probably guess, this data file contains case data for every county in every state and territory in the US. To make things simpler, let’s just work with the data for Minnesota. Using filter(), limit the data to just the state of Minnesota. For this, which column should be examined, and what should the value be? That will give an indication of how to write the command. You can either overwrite the old covid data frame, or create a new one. In any case, the filtered data frame should contain 89 rows, and around 770+ columns. (Those of you who know Minnesota might wonder why there are 89 rows, since Minnesota has 87 counties. This is because there are two special categories of “Out of MN” and “Unassigned”. We won’t work with these categories explicitly in this homework.)
covid <- filter(covid, state== "Minnesota")
nrow(covid)
## [1] 89
ncol(covid)
## [1] 790
  1. We now have a data frame with the state, county, and date columns. As you probably know, this data is in wide format, which is mostly arranged for spreadsheets and human viewing. However, it is not in the best format for analysis with R, and needs to be turned into something longer. Using pivot_longer(), transform this data frame. County and state should be preserved (we won’t really need the state column at this point), but the dates should now all be in one column, called “date”, and the numbers should be in a column called “c_cases” (cumulative cases; more on this below). Save this new data frame as “covid_long”. The resulting data frame should have just 4 columns, but over 67,000(!) rows.
covid_long <- pivot_longer(covid, names_to = "date", values_to = "c_cases", cols = contains("/"))
## # A tibble: 10 × 4
##    state     county date    c_cases
##    <chr>     <chr>  <chr>     <dbl>
##  1 Minnesota Aitkin 1/22/20       0
##  2 Minnesota Aitkin 1/23/20       0
##  3 Minnesota Aitkin 1/24/20       0
##  4 Minnesota Aitkin 1/25/20       0
##  5 Minnesota Aitkin 1/26/20       0
##  6 Minnesota Aitkin 1/27/20       0
##  7 Minnesota Aitkin 1/28/20       0
##  8 Minnesota Aitkin 1/29/20       0
##  9 Minnesota Aitkin 1/30/20       0
## 10 Minnesota Aitkin 1/31/20       0
nrow(covid_long)
## [1] 70132
  1. The dates were read in as strings, so R does not have a sense that they are actual calendar dates. Using either the base R as.Date() function, or an appropriate function from the lubridate package, transform the date column from a string type to the Date type. Hint: dates were discussed in the Week 6 lab, as were some of the conversion functions. You can just save this transformation to the covid_long data frame, instead of making a new one.
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
covid_long <- mutate(covid_long, date=mdy(date))
## # A tibble: 10 × 4
##    state     county date       c_cases
##    <chr>     <chr>  <date>       <dbl>
##  1 Minnesota Aitkin 2020-01-22       0
##  2 Minnesota Aitkin 2020-01-23       0
##  3 Minnesota Aitkin 2020-01-24       0
##  4 Minnesota Aitkin 2020-01-25       0
##  5 Minnesota Aitkin 2020-01-26       0
##  6 Minnesota Aitkin 2020-01-27       0
##  7 Minnesota Aitkin 2020-01-28       0
##  8 Minnesota Aitkin 2020-01-29       0
##  9 Minnesota Aitkin 2020-01-30       0
## 10 Minnesota Aitkin 2020-01-31       0
  1. Using dplyr’s filter() command, limit the data to the days up to and including February 15, 2022. Again, you can just save this to covid_long. The resulting data frame should have 4 columns, and 67,284 rows. (This is done to make it easier to run checks on subsequent data, since everyone should now have the same data frame, regardless of which day they downloaded it.)
covid_long <- filter(covid_long, date <= as.Date("2022-02-15"))
nrow(covid_long)
## [1] 67284
  1. Using the group_by() command, group the data in covid_long by county. This will ensure that the following commands are done separately within each county. While this command will not result in an easily visible change, if you are using RStudio, clicking on the blue disclosure triangle in the Environment tab should show a lot more information than just the columns, including a mention of groups.
covid_long <- group_by(covid_long, county)
  1. While not strictly necessary in this case, as pivot_longer() should preserve chronological order, use the arrange() command to sort all the data by date. If you view the data at this point, View() should list all the counties in Minnesota, in alphabetical order, for January 22, 2020, and so on. (It’s good practice to do this for date data.) This should be an ascending sort, so that the earliest dates are at the top.
covid_long <- arrange(covid_long, date)
## # A tibble: 10 × 4
## # Groups:   county [10]
##    state     county     date       c_cases
##    <chr>     <chr>      <date>       <dbl>
##  1 Minnesota Aitkin     2020-01-22       0
##  2 Minnesota Anoka      2020-01-22       0
##  3 Minnesota Becker     2020-01-22       0
##  4 Minnesota Beltrami   2020-01-22       0
##  5 Minnesota Benton     2020-01-22       0
##  6 Minnesota Big Stone  2020-01-22       0
##  7 Minnesota Blue Earth 2020-01-22       0
##  8 Minnesota Brown      2020-01-22       0
##  9 Minnesota Carlton    2020-01-22       0
## 10 Minnesota Carver     2020-01-22       0
  1. The data in c_cases shows the number of cumulative (i.e., total) cases for each county, up to and including that date. While this can be useful, we might also want to see the number of new cases each day. Using the mutate() command, create a new column called “cases” which contains the number of new cases for each date. See the help file for dplyr’s helper function lag(), which is an easy way to specify that the current cases cell should be the current c_cases cell minus the previous c_cases cell. Since using this function by itself will give NA for the very first date, use the option “default=0” within the lag() function to prevent this. (It will thus assume that the date of January 21, 2020 had 0 cases.) This should add a column to the data frame. As a check, try looking up a random county on a random date, and verify that cases does indeed equal c_cases for that date, minus c_cases for the previous date.
covid_long <- mutate(covid_long, cases = c_cases - lag(c_cases, n=1L, default=0))
## # A tibble: 10 × 5
## # Groups:   county [10]
##    state     county     date       c_cases cases
##    <chr>     <chr>      <date>       <dbl> <dbl>
##  1 Minnesota Aitkin     2020-01-22       0     0
##  2 Minnesota Anoka      2020-01-22       0     0
##  3 Minnesota Becker     2020-01-22       0     0
##  4 Minnesota Beltrami   2020-01-22       0     0
##  5 Minnesota Benton     2020-01-22       0     0
##  6 Minnesota Big Stone  2020-01-22       0     0
##  7 Minnesota Blue Earth 2020-01-22       0     0
##  8 Minnesota Brown      2020-01-22       0     0
##  9 Minnesota Carlton    2020-01-22       0     0
## 10 Minnesota Carver     2020-01-22       0     0
  1. Using dplyr’s summarize() command, collapse this data so that only the final, cumulative number of cases for each county is shown. This can be done in either of two ways: by obtaining the last value of c_cases in each group, or by summing all of the cases across all dates; the numbers should be the same. Save this new data frame as “covid_last”. As a check, this data frame should have 2 columns and 89 rows. (As an additional check, examine Hennepin County, Minnesota; cumulative cases on February 15, 2022 should be 290,386.)
covid_last <- summarize (covid_long, total_cases=sum(c_cases))
nrow(covid_last)
## [1] 89
ncol(covid_last)
## [1] 2
  1. Download the zip file in this link Download the zip file in this link , which contains an .RData file. Using load(), load this file into R. It should add a single data frame called “mnpops” to your environment. Try viewing the data frame, which should show the 2020 Census populations for each of the counties in Minnesota. (I obtained this data from the U.S. Census Bureau web site, and turned it into a data frame.)
load("/Users/Claire/Downloads/2020 Minnesota County Populations/mnpops.RData")
## # A tibble: 10 × 2
##    county     pop2020
##    <chr>        <dbl>
##  1 Hennepin   1281565
##  2 Ramsey      552352
##  3 Dakota      439882
##  4 Anoka       363887
##  5 Washington  267568
##  6 St. Louis   200231
##  7 Olmsted     162847
##  8 Stearns     158292
##  9 Scott       150928
## 10 Wright      141337
  1. Using inner_join(), add the population data to covid_last. Hint: use “county” as the key to merge on. This new version of covid_last should have 3 columns and 87 rows. (Since the special “counties” were not in the census data, they were not carried over in the join.)
covid_last <- inner_join(covid_last, mnpops)
## Joining, by = "county"
  1. Using mutate(), add a new column to covid_last called “per100k”, which lists the cumulative incidence of cases per 100,000 people. You can obtain this by dividing total by pop2020 (that county’s 2020 population), then multiplying by 100,000. These standardized numbers make it easier to compare different counties, which have very different population sizes.
covid_last <- mutate(covid_last, per100k = total_cases / pop2020 * 100000)
  1. Finally, using dplyr’s slice_max(), obtain the top 10 counties, in terms of cases per 100,000 people. Name this final data frame covid_top10. This data frame should have 4 columns and 10 observations, and should contain the following, which could be typical data found in dashboards or web sites tracking the pandemic.
county total pop2020 per100k
Benton 13618 41379 32910.41
Stearns 49548 158292 31301.65
Nobles 6841 22290 30690.89
Kandiyohi 13029 43732 29792.83
Mahnomen 1596 5411 29495.47
Clay 18941 65318 28998.13
Mower 11577 40029 28921.53
Wadena 4045 14065 28759.33
Freeborn 8764 30895 28367.05
Waseca 5354 18968 28226.49
top10 <- slice_max(covid_last, per100k, n=10)

What to turn in:

A simple text file, with a .R suffix, containing the commands you used for each of the parts. You do not have to include any output. Mark each part with a comment line, which begins with #. For example, if Part X asked you to add 1 and 1, and store the result in a variable called “a”, you might put this in the file:

# Part X
a <- 1 + 1

To further help with grading, please put your name in a comment line at the top of the file.

As always, let us know if you have questions or problems with this assignment. We hope you find it interesting, and that you learn one or two new things!

Dave, Divya, and Eunice…