Demonstrating tidyr for Data Cleaning and Reshaping in R

First, we need to download the tidyr package to be able to use the commands pivot_wider, pivot_longer, unite, and separate.

library(tidyr)

Now we need to install the package dplyr as this will be used throughout the example to select specific columns.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Next, we need to download the covid data, which is in a repository on Github, this means that we need to install the package remotes allows for the installation of R packages from sources other then CRAN. In this case the source is GitHub, and we will be taking directly from the kjhealy/covdata repository. Note: It may ask you to update the package in that case update all, which is number 1

remotes::install_github("kjhealy/covdata")
## Skipping install of 'covdata' from a github remote, the SHA1 (0626c652) has not changed since last install.
##   Use `force = TRUE` to force installation
library(covdata)
## 
## Attaching package: 'covdata'
## The following object is masked from 'package:datasets':
## 
##     uspop
covid_sample=cdc_deaths_by_state 
print(n=51, covid_sample)
## # A tibble: 53 × 10
##    data_as_of state              start_week end_week   covid_deaths total_deaths
##    <date>     <chr>              <date>     <date>            <int>        <int>
##  1 2020-04-30 Alabama            2020-02-01 2020-04-25          153        12427
##  2 2020-04-30 Alaska             2020-02-01 2020-04-25           NA          856
##  3 2020-04-30 Arizona            2020-02-01 2020-04-25          194        15859
##  4 2020-04-30 Arkansas           2020-02-01 2020-04-25           26         7774
##  5 2020-04-30 California         2020-02-01 2020-04-25         1111        69341
##  6 2020-04-30 Colorado           2020-02-01 2020-04-25          511        10721
##  7 2020-04-30 Connecticut        2020-02-01 2020-04-25           16         1277
##  8 2020-04-30 Delaware           2020-02-01 2020-04-25           53         2034
##  9 2020-04-30 District of Colum… 2020-02-01 2020-04-25           58         1452
## 10 2020-04-30 Florida            2020-02-01 2020-04-25          784        54148
## 11 2020-04-30 Georgia            2020-02-01 2020-04-25          460        19739
## 12 2020-04-30 Hawaii             2020-02-01 2020-04-25           NA         2830
## 13 2020-04-30 Idaho              2020-02-01 2020-04-25           44         3553
## 14 2020-04-30 Illinois           2020-02-01 2020-04-25          977        28696
## 15 2020-04-30 Indiana            2020-02-01 2020-04-25          493        16502
## 16 2020-04-30 Iowa               2020-02-01 2020-04-25           78         7447
## 17 2020-04-30 Kansas             2020-02-01 2020-04-25           82         6559
## 18 2020-04-30 Kentucky           2020-02-01 2020-04-25          105        10794
## 19 2020-04-30 Louisiana          2020-02-01 2020-04-25          779        11174
## 20 2020-04-30 Maine              2020-02-01 2020-04-25           39         3813
## 21 2020-04-30 Maryland           2020-02-01 2020-04-25          631        13661
## 22 2020-04-30 Massachusetts      2020-02-01 2020-04-25         1872        17627
## 23 2020-04-30 Michigan           2020-02-01 2020-04-25         1746        26798
## 24 2020-04-30 Minnesota          2020-02-01 2020-04-25          169        11351
## 25 2020-04-30 Mississippi        2020-02-01 2020-04-25          186         8064
## 26 2020-04-30 Missouri           2020-02-01 2020-04-25          181        15085
## 27 2020-04-30 Montana            2020-02-01 2020-04-25           NA         2248
## 28 2020-04-30 Nebraska           2020-02-01 2020-04-25           21         3971
## 29 2020-04-30 Nevada             2020-02-01 2020-04-25          134         6431
## 30 2020-04-30 New Hampshire      2020-02-01 2020-04-25           54         3193
## 31 2020-04-30 New Jersey         2020-02-01 2020-04-25         4147        25480
## 32 2020-04-30 New Mexico         2020-02-01 2020-04-25           36         4113
## 33 2020-04-30 New York           2020-02-01 2020-04-25         4800        32000
## 34 2020-04-30 New York City      2020-02-01 2020-04-25        10978        30711
## 35 2020-04-30 North Carolina     2020-02-01 2020-04-25            0        10727
## 36 2020-04-30 North Dakota       2020-02-01 2020-04-25           NA         1546
## 37 2020-04-30 Ohio               2020-02-01 2020-04-25          171        26464
## 38 2020-04-30 Oklahoma           2020-02-01 2020-04-25          114         8762
## 39 2020-04-30 Oregon             2020-02-01 2020-04-25           70         8392
## 40 2020-04-30 Pennsylvania       2020-02-01 2020-04-25         1388        28073
## 41 2020-04-30 Rhode Island       2020-02-01 2020-04-25           55         2235
## 42 2020-04-30 South Carolina     2020-02-01 2020-04-25          150        12807
## 43 2020-04-30 South Dakota       2020-02-01 2020-04-25           NA         1898
## 44 2020-04-30 Tennessee          2020-02-01 2020-04-25          127        18292
## 45 2020-04-30 Texas              2020-02-01 2020-04-25          347        48574
## 46 2020-04-30 Utah               2020-02-01 2020-04-25           28         4675
## 47 2020-04-30 Vermont            2020-02-01 2020-04-25           42         1570
## 48 2020-04-30 Virginia           2020-02-01 2020-04-25          326        17624
## 49 2020-04-30 Washington         2020-02-01 2020-04-25          527        14139
## 50 2020-04-30 West Virginia      2020-02-01 2020-04-25           11         5022
## 51 2020-04-30 Wisconsin          2020-02-01 2020-04-25          209        13718
## # ℹ 2 more rows
## # ℹ 4 more variables: percent_expected_deaths <dbl>, pneumonia_deaths <int>,
## #   pneumonia_and_covid_deaths <int>, all_influenza_deaths_j09_j11 <int>

Now we need to enter back into the tidyr package to use the pivot_longer commend. In this command we will be used to combine the columns covid_deaths and total_deaths and name the new columns death_type and deaths. This will allow for more straight forward labeling of covid vs total deaths.

library(tidyr)
df_long =pivot_longer(covid_sample, cols = c(covid_deaths, total_deaths), names_to = "death_type", values_to = "deaths")
head(df_long)
## # A tibble: 6 × 10
##   data_as_of state start_week end_week   percent_expected_dea…¹ pneumonia_deaths
##   <date>     <chr> <date>     <date>                      <dbl>            <int>
## 1 2020-04-30 Alab… 2020-02-01 2020-04-25                   0.91              769
## 2 2020-04-30 Alab… 2020-02-01 2020-04-25                   0.91              769
## 3 2020-04-30 Alas… 2020-02-01 2020-04-25                   0.79               41
## 4 2020-04-30 Alas… 2020-02-01 2020-04-25                   0.79               41
## 5 2020-04-30 Ariz… 2020-02-01 2020-04-25                   1.01             1115
## 6 2020-04-30 Ariz… 2020-02-01 2020-04-25                   1.01             1115
## # ℹ abbreviated name: ¹​percent_expected_deaths
## # ℹ 4 more variables: pneumonia_and_covid_deaths <int>,
## #   all_influenza_deaths_j09_j11 <int>, death_type <chr>, deaths <int>

Now we are going to use the command separate to separate the start_week date into year, month, and day. We first tell the code to use the df_long dataset that we just created, then state the column we wish to separate (start_week) then we label based on the order that we know the date is arranged currently within the start_week column. We know that currently these are separated in the dataset by a “-” within the start_week column, therefore we use the command sep=“-”.

df_separated = separate( df_long, col = start_week, into = c("year", "month", "day"), sep = "-", convert = TRUE)
head(df_separated)
## # A tibble: 6 × 12
##   data_as_of state    year month   day end_week   percent_expected_deaths
##   <date>     <chr>   <int> <int> <int> <date>                       <dbl>
## 1 2020-04-30 Alabama  2020     2     1 2020-04-25                    0.91
## 2 2020-04-30 Alabama  2020     2     1 2020-04-25                    0.91
## 3 2020-04-30 Alaska   2020     2     1 2020-04-25                    0.79
## 4 2020-04-30 Alaska   2020     2     1 2020-04-25                    0.79
## 5 2020-04-30 Arizona  2020     2     1 2020-04-25                    1.01
## 6 2020-04-30 Arizona  2020     2     1 2020-04-25                    1.01
## # ℹ 5 more variables: pneumonia_deaths <int>, pneumonia_and_covid_deaths <int>,
## #   all_influenza_deaths_j09_j11 <int>, death_type <chr>, deaths <int>

Next, we are going to use the unite command to combine the data in the state and the data_as_of information together and separated by a “_“. This will allow us to see the data as of date with the state name.

df_united = unite(df_separated, "state_date", state, data_as_of, sep = "_")
head(df_united)
## # A tibble: 6 × 11
##   state_date          year month   day end_week   percent_expected_deaths
##   <chr>              <int> <int> <int> <date>                       <dbl>
## 1 Alabama_2020-04-30  2020     2     1 2020-04-25                    0.91
## 2 Alabama_2020-04-30  2020     2     1 2020-04-25                    0.91
## 3 Alaska_2020-04-30   2020     2     1 2020-04-25                    0.79
## 4 Alaska_2020-04-30   2020     2     1 2020-04-25                    0.79
## 5 Arizona_2020-04-30  2020     2     1 2020-04-25                    1.01
## 6 Arizona_2020-04-30  2020     2     1 2020-04-25                    1.01
## # ℹ 5 more variables: pneumonia_deaths <int>, pneumonia_and_covid_deaths <int>,
## #   all_influenza_deaths_j09_j11 <int>, death_type <chr>, deaths <int>

Now we are going to use the pivot_wider command to separate out the total deaths from the covid deaths. To do this we first want to select a smaller dataset using the dplyr package, the dataset we want to select is state_date, death_type, and deaths. After selecting this data using the dplyr package we next use the pivot_wider command. In this case we are having the death_type moved into separate columns with counts, the names of the columns will be based off their labeling in the death_type column and the counts will be from the death column. This ensures that only one variable is within one column.

df_wide = dplyr::select(df_united, state_date, death_type, deaths)
df_wide= pivot_wider(df_wide, names_from = death_type, values_from = deaths)
head(df_wide)
## # A tibble: 6 × 3
##   state_date            covid_deaths total_deaths
##   <chr>                        <int>        <int>
## 1 Alabama_2020-04-30             153        12427
## 2 Alaska_2020-04-30               NA          856
## 3 Arizona_2020-04-30             194        15859
## 4 Arkansas_2020-04-30             26         7774
## 5 California_2020-04-30         1111        69341
## 6 Colorado_2020-04-30            511        10721

To graph this now tidy dataset we will use ggplot2, which we first need to download, and then will label based on the information of this dataset, we want the x-axis to be the total deaths while the y axis is covid deaths.

library(ggplot2)
ggplot(df_wide, aes(x = total_deaths, y = covid_deaths, label = state_date)) + geom_point(color = "steelblue", size = 4) + geom_text(vjust = -0.5, size = 3) + labs(title = "COVID vs. Total Deaths (April 2020)", x = "Total Deaths", y = "COVID-19 Deaths")
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

We see that this combined data with the state_date makes the plot hard to read so let’s remove the date and have only the state name be plotted.

library(dplyr)
df_wide = select(df_separated, state, death_type, deaths) 
library(tidyr)
df_wide=pivot_wider(df_wide, names_from = death_type, values_from = deaths)
head(df_wide)
## # A tibble: 6 × 3
##   state      covid_deaths total_deaths
##   <chr>             <int>        <int>
## 1 Alabama             153        12427
## 2 Alaska               NA          856
## 3 Arizona             194        15859
## 4 Arkansas             26         7774
## 5 California         1111        69341
## 6 Colorado            511        10721
ggplot(df_wide, aes(x = total_deaths, y = covid_deaths, label = state)) + geom_point(color = "steelblue", size = 4) + geom_text(vjust = -0.5, size = 3) + labs(title = "COVID vs. Total Deaths (April 2020)", x = "Total Deaths", y = "COVID-19 Deaths")
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

We have now completed our example on how to use tidyr to tidy up some data and plot it.