Demonstrating tidyr for Data Cleaning and Reshaping in R
First, we need to download the tidyr package to be able to use the
commands pivot_wider, pivot_longer, unite, and separate.
library(tidyr)
Now we need to install the package dplyr as this will be used
throughout the example to select specific columns.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Next, we need to download the covid data, which is in a repository
on Github, this means that we need to install the package remotes allows
for the installation of R packages from sources other then CRAN. In this
case the source is GitHub, and we will be taking directly from the
kjhealy/covdata repository. Note: It may ask you to update the
package in that case update all, which is number 1
remotes::install_github("kjhealy/covdata")
## Skipping install of 'covdata' from a github remote, the SHA1 (0626c652) has not changed since last install.
## Use `force = TRUE` to force installation
library(covdata)
##
## Attaching package: 'covdata'
## The following object is masked from 'package:datasets':
##
## uspop
covid_sample=cdc_deaths_by_state
print(n=51, covid_sample)
## # A tibble: 53 × 10
## data_as_of state start_week end_week covid_deaths total_deaths
## <date> <chr> <date> <date> <int> <int>
## 1 2020-04-30 Alabama 2020-02-01 2020-04-25 153 12427
## 2 2020-04-30 Alaska 2020-02-01 2020-04-25 NA 856
## 3 2020-04-30 Arizona 2020-02-01 2020-04-25 194 15859
## 4 2020-04-30 Arkansas 2020-02-01 2020-04-25 26 7774
## 5 2020-04-30 California 2020-02-01 2020-04-25 1111 69341
## 6 2020-04-30 Colorado 2020-02-01 2020-04-25 511 10721
## 7 2020-04-30 Connecticut 2020-02-01 2020-04-25 16 1277
## 8 2020-04-30 Delaware 2020-02-01 2020-04-25 53 2034
## 9 2020-04-30 District of Colum… 2020-02-01 2020-04-25 58 1452
## 10 2020-04-30 Florida 2020-02-01 2020-04-25 784 54148
## 11 2020-04-30 Georgia 2020-02-01 2020-04-25 460 19739
## 12 2020-04-30 Hawaii 2020-02-01 2020-04-25 NA 2830
## 13 2020-04-30 Idaho 2020-02-01 2020-04-25 44 3553
## 14 2020-04-30 Illinois 2020-02-01 2020-04-25 977 28696
## 15 2020-04-30 Indiana 2020-02-01 2020-04-25 493 16502
## 16 2020-04-30 Iowa 2020-02-01 2020-04-25 78 7447
## 17 2020-04-30 Kansas 2020-02-01 2020-04-25 82 6559
## 18 2020-04-30 Kentucky 2020-02-01 2020-04-25 105 10794
## 19 2020-04-30 Louisiana 2020-02-01 2020-04-25 779 11174
## 20 2020-04-30 Maine 2020-02-01 2020-04-25 39 3813
## 21 2020-04-30 Maryland 2020-02-01 2020-04-25 631 13661
## 22 2020-04-30 Massachusetts 2020-02-01 2020-04-25 1872 17627
## 23 2020-04-30 Michigan 2020-02-01 2020-04-25 1746 26798
## 24 2020-04-30 Minnesota 2020-02-01 2020-04-25 169 11351
## 25 2020-04-30 Mississippi 2020-02-01 2020-04-25 186 8064
## 26 2020-04-30 Missouri 2020-02-01 2020-04-25 181 15085
## 27 2020-04-30 Montana 2020-02-01 2020-04-25 NA 2248
## 28 2020-04-30 Nebraska 2020-02-01 2020-04-25 21 3971
## 29 2020-04-30 Nevada 2020-02-01 2020-04-25 134 6431
## 30 2020-04-30 New Hampshire 2020-02-01 2020-04-25 54 3193
## 31 2020-04-30 New Jersey 2020-02-01 2020-04-25 4147 25480
## 32 2020-04-30 New Mexico 2020-02-01 2020-04-25 36 4113
## 33 2020-04-30 New York 2020-02-01 2020-04-25 4800 32000
## 34 2020-04-30 New York City 2020-02-01 2020-04-25 10978 30711
## 35 2020-04-30 North Carolina 2020-02-01 2020-04-25 0 10727
## 36 2020-04-30 North Dakota 2020-02-01 2020-04-25 NA 1546
## 37 2020-04-30 Ohio 2020-02-01 2020-04-25 171 26464
## 38 2020-04-30 Oklahoma 2020-02-01 2020-04-25 114 8762
## 39 2020-04-30 Oregon 2020-02-01 2020-04-25 70 8392
## 40 2020-04-30 Pennsylvania 2020-02-01 2020-04-25 1388 28073
## 41 2020-04-30 Rhode Island 2020-02-01 2020-04-25 55 2235
## 42 2020-04-30 South Carolina 2020-02-01 2020-04-25 150 12807
## 43 2020-04-30 South Dakota 2020-02-01 2020-04-25 NA 1898
## 44 2020-04-30 Tennessee 2020-02-01 2020-04-25 127 18292
## 45 2020-04-30 Texas 2020-02-01 2020-04-25 347 48574
## 46 2020-04-30 Utah 2020-02-01 2020-04-25 28 4675
## 47 2020-04-30 Vermont 2020-02-01 2020-04-25 42 1570
## 48 2020-04-30 Virginia 2020-02-01 2020-04-25 326 17624
## 49 2020-04-30 Washington 2020-02-01 2020-04-25 527 14139
## 50 2020-04-30 West Virginia 2020-02-01 2020-04-25 11 5022
## 51 2020-04-30 Wisconsin 2020-02-01 2020-04-25 209 13718
## # ℹ 2 more rows
## # ℹ 4 more variables: percent_expected_deaths <dbl>, pneumonia_deaths <int>,
## # pneumonia_and_covid_deaths <int>, all_influenza_deaths_j09_j11 <int>
Now we need to enter back into the tidyr package to use the
pivot_longer commend. In this command we will be used to combine the
columns covid_deaths and total_deaths and name the new columns
death_type and deaths. This will allow for more straight forward
labeling of covid vs total deaths.
library(tidyr)
df_long =pivot_longer(covid_sample, cols = c(covid_deaths, total_deaths), names_to = "death_type", values_to = "deaths")
head(df_long)
## # A tibble: 6 × 10
## data_as_of state start_week end_week percent_expected_dea…¹ pneumonia_deaths
## <date> <chr> <date> <date> <dbl> <int>
## 1 2020-04-30 Alab… 2020-02-01 2020-04-25 0.91 769
## 2 2020-04-30 Alab… 2020-02-01 2020-04-25 0.91 769
## 3 2020-04-30 Alas… 2020-02-01 2020-04-25 0.79 41
## 4 2020-04-30 Alas… 2020-02-01 2020-04-25 0.79 41
## 5 2020-04-30 Ariz… 2020-02-01 2020-04-25 1.01 1115
## 6 2020-04-30 Ariz… 2020-02-01 2020-04-25 1.01 1115
## # ℹ abbreviated name: ¹percent_expected_deaths
## # ℹ 4 more variables: pneumonia_and_covid_deaths <int>,
## # all_influenza_deaths_j09_j11 <int>, death_type <chr>, deaths <int>
Now we are going to use the command separate to separate the
start_week date into year, month, and day. We first tell the code to use
the df_long dataset that we just created, then state the column we wish
to separate (start_week) then we label based on the order that we know
the date is arranged currently within the start_week column. We know
that currently these are separated in the dataset by a “-” within the
start_week column, therefore we use the command sep=“-”.
df_separated = separate( df_long, col = start_week, into = c("year", "month", "day"), sep = "-", convert = TRUE)
head(df_separated)
## # A tibble: 6 × 12
## data_as_of state year month day end_week percent_expected_deaths
## <date> <chr> <int> <int> <int> <date> <dbl>
## 1 2020-04-30 Alabama 2020 2 1 2020-04-25 0.91
## 2 2020-04-30 Alabama 2020 2 1 2020-04-25 0.91
## 3 2020-04-30 Alaska 2020 2 1 2020-04-25 0.79
## 4 2020-04-30 Alaska 2020 2 1 2020-04-25 0.79
## 5 2020-04-30 Arizona 2020 2 1 2020-04-25 1.01
## 6 2020-04-30 Arizona 2020 2 1 2020-04-25 1.01
## # ℹ 5 more variables: pneumonia_deaths <int>, pneumonia_and_covid_deaths <int>,
## # all_influenza_deaths_j09_j11 <int>, death_type <chr>, deaths <int>
Now we are going to use the pivot_wider command to separate out the
total deaths from the covid deaths. To do this we first want to select a
smaller dataset using the dplyr package, the dataset we want to select
is state_date, death_type, and deaths. After selecting this data using
the dplyr package we next use the pivot_wider command. In this case we
are having the death_type moved into separate columns with counts, the
names of the columns will be based off their labeling in the death_type
column and the counts will be from the death column. This ensures that
only one variable is within one column.
df_wide = dplyr::select(df_united, state_date, death_type, deaths)
df_wide= pivot_wider(df_wide, names_from = death_type, values_from = deaths)
head(df_wide)
## # A tibble: 6 × 3
## state_date covid_deaths total_deaths
## <chr> <int> <int>
## 1 Alabama_2020-04-30 153 12427
## 2 Alaska_2020-04-30 NA 856
## 3 Arizona_2020-04-30 194 15859
## 4 Arkansas_2020-04-30 26 7774
## 5 California_2020-04-30 1111 69341
## 6 Colorado_2020-04-30 511 10721
We see that this combined data with the state_date makes the plot
hard to read so let’s remove the date and have only the state name be
plotted.
library(dplyr)
df_wide = select(df_separated, state, death_type, deaths)
library(tidyr)
df_wide=pivot_wider(df_wide, names_from = death_type, values_from = deaths)
head(df_wide)
## # A tibble: 6 × 3
## state covid_deaths total_deaths
## <chr> <int> <int>
## 1 Alabama 153 12427
## 2 Alaska NA 856
## 3 Arizona 194 15859
## 4 Arkansas 26 7774
## 5 California 1111 69341
## 6 Colorado 511 10721
ggplot(df_wide, aes(x = total_deaths, y = covid_deaths, label = state)) + geom_point(color = "steelblue", size = 4) + geom_text(vjust = -0.5, size = 3) + labs(title = "COVID vs. Total Deaths (April 2020)", x = "Total Deaths", y = "COVID-19 Deaths")
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

We have now completed our example on how to use tidyr to tidy up
some data and plot it.