Use table1, table2, table3, table4a, table4b and table5 datasets which are under {tidyr} package. As you can see, each dataset shows the same values of four variables, country, year, population, and cases, but each dataset organises the values in a different way. Click on the tables below to use them for completing the task. The functions mentioned in step 3 use one of these tables to explain the functionality of {tidyr} functions.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
When column names are values instead of variables, you need to transform data from wide to long format. We do this with the pivot_longer() function. Previously, the function that performed this was gather(), so you may also see that in legacy code in your workplace.
To illustrate this function, let’s have a look at the data given in the table4a below:
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To tidy a dataset like this, you need to pivot those columns into a new pair of variables using pivot_longer() function. To describe that operation you need three parameters:
The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
As values are given as column names, we need to tell the function the name of the new variable that will contain those column names. The argument used is names_to because we’re telling it what to call the column where we’ll list the current column names… In this case, the argument is year, as we are sending the column names 1999 and 2000 to a new year variable.
The name of the variable whose values are spread over the cells. Because we’ll be bringing those values together in a single column, we can say what that column is named with the argument values_to because we’re telling it what to call the column to which we are sending the values. This table is listing case numbers, so we shall use values_to = "cases"
Use the below code:
table4a %>%
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Note that in the R code below, you used the pipe (%>%) operator to take the data first, then use the pivot_longer() function. The {tidyr} package functions can also be used along with the pipe operator %>% which is developed by Stefan Milton Bache in the R package {magrittr}. Remember that the functions in {tidyr} can be used either with or without the pipe operator.
When multiple variables are stored in rows, the pivot_wider() function generates columns from rows. In other words, it transforms data from long to wide format. The pivot_wider() function is the opposite of pivot_longer() function.
Let’s look at table2 and assume that you are required to turn long formatted data into wide formatted data by generating columns from cases.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
To tidy this up, you first analyse the representation in a similar way to pivot_longer(). This time, however, you only need two parameters:
The column that contains variable names, which is indicated by the names_from argument. Here, it’s type.
The column that contains values that will form multiple variables, the value column, which is indicated by the values_from argument. Here, it’s count.
Once you’ve figured that out, you can use pivot_wider():
table2 %>%
pivot_wider(names_from = "type",
values_from = "count")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Now, cases and population are separate variables given in columns, therefore, generating a new variable from these two variables is super easy! Let’s calculate the Tuberculosis rate (rate = cases/population) using:
rate <- pivot_wider(data = table2, names_from = "type", values_from = "count")$cases /
pivot_wider(data = table2, names_from = "type", values_from = "count")$population
rate
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488
In a Module 3, we will learn the mutate() function, which will make this even easier:
table2 %>%
pivot_wider(names_from = "type",
values_from = "count") %>%
mutate(rate = cases / population)
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.0000373
## 2 Afghanistan 2000 2666 20595360 0.000129
## 3 Brazil 1999 37737 172006362 0.000219
## 4 Brazil 2000 80488 174504898 0.000461
## 5 China 1999 212258 1272915272 0.000167
## 6 China 2000 213766 1280428583 0.000167
The separate() function is used when multiple variables are stored in one column and you want to split them according to the separator character. Take table3 for example:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate column contains both cases and population variables, and you need to split it into two variables. Use the below code:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
unite() is the inverse of separate() function. One can use it to combine multiple columns into a single column.
Now let’s look at the table5:
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
In this data, assume that you want to combine the century and year variables into one variable called new_year. You can use unite() for this purpose using following code:
table5 %>%
unite(new_year, century, year)
## # A tibble: 6 x 3
## country new_year rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583