2.5.1 Start using tidyr

Practice with the tidyr package

Use table1, table2, table3, table4a, table4b and table5 datasets which are under {tidyr} package. As you can see, each dataset shows the same values of four variables, country, year, population, and cases, but each dataset organises the values in a different way. Click on the tables below to use them for completing the task. The functions mentioned in step 3 use one of these tables to explain the functionality of {tidyr} functions.

table1

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

table4b

## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Learn and play with {tidyr} functions

pivot_longer() function

When column names are values instead of variables, you need to transform data from wide to long format. We do this with the pivot_longer() function. Previously, the function that performed this was gather(), so you may also see that in legacy code in your workplace.

To illustrate this function, let’s have a look at the data given in the table4a below:

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

To tidy a dataset like this, you need to pivot those columns into a new pair of variables using pivot_longer() function. To describe that operation you need three parameters:

The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
As values are given as column names, we need to tell the function the name of the new variable that will contain those column names. The argument used is names_to because we’re telling it what to call the column where we’ll list the current column names… In this case, the argument is year, as we are sending the column names 1999 and 2000 to a new year variable.
The name of the variable whose values are spread over the cells. Because we’ll be bringing those values together in a single column, we can say what that column is named with the argument values_to because we’re telling it what to call the column to which we are sending the values. This table is listing case numbers, so we shall use values_to = "cases"

Use the below code:

table4a %>% 
  pivot_longer(cols = c(`1999`, `2000`), 
               names_to = "year", 
               values_to = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Note that in the R code below, you used the pipe (%>%) operator to take the data first, then use the pivot_longer() function. The {tidyr} package functions can also be used along with the pipe operator %>% which is developed by Stefan Milton Bache in the R package {magrittr}. Remember that the functions in {tidyr} can be used either with or without the pipe operator.

pivot_wider() function

When multiple variables are stored in rows, the pivot_wider() function generates columns from rows. In other words, it transforms data from long to wide format. The pivot_wider() function is the opposite of pivot_longer() function.

Let’s look at table2 and assume that you are required to turn long formatted data into wide formatted data by generating columns from cases.

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

To tidy this up, you first analyse the representation in a similar way to pivot_longer(). This time, however, you only need two parameters:

The column that contains variable names, which is indicated by the names_from argument. Here, it’s type.
The column that contains values that will form multiple variables, the value column, which is indicated by the values_from argument. Here, it’s count.

Once you’ve figured that out, you can use pivot_wider():

table2 %>% 
  pivot_wider(names_from = "type", 
              values_from = "count")

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Now, cases and population are separate variables given in columns, therefore, generating a new variable from these two variables is super easy! Let’s calculate the Tuberculosis rate (rate = cases/population) using:

rate <- pivot_wider(data = table2, names_from = "type", values_from = "count")$cases / 
  pivot_wider(data = table2, names_from = "type", values_from = "count")$population 

rate

## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488

In a Module 3, we will learn the mutate() function, which will make this even easier:

table2 %>% 
  pivot_wider(names_from = "type", 
              values_from = "count") %>% 
  mutate(rate = cases / population)

## # A tibble: 6 x 5
##   country      year  cases population      rate
##   <chr>       <int>  <int>      <int>     <dbl>
## 1 Afghanistan  1999    745   19987071 0.0000373
## 2 Afghanistan  2000   2666   20595360 0.000129 
## 3 Brazil       1999  37737  172006362 0.000219 
## 4 Brazil       2000  80488  174504898 0.000461 
## 5 China        1999 212258 1272915272 0.000167 
## 6 China        2000 213766 1280428583 0.000167

separate() function

The separate() function is used when multiple variables are stored in one column and you want to split them according to the separator character. Take table3 for example:

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The rate column contains both cases and population variables, and you need to split it into two variables. Use the below code:

table3 %>%
  separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

unite() function

unite() is the inverse of separate() function. One can use it to combine multiple columns into a single column.

Now let’s look at the table5:

table5

## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

In this data, assume that you want to combine the century and year variables into one variable called new_year. You can use unite() for this purpose using following code:

table5 %>%
  unite(new_year, century, year)

## # A tibble: 6 x 3
##   country     new_year rate             
##   <chr>       <chr>    <chr>            
## 1 Afghanistan 19_99    745/19987071     
## 2 Afghanistan 20_00    2666/20595360    
## 3 Brazil      19_99    37737/172006362  
## 4 Brazil      20_00    80488/174504898  
## 5 China       19_99    212258/1272915272
## 6 China       20_00    213766/1280428583