Tidyverse - Pivot

Pivot Vignette

This vignette will cover the pivot_longer and pivot_wider functions. These functions are an alternative to gather() and spread() while incorporating features from other packages.

The gather and spread functions are used to “tidy” data, rearranging data that should be in rows in the columns or vice versa. Gather() moves columns into rows while the spread() function moves rows into columns.

These functions are great, however, the syntax is not easy to remember and takes multiple tries to get it right. The functions pivot_longer() and pivot_wider() are meant to replace gather() and spread(), in a more intuitive way.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
knitr::opts_chunk$set(echo = TRUE)

The dataset used to demonstrate pivot functions is the drug-use-by-age dataset from 538.

Load Data

drug_df_0 =  read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   age = col_character(),
##   `cocaine-frequency` = col_character(),
##   `crack-frequency` = col_character(),
##   `heroin-frequency` = col_character(),
##   `inhalant-frequency` = col_character(),
##   `oxycontin-frequency` = col_character(),
##   `meth-frequency` = col_character()
## )

## See spec(...) for full column specifications.

Glimpse the Data

glimpse(drug_df_0)

## Observations: 17
## Variables: 28
## $ age                       <chr> "12", "13", "14", "15", "16", "17", "18",...
## $ n                         <dbl> 2798, 2757, 2792, 2956, 3058, 3038, 2469,...
## $ `alcohol-use`             <dbl> 3.9, 8.5, 18.1, 29.2, 40.1, 49.3, 58.7, 6...
## $ `alcohol-frequency`       <dbl> 3, 6, 5, 6, 10, 13, 24, 36, 48, 52, 52, 5...
## $ `marijuana-use`           <dbl> 1.1, 3.4, 8.7, 14.5, 22.5, 28.0, 33.7, 33...
## $ `marijuana-frequency`     <dbl> 4, 15, 24, 25, 30, 36, 52, 60, 60, 52, 52...
## $ `cocaine-use`             <dbl> 0.1, 0.1, 0.1, 0.5, 1.0, 2.0, 3.2, 4.1, 4...
## $ `cocaine-frequency`       <chr> "5.0", "1.0", "5.5", "4.0", "7.0", "5.0",...
## $ `crack-use`               <dbl> 0.0, 0.0, 0.0, 0.1, 0.0, 0.1, 0.4, 0.5, 0...
## $ `crack-frequency`         <chr> "-", "3.0", "-", "9.5", "1.0", "21.0", "1...
## $ `heroin-use`              <dbl> 0.1, 0.0, 0.1, 0.2, 0.1, 0.1, 0.4, 0.5, 0...
## $ `heroin-frequency`        <chr> "35.5", "-", "2.0", "1.0", "66.5", "64.0"...
## $ `hallucinogen-use`        <dbl> 0.2, 0.6, 1.6, 2.1, 3.4, 4.8, 7.0, 8.6, 7...
## $ `hallucinogen-frequency`  <dbl> 52, 6, 3, 4, 3, 3, 4, 3, 2, 4, 3, 2, 3, 2...
## $ `inhalant-use`            <dbl> 1.6, 2.5, 2.6, 2.5, 3.0, 2.0, 1.8, 1.4, 1...
## $ `inhalant-frequency`      <chr> "19.0", "12.0", "5.0", "5.5", "3.0", "4.0...
## $ `pain-releiver-use`       <dbl> 2.0, 2.4, 3.9, 5.5, 6.2, 8.5, 9.2, 9.4, 1...
## $ `pain-releiver-frequency` <dbl> 36, 14, 12, 10, 7, 9, 12, 12, 10, 15, 15,...
## $ `oxycontin-use`           <dbl> 0.1, 0.1, 0.4, 0.8, 1.1, 1.4, 1.7, 1.5, 1...
## $ `oxycontin-frequency`     <chr> "24.5", "41.0", "4.5", "3.0", "4.0", "6.0...
## $ `tranquilizer-use`        <dbl> 0.2, 0.3, 0.9, 2.0, 2.4, 3.5, 4.9, 4.2, 5...
## $ `tranquilizer-frequency`  <dbl> 52.0, 25.5, 5.0, 4.5, 11.0, 7.0, 12.0, 4....
## $ `stimulant-use`           <dbl> 0.2, 0.3, 0.8, 1.5, 1.8, 2.8, 3.0, 3.3, 4...
## $ `stimulant-frequency`     <dbl> 2.0, 4.0, 12.0, 6.0, 9.5, 9.0, 8.0, 6.0, ...
## $ `meth-use`                <dbl> 0.0, 0.1, 0.1, 0.3, 0.3, 0.6, 0.5, 0.4, 0...
## $ `meth-frequency`          <chr> "-", "5.0", "24.0", "10.5", "36.0", "48.0...
## $ `sedative-use`            <dbl> 0.2, 0.1, 0.2, 0.4, 0.2, 0.5, 0.4, 0.3, 0...
## $ `sedative-frequency`      <dbl> 13.0, 19.0, 16.5, 30.0, 3.0, 6.5, 10.0, 6...

Now, lets introduce the pivot_longer() function. As the name impllies, it makes datasets longer by increasing the number of rows and decreasing the number of columns. Let’s imagine we only care about total drug users per category and not frequency. In this dataset, frequency is a character type and we would need to convert to dbl in order to use the function.

Pivot_longer()

#Let's clean the data and get rid of the unnecessary columns
drug_df <- drug_df_0[,-c(4,6,8,10,12,14,16,18,20,22,24,26,28)]

drug_df_morphin_time <- drug_df %>% pivot_longer(-c(age, n), names_to = "Drugs", values_to = "User_%" )

drug_df_morphin_time

## # A tibble: 221 x 4
##    age       n Drugs             `User_%`
##    <chr> <dbl> <chr>                <dbl>
##  1 12     2798 alcohol-use            3.9
##  2 12     2798 marijuana-use          1.1
##  3 12     2798 cocaine-use            0.1
##  4 12     2798 crack-use              0  
##  5 12     2798 heroin-use             0.1
##  6 12     2798 hallucinogen-use       0.2
##  7 12     2798 inhalant-use           1.6
##  8 12     2798 pain-releiver-use      2  
##  9 12     2798 oxycontin-use          0.1
## 10 12     2798 tranquilizer-use       0.2
## # ... with 211 more rows

In the above example, I chose to create a new data frame but that was a personal preference not mandatory for the use of the pivot_longer function.

The first step is to pass the dataset you wish to change.
If you wish to exlude any column from being turned into rows, use “-”. So, in my case, “-c(age,n)” means use all columns EXCEPT age and n.
names_to takes gives the name to the column containing all the old column names that are now rows.
values_to is the name given to the column containing all the values those old columns contained.

Pivot_wider()

To illustrate the pivot_wider function, we are going to revert the dataset back to wide.

drug_df_widen <- drug_df_morphin_time %>% pivot_wider(names_from = `Drugs`, values_from = `User_%` )
drug_df_widen

## # A tibble: 17 x 15
##    age       n `alcohol-use` `marijuana-use` `cocaine-use` `crack-use`
##    <chr> <dbl>         <dbl>           <dbl>         <dbl>       <dbl>
##  1 12     2798           3.9             1.1           0.1         0  
##  2 13     2757           8.5             3.4           0.1         0  
##  3 14     2792          18.1             8.7           0.1         0  
##  4 15     2956          29.2            14.5           0.5         0.1
##  5 16     3058          40.1            22.5           1           0  
##  6 17     3038          49.3            28             2           0.1
##  7 18     2469          58.7            33.7           3.2         0.4
##  8 19     2223          64.6            33.4           4.1         0.5
##  9 20     2271          69.7            34             4.9         0.6
## 10 21     2354          83.2            33             4.8         0.5
## 11 22-23  4707          84.2            28.4           4.5         0.5
## 12 24-25  4591          83.1            24.9           4           0.5
## 13 26-29  2628          80.7            20.8           3.2         0.4
## 14 30-34  2864          77.5            16.4           2.1         0.5
## 15 35-49  7391          75              10.4           1.5         0.5
## 16 50-64  3923          67.2             7.3           0.9         0.4
## 17 65+    2448          49.3             1.2           0           0  
## # ... with 9 more variables: `heroin-use` <dbl>, `hallucinogen-use` <dbl>,
## #   `inhalant-use` <dbl>, `pain-releiver-use` <dbl>, `oxycontin-use` <dbl>,
## #   `tranquilizer-use` <dbl>, `stimulant-use` <dbl>, `meth-use` <dbl>,
## #   `sedative-use` <dbl>

The new dataset is the original dataset.

In this case, we do not wan to exclude columns.
The column names are going to be generated from the values in the names_from column
The values that are going to populate the new rows are originating from the values_from column.

There are more complicated uses of pivot_longer() and pivot_wider(). Further exploration of the pivot functions can be found here:

https://tidyr.tidyverse.org/articles/pivot.html

But even at it’s most basic use, the pivot functions are intuitive and easy to use.