The Idea

The tidyr package is part of the Tidyverse group of libraries. As such, it is constantly maintained and the data and analysis that comes from this package can be trusted. It was created for the sole purpose of simplifying the process of creating tidy data. You can find a cheetsheat for the tidyr package here or on the tidyverse website.

The tidr website states "The goal of tidyr is to help you create tidy data. Tidy data is data where:

Every column is variable.
Every row is an observation.
Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis."

Of course, the first step with tidyr is to load the library (or alternatively, you could load the tidyverse library) and examining some data (USArrests data set). However, to get a more complete view of what the tidyr commands actually do, we simply take a subset of the data (rows 1, 11, 21, 31). The codebook for the data can be found here.

library(tidyr)
library(dplyr)
df <- USArrests[c(1, 11, 21, 31),]
head(df)

##               Murder Assault UrbanPop Rape
## Alabama         13.2     236       58 21.2
## Hawaii           5.3      46       83 20.2
## Massachusetts    4.4     149       85 16.3
## New Mexico      11.4     285       70 32.1

In the examples that follow, we would like the state to be a variable rather than simply a row name. Thus, we will add a column to the table (using the cbind() command).

df <- cbind(State=rownames(df), df)
df

##                       State Murder Assault UrbanPop Rape
## Alabama             Alabama   13.2     236       58 21.2
## Hawaii               Hawaii    5.3      46       83 20.2
## Massachusetts Massachusetts    4.4     149       85 16.3
## New Mexico       New Mexico   11.4     285       70 32.1

There are four fundamental functions in tidyr.

Tool	Function
gather()	makes “wide” data longer
spread()	makes “long” data wider
separate()	splits a single column into multiple columns
unite()	combines multiple columns into a single column

The Pipe %>% (from dplyr page)

While the pipe operator is not technically a requirement for the tidyverse or dplyr, it certainly increases the ease with which we make use of the tools.

At a basic level, the pipe forwards a value or the result of an expression into the next function. We can then omit the data call in the function. For example: \[\text{filter(data, variable == value)}\] can also be written as \[\text{data %>% filter(variable == value)}.\] While that does not seem to be a substantial change or even an upgrade at all, more complex strings of code will become increasingly simple with this tool. More on this later.

gather()

We begin by looking once again at our dataset (USArrests - a built in dataset). For the sake of easy example, we have taken only four rows of the USArrests data set (rows 1, 11, 21 and 31) and added a column called State that denotes the state as a variable rather than simply a row name.

df

##                       State Murder Assault UrbanPop Rape
## Alabama             Alabama   13.2     236       58 21.2
## Hawaii               Hawaii    5.3      46       83 20.2
## Massachusetts Massachusetts    4.4     149       85 16.3
## New Mexico       New Mexico   11.4     285       70 32.1

The function, gather(), collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one. It’s an alternative of melt() function [in reshape2 package].

The gather() function takes in 4 values (one fewer with the pipe): data, key, value and a vector specifying which columns need to be gathered \[gather(data, key, value, ...).\]. We would like to create a table that simply has a state, a type of infraction and the number of arrests for that infraction (per 100,000 people).

df %>% gather(key="Infraction",
               value="Arrests")

##    Infraction       Arrests
## 1       State       Alabama
## 2       State        Hawaii
## 3       State Massachusetts
## 4       State    New Mexico
## 5      Murder          13.2
## 6      Murder           5.3
## 7      Murder           4.4
## 8      Murder          11.4
## 9     Assault           236
## 10    Assault            46
## 11    Assault           149
## 12    Assault           285
## 13   UrbanPop            58
## 14   UrbanPop            83
## 15   UrbanPop            85
## 16   UrbanPop            70
## 17       Rape          21.2
## 18       Rape          20.2
## 19       Rape          16.3
## 20       Rape          32.1

We can see that our State and UrbanPop columns get messed up - this will happen from time to time. What we forgot to do is take into account that State and UrbanPop were not types of arrests and so need to be ignored. We do that by removing them after we add our value data to the function.

df %>% gather(key="Infraction",
               value="Arrests",
              -c(State, UrbanPop))

##            State UrbanPop Infraction Arrests
## 1        Alabama       58     Murder    13.2
## 2         Hawaii       83     Murder     5.3
## 3  Massachusetts       85     Murder     4.4
## 4     New Mexico       70     Murder    11.4
## 5        Alabama       58    Assault   236.0
## 6         Hawaii       83    Assault    46.0
## 7  Massachusetts       85    Assault   149.0
## 8     New Mexico       70    Assault   285.0
## 9        Alabama       58       Rape    21.2
## 10        Hawaii       83       Rape    20.2
## 11 Massachusetts       85       Rape    16.3
## 12    New Mexico       70       Rape    32.1

We notice that we have collapsed the Murder, Assult and Rape columns into a single column while the State and UrbanPop column have been duplicated (as in rows 1, 5 and 9, for example). If we want to omit one of the columns (UrbanPop, for example), we simply remove the column when we view (or save a new data frame without that column).

df2 <- df %>% 
        gather(key="Infraction",
               value="Arrests",
              -c(State, UrbanPop))
df2 <- df2[,c(1,3,4)]
head(df2)

##           State Infraction Arrests
## 1       Alabama     Murder    13.2
## 2        Hawaii     Murder     5.3
## 3 Massachusetts     Murder     4.4
## 4    New Mexico     Murder    11.4
## 5       Alabama    Assault   236.0
## 6        Hawaii    Assault    46.0

spread()

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].

The spread() function takes in 3 values (one fewer with the pipe): data, key and value \[spread(data, key, value).\] We would like to take df2 and make it look like the original dataset. Note that we have taken out the UrbanPop column, so it will not be identical.

df2 %>% spread(key="Infraction",
               value="Arrests")

##           State Assault Murder Rape
## 1       Alabama     236   13.2 21.2
## 2        Hawaii      46    5.3 20.2
## 3 Massachusetts     149    4.4 16.3
## 4    New Mexico     285   11.4 32.1

Notice that we spread the key out wide and inputted the Arrests value in the table.

unite()

The function unite() takes multiple columns and paste them together into one. The unite() function takes data, col and sep as inputs \[unite(data, col, ..., sep).\] The col input is the new name of the column to add and the sep is the separator used between values.

Suppose we wanted to create a column called Murder_Assault which took the Murder number and attached to it the Assult number separated by a _. (I don’t know why we would ever want to do this, but there are clear reasons why one might want to have such a function - dates, for example). We would accomplish this task the following way:

df %>% unite(col="Murder_Assault",
             Murder, Assault,
             sep="_")

##                       State Murder_Assault UrbanPop Rape
## Alabama             Alabama       13.2_236       58 21.2
## Hawaii               Hawaii         5.3_46       83 20.2
## Massachusetts Massachusetts        4.4_149       85 16.3
## New Mexico       New Mexico       11.4_285       70 32.1

As you can see, we have removed the Murder column and the Assault column and replaced them with a merged column Murder_Assault.

separate()

The function seperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns. The inputs for the unite() function are data, col, into and sep, where col is the column names, into is the character vector specifying the names of the new variables to be created and sep details the separator between columns \[separate(data, col, into, sep).\] Note that if the separator is a character, it is interpreted as a regular expression and if it is numeric, it is interpreted as the position of the split.

Let us create the Murder_Assault column from before and save the dataframe as df3.

df3 <- df %>% 
        unite(col="Murder_Assault",
              Murder, Assault,
              sep="_")

We will now attempt to separate the Murder_Assault column into 2 separate columns.

df3 %>% separate(
        col = "Murder_Assault",
        into = c("Murder", "Assault"),
        sep = "_")

##                       State Murder Assault UrbanPop Rape
## Alabama             Alabama   13.2     236       58 21.2
## Hawaii               Hawaii    5.3      46       83 20.2
## Massachusetts Massachusetts    4.4     149       85 16.3
## New Mexico       New Mexico   11.4     285       70 32.1

Multiple Operations

Of course, we can use the pipe to chain multiple operations together. If we want to create a table that has, for example, data that resembles Alabama_Murder_13.2 etc. giving the details about an attribute of the state of Alabama in one line, we would do so in the following way.

df %>% gather(key = "Attribute",
              value = "Value",
              Murder:Rape) %>%
        unite(col = "State_Attribute",
              State, Attribute, Value,
              sep="_")

##              State_Attribute
## 1        Alabama_Murder_13.2
## 2          Hawaii_Murder_5.3
## 3   Massachusetts_Murder_4.4
## 4     New Mexico_Murder_11.4
## 5        Alabama_Assault_236
## 6          Hawaii_Assault_46
## 7  Massachusetts_Assault_149
## 8     New Mexico_Assault_285
## 9        Alabama_UrbanPop_58
## 10        Hawaii_UrbanPop_83
## 11 Massachusetts_UrbanPop_85
## 12    New Mexico_UrbanPop_70
## 13         Alabama_Rape_21.2
## 14          Hawaii_Rape_20.2
## 15   Massachusetts_Rape_16.3
## 16      New Mexico_Rape_32.1

Citations

Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.

Dealing with Missing Values - UC Business Analytics R Programming Guide. Accessed April 19, 2021. Available here.

“Reshaping Your Data with Tidyr · UC Business Analytics R Programming Guide.” Accessed April 20, 2021. Available here.

“Tidy Messy Data.” Accessed April 20, 2021. Available here.

“USArrests Function - RDocumentation.” Accessed April 20, 2021. Available here.

Wickham, Hadley and RStudio. Tidyr: Tidy Messy Data (version 1.1.3), 2021. Available here.

Tidying Data with tidyr

OC Data Science