The tidyr package is part of the Tidyverse group of libraries. As such, it is constantly maintained and the data and analysis that comes from this package can be trusted. It was created for the sole purpose of simplifying the process of creating tidy data. You can find a cheetsheat for the tidyr package here or on the tidyverse website.
The tidr website states "The goal of tidyr is to help you create tidy data. Tidy data is data where:
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis."
Of course, the first step with tidyr is to load the library (or alternatively, you could load the tidyverse library) and examining some data (USArrests data set). However, to get a more complete view of what the tidyr commands actually do, we simply take a subset of the data (rows 1, 11, 21, 31). The codebook for the data can be found here.
library(tidyr)
library(dplyr)
df <- USArrests[c(1, 11, 21, 31),]
head(df)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Hawaii 5.3 46 83 20.2
## Massachusetts 4.4 149 85 16.3
## New Mexico 11.4 285 70 32.1
In the examples that follow, we would like the state to be a variable rather than simply a row name. Thus, we will add a column to the table (using the cbind() command).
df <- cbind(State=rownames(df), df)
df
## State Murder Assault UrbanPop Rape
## Alabama Alabama 13.2 236 58 21.2
## Hawaii Hawaii 5.3 46 83 20.2
## Massachusetts Massachusetts 4.4 149 85 16.3
## New Mexico New Mexico 11.4 285 70 32.1
There are four fundamental functions in tidyr.
| Tool | Function |
|---|---|
| gather() | makes “wide” data longer |
| spread() | makes “long” data wider |
| separate() | splits a single column into multiple columns |
| unite() | combines multiple columns into a single column |
While the pipe operator is not technically a requirement for the tidyverse or dplyr, it certainly increases the ease with which we make use of the tools.
At a basic level, the pipe forwards a value or the result of an expression into the next function. We can then omit the data call in the function. For example: \[\text{filter(data, variable == value)}\] can also be written as \[\text{data %>% filter(variable == value)}.\] While that does not seem to be a substantial change or even an upgrade at all, more complex strings of code will become increasingly simple with this tool. More on this later.
We begin by looking once again at our dataset (USArrests - a built in dataset). For the sake of easy example, we have taken only four rows of the USArrests data set (rows 1, 11, 21 and 31) and added a column called State that denotes the state as a variable rather than simply a row name.
df
## State Murder Assault UrbanPop Rape
## Alabama Alabama 13.2 236 58 21.2
## Hawaii Hawaii 5.3 46 83 20.2
## Massachusetts Massachusetts 4.4 149 85 16.3
## New Mexico New Mexico 11.4 285 70 32.1
The function, gather(), collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one. It’s an alternative of melt() function [in reshape2 package].
The gather() function takes in 4 values (one fewer with the pipe): data, key, value and a vector specifying which columns need to be gathered \[gather(data, key, value, ...).\]. We would like to create a table that simply has a state, a type of infraction and the number of arrests for that infraction (per 100,000 people).
df %>% gather(key="Infraction",
value="Arrests")
## Infraction Arrests
## 1 State Alabama
## 2 State Hawaii
## 3 State Massachusetts
## 4 State New Mexico
## 5 Murder 13.2
## 6 Murder 5.3
## 7 Murder 4.4
## 8 Murder 11.4
## 9 Assault 236
## 10 Assault 46
## 11 Assault 149
## 12 Assault 285
## 13 UrbanPop 58
## 14 UrbanPop 83
## 15 UrbanPop 85
## 16 UrbanPop 70
## 17 Rape 21.2
## 18 Rape 20.2
## 19 Rape 16.3
## 20 Rape 32.1
We can see that our State and UrbanPop columns get messed up - this will happen from time to time. What we forgot to do is take into account that State and UrbanPop were not types of arrests and so need to be ignored. We do that by removing them after we add our value data to the function.
df %>% gather(key="Infraction",
value="Arrests",
-c(State, UrbanPop))
## State UrbanPop Infraction Arrests
## 1 Alabama 58 Murder 13.2
## 2 Hawaii 83 Murder 5.3
## 3 Massachusetts 85 Murder 4.4
## 4 New Mexico 70 Murder 11.4
## 5 Alabama 58 Assault 236.0
## 6 Hawaii 83 Assault 46.0
## 7 Massachusetts 85 Assault 149.0
## 8 New Mexico 70 Assault 285.0
## 9 Alabama 58 Rape 21.2
## 10 Hawaii 83 Rape 20.2
## 11 Massachusetts 85 Rape 16.3
## 12 New Mexico 70 Rape 32.1
We notice that we have collapsed the Murder, Assult and Rape columns into a single column while the State and UrbanPop column have been duplicated (as in rows 1, 5 and 9, for example). If we want to omit one of the columns (UrbanPop, for example), we simply remove the column when we view (or save a new data frame without that column).
df2 <- df %>%
gather(key="Infraction",
value="Arrests",
-c(State, UrbanPop))
df2 <- df2[,c(1,3,4)]
head(df2)
## State Infraction Arrests
## 1 Alabama Murder 13.2
## 2 Hawaii Murder 5.3
## 3 Massachusetts Murder 4.4
## 4 New Mexico Murder 11.4
## 5 Alabama Assault 236.0
## 6 Hawaii Assault 46.0
The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].
The spread() function takes in 3 values (one fewer with the pipe): data, key and value \[spread(data, key, value).\] We would like to take df2 and make it look like the original dataset. Note that we have taken out the UrbanPop column, so it will not be identical.
df2 %>% spread(key="Infraction",
value="Arrests")
## State Assault Murder Rape
## 1 Alabama 236 13.2 21.2
## 2 Hawaii 46 5.3 20.2
## 3 Massachusetts 149 4.4 16.3
## 4 New Mexico 285 11.4 32.1
Notice that we spread the key out wide and inputted the Arrests value in the table.
The function unite() takes multiple columns and paste them together into one. The unite() function takes data, col and sep as inputs \[unite(data, col, ..., sep).\] The col input is the new name of the column to add and the sep is the separator used between values.
Suppose we wanted to create a column called Murder_Assault which took the Murder number and attached to it the Assult number separated by a _. (I don’t know why we would ever want to do this, but there are clear reasons why one might want to have such a function - dates, for example). We would accomplish this task the following way:
df %>% unite(col="Murder_Assault",
Murder, Assault,
sep="_")
## State Murder_Assault UrbanPop Rape
## Alabama Alabama 13.2_236 58 21.2
## Hawaii Hawaii 5.3_46 83 20.2
## Massachusetts Massachusetts 4.4_149 85 16.3
## New Mexico New Mexico 11.4_285 70 32.1
As you can see, we have removed the Murder column and the Assault column and replaced them with a merged column Murder_Assault.
The function seperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns. The inputs for the unite() function are data, col, into and sep, where col is the column names, into is the character vector specifying the names of the new variables to be created and sep details the separator between columns \[separate(data, col, into, sep).\] Note that if the separator is a character, it is interpreted as a regular expression and if it is numeric, it is interpreted as the position of the split.
Let us create the Murder_Assault column from before and save the dataframe as df3.
df3 <- df %>%
unite(col="Murder_Assault",
Murder, Assault,
sep="_")
We will now attempt to separate the Murder_Assault column into 2 separate columns.
df3 %>% separate(
col = "Murder_Assault",
into = c("Murder", "Assault"),
sep = "_")
## State Murder Assault UrbanPop Rape
## Alabama Alabama 13.2 236 58 21.2
## Hawaii Hawaii 5.3 46 83 20.2
## Massachusetts Massachusetts 4.4 149 85 16.3
## New Mexico New Mexico 11.4 285 70 32.1
Of course, we can use the pipe to chain multiple operations together. If we want to create a table that has, for example, data that resembles Alabama_Murder_13.2 etc. giving the details about an attribute of the state of Alabama in one line, we would do so in the following way.
df %>% gather(key = "Attribute",
value = "Value",
Murder:Rape) %>%
unite(col = "State_Attribute",
State, Attribute, Value,
sep="_")
## State_Attribute
## 1 Alabama_Murder_13.2
## 2 Hawaii_Murder_5.3
## 3 Massachusetts_Murder_4.4
## 4 New Mexico_Murder_11.4
## 5 Alabama_Assault_236
## 6 Hawaii_Assault_46
## 7 Massachusetts_Assault_149
## 8 New Mexico_Assault_285
## 9 Alabama_UrbanPop_58
## 10 Hawaii_UrbanPop_83
## 11 Massachusetts_UrbanPop_85
## 12 New Mexico_UrbanPop_70
## 13 Alabama_Rape_21.2
## 14 Hawaii_Rape_20.2
## 15 Massachusetts_Rape_16.3
## 16 New Mexico_Rape_32.1
Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.
Dealing with Missing Values - UC Business Analytics R Programming Guide. Accessed April 19, 2021. Available here.
“Reshaping Your Data with Tidyr · UC Business Analytics R Programming Guide.” Accessed April 20, 2021. Available here.
“Tidy Messy Data.” Accessed April 20, 2021. Available here.
“USArrests Function - RDocumentation.” Accessed April 20, 2021. Available here.
Wickham, Hadley and RStudio. Tidyr: Tidy Messy Data (version 1.1.3), 2021. Available here.