At the end of the lesson, the students are expected to:
Here, you’ll learn modern conventions for preparing and reshaping data in order to facilitate analyses in R. The process is called tidying the data.
A data set is called tidy when:
each column represents a variable and
each row represents an observation
The opposite of tidy is messy data, which corresponds to any other arrangement of the data.
Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.
The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It’s particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.
The tidyr package, provides four functions to help you change the layout of your data set:
gather(): gather (collapse) columns into rows
spread(): spread rows into columns
separate(): separate one column into multiple
unite(): unite multiple columns into one
** Example data sets**
We’ll use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:
Row names are states, so let’s use the function cbind() to add a column named “state” in the data. This will make the data tidy and the analysis easier.
The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one. It’s an alternative of melt() function [in reshape2 package].
Figure 5
1.) Simplified format:
gather(data, key, value, ...)
data: A data frame
key, value: Names of key and value columns to create in output
…: Specification of columns to gather. Allowed values are:
variable names
if you want to select all variables between a and e, use a:e
if you want to exclude a column name y use -y for more options, see: dplyr::select()
2.) Examples of usage:
Gather all columns except the column state
Note that, all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).
The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].
Figure 6
1.) Simplified format
spread(data, key, value)
2.) Examples of usage:
Spread “my_data2” to turn back to the original data:
The function unite() takes multiple columns and paste them together into one.
Figure 7
1.) Simplified format:
unite(data, col, ..., sep = "_")
2.) Examples of usage:
The R code below uses the data set “my_data” and unites the columns Murder and Assault
The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.
Figure 8
1.) Simplified format:
separate(data, col, into, sep = "[^[:alnum:]]+")
data: A data frame
col: Unquoted column names
into: Character vector specifying the names of new variables to be created.
sep: Separator between columns:
If character, is interpreted as a regular expression.
If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.
2.) Examples of usage:
Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault:
In this course, you will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:
## -- Attaching packages ------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v dplyr 1.0.2
## v tibble 3.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts --------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
You will learn how to use the following functions:
Figure 9
We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Select column by position
Select columns by names: Sepal.Length and Petal.Length
Select all columns from Sepal.Length to Petal.Length
There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.
# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))Note that, to remove a column from a data frame, prepend its name by minus -.
Removing Sepal.Length and Petal.Length columns:
Removing all columns from Sepal.Length to Petal.Length:
Removing all columns whose name starts with “Petal”:
Note that, if you want to drop columns by position, the syntax is as follow.
Figure 10
One-column based criteria: Extract rows where Sepal.Length > 7:
Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:
Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:
Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):
Or, equivalently, use this shortcut (%in% operator):
This section presents 3 functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.
These functions replicate the logical criteria over all variables or a selection of variables.
Create a new demo data set from my_data by removing the grouping column “Species”:
We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing values:
# Create a data frame with missing data
friends_data <- data_frame(
name = c("A", "B", "C", "D"),
age = c(27, 25, 29, 26),
height = c(180, NA, NA, 169),
married = c("yes", "yes", "no", "no")
)## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Extract rows where height is NA:
Exclude (drop) rows where height is NA:
In the R code above, !is.na() means that “we don’t want” NAs.
It’s possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac(). We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.
Select the top 5 rows ordered by Sepal.Length
Group by the column Species and select the top 5 of each group ordered by Sepal.Length:
We have introduce how to filter a data frame rows using the dplyr package:
The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.
Given the following vector:
To find the position of duplicate elements in x, use this:
## [1] FALSE TRUE FALSE FALSE TRUE FALSE
Extract duplicate elements:
## [1] 1 4
If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
## [1] 1 4 5 6
Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:
! is a logical negation. !duplicated() means that we don’t want duplicate rows.
Given the following vector:
You can extract unique elements as follow:
## [1] 1 4 5 6
It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().
Remove duplicate rows based on all columns:
Remove duplicate rows based on certain columns (variables):
# Remove duplicated rows based on
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)The option .kep_all is used to keep all variables in the data.
In this chapter, we describe key functions for identifying and removing duplicate data:
To rename the column Sepal.Length to sepal_length, the procedure is as follow:
1.) Get column names using the function names() or colnames() 2.) Change column names where name = Sepal.Length
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# Rename column where names is "Sepal.Length"
names(my_data)[names(my_data) == "Sepal.Length"] <- "sepal_length"
names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
my_dataIt’s also possible to rename by index in names vector as follow:
Figure 11
Add new columns (sepal_by_petal_*) by preserving existing ones:
Add new columns (sepal_by_petal_*) by dropping existing ones:
my_data %>%
transmute(
sepal_by_petal_l = Sepal.Length/Petal.Length,
sepal_by_petal_w = Sepal.Width/Petal.Width
)We start by creating a demo data set, my_data2, which contains only numeric columns. To do so, we’ll remove the column Species as follow:
The functions mutate_all() / transmute_all(), mutate_at() / transmute_at() and mutate_if() / transmute_if() can be used to modify multiple columns at once.
The simplified formats are as follow:
# Mutate variants
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ...)
# Transmute variants
transmute_all(.tbl, .funs, ...)
transmute_if(.tbl, .predicate, .funs, ...)
transmute_at(.tbl, .vars, .funs, ...)
In the following sections, we’ll present only the variants of mutate(). The transmute() variants can be used similarly.
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Note that, the dot “.” represents any variables
Note that, the output variable name now includes the function name.
Note that, the output variable name now includes the function name.
mutate_if(): transform specific columns selected by a predicate function.
mutate_if() is particularly useful for transforming variables from one type to another.
Round all numeric variables:
This article describe how to add new variable columns into a data frame using the dplyr functions: mutate(), transmute() and variants.