Lesson Objectives

At the end of the lesson, the students are expected to:

  • Learn how prepare data for analysis, and convert or reshape data using different functions
  • Learn how to manipulate data in R

Reshaping Data using tidyr

Here, you’ll learn modern conventions for preparing and reshaping data in order to facilitate analyses in R. The process is called tidying the data.

What is a tidy data set?

A data set is called tidy when:

  • each column represents a variable and

  • each row represents an observation

The opposite of tidy is messy data, which corresponds to any other arrangement of the data.

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It’s particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.

Reshaping data using tidyr package

The tidyr package, provides four functions to help you change the layout of your data set:

  • gather(): gather (collapse) columns into rows

  • spread(): spread rows into columns

  • separate(): separate one column into multiple

  • unite(): unite multiple columns into one

Installing and loading tidyr

# Installing
# install.packages("tidyr")
# Loading
library("tidyr")

** Example data sets**

We’ll use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:

my_data <- USArrests[c(1, 10, 20, 30), ]
my_data

Row names are states, so let’s use the function cbind() to add a column named “state” in the data. This will make the data tidy and the analysis easier.

my_data <- cbind(state = rownames(my_data), my_data)
my_data

gather():

The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one. It’s an alternative of melt() function [in reshape2 package].

Figure 5

1.) Simplified format:

gather(data, key, value, ...)
  • data: A data frame

  • key, value: Names of key and value columns to create in output

  • …: Specification of columns to gather. Allowed values are:

  • variable names

  • if you want to select all variables between a and e, use a:e

  • if you want to exclude a column name y use -y for more options, see: dplyr::select()

2.) Examples of usage:

Gather all columns except the column state

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2

Note that, all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).

spread():

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].

Figure 6

1.) Simplified format

spread(data, key, value)
  • data: A data frame
  • key: The (unquoted) name of the column whose values will be used as column headings.
  • value:The (unquoted) names of the column whose values will populate the cells.

2.) Examples of usage:

Spread “my_data2” to turn back to the original data:

my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
my_data3

unite()

The function unite() takes multiple columns and paste them together into one.

Figure 7

1.) Simplified format:

unite(data, col, ..., sep = "_")
  • data: A data frame
  • col: The new (unquoted) name of column to add.
  • sep: Separator to use between values

2.) Examples of usage:

The R code below uses the data set “my_data” and unites the columns Murder and Assault

my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4

separate()

The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.

Figure 8

1.) Simplified format:

separate(data, col, into, sep = "[^[:alnum:]]+")
  • data: A data frame

  • col: Unquoted column names

  • into: Character vector specifying the names of new variables to be created.

  • sep: Separator between columns:

  • If character, is interpreted as a regular expression.

  • If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.

2.) Examples of usage:

Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault:

separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

Manipulate

In this course, you will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:

  • filtering and ordering rows,
  • renaming and adding columns,
  • computing summary statistics
library("tidyverse")
## -- Attaching packages ------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2     v dplyr   1.0.2
## v tibble  3.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts --------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Select Data Frame Columns

You will learn how to use the following functions:

  • pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
  • select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
  • select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
  • Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names

Figure 9

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data

Extract column values as a vector

my_data %>% pull(Species)
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Extract columns as a data table**

Select column by position

  • Select columns 1 to 3:
my_data %>% select(1:3)
  • Select column 1 and 3 but not 2:
my_data %>% select(1, 3)

Select columns by names

Select columns by names: Sepal.Length and Petal.Length

my_data %>% select(Sepal.Length, Petal.Length)

Select all columns from Sepal.Length to Petal.Length

my_data %>% select(Sepal.Length:Petal.Length)

There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.

# Select column whose name starts with "Petal"
my_data %>% select(starts_with("Petal"))
# Select column whose name ends with "Width"
my_data %>% select(ends_with("Width"))
# Select columns whose names contains "etal"
my_data %>% select(contains("etal"))
# Select columns whose name matches a regular expression
my_data %>% select(matches(".t."))
# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))

Select column based on a condition

my_data %>% select_if(is.numeric)

Remove columns

Note that, to remove a column from a data frame, prepend its name by minus -.

Removing Sepal.Length and Petal.Length columns:

my_data %>% select(-Sepal.Length, -Petal.Length)

Removing all columns from Sepal.Length to Petal.Length:

my_data %>% select(-(Sepal.Length:Petal.Length))

Removing all columns whose name starts with “Petal”:

my_data %>% select(-starts_with("Petal"))

Note that, if you want to drop columns by position, the syntax is as follow.

# Drop column 1
my_data %>% select(-1)
# Drop columns 1 to 3
my_data %>% select(-(1:3))
# Drop columns 1 and 3 but not 2
my_data %>% select(-1, -3)

Subset Data Frame Rows

Figure 10

Extract rows by position

  • Key R function: slice() [dplyr package]
my_data %>% slice(1:6)

Filter rows by logical criteria

  • Key R function: filter() [dplyr package]. Used to filter rows that meet some logical criteria.

One-column based criteria: Extract rows where Sepal.Length > 7:

my_data %>% filter(Sepal.Length > 7)

Multiple-column based criteria: Extract rows where Sepal.Length > 6.7 and Sepal.Width ≤ 3:

my_data %>% filter(Sepal.Length > 6.7, Sepal.Width <= 3)

Test for equality (==): Extract rows where Sepal.Length > 6.5 and Species = “versicolor”:

my_data %>% filter(Sepal.Length > 6.7, Species == "versicolor")

Using OR operator (|): Extract rows where Sepal.Length > 6.5 and (Species = “versicolor” or Species = “virginica”):

my_data %>% filter(
  Sepal.Length > 6.7, 
  Species == "versicolor" | Species == "virginica"
  )

Or, equivalently, use this shortcut (%in% operator):

my_data %>% filter(
  Sepal.Length > 6.7, 
  Species %in% c("versicolor", "virginica" )
  )

Filter rows within a selection of variables

This section presents 3 functions - filter_all(), filter_if() and filter_at() - to filter rows within a selection of variables.

These functions replicate the logical criteria over all variables or a selection of variables.

Create a new demo data set from my_data by removing the grouping column “Species”:

my_data2 <- my_data %>% select(-Species)
  • Select rows where all variables are greater than 2.4:
my_data2 %>% filter_all(all_vars(.> 2.4))
  • Select rows when any of the variables are greater than 2.4:
my_data2 %>% filter_all(any_vars(.> 2.4))
  • Vary the selection of columns on which to apply the filtering criteria. filter_at() takes a vars() specification. The following R code apply the filtering criteria on the columns Sepal.Length and Sepal.Width:
my_data2 %>% filter_at(vars(starts_with("Sepal")), any_vars(. > 2.4))

Remove missing values

We start by creating a data frame with missing values. In R NA (Not Available) is used to represent missing values:

# Create a data frame with missing data
friends_data <- data_frame(
  name = c("A", "B", "C", "D"),
  age = c(27, 25, 29, 26),
  height = c(180, NA, NA, 169),
  married = c("yes", "yes", "no", "no")
)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Print
friends_data

Extract rows where height is NA:

friends_data %>% filter(is.na(height))

Exclude (drop) rows where height is NA:

friends_data %>% filter(!is.na(height))

In the R code above, !is.na() means that “we don’t want” NAs.

Select random rows from a data frame

It’s possible to select either n random rows with the function sample_n() or a random fraction of rows with sample_frac(). We first use the function set.seed() to initiate random number generator engine. This important for users to reproduce the analysis.

set.seed(1234)
# Extract 5 random rows without replacement
my_data %>% sample_n(5, replace = FALSE)
# Extract 5% of rows, randomly without replacement
my_data %>% sample_frac(0.05, replace = FALSE)

Select top n rows ordered by a variable

Select the top 5 rows ordered by Sepal.Length

my_data %>% top_n(5, Sepal.Length)

Group by the column Species and select the top 5 of each group ordered by Sepal.Length:

my_data %>% 
  group_by(Species) %>%
  top_n(5, Sepal.Length)

Summary

We have introduce how to filter a data frame rows using the dplyr package:

  • Filter rows by logical criteria: my_data %>% filter(Sepal.Length >7)
  • Select n random rows: my_data %>% sample_n(10)
  • Select a random fraction of rows: my_data %>% sample_frac(10)
  • Select top n rows by values: my_data %>% top_n(10, Sepal.Length)

Identify and Remove Duplicate Data

Find and drop duplicate elements

The R function duplicated() returns a logical vector where TRUE specifies which elements of a vector or data frame are duplicates.

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

To find the position of duplicate elements in x, use this:

duplicated(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Extract duplicate elements:

x[duplicated(x)]
## [1] 1 4

If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:

x[!duplicated(x)]
## [1] 1 4 5 6

Following this way, you can remove duplicate rows from a data frame based on a column values, as follow:

# Remove duplicates based on Sepal.Width columns
my_data[!duplicated(my_data$Sepal.Width), ]

! is a logical negation. !duplicated() means that we don’t want duplicate rows.

Extract unique elements

Given the following vector:

x <- c(1, 1, 4, 5, 4, 6)

You can extract unique elements as follow:

unique(x)
## [1] 1 4 5 6

It’s also possible to apply unique() on a data frame, for removing duplicated rows as follow:

unique(my_data)

Remove duplicate rows in a data frame

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

Remove duplicate rows based on all columns:

my_data %>% distinct()

Remove duplicate rows based on certain columns (variables):

# Remove duplicated rows based on Sepal.Length
my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
# Remove duplicated rows based on 
# Sepal.Length and Petal.Width
my_data %>% distinct(Sepal.Length, Petal.Width, .keep_all = TRUE)

The option .kep_all is used to keep all variables in the data.

Summary

In this chapter, we describe key functions for identifying and removing duplicate data:

  • Remove duplicate rows based on one or more column values: my_data %>% dplyr::distinct(Sepal.Length)
  • R base function to extract unique elements from vectors and data frames: unique(my_data)
  • R base function to determine duplicate elements: duplicated(my_data)

Renaming columns

To rename the column Sepal.Length to sepal_length, the procedure is as follow:

1.) Get column names using the function names() or colnames() 2.) Change column names where name = Sepal.Length

# get column names
colnames(my_data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
# Rename column where names is "Sepal.Length"
names(my_data)[names(my_data) == "Sepal.Length"] <- "sepal_length"
names(my_data)[names(my_data) == "Sepal.Width"] <- "sepal_width"
my_data

It’s also possible to rename by index in names vector as follow:

names(my_data)[1] <- "sepal_length"
names(my_data)[2] <- "sepal_width"

Compute and Add new Variables to a Data Frame

Figure 11

mutate: Add new variables by preserving existing ones

Add new columns (sepal_by_petal_*) by preserving existing ones:

my_data <- as_tibble(iris)
my_data %>% mutate(sepal_by_petal_l=Sepal.Length/Petal.Length)

transmute: Make new variables by dropping existing ones

Add new columns (sepal_by_petal_*) by dropping existing ones:

my_data %>%
  transmute(
    sepal_by_petal_l = Sepal.Length/Petal.Length,
    sepal_by_petal_w = Sepal.Width/Petal.Width
    )

Modify multiple columns at once

We start by creating a demo data set, my_data2, which contains only numeric columns. To do so, we’ll remove the column Species as follow:

my_data2 <- my_data %>%
  select(-Species)

The functions mutate_all() / transmute_all(), mutate_at() / transmute_at() and mutate_if() / transmute_if() can be used to modify multiple columns at once.

The simplified formats are as follow:

# Mutate variants
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ...)

# Transmute variants
transmute_all(.tbl, .funs, ...)
transmute_if(.tbl, .predicate, .funs, ...)
transmute_at(.tbl, .vars, .funs, ...)
  • .tbl: a tbl data frame
  • .funs: List of function calls generated by funs(), or a character vector of function names, or simply a function.
  • …: Additional arguments for the function calls in .funs.
  • .predicate: A predicate function to be applied to the columns or a logical vector. The variables for which .predicate is or returns TRUE are selected.

In the following sections, we’ll present only the variants of mutate(). The transmute() variants can be used similarly.

Transform all column values

  • Divide all columns value by 2.54:
my_data2 %>%
  mutate_all(funs(./2.54))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Note that, the dot “.” represents any variables

  • Function names will be appended to column names if .funs has names or multiple inputs:
my_data2 %>%mutate_all(funs(cm = ./2.54))

Note that, the output variable name now includes the function name.

  • Function names will be appended to column names if .funs has names or multiple inputs:
my_data2 %>%
  mutate_all(funs(cm = ./2.54))

Note that, the output variable name now includes the function name.

Transform specific columns

  • mutate_at(): transform specific columns selected by names:
my_data2 %>%
  mutate_at(
    c("Sepal.Length", "Petal.Width"),
    funs(cm = ./2.54)
    )
  • mutate_if(): transform specific columns selected by a predicate function.

    mutate_if() is particularly useful for transforming variables from one type to another.

my_data %>% mutate_if(is.factor, as.character)

Round all numeric variables:

my_data %>% mutate_if(is.numeric, round, digits = 0)

Summary

This article describe how to add new variable columns into a data frame using the dplyr functions: mutate(), transmute() and variants.

  • mutate(iris, sepal = 2*Sepal.Length): Computes and appends new variable(s).
  • transmute(iris, sepal = 2*Sepal.Length): Makes new variable(s) and drops existing ones.