Basic Data Manipulation with dplyr

Introduction

Data is usually available or stored in a relational database in a two-dimensional format, where each row is known as a record and each column is known as field. Fields are also known as variables. For example, we can have a two-dimensional patient data, where each row stores information for each patient and different columns store information regarding different characteristics like age, sex, height, weight etc. It looks like the following:

Patient Index	AGE	SEX	HEIGHT	WEIGHT
1	69	M	160	55
2	58	F	175	65
3	47	M	148	78
4	35	F	150	66

In R, data can also be stored in a two-dimensional structure which are known as Data Frame. A data frame can contains different types of variables or fields like numeric variable, character variable (factors), date-time variable (known as time-stamp in relational database concept) etc., this makes it handy to strore a real life data.

Data manipulation involves different tasks like filtering, merging, selecting some particular variables, creating new variables as a function of existing variables etc.

We will use Fisher’s iris data set to do some manipulation work with dplyr library developed by Hadley Wickham.

An alternative but handy Data Frame format

dplyr provides an efficient function tbl_df() which converts a regular dataframe object into a more good looking dataframe, called local dataframe object, which provides additional information like dimension of the actual dataframe, variable type of each field etc. It prints only that number of columns that fits your concole screen and only first 10 rows. It is particularly efficient when you have to get very basic idea about the data. It also allows you to avoid the use of additional functions like head() or tail().

Here is the example code:

rm(list = ls(all = TRUE))
# install.packages('dplyr', dependencies = TRUE) # install it if you don't have
library(dplyr, quietly = TRUE)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(iris) # load the iris data
iris2 <- tbl_df(iris)
print(iris2)

## Source: local data frame [150 x 5]
## 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## ..          ...         ...          ...         ...     ...

Data Manipulation

Now we have the data in suitable format and we are ready to manipulate it. The major verbs related to data manipulation using dplyr are filter, select, arrange, mutate, summarise, group_by etc. Let’s start one by one with the iris data set.

Data Filtering

Data filtering actually means to extract some of the records which are satisfying some given criteria. For the patient data, filtering may be to find all the patients who are male or whose heights are more than 160cm. For iris data, it may be to identify all the flowers which are setosa in species.

It’s generic form is filter(local_df, criteria).

iris2 <- tbl_df(iris)
filter(iris2, Species == 'setosa') # Species == 'setosa' is the criteria

## Source: local data frame [50 x 5]
## 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## ..          ...         ...          ...         ...     ...

Notice the dimension of the filtered data. It is of \(50 \times 5\) now.

Selecting some columns or fields

Sometimes it is necessary to extract some particular fields (but all the rows) from the entire data and then to work with them. For example, we may need the Petal.length field in the iris data to study it’s distribution.

It’s generic form is select(local_df, field_name_1, field_name_2, ...).

select(iris2, Petal.Length, Sepal.Width)

## Source: local data frame [150 x 2]
## 
##    Petal.Length Sepal.Width
##           (dbl)       (dbl)
## 1           1.4         3.5
## 2           1.4         3.0
## 3           1.3         3.2
## 4           1.5         3.1
## 5           1.4         3.6
## 6           1.7         3.9
## 7           1.4         3.4
## 8           1.5         3.4
## 9           1.4         2.9
## 10          1.5         3.1
## ..          ...         ...

Rearranging the rows by variables

The concept is very easy to grasp. Choose one (or more) variable(s) and rearrange the values of this variable(s) by increasing order such that it can change the actual order of the rows of the entire data frame.

It’s generic form is arrange(local_df, field_name_1, field_name_2, ...)

arrange(iris2, Sepal.Length, Petal.Width)

## Source: local data frame [150 x 5]
## 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
## 1           4.3         3.0          1.1         0.1  setosa
## 2           4.4         2.9          1.4         0.2  setosa
## 3           4.4         3.0          1.3         0.2  setosa
## 4           4.4         3.2          1.3         0.2  setosa
## 5           4.5         2.3          1.3         0.3  setosa
## 6           4.6         3.1          1.5         0.2  setosa
## 7           4.6         3.6          1.0         0.2  setosa
## 8           4.6         3.2          1.4         0.2  setosa
## 9           4.6         3.4          1.4         0.3  setosa
## 10          4.7         3.2          1.3         0.2  setosa
## ..          ...         ...          ...         ...     ...

Note that, if you specify a character variable or factor, then arrange() will rearrange all the values by alphabetical order.

Note again, to rearrange the values in descending order, we have to use desc() method as the following:

arrange(iris2, desc(Sepal.Length))

## Source: local data frame [150 x 5]
## 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##           (dbl)       (dbl)        (dbl)       (dbl)    (fctr)
## 1           7.9         3.8          6.4         2.0 virginica
## 2           7.7         3.8          6.7         2.2 virginica
## 3           7.7         2.6          6.9         2.3 virginica
## 4           7.7         2.8          6.7         2.0 virginica
## 5           7.7         3.0          6.1         2.3 virginica
## 6           7.6         3.0          6.6         2.1 virginica
## 7           7.4         2.8          6.1         1.9 virginica
## 8           7.3         2.9          6.3         1.8 virginica
## 9           7.2         3.6          6.1         2.5 virginica
## 10          7.2         3.2          6.0         1.8 virginica
## ..          ...         ...          ...         ...       ...

Notice the Species variable now, it starts with 'virginica'.

Create new variables

Sometimes we may need to compute new variables independently or as a function of existing variable(s). Thease are handled by mutate() function.

It’s generic form is mutate(local_df, newvar = function(oldvar_1, oldvar_2,...)).

mutate(iris2, log.Sepal.length = log(Sepal.Length))

## Source: local data frame [150 x 6]
## 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## ..          ...         ...          ...         ...     ...
## Variables not shown: log.Sepal.length (dbl)

It creates the new variable as a function of the existing one, but doesn’t store it permanently in the previous dataframe. To store it permanently, save the dataframe as a new dataframe.

Summarising variables

Summarising is actually a dimension reduction process. It takes a vector and returns a scalar i.e. reduction from multi dimension to single dimension, sometimes these are also called data compression. Summarisation process involves computation of mean, sum, standard deviation etc. of a variable.

It’s generic form is summarise(local_df, mean(var_name))

summarise(iris2, mean(Petal.Length))

## Source: local data frame [1 x 1]
## 
##   mean(Petal.Length)
##                (dbl)
## 1              3.758

Computing Summary Statistics according to factor levels

Sometimes it is very obvious to compute summary statistics of some variables according to different levels of another categorical variable. For example, in iris data we may need to compute standard deviation of Petal.Length for each level of the Species. These can be handled by group_by(). First we have to convert a local dataframe into a grouped data frame using some categorical variable. Then we have to use summarise() to compute the summary statistics.

It’s generic form is group_by(local_df, categorical_var)

# create a grouped local dataframe
by_species <- group_by(iris2, Species)
summarise(by_species, sd(Petal.Length))

## Source: local data frame [3 x 2]
## 
##      Species sd(Petal.Length)
##       (fctr)            (dbl)
## 1     setosa        0.1736640
## 2 versicolor        0.4699110
## 3  virginica        0.5518947

alternatively, we can do it using tapply() also as follows:

# tapply(summarising_var, grouping_var, function)
tapply(iris2$Petal.Length, iris2$Species, sd)

##     setosa versicolor  virginica 
##  0.1736640  0.4699110  0.5518947

Chaining or Pipelining

Chaining or Pipelining a process by which several steps in data manipulation can be joined in one R statement. It enhances the readability of the code and also makes it smaller. The pipelining operator is %>%.

It’s generic form is

step_1 %>% step_2 %>% step_3 %>% ... %>% step_n

Let me first define the steps of our problem with iris dataset.

Filter the rows of iris data where Petal.Length > 2.
Convert the dataframe into a grouped dataframe using a grouping variable.
Compute the means of each numeric variables.

Let me first do it in usual way:

iris2.new <- filter(iris2, Petal.Length > 2)
by_spec <- group_by(iris2.new, Species)
summarise(by_spec, mean(Sepal.Width))

## Source: local data frame [2 x 2]
## 
##      Species mean(Sepal.Width)
##       (fctr)             (dbl)
## 1 versicolor             2.770
## 2  virginica             2.974

Note that petal length is less than 2 in setosa species, so records related to setosa species has been discarded while filtering.

Now we do the same using chaining as follows:

iris2 %>%
  filter(Petal.Length > 2) %>%
  group_by(Species) %>%
  summarise(mean(Sepal.Width))

## Source: local data frame [2 x 2]
## 
##      Species mean(Sepal.Width)
##       (fctr)             (dbl)
## 1 versicolor             2.770
## 2  virginica             2.974

Another very small but interesting example of pipelining is computation of Euclidean distance of two vectors (you can find it in help file also),

x <- 1:10; y <- 3:12

(x - y) ** 2 %>%
  sum() %>%
  sqrt()

## [1] 6.324555