1. Introduction

This is an R Markdown Notebook which illustrates main functionalities of the dplyr library.

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

These all combine naturally with group_by() which allows you to perform any operation “by group”.

2. Setup

First, we need to install the dplyr library. Make sure you install it from the console.

# DO NOT RUN THIS CUNK FROM HERE
#install.packages("dplyr")

Now, we load the library:

# Load the dplyr library
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

3. The Iris dataset

In order to illustrate the funcionalities we need some data as example. Let’s use the famous iris dataset.

The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

This is an illustration of the three Iris species:

iris
iris

Let’s import the data into our notebook:

# Load the Iris dataset
data(iris)

What object’s class is iris?

class(iris)
## [1] "data.frame"

How can be know its size?

dim(iris)
## [1] 150   5

The above result means that iris is composed of 150 rows and 5 columns.

Let’s have a look at its head:

head(iris)

Now, have a look at the last rows:

tail(iris)

What are the statistics of iris?

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

3. The dplyr functionalities

The filter() function picks cases based on their values. Let’s pick the rows corresponding to the setosa specie:

# Filtering Rows
(iris_filtered <- iris %>%
  filter(Species == "setosa"))

The select() function picks variables based on their names. Let’s pick the columns Sepal.Length and Sepal.Width:

# Selecting Columns
(iris_selected <- iris %>%
  select(Sepal.Length, Sepal.Width))

Now, let’s sort the rows using a particular attribute:

# Arranging (Sorting) Rows
(iris_sorted <- iris %>%
  arrange(Sepal.Length))

Note that the rows get organized in ascendent order using the column we stated in the code.

Is it possible to sort rows in descendent order? Let’s see.

# Arranging (Sorting) Rows
(iris_sorted <- iris %>%
  arrange(desc(Sepal.Length)))

The mutate() function creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

# Mutating (Creating/Modifying Columns)
(iris_mutated <- iris %>%
  mutate(Petal.Ratio = Petal.Length / Petal.Width))

Make sure you check values of the column Petal.Ratio in the table above.

# Grouping Data and Summarizing Data
iris_summary <- iris %>%
  group_by(Species) %>%
  summarise(
    Avg.Sepal.Length = mean(Sepal.Length),
    Min.Petal.Width = min(Petal.Width),
    Max.Petal.Length = max(Petal.Length)
  )

4. The pipe

All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr.

x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step.

You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).

The summarise() function computes a summary for each group. This means that it starts from group_keys(), adding the summary variable to the right hand side:

Let’s use the pipe operator in order to summarise the iris dataset.

# Chaining Operations
(iris_result <- iris %>%
  filter(Sepal.Length > 5.0) %>%
  group_by(Species) %>%
  summarise(Avg.Petal.Length = mean(Petal.Length)))

5. Homework

Write a chunk to summarize the iris dataset based on the Sepal.Width

# Write your code here
#
#

6. The environment

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Bogota
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.2
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.2      cli_3.6.1        knitr_1.43       rlang_1.1.1     
##  [5] xfun_0.39        generics_0.1.3   jsonlite_1.8.5   glue_1.6.2      
##  [9] htmltools_0.5.5  sass_0.4.6       fansi_1.0.4      rmarkdown_2.22  
## [13] evaluate_0.21    jquerylib_0.1.4  tibble_3.2.1     fastmap_1.1.1   
## [17] yaml_2.3.7       lifecycle_1.0.3  compiler_4.3.0   pkgconfig_2.0.3 
## [21] rstudioapi_0.14  digest_0.6.31    R6_2.5.1         tidyselect_1.2.0
## [25] utf8_1.2.3       pillar_1.9.0     magrittr_2.0.3   bslib_0.4.2     
## [29] withr_2.5.0      tools_4.3.0      cachem_1.0.8