This is an R Markdown Notebook which illustrates main functionalities of the dplyr library.
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
These all combine naturally with group_by() which allows you to perform any operation “by group”.
First, we need to install the dplyr library. Make sure you install it from the console.
# DO NOT RUN THIS CUNK FROM HERE
#install.packages("dplyr")
Now, we load the library:
# Load the dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In order to illustrate the funcionalities we need some data as example. Let’s use the famous iris dataset.
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
This is an illustration of the three Iris species:
Let’s import the data into our notebook:
# Load the Iris dataset
data(iris)
What object’s class is iris?
class(iris)
## [1] "data.frame"
How can be know its size?
dim(iris)
## [1] 150 5
The above result means that iris is composed of 150 rows and 5 columns.
Let’s have a look at its head:
head(iris)
Now, have a look at the last rows:
tail(iris)
What are the statistics of iris?
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The filter() function picks cases based on their values. Let’s pick the rows corresponding to the setosa specie:
# Filtering Rows
(iris_filtered <- iris %>%
filter(Species == "setosa"))
The select() function picks variables based on their names. Let’s pick the columns Sepal.Length and Sepal.Width:
# Selecting Columns
(iris_selected <- iris %>%
select(Sepal.Length, Sepal.Width))
Now, let’s sort the rows using a particular attribute:
# Arranging (Sorting) Rows
(iris_sorted <- iris %>%
arrange(Sepal.Length))
Note that the rows get organized in ascendent order using the column we stated in the code.
Is it possible to sort rows in descendent order? Let’s see.
# Arranging (Sorting) Rows
(iris_sorted <- iris %>%
arrange(desc(Sepal.Length)))
The mutate() function creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).
# Mutating (Creating/Modifying Columns)
(iris_mutated <- iris %>%
mutate(Petal.Ratio = Petal.Length / Petal.Width))
Make sure you check values of the column Petal.Ratio in the table above.
# Grouping Data and Summarizing Data
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Avg.Sepal.Length = mean(Sepal.Length),
Min.Petal.Width = min(Petal.Width),
Max.Petal.Length = max(Petal.Length)
)
All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr.
x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step.
You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom (reading the pipe operator as “then”).
The summarise() function computes a summary for each group. This means that it starts from group_keys(), adding the summary variable to the right hand side:
Let’s use the pipe operator in order to summarise the iris dataset.
# Chaining Operations
(iris_result <- iris %>%
filter(Sepal.Length > 5.0) %>%
group_by(Species) %>%
summarise(Avg.Petal.Length = mean(Petal.Length)))
Write a chunk to summarize the iris dataset based on the Sepal.Width
# Write your code here
#
#
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Bogota
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.2 cli_3.6.1 knitr_1.43 rlang_1.1.1
## [5] xfun_0.39 generics_0.1.3 jsonlite_1.8.5 glue_1.6.2
## [9] htmltools_0.5.5 sass_0.4.6 fansi_1.0.4 rmarkdown_2.22
## [13] evaluate_0.21 jquerylib_0.1.4 tibble_3.2.1 fastmap_1.1.1
## [17] yaml_2.3.7 lifecycle_1.0.3 compiler_4.3.0 pkgconfig_2.0.3
## [21] rstudioapi_0.14 digest_0.6.31 R6_2.5.1 tidyselect_1.2.0
## [25] utf8_1.2.3 pillar_1.9.0 magrittr_2.0.3 bslib_0.4.2
## [29] withr_2.5.0 tools_4.3.0 cachem_1.0.8