a deep dive into purrr functions for filtering lists

The purrr package is is a set of functions that facilitates iterating R objects. it is part of the tidyverse; a set of R packages designed to make data science ‘faster, easier and more fun.’ this vignette will demonstrate the use of several purrr methods used to filter lists: pluck(), head_while(), tail_while(), compact(), keep() and discard()

Global data on shark attacks from Kaggle will be used for demonstration.

shark attack data set

A .csv of the dataset has been uploaded to be accessed here:

## [1] 25723    24
##  [1] "Case.Number"            "Date"                   "Year"                  
##  [4] "Type"                   "Country"                "Area"                  
##  [7] "Location"               "Activity"               "Name"                  
## [10] "Sex"                    "Age"                    "Injury"                
## [13] "Fatal..Y.N."            "Time"                   "Species"               
## [16] "Investigator.or.Source" "pdf"                    "href.formula"          
## [19] "href"                   "Case.Number.1"          "Case.Number.2"         
## [22] "original.order"         "X"                      "X.1"


the necessary libraries:


pluck()

used to select elements by name of by index number
Suppose we have a particular shark attack incident that we are interested in such as Bethany Hamilton’s dramatic story and we would like to return a specific feature. for example, the ‘Date’:

## [1] "31-Oct-2003"


That was simple enough! but where pluck() becomes increasingly useful in situations where a deeper indexing into a data structure is necessary. suppose we reformate the ‘Date’ feature and split the string by the delimiter ‘-’ :

Here we use pluck() to return the string corresponding to the month of Bethany Hamilton’s incident.

## [1] "Oct"


We will now use purrr list filtering functions to find the fequency of shark attack by the month the incidences occur in.

The previous call the pluck() worked, however, further inspection of the ‘Date’ feature reveals the data is rather dirty. the lists length of ‘Date’ is variable:

## [1] 3 1 2 4 5 0


Something is fishy!, we will explore the data with purrr list filtering functions

head_while()

returns top elements down to the point where an elements fails a logical test

## [[1]]
## [1] "25"   "Jun"  "2018"
## 
## [[2]]
## [1] "18"   "Jun"  "2018"
## 
## [[3]]
## [1] "09"   "Jun"  "2018"
## 
## [[4]]
## [1] "08"   "Jun"  "2018"
## 
## [[5]]
## [1] "04"   "Jun"  "2018"
## 
## [[6]]
## [1] "03"   "Jun"  "2018"
## 
## [[7]]
## [1] "03"   "Jun"  "2018"

tail_while()

returns the bottom elements up to the point where an element fails a logical test

## [1] 19421

The above output shows that there at over 19K null values taking up spave on the bottom of this set! Now to remove them…

compact()

used to drop empty elements remove all the dates where there is no value entered.

## the compact() function eliminated 19421 empty elements

Properly formatted dates result in a list length of 3 (c(‘DD’,‘MMM’,‘YYYY’)). here we use the purrr function keep() to select for all the cases (rows) where the length == 3.

keep()

used to select elements that pass a logical test

## [1] 3


Alternatively, discard() can be used to remove values based on a criteria…

discard()

used to select elements that fail a logical test

## [1] 3 1 4 5


We will now plot the cleaned date data

## 
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Sep 
## 420 553 415 356 493 621 475 379 358 377 520


The bar plot above shows the total shark attacks per month with a red horizontal line which indicates the average deaths per month. We can see that shark attacks are much more frequent in the summer months with another surge in January (presumably corresponding to increased vacationers).

We got to sink our teeth into several purrr functions for filtering lists. Hopefully this will help your future data wrangling go swimmingly!