Tidyverse: the purrr(fect) list

a deep dive into purrr functions for filtering lists

The purrr package is is a set of functions that facilitates iterating R objects. it is part of the tidyverse; a set of R packages designed to make data science ‘faster, easier and more fun.’ this vignette will demonstrate the use of several purrr methods used to filter lists: pluck(), head_while(), tail_while(), compact(), keep() and discard()

Global data on shark attacks from Kaggle will be used for demonstration.

shark attack data set

A .csv of the dataset has been uploaded to be accessed here:

#github raw url
attacksURL <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/attacks.csv'
#read the url to an r data.frame
attacks_df <- read.csv( attacksURL, stringsAsFactors = F )
#show the dimentions...
dim( attacks_df )

## [1] 25723    24

#...and features of the data.frame
colnames( attacks_df )

##  [1] "Case.Number"            "Date"                   "Year"                  
##  [4] "Type"                   "Country"                "Area"                  
##  [7] "Location"               "Activity"               "Name"                  
## [10] "Sex"                    "Age"                    "Injury"                
## [13] "Fatal..Y.N."            "Time"                   "Species"               
## [16] "Investigator.or.Source" "pdf"                    "href.formula"          
## [19] "href"                   "Case.Number.1"          "Case.Number.2"         
## [22] "original.order"         "X"                      "X.1"

the necessary libraries:

library( tidyverse )
library( dplyr )
library( ggplot2 )

pluck()

used to select elements by name of by index number
Suppose we have a particular shark attack incident that we are interested in such as Bethany Hamilton’s dramatic story and we would like to return a specific feature. for example, the ‘Date’:

#index row number for Bethany's record
idx <- which( attacks_df$Name == 'Bethany Hamilton')
#pluck the corresponding 'Date' element by name...
bHam_date <- pluck( attacks_df, 'Date', idx )
#...or num feature index number
#bHam_date <- pluck( attacks_df, 2, idx )
bHam_date

## [1] "31-Oct-2003"

That was simple enough! but where pluck() becomes increasingly useful in situations where a deeper indexing into a data structure is necessary. suppose we reformate the ‘Date’ feature and split the string by the delimiter ‘-’ :

#split the data strings and return as lists in a new column of the data.frame
attacks_df$DateSplit <- lapply(strsplit( attacks_df$Date, '-' ),'[')

Here we use pluck() to return the string corresponding to the month of Bethany Hamilton’s incident.

pluck( attacks_df, 'DateSplit', idx, 2 )

## [1] "Oct"

We will now use purrr list filtering functions to find the fequency of shark attack by the month the incidences occur in.

The previous call the pluck() worked, however, further inspection of the ‘Date’ feature reveals the data is rather dirty. the lists length of ‘Date’ is variable:

#work with the date strings as a list of lists:
attackDates <- lapply(strsplit( attacks_df$Date, '-' ),'[')
#return the unique lengths of the lists of strings in the 'Date'feature
unique(unlist(map(attackDates, length)))

## [1] 3 1 2 4 5 0

Something is fishy!, we will explore the data with purrr list filtering functions

head_while()

returns top elements down to the point where an elements fails a logical test

head_while(attackDates, function(x) x[2]=='Jun')

## [[1]]
## [1] "25"   "Jun"  "2018"
## 
## [[2]]
## [1] "18"   "Jun"  "2018"
## 
## [[3]]
## [1] "09"   "Jun"  "2018"
## 
## [[4]]
## [1] "08"   "Jun"  "2018"
## 
## [[5]]
## [1] "04"   "Jun"  "2018"
## 
## [[6]]
## [1] "03"   "Jun"  "2018"
## 
## [[7]]
## [1] "03"   "Jun"  "2018"

tail_while()

returns the bottom elements up to the point where an element fails a logical test

length(tail_while(attackDates, function(x) length(x)==0))

## [1] 19421

The above output shows that there at over 19K null values taking up spave on the bottom of this set! Now to remove them…

compact()

used to drop empty elements remove all the dates where there is no value entered.

oldLength <- length( attackDates )
#apply the compact() function to the date data:
attackDates <- compact( attackDates )
newLength <- length( attackDates )
difLength <- oldLength - newLength
cat('the compact() function eliminated', difLength, 'empty elements')

## the compact() function eliminated 19421 empty elements

Properly formatted dates result in a list length of 3 (c(‘DD’,‘MMM’,‘YYYY’)). here we use the purrr function keep() to select for all the cases (rows) where the length == 3.

keep()

used to select elements that pass a logical test

attackDates_3 <- keep(attackDates, function(x) length(x)==3 )
unique( unlist(map(attackDates_3, length) ) )

## [1] 3

Alternatively, discard() can be used to remove values based on a criteria…

discard()

used to select elements that fail a logical test

#remove all elements with a list length of 2
attackDates_no2 <-  discard(attackDates, function(x) length(x)==2 ) 
unique( unlist(map(attackDates_no2, length) ) )

## [1] 3 1 4 5

We will now plot the cleaned date data

month_strs <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                'Jul', 'Aug', 'Sep', 'Nov', 'Dec')
#select the 2nd elements as lists, unlist them, 
#keep only those strings that occur in the 'month_str' list & 
#format the results as a table.
attack_bymonth <- table(keep(unlist(lapply( attackDates_3, '[[',2 )), 
                             function(x) x %in% month_strs ))
attack_bymonth

## 
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Sep 
## 420 553 415 356 493 621 475 379 358 377 520

#now to visualize.....
plotdata <- data.frame( names = names( attack_bymonth ),
                        values = as.vector( attack_bymonth ))
months_bp <- ggplot(plotdata, aes(x=names, y=values)) + 
    geom_bar(stat = "identity") +
    scale_x_discrete( limits = month_strs ) +
    geom_hline(yintercept = mean( plotdata$values ), color = 'red' ) +
    ggtitle( 'Number of Shart Attacks per Month' ) +
    xlab( 'Month' ) +
    ylab( 'count' )
months_bp

The bar plot above shows the total shark attacks per month with a red horizontal line which indicates the average deaths per month. We can see that shark attacks are much more frequent in the summer months with another surge in January (presumably corresponding to increased vacationers).

We got to sink our teeth into several purrr functions for filtering lists. Hopefully this will help your future data wrangling go swimmingly!