a deep dive into purrr functions for filtering lists
The purrr package is is a set of functions that facilitates iterating R objects. it is part of the tidyverse; a set of R packages designed to make data science ‘faster, easier and more fun.’ this vignette will demonstrate the use of several purrr methods used to filter lists: pluck(), head_while(), tail_while(), compact(), keep() and discard()
Global data on shark attacks from Kaggle will be used for demonstration.
pluck()
used to select elements by name of by index number
Suppose we have a particular shark attack incident that we are interested in such as Bethany Hamilton’s dramatic story and we would like to return a specific feature. for example, the ‘Date’:
#index row number for Bethany's record
idx <- which( attacks_df$Name == 'Bethany Hamilton')
#pluck the corresponding 'Date' element by name...
bHam_date <- pluck( attacks_df, 'Date', idx )
#...or num feature index number
#bHam_date <- pluck( attacks_df, 2, idx )
bHam_date## [1] "31-Oct-2003"
That was simple enough! but where pluck() becomes increasingly useful in situations where a deeper indexing into a data structure is necessary. suppose we reformate the ‘Date’ feature and split the string by the delimiter ‘-’ :
#split the data strings and return as lists in a new column of the data.frame
attacks_df$DateSplit <- lapply(strsplit( attacks_df$Date, '-' ),'[')Here we use pluck() to return the string corresponding to the month of Bethany Hamilton’s incident.
## [1] "Oct"
We will now use purrr list filtering functions to find the fequency of shark attack by the month the incidences occur in.
The previous call the pluck() worked, however, further inspection of the ‘Date’ feature reveals the data is rather dirty. the lists length of ‘Date’ is variable:
#work with the date strings as a list of lists:
attackDates <- lapply(strsplit( attacks_df$Date, '-' ),'[')
#return the unique lengths of the lists of strings in the 'Date'feature
unique(unlist(map(attackDates, length)))## [1] 3 1 2 4 5 0
Something is fishy!, we will explore the data with purrr list filtering functions
head_while()
returns top elements down to the point where an elements fails a logical test
## [[1]]
## [1] "25" "Jun" "2018"
##
## [[2]]
## [1] "18" "Jun" "2018"
##
## [[3]]
## [1] "09" "Jun" "2018"
##
## [[4]]
## [1] "08" "Jun" "2018"
##
## [[5]]
## [1] "04" "Jun" "2018"
##
## [[6]]
## [1] "03" "Jun" "2018"
##
## [[7]]
## [1] "03" "Jun" "2018"
tail_while()
returns the bottom elements up to the point where an element fails a logical test
## [1] 19421
The above output shows that there at over 19K null values taking up spave on the bottom of this set! Now to remove them…
compact()
used to drop empty elements remove all the dates where there is no value entered.
oldLength <- length( attackDates )
#apply the compact() function to the date data:
attackDates <- compact( attackDates )
newLength <- length( attackDates )
difLength <- oldLength - newLength
cat('the compact() function eliminated', difLength, 'empty elements')## the compact() function eliminated 19421 empty elements
Properly formatted dates result in a list length of 3 (c(‘DD’,‘MMM’,‘YYYY’)). here we use the purrr function keep() to select for all the cases (rows) where the length == 3.
keep()
used to select elements that pass a logical test
attackDates_3 <- keep(attackDates, function(x) length(x)==3 )
unique( unlist(map(attackDates_3, length) ) )## [1] 3
Alternatively, discard() can be used to remove values based on a criteria…
discard()
used to select elements that fail a logical test
#remove all elements with a list length of 2
attackDates_no2 <- discard(attackDates, function(x) length(x)==2 )
unique( unlist(map(attackDates_no2, length) ) )## [1] 3 1 4 5
We will now plot the cleaned date data
month_strs <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Nov', 'Dec')
#select the 2nd elements as lists, unlist them,
#keep only those strings that occur in the 'month_str' list &
#format the results as a table.
attack_bymonth <- table(keep(unlist(lapply( attackDates_3, '[[',2 )),
function(x) x %in% month_strs ))
attack_bymonth##
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Sep
## 420 553 415 356 493 621 475 379 358 377 520
#now to visualize.....
plotdata <- data.frame( names = names( attack_bymonth ),
values = as.vector( attack_bymonth ))
months_bp <- ggplot(plotdata, aes(x=names, y=values)) +
geom_bar(stat = "identity") +
scale_x_discrete( limits = month_strs ) +
geom_hline(yintercept = mean( plotdata$values ), color = 'red' ) +
ggtitle( 'Number of Shart Attacks per Month' ) +
xlab( 'Month' ) +
ylab( 'count' )
months_bp
The bar plot above shows the total shark attacks per month with a red horizontal line which indicates the average deaths per month. We can see that shark attacks are much more frequent in the summer months with another surge in January (presumably corresponding to increased vacationers).
We got to sink our teeth into several purrr functions for filtering lists. Hopefully this will help your future data wrangling go swimmingly!