in this assignment, I was asked to explore a tidyverse package and show an example on how it works. Here, I am trying to explore ‘purrr’ and broom packages and see what features it provides.
purrr
is kind of like dplyr
for lists. It helps you repeatedly apply functions. purrr
makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups. Lets explore some basic usage and dive into some complex use-cases.
library(purrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
map(1:4, sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
# using a formulae with tilde
map(1:4, ~ sqrt(2*.))
## [[1]]
## [1] 1.414214
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 2.44949
##
## [[4]]
## [1] 2.828427
map(1:4, ~ log(3, base = .))
## [[1]]
## [1] Inf
##
## [[2]]
## [1] 1.584963
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 0.7924813
map_dbl(1:4, ~ log(3, base = .))
## [1] Inf 1.5849625 1.0000000 0.7924813
#And throws an error if any output isn't of the expected type (which is a good thing!).
map2
is like mapply
– apply a function over two lists in parallel. map_n
generalizes to any number of lists.
fwd = 1:10
bck = 10:1
map2_dbl(fwd, bck, `^`)
## [1] 1 512 6561 16384 15625 7776 2401 512 81 10
map_if
tests each element on a function and if true applies the second function, if false returns the original element.
data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>% map_if(is.numeric, ~ .^2)
## $ints
## [1] 1 4 9 16 25
##
## $lets
## [1] "a" "b" "c" "d" "e"
##
## $sqrts
## [1] 1 2 3 4 5
Let’s see if we can really make this purrr… Fit a linear model of winpercent by every combination of two predictors in the dataset and see which two predict best. We will select predictors who ‘rmse’ ( root mean square error) is lesser.
library(readr)
candy_data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
test <- candy_data[1:19 , ]
head(candy_data)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
‘augment’ is a method from broom package that returns information about individual observations to a dataset, such as fitted values or influence measures.
library(broom)
train = sample(nrow(candy_data), floor(nrow(candy_data) * .67))
testdata <- candy_data[1:56 , ]
head(testdata)
## competitorname chocolate fruity caramel peanutyalmondy nougat
## 1 100 Grand 1 0 1 0 0
## 2 3 Musketeers 1 0 0 0 1
## 3 One dime 0 0 0 0 0
## 4 One quarter 0 0 0 0 0
## 5 Air Heads 0 1 0 0 0
## 6 Almond Joy 1 0 0 1 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1 1 0 1 0 0.732 0.860 66.97173
## 2 0 0 1 0 0.604 0.511 67.60294
## 3 0 0 0 0 0.011 0.116 32.26109
## 4 0 0 0 0 0.011 0.511 46.11650
## 5 0 0 0 0 0.906 0.511 52.34146
## 6 0 0 1 0 0.465 0.767 50.34755
setdiff(names(candy_data), "winpercent") %>%
combn(2, paste, collapse = " + ") %>%
structure(., names = .) %>%
map(~ formula(paste("winpercent ~ ", .x))) %>%
map(lm, data = candy_data[train, ]) %>%
map_df(augment, newdata = testdata, .id = "predictors") %>%
group_by(predictors) %>% summarize(rmse = sqrt(mean((winpercent - .fitted)^2))) %>%
arrange(rmse)
## # A tibble: 66 x 2
## predictors rmse
## <chr> <dbl>
## 1 chocolate + sugarpercent 9.53
## 2 chocolate + peanutyalmondy 9.89
## 3 chocolate + pluribus 9.90
## 4 chocolate + pricepercent 9.93
## 5 chocolate + fruity 9.93
## 6 chocolate + crispedricewafer 9.94
## 7 chocolate + caramel 9.97
## 8 chocolate + hard 10.0
## 9 chocolate + bar 10.2
## 10 chocolate + nougat 10.3
## # ... with 56 more rows