Tidyverse assignment.

in this assignment, I was asked to explore a tidyverse package and show an example on how it works. Here, I am trying to explore ‘purrr’ and broom packages and see what features it provides.

purrr and broom packages

purrr is kind of like dplyr for lists. It helps you repeatedly apply functions. purrr makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups. Lets explore some basic usage and dive into some complex use-cases.

library(purrr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
map(1:4,  sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
# using a formulae with tilde 
map(1:4,  ~ sqrt(2*.))
## [[1]]
## [1] 1.414214
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 2.44949
## 
## [[4]]
## [1] 2.828427
map(1:4,  ~ log(3, base = .))
## [[1]]
## [1] Inf
## 
## [[2]]
## [1] 1.584963
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 0.7924813

map_xxx type-specifies the output type.

map_dbl(1:4,  ~ log(3, base = .))
## [1]       Inf 1.5849625 1.0000000 0.7924813
#And throws an error if any output isn't of the expected type (which is a good thing!).

map2 is like mapply – apply a function over two lists in parallel. map_n generalizes to any number of lists.

fwd = 1:10
bck = 10:1
map2_dbl(fwd, bck, `^`)
##  [1]     1   512  6561 16384 15625  7776  2401   512    81    10

map_if tests each element on a function and if true applies the second function, if false returns the original element.

data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>% map_if(is.numeric, ~ .^2) 
## $ints
## [1]  1  4  9 16 25
## 
## $lets
## [1] "a" "b" "c" "d" "e"
## 
## $sqrts
## [1] 1 2 3 4 5

Let’s see if we can really make this purrr… Fit a linear model of winpercent by every combination of two predictors in the dataset and see which two predict best. We will select predictors who ‘rmse’ ( root mean square error) is lesser.

library(readr)
candy_data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")

test <- candy_data[1:19 , ]
head(candy_data)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755

‘augment’ is a method from broom package that returns information about individual observations to a dataset, such as fitted values or influence measures.

library(broom)
train = sample(nrow(candy_data), floor(nrow(candy_data) * .67))

testdata <- candy_data[1:56 , ]
head(testdata)
##   competitorname chocolate fruity caramel peanutyalmondy nougat
## 1      100 Grand         1      0       1              0      0
## 2   3 Musketeers         1      0       0              0      1
## 3       One dime         0      0       0              0      0
## 4    One quarter         0      0       0              0      0
## 5      Air Heads         0      1       0              0      0
## 6     Almond Joy         1      0       0              1      0
##   crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
## 1                1    0   1        0        0.732        0.860   66.97173
## 2                0    0   1        0        0.604        0.511   67.60294
## 3                0    0   0        0        0.011        0.116   32.26109
## 4                0    0   0        0        0.011        0.511   46.11650
## 5                0    0   0        0        0.906        0.511   52.34146
## 6                0    0   1        0        0.465        0.767   50.34755
setdiff(names(candy_data), "winpercent") %>%
  combn(2, paste, collapse = " + ") %>%
  structure(., names = .) %>%
  map(~ formula(paste("winpercent ~ ", .x))) %>%
  map(lm, data = candy_data[train, ]) %>%
  map_df(augment, newdata = testdata, .id = "predictors") %>%
  group_by(predictors) %>% summarize(rmse = sqrt(mean((winpercent - .fitted)^2))) %>%
  arrange(rmse)
## # A tibble: 66 x 2
##    predictors                    rmse
##    <chr>                        <dbl>
##  1 chocolate + sugarpercent      9.53
##  2 chocolate + peanutyalmondy    9.89
##  3 chocolate + pluribus          9.90
##  4 chocolate + pricepercent      9.93
##  5 chocolate + fruity            9.93
##  6 chocolate + crispedricewafer  9.94
##  7 chocolate + caramel           9.97
##  8 chocolate + hard             10.0 
##  9 chocolate + bar              10.2 
## 10 chocolate + nougat           10.3 
## # ... with 56 more rows