email: jc3181 AT columbia DOT edu
Here is the first quick overview of a short series that I’m planning on writing discussing some of the cool chaining possibilities that the magrittr package provides. This is mainly written as a self-reference guide for the future. I’m not sure my code is necessarily the neatest or prettiest, but perhaps others will find use in what’s below so I thought I’d share it. This brief tutorial focuses on the %$% pipe. It’s great that this new pipe does really cool stuff because I’ve accidentally typed it a million times before - now it does something (as long as it’s put in the right place) !
First, install engsoccerdata if you have not already: Make sure you have the devtools package loaded first. I’m going to use the engsoccerdata2 dataset in this package as my sample data. It contains every single match in professional English soccer’s top four divisions from 1888-2014.
library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")
Load required packages.
library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(magrittr)
Please refer to the github page of engsoccerdata for more information on the engsoccerdata2 dataset, but just for quick reference, here is a sample of the data format - it’s pretty self-explanatory.
tail(engsoccerdata2)
## Date Season home visitor FT hgoal vgoal
## 188055 2013-09-28 2013 York City Portsmouth 4-2 4 2
## 188056 2013-11-30 2013 York City Rochdale 0-0 0 0
## 188057 2013-10-29 2013 York City Scunthorpe United 4-1 4 1
## 188058 2014-02-22 2013 York City Southend United 0-0 0 0
## 188059 2014-03-25 2013 York City Torquay United 1-0 1 0
## 188060 2014-03-15 2013 York City Wycombe Wanderers 2-0 2 0
## division tier totgoal goaldif result
## 188055 4 4 6 2 H
## 188056 4 4 0 0 D
## 188057 4 4 5 3 H
## 188058 4 4 0 0 D
## 188059 4 4 1 1 H
## 188060 4 4 2 2 H
The %$% operator can be used to apply functions to newly manipulated data.
We can plot dataframes more quickly by refering to variables newly created within a current chain. In the example below, the steps are as follows:
engsoccerdata2%$% operator to chain the just completed data manipulation to the plotting that is about to followggplot function, including . to refer to the currently chained dataframe
engsoccerdata2 %>%
filter(tier==1 & Season!=1939) %>%
group_by(Season) %>%
summarize(gpg = sum(totgoal)/length(totgoal)) %$%
ggplot(., aes(Season, gpg)) + geom_point() + geom_line() + theme_bw()
Another use of this example may be to manipulate a dataframe and then apply a simple function. For instance, to get the correlation between home and away goals in the 2013/14 season. The [[1]][[1]] refers to getting the t-statistic number (it’s the index of that number in the output returned from cor.test function)
engsoccerdata2 %>%
filter(tier==1 & Season==2013) %$%
cor.test(hgoal,vgoal)
##
## Pearson's product-moment correlation
##
## data: hgoal and vgoal
## t = -2.0078, df = 378, p-value = 0.04538
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.201245136 -0.002143343
## sample estimates:
## cor
## -0.102723
what if we wanted to return the “t” value from the correlation-test for that season?
engsoccerdata2 %>%
filter(tier==1 & Season==2013) %$%
cor.test(hgoal,vgoal)[[1]][[1]]
## [1] -2.007785
Let’s get this number for all Seasons !!! We can do this by putting our data into a list using split and %$%. We chain the split function to the end of our chain. Again the . represents our newly piped dataframe. We then pipe lapply after it… I’ve added the head(10) just to return the first 10 values and to not return every year.
engsoccerdata2 %>%
filter(tier==1) %$%
split(., Season) %>%
lapply(. %$% cor.test(hgoal,vgoal)[[1]][[1]]) %>%
head(10)
## $`1888`
## [1] -1.499024
##
## $`1889`
## [1] -3.415882
##
## $`1890`
## [1] -1.836765
##
## $`1891`
## [1] -2.190877
##
## $`1892`
## [1] -1.630546
##
## $`1893`
## [1] -1.645597
##
## $`1894`
## [1] -2.400865
##
## $`1895`
## [1] 0.008835615
##
## $`1896`
## [1] -0.1785999
##
## $`1897`
## [1] 0.2466131
That, for me, is a officially an #rstats ‘wow’ moment.