email: jc3181 AT columbia DOT edu

 

Quick tutorial of magrittr’s cool chaining operations

Here is the first quick overview of a short series that I’m planning on writing discussing some of the cool chaining possibilities that the magrittr package provides. This is mainly written as a self-reference guide for the future. I’m not sure my code is necessarily the neatest or prettiest, but perhaps others will find use in what’s below so I thought I’d share it. This brief tutorial focuses on the %$% pipe. It’s great that this new pipe does really cool stuff because I’ve accidentally typed it a million times before - now it does something (as long as it’s put in the right place) !

 

Install and loading required packages

First, install engsoccerdata if you have not already: Make sure you have the devtools package loaded first. I’m going to use the engsoccerdata2 dataset in this package as my sample data. It contains every single match in professional English soccer’s top four divisions from 1888-2014.

library(devtools)
install_github('jalapic/engsoccerdata', username = "jalapic")

 

Load required packages.

library(engsoccerdata)
library(dplyr)
library(ggplot2)
library(magrittr)

 

Please refer to the github page of engsoccerdata for more information on the engsoccerdata2 dataset, but just for quick reference, here is a sample of the data format - it’s pretty self-explanatory.

tail(engsoccerdata2)
##              Date Season      home           visitor  FT hgoal vgoal
## 188055 2013-09-28   2013 York City        Portsmouth 4-2     4     2
## 188056 2013-11-30   2013 York City          Rochdale 0-0     0     0
## 188057 2013-10-29   2013 York City Scunthorpe United 4-1     4     1
## 188058 2014-02-22   2013 York City   Southend United 0-0     0     0
## 188059 2014-03-25   2013 York City    Torquay United 1-0     1     0
## 188060 2014-03-15   2013 York City Wycombe Wanderers 2-0     2     0
##        division tier totgoal goaldif result
## 188055        4    4       6       2      H
## 188056        4    4       0       0      D
## 188057        4    4       5       3      H
## 188058        4    4       0       0      D
## 188059        4    4       1       1      H
## 188060        4    4       2       2      H

 

%$%

The %$% operator can be used to apply functions to newly manipulated data.

 

Adding a plot

We can plot dataframes more quickly by refering to variables newly created within a current chain. In the example below, the steps are as follows:

  • take original dataframe engsoccerdata2
  • filter out the incomplete 1939 season data, and only keep data from the top division
  • group all the observations by the grouping variable ‘Season’
  • create a summary variable of ‘goals-per-game’ called ‘gpg’ which is the average number of goals scored per game per ‘Season’
  • use the %$% operator to chain the just completed data manipulation to the plotting that is about to follow
  • use a ggplot function, including . to refer to the currently chained dataframe

 

engsoccerdata2 %>%
  filter(tier==1 & Season!=1939) %>%
  group_by(Season) %>%
  summarize(gpg = sum(totgoal)/length(totgoal)) %$%
  ggplot(., aes(Season, gpg)) + geom_point() + geom_line() + theme_bw()

 

Adding a function

Another use of this example may be to manipulate a dataframe and then apply a simple function. For instance, to get the correlation between home and away goals in the 2013/14 season. The [[1]][[1]] refers to getting the t-statistic number (it’s the index of that number in the output returned from cor.test function)

engsoccerdata2 %>%
  filter(tier==1 & Season==2013) %$%
  cor.test(hgoal,vgoal)
## 
##  Pearson's product-moment correlation
## 
## data:  hgoal and vgoal
## t = -2.0078, df = 378, p-value = 0.04538
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.201245136 -0.002143343
## sample estimates:
##       cor 
## -0.102723

 

what if we wanted to return the “t” value from the correlation-test for that season?

engsoccerdata2 %>%
  filter(tier==1 & Season==2013) %$%
  cor.test(hgoal,vgoal)[[1]][[1]]   
## [1] -2.007785

 

Let’s get this number for all Seasons !!! We can do this by putting our data into a list using split and %$%. We chain the split function to the end of our chain. Again the . represents our newly piped dataframe. We then pipe lapply after it… I’ve added the head(10) just to return the first 10 values and to not return every year.

engsoccerdata2 %>%
  filter(tier==1) %$%
  split(., Season) %>%
  lapply(. %$% cor.test(hgoal,vgoal)[[1]][[1]]) %>%
  head(10)
## $`1888`
## [1] -1.499024
## 
## $`1889`
## [1] -3.415882
## 
## $`1890`
## [1] -1.836765
## 
## $`1891`
## [1] -2.190877
## 
## $`1892`
## [1] -1.630546
## 
## $`1893`
## [1] -1.645597
## 
## $`1894`
## [1] -2.400865
## 
## $`1895`
## [1] 0.008835615
## 
## $`1896`
## [1] -0.1785999
## 
## $`1897`
## [1] 0.2466131

 

That, for me, is a officially an #rstats ‘wow’ moment.

 

 

Putting it all together

We can go further… we can put these results from a list into a neat dataframe and plot them. The logical steps are as follows:

  • take dataframe engsoccerdata2
  • filter to only keep all the top tier games
  • split the data into separate dataframes by factor ‘Season’ (chained by %$%)
  • then apply to all these dataframes that are now in a list, the cor.test function. This done using lapply (chained by %>%) and cor.test (chained by %$%) inside lapply
  • store the t-statistic of each of these tests (indexd by [[1]][[1]]
  • unlist the data
  • make the data a matrix and then a dataframe (I find this the easiest way of making sure our unlisted data do not behave like a list when inside a dataframe)
  • then add a new column for the ‘Season’ which is stored as the rownames.
  • then make a ggplot (chained by %$%)
  • and add a lot of different pieces to the ggplot graph to make it look a little prettier.

 

engsoccerdata2 %>%
  filter(tier==1) %$%
  split(., Season) %>%
  lapply(. %$% cor.test(hgoal,vgoal)[[1]][[1]]) %>%
  unlist(.) %>%
  as.matrix() %>%
  as.data.frame() %>%
  mutate(Season = rownames(.)) %$%
  ggplot(., aes(Season, V1)) + 
  geom_point(size=3) + 
  theme_bw() + 
  ylab("t statistic") +
  stat_smooth(lwd=1, color="red", aes(group = 1)) + 
  scale_x_discrete(breaks=seq(1885, 2015, 5)) +
  theme(axis.text.x = element_text(angle=90, hjust=1))

 

This is just the tip of the iceberg ! There so much more that can be done with %$% as well as its friends %<>% and %T>% - I’ll try to write something on these guys soon.