Lesson 4

Piping

Often we had muliple lines for different temporary tibbles that we modified multiple times. Lucky for us there is an operator that lets us connect multiple functions together known as a pipe which has the notation %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. Don’t try to run this code, just try to understand.

data1 <- func_1(data)

data2 <- func_2(data1)

data3 <- func_3(data2)

We can simplify this code using the pipe operator:

dat_final <- dat %>% func_1() %>% func_2() %>% func_3()

The tibble gets “piped” into the first function, which pipes its output to the next function, and so on. You can think of this as your analysis “pipeline.” The sequence of analysis flows naturally from left-to-right and puts the emphasis on the actions being carried out by us (i.e. the functions) and the final output rather than a bunch of temporary tibbles that may not be of much interest.

Let’s start working with some data. I can’t tell you why but there is a whole downloadable library of baseball dating from 1871 onwards known as the Lahman Baseball database. Let’s install it along with the tidyverse.

install.packages("Lahman")

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.1

## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## Warning: package 'tibble' was built under R version 3.5.1

## Warning: package 'tidyr' was built under R version 3.5.1

## Warning: package 'readr' was built under R version 3.5.1

## Warning: package 'purrr' was built under R version 3.5.1

## Warning: package 'dplyr' was built under R version 3.5.1

## Warning: package 'stringr' was built under R version 3.5.1

## Warning: package 'forcats' was built under R version 3.5.1

## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(Lahman)

## Warning: package 'Lahman' was built under R version 3.5.1

We are going to focus on pitching today so lets load the data into a tibble under the name Pitching

Pitching<-as_tibble(Pitching)

For this lesson we will only want to focus on ERA (Earned Run Average is the mean of earned runs given up by a pitcher per nine innings pitched (i.e. the traditional length of a game). It is determined by dividing the number of earned runs allowed by the number of innings pitched and multiplying by nine.) and also focus on those pitchers who have pitched at least 175 innings. The Lahman pitching dataset does not have Innings Pitched. Instead, it has a column called ``IPouts’’, which is the number of outs pitched and whose formula is IPOuts=3×IP.

We will create a new tibble called pitching which contains all players who pitched at least 150 innings, played in either the AL or the NL and which contains only the columns corresponding the player, year, team, league, innings pitched, and ERA.

pitching <- 
   Pitching %>%
   mutate(IP = IPouts/3) %>% 
   filter(lgID %in% c('AL', 'NL') & IP >= 175) %>%
   select(playerID, yearID, lgID, teamID, IP, ERA)

## Warning: package 'bindrcpp' was built under R version 3.5.1

pitching

## # A tibble: 7,387 x 6
##    playerID  yearID lgID  teamID    IP   ERA
##    <chr>      <int> <fct> <fct>  <dbl> <dbl>
##  1 bondto01    1876 NL    HAR     408   1.68
##  2 bordejo01   1876 NL    BSN     218.  2.89
##  3 bradlge01   1876 NL    SL3     573   1.23
##  4 cummica01   1876 NL    HAR     216   1.67
##  5 deando01    1876 NL    CN1     263.  3.73
##  6 devliji01   1876 NL    LS1     622   1.56
##  7 fishech01   1876 NL    CN1     229.  3.02
##  8 knighlo01   1876 NL    PHN     282   2.62
##  9 mannija01   1876 NL    BSN     197.  2.14
## 10 mathebo01   1876 NL    NY3     516   2.86
## # ... with 7,377 more rows

Make sure you understand what has happened here, especially with the pipe operator

We now have a list of 7,387 players who fit the conditions we specified (At least 175 innings). We can plot the ERA by year to see the overall trend.

ggplot(data = pitching) + 
    ylim(0, 9) + 
    geom_point(mapping = aes(x = yearID, y = ERA), size = 0.7)

Let’s now arrange the data by ERA to see who had the lowest ERAs and what time periods they came from.

arrange(pitching, ERA)

## # A tibble: 7,387 x 6
##    playerID  yearID lgID  teamID    IP   ERA
##    <chr>      <int> <fct> <fct>  <dbl> <dbl>
##  1 leonadu01   1914 AL    BOS     225.  0.96
##  2 brownmo01   1906 NL    CHN     277.  1.04
##  3 gibsobo01   1968 NL    SLN     305.  1.12
##  4 mathech01   1909 NL    NY1     275.  1.14
##  5 johnswa01   1913 AL    WS1     346   1.14
##  6 pfiesja01   1907 NL    CHN     195   1.15
##  7 jossad01    1908 AL    CLE     325   1.16
##  8 lundgca01   1907 NL    CHN     207   1.17
##  9 alexape01   1915 NL    PHI     376.  1.22
## 10 bradlge01   1876 NL    SL3     573   1.23
## # ... with 7,377 more rows

The lowest ERAs belong to Dutch Leonard and the three-fingered Mordecai Brown. To put into perspective just how good these ERAs are we can use Z-Scores to standardize data. Z-Score represents how many standard deviations a value is from the average. While there is no formal function for standardizing it’s pretty seamless to do.

 standardize <- function(e){
   mu <- mean(e, na.rm = TRUE)
   sigma <- sd(e, na.rm = TRUE)
   return( (e - mu)/sigma )}

The use of na.rm=TRUE pretty much that there are no values being calculated right now until they are plugged in from pitching.

pitching <-
   pitching %>%
   mutate(zERA= standardize(ERA)) 
pitching %>% arrange(zERA)

## # A tibble: 7,387 x 7
##    playerID  yearID lgID  teamID    IP   ERA  zERA
##    <chr>      <int> <fct> <fct>  <dbl> <dbl> <dbl>
##  1 leonadu01   1914 AL    BOS     225.  0.96 -3.01
##  2 brownmo01   1906 NL    CHN     277.  1.04 -2.92
##  3 gibsobo01   1968 NL    SLN     305.  1.12 -2.82
##  4 mathech01   1909 NL    NY1     275.  1.14 -2.80
##  5 johnswa01   1913 AL    WS1     346   1.14 -2.80
##  6 pfiesja01   1907 NL    CHN     195   1.15 -2.79
##  7 jossad01    1908 AL    CLE     325   1.16 -2.78
##  8 lundgca01   1907 NL    CHN     207   1.17 -2.77
##  9 alexape01   1915 NL    PHI     376.  1.22 -2.71
## 10 bradlge01   1876 NL    SL3     573   1.23 -2.69
## # ... with 7,377 more rows

There is a flaw here, right now we compare all seasons as if they are equal to each other but in reality they are not. Baseball in the early 1900s was nothing like baseball in the modern era. For that reason we will need to standardize each season seperately. To do this, we will have to compute the mean and standard deviation of ERAs within each season.

A lot of the time when doing data analysis instead of performing a calculation on the entire data set you will likely want to first split the data into smaller subsets, apply the same calculation on every subset, and then combine the results from each subset. Luckily we have a function that would allow us to do this, the group_by() function. For our situation of getting Z-Scores for every year seperately we can now begin to address this.

pitching <- 
   pitching %>% 
   group_by(yearID)
pitching

## # A tibble: 7,387 x 7
## # Groups:   yearID [141]
##    playerID  yearID lgID  teamID    IP   ERA   zERA
##    <chr>      <int> <fct> <fct>  <dbl> <dbl>  <dbl>
##  1 bondto01    1876 NL    HAR     408   1.68 -2.16 
##  2 bordejo01   1876 NL    BSN     218.  2.89 -0.727
##  3 bradlge01   1876 NL    SL3     573   1.23 -2.69 
##  4 cummica01   1876 NL    HAR     216   1.67 -2.17 
##  5 deando01    1876 NL    CN1     263.  3.73  0.269
##  6 devliji01   1876 NL    LS1     622   1.56 -2.30 
##  7 fishech01   1876 NL    CN1     229.  3.02 -0.573
##  8 knighlo01   1876 NL    PHN     282   2.62 -1.05 
##  9 mannija01   1876 NL    BSN     197.  2.14 -1.62 
## 10 mathebo01   1876 NL    NY3     516   2.86 -0.762
## # ... with 7,377 more rows

Take note of how the data was organized into groups by year with 141 groups for the 141 different seasons.

pitching_summarize <- 
   pitching %>% 
   summarize(mean = mean(ERA), sd = sd(ERA))
pitching_summarize

## # A tibble: 141 x 3
##    yearID  mean    sd
##     <int> <dbl> <dbl>
##  1   1876  2.42 0.877
##  2   1877  2.52 0.461
##  3   1878  2.02 0.366
##  4   1879  2.58 0.545
##  5   1880  2.33 0.710
##  6   1881  2.70 0.498
##  7   1882  2.76 0.514
##  8   1883  2.79 0.771
##  9   1884  2.94 0.941
## 10   1885  2.87 0.964
## # ... with 131 more rows

The mean and standard deviations are listed only by year instead of all together because we used the group_by() function earlier.

Now let’s get the Z-Score by each individual season and see how the data changes.

pitching <-
   pitching %>%
   mutate(zERA_year = standardize(ERA))
pitching %>% arrange(zERA_year)

## # A tibble: 7,387 x 8
## # Groups:   yearID [141]
##    playerID  yearID lgID  teamID    IP   ERA  zERA zERA_year
##    <chr>      <int> <fct> <fct>  <dbl> <dbl> <dbl>     <dbl>
##  1 leonadu01   1914 AL    BOS     225.  0.96 -3.01     -3.31
##  2 martipe02   2000 AL    BOS     217   1.74 -2.09     -3.25
##  3 luquedo01   1923 NL    CIN     322   1.93 -1.86     -3.08
##  4 martipe02   1999 AL    BOS     213.  2.07 -1.70     -2.81
##  5 gibsobo01   1968 NL    SLN     305.  1.12 -2.82     -2.80
##  6 maddugr01   1995 NL    ATL     210.  1.63 -2.22     -2.78
##  7 goodedw01   1985 NL    NYN     277.  1.53 -2.34     -2.76
##  8 guidrro01   1978 AL    NYA     274.  1.74 -2.09     -2.69
##  9 grovele01   1930 AL    PHA     291   2.54 -1.14     -2.67
## 10 piercbi02   1955 AL    CHA     206.  1.97 -1.82     -2.66
## # ... with 7,377 more rows

Dutch Leonard still has one of the best pitching seasons ever but now we see two of Pedro Martinez’s Cy Young winning seasons move into the top 5 because of how amazing the season was compared to everyone else in the time period he played in.

If we want to go back to getting standard deviations and mean for the full dataset and not the specific years, we can just use the ungroup() function.

pitching <- 
   pitching %>%
   ungroup()
pitching %>% summarize(mean = mean(ERA), sd = sd(ERA))

## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1  3.50 0.844

We have only grouped by single variables so far. It may be beneficial to group multiple variables at once. For our situation, we can try to standardize ERA not only within each year but also within each leauge, the American League and National League (There are differences such as the designated hitter which only exists in the AL which makes pitching harder). Let’s standardize for these conditions.

pitching <-
   pitching %>%
   group_by(yearID, lgID) %>%
   mutate(zERA_year_lg = standardize(ERA))
pitching %>% arrange(zERA_year_lg)

## # A tibble: 7,387 x 9
## # Groups:   yearID, lgID [257]
##    playerID  yearID lgID  teamID    IP   ERA  zERA zERA_year zERA_year_lg
##    <chr>      <int> <fct> <fct>  <dbl> <dbl> <dbl>     <dbl>        <dbl>
##  1 martipe02   2000 AL    BOS     217   1.74 -2.09     -3.25        -3.73
##  2 martipe02   1999 AL    BOS     213.  2.07 -1.70     -2.81        -3.34
##  3 leonadu01   1914 AL    BOS     225.  0.96 -3.01     -3.31        -2.99
##  4 guidrro01   1978 AL    NYA     274.  1.74 -2.09     -2.69        -2.79
##  5 santajo01   2004 AL    MIN     228   2.61 -1.06     -2.00        -2.71
##  6 vanceda01   1930 NL    BRO     259.  2.61 -1.06     -2.57        -2.70
##  7 greinza01   2009 AL    KCA     229.  2.16 -1.59     -2.34        -2.70
##  8 luquedo01   1923 NL    CIN     322   1.93 -1.86     -3.08        -2.68
##  9 alexape01   1915 NL    PHI     376.  1.22 -2.71     -2.26        -2.68
## 10 goodedw01   1985 NL    NYN     277.  1.53 -2.34     -2.76        -2.68
## # ... with 7,377 more rows

Pedro Martinez played in the American League after the designated hitter was established in 1973 which further amplifies just how good of a pitcher he was, in fact even better than Dutch.

We can graph the average for each season on the graph too with the functions we previously declared as shown below. We can use geom_line() to map the average with a line. We will reduce the dot size so that we can see the average ERA line better.

ggplot(pitching) +
   geom_point(aes(x = yearID, y = ERA), size = 0.3) +
   ylim(0, 9) + 
   geom_line(data = pitching_summarize, mapping = aes(x = yearID, y = mean), col = 'dark blue')

Lesson 4

Ankith Kodali

Piping