Often we had muliple lines for different temporary tibbles that we modified multiple times. Lucky for us there is an operator that lets us connect multiple functions together known as a pipe which has the notation %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. Don’t try to run this code, just try to understand.
data1 <- func_1(data)
data2 <- func_2(data1)
data3 <- func_3(data2)
We can simplify this code using the pipe operator:
dat_final <- dat %>% func_1() %>% func_2() %>% func_3()
The tibble gets “piped” into the first function, which pipes its output to the next function, and so on. You can think of this as your analysis “pipeline.” The sequence of analysis flows naturally from left-to-right and puts the emphasis on the actions being carried out by us (i.e. the functions) and the final output rather than a bunch of temporary tibbles that may not be of much interest.
Let’s start working with some data. I can’t tell you why but there is a whole downloadable library of baseball dating from 1871 onwards known as the Lahman Baseball database. Let’s install it along with the tidyverse.
install.packages("Lahman")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'tibble' was built under R version 3.5.1
## Warning: package 'tidyr' was built under R version 3.5.1
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'stringr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(Lahman)
## Warning: package 'Lahman' was built under R version 3.5.1
We are going to focus on pitching today so lets load the data into a tibble under the name Pitching
Pitching<-as_tibble(Pitching)
For this lesson we will only want to focus on ERA (Earned Run Average is the mean of earned runs given up by a pitcher per nine innings pitched (i.e. the traditional length of a game). It is determined by dividing the number of earned runs allowed by the number of innings pitched and multiplying by nine.) and also focus on those pitchers who have pitched at least 175 innings. The Lahman pitching dataset does not have Innings Pitched. Instead, it has a column called ``IPouts’’, which is the number of outs pitched and whose formula is IPOuts=3×IP.
We will create a new tibble called pitching which contains all players who pitched at least 150 innings, played in either the AL or the NL and which contains only the columns corresponding the player, year, team, league, innings pitched, and ERA.
pitching <-
Pitching %>%
mutate(IP = IPouts/3) %>%
filter(lgID %in% c('AL', 'NL') & IP >= 175) %>%
select(playerID, yearID, lgID, teamID, IP, ERA)
## Warning: package 'bindrcpp' was built under R version 3.5.1
pitching
## # A tibble: 7,387 x 6
## playerID yearID lgID teamID IP ERA
## <chr> <int> <fct> <fct> <dbl> <dbl>
## 1 bondto01 1876 NL HAR 408 1.68
## 2 bordejo01 1876 NL BSN 218. 2.89
## 3 bradlge01 1876 NL SL3 573 1.23
## 4 cummica01 1876 NL HAR 216 1.67
## 5 deando01 1876 NL CN1 263. 3.73
## 6 devliji01 1876 NL LS1 622 1.56
## 7 fishech01 1876 NL CN1 229. 3.02
## 8 knighlo01 1876 NL PHN 282 2.62
## 9 mannija01 1876 NL BSN 197. 2.14
## 10 mathebo01 1876 NL NY3 516 2.86
## # ... with 7,377 more rows
Make sure you understand what has happened here, especially with the pipe operator
We now have a list of 7,387 players who fit the conditions we specified (At least 175 innings). We can plot the ERA by year to see the overall trend.
ggplot(data = pitching) +
ylim(0, 9) +
geom_point(mapping = aes(x = yearID, y = ERA), size = 0.7)
Let’s now arrange the data by ERA to see who had the lowest ERAs and what time periods they came from.
arrange(pitching, ERA)
## # A tibble: 7,387 x 6
## playerID yearID lgID teamID IP ERA
## <chr> <int> <fct> <fct> <dbl> <dbl>
## 1 leonadu01 1914 AL BOS 225. 0.96
## 2 brownmo01 1906 NL CHN 277. 1.04
## 3 gibsobo01 1968 NL SLN 305. 1.12
## 4 mathech01 1909 NL NY1 275. 1.14
## 5 johnswa01 1913 AL WS1 346 1.14
## 6 pfiesja01 1907 NL CHN 195 1.15
## 7 jossad01 1908 AL CLE 325 1.16
## 8 lundgca01 1907 NL CHN 207 1.17
## 9 alexape01 1915 NL PHI 376. 1.22
## 10 bradlge01 1876 NL SL3 573 1.23
## # ... with 7,377 more rows
The lowest ERAs belong to Dutch Leonard and the three-fingered Mordecai Brown. To put into perspective just how good these ERAs are we can use Z-Scores to standardize data. Z-Score represents how many standard deviations a value is from the average. While there is no formal function for standardizing it’s pretty seamless to do.
standardize <- function(e){
mu <- mean(e, na.rm = TRUE)
sigma <- sd(e, na.rm = TRUE)
return( (e - mu)/sigma )}
The use of na.rm=TRUE pretty much that there are no values being calculated right now until they are plugged in from pitching.
pitching <-
pitching %>%
mutate(zERA= standardize(ERA))
pitching %>% arrange(zERA)
## # A tibble: 7,387 x 7
## playerID yearID lgID teamID IP ERA zERA
## <chr> <int> <fct> <fct> <dbl> <dbl> <dbl>
## 1 leonadu01 1914 AL BOS 225. 0.96 -3.01
## 2 brownmo01 1906 NL CHN 277. 1.04 -2.92
## 3 gibsobo01 1968 NL SLN 305. 1.12 -2.82
## 4 mathech01 1909 NL NY1 275. 1.14 -2.80
## 5 johnswa01 1913 AL WS1 346 1.14 -2.80
## 6 pfiesja01 1907 NL CHN 195 1.15 -2.79
## 7 jossad01 1908 AL CLE 325 1.16 -2.78
## 8 lundgca01 1907 NL CHN 207 1.17 -2.77
## 9 alexape01 1915 NL PHI 376. 1.22 -2.71
## 10 bradlge01 1876 NL SL3 573 1.23 -2.69
## # ... with 7,377 more rows
There is a flaw here, right now we compare all seasons as if they are equal to each other but in reality they are not. Baseball in the early 1900s was nothing like baseball in the modern era. For that reason we will need to standardize each season seperately. To do this, we will have to compute the mean and standard deviation of ERAs within each season.
A lot of the time when doing data analysis instead of performing a calculation on the entire data set you will likely want to first split the data into smaller subsets, apply the same calculation on every subset, and then combine the results from each subset. Luckily we have a function that would allow us to do this, the group_by() function. For our situation of getting Z-Scores for every year seperately we can now begin to address this.
pitching <-
pitching %>%
group_by(yearID)
pitching
## # A tibble: 7,387 x 7
## # Groups: yearID [141]
## playerID yearID lgID teamID IP ERA zERA
## <chr> <int> <fct> <fct> <dbl> <dbl> <dbl>
## 1 bondto01 1876 NL HAR 408 1.68 -2.16
## 2 bordejo01 1876 NL BSN 218. 2.89 -0.727
## 3 bradlge01 1876 NL SL3 573 1.23 -2.69
## 4 cummica01 1876 NL HAR 216 1.67 -2.17
## 5 deando01 1876 NL CN1 263. 3.73 0.269
## 6 devliji01 1876 NL LS1 622 1.56 -2.30
## 7 fishech01 1876 NL CN1 229. 3.02 -0.573
## 8 knighlo01 1876 NL PHN 282 2.62 -1.05
## 9 mannija01 1876 NL BSN 197. 2.14 -1.62
## 10 mathebo01 1876 NL NY3 516 2.86 -0.762
## # ... with 7,377 more rows
Take note of how the data was organized into groups by year with 141 groups for the 141 different seasons.
pitching_summarize <-
pitching %>%
summarize(mean = mean(ERA), sd = sd(ERA))
pitching_summarize
## # A tibble: 141 x 3
## yearID mean sd
## <int> <dbl> <dbl>
## 1 1876 2.42 0.877
## 2 1877 2.52 0.461
## 3 1878 2.02 0.366
## 4 1879 2.58 0.545
## 5 1880 2.33 0.710
## 6 1881 2.70 0.498
## 7 1882 2.76 0.514
## 8 1883 2.79 0.771
## 9 1884 2.94 0.941
## 10 1885 2.87 0.964
## # ... with 131 more rows
The mean and standard deviations are listed only by year instead of all together because we used the group_by() function earlier.
Now let’s get the Z-Score by each individual season and see how the data changes.
pitching <-
pitching %>%
mutate(zERA_year = standardize(ERA))
pitching %>% arrange(zERA_year)
## # A tibble: 7,387 x 8
## # Groups: yearID [141]
## playerID yearID lgID teamID IP ERA zERA zERA_year
## <chr> <int> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 leonadu01 1914 AL BOS 225. 0.96 -3.01 -3.31
## 2 martipe02 2000 AL BOS 217 1.74 -2.09 -3.25
## 3 luquedo01 1923 NL CIN 322 1.93 -1.86 -3.08
## 4 martipe02 1999 AL BOS 213. 2.07 -1.70 -2.81
## 5 gibsobo01 1968 NL SLN 305. 1.12 -2.82 -2.80
## 6 maddugr01 1995 NL ATL 210. 1.63 -2.22 -2.78
## 7 goodedw01 1985 NL NYN 277. 1.53 -2.34 -2.76
## 8 guidrro01 1978 AL NYA 274. 1.74 -2.09 -2.69
## 9 grovele01 1930 AL PHA 291 2.54 -1.14 -2.67
## 10 piercbi02 1955 AL CHA 206. 1.97 -1.82 -2.66
## # ... with 7,377 more rows
Dutch Leonard still has one of the best pitching seasons ever but now we see two of Pedro Martinez’s Cy Young winning seasons move into the top 5 because of how amazing the season was compared to everyone else in the time period he played in.
If we want to go back to getting standard deviations and mean for the full dataset and not the specific years, we can just use the ungroup() function.
pitching <-
pitching %>%
ungroup()
pitching %>% summarize(mean = mean(ERA), sd = sd(ERA))
## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 3.50 0.844
We have only grouped by single variables so far. It may be beneficial to group multiple variables at once. For our situation, we can try to standardize ERA not only within each year but also within each leauge, the American League and National League (There are differences such as the designated hitter which only exists in the AL which makes pitching harder). Let’s standardize for these conditions.
pitching <-
pitching %>%
group_by(yearID, lgID) %>%
mutate(zERA_year_lg = standardize(ERA))
pitching %>% arrange(zERA_year_lg)
## # A tibble: 7,387 x 9
## # Groups: yearID, lgID [257]
## playerID yearID lgID teamID IP ERA zERA zERA_year zERA_year_lg
## <chr> <int> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 martipe02 2000 AL BOS 217 1.74 -2.09 -3.25 -3.73
## 2 martipe02 1999 AL BOS 213. 2.07 -1.70 -2.81 -3.34
## 3 leonadu01 1914 AL BOS 225. 0.96 -3.01 -3.31 -2.99
## 4 guidrro01 1978 AL NYA 274. 1.74 -2.09 -2.69 -2.79
## 5 santajo01 2004 AL MIN 228 2.61 -1.06 -2.00 -2.71
## 6 vanceda01 1930 NL BRO 259. 2.61 -1.06 -2.57 -2.70
## 7 greinza01 2009 AL KCA 229. 2.16 -1.59 -2.34 -2.70
## 8 luquedo01 1923 NL CIN 322 1.93 -1.86 -3.08 -2.68
## 9 alexape01 1915 NL PHI 376. 1.22 -2.71 -2.26 -2.68
## 10 goodedw01 1985 NL NYN 277. 1.53 -2.34 -2.76 -2.68
## # ... with 7,377 more rows
Pedro Martinez played in the American League after the designated hitter was established in 1973 which further amplifies just how good of a pitcher he was, in fact even better than Dutch.
We can graph the average for each season on the graph too with the functions we previously declared as shown below. We can use geom_line() to map the average with a line. We will reduce the dot size so that we can see the average ERA line better.
ggplot(pitching) +
geom_point(aes(x = yearID, y = ERA), size = 0.3) +
ylim(0, 9) +
geom_line(data = pitching_summarize, mapping = aes(x = yearID, y = mean), col = 'dark blue')