To access R and R Studio which are installed on the Saint Ann’s server you can go to: http://rstudio.saintannsny.org:8787/ and log in with your Saint Ann’s email address.
Today we’ll be learning to filter, arrange, select and mutate data using the dplyr package in R. If you find yourself wanting to review this material or find out more, I encourage you to read Hadley Wickham’s online textbook R for Data Science (http://r4ds.had.co.nz/). This lab practices the material in 5.1 through 5.5.
R has many many packages that provide functions and data sets. The Lahman package provides baseball card level data (as opposed to play-by-play level data or pitch level data) data going back to 1884. To use a package, you first need to load it using the library() command.
library(Lahman)
Note that, for this line to work, the package needs to be installed. So, if you were using your own R installation at home, you would need to first use the packages tab (lower right) to install the package.
Let’s start by loading Batting data and then looking at the first six rows in the data set (using the head function) and the last six rows (using the tail function):
data(Batting)
head(Batting)
tail(Batting)
How many rows are there in the Batting data (Hint: look at the results of tail(Batting).)?
You might be able to recognize some of the 2015 players in the tail results. The playerID’s have the first five letters of the player’s last name and the first two letters of their first name. You might also notice that Ben Zobrist, who played for two teams last year (the Oakland A’s and the Kansas City Royals) has two lines of data for 2015.
The dplyr package allows you to explore data frames like this Batting table. You can load it the same way that you loaded the Lahman package.
library(dplyr)
Let’s start by finding members of the 30/30 (30 home runs and 30 steals) and 40/40 clubs:
filter(Batting, HR >=30, SB>=30)
filter(Batting, HR >=40, SB>=40)
Or, how about seasons in the 21st century with 40 or more homeruns and more walks than strikeouts?
filter(Batting, HR >=40, BB>SO, yearID>=2000)
We can also filter by non-numeric columns and use the same logical statements we used last week. Here are players on the Mets or Yankees who hit at least 40 homeruns:
filter(Batting, teamID=="NYN"|teamID=="NYA", HR >=40)
Questions:
Only 4 players have ever had seasons with 40 homeruns, 40 doubles and 10 triples (two of them did it twice). Who are they? You can list them by their playerID’s if you don’t recognize their names from their ID’s.
Since 1931 only one player with at least 300 at bats has hit for a batting average (hits per at bat) of .400 or higher. Who is it and how many hits did he have?
Only two players have ever had seasons with more than 100 RBI’s in which they didn’t hit a home run. Who are they?
How many players have hit 40 home runs for the Royals (KCA)?
Try the following to get batting lines sorted by numbers of home runs.
arrange(Batting, HR)
arrange(Batting, desc(HR))
These commands both return too much data and are unweildy but we can combine arrange with filter by creating a new data set using the filter command and then arranging it:
Batting.40plus.HR <- filter(Batting, HR>=60)
arrange(Batting.40plus.HR, desc(HR))
or, perhaps more simply, by nesting one function inside of the other:
arrange(filter(Batting, HR>=60), desc(HR))
In fact, there’s an even more elegant way to do this (and much more) by using a piping operator, %>%:
Batting %>% filter(HR>=60) %>% arrange(desc(HR))
As you can see above, with the piping operator you start with the name of the data set, then pipe to the first operation and then pipe to the second operation and then so on.
You can generate a list of all players with 300 at bats or more, who have hit .360 or better since 1980 and order them by batting average:
Batting %>% filter(AB>=300, yearID>=1980, H/AB>=.360) %>% arrange(desc(H/AB))
Questions
Find all players who have stolen more than 80 bases or more and order them in descending order of stolen bases.
Finding all players with 425 or more total bases in a season and order them in descending order of total bases. Be careful, singles aren’t listed!
If you want the top 10 and bottom 10 batters by batting average (300 at bats or more) from 1900 through 2015 you can do the following. Notice that we use 3 (!!!) piping operators in each lines because we’re performing three operations: filtering, grabbing the top 10 and then, finally, arranging the data.
Batting %>% filter(AB>=300, yearID>=1900) %>% top_n(10, H/AB) %>% arrange(desc(H/AB))
Batting %>% filter(AB>=300, yearID>=1900) %>% top_n(10, desc(H/AB)) %>% arrange(H/AB)
Why does the second line return more than 10 batting lines?
Problems:
If you only want to see some of the columns from the data set, select can do the trick. Here’s the top 10 batting averages since 1900 with only the player, year, team, hits and at bats shown. This code is getting long so I’ve split it into multiple lines and to make it easier to read, I’ve split it immediately following the piping operators. This code can be copied and pasted, either into a script or into the console, and run all at once.
Batting %>%
filter(AB>=300, yearID>=1900) %>%
top_n(10, H/AB) %>%
arrange(desc(H/AB)) %>%
select (playerID, yearID, teamID, H, AB)
Mutate allows you to create new columns. We’ve been sorting players by batting average and listing the top average hitters so it might be useful to actually calculate batting averages! We can do that by adding a new column “BA” that is calculated as hits divided by at bats (pay careful attention to the changes from the code above):
Batting %>%
filter(AB>=300, yearID>=1900) %>%
mutate(BA = H/AB) %>%
top_n(10, BA) %>%
arrange(desc(BA)) %>%
select (playerID, yearID, teamID, H, AB, BA)
Since batting average is usually rounded to the nearest .001 we might want to use the round() function:
Batting %>%
filter(AB>=300, yearID>=1900) %>%
mutate(BA = round(H/AB,3)) %>%
top_n(10, BA) %>%
arrange(desc(BA)) %>%
select (playerID, yearID, teamID, H, AB, BA)
First load the pitching data as follows:
data(Pitching)
Next, find the top 10 lowest ERAs (earned run averages) since 1950 for pitchers with a minimum of 162 innings pitched. If you are (understandably) unsure of how to calculate ERA, use your favorite search engine. Note that the pitching data provides IPouts rather than innings pitched (IP) and that there 3 outs to an inning. See if you can write code that returns the 10 best ERA’s since 1950 in order and shows the player, year, team, IPouts and ERA.