This walk-through provides a brief introduction to the dplyr and tidyr packages in R that allow us to reshape our data for visualization or analysis. We will analyse Major League Baseball data from the Lahaman package. For convenience, the “base R” equivalents to the dplyr and tidy functions at bottom of each code block.
library(Lahman)
library(dplyr)
library(tidyr)
library(magrittr)
R can load data from rational and unstructured databases as well as just about any file format one could imagine. For our use case, we will load an R data package called Lahman. This package load Sean Lahman’s baseball database, which is an annual aggregation of statistics for all major league players from 1871 to present.
library(Lahman)
# Load the Batting table from the Lahman database.
batting <- Lahman::Batting
# Take a look at the first few rows of the data set.
head(Batting, 3)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
GOAL: Add full player names from the master data frame to the batting data frame.
The dplyr package has several “join” functions, many of which are similar, in concept, to SQL joins. Full descriptions of dplyr joins can be found here. For our use, a left join is the most appropriate.
# Get Player names and Teams names from other tables
teams <- Lahman::Teams
master <- Lahman::Master
library(dplyr)
# Subset master table to only playerID and player names since that's all we need.
master <- select(master, playerID, nameFirst, nameLast)
# Left join master and batting to get the player names in master
battingPlayer <- left_join(batting, master, by = "playerID")
# Base R equivalent:
# master <- master[, c("playerID", "nameFirst", "nameLast")]
# battingPlayer <- merge(x = master, y = batting, by = "playerID", all.x = TRUE)
We can use the same strategy as above to add full team names to the data set. Note the use of the rename() function to rename the ambiguous column “name” to “teamName.”
teamsYr <- select(teams, teamID, yearID, name)
# Rename the team name column to avoid confusion
teamsYr <- rename(teamsYr, "teamName" = name)
batting <- left_join(battingPlayer, teamsYr, by = c("teamID", "yearID"))
# Base R equivalent:
# teamsYr <- teams[, c("teams", "teamID", "yearID", "name")]
# batting <- merge(x = battingPlayer, y = teamsYr, by = c("tea,ID", "yearID"), all.x = TRUE)
head(batting, 3)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## SO IBB HBP SH SF GIDP nameFirst nameLast teamName
## 1 0 NA NA NA NA NA Frank Abercrombie Troy Haymakers
## 2 0 NA NA NA NA NA Bob Addy Rockford Forest Citys
## 3 5 NA NA NA NA NA Art Allison Cleveland Forest Citys
GOAL: To take our joined data set and parse it to only years after 1919 and only players who had more than 199 at bats per season. The reasoning here is to eliminate the “dead ball era” while removing temporary minor leaguers and modern National League pitchers.
The tidyr package is a compliment to dplyr and provides additional data manipulation functions such as “spread” or “gather” data frames. In our case, we’re using the unite() function to concatenate the players’ first and last names.
# Player name and team name appear next to playerID and teamID.
batting <- subset(batting, yearID >= 1955 & AB >= 200)
# Concatenate player names with unite() function from tidyr.
library(tidyr)
batting <- unite(batting, playerName, nameFirst, nameLast, sep=" ")
# Base R equivalent:
# batting <- subset(batting, yearID >= 1920 & AB >= 200)
# batting <- transform(batting, playerName = paste0(nameFirst, " ", nameLast))
# batting <- batting[, c("playerID", "playerName", "yearID", "stint", "teamID", "teamName", "lgID", "G", "AB", "R", "H", "X2B", "X3B", "HR", "BB", "HBP", "SF", "SH")]
head(batting, 3)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB
## 37448 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3
## 37451 abramca01 1955 1 BAL AL 118 309 56 75 12 3 6 32 2
## 37454 adcocjo01 1955 1 ML1 NL 84 288 40 76 14 0 15 45 0
## CS BB SO IBB HBP SH SF GIDP playerName teamName
## 37448 1 49 61 5 3 7 4 20 Hank Aaron Milwaukee Braves
## 37451 8 89 69 2 3 3 3 6 Cal Abrams Baltimore Orioles
## 37454 2 31 44 3 2 2 1 13 Joe Adcock Milwaukee Braves
Pipes aren’t native to R, many packages make use of the magrittr package, which adds a %>% operator as a pipe. The code block below accomplishes everything we have done with the Lahman data set above with fewer lines of code. Note, that only functions can be piped with the magrittr package.
batting <- Lahman::Batting
batting <- left_join(batting, master, by = "playerID") %>%
subset(yearID >= 1955 & AB >= 200) %>%
left_join(teams[,c("teamID", "yearID", "name")], by = c("teamID", "yearID")) %>%
unite(playerName, nameFirst, nameLast, sep=" ") %>% rename(teamName = name)
# Base R equivalent:
# batting <- Lahmand::Batting
# batting <- merge(x = master, y = batting, by = "playerID", all.x = TRUE)
# batting <- subset(batting, yearID >= 1955 & AB >= 200)
# batting <- transform(batting, name = paste0(nameFirst, " ", nameLast))
# batting <- merge(x = master, y = batting, by = c("tea,ID", "yearID"), all.x = TRUE)
# colnames(batting) <- c("playerID", "yearID", "stint", "teamID", "lgID", "G", "AB", "R",
# "H", "X2B", "X3B", "HR", "RBI", "SB", "CS", "BB", "SO", "IBB", "HBP", "SH", "SF", "GIDP", "playerName",
# "teamName")
head(batting, 3)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1
## 2 abramca01 1955 1 BAL AL 118 309 56 75 12 3 6 32 2 8
## 3 adcocjo01 1955 1 ML1 NL 84 288 40 76 14 0 15 45 0 2
## BB SO IBB HBP SH SF GIDP playerName teamName
## 1 49 61 5 3 7 4 20 Hank Aaron Milwaukee Braves
## 2 89 69 2 3 3 3 6 Cal Abrams Baltimore Orioles
## 3 31 44 3 2 2 1 13 Joe Adcock Milwaukee Braves
The forward pipe operator is exported by dplyr, tidyr, and several over R packages. However, the magrittr package, which is the basis for the pipe, provides additional operations. For example, with the %<>% operator, we can make a data frame the source and the target of a transformation. For example, the following statements accomplish the same task.
batting <- subset(batting, yearID >= 1900)
# ...is the same as
batting %<>% subset(yearID >= 1900)
GOAL: Add two new columns to our data frame BA and OPS, which calculate batting average and on-base percentage plus slugging percentage. Both calculations can be done as one line of code.
# This becomes a one-liner with the use of the mutate function and a pipe.
batting %<>% mutate(BA = H/AB)
# Multimple calculations can be passed to one mutae command.
batting %<>% mutate(BA = H/AB, OBS = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF+SH)+((H+2*X2B+3*X3B+4*HR)/AB))
# Base R equivalent:
# batting$BA <- batting$H / batting$AB
# batting$PA <- batting$AB + batting$BB + batting$HBP + batting$SF + batting$SH
head(batting, 3)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1
## 2 abramca01 1955 1 BAL AL 118 309 56 75 12 3 6 32 2 8
## 3 adcocjo01 1955 1 ML1 NL 84 288 40 76 14 0 15 45 0 2
## BB SO IBB HBP SH SF GIDP playerName teamName BA
## 1 49 61 5 3 7 4 20 Hank Aaron Milwaukee Braves 0.3139535
## 2 89 69 2 3 3 3 6 Cal Abrams Baltimore Orioles 0.2427184
## 3 31 44 3 2 2 1 13 Joe Adcock Milwaukee Braves 0.2638889
## OBS
## 1 1.0282937
## 2 0.8403874
## 3 0.9119521
GOAL: The batting data frame is grouped by player and year. Find the total career home runs for every player in the data frame.
Like joins, grouping and aggregating adhere to concepts found in SQL. The data frame is grouped by player and year, so we’ll use the group_by() function to group by playerID only and then the summarise() function to find the sum of HR grouped by player.
If we did it right, we should see Barry Bonds and Hank Aaron at the top of the list.
batting <- Lahman::Batting
# Grouped calculations
moonShot <- batting %>% group_by(playerID) %>%
summarise(careerHR = sum(HR)) %>% arrange(desc(careerHR))
# Base R equivalent:
# moonShot <- batting
# moonShot <- aggregate(moonShot$HR, by = list(moonShot$playerID), FUN=sum)
# colnames(moonShot) <- c("playerID", "careerHR")
# moonShot <- moonShot[order(-moonShot$careerHR),]
head(moonShot, 3)
## # A tibble: 3 × 2
## playerID careerHR
## <chr> <int>
## 1 bondsba01 762
## 2 aaronha01 755
## 3 ruthba01 714
The tidyr package picks up where dplyr ends. Some of the most useful functions of the package are spread() and gather(), which allows us to recast our data frame in “wide” or “long” formats. The ability to switch data from wide to long formats can be helpful when preforming multiple analysis on a single data set.
Set the batting table “wide” with a column for each year using the spread() function. This would be particularly helpful if plotting or visualizing the data.
batting <- Lahman::Batting
BattingLong <- batting %>% filter(yearID >= 2000 & AB >=200) %>% group_by(playerID, yearID) %>%
summarize(HR = sum(HR))
BattingWide <- spread(BattingLong, key=yearID, value=HR)
# names <- BattingLong$yearID
# BattingWide <- as.data.frame(t(BattingLong[,-2]))
# colnames(BattingWide) <- names
# BattingWide$playerID <- factor(row.names(BattingWide))
head(BattingWide, 3)
## Source: local data frame [3 x 17]
## Groups: playerID [3]
##
## playerID `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 abbotje01 3 NA NA NA NA NA NA NA NA
## 2 abercre01 NA NA NA NA NA NA 5 NA NA
## 3 abernbr01 NA 5 2 NA NA NA NA NA NA
## # ... with 7 more variables: `2009` <int>, `2010` <int>, `2011` <int>,
## # `2012` <int>, `2013` <int>, `2014` <int>, `2015` <int>
Use the gather() function to take the same data and cast it to “long” with two columns; a key and a value.
BattingGathered <- gather(batting)
head(BattingGathered, 10)
## key value
## 1 playerID abercda01
## 2 playerID addybo01
## 3 playerID allisar01
## 4 playerID allisdo01
## 5 playerID ansonca01
## 6 playerID armstbo01
## 7 playerID barkeal01
## 8 playerID barnero01
## 9 playerID barrebi01
## 10 playerID barrofr01
The default visualization in the R language is the plot() function from base R. However, the ggplot2 package has become somewhat of a defacto standard.
GOAL: Subset the Batting table from the Lahman data base to the total number of Runs scored by the entire league, grouped by year, and perform an analysis of how runs scored may be related to batting average.
batting <- Lahman::Batting
# Subset for players who had at least one at-bat, group by year, aggregate, and calculate a league batting average.
batting %<>% subset(AB >= 200 & yearID >= 1920) %>% group_by(yearID, playerID) %>%
summarise(H = sum(H), AB = sum(AB), R = sum(R)) %>% mutate(BA = H/AB)
The ggplot2 package can has tons of options, which can be overwhelming at first. However, it is quite simple in basic usage. Note: The + operator in this package acts the same as the standard pipe operator %>% from the magrittr package.
library(ggplot2)
ggplot(batting, aes(BA, R)) +
geom_point(aes(colour = yearID))
We can see by the plot that there seems to be a relationship between batting average and runs scored.
G = Games played
AB = At bats
R = Runs
H = Hits
X2B = Doubles
X3B = Triples
HR = Home Runs
BB = Walk / base on balls
HBP = Hit by pitch
SF = Sacrifice Fly
SH = Sacrifice Hit
Stint = How many times a player was a member of a team in one season. Example, if Stint=2, the player was traded to another team, then re-acquired by his first team in the same season.