Tidyverse in R: Tools for Data Manipulation

Introduction

This walk-through provides a brief introduction to the dplyr and tidyr packages in R that allow us to reshape our data for visualization or analysis. We will analyse Major League Baseball data from the Lahaman package. For convenience, the “base R” equivalents to the dplyr and tidy functions at bottom of each code block.

Load the libraries we will need.

library(Lahman)
library(dplyr)
library(tidyr)
library(magrittr)

Loading Data

R can load data from rational and unstructured databases as well as just about any file format one could imagine. For our use case, we will load an R data package called Lahman. This package load Sean Lahman’s baseball database, which is an annual aggregation of statistics for all major league players from 1871 to present.

library(Lahman)
# Load the Batting table from the Lahman database.
batting <- Lahman::Batting
# Take a look at the first few rows of the data set.
head(Batting, 3)

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA

Manipulating Data with dplyr

GOAL: Add full player names from the master data frame to the batting data frame.

The dplyr package has several “join” functions, many of which are similar, in concept, to SQL joins. Full descriptions of dplyr joins can be found here. For our use, a left join is the most appropriate.

# Get Player names and Teams names from other tables
teams <- Lahman::Teams
master <- Lahman::Master

library(dplyr)
# Subset master table to only playerID and player names since that's all we need.
master <- select(master, playerID, nameFirst, nameLast)

# Left join master and batting to get the player names in master
battingPlayer <- left_join(batting, master, by = "playerID")

# Base R equivalent:
# master <- master[, c("playerID", "nameFirst", "nameLast")]
# battingPlayer <- merge(x = master, y = batting, by = "playerID", all.x = TRUE)

We can use the same strategy as above to add full team names to the data set. Note the use of the rename() function to rename the ambiguous column “name” to “teamName.”

teamsYr <- select(teams, teamID, yearID, name)
# Rename the team name column to avoid confusion
teamsYr <- rename(teamsYr, "teamName" = name)
batting <- left_join(battingPlayer, teamsYr, by = c("teamID", "yearID"))

# Base R equivalent:
# teamsYr <- teams[, c("teams", "teamID", "yearID", "name")]
# batting <- merge(x = battingPlayer, y = teamsYr, by = c("tea,ID", "yearID"), all.x = TRUE)

head(batting, 3)

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
##   SO IBB HBP SH SF GIDP nameFirst    nameLast               teamName
## 1  0  NA  NA NA NA   NA     Frank Abercrombie         Troy Haymakers
## 2  0  NA  NA NA NA   NA       Bob        Addy  Rockford Forest Citys
## 3  5  NA  NA NA NA   NA       Art     Allison Cleveland Forest Citys

Subseting in dplyr

GOAL: To take our joined data set and parse it to only years after 1919 and only players who had more than 199 at bats per season. The reasoning here is to eliminate the “dead ball era” while removing temporary minor leaguers and modern National League pitchers.

The tidyr package is a compliment to dplyr and provides additional data manipulation functions such as “spread” or “gather” data frames. In our case, we’re using the unite() function to concatenate the players’ first and last names.

# Player name and team name appear next to playerID and teamID.
batting <- subset(batting, yearID >= 1955 & AB >= 200)

# Concatenate player names with unite() function from tidyr.
library(tidyr)
batting <- unite(batting, playerName, nameFirst, nameLast, sep=" ")

# Base R equivalent:
# batting <- subset(batting, yearID >= 1920 & AB >= 200)
# batting <- transform(batting, playerName = paste0(nameFirst, " ", nameLast))
# batting <- batting[, c("playerID", "playerName", "yearID", "stint", "teamID", "teamName", "lgID", "G", "AB", "R", "H", "X2B", "X3B", "HR", "BB", "HBP", "SF", "SH")]

head(batting, 3)

##        playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB
## 37448 aaronha01   1955     1    ML1   NL 153 602 105 189  37   9 27 106  3
## 37451 abramca01   1955     1    BAL   AL 118 309  56  75  12   3  6  32  2
## 37454 adcocjo01   1955     1    ML1   NL  84 288  40  76  14   0 15  45  0
##       CS BB SO IBB HBP SH SF GIDP playerName          teamName
## 37448  1 49 61   5   3  7  4   20 Hank Aaron  Milwaukee Braves
## 37451  8 89 69   2   3  3  3    6 Cal Abrams Baltimore Orioles
## 37454  2 31 44   3   2  2  1   13 Joe Adcock  Milwaukee Braves

Using pipes in R

Pipes aren’t native to R, many packages make use of the magrittr package, which adds a %>% operator as a pipe. The code block below accomplishes everything we have done with the Lahman data set above with fewer lines of code. Note, that only functions can be piped with the magrittr package.

batting <- Lahman::Batting
batting <- left_join(batting, master, by = "playerID") %>%
    subset(yearID >= 1955 & AB >= 200) %>%
    left_join(teams[,c("teamID", "yearID", "name")], by = c("teamID", "yearID")) %>%
    unite(playerName, nameFirst, nameLast, sep=" ") %>% rename(teamName = name)

# Base R equivalent:
# batting <- Lahmand::Batting
# batting <- merge(x = master, y = batting, by = "playerID", all.x = TRUE)
# batting <- subset(batting, yearID >= 1955 & AB >= 200)
# batting <- transform(batting, name = paste0(nameFirst, " ", nameLast))
# batting <- merge(x = master, y = batting, by = c("tea,ID", "yearID"), all.x = TRUE)
# colnames(batting) <- c("playerID", "yearID", "stint", "teamID", "lgID", "G", "AB", "R",
# "H", "X2B", "X3B", "HR", "RBI", "SB", "CS", "BB", "SO", "IBB", "HBP", "SH", "SF", "GIDP", "playerName",
# "teamName")

head(batting, 3)

##    playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1 aaronha01   1955     1    ML1   NL 153 602 105 189  37   9 27 106  3  1
## 2 abramca01   1955     1    BAL   AL 118 309  56  75  12   3  6  32  2  8
## 3 adcocjo01   1955     1    ML1   NL  84 288  40  76  14   0 15  45  0  2
##   BB SO IBB HBP SH SF GIDP playerName          teamName
## 1 49 61   5   3  7  4   20 Hank Aaron  Milwaukee Braves
## 2 89 69   2   3  3  3    6 Cal Abrams Baltimore Orioles
## 3 31 44   3   2  2  1   13 Joe Adcock  Milwaukee Braves

The forward pipe operator is exported by dplyr, tidyr, and several over R packages. However, the magrittr package, which is the basis for the pipe, provides additional operations. For example, with the %<>% operator, we can make a data frame the source and the target of a transformation. For example, the following statements accomplish the same task.

batting <- subset(batting, yearID >= 1900)

# ...is the same as
batting %<>% subset(yearID >= 1900)

Calculations and Mutation

GOAL: Add two new columns to our data frame BA and OPS, which calculate batting average and on-base percentage plus slugging percentage. Both calculations can be done as one line of code.

# This becomes a one-liner with the use of the mutate function and a pipe.
batting %<>% mutate(BA = H/AB)

# Multimple calculations can be passed to one mutae command.
batting %<>% mutate(BA = H/AB, OBS = (H+BB+IBB+HBP)/(AB+BB+IBB+HBP+SF+SH)+((H+2*X2B+3*X3B+4*HR)/AB))

# Base R equivalent:
# batting$BA <- batting$H / batting$AB
# batting$PA <- batting$AB + batting$BB + batting$HBP + batting$SF + batting$SH

head(batting, 3)

##    playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS
## 1 aaronha01   1955     1    ML1   NL 153 602 105 189  37   9 27 106  3  1
## 2 abramca01   1955     1    BAL   AL 118 309  56  75  12   3  6  32  2  8
## 3 adcocjo01   1955     1    ML1   NL  84 288  40  76  14   0 15  45  0  2
##   BB SO IBB HBP SH SF GIDP playerName          teamName        BA
## 1 49 61   5   3  7  4   20 Hank Aaron  Milwaukee Braves 0.3139535
## 2 89 69   2   3  3  3    6 Cal Abrams Baltimore Orioles 0.2427184
## 3 31 44   3   2  2  1   13 Joe Adcock  Milwaukee Braves 0.2638889
##         OBS
## 1 1.0282937
## 2 0.8403874
## 3 0.9119521

Grouping and Aggregating

GOAL: The batting data frame is grouped by player and year. Find the total career home runs for every player in the data frame.

Like joins, grouping and aggregating adhere to concepts found in SQL. The data frame is grouped by player and year, so we’ll use the group_by() function to group by playerID only and then the summarise() function to find the sum of HR grouped by player.

If we did it right, we should see Barry Bonds and Hank Aaron at the top of the list.

batting <- Lahman::Batting
# Grouped calculations
moonShot <- batting %>% group_by(playerID) %>% 
    summarise(careerHR = sum(HR)) %>% arrange(desc(careerHR))

# Base R equivalent:
# moonShot <- batting
# moonShot <- aggregate(moonShot$HR, by = list(moonShot$playerID), FUN=sum)
# colnames(moonShot) <- c("playerID", "careerHR")
# moonShot <- moonShot[order(-moonShot$careerHR),]

head(moonShot, 3)

## # A tibble: 3 × 2
##    playerID careerHR
##       <chr>    <int>
## 1 bondsba01      762
## 2 aaronha01      755
## 3  ruthba01      714

Spreading Data with tidyr

The tidyr package picks up where dplyr ends. Some of the most useful functions of the package are spread() and gather(), which allows us to recast our data frame in “wide” or “long” formats. The ability to switch data from wide to long formats can be helpful when preforming multiple analysis on a single data set.

Set the batting table “wide” with a column for each year using the spread() function. This would be particularly helpful if plotting or visualizing the data.

batting <- Lahman::Batting

BattingLong <- batting %>% filter(yearID >= 2000 & AB >=200) %>% group_by(playerID, yearID) %>% 
    summarize(HR = sum(HR))

BattingWide <- spread(BattingLong, key=yearID, value=HR)

# names <- BattingLong$yearID
# BattingWide <- as.data.frame(t(BattingLong[,-2]))
# colnames(BattingWide) <- names
# BattingWide$playerID <- factor(row.names(BattingWide))

head(BattingWide, 3)

## Source: local data frame [3 x 17]
## Groups: playerID [3]
## 
##    playerID `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##       <chr>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1 abbotje01      3     NA     NA     NA     NA     NA     NA     NA     NA
## 2 abercre01     NA     NA     NA     NA     NA     NA      5     NA     NA
## 3 abernbr01     NA      5      2     NA     NA     NA     NA     NA     NA
## # ... with 7 more variables: `2009` <int>, `2010` <int>, `2011` <int>,
## #   `2012` <int>, `2013` <int>, `2014` <int>, `2015` <int>

Use the gather() function to take the same data and cast it to “long” with two columns; a key and a value.

BattingGathered <- gather(batting)
head(BattingGathered, 10)

##         key     value
## 1  playerID abercda01
## 2  playerID  addybo01
## 3  playerID allisar01
## 4  playerID allisdo01
## 5  playerID ansonca01
## 6  playerID armstbo01
## 7  playerID barkeal01
## 8  playerID barnero01
## 9  playerID barrebi01
## 10 playerID barrofr01

Visualization

The default visualization in the R language is the plot() function from base R. However, the ggplot2 package has become somewhat of a defacto standard.

GOAL: Subset the Batting table from the Lahman data base to the total number of Runs scored by the entire league, grouped by year, and perform an analysis of how runs scored may be related to batting average.

batting <- Lahman::Batting

# Subset for players who had at least one at-bat, group by year, aggregate, and calculate a league batting average.
batting %<>% subset(AB >= 200 & yearID >= 1920) %>% group_by(yearID, playerID) %>% 
    summarise(H = sum(H), AB = sum(AB), R = sum(R)) %>% mutate(BA = H/AB)

The ggplot2 package can has tons of options, which can be overwhelming at first. However, it is quite simple in basic usage. Note: The + operator in this package acts the same as the standard pipe operator %>% from the magrittr package.

library(ggplot2)
ggplot(batting, aes(BA, R)) + 
    geom_point(aes(colour = yearID))

We can see by the plot that there seems to be a relationship between batting average and runs scored.

Data Dictionary

Column definitions from the Lahman Database:

G = Games played
AB = At bats
R = Runs
H = Hits
X2B = Doubles
X3B = Triples
HR = Home Runs
BB = Walk / base on balls
HBP = Hit by pitch
SF = Sacrifice Fly
SH = Sacrifice Hit
Stint = How many times a player was a member of a team in one season. Example, if Stint=2, the player was traded to another team, then re-acquired by his first team in the same season.