Data scientists spend most of their time cleaning and reshaping their data. In order to find the truths hidden within the data you will often need to rearrange it. tidyr is a package that can you do this. More specifically, tidyr can take “long” data and make it “wide” or take “wide” dat and make it “long”.

Packages and Data

Let’s load some packages we’ll need and three of the data sets in the Lahman package:

library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyr)

data(HallOfFame)
data(Batting)
data(Pitching)
data(Master)

Spread: From Long to Wide

The spread* function takes a long data set and makes it wide. Let’s take a look at what that means using Hall of Fame data. The “HallOfFame” data set has data on Hall of Fame voting

View(HallOfFame)

Let’s look only at votes for players by the BBWAA (Baseball Writers’ Association of America) since the year 2000:

HOFlong <- HallOfFame %>% 
  filter(category=="Player", votedBy=="BBWAA", yearID>=2000) %>% 
  select(playerID, yearID, votes)

View(HOFlong)

To take this long data and make it wide, we can use the spread function. Here’s the key is going to be the column that we’ll split a series of columns names and value is the data that will fill those columns.

HOFwide <- spread(HOFlong, key=yearID, value=votes)
View(HOFwide)

Do you see what happenned?

Let’s try this with batting data. First, let’s take a look at what we’re starting with:

View(Batting)

Next, let’s get seasonal HR totals for every player since 2010 (the original data set is split by team for players who switch teams within a season).

BattingLong <- Batting %>% filter(yearID >= 2010) %>% group_by(playerID, yearID) %>% summarize(HR = sum(HR))
View(BattingLong)

Now, let’s spread the data:

BattingWide <- spread(BattingLong, key=yearID, value=HR)
View(BattingWide)

Finally, here are the correlations between HR’s in pairs of seasons (rounded to the nearest two digits):

round(cor(BattingWide[,-1], use="pairwise.complete"),2)

Q1: What do you notice about the correlations in home runs between pairs of seasons?

Challenge #1: Create a wide table with strikeout numbers for pitchers from 2011 through 2015 and find the correlations between pairs of seasons.

Next, let’s do the same thing we did with homeruns but with on base percentage (OBP). We’ll limit this to players with at least 400 plate appearances (AB + BB + HBP +SF) and start by calculating the OBP for every qualifying season since 2010. Then we’ll make the “wide” table again and calculate correlations.

BattingLong <- Batting %>% filter(yearID >= 2010, AB + BB + HBP + SF>=400) %>% group_by(playerID, yearID) %>% 
  summarize(OBP = sum(H + BB + HBP)/sum(AB + BB + HBP + SF))
View(BattingLong)

BattingWide <- spread(BattingLong, key=yearID, value=OBP)
View(BattingWide)

round(cor(BattingWide[,-1], use="pairwise.complete"),2)

Q2: How do the correlations in OBP compare to the correlations in HR?

Challenge #2: Find year to year correlations in ERA (earned runs per 9 innings pitched) for pitchers. You should place a lower limit on the number of innings pitched (just as we place a lower limint on plate appearances when looking at OBP).

Gather: From Wide to Long

Let’s start by making a table that has four dates for every player, their birth dates, the dates of their debuts and final games and their deaths.

View(Master)
MasterWide <- Master %>% 
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame, deathDate) %>%
  mutate(birthDate=as.character(birthDate), deathDate=as.character(deathDate))
View(MasterWide)

dim(MasterWide)

Notice that the final line above tells us the dimensions of this table (18,846 x 7) Now, let’s see what gather can do. We’re going to combine all of the dates into one column (“date”) and create another column (“dateType”) to tell us what kind of data we’re looking at:

MasterLong <- gather(MasterWide, key=dateType, value=date, c(birthDate, deathDate, debut, finalGame))
View(MasterLong)

dim(MasterLong)

In some case, arranging the data like this can make it easier to graph:

MasterLong <- MasterLong %>% mutate(date=as.Date(date))

ggplot(MasterLong, aes(date, fill=dateType))+geom_density(alpha=0.1)

Challenge #3:

Using gather, make a long table of pitching stats with a column for statistic_type that has values of W, L, G, GS, CG, SHO and SV and a value column with the value of that statistics for each pitcher.