The Data

People often keep and pass along data in .csv (comma separated values) files. I have emailed you three .csv files which you should upload into your RStudio directory (look for the upload button in the pane in the lower-righthand corner of your screen).

You can then read them in as follows:

player_info <- read.csv('player_info.csv')
player_stats <- read.csv('player_stats.csv')
teams <- read.csv('teams.csv')

Now take a look at these three (small) tables:

player_info
player_stats
teams

Each table has a different kind of information. If would be useful to be able to match them up (or JOIN them) but the rows don’t match up readily. In fact, some values that we’d like to match up may be missing from other tables. This is a fairly typical predicament except that we’ll usually be working with much larger tables. In fact, in many cases our tables will be so large that matching values up by hand will be entirely unfeasable.

Q1: What column could we use to match up values in player_stats with values in player_info? Q2: What column could we use to match up values in player_stats with values in teams? Q3: Is there a way to match up values in teams with values in player_info? If not, why not?

The dplyr package includes many different types of joins. The best way to understand them may be to see what they do. First, we need to load the dplyr package.

library(dplyr)

inner_join(player_info, player_stats)

left_join(player_info, player_stats)
left_join(player_stats, player_info)

anti_join(player_info, player_stats)
anti_join(player_stats, player_info)

full_join(player_info, player_stats)

If you’re now wondering what you’re looking at, here’s a Venn diagram that summarizes the different joins. .

The Venn diagrams show that inner_join only returns the intersection of the two tables – meaning rows that show up in both tables whereas full_join returns the union of the two tables – any rows that show up in either table.

Also note that we relied on dplyr’s ability to figure out what column we wanted to use to join the data. It was able to do this because there was only one choice. In cases where dplyr may not be able to read our minds, we can tell it what column to join by:

full_join(player_info, player_stats, by="playerid")

Q4: Try joining player_stats with teams in such a way that you get one line of output for every player stat line and that where team names are missing, your output displays NA in their place.

Getting Serious

Okay, now you’re ready the the real test. Let’s get back to our Lahman baseball database

library(Lahman)
data(Batting)
data(Master)
data(HallOfFame)

We can use joins to add names to some of the tables we’ve made in the past. For instance we previously found all members of the 40/40 club (40 steals and 40 home runs in a season) as follows:

Batting %>% filter(SB >=40 & HR>= 40)

Now, we can add names to these lines

left_join(Batting %>% filter(SB >=40 & HR>= 40), Master)

We might want to clean this up a little bit and only select the rows that we’re interested in:

left_join(Batting %>% filter(SB >=40 & HR>= 40), Master) %>% select(nameFirst, nameLast, yearID, HR, SB)

Ah! Now we’re getting somewhere. Let’s find the 20 players with the most career home runs:

Batting %>% group_by(playerID) %>% summarize(HR = sum(HR)) %>% top_n(20, HR)

Now suppose, I’m interested I’d like to find out which of these players did and didn’t make the Hall of Fame. I’m going to split this into a few lines to make things a bit simpler:

top20_hr_hitters <- Batting %>% group_by(playerID) %>% summarize(HR = sum(HR)) %>% top_n(20, HR)

hofers <- HallOfFame %>% filter(inducted=="Y" & category=="Player")

inner_join(top20_hr_hitters, hofers) %>% select(playerID, HR)
anti_join(top20_hr_hitters, hofers)
left_join(top20_hr_hitters, hofers) %>% select(playerID, HR, inducted)

I might also be interested in how old players were when they had the top 20 HR seasons:

top20_hr_seasons <- Batting %>% top_n(20, HR)


left_join(top20_hr_seasons, Master) %>% mutate(Age = yearID-birthYear) %>% 
  select(nameFirst, nameLast, HR, Age) %>% arrange(desc(HR))

I could even be interested in the age at which batters have hit the highest batting average:

left_join(Batting, Master) %>% mutate(Age = yearID-birthYear) %>% 
  group_by(Age) %>% summarize(at.bats = sum(AB, na.rm=TRUE), 
                              BA = sum(H, na.rm=TRUE)/sum(AB, na.rm=TRUE)) %>%
        filter(at.bats > 10000)

… or the age at which players have hit the most home runs:

left_join(Batting, Master) %>% mutate(Age = yearID-birthYear) %>% 
  group_by(Age) %>% summarize(HR = sum(HR, na.rm=TRUE)) %>% top_n(10, HR) %>% arrange(desc(HR))

Q5: At what age have players stolen the most bases?

Q6: At what age do players have the most walks?

Q7: Find the number of home runs by birth country and by birth city. What city has produced the most home runs?

Q8: Find the ratio of triples to doubles for left-handed and right-handed batters. Who hits more triples?

Q9: Find the average height and weight for the 20 hitters with the most career home runs.

Q10: Find the average height and weight for the 20 hitters with the most career stolen bases.

Time Permitting:

Try other joins to answer your own questions.

Introduction to Joins

Sports Data Science

The Data

Getting Serious

Time Permitting: