You may want to start by looking over our Joins Lab.
You may also need to do something which we didn’t do in our Introduction to Joins lab - namely joining by multiple columns simultaneously. For instance, to join Batting and Fielding tables by playerID, season and stint simultaneously we can do:
left_join(Fielding, Batting, by=c("playerID", "yearID", "stint"))
In this lab, instead of giving you code to try out, I’m simply going to give you a few challenges that will require using joins.
Once again, we’ll be using the Lahman database, so you’ll want to load the packages you’ll need and the data:
library(Lahman)
library(dplyr)
data(Batting)
data(Master)
data(Pitching)
data(Fielding)
The Master table lists birth cities and there are 38 players born in Brooklyn. Find the total number of games played for each of these players in their career and rank them from most to least games played.
Specifically, I want you to find the top 10 home run seasons for players who played at least 50 games that season at shortstop. The Fielding table has a column POS for position and shortstop is listed as SS. You may well want to start by making a table that includes shortstop seasons.
You are interested in finding batting averages for every season but only want to include hitters in your average. In other words, you need to filter out pitchers before getting the average. The Batting table doesn’t tell you anything about whether these players were pitchers but you could use information from either the Fielding table or the Pitching table. In fact, it might be a good idea to try to do this in a couple of different ways and check whether you are getting the same answers.
Use the Pitching and PitchingPost tables to compare runs allowed per 9 innings in the regular season to the post season for every season since 1980. Note that you’ll need to laod the PitchingPost table to do this:
data(PitchingPost)
Is the post-season or the regular season higher scoring? How does the quality of the pitching and quality of the hitting compare between regular and post-season… or how would you go about answering this question using the data at hand?