Exploratory data analysis and baseball

Executive Summary

In any professional sports, how well the teams spend their money means more than the difference between a championship and a flop. It’s no different with baseball, the sport that introduces the concepts of professionalism and moneyball.

For those who are not used to the term, moneyball is used to describes baseball operations in which a team endeavors to analyze the market for baseball players and buy who is undervalued and sell who is overvalued. Unlike a common misconception, it is not about on-base percentage (a measure of how often a batter reaches base for any reason other than a fielding error, fielder’s choice, dropped/uncaught third strike, fielder’s obstruction, or catcher’s interference), but to explore methods of rating players.

It is most commonly used to refer to the strategy used by the front office of the 2002 Oakland Athletics, with approximately US$44 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over US$125 million in payroll that same season. It derives its name from the 2003 book from Michael Lewis about the team’s analytical, evidence-based, sabermetric approach. Suffice to say that there is also a 2011 motion picture of the same name, based on the book, starring Brad Pitt and Jonah Hill, for which the term became mainstream.

The data

I will be using data from two very useful databases on baseball teams, players and seasons. One is curated by Sean Lahman, available at http://www.seanlahman.com/baseball-archive/statistics/. The other, is from the nutshell package, which contains data sets used as examples in the book “R in a Nutshell” by Joseph Adler. More information about the package is available at https://cran.r-project.org/web/packages/nutshell/index.html.

The reason for pick two different datasets instead of one is because I wanted to perform the analysis in different sources. The decision proved right for account of speed and practicality too. The Lahman data set uses data on pitching, hitting and fielding performance and other tables from 1871 through 2015. As we can see, is thoroughly and updated. The Nutshell’s on the other hand, is better designed for learning approaches (at least in my opinion) and comprises statistical data from 2000 - 2008 for every Major League Baseball team.

For those who are not familiar with baseball, a few points of explanation are important:

Major League Baseball is a professional baseball league, where teams pay players to play baseball (I know it sounds silly and redundant, but I have to be sure everybody knows what we are talking about here).
The goal of each team is to win as many games out of a 162 game season as possible. This allows a ticket to the post season and a chance to play at the World Series, where the champion is defined.
Teams win games by scoring more runs than their adversary. A run is computed when a player advances around first, second and third base and returns safely to home plate (in other words, do a round around the infield).
In principle, better players are expensive, so teams that want good players need to spend more money.
Teams that spend the most, frequently won the most (not always but so often that is fair to consider it a case of cause and effect).

Analysis

Using Lahman data set

The idea is to create a new measurement unit we can plot across time that summarizes how efficient each team is in their spending. I’ll do this by calculating the mean of a team payroll and the mean of its win percentage. I also decided to restrict this analysis to 2014 and 2015.

Codes to load the data:

Subset for years 2014 and 2015:

data_salaries <- subset(Salaries, yearID == 2014, select = c(yearID, teamID, lgID, playerID, salary))
data_salaries2 <- subset(Salaries, yearID == 2015, select = c(yearID, teamID, lgID, playerID, salary))
data_salaries <- rbind(data_salaries, data_salaries2)

data_teams1 <- subset(Teams, yearID == 2014, select = c(yearID, teamID, lgID, G, W, L))
data_teams2 <- subset(Teams, yearID == 2015, select = c(yearID, teamID, lgID, G, W, L))
data_teams <- rbind(data_teams1, data_teams2)

Filtering for 2014 and 2015:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

total_payroll <- data_salaries %>%
  group_by(yearID, teamID, lgID) %>%
  summarise(salary = sum(as.numeric(salary))) %>%
  group_by(yearID, teamID, lgID) %>%
  arrange(yearID)

total_payroll <- total_payroll[-c(9, 32), ]

Creating a dataframe with the assembled data:

moneyball <- data.frame(yearID = total_payroll$yearID, teamID = total_payroll$teamID,
                        lgID = total_payroll$lgID, salary = total_payroll$salary,
                        G = data_teams$G, W = data_teams$W, L = data_teams$L)

Calculating the number of wins per dollar and win percentage:

moneyball$WperDollar <- (moneyball$W/moneyball$salary)

moneyball$win_percentage <- (moneyball$W/moneyball$G)

To set the final data set:

avgStatsPerYear <- moneyball %>%
  group_by(teamID) %>%
  summarise(averagePayinYears = mean(salary),
            averageWininYears = mean(win_percentage))

To check, I will test the New York Yankees Team Spending Efficiency against the other teams. I encourage you to pick any team you want if you intend to reproduce this paper.

library(ggplot2)
library(digest)

avgStatsPerYear %>%
  ggplot(aes(x=averagePayinYears, y=averageWininYears)) +
  geom_point(aes(colour=ifelse(teamID=="NYA", 'NY Yankees', "Other Teams"))) +
  xlab("Average Team Payroll") +
  ylab("Average Winning Percentage") +
  ggtitle("NY Yankees Spending Efficiency 2014-2015") +
  geom_smooth(method = 'lm') +
  labs(colour="Team") +
  theme(text = element_text(),
        axis.text = element_text(angle = 90, vjust = 1))

Using nutshell package

The idea is use this data set to applies a linear model which predicts the number of runs scored by a team and provide the prediction confidence interval based on the variables below.

Singles
Doubles
Triples
Home Runs
Walks
Hit by Pitch
Sacrifice Flies
Stolen Bases
Caught Stealing

I intend to see how this particular model goes by plotting some diagnostic plots for Linear Regression Analysis.

Codes to load the data:

## Loading required package: nutshell.bbdb

## Loading required package: nutshell.audioscrobbler

It’s a very simple linear model but performs very well predicting runs, with a 95% confidence interval.

runs.mdl <- lm(
  formula=runs~singles+doubles+triples+homeruns+
    walks+hitbypitch+sacrificeflies+
    stolenbases+caughtstealing,
  data=team.batting.00to08)

Residuals Vs Fitted

The residual data of a linear regression model is the difference between the observed data of the dependent variable y and the fitted values y. If we find equally spread residuals around a horizontal line without distinct patterns, it’s a good indication that we don’t have non-linear relationships.

plot(runs.mdl, 1)

Normal Q-Q

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

plot(runs.mdl, 2)

Scale-Location

It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how is checked the assumption of equal variance (homoscedasticity). If there’s a horizontal line with equally (randomly) spread points, it means the scale-location is good.

plot(runs.mdl, 3)

Residuals Vs Leverage

This plot helps us to find influential cases. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results, so the regression results will be altered if we exclude those cases.

plot(runs.mdl, 5)

An application

One of the reasons that I chose the nutshell data set is because it is used as a case study from the book “R in a Nutshell” by Joseph Adler. Inspired by this case, I developed a simple app to predicts the number of runs scored by a team based on the model we just saw. For those curious to see it, a demo for the app can be found at: https://marcelotibau.shinyapps.io/baseball-prediction/

I invite you to follow me on twitter @marcelo_tibau or check my site at tibau.org