How many runs is a home run worth? Is it worth twice as much as a double? We can use linear regression to estimate the answer to these questions.

The Data

We’re going to be looking at the number of runs scored by teams and trying to predict the numbers of runs scored using the number of singles, doubles, triples, homeruns and walks. The data we need is in the Teams data.frame in the Lahman package.

library(Lahman); library(dplyr); library(ggplot2)

data(Teams)
head(Teams)

Filtering and Mutating

Before we get started, there are two things we may want to do with our data set. First, so that we’re comparing teams that have played the same number of games, let’s only go back as far as the 1961 season and eliminate team season where teams played some number other than 162 games. Second, to calculate the number of singles, we can subtract doubles, triples and homeruns from the number of outs.

teams <- Teams %>% filter(G==162 & yearID >= 1961)
teams <- teams %>% mutate(X1B = H - X2B - X3B - HR)

Using Linear Models like “group_by”

Before we dive into our main objective, it’s worth noting that we can use linear regression to calculate group averages. The first line of code below will find each team’s average run scoring relative to the first team alphabetically (Anaheim). The second line of code will find every team’s average scoring.

lm(R ~ teamID, data=teams)


lm(R ~ teamID +0, data=teams)

Likewise, if we can calcualte the average scoring by season or by league using linear regression or even the scoring for each league season. Not that unless we use “as.factor()” to tell it otherwise, R will treat yearID as a number.

lm(R ~ as.factor(yearID) +0, data=teams)

lm(R ~ lgID +0, data=teams)

lm(R ~ lgID:as.factor(yearID)+0 , data=teams)

Predicting Run Scoring

Now let’s move on to our main objective – predicitng run scoring:

lm(R ~ X1B+X2B+X3B+HR+BB, data=teams)

What does this model tell us about the relative value of singles, doubles, triples and homerune? Is that what you would expect? According, to this model, how much is a triple worth? Is this value plausible?

Pay careful attention to what happens to the coefficient of triples when we add stolen bases and caught stealing to the model:

lm(R ~ X1B+X2B+X3B+HR+BB+SB+CS, data=teams)

Why might this happen?

Making Predictions

Let’s calculate how many runs each team “should” have scored according to the model. We’ll call this expected.runs. Let’s also calculate the residuals – the difference between the runs expected and the runs actually scored.

m <- lm(R ~ X1B+X2B+X3B+HR+BB, data=teams)
teams$expected.runs <- predict(m, teams)
teams$residuals <- residuals(m)

We can uss ggplot to visually compare our expectations to the actual runs scored. We’ll color our data points by season.

ggplot(teams, aes(expected.runs, R, color=as.factor(yearID)))+geom_point()

We can also calculate the correlation between our model’s predictions and the actual run scoring.

teams %>% summarize(cor(expected.runs, R))

We can use dplyr to find the 10 teams who most outperformed and the 10 teams who most underformed our model’s expectations:

teams %>% top_n(10, residuals)

teams %>% top_n(10, desc(residuals))

Checking residuals is typically a great way to find things that your model overlooked although in this case the identity of the teams that outscored and underscored our predictions tells me very little.

Linear Weigths

It may be easier to understand the value of baseball events if we compare them to the natural alternative: an out. Let’s estimate outs as at bats that weren’t hits and then add outs to our model and remove the intercept term.

teams <- teams %>% mutate(outs = AB-H)
m <- lm(R ~ X1B+X2B+X3B+HR+BB+outs+0, data=teams)
coef(m)

If singles are worth 0.526 runs but outs are worth -0.106 runs then singles are actually (0.526+0.106 = 0.632) 0.632 runs more than an out.

We can calculate the relative values of singles, doubles, triples and homeruns to outs by doing:

coef(m)[1:5]-coef(m)[6]

According to this measure, how many times more valuable is a home run than a single?

The Cost of a Strikeout

Basebll people often talk about the value of putting the ball in play and moving runners over – making productive outs, in other words. Strikeouts are non-productive outs and so we might ask how much worse a strikeout is than an out in play. We can attempt to answer this question by adding strikeouts to our regression model.

lm(R ~ X1B+X2B+X3B+HR+BB+outs+SO+0, data=teams)

According to our model, how much worse is a strikeout than an out in play?

Assignment

Using code from our last lab, determine which performs better out-of-sample, a model that includes strikeouts as a predictor or one that does not.