library(Lahman)
library(tidyverse)
library(broom)

Problem 1 - Money Ball

Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis about the Oakland Athletics baseball team in 2002 and its general manager, the person tasked with building the team, Billy Beane. During Billy Bean’s tenure as general manager, ownership cut the budget drastically leaving the general manager with one of the lowest payrolls in baseball. Money Ball tells the story of how Billy Bean used analysts to find inefficiencies in the market. Specifically, his team used data science to find low cost players that the data predicted would help the team win.

Statistics have been used in baseball since its beginnings. Note that Lahman (a library containing an extensive baseball database) goes back to the 19th century. Batting average, for example, has been used to summarize a batter’s success for decades. Other statistics such as home runs (HR), runs batted in (RBI) and stolen bases have been reported and players rewarded for high numbers. However, until Bill James introduced sabermetrics, careful analyses had not been done to determine if these statistics actually help a team win. To simplify the exercise we will focus on scoring runs and ignore pitching and fielding.

Problem 1A

Here, we will use the Lahman library. You can see tables that are available when you load this package by typing:

?Lahman

Use the data in the Teams table to explore the relationship between stolen bases (SB) and runs per game in 1999. Make a plot, fit a regression line, and report the coefficients. If you take the coefficient at face value, how many more runs per game does a team score for every extra SB per game?

To correctly account for the number of times a baserunner safely reaches home plate (denoted as R) and the number of times a player has stolen a base (denoted as SB), these numbers need to be divided by the number of games in which a player rhas appeared (denoted as G).

linefit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R.per.game = R / G, SB.per.game = SB / G) %>%
  lm(R.per.game ~ SB.per.game, data = .)

Teams %>%
  filter(yearID == 1999) %>%
  mutate(R.per.game = R/G, SB.per.game = SB/G) %>%
  ggplot(aes(SB.per.game, R.per.game)) +
  theme_classic() + 
  geom_point() + 
  geom_abline(intercept = linefit$coefficients[1], 
              slope = linefit$coefficients[2]) +
  geom_smooth(method = "lm") + 
  ggtitle("The Relationship between stolen bases and runs per game in 1999") +
  labs(x = "Stolen bases", y = "Scoring runs") + 
  theme(plot.title = element_text(hjust = 0.5))

cat("For every unit of extra stolen base per game, the mean number of runs per game increase by", paste0(sprintf("%.5s", linefit$coefficients[2],".")))
## For every unit of extra stolen base per game, the mean number of runs per game increase by 0.429

Based on the model fit, there is a positive relationship between scoring runs and stealing bases. The linear equation of the model can be expressed as: \(E(Y) = 4.7818 + 0.4294X\)

Problem 1B

In Problem 1A we observed a positive relationship between scoring runs and stealing bases. However, the estimated slope coefficient is a random variable. There is chance involved in scoring a run. So how do we know if this observed relationship was not just chance variability?

To examine the variability of this random variable we will consider each year to be a new independent outcome. Use the lm and do functions to fit a linear model to each year since 1961 (when they started playing 162 games per year). Hint: use the function tidy in broom to process the regression in each group so that it can be recombined (see here for examples).

Using this approach, what is your estimate of the slope random variable’s standard error? Is the distribution of the random variable well approximated by a normal distribution? If so, use this to provide a 95% confidence interval for our effect of stolen bases on runs per game. Do you think stolen bases help score runs?

recombined <-
  Teams %>% 
  filter(yearID >= 1961) %>%
  mutate(R.per.game=R/G, SB.per.game=SB/G) %>%
  group_by(yearID) %>%
  do(tidy(lm(R.per.game ~ SB.per.game, data =.))) %>%
  filter(term == "SB.per.game")
  
cat("Using this approach, the estimate of the slope of the slope random variable's standard error is",sd(recombined$estimate))
## Using this approach, the estimate of the slope of the slope random variable's standard error is 0.4065476

Once can assess the residuals by looking at the normal Q-Q plot.

qqnorm(recombined$estimate) # Q-Q plot - normality assessment
qqline(recombined$estimate) # with the line

The points in the Q-Q plot fall on the y-x line, indicating a normality.

Histogram is another useful plot to assess normality.

hist(recombined$estimate)

Ths histogram shows an approximately normal distribution as it is relatively symmetric and centered around 0. Although the histogram does not show a perfert normal distribution, this plot shows that the distribution is still relatively normal given the small sample size (56). It would be helpful to have more data to confirm if the middle portion of the histogram is tall and dense while both tails are small.

Since the distribution of the random variable is well approximately by a normal distribution, we can proceed with our steps and calculate the 95% confidence interval for our effect of stolen bases on runs per game. Given \(E(\bar{X})=\mu\) and \(SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}\), we can build the 95% confidence interval with the boundaries where \(\hat{\mu} + 1.96*\frac{\sigma}{\sqrt{n}}\) is the upper boundary and \(\hat{\mu} - 1.96*\frac{\sigma}{\sqrt{n}}\) is the lower boundary.

# 95% CI for the effect of stolen bases on runs 
lower <- mean(recombined$estimate) - 1.96*sd(recombined$estimate)
upper <- mean(recombined$estimate) + 1.96*sd(recombined$estimate)

cat("The 95% confdience interval for the effect of stolen bases on runs per game is", paste0("(",sprintf("%.5s", lower)),"-",paste0(sprintf("%.5s",upper),")"))
## The 95% confdience interval for the effect of stolen bases on runs per game is (-0.76 - 0.830)

The 95% confidence interval for our effects of stolen bases on runs per game contains 0, which indicates an insignificance effect on the response.

mean(recombined$estimate) + c(-1,1) * qnorm(0.975) * sd(recombined$estimate)
## [1] -0.7632109  0.8304263

We can double-check that we got the right confidence-interval using the qnorm function in R.

Problem 1C

Even if we didn’t have several years to examine the distribution of our estimate, there is a version of the CLT that applies to regression. It turns out that with a large enough sample size, in this case the number of teams, we can construct a confidence interval. Use the function tidy to report a confidence interval for the effect of SB on runs based exclusively on the 1999 data. What are your thoughts now on the effectiveness of recruiting players that can steal bases?

# CI from tidy
tidyCI <- tidy(linefit, conf.int = TRUE)

tidyCI %>%
  filter(term == "SB.per.game")
## # A tibble: 1 x 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 SB.per.game    0.429     0.431     0.997   0.327   -0.453      1.31

Our confidence interval is wider with a larger sample size. Again, the confidence interval contains 0 as it did from 1(b). Stolen bases per game does not seem to be significant in determining runs sccored per game.

# Regular method of generating CI
regularCLT <- summary(linefit)

cat("The 95% confdience interval for the effect of stolen bases on runs per game is", paste0("(",sprintf("%.5s", regularCLT$coefficient[2,1] +c(-1)*qnorm(0.975)*regularCLT$coefficient[2,2])),"-",paste0(sprintf("%.5s",regularCLT$coefficient[2,1] + c(1)*qnorm(0.975)*regularCLT$coefficient[2,2]),")"))
## The 95% confdience interval for the effect of stolen bases on runs per game is (-0.41 - 1.273)

We get the same confidence interval using the regular method of generating the confidence interval as we did using the tidy function.

Problem 1D

Back in 2002 (the year of the money ball story described above), bases on balls (BB) did not receive as much attention as other statistics. Repeat the above analysis we performed in 1C for BB per game in 1999. Do BB have a larger effect on runs than SB?

BBFit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R.per.game = R/G, BB.per.game = BB/G) %>%
  lm(R.per.game ~ BB.per.game, data = .)

# CI from tidy
BBtidyCI <- tidy(BBFit, conf.int = TRUE)

BBtidyCI %>%
  filter(term == "BB.per.game")
## # A tibble: 1 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 BB.per.game    0.547     0.137      3.99 0.000429    0.266     0.827

BB have a larger effect on runs than SB because the confidence interval for BB does not contain 0, which indicates that the relationship between BB per game and runs per game is significant whereas stolen bases per game has an insignificant relationship. Both the upper boundary and lower boundary are positive values; therefore, for every additional walk a team receives per game, they are increasing their mean number of runs per game.

BBFit <- Teams %>%
  filter(yearID == 1999) %>%
  mutate(R.per.game = R/G, BB.per.game = BB/G) %>%
  lm(R.per.game ~ BB.per.game, data = .)

# CI from tidy
BBtidyCI <- tidy(BBFit, conf.int = TRUE)

BBtidyCI %>%
  filter(term == "BB.per.game")
## # A tibble: 1 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 BB.per.game    0.547     0.137      3.99 0.000429    0.266     0.827
# Regular method of generating CI
regularBBCLT <- summary(BBFit)
regularBBCLT$coefficient[2,1] + c(-1,1)*qnorm(0.975)*regularBBCLT$coefficient[2,2]
## [1] 0.2782708 0.8150928
# Regular method of generating CI
regularBBCLT <- summary(BBFit)
regularBBCLT$coefficient[2,1] + c(-1,1)*qnorm(0.975)*regularBBCLT$coefficient[2,2]
## [1] 0.2782708 0.8150928

We get the same confidence interval using the regular method of generating the confidence interval as we did using the tidy function.

Problem 1E

Association is not causation. It turns out that HR hitters also obtain many BB. We know for a fact that HRs cause runs because, by definition, they produce at least one. We can see this by simply plotting these two statistics for all players with more than 500 plate appearances (BB+AB):

Batting %>%
  filter(yearID >= 1961 & BB+AB > 500 & !is.na(HR) & !is.na(BB)) %>% 
  mutate(HR = factor(pmin(HR, 40))) %>%
  ggplot(aes(HR, BB)) +
  geom_boxplot()

So, is the relationship we saw above for BB and runs due to teams having more HRs also having more BBs? One way we can explore this is by keeping HR fixed and examining the relationship within the strata. For example, if we look only at teams with 150 HRs, do more BBs produce more runs?

We can’t perform this analysis on a single year, because there are not enough teams to obtain strata with more than one or two teams. Instead we will combine all data across years since 1961.

Group data by the number of HRs and perform a regression analysis in each stratum to determine the effect of BB per game on runs per game. Use 10th, 20th, … quantiles to split the data into 10 groups. Hint: use the function cut and quantile to create the strata. Does the relationship between BB and runs seem linear within each strata?

quantile <- Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R.per.game = R/G, BB.per.game = BB/G, HR.per.game = HR/G) %>% 
  mutate(group = cut(HR.per.game, quantile(HR.per.game, prob = seq(0, 1, .1)), 
                     include.lowest = TRUE))
BB <- 
  quantile %>%
  group_by(group) %>%
  do(tidy(lm(R ~ BB.per.game, data = .))) %>%
  filter(term == "BB.per.game")

## graph for each stratum
quantile %>% 
  ggplot(aes(BB.per.game, R)) +
  theme_classic() +
  geom_point(color = "grey") + 
  geom_smooth(method = "lm") + 
  facet_wrap(~group) +
  labs(x = "BB per game", y = "R per game") + 
  ggtitle("Regression analysis in each stratum") + 
  theme(plot.title = element_text(hjust = 0.5))

Based on the plot shown above, there appears to be linear relationships between walks and runs within each strata.

Problem 1F

In problem 1E, we saw that the effect of BB on runs appears to be about the same in each strata. The relationship between HR and R is also, not surprisingly, linear:

Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, HR = HR / G) %>%
  ggplot(aes(HR, R)) +
  geom_point()

These two combined implies that a sensible linear model says:

\[ \mbox{Runs} = \beta_0 + \beta_{BB} \mbox{BB} + \beta_{HR}{HR} + \varepsilon \]

In this model, we adjust for HRs by including it as linear term. Note that we have already shown data that support this model. In general, simply fitting such a model does not necessarily adjust for a possible confounder. The model must also be approximately correct.

We can fit this model like this:

fit <- Teams %>%
  filter(yearID >= 1961) %>% 
  mutate(R = R / G, BB = BB / G, HR = HR / G) %>%
  lm(R ~ BB + HR, data = .)

Note that the summary shows a very strong HR effect but also a decent BB effect. Now, what happens if we include singles (H-X2B-X3B-HR), extra bases (doubles plus triples, X2B + X3B), and HRs per game in our model? What does the model say about which of these characteristics should receive more weight?

Also, fit the model to each year independently to check for consistency from year to year. Does the model appear consistent over time?

model <- 
  Teams %>% 
  filter(yearID>=1961) %>%  
  mutate(R.per.game = R/G, BB.per.game = BB/G, singles = (H-X2B-X3B-HR)/G,
         XB.per.game =(X2B+X3B)/G, HR.per.game=HR/G)

modelfit <- lm(R.per.game ~ BB.per.game+singles+XB.per.game+HR.per.game, data = model)

Teams %>%
  filter(yearID >= 1961) %>%
  group_by(yearID) %>%
  mutate(R.per.game = R/G, BB.per.game = BB/G, singles = (H-X2B-X3B-HR)/G,
         XB.per.game = (X2B+X3B)/G, HR.per.game = HR/G) %>%
  do(tidy(lm(R.per.game ~ BB.per.game+singles+XB.per.game+HR.per.game, data = .))) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(yearID, estimate, group = term, col = term)) +
  theme_classic() + 
  geom_line() + 
  geom_point() +
  ggtitle("Model fit for each year independently from year to year") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  labs(x = "Year", y = "Estimate")

Based on this graph, the model estimates seem to fluctuate over time. However, there is an interesting pattern to it. These estimates fluctuate over a relatively small range of values - in most circumstances, no more than 0.5 in the total change of each estimate. Although the plot seems to suggest that the estimates have not been consistent over time, one could also argue that the estimates of the model have remained relatively consistent due to the small changes observed over time. Looking at the graph over the entire time period, the estimates remain within small distinct range except for a few years. However, if one simply looks at this graph, the estimates do not seem very consistent over time.