18.4 Exercises
We have shown how BB and singles have similar predictive power for scoring runs. Another way to compare the usefulness of these baseball metrics is by assessing how stable they are across the years. Since we have to pick players based on their previous performances, we will prefer metrics that are more stable. In these exercises, we will compare the stability of singles and BBs.
1. Before we get started, we want to generate two tables. One for 2002 and another for the average of 1999-2001 seasons. We want to define per plate appearance statistics. Here is how we create the 2017 table. Keeping only players with more than 100 plate appearances. Now compute a similar table but with rates computed over 1999-2001.
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
dat<-Batting |> filter(yearID==2002) |> mutate(pa=AB + BB, singles=(H-X2B-X3B-HR)/pa, bb=BB/pa) |> filter(pa>=100) |> select(playerID, singles, bb)
dat2<- Batting %>% filter(yearID %in% 1999:2001) %>%
mutate(pa=AB+BB, singles=(H-X2B-X3B-HR)/pa, bb=BB/pa) %>%
filter(pa>=100) %>% group_by(playerID) %>% summarize(mean_singles=mean(singles), mean_bb=mean(bb))
2. You can use the inner_join function to combine the 2001 data and averages in the same table. Compute the correlation between 2002 and the previous seasons for singles and BB:
dat3<-inner_join(dat, dat2, by="playerID")
cors<-cor(dat3$singles, dat3$mean_singles)
corb<-cor(dat3$bb, dat3$mean_bb)
3. Note that the correlation is higher for BB. To quickly get an idea of the uncertainty associated with this correlation estimate, we will fit a linear model and compute confidence intervals for the slope coefficient. However, first make scatterplots to confirm that fitting a linear model is appropriate.
dat3 %>% ggplot(aes(singles, mean_singles)) + geom_point()
dat3 %>% ggplot(aes(bb, mean_bb)) + geom_point()
Both distributions are bivariate normal.
4. Now fit a linear model for each metric and use the
confint function to compare the estimates.
singles<-lm(singles~mean_singles, dat3)
singles
##
## Call:
## lm(formula = singles ~ mean_singles, data = dat3)
##
## Coefficients:
## (Intercept) mean_singles
## 0.06206 0.58813
confint(singles)
## 2.5 % 97.5 %
## (Intercept) 0.04747792 0.07664646
## mean_singles 0.49943734 0.67683074
bb<-lm(bb~mean_bb, dat3)
bb
##
## Call:
## lm(formula = bb ~ mean_bb, data = dat3)
##
## Coefficients:
## (Intercept) mean_bb
## 0.01548 0.82905
confint(bb)
## 2.5 % 97.5 %
## (Intercept) 0.007789552 0.02317953
## mean_bb 0.748916885 0.90918171