In my previous post I did a preliminary anlysis of how a hitter’s skill might change over the course of their career. Most of this analysis, however, was centered around removing the survival bias of all the poor hitting youngsters that get mere stints in the majors. These players tend not to play past their late twenties, and their low averages depress rates in those early- to mid-20s age cohorts.

In this post and the next few posts, I’d like to explore how a hitter’s skill changes throughout their career, relative to their career. First and foremeost, I want to exclude players who haven’t had enough career at-bats for their career statistic to normalize. To set these bounds on a stat-by-stat basis, I’ll use Russell Carleton’s findings in his It’s a Small Sample Size After All analysis as a guide for where to set these bounds. With OBP, this bound is arpound 460 plate appearances. (I’ll look at other stats, such as walk rate, in my next posts.)

library(Lahman)
library(plyr)
library(dplyr)
library(ggplot2)
  
master = Lahman::Master %>%
  select(playerID,birthYear)
bstats <- battingStats() %>%
  select(playerID, yearID,H,BB,HBP,PA)
batting=left_join(bstats,master) %>%
  mutate(
    age=yearID-birthYear
    , obp_num = H+BB+HBP
    ) %>%
  arrange(playerID,age) %>%
  filter(!is.na(PA) & !is.na(age)) %>%
  group_by(playerID) %>%
  filter(length(playerID)>=2) %>%
  mutate(
    OBP = obp_num / PA
  )

cOBP = batting %>%
  group_by(playerID) %>%
  summarise(
    cobp_num = sum(obp_num)    # numerator for career OBP
    ,cPA = sum(PA)             # career plate appearances
  ) %>%
  mutate(
    cobp = cobp_num / cPA   # career OBP
  ) %>%
  filter(!is.na(cobp) & cPA >= 460)
head(cOBP)  # let's just take a look at it
## Source: local data frame [6 x 4]
## 
##    playerID cobp_num   cPA   cobp
## 1 aaronha01     5205 13940 0.3734
## 2 aaronto01      302  1045 0.2890
## 3 abbated01     1094  3459 0.3163
## 4 abbeych01      682  1959 0.3481
## 5 abbotfr01      134   560 0.2393
## 6 abbotje01      198   649 0.3051

Now I’ll use these players’ career on-base averages to project how many times that player would have gotten on base within each of their seasons. Then, I’ll sum this expected on-base count and group by age, in order to represent how each age’s on base average differs from their career on base average. This may sound like an odd approach, but it’s the only way to create the Year vs. Career variances I need later on, when I group by age. Otherwise, I’d only be left with the bad math of taking an average of the variances. That is, I need to retain counting statistics, rather than create an average, until later.

# join career on-base back onto the full batting data frame
battingOBP = join(batting,cOBP) %>%
  mutate(
    # the expected On-Base numerator: sum-product of Plate Appearances & career OBP
    eOBP_num = PA*cobp
  ) %>%
  group_by(playerID) %>%
  filter(!is.na(cPA)) %>%
  mutate(
    obp_var = OBP - cobp    # variance of observed OVP from career OBP
  ) %>%
  filter(!is.na(obp_var))

# table of on-base by age, with expected vs observed variance
cOBP_var = battingOBP %>%
  group_by(age) %>%
  summarize(
     oOBP = (sum(obp_num) / sum(PA))     # observed OBP
    ,eOBP = (sum(eOBP_num) / sum(PA))    # expected OBP
# note, the standard error is calcuated off of the obp_var, calculated in the previous join
    ,se=sqrt(var(obp_var))/length(OBP)
    ,obs=length(OBP)
  ) %>%
  mutate(
    vOBP = oOBP - eOBP      # variabnce between observed and expected OBP
  ) %>%
  filter(age >=20 & age <= 38)

What we’ve got here is, at each age, the average OBP; plus the expected OBP based on each player’s career OBP; and the variance between these at each age. Let’s plot.

p0=ggplot(cOBP_var,aes(x=age,y=vOBP,ymin=vOBP-se,ymax=vOBP+se)) +
  stat_smooth(span=2,level=.9,method="loess") +
  geom_point(aes(size=obs)) +
  scale_size_area() +
  labs(x="Age",y="OBP var",title="Age OBP Variance from Career OBP")+
  theme(legend.position="bottom")
p0

plot of chunk unnamed-chunk-3

While the curve is clear, the domain of this variance is small: we’re talking about hitters getting on base, at worst, less than .030 points under their career OBP. On the high end, when a player peaks, it’s because they are getting on base at around .005 points over their career averages. Nevertheless, there remains the obvious trend that hitters peak in their OBP between the ages of 27 and 29, and overall, a player gets on base at a higher average than their career between 25 and 32 years old.

Is it the body’s biological age that makes a hitter their best between 27 and 29? Possibly. Is it their mental age that allows them to peak, then their physical age that brings the decline? Also possible. In any case, I have a strong hunch that there are factors that contribute to the the steep slope between 20 and ~25; and that there may be separate factors contributing to the decline after age 30; or even, factors that cause both the pre-25 incline to be as steep, and the post-30 decline to be as shallow, as they are. These intuitions are mere conjecture to be explored in later posts.