Walk and Strike-Out Rates Across a Baseball Career

It seems that On-Base Average peaks around 27 to 29 years old, and tapers off after that. But how do other batting statistics trend across ages, and across a players career? In this post, I’ll take a look at walk rate and stike-out rate, both according to age, and according to a player’s career rate.

To minimize error, I’ll again use the guidelines set by Russell Carleton to exclude players who haven’t had enough career at-bats for their career statistic to stabilize. Walk rate stabilizes (is half error) at 120 plate appearances; K-rate is half error at 60 plate appearances. In contrast to my similar analysis of OBP, in this case, I will also only include seasons with enough plate apeparances for the statistic to have normalized. That is, when calculating a player’s performance at age N vs. their career average, I will only use seasons with 60 or 120 at-bats, for K-rate and walk-rate, respectively.

Given this, let’s take a look at what the walk- and K-rates are for players at each age between 20 and 40.

library(Lahman)
library(plyr)
library(dplyr)
library(ggplot2)

master = Lahman::Master %>%
  select(playerID,birthYear)
bstats <- battingStats() %>%
  select(playerID, yearID,SO,BB,PA)
batting=left_join(bstats,master) %>%
  mutate(
    age=yearID-birthYear
  ) %>%
  arrange(playerID,age) %>%
  filter(!is.na(age) & !is.na(PA) & !is.na(BB) & !is.na(SO) & PA>1) %>%
  mutate(
     SOPA = SO/PA
    ,BBPA = BB/PA
  ) %>%
  group_by(playerID) %>%
  filter(length(playerID)>=2)

SOage = subset(batting,subset=(PA>=60 & age<=40 & age>=20 & !is.na(SO))) %>%
  group_by(age) %>%
  summarise(
    rate = sum(SO)/sum(PA)
    ,se=sqrt(var(SOPA)/length(PA))
    ,obs=length(PA)
    ,metric = "SO-Rate"
  )

BBage = subset(batting,subset=(PA>=120 & age<=40 & age>=20 & !is.na(BB))) %>%
  group_by(age) %>%
  summarise(
    rate = sum(BB)/sum(PA)
    ,se=sqrt(var(BBPA)/length(PA))
    ,obs=length(PA)
    ,metric = "BB-Rate"
  )

SOBBrate <- rbind(SOage,BBage)

p0=ggplot(SOBBrate,aes(x=age,y=rate,ymin=rate-se,ymax=rate+se,group=metric,color=metric)) +
  stat_smooth(span=10,level=.9,method="loess") +
  geom_point(aes(size=obs)) +
  scale_size_area() +
  labs(x="Age",y="Rate")+
  theme(legend.position="bottom")
p0

plot of chunk unnamed-chunk-1

What do these lines mean? They are the rates of walks and strike-outs per plate appearance, for the whole-number age cohorts 20 through 40 years, as calcualted by the season year minus their birth year. This does mean that some players may have at-bats or whole seasons before their calendar birthday and get counted in the age cohort ahead of their true age, but I don’t think this is a fatal flaw in the approach here. A player’s season is only counted if they had enough plate appearances in that season to have a reliable rate, and a single player can represent an observation at each age.

There is inherrent survival bias in these lines. The apparent slopes of each rate come in part because a player who strikes out too much or walks too infrequently might not be asked to come back for another season. It is nevertheless a good way of looking at the strike-out and walk rates in a specific way. That is, if I pick a 24 year old at random from the history of baseball, how would I expect him to perform? What if I need to pick any player and I want to minimize strike-out rate? Pick a player in their late 20s. What if I need a player who draws walks? Pick a veteran.

What the survival bias is hiding is how players get better or worse at walking or avoiding stike-outs as they age. The next graph will present exactly this. That is, at each age, how did the player perform versus their career rate?

# make career rates
cSOBB = batting %>%
  group_by(playerID) %>%
  summarise(
     cPA = sum(PA)
    ,SORate = sum(SO)/sum(PA)
    ,BBRate = sum(BB)/sum(PA)
  ) %>%
  filter(!is.na(cPA) & cPA >= 60)

batSO = join(batting,cSOBB) %>%
  mutate(
    eSO = PA*SORate
  ) %>%
  group_by(playerID) %>%
  filter(PA>60) %>%   # use the same PA requirements as above
  mutate(
    SOvar = SO - eSO
  ) %>%
  filter(!is.na(SOvar)) %>%
  mutate(
    SOPAvar = SOvar/PA
  )
# make season-to-career variances
SOvar = batSO %>%
  group_by(age) %>%
  summarise(
    rate_variance = sum(SOvar)/sum(PA)
    ,se=sqrt(var(SOPAvar)/length(PA))
    ,obs=length(PA)
    ,metric = "SO-Rate"
  ) %>%
  filter(age>=20 & age<=40)

batBB = join(batting,cSOBB) %>%
  mutate(
    eBB = PA*BBRate
  ) %>%
  group_by(playerID) %>%
  filter(PA>120) %>%  # use the same PA requirements as above
  mutate(
    BBvar = BB - eBB
  ) %>%
  filter(!is.na(BBvar)) %>%
  mutate(
    BBPAvar = BBvar/PA
  )

BBvar = batBB %>%
  group_by(age) %>%
  summarise(
    rate_variance = sum(BBvar)/sum(PA)
    ,se=sqrt(var(BBPAvar)/length(PA))
    ,obs=length(PA)
    ,metric = "BB-Rate"
  ) %>%
  filter(age>=20 & age<=40)
# bind the two rate-variances
SOBBvar <- rbind(SOvar,BBvar)

p1=ggplot(SOBBvar,aes(x=age,y=rate_variance,ymin=rate_variance-se,ymax=rate_variance+se,group=metric,color=metric)) +
  stat_smooth(span=10,level=.9,method="loess") +
  geom_point(aes(size=obs)) +
  scale_size_area() +
  labs(x="Age",y="Age Rate vs. Career Rate")+
  theme(legend.position="bottom")
p1

plot of chunk unnamed-chunk-2

These are nearly mirror-image trends from the age-cohort graph, but not quite. Again, there is a certain bias here as well, which attenuates the trends, rather than amplify them. If a player played fewer seasons, say, only two, and was terrible in each, two things happen. One, similar to the first graph, they will not be represented in the older ages because they weren’t good enough to play at the major league level. But two, this decreases the apparent year-vs-career variance, because they were equally bad both years, and hence their variances are diminished. It hides the effect of aging on how these rates and variances change across a career. Nevertheless, this is not fatal to interpretation, and the trends are still clear.

Hitters seem to get rapidly better at striking out less often, and this reaches a valley in the late twenties. After this, they slowly get worse and worse at avoiding strike-outs. On the other hand, hitters get consistently better at drawing walks, and this plateaus just after 30. The plateau doesn’t mean they got worse, it just means they stopped getting better at walking.

As with previous posts, I again think that there may be two things going on, each on a different side of a player’s prime of 27 to 29 years of age. On the young side, the trends look like increasing maturity, added year over year, plate-appearance by plate-appearance. They get more patient, swing at fewer bad pitches, and simultaneously start drawing more walks and striking out less. But over the hill, I don’t think they become less mature, at least not mentally. Rather, maybe they are still learning-maybe learning a little slower than they used to-but maybe they are just getting physically old. That is, what benefit they get from their experience in the majors is tempered after thirty, when their bodies get old, their reaction times worsen, their speed to first-base drops off. Who knows. At this point, it’s still all conjecture. I’ll explore this more in later posts.

Walk and Strike-Out Rates Across a Baseball Career

Nikolaus P. Schuetz

Wednesday, November 11, 2014