This analysis comes in response to an article published in Decision Science News, Getting Old in Baseball, where the author’s apporach misconstrues the true trajectory of hitting skill across a baseball career.
Here’s a graph from the Getting Old in Baseball post, where they “take the set of players who played in the majors for at least two years and look at the mean batting average at every age.” (to reproduce, see the code from the original post)
It concludes, “pro baseball players have their highest averages just over age 30.” Looking at the graph, it would appear this is the case - that batting average tends to increase until about age 30, where it plateaus until about age 35.
But is that what the graph really means? Can it be possible that baseball players have their highest averages just over thirty? What is this graph really capturing?
Intuitively, it should be clear that much of what’s captured in the graph is pure survival bias. Some baseball players play longer than others for the singular reason that some baseball players are better than others. There’s a difference between having a stint in the majors and having a career in the majors; it’s in their twenties that worse players have these stints, which negatively biases that age range. While the area of each dot is intended to represent the number of observations, a basic histogram will more clearly indicate just how many more twenty year old than thirty year old hitters there are.
# here I grab more hitting stats for later use
bstats <- battingStats() %>%
select(playerID, yearID,AB,H,BB,HBP,SF,SO,BA,PA)
batting=left_join(bstats,master) %>%
mutate(age=yearID-birthYear) %>%
arrange(playerID,age) %>%
filter(!is.na(BA) & !is.na(age)) %>%
group_by(playerID) %>%
filter(length(playerID)>=2)
# overwrite NA values for OBP calx
batting$SF[is.na(batting$SF)] <- 0
batting$HBP[is.na(batting$HBP)] <- 0
batting$OBP <- (batting$H+batting$BB+batting$HBP)/(batting$AB+batting$BB+batting$HBP+batting$SF)
hist(batting$age, breaks = length(levels(as.factor(batting$age))), xlim = c(18,45)
, main = "Hitter age mix", xlab = "Age", ylab = "Num")
Why are there fewer 27 year olds than 25 year olds? Are 27 year olds worse than 25 year olds? On the contrary. They are better, and if you aren’t performing by 27, you are (or should be) on your way out. The underperforming twenty year olds who fail to deliver on the promise of their youth are not represented in the averages of thirty year olds, because they don’t make it to 30.
Additionally, one would think that both being a worse-hitting player and a younger player would lead to seasons with fewer plate appearances. (Younger, if only because rookies are often “eased in” to play, having preliminary seasons with fewer plate appearances before becoming an every-day player.) The data bear this relationship out:
plot_data_paXage = batting %>%
group_by(age) %>%
summarise(
mu=mean(PA),
se=sqrt(var(PA)/length(PA)),
obs=length(PA)
) %>%
filter(age>=20 & age <=42)
p_paXage=ggplot(plot_data_paXage,aes(x=age,y=mu,ymin=mu-se,ymax=mu+se)) +
stat_smooth(span=10,level=.9,method="loess") +
geom_point(aes(size=obs)) +
scale_size_area() +
labs(x="Age",y="Mean Plate Apperances")+
theme(legend.position="bottom")
p_paXage
Why is the mean number of plate appearances for twenty year olds so low to begin, and so slow to increase? For the simple reason that this is where players earn their careers. This is what makes this graph one of my favorites of all that I’ve made with this data yet, because it helps answer what may otherwise be a slippery question: What does it mean to play beyond a stint and have a career in the majors?
If you’re playing at the age of thirty, I think you’ve made it. You can talk about having a career in the majors without raising any eyebrows, at least not mine. (Consider the couterfactual of, had a player been unable to secure a contract after the age of 27, for example, did they really “make it”?)
And worse-hitting players have fewer plate appearances, right? (Forgive me as I use on-base percentage rather than batting average.)
plot_data_paXobp = batting %>%
group_by(PA) %>%
summarise(
mu=mean(PA),
se=sqrt(var(PA)/length(PA)),
obp = (sum(H+BB+HBP)/sum(AB+BB+HBP+SF)),
obs=length(PA)
) %>%
filter(PA > 5)
p_paXobp=ggplot(plot_data_paXobp,aes(x=obp,y=mu,ymin=mu-se,ymax=mu+se)) +
geom_point(aes(size=obs)) +
scale_size_area() +
ylim(-50,800) +
labs(x="On Base %",y="Plate Appearances")+
theme(legend.position="bottom")
p_paXobp
Note that this graph looks as clean as it does because I grouped the data on plate appearances, an integer, rather than trying to group on the floating decimal of OBP. This graph gives us the feel of: if you are this % good of an OBP hitter, how many plate appearances would we expect you to earn?
The answer appears to be that unless you can put together a season with about a .300 OBP, you shouldn’t be playing much. After that, a hitter earns playing time with increased OBP fairly steadily. There is some interesting stuff going on after your OBP reaches .325, but I’ll save that for another day. For now, this should further represent how much the mean OBP, grouped by age, will depress the apparent hitting performance of younger players and players with fewer at-bats (often these are one and the same).
Let’s take a look at what the original graph would look like if we exculde both players who don’t deserve solid playing time (they shouldn’t “make it”) and players who will never have a “career” in baseball. That is, exclude hitters with career OBPs under .300 and players who never played into their thirties.
# make career stats
batters = batting %>%
group_by(playerID) %>%
summarise(cPA = sum(PA)
, obp_num = sum(H,BB,HBP)
, obp_den = sum(AB,BB,HBP,SF)
, years = max(age) - min(age) + 1
, max_age=max(age)) %>%
mutate(cOBP = obp_num / obp_den)
# now get filter down the batters data to career OBP > .300 & played at 30 years old
batting_careers=join(batting,batters) %>%
arrange(playerID,age) %>%
group_by(playerID) %>%
mutate(yearsLeft=min_rank(desc(age))-1) %>%
filter(max_age>=30 & cOBP>=.3) %>%
mutate(LastYear=(yearsLeft==0))
From here, I simply follow the same procedure as from above to greate the graphs, except I stick with OBP.
Also note that because all players played until at least thirty years old, the “Last” trendline will begin at age 30.
###Split by people in their last year of play or not
plot_dataC = batting_careers %>%
group_by(age,LastYear=!(yearsLeft>0)) %>%
summarise(
mu=mean(OBP),
se=sqrt(var(OBP)/length(OBP)),
obs=length(OBP)
) %>%
filter(age>=20 & age <=42)
plot_dataC$se=with(plot_dataC,ifelse(LastYear,0,se))
plot_dataC$CareerYear=with(plot_dataC,ifelse(LastYear,"Last","Not Last"))
###Compute regardless of last year or not
plot_dataD = batting_careers %>%
group_by(age) %>%
summarise(
LastYear=NA,
mu=mean(OBP),
se=sqrt(var(OBP)/length(OBP)),
obs=length(OBP)
) %>%
filter(age>=20 & age <=42)
plot_dataD$CareerYear="Combined"
plot_data1=rbind(plot_dataC,plot_dataD)
plot_data1$CareerYear=factor(plot_data1$CareerYear,levels=c("Not Last","Combined","Last"))
p1=ggplot(plot_data1,aes(x=age,y=mu,ymin=mu-se,ymax=mu+se,group=CareerYear,color=CareerYear)) +
stat_smooth(span=10,level=.9,method="loess") +
geom_point(aes(size=obs)) +
scale_size_area() +
labs(x="Age",y="Mean Batting Average")+
theme(legend.position="bottom")
p1
This graph tells a different story. This set of “career” players gets continually better with each season and peaks in their mid to late twenties. After this, if they are going to stick around, their OBP seems relatively stable.
If this is their last year, however, it’s because they really tanked. In fact, they are typically not getting on base at the .300 mark that earned them a spot in this subset in the first place. Yet on the whole, you can’t expect these players to continue their peak performance. Well-performing thirty year olds become more and more scarce with each passing year.