The Project

I’ve always considered myself a datanerd, and a baseball fan, but I’ve done very little to put those two things together. Several years ago I picked up the book “Analyzing Baseball Data with R” and started working through it, but between work and other life events I didn’t get very far. But this is a year of learning new things, and bringing skills up so I’m back at it!

In pursuit of this, I want to dive in and do some projects to try and get into the world of baseball analysis. To that end, I asked my friend Murph what his list of top 10 starting pitchers was so that I could try and do some work figuring out a way to determine who should be at the top of that list. This isn’t necessarily an attempt to find the absolute best pitcher ever, but to find the order of the list provided. Here’s the list of pitchers in question, as originally given to me in text message form:

The Data

Most of the data here comes from the wonderful Lahman database. I’ve used the pitchers table to get career stats for the top 10 pitchers as per Murph. I grabbed a helpful function from the Analyzing Baseball Data with R book for getting the player birthyear so that we can eventually look at some stats for each player’s age trajectory. I also compiled some career statistics into a .csv directly from Baseball Reference. That will be used for a lot of the ranking metrics to come. Here’s how we ended up with the current main dataset.

data("Pitching")
data(PitchingPost)
data("People")
career <- read_csv(here("raw", "murph_pitchers_career.csv"))
## Rows: 10 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): player, active
## dbl (15): fip, era, seasons, ip, so, bb, so9, bb9, whip, era_plus, war, all_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
get_birthyear <- function(Name) {
  Names <- unlist(strsplit(Name, " "))
  People %>%
    filter(nameFirst == Names[1],
           nameLast == Names[2]) %>%
    mutate(birthYear = ifelse(birthMonth >= 7,
                              birthYear + 1, birthYear),
           Player = paste(nameFirst, nameLast)) %>%
    select(playerID, Player, birthYear)
}

PlayerInfo <- bind_rows(get_birthyear("Pedro Martinez"),
                        get_birthyear("Greg Maddux"),
                        get_birthyear("Christy Matthewson"),
                        get_birthyear("Randy Johnson"),
                        get_birthyear("Walter Johnson"),
                        get_birthyear("Sandy Koufax"),
                        get_birthyear("Clayton Kershaw"),
                        get_birthyear("Bob Gibson"),
                        get_birthyear("Tom Seaver"),
                        get_birthyear("Max Scherzer")) %>%
  filter(!playerID %in% c("johnsra03", "johnsra04", "martipez03", "gibsobo02"))

Pitching %>%
  inner_join(PlayerInfo, by = 'playerID') %>%
  mutate(Age = yearID - birthYear) %>%
  select(Player, Age, W, SO, BB, ERA, CG, SHO, IPouts) %>%
  mutate(IP = round((IPouts/3), digits = 1)) %>%
  group_by(Player) -> murph_pitchers

Ranking System

Evaluating pitchers is a difficult task. One that I’m far from an expert in. But I’ve devised a system for this project that I hope will make some sense to you reading. In this project, I will assign points to players in 5 categories: 1)Longevity, 2)Strikeout Potential, 3)Control, 4)Performance vs Peers, and 5)Damage Limiting. In assigning these points, we’ll look at the statistical categories of innings pitched, k/9, bb/9, career ERA+, and career FIP. Within each category, players will be ranked 1-10 with the top player receiving 10 points, the second player receiving 9 points, etc. Doing this for all categories will provide the player who excelled the most over all 5 categories and will provide us an objectively best pitcher.

Analysis

1. Longevity

The first category we’ll look at is pitcher longevity. Lots of pitchers have come up and been great for a season or two, but part of being THE BEST is being able to do it over time. For this, the metric I’ve chosen is innings pitched. Let’s see who comes out on top.

First points on the board for Walter Johnson. He pitched substantially more than any other player on this list. Definitely someone you’d want to have on your roster. You know you’re gonna get reliable length out of a pitcher like this.

cumip <- murph_pitchers %>%
  mutate(CIP = cumsum(IP)) %>%
  filter(Player %in% c("Walter Johnson", "Greg Maddux", "Max Scherzer", "Clayton Kershaw"))

ggplot(cumip, aes(x = Age, y = CIP, linetype = Player)) +
  geom_line(aes(color = Player))

If we look at the career trajectories of the top two players in innings pitched (Johnson and Maddux) versus the two active players on the list (Kershaw and Scherzer) we can see a couple things. First, Scherzer is unlikely to reach the top of this list since he started later than the others. To accumulate something on the order of 5000 innings, you’ve got to play a looooong time. Secondly, we can see that Kershaw was on Maddux pace for the first part of his career. Sadly, though, injuries have knocked him off track so he too is unlikely to reach the top tier of this list.

2. Strikeout Potential

The next thing we want to examine is how good a pitcher is at getting strikeouts. This is a great skill for a pitcher to have since he can get outs without the ball ever entering play. Here we’ll look at strikeouts per nine innings as a means of not giving extra advantage to pitchers who played longer and simply accumulated more stats.

Fangraphs gives a helpful rule of thumb to interpret K9. Anything at 10 or above is excellent. They give 7.7 as average, and below 6 as poor. Max Scherzer and Randy Johnson are neck and neck here, but Mad Max edges him out and gets the points. I find it interesting that Walter Johnson, who pitched the most innings by far, is near the bottom of this list. It really shows the difference in the focus of different eras of baseball.

3. Control

The flipside of strikeout potential, to me, is control. Lots of guys have top notch stuff or high end heat, but they can’t put it where they want it. Those players don’t seen to stick around long enough to enter the conversation for THE BEST. Here, I want to look at BB/9 much like we looked at K/9 above. This will give a more digestible rate to compare.

Strikeouts are great, but if you walk too many batters will end up doing damage at some point. Christy Matthewson takes the cake for the least walks per nine innings. Fangraphs rule of thumb is that 1.5 BB9 is excellent, and 1.9 is great. Average is 2.9 and above 3.2 is below average. For fun, let’s look at the ratio of K/9 to BB/9 for all these guys and see what that tells us.

4. Performance vs Peers

In the next category we’ll look at how good a pitcher was compared to his contemporaries. You can hardly be THE BEST if you aren’t significantly above average in the seasons you play. To examine this, we’ll look at ERA+ which is a normalized statistic. An ERA+ of 100 is league average, and any point above 100 is one percentage point better than average. All of these pitchers were well above league average for most of the seasons they played, but let’s see who was the top of the heap.

Clayton Kershaw is at the top of this list which really shows how dominant peak Kershaw was. As he is still active and is unlikely to have anymore seasons near his peak performance, it is possible that he slips down this list by the time he retires. We’ll have to check back in and see!

5. Limiting Damage

The final stat we’ll look at in our assessment of THE BEST pitcher (as per Murph) is Fielding Independent Pitching (FIP). The more standard ERA might be more straightforward, but FIP theoretically takes out a little of the variation that is not dependent on the pitcher himself. I will note, that with this group the difference in career FIP vs ERA is very small.

Just for fun, let’s see if ERA changes the order any

Christy Matthewson and Walter Johnson top the list for both FIP and ERA, with Kershaw and Koufax swapping in spots 3 and 4.

Final Assessment

So where do we land at the end of this experiment? Well, here is the list in order of total points:

  1. Walter Johnson - 37
  2. Christy Matthewson - 36
  3. Clayton Kershaw - 33
  4. Pedro Martinez - 32
  5. Greg Maddux - 26
  6. Max Scherzer - 26
  7. Randy Johnson - 24
  8. Tom Seaver - 21
  9. Bob Gibson - 20
  10. Sandy Koufax - 20

There you have it! The best pitcher of the best pitcher (in the list Murph provided) is Walter Johnson by and edge! There seem to be two distinct tiers here with one group above 30 points and one group decently below. Clayton Kershaw being in the this close to the top as an active player is very surprising to me. I assumed he would be further down, with the retired players taking the top spots.

Overall this was a very fun project to complete. I’m looking forward to the season getting going and trying to do more projects in this vein. I’d love any feedback for improving the analysis and reporting for future ideas!