Introduction

All of the data used in this project is from the 2019 MLB season. I wanted to use the most recent and relevant data available but elected to use 2019 data over 2020 data as the 2020 season was shortened due to COVID-19. Most of the data used in this project was collected using the scrape_statcast_savant_pitcher_all() function from the baseballr package we learned about during Bootcamp. For parts of this project, data was also collected from Fangraphs. I will quickly outline the manipulation I had to do to these data sets to make them usable for my project.


Data Collected from baseballr

There wasn’t much manipulating necessary to use this data, but collecting data for the entire 2019 MLB season was a bit of a hassle. To do this, I collected data in small samples, usually about a week of data at a time. After doing this to scrape data for the entire season, I used a series of rbind() functions to accumulate the data into one big set for me to use.

After that collection, the only manipulation required for that data was very similar to what we did during Bootcamp. For the statistics I was creating and calculating, I mostly just needed to filter out observations where the events variable took NA values. I will further outline the things I calculated in the Methods / Findings section of my project.


Data Collected from Fangraphs

I collected data from Fangraphs on three separate occasions – looking back, I could have done it all with only one step, but I ultimately obtained data from there three times. The three sets I collected from Fangraphs gave me information about each pitcher’s innings pitched, their WHIP, and how many earned runs they allowed. I used each of these three statistics in creating a new statistic that I used for a part of my project. The only “manipulation” I needed to do for this data was joining it with the data I scraped using baseballr – for this, I utilized the inner_join() function contained in the dplyr package to ensure all statistics were assigned to the correct player.


Motivation

In doing this project, there were three things I wanted to do. Firstly, I wanted to demonstrate my knowledge of baseball and my understanding of its intricacies. Secondly, I wanted to demonstrate my proficiency in using R and RStudio. Lastly and most importantly, I wanted to showcase my ability to combine those two things in a way that would prove beneficial to the UNC Baseball Analytics team and the UNC Baseball team itself.

In considering what things I wanted to explore with this project, I decided I wanted to focus on my strengths while still taking a holistic approach. In the many years I played baseball, my greatest strength and asset to the teams I played on was my pitching ability. Because of this, most of this project focuses on pitching statistics and ways I believe they can be most effectively understood and utilized. Looking also at the offensive side of the ball, I sought to analyze situations in which hitters perform worse or better. Without going into detail, here are a few of the things I decided to explore with this project.

Along with these things I was able to analyze, there are a couple of other things I did not have the data (to my knowledge) to examine and would consider doing so in the future. I strongly believe in the power of great defense, so these two things are defense-related.

The first is something my high school team used that we called “Heartbreaker” plays. This was a scoring system we kept track of to find out who took away the most hits or big plays from the other team. This consisted of diving plays, pickoff plays, throwing runners out on the bases, or anything we considered to be a big play for our team. This scoring system was largely subjective and would be hard to keep track of statistically, but it always incentivized us to go the extra mile to do something to help the team. We would often extend the scoring opportunities into practice as well to ensure everyone was giving it their all. The incentive for us was getting your dinner paid for if you were the team leader for that week, but the exact incentive itself doesn’t matter much, just the fact that it existed.

The second thing I would like to examine that I am sure would be possible to do would be to evaluate how pitchers do in the at-bats following one where the runner got on base for whatever reason. One thing I always prided myself in as a pitcher was my ability to shrug off mistakes or giving up hits, and I believe this kind of statistic would be a great way of understanding how pitchers react to things that don’t go their way for whatever reason (including errors and baserunners allowed when it was not the pitcher’s fault).

Above all, I wanted this project to be a product of the hard work I have put in over the years understanding baseball, statistics, programming, and the intersection of the three. I have been a big UNC Baseball fan for many years, and I am very thankful to have an opportunity to potentially help a team I’ve admired and cheered for in the past.


Methods / Findings

BPIP

The first thing I would like to take a look at is my creation of what I call the BPIP statistic. Modeled after a pitcher’s WHIP, BPIP stands for bases per inning pitched and is designed to reflect how many bases a pitcher is giving up in an average inning rather than just walks and hits. A WHIP around one is often considered to be very good, but this could be attained by a pitcher who would hypothetically give up a solo home run every single inning. Similarly, a pitcher who gave up only a walk and a single every inning (allowing no runs) would have a WHIP of two, which would not be considered very impressive. These are of course only hypothetical situations, but BPIP would be a much better indicator of that pitcher’s success in both situations.

At this point, you have probably figured out the formulation of the BPIP statistic for yourself. Using data from the 2019 MLB season (scraped using baseballr), I created a new variable bases that reflected how many bases a pitcher gave up in each at-bat. This variable only considered walks, hit-by-pitches, singles, doubles, triples, and home runs. I elected to not consider anything else as I wanted the statistic to be a reflection of bases allowed by the pitcher himself (although, it is worth noting that the difference between a single and double, for example, can often be defense-dependent). I then calculated the total number of bases allowed on the season for each pitcher and combined the data with data collected from Fangraphs containing information on each pitcher’s innings pitched, along with their WHIPs for comparison. Once I had that together, I was able to create the BPIP statistic. First, let’s take a look at the starting pitchers with the ten best WHIPs in 2019.

Top 10 starting pitchers in 2019 based on WHIP
Player Name Total Bases IP BPIP WHIP
Justin Verlander 266 223.0 1.19 0.80
Gerrit Cole 258 212.1 1.22 0.89
Jack Flaherty 224 196.1 1.14 0.97
Jacob deGrom 216 204.0 1.06 0.97
Zack Greinke 262 208.2 1.26 0.98
Max Scherzer 236 172.1 1.37 1.03
Clayton Kershaw 219 178.1 1.23 1.04
Stephen Strasburg 262 209.0 1.25 1.04
Walker Buehler 239 182.1 1.31 1.04
Shane Bieber 301 214.1 1.41 1.05

Unsurprisingly, these are some extremely well-known pitchers that are considered to be some of the best in the league. We can see that for each of them, however, that their BPIPs are all higher than their WHIPs, some more than others. Let’s take a look at the starting pitchers with the ten best BPIPs in 2019 instead of WHIPs, and see how the list compares.

Top 10 starting pitchers in 2019 based on BPIP
Player Name Total Bases IP BPIP WHIP
Jacob deGrom 216 204.0 1.06 0.97
Jack Flaherty 224 196.1 1.14 0.97
Luis Castillo 216 190.2 1.14 1.14
Mike Soroka 202 174.2 1.16 1.11
Justin Verlander 266 223.0 1.19 0.80
Gerrit Cole 258 212.1 1.22 0.89
Charlie Morton 238 194.2 1.23 1.08
Clayton Kershaw 219 178.1 1.23 1.04
Stephen Strasburg 262 209.0 1.25 1.04
Zack Greinke 262 208.2 1.26 0.98

The lists are pretty similar. Among the top ten starting pitchers based on WHIP, seven of them are also in the top ten based on the BPIP metric. Pretty interesting, but this doesn’t tell us anything yet. BPIP doesn’t serve any purpose unless we can use it to evaluate or predict a pitcher’s overall effectiveness. To do this, I first looked at how many total bases these pitchers gave up to visualize how BPIP is more accurate than WHIP in this aspect.

These graphs are pretty similar, but we can see there is far less spread in the graph for BPIP than there is for WHIP. The data points follow a linear pattern, which is reflected in the regression line added to its graph. Looking at the graph for WHIP, the data also seem to follow a linear pattern, but its observations are not packed so tightly along the added regression line; there is far more spread and unpredictability. This makes complete sense that a statistic created to reflect total bases allowed (BPIP) would do so more effectively than one (WHIP) created to reflect total baserunners allowed.

An alternate approach is using BPIP and WHIP to look at how many earned runs a pitcher gives up. This is the real reasoning behind creating the statistic, to better understand a pitcher’s overall effectiveness. Other metrics such as ERA could also be used for this, and I would easily be able to reproduce this for those other metrics, but I chose to use earned runs for this specific part. Here are scatterplots similar to the two I just showed, but this time showing the relationship with earned runs instead of total bases.

We see a similar pattern here to the one we saw in the first set of graphs, but it is less obvious here. The data points do seem to stay closer to the regression line in the graph for BPIP, but not as much as we saw earlier. To statistically confirm this, I calculated the correlation coefficients between BPIP/WHIP and earned runs. This confirmed my hypothesis that BPIP has a stronger relationship with earned runs than WHIP.

Of course, these findings do not mean that BPIP is the best predictor of a pitcher’s overall success. However, I do believe that it is a more useful metric than WHIP and that it could become a great tool for understanding which pitchers perform the best and put the team in the best position to win.


Handedness of Pitchers and Hitters

It’s no secret that hitters tend to perform better against pitchers with opposite handedness, but I wanted to take a further look into this and analyze performance for all combinations of pitcher/hitter handedness along with accounting for pitch type. All data used in this section was scraped using baseballr, and the only data manipulation needed was omitting NA events to calculate batting averages.

I chose to look at three metrics (though two of them go hand-in-hand) to analyze this: batting average, average pitch release speed, and average launch speed. The code behind this consisted of a lot group_by() and summarize() functions. Without going into too much detail about the code, I created an output that shows batting average, average release speed, and average launch speed for each pitcher/batter matchup and pitch type. For pitch type, I chose to make it a binary variable that would take the value Fastball or Offspeed dependent on its average release speed. Here is the table.

Matchup Pitch Type Occurrences Batting Average Avg. Release Speed Avg. Launch Speed
RHP v. RHB Fastball 37387 0.267 93.603 90.143
RHP v. RHB Offspeed 27266 0.220 85.742 86.679
RHP v. LHB Fastball 31146 0.279 93.659 90.547
RHP v. LHB Offspeed 22688 0.226 85.931 87.031
LHP v. RHB Fastball 17734 0.276 92.062 90.734
LHP v. RHB Offspeed 14808 0.242 84.201 87.113
LHP v. LHB Fastball 7829 0.268 91.956 88.794
LHP v. LHB Offspeed 5200 0.213 84.203 85.772

The table contains a lot of information, but I’ll quickly touch on a couple of observations I made from it. The first thing I noticed is that when arranged by batting average, the batting averages for all offspeed pitches are lower than those for all fastballs. Looking then between the two pitch groups, batting averages were lower when the handedness of the pitcher and hitter was the same. Here is the table arranged by increasing batting average.

Matchup Pitch Type Occurrences Batting Average Avg. Release Speed Avg. Launch Speed
LHP v. LHB Offspeed 5200 0.213 84.203 85.772
RHP v. RHB Offspeed 27266 0.220 85.742 86.679
RHP v. LHB Offspeed 22688 0.226 85.931 87.031
LHP v. RHB Offspeed 14808 0.242 84.201 87.113
RHP v. RHB Fastball 37387 0.267 93.603 90.143
LHP v. LHB Fastball 7829 0.268 91.956 88.794
LHP v. RHB Fastball 17734 0.276 92.062 90.734
RHP v. LHB Fastball 31146 0.279 93.659 90.547

I then wanted to take a look at how launch speeds vary for different release speeds. Unsurprisingly, when arranging by both release and launch speeds, the offspeed pitches are slowest and the fastballs are fastest. I created a new variable here, Smash, the ratio of launch speed to release speed, to determine the efficiency of contact. Here is a table showing the new variable, with the table arranged by increasing Smash.

Matchup Pitch Type Occurrences Batting Average Avg. Release Speed Avg. Launch Speed Smash
RHP v. RHB Fastball 37387 0.267 93.603 90.143 0.963
LHP v. LHB Fastball 7829 0.268 91.956 88.794 0.966
RHP v. LHB Fastball 31146 0.279 93.659 90.547 0.967
LHP v. RHB Fastball 17734 0.276 92.062 90.734 0.986
RHP v. RHB Offspeed 27266 0.220 85.742 86.679 1.011
RHP v. LHB Offspeed 22688 0.226 85.931 87.031 1.013
LHP v. LHB Offspeed 5200 0.213 84.203 85.772 1.019
LHP v. RHB Offspeed 14808 0.242 84.201 87.113 1.035

The observation I made here is that for each pitch type, the smash factor is the lowest (less efficient contact) for right-handed batters against right-handed pitchers and best for right-handed batters against left-handed pitchers. Looking at the previous few tables, it is clear to me that hitters perform worse (by multiple metrics) against pitchers with the same handedness and also perform worse (based on batting average) against offspeed pitches. This information could be leveraged on both offense and defense and could be combined with other statistics to improve its effectiveness.


Situational Hitting

The last thing I wanted to look at was situational hitting – how hitters perform at the times it seems to matter most. This can be extremely valuable to understand; knowing how players perform in these situations can be a determining factor in late-game decisions, and are often the difference between winning and losing a game.

I decided to look at three groups of what I call critical situations: runners on base, number of strikes/outs, and the inning of the game. The data used in this section was scraped using baseballr and was very simple to use, omitting NA events to calculate batting averages after filtering to meet certain criteria based on the critical situation I was looking at. First, let’s take a look at how hitters performed with runners on base.

Situation Batting Average Occurrences
Bases Empty 0.244 105475
Runners on Base 0.256 78335
Runners in Scoring Position 0.252 44262

Looking at this, hitters performed noticeably better with runners on base than with the bases empty. I suspect that a couple of reasons for this might be the inability to shift defenders along with pitchers often having to switch from windup to the stretch. Hitters did perform slightly worse with runners in scoring position, which may be a result of hitters feeling a little more pressure in those situations. Next, let’s look at how hitters did with two strikes or two outs.

Situation Batting Average Occurrences
Less Than Two Strikes 0.340 84940
Two Strikes 0.172 98870
Less Than Two Outs 0.252 124488
Two Outs 0.244 59322

Based on this table, hitters performed far worse with two strikes than with less than two strikes, and also worse with two outs than with less than two outs, though the drop with two outs is not nearly as large as the performance drop with two strikes. I expected to see a definite difference in batting average with two strikes but wasn’t expecting it to be half of it with less than two strikes. For pitchers, this certainly shows that getting two strikes in the count is extremely important and gives pitchers the upper hand. For hitters, this tells that the best opportunities to get a hit may be earlier in the at-bat. Finally, let’s see how hitters performed earlier in games and later in games.

Situation Batting Average Occurrences
Before 7th Inning 0.253 123314
7th Inning or Later 0.241 60496

In the final three innings of the game (extra innings as well), hitters’ batting averages were lower than those in the first six innings of the game. A probable reason for this is that in those late-game situations, relief pitchers have likely entered the game and are often harder to hit than starting pitchers given their shorter outings.

All of these averages are telling of when hitters perform their best and worst, and can certainly be leveraged to give a team the upper hand. Extending these statistics to individual players and seeing how they compare to averages is a great way of understanding which players are best in certain situations, along with determining which players might have that clutch factor that is difficult to quantify.


Conclusion

With this project, I believe I have analyzed data that is relevant to the UNC Baseball team and believe I could be a valuable asset to the team. The new statistic I created, BPIP, could certainly be calculated for players on the UNC Baseball team to learn more about which pitchers perform the best and put the team in situations to succeed. Similarly, this could be leveraged to better understand pitchers on the other team as well as to create a lineup whose strengths align with the other team’s weaknesses.

While the data I analyzed was MLB data, my analysis of pitcher/hitter handedness could also be valuable for the UNC Baseball team in creating a lineup that will perform well against the other team. Not only could this information be useful in creating lineups to start the game, but could be extremely valuable in late-game situations where either team is making a pitching change or making substitutions. Just as with BPIP, these metrics could be easily reproduced for the players on UNC’s team (the other team’s players, as well) to better understand how they specifically perform in certain matchups. The same goes for the situational hitting statistics; extending these to individual players on both teams will give insight into which players are going to help UNC win more games.

UNC Baseball is an extremely successful program and will continue to be so. As is in most if not all sports, the objective is to win games and consistently outperform opponents. Analytics are extremely valuable in getting teams prepared for games and can make practice sessions and game-specific preparations far more beneficial. With this project, I believe I have demonstrated my baseball and statistical knowledge, along with my ability to combine the two in a way that will help the UNC Baseball team. I have always believed that fulfilling my full potential is to use my strengths and abilities in a way that will benefit others, and I believe this role is a perfect opportunity for me to do exactly that.

I am very thankful to have participated in the Bootcamp and to be given the chance to complete this project. I hope you enjoyed it, and I thank you for taking the time to look through it.