1 Abstract

Hitting+ is a metric that measures the true talent level of an MLB hitter through Statcast data, batted ball classification data, spray chart distribution, sprint speed interactions with specific types of batted balls, and plate discipline data. The metric is composed of two underlying metrics, Quality of Contact+ and Plate Discipline+, both of which are created through multiple regression models. The final metric stems from a final regression model that features both QOC+ and Plate Discipline+, as well as data on a player’s quantity of contact, also known as Expected Balls in Play%. Overall, the purpose of Hitting+ is to utilize a hitter’s underlying peripheral data to encapsulate their true ability at the plate.

2 Introduction

How do we truly know how good a hitter is? Do we look at their past performance? Do we analyze their expected statistics? Each of these approaches will likely yield solid results, but there are still gaps that have yet to be filled when it comes to hitter evaluation. I have yet to see a free, publicly-available metric that accurately measures the talent level of a hitter. One common metric that is cited in analytics circles is wRC+, which measures a hitter’s run-creation performance relative to the rest of the league. wRC+ is a nearly perfect measure of past performance at the plate, but hitters do not always perform at or near their true talent level. wRC+ has a year-to-year correlation of approximately 0.40, indicating that it is a rather unstable metric. In other words, a larger-than-desired portion of the metric is determined by randomness. Expected stats, on the other hand, remove the defense from the equation and only focus on the quality of the batted ball. Although I am in favor of isolating the individual in player evaluation, expected stats come with a glaring flaw: they do not account for the shift. It is easy, and likely fair, to argue that a 400-foot fly out is simply bad luck. However, when a pull-heavy hitter like Joey Gallo lines out to the second basemen who is camping in short right field, that is not bad luck. That is the result of being an very shiftable hitter.

This is where Hitting+ comes into play. Hitting+ is not a production stat, but rather a measure of a hitter’s true talent level. Similar to expected stats, the premise is to isolate what the hitter can control. However, unlike expected stats, a hitter’s ability to spray the ball to all fields is taken into account. Take Carlos Santana, for instance. Santana frequently underperforms his expected stats. Is he unlucky? No. Santana is a pull-heavy lefty who tends to hit a ton of ground balls. In other words, he is extremely shiftable. This is not bad luck. This is a flaw in his hitting ability. On the flip side, Whit Merrifield is a hitter who may scare some away due to his inability to barrel the ball at a high clip. Although a high Barrel% is a desirable trait, Merrifield posted a remarkably balanced spray-chart profile last year: 35.6% Pull%, 32.3% Cent%, 32.1% Oppo%. Unlike Santana, Merrifield tends to outperform his expected stats. Hitting+ appropriately credits hitters who are hard to shift, and discredits those who are rather shiftable.

Hitting+ is composed of two different metrics: Quality of Contact+ and Plate Discipline+. Aside from spray chart data, QOC+ focuses on batted ball classification data, primarily from Statcast. Statcast categorizes every batted ball into six categories: Barrel, Solid Contact, Flare/Burner, Under, Topped, and Poor/Weak. QOC+ also factors in Hard Hit%, Average Exit Velocity, and more traditional classifications such as FB%, LD%, and GB%. I decided to create two simple metrics that measure a hitter’s ability to spray the ball to all fields. One of the metrics is called Spray Score, and it is simply the standard deviation between Pull%, Cent%, and Oppo%. The lower the Spray Score, the higher the hitter’s ability to spray the ball to all fields. The other is called Shiftability, which is the absolute value of Pull% - Oppo%. However, it is important to note that not all hitters are negatively affected by pull-heavy tendencies. A pulled ground ball is the least beneficial outcome, but a pulled fly ball is among the most beneficial outcomes. Similarly, a Trea Turner ground ball and Albert Pujols ground ball are not the same, nor should they be treated as such. Sprint Speed matters a lot on certain types of ground balls Hence, I included interactions into the model. The following interactions were included in the model: SpeedGB%, SpeedLD%, SpeedTopped%, Pull%FB%, Cent%FB%, Oppo%FB%, ShiftabilityGB%, and ShiftabilityLD%.

The second metric, PD+, involves a much more simple regression model. When evaluating a hitter’s plate discipline, it is vital to analyze their swing rates, whiff rates, and contact rates on pitches in and out of the strike zone. Hence, the model includes z-swing%, z-whiff%, z-contact%, o-swing%, o-whiff%, and o-contact%. I also opted to include zone%, as this could shed light on how pitchers attack a certain hitter. It is not a surprise that Javier Baez, a hitter who notoriously chases pitches at a high rate, faced the second-lowest percentage of pitches in the strike zone. I also included overall whiff rate and overall swing rate. Using these nine variables, I discovered each player’s xBB% and xK%, which I later combined to form PD+.

3 Methods & Analysis

3.1 Creating QOC+

#import data
qoc <- read.csv("Hitting+ - QOC Data.csv")

#create the train and test groups
set.seed(2)
library(caTools)
split <- sample.split(qoc, SplitRatio=0.7)
train <- subset(qoc, split=="TRUE")
test <- subset(qoc, split=="FALSE")

#create the QOC+ model
qoc_model <- lm(wOBACON ~ EV + Barrel. + SolidContact. + Flare. + PoorlyUnder. + PoorlyTopped. + PoorlyWeak. + HardHit. + Pull. + Cent. + Oppo. + SprayScore + Shiftability + GB. + FB. + LD. + PU. + SpdGB + SpdLD + SpdPT + PullFB + CentFB + OppoFB + ShiftGB + ShiftLD, data=train)

#print the QOC+ model
summary(qoc_model)

## 
## Call:
## lm(formula = wOBACON ~ EV + Barrel. + SolidContact. + Flare. + 
##     PoorlyUnder. + PoorlyTopped. + PoorlyWeak. + HardHit. + Pull. + 
##     Cent. + Oppo. + SprayScore + Shiftability + GB. + FB. + LD. + 
##     PU. + SpdGB + SpdLD + SpdPT + PullFB + CentFB + OppoFB + 
##     ShiftGB + ShiftLD, data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.144694 -0.021797  0.000456  0.021984  0.104848 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.350595   1.667632  -0.210  0.83350    
## EV             0.003208   0.001058   3.031  0.00247 ** 
## Barrel.        0.961164   0.065204  14.741  < 2e-16 ***
## SolidContact.  0.149751   0.069850   2.144  0.03216 *  
## Flare.         0.159022   0.052598   3.023  0.00253 ** 
## PoorlyUnder.  -0.307275   0.059493  -5.165 2.64e-07 ***
## PoorlyTopped.  0.486327   0.502634   0.968  0.33338    
## PoorlyWeak.   -0.105008   0.069129  -1.519  0.12891    
## HardHit.      -0.081256   0.030433  -2.670  0.00765 ** 
## Pull.          0.584116   1.112943   0.525  0.59975    
## Cent.          0.552988   1.111569   0.497  0.61890    
## Oppo.          0.626391   1.112209   0.563  0.57336    
## SprayScore    -0.138261   0.064545  -2.142  0.03230 *  
## Shiftability   0.167325   0.175292   0.955  0.33992    
## GB.           -0.910297   1.281431  -0.710  0.47755    
## FB.            3.109838   5.078126   0.612  0.54034    
## LD.           -0.317046   1.244921  -0.255  0.79900    
## PU.           -0.035029   1.223238  -0.029  0.97716    
## SpdGB          0.028292   0.014471   1.955  0.05071 .  
## SpdLD          0.011205   0.008218   1.363  0.17289    
## SpdPT         -0.023074   0.018268  -1.263  0.20670    
## PullFB        -3.252068   4.912697  -0.662  0.50806    
## CentFB        -3.080128   4.895933  -0.629  0.52934    
## OppoFB        -3.458932   4.911171  -0.704  0.48133    
## ShiftGB       -0.256214   0.229460  -1.117  0.26430    
## ShiftLD        0.032072   0.283994   0.113  0.91010    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03541 on 2066 degrees of freedom
## Multiple R-squared:  0.6638, Adjusted R-squared:  0.6597 
## F-statistic: 163.1 on 25 and 2066 DF,  p-value: < 2.2e-16

#create the metric
true_wOBACON <- predict(qoc_model, test)

#plot the results
plot(test$wOBACON, type= "l", lty=1.75, col="red",
     main="Figure I | Comparing wOBACON to True wOBACON",
     xlab="Index",
     ylab="wOBACON vs. True wOBACON",
     sub="red = wOBACON | blue = True wOBACON")
lines(true_wOBACON, type="l", col="blue")

#find standard deviation of each metric
sd(true_wOBACON)

## [1] 0.05130187

sd(test$wOBACON)

## [1] 0.06447698

#put metric in dataset
qoc_values <- predict(qoc_model, qoc)
qoc$qoc <- qoc_values

#create subsets by year
qoc2021 <- subset(qoc, year=="2021")
qoc2020 <- subset(qoc, year=="2020")
qoc2019 <- subset(qoc, year=="2019")
qoc2018 <- subset(qoc, year=="2018")
qoc2017 <- subset(qoc, year=="2017")
qoc2016 <- subset(qoc, year=="2016")
qoc2015 <- subset(qoc, year=="2015")

#adjust by league average (put it on the + scale)
qoc2021$QOC. <- (qoc2021$qoc*100)/mean(qoc2021$qoc)
qoc2020$QOC. <- (qoc2020$qoc*100)/mean(qoc2020$qoc)
qoc2019$QOC. <- (qoc2019$qoc*100)/mean(qoc2019$qoc)
qoc2018$QOC. <- (qoc2018$qoc*100)/mean(qoc2018$qoc)
qoc2017$QOC. <- (qoc2017$qoc*100)/mean(qoc2017$qoc)
qoc2016$QOC. <- (qoc2016$qoc*100)/mean(qoc2016$qoc)
qoc2015$QOC. <- (qoc2015$qoc*100)/mean(qoc2015$qoc)

#bind the subsets
overall_qoc_data <- rbind(qoc2021, qoc2020, qoc2019, qoc2018, qoc2017, qoc2016, qoc2015)

#merge the subsets by year n and year n+1
total1 <- merge(qoc2015, qoc2016, by=c("last_name","first_name"))
total2 <- merge(qoc2016, qoc2017, by=c("last_name","first_name"))
total3 <- merge(qoc2017, qoc2018, by=c("last_name","first_name"))
total4 <- merge(qoc2018, qoc2019, by=c("last_name","first_name"))

#bind by year n and year n+1
total <- rbind(total1, total2, total3, total4)

#find the year to year correlation of QOC+
cor(total$QOC..x, total$QOC..y)

## [1] 0.6823355

The summary of the QOC+ model indicates that the most influential metrics in determining QOC+ were Barrel%, EV, Hard Hit%, and other batted ball categorization data. This is unsurprising, given the known predictiveness of these metrics. However, it is worth noting that Spray Score also heavily influences the model. Moreso, some of the interaction variables, particularly speed interactions and Shiftability*GB%, are heavily factored into the model. The Adjusted R-Squared sits at a strong 0.66, and the 25th and 75th percentile of the model are nearly equidistant from the median, indicating that the distribution is standard. Figure I indicates that the model’s version of wOBACON is a lot more regressed than actual wOBACON. This makes sense, considering one of the primary functions of a regression model is to detect unsustainable outliers and regress them towards the mean. The standard deviation of the model’s version of wOBACON is smaller than actual wOBACON, which supports this theory. Lastly, the year-to-year correlation of QOC+ is 0.682, which indicates that the metric is rather sticky.

3.2 Creating xK% and xBB%

#import data
pd <- read.csv("Hitting+ - PD Data-2.csv")

#create the train and test groups
split2 <- sample.split(pd, SplitRatio=0.7)
train2 <- subset(pd, split=="TRUE")
test2 <- subset(pd, split=="FALSE")

#create the K model
k_model <- lm(K. ~ z.swing. + z.whiff. + o.swing. + o.whiff. + o.contact. + z.contact. + zone. + whiff. + swing., data=train2)

#print the K model
summary(k_model)

## 
## Call:
## lm(formula = K. ~ z.swing. + z.whiff. + o.swing. + o.whiff. + 
##     o.contact. + z.contact. + zone. + whiff. + swing., data = train2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.084324 -0.016951 -0.000648  0.015831  0.137697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.61405    0.09732   6.310 3.41e-10 ***
## z.swing.    -0.58113    0.14893  -3.902 9.84e-05 ***
## z.whiff.    -0.14832    0.13305  -1.115  0.26510    
## o.swing.    -0.38529    0.16162  -2.384  0.01721 *  
## o.whiff.    -0.27531    0.08062  -3.415  0.00065 ***
## o.contact.  -0.29721    0.06978  -4.259 2.14e-05 ***
## z.contact.  -0.16147    0.09421  -1.714  0.08670 .  
## zone.       -0.07056    0.12073  -0.584  0.55900    
## whiff.       0.91590    0.13270   6.902 6.77e-12 ***
## swing.       0.76542    0.30680   2.495  0.01268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02606 on 2082 degrees of freedom
## Multiple R-squared:  0.8364, Adjusted R-squared:  0.8357 
## F-statistic:  1183 on 9 and 2082 DF,  p-value: < 2.2e-16

#create the BB model
bb_model <- lm(BB. ~ z.swing. + z.whiff. + o.swing. + o.whiff. + o.contact. + z.contact. + zone. + whiff. + swing., data=train2)

#print the BB model
summary(bb_model)

## 
## Call:
## lm(formula = BB. ~ z.swing. + z.whiff. + o.swing. + o.whiff. + 
##     o.contact. + z.contact. + zone. + whiff. + swing., data = train2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061231 -0.010798 -0.000342  0.010478  0.078519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.28833    0.06502   4.434 9.71e-06 ***
## z.swing.     0.44681    0.09950   4.490 7.50e-06 ***
## z.whiff.     0.36743    0.08890   4.133 3.72e-05 ***
## o.swing.     0.19222    0.10798   1.780  0.07521 .  
## o.whiff.     0.13930    0.05386   2.586  0.00977 ** 
## o.contact.   0.02728    0.04662   0.585  0.55853    
## z.contact.  -0.02621    0.06295  -0.416  0.67721    
## zone.       -0.16234    0.08066  -2.013  0.04429 *  
## whiff.      -0.42661    0.08866  -4.812 1.60e-06 ***
## swing.      -1.04035    0.20499  -5.075 4.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01741 on 2082 degrees of freedom
## Multiple R-squared:  0.7025, Adjusted R-squared:  0.7013 
## F-statistic: 546.4 on 9 and 2082 DF,  p-value: < 2.2e-16

#create the metrics
true_k <- predict(k_model, test2)
true_bb <- predict(bb_model, test2)

#plot the results of the K model
plot(test2$K., type= "l", lty=1.75, col="red",
     main="Figure II | Comparing K% to True K%",
     xlab="Index",
     ylab="K% vs. xK%",
     sub="red = K% | blue = True K%")
lines(true_k, type="l", col="blue")

#plot the results of the BB model
plot(test2$BB., type= "l", lty=1.75, col="red",
     main="Figure III | Comparing BB% to True BB%",
     xlab="Index",
     ylab="BB% vs. xBB%",
     sub="red = BB% | blue = True BB%")
lines(true_bb, type="l", col="blue")

#find the standard deviations of xK%, K%, xBB%, BB%
sd(true_k)

## [1] 0.05760262

sd(test2$K.)

## [1] 0.06476365

sd(true_bb)

## [1] 0.02705823

sd(test2$BB.)

## [1] 0.0329447

For the model that derives xK%, it is unsurprising to see that whiff% is the most indicative of K%. After all, hitters who whiff at a high rate will likely have high K rates. z-swing%, o-whiff%, and o-contact% are also very influential on the model. For xBB%, the swing% and whiff% variables are the strongest. Hitters who swing less often walk more. To take it a step further, the zone% coefficient indicates that hitters who face more balls often walk more. Elite hitters are pitched around and thrown fewer strikes, which leads to more walks. Both models possess very high Adjusted R-Squared values and standard distributions. Figures II and III indicate the same as Figure I: the regression models regress unsustainable outliers. While Yasmani Grandal does possess strong plate discipline, it is unreasonable to expect him to maintain his 23.2% BB%. His 2021 xBB% 15.4%, which is likely closer to his true talent level.

3.3 Creating PD+

#put metrics in dataset
k_values <- predict(k_model, pd)
pd$xK. <- k_values
bb_values <- predict(bb_model, pd)
pd$xBB. <- bb_values

#create the train and test groups
split3 <- sample.split(pd, SplitRatio=0.7)
train3 <- subset(pd, split=="TRUE")
test3 <- subset(pd, split=="FALSE")

#create the model
pd_model <- lm(wOBA ~ xK. + xBB., data =train3)

#print the model
summary(pd_model)

## 
## Call:
## lm(formula = wOBA ~ xK. + xBB., data = train3)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146878 -0.025770  0.001388  0.025692  0.126177 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29564    0.00399  74.092  < 2e-16 ***
## xK.         -0.12124    0.01514  -8.008 1.91e-15 ***
## xBB.         0.55258    0.03334  16.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03982 on 2089 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1236 
## F-statistic: 148.4 on 2 and 2089 DF,  p-value: < 2.2e-16

#put metrics in dataset
pd_values <- predict(pd_model, pd)
pd$pd <- pd_values

#create subsets by year
pd2021 <- subset(pd, year=="2021")
pd2020 <- subset(pd, year=="2020")
pd2019 <- subset(pd, year=="2019")
pd2018 <- subset(pd, year=="2018")
pd2017 <- subset(pd, year=="2017")
pd2016 <- subset(pd, year=="2016")
pd2015 <- subset(pd, year=="2015")

#adjust by league average (put it on the + scale)
pd2021$PD. <- (pd2021$pd*100)/mean(pd2021$pd)
pd2020$PD. <- (pd2020$pd*100)/mean(pd2020$pd)
pd2019$PD. <- (pd2019$pd*100)/mean(pd2019$pd)
pd2018$PD. <- (pd2018$pd*100)/mean(pd2018$pd)
pd2017$PD. <- (pd2017$pd*100)/mean(pd2017$pd)
pd2016$PD. <- (pd2016$pd*100)/mean(pd2016$pd)
pd2015$PD. <- (pd2015$pd*100)/mean(pd2015$pd)

#bind the subsets
overall_pd_data <- rbind(pd2021, pd2020, pd2019, pd2018, pd2017, pd2016, pd2015)

#merge the subsets by year n and year n+1
total5 <- merge(pd2015, pd2016, by=c("last_name","first_name"))
total6 <- merge(pd2016, pd2017, by=c("last_name","first_name"))
total7 <- merge(pd2017, pd2018, by=c("last_name","first_name"))
total8 <- merge(pd2018, pd2019, by=c("last_name","first_name"))

#bind the subsets by year n and year n+1
total2.0 <- rbind(total5, total6, total7, total8)

#find the year to year correlations of PD+, xK%, and xBB%
cor(total2.0$PD..x, total2.0$PD..y)

## [1] 0.725555

cor(total2.0$xK..x, total2.0$xK..y)

## [1] 0.8203607

cor(total2.0$xBB..x, total2.0$xBB..y)

## [1] 0.7460486

The PD+ model is composed of xBB% and xK%. Considering there are only two variables, it is unsurprising to see that both are heavily influential on the model. xBB% happens to be more influential, indicating that a high BB% is more important than a low K%. Although the Adjusted R-Squared of the model appears to be very low, it is worth noting that this is expected. wOBA, which is the y-variable in the model, is a influenced by several factors that are not strikeouts and walks. The purpose of using wOBA was merely to derive the coefficients for xK% and xBB% to combine the two into PD+. The year-to-year correlations of PD+, xK%, xBB% are all very high. PD+’s correlation sits at 0.726; xK%’s sits at 0.82, and xBB%’s sits at 0.746. All of these correlations are high enough to indicate that each metric is very sticky year-to-year. They are all strong indicators of a hitter’s true talent level in each attribute.

3.4 Creating Hitting+

#merge QOC and PD datasets
hitting <- merge(overall_qoc_data, overall_pd_data, by=c("last_name","first_name","year"))

#create xK% + xBB% variable
hitting$xK.xBB. <- (hitting$xK. + hitting$xBB.)

#create xBIP% variable
hitting$xBIP. <- 1 - hitting$xK.xBB.

#create PD+ * (xK% + xBB%) variable
hitting$PD.xK.xBB <- hitting$PD.*hitting$xK.xBB.

#create QOC+ * xBIP% variable
hitting$QOC.xBIP <- hitting$QOC.*hitting$xBIP.

#create the train and test groups
split4 <- sample.split(hitting, SplitRatio=0.7)
train4 <- subset(hitting, split=="TRUE")
test4 <- subset(hitting, split=="FALSE")

#create the Hitting+ model
hitting_model <- lm(wOBA ~ QOC. + PD. + PD.xK.xBB + QOC.xBIP, data=train4)

#print the Hitting+ model
summary(hitting_model)

## 
## Call:
## lm(formula = wOBA ~ QOC. + PD. + PD.xK.xBB + QOC.xBIP, data = train4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115662 -0.017090 -0.000331  0.018418  0.090393 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.245e-01  1.319e-02 -17.014  < 2e-16 ***
## QOC.        -3.333e-05  3.840e-04  -0.087   0.9308    
## PD.          2.361e-03  2.156e-04  10.950  < 2e-16 ***
## PD.xK.xBB    1.397e-03  5.715e-04   2.445   0.0146 *  
## QOC.xBIP     3.826e-03  5.664e-04   6.754 1.86e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02712 on 2087 degrees of freedom
## Multiple R-squared:  0.6063, Adjusted R-squared:  0.6056 
## F-statistic: 803.5 on 4 and 2087 DF,  p-value: < 2.2e-16

#create the metric
true_wOBA <- predict(hitting_model, test4)

#plot the results
plot(test4$wOBA, type= "l", lty=1.75, col="red",
     main="Figure IV | Comparing wOBA to True wOBA",
     xlab="Index",
     ylab="wOBA vs. True wOBA",
     sub="red = wOBA | blue = True wOBA")
lines(true_wOBA, type="l", col="blue")

#put metric in dataset
hitting_values <- predict(hitting_model, hitting)
hitting$hitting <- hitting_values

#create subsets by year
hitting2021 <- subset(hitting, year=="2021")
hitting2020 <- subset(hitting, year=="2020")
hitting2019 <- subset(hitting, year=="2019")
hitting2018 <- subset(hitting, year=="2018")
hitting2017 <- subset(hitting, year=="2017")
hitting2016 <- subset(hitting, year=="2016")
hitting2015 <- subset(hitting, year=="2015")

#adjust by league average (put it on the + scale)
hitting2021$Hitting. <- (hitting2021$hitting*100)/mean(hitting2021$hitting)
hitting2020$Hitting. <- (hitting2020$hitting*100)/mean(hitting2020$hitting)
hitting2019$Hitting. <- (hitting2019$hitting*100)/mean(hitting2019$hitting)
hitting2018$Hitting. <- (hitting2018$hitting*100)/mean(hitting2018$hitting)
hitting2017$Hitting. <- (hitting2017$hitting*100)/mean(hitting2017$hitting)
hitting2016$Hitting. <- (hitting2016$hitting*100)/mean(hitting2016$hitting)
hitting2015$Hitting. <- (hitting2015$hitting*100)/mean(hitting2015$hitting)

#bind the subsets
hitting._data <- rbind(hitting2021, hitting2020, hitting2019, hitting2018, hitting2017, hitting2016, hitting2015)

#select certain columns
final_data <- hitting._data[ , c("last_name", "first_name", "year", "Hitting.", "QOC.", "PD.", "xK.", "xBB.", "xBIP.")]

#merge the subsets by year n and year n+1
total9 <- merge(hitting2015, hitting2016, by=c("last_name","first_name"))
total10 <- merge(hitting2016, hitting2017, by=c("last_name","first_name"))
total11 <- merge(hitting2017, hitting2018, by=c("last_name","first_name"))
total12 <- merge(hitting2018, hitting2019, by=c("last_name","first_name"))

#bind the subsets by year n and year n+1
total3.0 <- rbind(total9, total10, total11, total12)

#find the year to year correlation of Hitting+
cor(total3.0$Hitting..x, total3.0$Hitting..y)

## [1] 0.6218479

Although QOC+ and PD+ are the basis of Hitting+, the metric needed an added element: quantity of contact. On a per batted ball basis, Joey Gallo’s quality of contact ranks at or near the top of the league every year. However, throughout his career, fewer than 50% of his plate appearances result in a batted ball. This piece of information must be taken into account. Hence, I added two interaction variables. The first can be thought of as quality of contact * quantity of contact, or QOC+*xBIP%. The second is the product of PD+ and xK% + xBB%. It is worth noting that the first interaction is far more influential on the output of the model than QOC+. Although the year-to-year correlation of Hitting+, which lies at 0.622, is lower than some of its base metrics, it is still a rather strong year-to-year correlation.

3.5 Creating a Weighted Average of Hitting+

#create percentiles for Hitting+, QOC+, and PD+
final_data$Hitting.percentile <- ecdf(final_data$Hitting.)(final_data$Hitting.)
final_data$QOC.percentile <- ecdf(final_data$QOC.)(final_data$QOC.)
final_data$PD.percentile <- ecdf(final_data$PD.)(final_data$PD.)

#import data
averages <- read.csv("Hitting+ - Leaderboard-2.csv")

#subset the data by year
data2015 <- subset(averages, year=="2015")
data2016 <- subset(averages, year=="2016")
data2017 <- subset(averages, year=="2017")
data2018 <- subset(averages, year=="2018")
data2019 <- subset(averages, year=="2019")
data2020 <- subset(averages, year=="2020")
data2021 <- subset(averages, year=="2021")

#create four-year data
a <- merge(data2015, data2016, by=c("last_name", "first_name"))
b <- merge(data2017, data2018, by=c("last_name", "first_name"))
c <- merge(a, b, by=c("last_name", "first_name"))
four_year_data <- merge(c, data2019, by=c("last_name", "first_name"))

#create four-year model
four_year_model <- lm(Hitting. ~ Hitting..x.x + Hitting..y.x + Hitting..x.y + Hitting..y.y, data=four_year_data)

#print four-year model
summary(four_year_model)

## 
## Call:
## lm(formula = Hitting. ~ Hitting..x.x + Hitting..y.x + Hitting..x.y + 
##     Hitting..y.y, data = four_year_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.8115  -5.8745  -0.6309   5.7627  20.7041 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.08717    7.35261   0.692  0.48990    
## Hitting..x.x  0.13578    0.08674   1.565  0.11923    
## Hitting..y.x  0.14358    0.09214   1.558  0.12090    
## Hitting..x.y  0.25475    0.09473   2.689  0.00784 ** 
## Hitting..y.y  0.39314    0.08010   4.908 2.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.944 on 180 degrees of freedom
## Multiple R-squared:  0.4906, Adjusted R-squared:  0.4793 
## F-statistic: 43.34 on 4 and 180 DF,  p-value: < 2.2e-16

#merge 2018-2021 data
d <- merge(data2018, data2019, by=c("last_name", "first_name"))
e <- merge(data2020, data2021, by=c("last_name", "first_name"))
avg <- merge(d, e, by=c("last_name", "first_name"))

#adjust by number of PAs in each year
avg$wPA2021 <- 0.4239849 * avg$PA.y.y
avg$wPA2020 <- 0.2747371 * avg$PA.x.y
avg$wPA2019 <- 0.154845 * avg$PA.y.x
avg$wPA2018 <- 0.146433 * avg$PA.x.x

#create weights
avg$weight2021 <- avg$wPA2021 / (avg$wPA2021 + avg$wPA2020 + avg$wPA2019 + avg$wPA2018)
avg$weight2020 <- avg$wPA2020 / (avg$wPA2021 + avg$wPA2020 + avg$wPA2019 + avg$wPA2018)
avg$weight2019 <- avg$wPA2019 / (avg$wPA2021 + avg$wPA2020 + avg$wPA2019 + avg$wPA2018)
avg$weight2018 <- avg$wPA2018 / (avg$wPA2021 + avg$wPA2020 + avg$wPA2019 + avg$wPA2018)

#find weighted average
avg$Weighted_Average <- (avg$Hitting..x.x*avg$weight2018) + (avg$Hitting..y.x*avg$weight2019) + (avg$Hitting..x.y*avg$weight2020) + (avg$Hitting..y.y*avg$weight2021)

#merge 2019-2021 data
f <- merge(data2019, data2020, by=c("last_name", "first_name"))
avg2 <- merge(f, data2021, by=c("last_name", "first_name"))

#adjust by number of PAs in each year
avg2$wPA2021 <- 0.4239849 * avg2$PA
avg2$wPA2020 <- 0.2747371 * avg2$PA.y
avg2$wPA2019 <- 0.154845 * avg2$PA.x

#create weights
avg2$weight2021 <- avg2$wPA2021 / (avg2$wPA2021 + avg2$wPA2020 + avg2$wPA2019)
avg2$weight2020 <- avg2$wPA2020 / (avg2$wPA2021 + avg2$wPA2020 + avg2$wPA2019)
avg2$weight2019 <- avg2$wPA2019 / (avg2$wPA2021 + avg2$wPA2020 + avg2$wPA2019)

#find weighted average
avg2$Weighted_Average <- (avg2$Hitting..x*avg2$weight2019) + (avg2$Hitting..y*avg2$weight2020) + (avg2$Hitting.*avg2$weight2021)

#merge 2020-2021 data
avg3 <- merge(data2020, data2021, by=c("last_name", "first_name"))

#adjust by number of PAs in each year
avg3$wPA2021 <- 0.4239849 * avg3$PA.y
avg3$wPA2020 <- 0.2747371 * avg3$PA.x

#create weights
avg3$weight2021 <- avg3$wPA2021 / (avg3$wPA2021 + avg3$wPA2020)
avg3$weight2020 <- avg3$wPA2020 / (avg3$wPA2021 + avg3$wPA2020)

#find weighted average
avg3$Weighted_Average <- (avg3$Hitting..x*avg3$weight2020) + (avg3$Hitting..y*avg3$weight2021)

#import weighted_percentiles data
weighted_percentiles <- read.csv("percentiles.csv")

#create the weighted percentiles
weighted_percentiles$Hitting.percentile <- ecdf(weighted_percentiles$Weighted_Average)(weighted_percentiles$Weighted_Average)

#print the top 6 players in weighted Hitting+
head(weighted_percentiles)

##   X last_name first_name year.x.x Weighted_Average Hitting.percentile
## 1 1     Trout       Mike     2018         136.9495          1.0000000
## 2 2 Acuna Jr.     Ronald     2018         128.7473          0.9978261
## 3 3 Tatis Jr.   Fernando     2019         126.5463          0.9956522
## 4 4     Judge      Aaron     2018         125.5328          0.9934783
## 5 5   Freeman    Freddie     2018         125.3530          0.9913043
## 6 6    Harper      Bryce     2018         123.5704          0.9891304

The weighted average is composed of two weights: recency and number of plate appearances in a given season. In order to determine the weights for recency, I created a dataset that consisted of every player who recorded at least 100 PAs in each season from 2015-2019. I wanted to use their 2015-2018 Hitting+ values to predict their 2019 value. The model gave me those weights. Instead of using the traditional linear model formula, I chose to disregard the intercept, and rescale each coefficient so that the sum of the four equals 1.000. Unsurprisingly, 2018 was the most predictive of 2019, followed by 2017, then 2016, then 2015. If 2019 is year n, then the 2018 weight can be considered the n-1 weight, the 2017 weight can be considered the n-2 weight, and so on. For a player like Fernando Tatis Jr., who did not play in an n-4 season if 2022 is n, then the n-4 weight can be disregarded and the other weights can be rescaled to sum to 1.000. If I am trying to calculate weighted averages of the past four seasons, there is a glaring issue: the shortened 2020 season. Hence, I must adjust for sample size in each season. The calculation to do this is rather rudimentary. I multiplied a player’s PAs in each year by the respective recency weight for that year, and rescaled the coefficients to add up to 1.000. There were a few manual adjustments I had to make for players like Salvador Perez who played in 2017, 2018, 2020, and 2021, but not 2019. However, most players fit a more traditional mold.

In my opinion, a player’s Weighted Hitting+ is the most indicative statistic with regards to their true talent level. Even though it is pretty sustainable year-to-year, it is still possible to have outliers. Mike Zunino posted a 42nd percentile Hitting+ in 2018 and a 14th percentile Hitting+ in 2019 before exploding in 2021 with a 92nd percentile Hitting+. Did he make substantial changes to his offensive game that led to increased success? The answer is likely yes. However, based on this information, I do not think it is fair to automatically assert that he is a 92nd percentile hitter. I believe that his 64th percentile Weighted Hitting+ is a lot closer to his true talent level. The top six in Weighted Hitting+, as listed above, appears to be rather standard. Mike Trout, Ronald Acuna Jr., Fernando Tatis Jr., Aaron Judge, Freddie Freeman, and Bryce Harper are all rightfully considered among the game’s best hitters. Although the leaderboard generally resembles the common opinion, there are a few surprises. Both Connor Joe and Alex Kirilloff rank in the top 20 of Weighted Hitting+. I guarantee not many people consider them to be elite hitters.

3.6 Comparing Hitting+ to wRC+

#import the ggplot2 and plotly packages
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

#import wRC+ data
wRC <- read.csv("Hitting+ - Sheet21-2.csv")

#put 2021 wRC+ values into percentiles
wRC$wRC.percentile <- ecdf(wRC$wRC.)(wRC$wRC.)

dataset <- wRC

#create interactive scatterplot of 2021 Hitting+ vs. 2021 wRC+
fig <- plot_ly(
  wRC, x = ~Hitting..Percentile, y = ~wRC.percentile,
  text = ~paste("Name: ", first_name, last_name, "; QOC+ Percentile: ", QOC..Percentile, '; PD+ Percentile:', PD..Percentile),
  color = ~PA) %>%
  layout(title= "Figure V 
         Interactive Scatterplot Comparing Hitting+ and wRC+ in 2021")
fig

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

#find percentile - Hitting+ percentile
wRC$Differential <- wRC$Hitting..Percentile - wRC$wRC.percentile

#import data.table package
library(data.table)

#print differentials
new_wRC <- wRC[c(1,2,3,5,14,15)]
new_wRC <- new_wRC[order(wRC$Differential, decreasing = TRUE), ]  
new_wRC <- data.table(new_wRC, key = "Differential")
new_wRC <- new_wRC[ , head(.SD, 3), by = Differential]
new_wRC

##      Differential  last_name first_name  PA Hitting..Percentile wRC.percentile
##   1:   -0.5948898       Kemp       Tony 397               0.282      0.8768898
##   2:   -0.5495054      Lopez      Nicky 565               0.133      0.6825054
##   3:   -0.5487019      Myers        Wil 500               0.177      0.7257019
##   4:   -0.5397430      Mejia  Francisco 277               0.173      0.7127430
##   5:   -0.5069762    Espinal   Santiago 246               0.290      0.7969762
##  ---                                                                          
## 459:    0.5377797       Bohm       Alec 417               0.784      0.2462203
## 460:    0.5589892      Marsh    Brandon 260               0.924      0.3650108
## 461:    0.6259352 Higashioka       Kyle 211               0.816      0.1900648
## 462:    0.6668942  Carpenter       Matt 249               0.844      0.1771058
## 463:    0.6679762     Senzel       Nick 124               0.871      0.2030238

Figure V is an interactive scatterplot that allows us to compare a player’s 2021 Hitting+ to their 2021 wRC+. A higher wRC+ percentile than Hitting+ percentile would tell us that the player likely outperformed their true talent level. For the most part, this indicates that regression is on the horizon. The opposite indicates that a player could be due for positive regression, as their true talent level is higher than their past production. Figure V is color-coded based on number of PAs in 2021. It is unsurprising that several of the points that strongly deviate from the figurative x=y line are darker, since they are likely skewed by a small sample size. Nonetheless, let’s analyze some of the players who were “lucky” or “unlucky”, for lack of better words.

The five biggest over-performers last year were, in order, Tony Kemp, Nicky Lopez, Wil Myers, Francisco Mejia, and Santiago Espinal. All of these hitters were in the 68th percentile or greater of wRC+, yet none fell above the 29th percentile of Hitting+. Why do these massive differentials exist? Kemp’s case is a rare one, considering he actually was in the top 10% of PD+. The issue was his 5th percentile QOC+, especially considering his xBIP% of 75%. Lopez is a less extreme version of the Kemp. His plate discipline was truly solid in 2021, but his inability to barrel a ball to save his life tremendously weighs down his value. This would not be such a huge issue, if 77% of his plate appearances didn’t result in a batted ball. Kemp and Lopez may not strike out often, but their batted balls are not that much more valuable than a strikeout. Myers’ profile is not as lopsided as Kemp and Lopez, as he fell in the pedestrian 39th percentile of both QOC+ and PD+. Mejia is simply a worse version of Myers. He did not demonstrate any level of success in either QOC+ or PD+. Lastly comes Espinal, who simply did not show any competency in his batted ball data. A 1.6% Barrel% won’t cut it at the big-league level.

Now, let’s focus on the five who may see an uptick in production next year: Nick Senzel, Matt Carpenter, Kyle Higashioka, Brandon Marsh, and Alec Bohm. Some may be ready to label Senzel a bust, but I believe that could be a mistake. Senzel was above average last year in terms of QOC+, and in the top 25% of PD+. The main reason one should be excited about Senzel is because of his low strikeout/solid contact combination. He may never post Joey Gallo numbers at the plate, but 52nd percentile QOC+ combined with a 79.3% xBIP% could be a lethal combination. Carpenter’s main strength is his ability to walk, as he posted a 14.7% xBB% last season. Some may be frightened by his pull-heavy tendencies, but his near-50% FB% actually makes them beneficial. He may never return to an All-Star level, but he should see an uptick from his 70 wRC+. Higashioka is not just Gerrit Cole’s personal catcher, but rather a solid bat as well. The main figure that demonstrates this is his 15.6% Barrel% in 2021. Granted, he tends to hit fly balls to center field, but he should still see plenty of positive regression in the future. The last two, Brandon Marsh and Alec Bohm, are the two I am most excited about. Marsh’s plate discipline is a work in progress, but his 98th percentile QOC+ automatically boosts him into the upper echelon of hitters. Bohm was one of the biggest disappointments in terms of production last year, but his peripheral data was relatively solid. Marsh and Bohm succeed in a similar way: line drives. Marsh may hit more on a per-batted-ball basis, but Bohm puts the ball in play more often. I expect both to be future All-Stars.

3.7 Determining the Predictiveness of Hitting+

#import data
predictiveness <- read.csv("Hitting+ - 2017-3.csv")

#find correlation between Weighted Hitting+ and n+1 wRC+
cor(predictiveness$Weighted_Average, predictiveness$wRC.)

## [1] 0.5179405

#import data
wRC.correlation <- read.csv("Hitting+ - wRC+-2.csv")

#find year-to-year correlation of wRC+
cor(wRC.correlation$X2018, wRC.correlation$X2019)

## [1] 0.4222068

The correlations above tell us that Weighted Hitting+ is a much better predictor of n+1 wRC+ than n wRC+ by a large margin. The 0.422 year-to-year correlation of wRC+ is underwhelming, to say the least. wRC+ is a strong measure of past production, but it is not a great player evaluation metric. If I want to know how productive Juan Soto was at the plate in 2021, then wRC+ is a near-perfect tool. If I want to know how good of a hitter Juan Soto is, I would reference Weighted Hitting+. For reference, the predictiveness of Steamer, which is commonly cited as the best publicly-available projection system, is approximately 0.55. For what it is worth, the point of Weighted Hitting+ is not necessarily to be predictive, but rather to estimate a player’s true talent level at the plate. It just so happens that true talent is a large determinant in on-field production.

4 Conclusion & Next Steps

I believe that sabermetrics can serve three purposes. First, they can measure a player’s past production. Second, they can shed light on how good a player truly is. Third, they can predict how good a player will be in the future. The first category consists of metrics such as wRC+, OPS, and wOBA. If I want to know how productive a given hitter was in a given year, these are the metrics I would cite. The third category is where projection systems, such as Steamer, come into play. If I want to know how well a given hitter will perform in 2022, projection systems are a strong tool. The second category, however, is where Hitting+ falls. It does not measure how productive a hitter was, and it is not designed to predict the future, but rather to determine how good a hitter is. Ironically, Hitting+ is a solid predictor of future production. Nonetheless, this is not the primary goal of the metric. The goal is to estimate true talent level.

In the future, I hope to use Hitting+ to create a full-blown projection system. In order to do so, I will need to derive an aging-curve adjustment. A 36-year-old with a Weighted Hitting+ in the 95th percentile should not be projected the same way as a 26-year-old with the same figure. I also think that park factors need to be looked at differently. Last season, Marcus Semien and Teoscar Hernandez played in the same ballparks. However, a singular park adjustment coefficient does not accurately represent how each player benefited from the park. Semien is a pull-heavy fly ball hitter who took advantage of the shallow left-field fence. Hernandez, on the other hand, is also pull-heavy, but tends to hit a ton of line drives. Did the two hitters play in the same ballparks last season? Yes. Were they equal beneficiaries of the ballparks that they played in? No.

Lastly, I plan on creating a ShinyApp platform to make Hitting+ more accessible and interactive. At the moment, Hitting+ leaderboards can be accessed using the link below. Down the road, I believe a ShinyApp will allow for people to easily access the metric and play around with leaderboards. I am currently developing a Twitter Bot that will automatically tweet out a player’s Hitting+ upon command. For instance if someone tweeted out “Mike Trout, Hitting+, 2021” the Twitter Bot will respond with Mike Trout’s Hitting+ in 2021. The main goal is to make Hitting+ easily accessible.