Task/Problem Description

We are to assess whether or not the count (number of strikes and number of balls) impacts the probability of a strike being called a pitch.
The following variables are our dataset.
release_speed: speed of the ball when it leaves the pitcher
handpitch_type: The type of pitch thrown by the pitcher
plate_x: The horizontal coordinate of the ball as it passes over the
plateplate_z: The vertical coordinate of the ball as it passes over the plate
strikes: The number of strikes the batter has during the current pitch
balls: The number of balls the batter has during the current pitch
home_score: Runs aka score of the away team at the time of pitch
away_score: Runs aka score of the away team at the time of pitch
pitcher: A unique id that represents the pitcher throwing the pitch

Location and Strike Zone

Given the nature of the strike zone in baseball, there are certain regions in which the probability of strike is overwhelmingly high. Likewise, if the ball is far outside of the strike zone, then the probability of strike is overwhelmingly low. Thus, we will focus our analysis on what we will define as a ‘fringe region’.

A fringe region should have the following characteristics:
1.)proportion (~0.50) of pitches in the fringe zone should be called strikes 2.)proportion (> 0.90) of pitches interior of the fringe zone should be called strikes
3.)proportion (< 0.05) of pitches exterior of the fringe zone should be called strikes

By focusing our analysis on this fringe zone we will allow our classification model to determine if any variables besides location are important without being overshadowed by the obvious strikes and obvious balls locations. See Appendix B for the construction of this fringe zone and it’s accompanied R code.

Figure 1

Figure 1

In Figure 1 above we see the entire location data and our feasible strike zone, fringe zone, and non-strike zone.

Figure 2

Figure 2

In Figure 2 above we have created a fringe region that has characteristic (1.). In particular, the proportion of pitches thrown in this region that are called strikes is close to 0.50. An important observation is that while the proportion is close to 0.50, there is still a clear pattern as to where the majority of the strikes vs balls were called. In particular, as we move further to the outside edges of the fringe region, the proportion of pitches being called strikes decreases. This would suggest that location, even in the fringe zone, will still have a significant impact on the probability of strike called.

Let’s check our fringe region has characteristic (2.) and (3.) in Figure 3 below.

Figure 3

Figure 3

As we can see, the interior of the fringe region has overwhelming proportion of pitches called strikes (0.945), while the exterior of the fringe region has overwhelming proportion of pitches called non-strikes.

As stated earlier, our analysis goal is to discover if the ball and strike count at a certain pitch affects the probability of being called strike. The interior and exterior regions will only cloud our analysis as any model will be overwhelmed by the location data and fail to uncover any other variables that may be contributing to this probability.

From here on our analysis will only consist of the fringe zone data.

Situational Pitches

There are 12 ball and strike combinations that precede either a walk or strikeout. We define 4 categories from these 12 combinations in an attempt to see if they have an immediate impact on probability of strike being called.

Combination 1: “Pitcher Advantage”
0 or 1 balls, 2 strikes: (0,2), (1,2)

Combination 2: “Batter Advantage” 3 balls, 0 or 1 strikes: (3,0), (3,1)

Combination 3: “Play On”
(0 balls, 0 strikes), (0,1) (1,0), (1,1), (2,0), (2,1), (2,2)

Combination 4: “Crunch Time”
3 balls and 2 strikes

##   Pitcher Adv Batter Adv   Play On Crunch Time
## 1   0.3182957  0.5220884 0.4698368    0.372549

It would appear that these four categories have a possible effect on the probability of a strike being called. In particular, the pitcher advantage has a reduced probability of throwing a strike if it is in the fringe zone.We will investigate this notion further with a randomForest and gradientBoostingMachine and allow the models to decide the importance of the variables at hand.

As we saw in the earlier analysis of the fringe region, it is possible that the location variables will still significantly impact the probability due to the pattern of strikes and balls within the fringe region (recall: ‘inner fringe region’ had high proportion of strikes while ‘outer fringe region’ had low proportion of strikes).

Random Forest

Figure 4

Figure 4

As we can see in the above variable importance plot, our constructed categories of strike-ball combination have little importance when predicting the probability of a pitch being called strike or not. In both measures of variable importance it does not make the top 3, and in one of them it is actually the least important variable.

Thus, we will adjust our model. We will not include categories and instead reinclude strikes and balls.

exclude_columns <- c("X", "categories")
rf2 <- randomForest(y.response~.,data=newdat[!names(newdat) %in% exclude_columns],mytry=3, importance=TRUE, ntree=100)

##                  %IncMSE IncNodePurity
## pitch_type     5.2067366      49.11118
## release_speed  5.2998998     136.65263
## pitcher        3.0708136     135.16901
## balls          4.3550054      37.60387
## strikes       10.4361016      30.41261
## plate_x       41.8645788     273.35113
## plate_z       65.3520557     342.57448
## home_score     0.5969622      57.51555
## away_score     2.6710537      63.20423

Again, we see that strikes and balls are not of significant importance to the probability of a pitch in the fringe region being called a strike. The random forest model overwhelmingly suggests that location_x and location_z are the most influential predictors.

We will more formally analyze this with cross validation. We will examine the cross-validated log likelihood of the randomforest model with and without our additional categories variable and the ball and strike raw data. See Appendix for full code and details.

##     ll.full  ll.noCat ll.noBallStrike ll.noLocation
## 1 -2853.863 -2844.255       -2884.324     -3667.945

We note two things here.
1.) The log likelihood of a full model with all predictors present (including our created categories column that categorized the ball-strike combination into four levels) is comparable in value to that of a model with that category removed and also comparable to that of a model with no balls and strikes. The maximum difference between those three is about a 3% difference. In other words, including or not including the ball-strike combinations or raw data increases or decreases our log likelihood by at most 3%. This is further evidence that the ball-strike data does not significantly impact the probability of the pitch being called a strike.
2.) Note that when we remove the location data our model performs significantly worse. Approximately 30+% worse than those models that included the location data as a predictor. Again, this is further evidence that location has a significant impact on the probability of a ball being thrown, yet ball-strike combination does not.

Gradient Boosting Machines

gmb.fit <- gbm(formula = y.response~., data=boostdat[!names(boostdat) %in% exclude_columns], distribution='gaussian', n.trees=500, interaction.depth = 1, shrinkage=0.01)
varImp(gmb.fit, numTrees=500)
##                 Overall
## pitch_type       0.0000
## release_speed    0.0000
## pitcher          0.0000
## balls            0.0000
## strikes        191.0088
## plate_x       2366.2348
## plate_z       4302.3664
## home_score       0.0000
## away_score       0.0000

From the variable importance of the gradient boosting machine model, we again see that location_x and location_z are the important predictors and all other variables do not significantly impact the probability of a pitch in the fringe zone being called a strike.

Conclusion

While initial analysis of the strike ball combinations suggested a possible influence on a pitch being called strike (or not), it is clear from our classification models that even in the fringe region, the location_x and to an even greater degree the location_z have the most significant impact on this probability. This agrees with our initial observation of the pattern of strikes and balls within the fringe region by location.

Thus we can conclude that the umpires are making calls with little to zero acknowledgement of the current strike-ball combination. It appears the MLB doesn’t have an umpire problem. Now let’s just check on the individual teams and those dubious play signals.

Appendix A: Cross-Validation of Random Forest Models to Determine Importance of Variables Balls, Strikes, and Location

cv.rf.full = rep(NA, 10)
cv.rf.noCat = rep(NA, 10)
cv.rf.noBS = rep(NA, 10)
cv.rf.noLocation = rep(NA, 10)

for(j in 1:10) {
  folds = createFolds(1:length(boostdat$y.response), k=10)
  train.data = boostdat[-folds[[j]],]
  test.data = boostdat[folds[[j]],]
  
  exclude_columns <- c("X")
  rf.full <- randomForest(y.response~.,data=train.data[!names(boostdat) %in% exclude_columns],mytry=3, importance=TRUE, ntree=100)
  prob.rf.full = predict(rf.full, type='prob', newdata=test.data[!names(boostdat) %in% exclude_columns])[,2] #probability of 1 is in second column
  
  exclude_columns <- c("X", "categories")
  rf.noCat <- randomForest(y.response~.,data=train.data[!names(boostdat) %in% exclude_columns],mytry=3, importance=TRUE, ntree=100)
  prob.rf.noCat = predict(rf.noCat, type='prob', newdata=test.data[!names(boostdat) %in% exclude_columns])[,2]
  
  exclude_columns <- c("X", "balls", "strikes")
  rf.noBS <- randomForest(y.response~.,data=train.data[!names(boostdat) %in% exclude_columns],mytry=3, importance=TRUE, ntree=100)
  prob.rf.noBS = predict(rf.noBS, type='prob', newdata=test.data[!names(boostdat) %in% exclude_columns])[,2]
  
  exclude_columns <- c("X", 'plate_x', 'plate_z')
  rf.noLocation <- randomForest(y.response~.,data=train.data[!names(boostdat) %in% exclude_columns],mytry=3, importance=TRUE, ntree=100)
  prob.rf.noLocation = predict(rf.noLocation, type='prob', newdata=test.data[!names(boostdat) %in% exclude_columns])[,2]
  
  y.test = as.numeric(test.data$y.response) - 1
  
  ll.rf.full = sum(y.test*log(prob.rf.full+0.001) + (1-y.test)*log(1-prob.rf.full - 0.001))
  #cv.rf.full = cv.rf.full + ll.rf.full
  cv.rf.full[j] = ll.rf.full
  
  ll.rf.noCat = sum(y.test*log(prob.rf.noCat + 0.001) + (1-y.test)*log(1-prob.rf.noCat-0.001))
  #cv.rf.noCat = cv.rf.noCat + ll.rf.noCat
  cv.rf.noCat[j] = ll.rf.noCat
  
  ll.rf.noBS = sum(y.test*log(prob.rf.noBS+0.001) + (1-y.test)*log(1-prob.rf.noBS-0.001))
  #cv.rf.noBS = cv.rf.noBS + ll.rf.noBS
  cv.rf.noBS[j] = ll.rf.noBS
  
  ll.rf.noLocation= sum(y.test*log(prob.rf.noLocation+0.001) + (1-y.test)*log(1-prob.rf.noLocation-0.001))
  cv.rf.noLocation[j] = ll.rf.noLocation
}

cv.rf.full.val = sum(cv.rf.full[is.finite(cv.rf.full)])
cv.rf.noCat.val = sum(cv.rf.noCat[is.finite(cv.rf.noCat)])
cv.rf.noBS.val = sum(cv.rf.noBS[is.finite(cv.rf.noBS)])
cv.rf.noLocation.val = sum(cv.rf.noLocation[is.finite(cv.rf.noLocation)])

a = data.frame(ll.full = cv.rf.full.val, ll.noCat =  cv.rf.noCat.val, ll.noBallStrike = cv.rf.noBS.val, ll.noLocation = cv.rf.noLocation.val)
a

Appendix B: Determining the Fringe Zone

To construct a fringe zone we first found the mean of the location_x and location_z repectively. These were our ‘centers’. While it is not necessarily true that the data nor a strike zone or symmetric about its center, this gave us a good approximation on where to start.

We know that from there we can start to expand a rectangle around this center by increasing the base and height incrementally. Mathematically, we seek \(\epsilon_x\) and \(\epsilon_z\) such that the proportion of pitches thrown in the rectangle (formed by these rectangles formed around the center) would be approximately 0.95.

See R code below for these first steps.

strike.data = data[description =='called_strike',]

strike.x = mean(strike.data$plate_x, na.rm = TRUE)
strike.z = mean(strike.data$plate_z, na.rm = TRUE)

findStrikeZone <- function(epsilons) {
  epsilonX = epsilons[1]
  epsilonZ = epsilons[2]
  rectangle = data[plate_x > (strike.x - epsilonX) & plate_x < (strike.x + epsilonX) & plate_z > (strike.z - epsilonZ) & plate_z < (strike.z + epsilonZ), ]
  strikes= nrow(rectangle[rectangle$description=='called_strike', ])
  balls= nrow(rectangle[rectangle$description == 'ball', ])
  total.pitches = strikes + balls
  strike.prop = strikes/total.pitches
  return(strike.prop)
}


x.seq = seq(0.01, 2.5, length.out = 60)
z.seq = seq(0.01, 2.5, length.out = 60)
gridxz = expand.grid(x.seq, z.seq)
props = apply(gridxz, 1, findStrikeZone)
gridxz = cbind(gridxz, props)
library(plot3D)
scatter3D(gridxz[,1], gridxz[,2], gridxz[,3])

As we can see from the plot, there is a feasible region of \(\epsilon_x\) and \(\epsilon_z\) that will produce the desired proportion of approximately 0.95.

feasible.values = mat[which(props >= 0.94 & props <= 0.96), ]

From here we played with the feasible values until we obtained a fringe region with the desired characteristics (see Fringe Region section in the beginning of the report for the three desired characteristics).

One thing to note is that the strike zone did not appear to be very symmetric in the z-axis. In particular, the strike zone appeared to “stretch” vertically in a manner that the strike zone was larger above the “center” than it was below. Thus, when creating our strike zone and fringe region, we took that into consideration. See R code below for this subtle differnce in our rectangle values.

irl = strike.x - 0.75
irr = strike.x + 0.75 #0.769
irb = strike.z - 0.66 #0.685
irt = strike.z + 0.66+0.2 #0.685+0.2

orl = irl - 0.35
orr = irr + 0.35
orb = irb - 0.35
ort = irt + 0.35