Statistical Models For MLB Pitching and Hitting Metrics (2019)

Data Science II (STAT 301-2)

Data

Glimpse of 2019 Dataset (Only First 100 Games):

Codebook (defines baseball and dataset-specific terms):

Download codebook.txt


The dataset above encompasses and emphasizes pitch-level data, which includes pertinent information about the trajectory of each pitch. The dataset is joined together with matching data for the at-bats and games that sync up with each pitch. I utilized the join function to merge the datasets as part of the data cleaning process. This ensures a complete picture of the various game situations.

Each observation represents a specific pitch while all the column variables describe characteristics of the result of the pitch and the pitch itself at that point in time. This includes but is not limited to the current score, game situation (score, number of outs, etc.), pitch type, pitch speed, batter’s count, and result of the pitch. The data table above displays a quick glimpse at the first 25 MLB games recorded for the 2019 season and all pitches thrown in each of those games. However, there are 2,408 distinct games included in the entire dataset for the 2019 regular season. I am not using the entire dataset due to size and time issues when processing, but 25 games including all 30 different MLB teams at least should be enough variability to produce an decently accurate model. The data was scraped from the official site MLB.com but was imported from Kaggle for simplicity (https://www.kaggle.com/pschale/mlb-pitch-data-20152018?select=pitches.csv).

The main reason I chose this data is because I am current baseball player and enthusiast and find great value in using predictive analysis to predict player decisions, especially when it comes to predicting pitching strategies. In fact, I am very interested in pursuing a career in sports data analysis or sports technology, so I thought a great way to start building experience would be to complete this report using MLB data.

Research Question

My fundamental research question is the following:

1. What are the most probable types of pitches (primarily break angles and speeds) to be thrown deep in the count? Early in the count? To rephrase in a more general sense, what are the most likely pitches given a specific game situation (with the addition of cases such as score of the game, inning, etc.).

The main idea from this research question is ultimately “guessing” a given pitch if you were to take the place of an MLB hitter. This is critical factor in hitting consistency and overall success because anticipating where and what type of pitch will be thrown gives the hitter an edge against the competition. Moreover, we can extend this report with further analysis of specific pitchers currently in the MLB. I will not go into that much depth in this report but the idea of assessing the tendencies of specific pitchers is valuable when a team knows who they are facing ahead of game time.

Exploratory Data Analysis

Distribution of Outcome Variables

Using the full dataset described above, I chose to explore the distribution of multiple predicted outcome variables: pitch_type, break_angle, start_speed, px, and pz (see codebook for further clarification). First, I perform a quick skim of the data and note any potential problems (such as missingness).

Data summary
Name Piped data
Number of rows 7159
Number of columns 19
_______________________
Column type frequency:
character 6
factor 5
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
stand 0 1 1 1 0 2 0
p_throws 0 1 1 1 0 2 0
event 0 1 4 26 0 24 0
code 0 1 1 2 0 14 0
type 0 1 1 2 0 14 0
pitch_type 23 1 2 2 0 10 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
inning 0 1 FALSE 13 6: 893, 4: 814, 7: 804, 3: 793
outs 0 1 FALSE 3 0: 2568, 2: 2411, 1: 2180
pitch_num 0 1 FALSE 13 1: 1854, 2: 1637, 3: 1372, 4: 1037
count_dif 0 1 FALSE 6 0: 3428, -1: 1908, -2: 1025, -3: 417
score_dif 0 1 FALSE 24 0: 2192, -1: 681, 1: 673, -4: 495

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
ab_id 0 1 2.019001e+09 535.60 2.01900e+09 2.01900e+09 2.019001e+09 2.019001e+09 2.019002e+09
g_id 0 1 2.019000e+08 7.24 2.01900e+08 2.01900e+08 2.019000e+08 2.019000e+08 2.019000e+08
batter_id 0 1 5.577128e+05 67667.51 4.05395e+05 5.02481e+05 5.719700e+05 6.064660e+05 6.709500e+05
pitcher_id 0 1 5.610370e+05 65620.89 4.07845e+05 5.18516e+05 5.719450e+05 6.072310e+05 6.736330e+05
px 23 1 1.000000e-02 0.86 -3.61000e+00 -5.70000e-01 2.000000e-02 6.000000e-01 3.310000e+00
pz 23 1 2.240000e+00 0.96 -1.48000e+00 1.62000e+00 2.240000e+00 2.860000e+00 6.350000e+00
start_speed 23 1 8.817000e+01 6.26 5.06000e+01 8.43000e+01 8.930000e+01 9.300000e+01 1.007000e+02
break_angle 23 1 2.146000e+01 13.08 0.00000e+00 9.60000e+00 2.160000e+01 3.240000e+01 6.000000e+01

We notice here that there are 6,654 observations in which there is missing data for all our outcome variables. This is likely due to the failure to record the metrics for some pitches. In this case, we can drop the observations from the main dataset since they do not provide much useful information.

Also, it will be useful to see the distribution of the continuous outcome variables:

## Warning: Removed 2 rows containing missing values (geom_bar).




## Warning: Removed 2 rows containing missing values (geom_bar).

Furthermore, I found the distribution of the discrete outcome variable pitch_type:

The last bar distribution plot tells us that the fastball is the most common type of pitch (as expected). It is also no surprise that most of the pitches end up being close to the middle of the strike zone as seen from the px and pz histogram plots. What is more interesting is that the most common speed of pitches are around 90 - 95 mph and most common break angles are below 10 degrees (close to a straight fastball but not quite).

Q-Q Plots

Q-Q plots are primarily utilized as a graphical tool to help us assess if a set of data came from some theoretical normal distribution. If we assume our outcome variable is normally distributed, a Q-Q plot can verify this assumption:

Break Angles


Pitch Speed


Pitch Location X (outside, middle, or inside)


Pitch Location Z (high, middle, or low)

The Q-Q plots tell us that both p_x and p_z have normal distributions while start_speed and break_angle do not.

Correlation Between Variables

3-0 COUNT

2-0 COUNT, 3-1 COUNT

1-0 COUNT, 2-1 COUNT, 3-2 COUNT

EVEN COUNT

0-1 COUNT, 1-2 COUNT

0-2 COUNT

In summary, the above plots show that there is a slight relationship between at-bat count, inning number, and score differential with the pitching type and metrics. In order to explore this relationship more in-depth, I decided to generate the best model possible to predict the break angle and speed of a pitch given the inning, batter’s count differential, score differential, and number of outs. This is addressed in the next section.

Fitting Different Models

To accurately assess the data and generate an effective machine learning model, it is imperative that we compare two different models and see which one performs the best. In this case, I chose to compare two machine learning models: random forest model and k-nearest neighbor model. I chose these two because they are known to be accurate models for data similar to my own and generally perform better than simple linear models with complex datasets.

The first step was to separate the data into training and testing sets using stratified sampling (80/20 split to be exact). I did this on both outcome variables that we are assessing, start_speed and break_angle. The values below show the dimensions of the training and test sets respectively:

Pitch Speed Training & Testing Set

## [1] 5712   19
## [1] 1424   19

Break Angle Training & Testing Set

## [1] 5711   19
## [1] 1425   19

I then set up the recipe for both outcome variables (using prep() and ’bake()` and the resulting data tables are constructed below for both pitch speed and break angle, respectively:

## # A tibble: 5,712 x 5
##    count_dif outs  inning score_dif start_speed
##    <fct>     <fct> <fct>  <fct>           <dbl>
##  1 0         0     1      0                88.8
##  2 0         1     1      0                89.9
##  3 0         1     1      0                85.7
##  4 1         1     1      0                85.4
##  5 0         1     1      0                84.6
##  6 -1        1     1      0                90.9
##  7 0         2     1      0                89  
##  8 -1        2     1      0                84.4
##  9 -3        2     1      0                89.2
## 10 -3        2     1      0                86.4
## # ... with 5,702 more rows
## # A tibble: 5,711 x 5
##    count_dif outs  inning score_dif break_angle
##    <fct>     <fct> <fct>  <fct>           <dbl>
##  1 0         0     1      0                22.8
##  2 0         1     1      0                22.8
##  3 0         1     1      0                 9.6
##  4 1         1     1      0                24  
##  5 0         1     1      0                26.4
##  6 -1        1     1      0                27.6
##  7 0         2     1      0                34.8
##  8 -1        2     1      0                22.8
##  9 -2        2     1      0                25.2
## 10 -3        2     1      0                19.2
## # ... with 5,701 more rows

I then utilized V-fold cross-validation with 10 folds, repeated 5 times, to fold the training data.

## #  10-fold cross-validation repeated 5 times 
## # A tibble: 50 x 3
##    splits             id      id2   
##    <list>             <chr>   <chr> 
##  1 <split [5.1K/572]> Repeat1 Fold01
##  2 <split [5.1K/572]> Repeat1 Fold02
##  3 <split [5.1K/571]> Repeat1 Fold03
##  4 <split [5.1K/571]> Repeat1 Fold04
##  5 <split [5.1K/571]> Repeat1 Fold05
##  6 <split [5.1K/571]> Repeat1 Fold06
##  7 <split [5.1K/571]> Repeat1 Fold07
##  8 <split [5.1K/571]> Repeat1 Fold08
##  9 <split [5.1K/571]> Repeat1 Fold09
## 10 <split [5.1K/571]> Repeat1 Fold10
## # ... with 40 more rows
## #  10-fold cross-validation repeated 5 times 
## # A tibble: 50 x 3
##    splits             id      id2   
##    <list>             <chr>   <chr> 
##  1 <split [5.1K/572]> Repeat1 Fold01
##  2 <split [5.1K/571]> Repeat1 Fold02
##  3 <split [5.1K/571]> Repeat1 Fold03
##  4 <split [5.1K/571]> Repeat1 Fold04
##  5 <split [5.1K/571]> Repeat1 Fold05
##  6 <split [5.1K/571]> Repeat1 Fold06
##  7 <split [5.1K/571]> Repeat1 Fold07
##  8 <split [5.1K/571]> Repeat1 Fold08
##  9 <split [5.1K/571]> Repeat1 Fold09
## 10 <split [5.1K/571]> Repeat1 Fold10
## # ... with 40 more rows

Furthermore, we can then set up both models and set tuning points that will be adjusted later on in the report. This will be helpful organizationally and the tuning of parameters will help maximize the accuracy of the chosen model.

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Computational engine: ranger
## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = tune()
## 
## Computational engine: kknn

Assessment of Both Models

Random Forest tuning plots, shows how node size and number of predictors affects the RMSE value which is value used to evaluate how good a model is at predictions.

k-nearest neighbor model tuning plots for both break angle and speed of pitch, show the effects of number of neighbors on the RMSE value

Tuning results for pitch speed (outcome variable)

## # A tibble: 20 x 10
##    model_type  mtry min_n .metric .estimator  mean     n std_err .config
##    <chr>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
##  1 rf             2    21 rmse    standard    5.95    50  0.0224 Prepro~
##  2 rf             2    30 rmse    standard    5.95    50  0.0220 Prepro~
##  3 rf             2    40 rmse    standard    5.96    50  0.0222 Prepro~
##  4 rf             2    11 rmse    standard    5.96    50  0.0233 Prepro~
##  5 rf             3    40 rmse    standard    5.97    50  0.0228 Prepro~
##  6 rf             4    40 rmse    standard    5.97    50  0.0231 Prepro~
##  7 rf             3    30 rmse    standard    5.98    50  0.0230 Prepro~
##  8 rf             4    30 rmse    standard    5.99    50  0.0233 Prepro~
##  9 rf             2     2 rmse    standard    5.99    50  0.0246 Prepro~
## 10 rf             3    21 rmse    standard    5.99    50  0.0232 Prepro~
## 11 rf             4    21 rmse    standard    6.01    50  0.0234 Prepro~
## 12 rf             3    11 rmse    standard    6.04    50  0.0245 Prepro~
## 13 rf             4    11 rmse    standard    6.08    50  0.0253 Prepro~
## 14 rf             3     2 rmse    standard    6.16    50  0.0276 Prepro~
## 15 rf             4     2 rmse    standard    6.24    50  0.0293 Prepro~
## 16 knn           NA    NA rmse    standard    6.28    50  0.0241 Prepro~
## 17 knn           NA    NA rmse    standard    6.34    50  0.0246 Prepro~
## 18 knn           NA    NA rmse    standard    6.44    50  0.0252 Prepro~
## 19 knn           NA    NA rmse    standard    6.80    50  0.0305 Prepro~
## 20 knn           NA    NA rmse    standard    8.05    50  0.0391 Prepro~
## # ... with 1 more variable: neighbors <int>

**Tuning results for break angle (outcome variable)*

## # A tibble: 20 x 10
##    model_type  mtry min_n .metric .estimator  mean     n std_err .config
##    <chr>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
##  1 rf             2    40 rmse    standard    12.9    50  0.0405 Prepro~
##  2 rf             2    30 rmse    standard    12.9    50  0.0407 Prepro~
##  3 rf             3    40 rmse    standard    12.9    50  0.0419 Prepro~
##  4 rf             4    40 rmse    standard    13.0    50  0.0417 Prepro~
##  5 rf             2    21 rmse    standard    13.0    50  0.0414 Prepro~
##  6 rf             3    30 rmse    standard    13.0    50  0.0417 Prepro~
##  7 rf             4    30 rmse    standard    13.0    50  0.0418 Prepro~
##  8 rf             2    11 rmse    standard    13.0    50  0.0413 Prepro~
##  9 rf             3    21 rmse    standard    13.1    50  0.0422 Prepro~
## 10 rf             4    21 rmse    standard    13.1    50  0.0425 Prepro~
## 11 rf             2     2 rmse    standard    13.1    50  0.0417 Prepro~
## 12 rf             3    11 rmse    standard    13.2    50  0.0426 Prepro~
## 13 rf             4    11 rmse    standard    13.4    50  0.0426 Prepro~
## 14 knn           NA    NA rmse    standard    13.5    50  0.0435 Prepro~
## 15 rf             3     2 rmse    standard    13.5    50  0.0452 Prepro~
## 16 knn           NA    NA rmse    standard    13.7    50  0.0439 Prepro~
## 17 rf             4     2 rmse    standard    13.7    50  0.0468 Prepro~
## 18 knn           NA    NA rmse    standard    13.9    50  0.0483 Prepro~
## 19 knn           NA    NA rmse    standard    14.8    50  0.0564 Prepro~
## 20 knn           NA    NA rmse    standard    17.5    50  0.0638 Prepro~
## # ... with 1 more variable: neighbors <int>

We can find the lowest RMSE score for both models and compare which one is lower. RMSE is the square root of the mean of the square of all of the error in the dataset given and is an excellent performance measure or error metric for numerical predictions. The lower the RMSE the more ideal.

Tuned Random Forest model for pitching speed

## # A tibble: 1 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2    21 rmse    standard    5.95    50  0.0224 Preprocessor1_Model07

Tuned Random Forest model for break angle

## # A tibble: 1 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     2    40 rmse    standard    12.9    50  0.0405 Preprocessor1_Model13

Tuned k-nearest neighbor model for pitching speed

## # A tibble: 1 x 7
##   neighbors .metric .estimator  mean     n std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 rmse    standard    6.28    50  0.0241 Preprocessor1_Model5

Tuned k-nearest neighbor model for break angle

## # A tibble: 1 x 7
##   neighbors .metric .estimator  mean     n std_err .config             
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1        15 rmse    standard    13.5    50  0.0435 Preprocessor1_Model5
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
## 
## -- Model -----------------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), min.node.size = min_rows(~21L, x), importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      5712 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 21 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       34.85219 
## R squared (OOB):                  0.1211724
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
## 
## -- Model -----------------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), min.node.size = min_rows(~40L, x), importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      5711 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 40 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       167.0845 
## R squared (OOB):                  0.02129066

After finalizing the workflow, I discovered that the Random Forest model produced the most optimal RMSE score for the training data. More specifically, 500 trees were used with an mtry value of 2 and target node size of 40. We can thus use this information to produce the most optimal estimated RMSE values for both speed of the pitch and break angle. This model will then be used to be tested on the testing set I created earlier in the report.

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.89
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        12.8

Model Performance Evaluation

The RMSE values of 6.018 and 12.948 are less than ideal since typically we see RMSE values between 0 and 1. This is likely due to the fact that it may have been more beneficial to log transform the outcome variables in the dataset since RMSE score is correlated with the range of the dependent variable being measured. Nevertheless, the random forest model actually performed slightly better on the test dataset, with a 0.07 lower RMSE value for the speed of the pitch and 0.03 lower RMSE value for the break angle, which was surprising to me considering I expected a little bit of overfitting.

Debrief

In order to improve the performance of my model, it would have been more useful to extract better data indicating the number of runners on base during the pitch for the 2019 season. This is because runners on base could have an effect on the speed and break angle of the pitch such that it puts more pressure on the pitcher to get an out. Moreover, the further benefits of the random forest model to my dataset is that it deals with missing data very well and fits well for nonlinearity.

Some new research questions that would arise from the conclusion of my report are the following:

  1. What are the likely outcomes of these pitches? Are they outs or hits?

  2. In a more generalized sense, what is the most likely outcome of an entire game based on the sequences or number of different pitches thrown?