Statistical Models For MLB Pitching and Hitting Metrics (2019)
Data Science II (STAT 301-2)
Data
Glimpse of 2019 Dataset (Only First 100 Games):
Codebook (defines baseball and dataset-specific terms):
Download codebook.txtThe dataset above encompasses and emphasizes pitch-level data, which includes pertinent information about the trajectory of each pitch. The dataset is joined together with matching data for the at-bats and games that sync up with each pitch. I utilized the join function to merge the datasets as part of the data cleaning process. This ensures a complete picture of the various game situations.
Each observation represents a specific pitch while all the column variables describe characteristics of the result of the pitch and the pitch itself at that point in time. This includes but is not limited to the current score, game situation (score, number of outs, etc.), pitch type, pitch speed, batter’s count, and result of the pitch. The data table above displays a quick glimpse at the first 25 MLB games recorded for the 2019 season and all pitches thrown in each of those games. However, there are 2,408 distinct games included in the entire dataset for the 2019 regular season. I am not using the entire dataset due to size and time issues when processing, but 25 games including all 30 different MLB teams at least should be enough variability to produce an decently accurate model. The data was scraped from the official site MLB.com but was imported from Kaggle for simplicity (https://www.kaggle.com/pschale/mlb-pitch-data-20152018?select=pitches.csv).
The main reason I chose this data is because I am current baseball player and enthusiast and find great value in using predictive analysis to predict player decisions, especially when it comes to predicting pitching strategies. In fact, I am very interested in pursuing a career in sports data analysis or sports technology, so I thought a great way to start building experience would be to complete this report using MLB data.
Research Question
My fundamental research question is the following:
1. What are the most probable types of pitches (primarily break angles and speeds) to be thrown deep in the count? Early in the count? To rephrase in a more general sense, what are the most likely pitches given a specific game situation (with the addition of cases such as score of the game, inning, etc.).
The main idea from this research question is ultimately “guessing” a given pitch if you were to take the place of an MLB hitter. This is critical factor in hitting consistency and overall success because anticipating where and what type of pitch will be thrown gives the hitter an edge against the competition. Moreover, we can extend this report with further analysis of specific pitchers currently in the MLB. I will not go into that much depth in this report but the idea of assessing the tendencies of specific pitchers is valuable when a team knows who they are facing ahead of game time.
Exploratory Data Analysis
Distribution of Outcome Variables
Using the full dataset described above, I chose to explore the distribution of multiple predicted outcome variables: pitch_type, break_angle, start_speed, px, and pz (see codebook for further clarification). First, I perform a quick skim of the data and note any potential problems (such as missingness).
| Name | Piped data |
| Number of rows | 7159 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| factor | 5 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| stand | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| p_throws | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| event | 0 | 1 | 4 | 26 | 0 | 24 | 0 |
| code | 0 | 1 | 1 | 2 | 0 | 14 | 0 |
| type | 0 | 1 | 1 | 2 | 0 | 14 | 0 |
| pitch_type | 23 | 1 | 2 | 2 | 0 | 10 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| inning | 0 | 1 | FALSE | 13 | 6: 893, 4: 814, 7: 804, 3: 793 |
| outs | 0 | 1 | FALSE | 3 | 0: 2568, 2: 2411, 1: 2180 |
| pitch_num | 0 | 1 | FALSE | 13 | 1: 1854, 2: 1637, 3: 1372, 4: 1037 |
| count_dif | 0 | 1 | FALSE | 6 | 0: 3428, -1: 1908, -2: 1025, -3: 417 |
| score_dif | 0 | 1 | FALSE | 24 | 0: 2192, -1: 681, 1: 673, -4: 495 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| ab_id | 0 | 1 | 2.019001e+09 | 535.60 | 2.01900e+09 | 2.01900e+09 | 2.019001e+09 | 2.019001e+09 | 2.019002e+09 |
| g_id | 0 | 1 | 2.019000e+08 | 7.24 | 2.01900e+08 | 2.01900e+08 | 2.019000e+08 | 2.019000e+08 | 2.019000e+08 |
| batter_id | 0 | 1 | 5.577128e+05 | 67667.51 | 4.05395e+05 | 5.02481e+05 | 5.719700e+05 | 6.064660e+05 | 6.709500e+05 |
| pitcher_id | 0 | 1 | 5.610370e+05 | 65620.89 | 4.07845e+05 | 5.18516e+05 | 5.719450e+05 | 6.072310e+05 | 6.736330e+05 |
| px | 23 | 1 | 1.000000e-02 | 0.86 | -3.61000e+00 | -5.70000e-01 | 2.000000e-02 | 6.000000e-01 | 3.310000e+00 |
| pz | 23 | 1 | 2.240000e+00 | 0.96 | -1.48000e+00 | 1.62000e+00 | 2.240000e+00 | 2.860000e+00 | 6.350000e+00 |
| start_speed | 23 | 1 | 8.817000e+01 | 6.26 | 5.06000e+01 | 8.43000e+01 | 8.930000e+01 | 9.300000e+01 | 1.007000e+02 |
| break_angle | 23 | 1 | 2.146000e+01 | 13.08 | 0.00000e+00 | 9.60000e+00 | 2.160000e+01 | 3.240000e+01 | 6.000000e+01 |
We notice here that there are 6,654 observations in which there is missing data for all our outcome variables. This is likely due to the failure to record the metrics for some pitches. In this case, we can drop the observations from the main dataset since they do not provide much useful information.
Also, it will be useful to see the distribution of the continuous outcome variables:
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
Furthermore, I found the distribution of the discrete outcome variable pitch_type:
The last bar distribution plot tells us that the fastball is the most common type of pitch (as expected). It is also no surprise that most of the pitches end up being close to the middle of the strike zone as seen from the px and pz histogram plots. What is more interesting is that the most common speed of pitches are around 90 - 95 mph and most common break angles are below 10 degrees (close to a straight fastball but not quite).
Q-Q Plots
Q-Q plots are primarily utilized as a graphical tool to help us assess if a set of data came from some theoretical normal distribution. If we assume our outcome variable is normally distributed, a Q-Q plot can verify this assumption:
Break Angles
Pitch Speed
Pitch Location X (outside, middle, or inside)
Pitch Location Z (high, middle, or low)
The Q-Q plots tell us that both p_x and p_z have normal distributions while start_speed and break_angle do not.
Correlation Between Variables
3-0 COUNT
2-0 COUNT, 3-1 COUNT
1-0 COUNT, 2-1 COUNT, 3-2 COUNT
EVEN COUNT
0-1 COUNT, 1-2 COUNT
0-2 COUNT
In summary, the above plots show that there is a slight relationship between at-bat count, inning number, and score differential with the pitching type and metrics. In order to explore this relationship more in-depth, I decided to generate the best model possible to predict the break angle and speed of a pitch given the inning, batter’s count differential, score differential, and number of outs. This is addressed in the next section.
Fitting Different Models
To accurately assess the data and generate an effective machine learning model, it is imperative that we compare two different models and see which one performs the best. In this case, I chose to compare two machine learning models: random forest model and k-nearest neighbor model. I chose these two because they are known to be accurate models for data similar to my own and generally perform better than simple linear models with complex datasets.
The first step was to separate the data into training and testing sets using stratified sampling (80/20 split to be exact). I did this on both outcome variables that we are assessing, start_speed and break_angle. The values below show the dimensions of the training and test sets respectively:
Pitch Speed Training & Testing Set
## [1] 5712 19
## [1] 1424 19
Break Angle Training & Testing Set
## [1] 5711 19
## [1] 1425 19
I then set up the recipe for both outcome variables (using prep() and ’bake()` and the resulting data tables are constructed below for both pitch speed and break angle, respectively:
## # A tibble: 5,712 x 5
## count_dif outs inning score_dif start_speed
## <fct> <fct> <fct> <fct> <dbl>
## 1 0 0 1 0 88.8
## 2 0 1 1 0 89.9
## 3 0 1 1 0 85.7
## 4 1 1 1 0 85.4
## 5 0 1 1 0 84.6
## 6 -1 1 1 0 90.9
## 7 0 2 1 0 89
## 8 -1 2 1 0 84.4
## 9 -3 2 1 0 89.2
## 10 -3 2 1 0 86.4
## # ... with 5,702 more rows
## # A tibble: 5,711 x 5
## count_dif outs inning score_dif break_angle
## <fct> <fct> <fct> <fct> <dbl>
## 1 0 0 1 0 22.8
## 2 0 1 1 0 22.8
## 3 0 1 1 0 9.6
## 4 1 1 1 0 24
## 5 0 1 1 0 26.4
## 6 -1 1 1 0 27.6
## 7 0 2 1 0 34.8
## 8 -1 2 1 0 22.8
## 9 -2 2 1 0 25.2
## 10 -3 2 1 0 19.2
## # ... with 5,701 more rows
I then utilized V-fold cross-validation with 10 folds, repeated 5 times, to fold the training data.
## # 10-fold cross-validation repeated 5 times
## # A tibble: 50 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <split [5.1K/572]> Repeat1 Fold01
## 2 <split [5.1K/572]> Repeat1 Fold02
## 3 <split [5.1K/571]> Repeat1 Fold03
## 4 <split [5.1K/571]> Repeat1 Fold04
## 5 <split [5.1K/571]> Repeat1 Fold05
## 6 <split [5.1K/571]> Repeat1 Fold06
## 7 <split [5.1K/571]> Repeat1 Fold07
## 8 <split [5.1K/571]> Repeat1 Fold08
## 9 <split [5.1K/571]> Repeat1 Fold09
## 10 <split [5.1K/571]> Repeat1 Fold10
## # ... with 40 more rows
## # 10-fold cross-validation repeated 5 times
## # A tibble: 50 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <split [5.1K/572]> Repeat1 Fold01
## 2 <split [5.1K/571]> Repeat1 Fold02
## 3 <split [5.1K/571]> Repeat1 Fold03
## 4 <split [5.1K/571]> Repeat1 Fold04
## 5 <split [5.1K/571]> Repeat1 Fold05
## 6 <split [5.1K/571]> Repeat1 Fold06
## 7 <split [5.1K/571]> Repeat1 Fold07
## 8 <split [5.1K/571]> Repeat1 Fold08
## 9 <split [5.1K/571]> Repeat1 Fold09
## 10 <split [5.1K/571]> Repeat1 Fold10
## # ... with 40 more rows
Furthermore, we can then set up both models and set tuning points that will be adjusted later on in the report. This will be helpful organizationally and the tuning of parameters will help maximize the accuracy of the chosen model.
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Computational engine: ranger
## K-Nearest Neighbor Model Specification (regression)
##
## Main Arguments:
## neighbors = tune()
##
## Computational engine: kknn
Assessment of Both Models
Random Forest tuning plots, shows how node size and number of predictors affects the RMSE value which is value used to evaluate how good a model is at predictions.
k-nearest neighbor model tuning plots for both break angle and speed of pitch, show the effects of number of neighbors on the RMSE value
Tuning results for pitch speed (outcome variable)
## # A tibble: 20 x 10
## model_type mtry min_n .metric .estimator mean n std_err .config
## <chr> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rf 2 21 rmse standard 5.95 50 0.0224 Prepro~
## 2 rf 2 30 rmse standard 5.95 50 0.0220 Prepro~
## 3 rf 2 40 rmse standard 5.96 50 0.0222 Prepro~
## 4 rf 2 11 rmse standard 5.96 50 0.0233 Prepro~
## 5 rf 3 40 rmse standard 5.97 50 0.0228 Prepro~
## 6 rf 4 40 rmse standard 5.97 50 0.0231 Prepro~
## 7 rf 3 30 rmse standard 5.98 50 0.0230 Prepro~
## 8 rf 4 30 rmse standard 5.99 50 0.0233 Prepro~
## 9 rf 2 2 rmse standard 5.99 50 0.0246 Prepro~
## 10 rf 3 21 rmse standard 5.99 50 0.0232 Prepro~
## 11 rf 4 21 rmse standard 6.01 50 0.0234 Prepro~
## 12 rf 3 11 rmse standard 6.04 50 0.0245 Prepro~
## 13 rf 4 11 rmse standard 6.08 50 0.0253 Prepro~
## 14 rf 3 2 rmse standard 6.16 50 0.0276 Prepro~
## 15 rf 4 2 rmse standard 6.24 50 0.0293 Prepro~
## 16 knn NA NA rmse standard 6.28 50 0.0241 Prepro~
## 17 knn NA NA rmse standard 6.34 50 0.0246 Prepro~
## 18 knn NA NA rmse standard 6.44 50 0.0252 Prepro~
## 19 knn NA NA rmse standard 6.80 50 0.0305 Prepro~
## 20 knn NA NA rmse standard 8.05 50 0.0391 Prepro~
## # ... with 1 more variable: neighbors <int>
**Tuning results for break angle (outcome variable)*
## # A tibble: 20 x 10
## model_type mtry min_n .metric .estimator mean n std_err .config
## <chr> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rf 2 40 rmse standard 12.9 50 0.0405 Prepro~
## 2 rf 2 30 rmse standard 12.9 50 0.0407 Prepro~
## 3 rf 3 40 rmse standard 12.9 50 0.0419 Prepro~
## 4 rf 4 40 rmse standard 13.0 50 0.0417 Prepro~
## 5 rf 2 21 rmse standard 13.0 50 0.0414 Prepro~
## 6 rf 3 30 rmse standard 13.0 50 0.0417 Prepro~
## 7 rf 4 30 rmse standard 13.0 50 0.0418 Prepro~
## 8 rf 2 11 rmse standard 13.0 50 0.0413 Prepro~
## 9 rf 3 21 rmse standard 13.1 50 0.0422 Prepro~
## 10 rf 4 21 rmse standard 13.1 50 0.0425 Prepro~
## 11 rf 2 2 rmse standard 13.1 50 0.0417 Prepro~
## 12 rf 3 11 rmse standard 13.2 50 0.0426 Prepro~
## 13 rf 4 11 rmse standard 13.4 50 0.0426 Prepro~
## 14 knn NA NA rmse standard 13.5 50 0.0435 Prepro~
## 15 rf 3 2 rmse standard 13.5 50 0.0452 Prepro~
## 16 knn NA NA rmse standard 13.7 50 0.0439 Prepro~
## 17 rf 4 2 rmse standard 13.7 50 0.0468 Prepro~
## 18 knn NA NA rmse standard 13.9 50 0.0483 Prepro~
## 19 knn NA NA rmse standard 14.8 50 0.0564 Prepro~
## 20 knn NA NA rmse standard 17.5 50 0.0638 Prepro~
## # ... with 1 more variable: neighbors <int>
We can find the lowest RMSE score for both models and compare which one is lower. RMSE is the square root of the mean of the square of all of the error in the dataset given and is an excellent performance measure or error metric for numerical predictions. The lower the RMSE the more ideal.
Tuned Random Forest model for pitching speed
## # A tibble: 1 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 21 rmse standard 5.95 50 0.0224 Preprocessor1_Model07
Tuned Random Forest model for break angle
## # A tibble: 1 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 40 rmse standard 12.9 50 0.0405 Preprocessor1_Model13
Tuned k-nearest neighbor model for pitching speed
## # A tibble: 1 x 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 rmse standard 6.28 50 0.0241 Preprocessor1_Model5
Tuned k-nearest neighbor model for break angle
## # A tibble: 1 x 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 rmse standard 13.5 50 0.0435 Preprocessor1_Model5
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
##
## -- Model -----------------------------------------------------------------------
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), min.node.size = min_rows(~21L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 500
## Sample size: 5712
## Number of independent variables: 4
## Mtry: 2
## Target node size: 21
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 34.85219
## R squared (OOB): 0.1211724
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
##
## -- Model -----------------------------------------------------------------------
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), min.node.size = min_rows(~40L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 500
## Sample size: 5711
## Number of independent variables: 4
## Mtry: 2
## Target node size: 40
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 167.0845
## R squared (OOB): 0.02129066
After finalizing the workflow, I discovered that the Random Forest model produced the most optimal RMSE score for the training data. More specifically, 500 trees were used with an mtry value of 2 and target node size of 40. We can thus use this information to produce the most optimal estimated RMSE values for both speed of the pitch and break angle. This model will then be used to be tested on the testing set I created earlier in the report.
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.89
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 12.8
Model Performance Evaluation
The RMSE values of 6.018 and 12.948 are less than ideal since typically we see RMSE values between 0 and 1. This is likely due to the fact that it may have been more beneficial to log transform the outcome variables in the dataset since RMSE score is correlated with the range of the dependent variable being measured. Nevertheless, the random forest model actually performed slightly better on the test dataset, with a 0.07 lower RMSE value for the speed of the pitch and 0.03 lower RMSE value for the break angle, which was surprising to me considering I expected a little bit of overfitting.
Debrief
In order to improve the performance of my model, it would have been more useful to extract better data indicating the number of runners on base during the pitch for the 2019 season. This is because runners on base could have an effect on the speed and break angle of the pitch such that it puts more pressure on the pitcher to get an out. Moreover, the further benefits of the random forest model to my dataset is that it deals with missing data very well and fits well for nonlinearity.
Some new research questions that would arise from the conclusion of my report are the following:
What are the likely outcomes of these pitches? Are they outs or hits?
In a more generalized sense, what is the most likely outcome of an entire game based on the sequences or number of different pitches thrown?