Abstract

Baseball is a popular sport played in the United States of America. Performing analytics on base ball games gives ideas for improving a teams performance[2]. It is possible to predict pitch type based on parameters associated with pitching. Input data was collected by scraping from various sources at University of Illinois, Urbana-Champaign. Extreme gradient descent algorithm was chosen for predicting pitch type. The model achieved accuracy of 85.9% with minimum F1 score of 0.7 for each class.


Introduction

Can machine learning algorithms correctly predict baseball pitch type based on numerical inputs? Baseball is one of the most popular games played in the United States of America. It is also one of the oldest sports which was developed over 150 years ago. Given the popularity of baseball, lots of attention is given to the sport both by players and spectators [1].

While most baseball fans can easily tell the pitch type by looking at how the ball was released and went to the batter, predicting pitch type automatically will significantly improve the ability to perform baseball analytics [2]. This report presents the analysis done on baseball data created for analysis in the course STAT 432 at Univeristy of Illinois, Urbana-Champaign and recommends an Xgboost based model to predict baseball pitch type when certain inputs are given.


Methods

We are trying to solve a classification problem. Accuracy and F1 score for each class is used as the metric for model performance evaluation. The definitions of these terms can be found at this wikipedia page. F1 score is used as a metric because it punishes the model if it achieves either low precision or low recall. The model that has better precision and recall combined has better F1 score.

Data

Data for the analysis was collected at University of Illinois, Urbana-Champaign by scraping from baseballr package written by Bill Petti. This package allows for easy querying of Statcast and PITCH f/x data as provided by Baseball Savant.

The response variable is the Pitch Type. There are predictor variables such as effective speed, release spin, velosity of pitch, etc. Information about these can be found at below documentation.

The relationship among features can be visualized using a correlation plot. The plot below shows how some of the variables are highly correlated with each other. We need to remove those variables that have high correlation. This helps to reduce test error as well reduces computational resource requirement. We can observe following patterns from the plot.

  • pfx_z is highly negatively correlated with release_pos_y.
  • vx0 is also highly correlated with release position x.
  • vy0 is highly correlated with release position y.
  • release_position_y is highly correlated with release_extension.
  • pfx_x-ax are highly correlated as well

After looking at the correlation plot, following features pfx_Z, vx0, vy0, release_extension, pfx_x are removed from the training dataset. Below correlation plot is generated after removing these features.

Effect of features Release_speed, release_spin_rate, release_pos_x, release_pos_y, release_pos_z, ax was studied. It can be observed that values of features release speed, release spin rate, acceleration of the pitch in x direction (ax) are different for different pitch types.

Modeling

The dataset is split into training and testing sets The training dataset is further split into estimation and validation set. For cross-validation, 5 fold cross-validation is used. We will build models using below algorithms.

  1. Decision Trees
  2. Extreme Gradient Boosting Machines

We can also use other algorithms such as K-nearest-neighbors but since the dataset is very large, it will be computationally heavy. Hence, it was not tried.


Results

Decision tree classifier achieved the accuracy of 0.6447 with value of cp = 0.04634. The classifier could not classify most of the pitch types correctly.

The extreme gradient boosting model achieved the accuracy of 0.859. This accuracy is better than the test accuracy obtained by decision tree model. Below is the confusion matrix for prediction using xgboost model on test set.

Xgboost Model Performance Results
X1 Sensitivity Specificity Precision F1
CH 0.90 0.99 0.93 0.92
CU 0.85 0.99 0.86 0.85
FC 0.77 0.98 0.68 0.72
FF 0.93 0.97 0.96 0.94
FS 0.80 1.00 0.66 0.72
FT 0.71 0.97 0.70 0.70
KC 0.78 0.99 0.62 0.69
SI 0.75 0.97 0.66 0.70
SL 0.84 0.98 0.89 0.87

We can see, the model has minimum F1 score of 0.7 for all classes.


Discussion

Extreme Gradient Descent Algorithm achieved highest accuracy of 85.9%. The model has minimum F1 score of 0.7 for all classes. Hence, this algorithm is used to predict data on the training set.


References

  1. Cork G., (Feb 24, 2015) Major League Baseball still leads the NBA when it comes to popularity, Business Insider, https://www.businessinsider.com/major-league-baseball-nba-popularity-2015-2
  2. Braham W., Brendan H., (Feb 21, 2019). Changing the Game: How Data Analytics Is Upending Baseball, wharton.upenn.edu https://knowledge.wharton.upenn.edu/article/analytics-in-baseball/
  3. Dalphiaz D., (2020, October 28). R for Statistical Learning. https://daviddalpiaz.github.io/r4sl/

Appendix

Below is the confusion matrix of the predictions on test set by decision tree classifier.