The data I chose contains MLB pitch-level data for every pitch thrown between 2015-2018. It is broken into four separate data sets; one for at bat details, one for general game information, one for pitch level details, and one for player details.
The following table shows the dimensions of each table:| Table | Variables | Observations |
|---|---|---|
| atbats | 11 | 740,389 |
| games | 19 | 9,699 |
| pitches | 39 | 2,852,973 |
| players | 3 | 2,218 |
For this project I chose to do some analytics on a specific pitcher, Brent Suter. In the real world, pitchers and coaches could both benefit from this data by learning how they can improve. For example, knowing where pitches are placed around the strike zone can improve strategy.
Further in the future, another good idea would be to use this data to run machine learning models to predict which pitch was coming next. If the coach could potentially signal to the batter which pitch he thinks is coming next, the batter can be better prepared to hit the ball.
Overall, there are definitely many different ways that baseball players and coaches can benefit from the insights gained from analyzing this data.
Click here to view the original source of data.
The following packages are required for this project:
library(dplyr) # Used as a fast, consistent tool for working with data frame like objects
library(data.table) # Used for a fast and friendly way to load the datasets
library(tidyr) # Used specifically for data tidying
library(stringr) # Used for string manipulation, such as splitting a string
library(ggplot2) # Used for creating elegant data visualizations
library(knitr) # Used for dynamic report generation
library(kableExtra) # Additional features for knitr 'kable' function
To begin, I have loaded the data files into R using the ‘fread’ function.
Next, I got a feel for the data by viewing the first 5 rows of each table.
| ab_id | batter_id | event | g_id | inning | o | p_score | p_throws | pitcher_id | stand | top |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.015e+09 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE |
| 2.015e+09 | 518792 | Double | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE |
| 2.015e+09 | 407812 | Single | 201500001 | 1 | 1 | 0 | L | 452657 | R | TRUE |
| 2.015e+09 | 425509 | Strikeout | 201500001 | 1 | 2 | 0 | L | 452657 | R | TRUE |
| 2.015e+09 | 571431 | Strikeout | 201500001 | 1 | 3 | 0 | L | 452657 | L | TRUE |
| attendance | away_final_score | away_team | date | elapsed_time | g_id | home_final_score | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | weather | wind | delay |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35055 | 3 | sln | 2015-04-05 | 184 | 201500001 | 0 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 degrees, clear | 7 mph, In from CF | 0 |
| 45909 | 1 | ana | 2015-04-06 | 153 | 201500002 | 4 | sea | 1:12 PM | Ron Kulpa | Brian Knight | Vic Carapazza | Larry Vanover | Safeco Field | 54 degrees, cloudy | 1 mph, Varies | 0 |
| 36969 | 2 | atl | 2015-04-06 | 156 | 201500003 | 1 | mia | 4:22 PM | Laz Diaz | Chris Guccione | Cory Blaser | Jeff Nelson | Marlins Park | 80 degrees, partly cloudy | 16 mph, In from CF | 16 |
| 31042 | 6 | bal | 2015-04-06 | 181 | 201500004 | 2 | tba | 3:12 PM | Ed Hickox | Paul Nauert | Mike Estabrook | Dana DeMuth | Tropicana Field | 72 degrees, dome | 0 mph, None | 0 |
| 45549 | 8 | bos | 2015-04-06 | 181 | 201500005 | 0 | phi | 3:08 PM | Phil Cuzzi | Tony Randazzo | Will Little | Gerry Davis | Citizens Bank Park | 71 degrees, partly cloudy | 11 mph, Out to RF | 0 |
| ab_id | ax | ay | az | b_count | b_score | break_angle | break_length | break_y | code | end_speed | nasty | on_1b | on_2b | on_3b | outs | pfx_x | pfx_z | pitch_num | pitch_type | px | pz | s_count | spin_dir | spin_rate | start_speed | sz_bot | sz_top | type | type_confidence | vx0 | vy0 | vz0 | x | x0 | y | y0 | z0 | zone |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.015e+09 | 7.665 | 34.685 | -11.960 | 0 | 0 | -25.0 | 3.2 | 23.7 | C | 84.1 | 55 | FALSE | FALSE | FALSE | 0 | 4.16 | 10.93 | 1 | FF | 0.416 | 2.963 | 0 | 159.235 | 2305.052 | 92.9 | 1.72 | 3.56 | S | 2 | -6.409 | -136.065 | -3.995 | 101.1400 | 2.280 | 158.7800 | 50 | 5.302 | 3 |
| 2.015e+09 | 12.043 | 34.225 | -10.085 | 0 | 0 | -40.7 | 3.4 | 23.7 | S | 84.1 | 31 | FALSE | FALSE | FALSE | 0 | 6.57 | 12.00 | 2 | FF | -0.191 | 2.347 | 1 | 151.402 | 2689.935 | 92.8 | 1.72 | 3.56 | S | 2 | -8.411 | -135.690 | -5.980 | 124.2800 | 2.119 | 175.4100 | 50 | 5.307 | 5 |
| 2.015e+09 | 14.368 | 35.276 | -11.560 | 0 | 0 | -43.7 | 3.7 | 23.7 | F | 85.2 | 49 | FALSE | FALSE | FALSE | 0 | 7.61 | 10.88 | 3 | FF | -0.518 | 3.284 | 2 | 145.125 | 2647.972 | 94.1 | 1.72 | 3.56 | S | 2 | -9.802 | -137.668 | -3.337 | 136.7400 | 2.127 | 150.1100 | 50 | 5.313 | 1 |
| 2.015e+09 | 2.104 | 28.354 | -20.540 | 0 | 0 | -1.3 | 5.0 | 23.8 | B | 84.0 | 41 | FALSE | FALSE | FALSE | 0 | 1.17 | 6.45 | 4 | FF | -0.641 | 1.221 | 2 | 169.751 | 1289.590 | 91.0 | 1.74 | 3.35 | B | 2 | -8.071 | -133.005 | -6.567 | 109.6856 | 2.279 | 187.4635 | 50 | 5.210 | 13 |
| 2.015e+09 | -10.280 | 21.774 | -34.111 | 1 | 0 | 18.4 | 12.0 | 23.8 | B | 69.6 | 18 | FALSE | FALSE | FALSE | 0 | -8.43 | -1.65 | 5 | CU | -1.821 | 2.083 | 2 | 280.671 | 1374.569 | 75.4 | 1.72 | 3.56 | B | 2 | -6.309 | -110.409 | 0.325 | 146.5275 | 2.179 | 177.2428 | 50 | 5.557 | 13 |
| id | first_name | last_name |
|---|---|---|
| 452657 | Jon | Lester |
| 425794 | Adam | Wainwright |
| 457435 | Phil | Coke |
| 435400 | Jason | Motte |
| 519166 | Neil | Ramirez |
It is also a good idea to check for duplicates and missing values. In this case, there were no duplicates, however, there were some missing values that needed to be taken care of.
# Remove missing values in 'pitches'
colSums(is.na(pitches))
## ab_id ax ay az
## 0 14189 14189 14189
## b_count b_score break_angle break_length
## 0 0 14189 14189
## break_y code end_speed nasty
## 14189 0 14114 14189
## on_1b on_2b on_3b outs
## 0 0 0 0
## pfx_x pfx_z pitch_num pitch_type
## 14142 14142 0 0
## px pz s_count spin_dir
## 14189 14189 0 14189
## spin_rate start_speed sz_bot sz_top
## 14189 14114 2083 2083
## type type_confidence vx0 vy0
## 0 14189 14189 14189
## vz0 x x0 y
## 14189 0 14189 0
## y0 z0 zone
## 14189 14189 14189
pitches <- na.omit(pitches)
There is also some bad data when it comes to double header games. For some of the double headers, the first game shows an attendance of 0 or 1. These rows were removed.
I converted some variables as factor, including event, away team, home team, venue, code, pitch type, and play type, and spilt a few columns into two separate columns (weather split into temperature and forecast and wind split into wind speed and wind direction).
| temp | forecast | wind_speed | wind_dir |
|---|---|---|---|
| 44 | clear | 7 | In from CF |
| 54 | cloudy | 1 | Varies |
| 80 | partly | 16 | In from CF |
| 72 | dome | 0 | None |
| 71 | partly | 11 | Out to RF |
Additionally, I created boxplots to look for outliers. There was one variable, delay (in minutes), that had an outlier of 1860 minutes.
The following is a look at the specific outlier observation(s), which I removed from the data set.
| attendance | away_final_score | away_team | date | elapsed_time | g_id | home_final_score | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | temp | forecast | wind_speed | wind_dir | delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9608 | 36508 | 6 | mil | 2018-09-24 | 229 | 201802339 | 4 | sln | 7:16 PM | Will Little | Ted Barrett | Mark Carlson | Lance Barksdale | Busch Stadium | 78 | cloudy | 5 | Out to LF | 1860 |
Lastly, I decided to combine all four tables into one table.
The following table shows a summary (first 5 rows) of the “all” data set, which includes ~3 million records and 39 variables.
| ab_id | b_count | b_score | on_1b | on_2b | on_3b | outs | pitch_num | pitch_type | s_count | batter_id | event | g_id | inning | o | p_score | p_throws | pitcher_id | stand | top | p_first_name | p_last_name | b_first_name | b_last_name | attendance | away_team | date | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | temp | forecast | wind_speed | wind_dir | delay |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.015e+09 | 0 | 0 | FALSE | FALSE | FALSE | 0 | 1 | FF | 0 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE | Jon | Lester | Matt | Carpenter | 35055 | sln | 2015-04-05 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 | clear | 7 | In from CF | 0 |
| 2.015e+09 | 0 | 0 | FALSE | FALSE | FALSE | 0 | 2 | FF | 1 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE | Jon | Lester | Matt | Carpenter | 35055 | sln | 2015-04-05 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 | clear | 7 | In from CF | 0 |
| 2.015e+09 | 0 | 0 | FALSE | FALSE | FALSE | 0 | 3 | FF | 2 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE | Jon | Lester | Matt | Carpenter | 35055 | sln | 2015-04-05 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 | clear | 7 | In from CF | 0 |
| 2.015e+09 | 0 | 0 | FALSE | FALSE | FALSE | 0 | 4 | FF | 2 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE | Jon | Lester | Matt | Carpenter | 35055 | sln | 2015-04-05 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 | clear | 7 | In from CF | 0 |
| 2.015e+09 | 1 | 0 | FALSE | FALSE | FALSE | 0 | 5 | CU | 2 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE | Jon | Lester | Matt | Carpenter | 35055 | sln | 2015-04-05 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 | clear | 7 | In from CF | 0 |
For the exploratory data analysis, I am focusing on a specific pitcher, Brent Suter. First, I created a pie chart to show his most common pitch types.
From the graph above, we can see that Brent sticks to six different pitch types and mostly throws Four-seam Fastballs. Changeups and Sliders are his next popular choices. Below shows a frequency plot of how many of each type he throws specifically on the first pitch of each at bat.
The following graph shows the position of the ball as it crosses home plate for the pitcher, Brent. It is color coded by pitch type thrown. X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.
Also, I created a similar plot except it is color coded by the pitch outcome (Strike, Ball, and In Play). Again, X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.
In conclusion, this data can be very beneficial to baseball players and coaches. They can gain insights such as, where the pitches are landing based on the type of pitch thrown, in order to improve strategy for when a batter is facing a specific pitcher.
Using the 2015-2018 MLB data, various plots and graphs were created. For this project, we learned that Brent throws six pitches, mostly the Four-seam Fastball.
I did struggle with creating predictive models. Even after only using 10% of the data, I was still running into memory issues. In the future, I think it would be great to use this data to create machine learning predictive models.
ab_id - at-bat ID (first 4 digits are year)
batter_id - player ID of the batter (player names found in player_names.csv)
event - description of the result of the at-bat
g_id - game ID (first 4 digits are year)
inning - inning number
o - number of outs after this at-bat
p_score - score for the pitcher’s team
p_throws - which hand pitcher throws with (single character, R or L)
pitcher_id - player ID of the pitcher (player names found in player_names.csv)
stand - which side batter hits on (single character, R or L)
top - True if it’s the top of the inning / False if it’s the bottom
attendance - number of fans who attended (NOTE: for first game of doubleheaders, value is often erroneously 1 or 0)
away_final_score - final score for the visiting team
away_team - three letter abbreviation for away team; third letter sometimes indicates league (national vs american)
date - date of game
elapsed_time - length of game in minutes
g_id game ID
home_final_score - final score for the home team
home_team - three letter abbreviation for home team; third letter sometimes indicates league (national vs american)
start_time - start time of game
umpire_1B - first and last name of the umpire at first base
umpire_2B - first and last name of the umpire at second base
umpire_3B - first and last name of the umpire at third base
umpire_HP - first and last name of the umpire at home plate
venue_name - name of stadium
weather - description of weather
wind - description of wind
delay - length of delay before game in minutes
ab_id - at-bat ID
ax
ay
az
b_count - balls in the current count
b_score - score for the batter’s team
break_angle
break_length
break_y
code - records the result of the pitch (See A2)
end_speed - speed of the pitch when it reaches the plate
nasty
on_1b - True if there’s a runner on first, False if empty
on_2b - True if there’s a runner on second, False if empty
on_3b - True if there’s a runner on third, False if empty
outs - number of outs (before pitch is thrown)
pfx_x
pfx_z
pitch_num - pitch number (of at-bat)
pitch_type - type of pitch (See A3)
px - x-location as pitch crosses the plate (X=0 means right down the middle)
pz - z-location as pitch crosses the plate (Z=0 means the ground)
s_count - strikes in the current count
spin_dir - direction in which pitch is spinning, measured in degrees
spin_rate - the pitch’s spin rate, measured in RPM
start_speed - speed of the pitch just as it’s thrown
sz_bot
sz_top
type - simplified code: S (strike) B (ball) or X (in play)
type_confidence - confidence in pitch_type classification (unsure what 2 means)
vx0
vy0
vz0
x
x0
y
y0
z0
zone
id - player ID (matches with batter_id and pitcher_id)
first_name - first name
last_name - last name
B - Ball
*B - Ball in dirt
S - Swinging Strike
C - Called Strike
F - Foul
T - Foul Tip
L - Foul Bunt
I - Intentional Ball
W - Swinging Strike (Blocked)
M - Missed Bunt
P - Pitchout
Q - Swinging pitchout
R - Foul pitchout
Values that only occur on last pitch of at-bat:
X - In play, out(s)
D - In play, no out
E - In play, runs
H - Hit by pitch
Note: all codes, except for H, come directly from the XML files. All at-bats with code H were given no code in the XMLs.
CH - Changeup
CU - Curveball
EP - Eephus
FC - Cutter
FF - Four-seam Fastball
FO - Pitchout (also PO)
FS - Splitter
FT - Two-seam Fastball
IN - Intentional ball
KC - Knuckle curve
KN - Knuckeball
PO - Pitchout (also FO)
SC - Screwball
SI - Sinker
SL - Slider
UN - Unknown