The data I chose contains MLB pitch-level data for every pitch thrown between 2015-2018. The data can be used to make predictions on which pitch might be coming next.
For this project, I will begin by cleaning and exploring the data. Then I will run different analytic models and compare them in order to determine which model is the the best at predicting which pitch came next.
Being able to predict which pitch is coming next will help the batter get a hit over getting a strike. In the end, the batting team will benefit from the results by hopefully scoring more runs and winning the game.
Click here to view the original source of data.
library(dplyr) # Used as a fast, consistent tool for working with data frame like objects
library(data.table) # Used for a fast and friendly way to load the datasets
library(tidyr) # Used specifically for data tidying
library(stringr) # Used for string manipulation, such as splitting a string
library(ggplot2) # Used for creating elegant data visualizations
library(knitr) # Used for dynamic report generation
library(kableExtra) # Additional features for knitr 'kable' function
To begin, I have loaded the data files into R using the ‘fread’ function.
atbats <- fread("mlb_pitch/atbats.csv")
games <- fread("mlb_pitch/games.csv")
pitches <- fread("mlb_pitch/pitches.csv")
players <- fread("mlb_pitch/player_names.csv")
Next, I got a feel for the data by viewing the first 5 rows of each table.
kable(atbats[1:5,], format = "html", caption = "At Bats Data") %>%
kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10)
| ab_id | batter_id | event | g_id | inning | o | p_score | p_throws | pitcher_id | stand | top |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.015e+09 | 572761 | Groundout | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE |
| 2.015e+09 | 518792 | Double | 201500001 | 1 | 1 | 0 | L | 452657 | L | TRUE |
| 2.015e+09 | 407812 | Single | 201500001 | 1 | 1 | 0 | L | 452657 | R | TRUE |
| 2.015e+09 | 425509 | Strikeout | 201500001 | 1 | 2 | 0 | L | 452657 | R | TRUE |
| 2.015e+09 | 571431 | Strikeout | 201500001 | 1 | 3 | 0 | L | 452657 | L | TRUE |
kable(games[1:5,], format = "html", caption = "Games Data") %>%
kable_styling(bootstrap_options = "striped", font_size = 10) %>%
column_spec(4, width_min = ".7in") %>%
column_spec(10:16, width_min = "1.5in") %>%
scroll_box(width = "1000px")
| attendance | away_final_score | away_team | date | elapsed_time | g_id | home_final_score | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | weather | wind | delay |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35055 | 3 | sln | 2015-04-05 | 184 | 201500001 | 0 | chn | 7:17 PM | Mark Wegner | Marty Foster | Mike Muchlinski | Mike Winters | Wrigley Field | 44 degrees, clear | 7 mph, In from CF | 0 |
| 45909 | 1 | ana | 2015-04-06 | 153 | 201500002 | 4 | sea | 1:12 PM | Ron Kulpa | Brian Knight | Vic Carapazza | Larry Vanover | Safeco Field | 54 degrees, cloudy | 1 mph, Varies | 0 |
| 36969 | 2 | atl | 2015-04-06 | 156 | 201500003 | 1 | mia | 4:22 PM | Laz Diaz | Chris Guccione | Cory Blaser | Jeff Nelson | Marlins Park | 80 degrees, partly cloudy | 16 mph, In from CF | 16 |
| 31042 | 6 | bal | 2015-04-06 | 181 | 201500004 | 2 | tba | 3:12 PM | Ed Hickox | Paul Nauert | Mike Estabrook | Dana DeMuth | Tropicana Field | 72 degrees, dome | 0 mph, None | 0 |
| 45549 | 8 | bos | 2015-04-06 | 181 | 201500005 | 0 | phi | 3:08 PM | Phil Cuzzi | Tony Randazzo | Will Little | Gerry Davis | Citizens Bank Park | 71 degrees, partly cloudy | 11 mph, Out to RF | 0 |
kable(pitches[1:5,], format = "html", caption = "Pitches Data") %>%
kable_styling(bootstrap_options = "striped", font_size = 10) %>%
scroll_box(width = "1000px")
| ab_id | ax | ay | az | b_count | b_score | break_angle | break_length | break_y | code | end_speed | nasty | on_1b | on_2b | on_3b | outs | pfx_x | pfx_z | pitch_num | pitch_type | px | pz | s_count | spin_dir | spin_rate | start_speed | sz_bot | sz_top | type | type_confidence | vx0 | vy0 | vz0 | x | x0 | y | y0 | z0 | zone |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2.015e+09 | 7.665 | 34.685 | -11.960 | 0 | 0 | -25.0 | 3.2 | 23.7 | C | 84.1 | 55 | FALSE | FALSE | FALSE | 0 | 4.16 | 10.93 | 1 | FF | 0.416 | 2.963 | 0 | 159.235 | 2305.052 | 92.9 | 1.72 | 3.56 | S | 2 | -6.409 | -136.065 | -3.995 | 101.1400 | 2.280 | 158.7800 | 50 | 5.302 | 3 |
| 2.015e+09 | 12.043 | 34.225 | -10.085 | 0 | 0 | -40.7 | 3.4 | 23.7 | S | 84.1 | 31 | FALSE | FALSE | FALSE | 0 | 6.57 | 12.00 | 2 | FF | -0.191 | 2.347 | 1 | 151.402 | 2689.935 | 92.8 | 1.72 | 3.56 | S | 2 | -8.411 | -135.690 | -5.980 | 124.2800 | 2.119 | 175.4100 | 50 | 5.307 | 5 |
| 2.015e+09 | 14.368 | 35.276 | -11.560 | 0 | 0 | -43.7 | 3.7 | 23.7 | F | 85.2 | 49 | FALSE | FALSE | FALSE | 0 | 7.61 | 10.88 | 3 | FF | -0.518 | 3.284 | 2 | 145.125 | 2647.972 | 94.1 | 1.72 | 3.56 | S | 2 | -9.802 | -137.668 | -3.337 | 136.7400 | 2.127 | 150.1100 | 50 | 5.313 | 1 |
| 2.015e+09 | 2.104 | 28.354 | -20.540 | 0 | 0 | -1.3 | 5.0 | 23.8 | B | 84.0 | 41 | FALSE | FALSE | FALSE | 0 | 1.17 | 6.45 | 4 | FF | -0.641 | 1.221 | 2 | 169.751 | 1289.590 | 91.0 | 1.74 | 3.35 | B | 2 | -8.071 | -133.005 | -6.567 | 109.6856 | 2.279 | 187.4635 | 50 | 5.210 | 13 |
| 2.015e+09 | -10.280 | 21.774 | -34.111 | 1 | 0 | 18.4 | 12.0 | 23.8 | B | 69.6 | 18 | FALSE | FALSE | FALSE | 0 | -8.43 | -1.65 | 5 | CU | -1.821 | 2.083 | 2 | 280.671 | 1374.569 | 75.4 | 1.72 | 3.56 | B | 2 | -6.309 | -110.409 | 0.325 | 146.5275 | 2.179 | 177.2428 | 50 | 5.557 | 13 |
kable(players[1:5,], format = "html", caption = "Players Data") %>%
kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10)
| id | first_name | last_name |
|---|---|---|
| 452657 | Jon | Lester |
| 425794 | Adam | Wainwright |
| 457435 | Phil | Coke |
| 435400 | Jason | Motte |
| 519166 | Neil | Ramirez |
It is also a good idea to check for duplicates and missing values. In this case, there were no duplicates, however, there were some missing values that needed to be taken care of.
# Check duplicates
anyDuplicated(atbats)
anyDuplicated(games)
anyDuplicated(pitches)
anyDuplicated(players)
# Check missing values
colSums(is.na(atbats))
colSums(is.na(games))
colSums(is.na(players))
# Remove missing values in 'pitches'
colSums(is.na(pitches))
## ab_id ax ay az
## 0 14189 14189 14189
## b_count b_score break_angle break_length
## 0 0 14189 14189
## break_y code end_speed nasty
## 14189 0 14114 14189
## on_1b on_2b on_3b outs
## 0 0 0 0
## pfx_x pfx_z pitch_num pitch_type
## 14142 14142 0 0
## px pz s_count spin_dir
## 14189 14189 0 14189
## spin_rate start_speed sz_bot sz_top
## 14189 14114 2083 2083
## type type_confidence vx0 vy0
## 0 14189 14189 14189
## vz0 x x0 y
## 14189 0 14189 0
## y0 z0 zone
## 14189 14189 14189
pitches <- na.omit(pitches)
There is also some bad data when it comes to double header games. For some of the double headers, the first game shows an attendance of 0 or 1. I am going to remove these rows.
games <- filter(games, attendance != c(0,1))
Lastly for data preparation, I created some new variables as factor, including event, away team, home team, venue, code, pitch type, and play type, and spilt a few columns into two separate columns (weather split into temperature and forecast and wind split into wind speed and wind direction).
# Convert to factors
atbats$eventF <- as.factor(atbats$event)
games$away_teamF <- as.factor(games$away_team)
games$home_teamF <- as.factor(games$home_team)
games$venue_nameF <- as.factor(games$venue_name)
pitches$codeF <- as.factor(pitches$code)
pitches$pitch_typeF <- as.factor(pitches$pitch_type)
pitches$typeF <- as.factor(pitches$type)
# Convert date as date
games$date <- as.Date(games$date)
# Split weather into temp and forecast
games <- games %>%
separate(weather, c("temp", "forecast"), sep = "\\b\\s\\b")
games <- games %>%
separate(forecast, c("to.rm", "forecast"), sep = " ")
games <- games[,-16]
games$forecastF <- as.factor(games$forecast)
games$temp <- as.numeric(games$temp)
# Split wind into windspeed and wind direction
games <- games %>%
separate(wind, c("wind_speed", "wind_dir"), sep = ",")
games$wind_dir <- str_trim(games$wind_dir, side = "left")
games <- games %>%
separate(wind_speed, c("wind_speed", "to.rm"), sep = " ")
games <- games[,-18]
index <- which(games$wind_dir == "none")
games$wind_dir[index] <- "None" # 4 records were spelled "none" and need to be changed to "None"
games$wind_dirF <- as.factor(games$wind_dir)
games$wind_speed <- as.numeric(games$wind_speed)
As part of the exploratory data analysis, I created boxplots to look for outliers. There was one variable, delay (in minutes), that had an outlier.
# Check for outliers
boxplot(games$attendance)
boxplot(games$away_final_score)
boxplot(games$elapsed_time)
boxplot(games$home_final_score)
boxplot(games$temp)
boxplot(games$wind_speed)
boxplot(pitches$end_speed)
boxplot(pitches$pitch_num)
boxplot(pitches$spin_rate)
boxplot(pitches$start_speed)
boxplot(games$delay)
The following code looks at the specific outlier observation(s).
# Cook's distance >> Outlier Observation = 9608
model <- lm(delay ~ ., data = games)
plot(model, 4)
kable(games[9608,1:19], format = "html", caption = "Games Outlier") %>%
kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10) %>%
column_spec(5, width_min = ".7in") %>%
column_spec(11:15, width_min = "1in") %>%
scroll_box(width = "1000px")
| attendance | away_final_score | away_team | date | elapsed_time | g_id | home_final_score | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | temp | forecast | wind_speed | wind_dir | delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9608 | 36508 | 6 | mil | 2018-09-24 | 229 | 201802339 | 4 | sln | 7:16 PM | Will Little | Ted Barrett | Mark Carlson | Lance Barksdale | Busch Stadium | 78 | cloudy | 5 | Out to LF | 1860 |
Lastly for data exploration, I created a plot to show the position of the ball as it crosses home plate for a single pitcher, Brent Suter. It is color coded by pitch type (Strike, Ball, and In Play / Hit). X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.
# Plot pitches on home plate for pitcher, Brent Suter
bs_pitches <- filter(atbats, pitcher_id == 608718)
bs <-
bs_pitches %>%
left_join(pitches, by = "ab_id")
ggplot(bs, aes(x = px, y = pz, color = typeF)) +
geom_point() +
ggtitle("Pitches on home plate for Brent Suter") +
scale_color_manual(name = "Type",
labels = c("Ball",
"Strike",
"In Play"),
values = c("B" = "#f97970",
"S" = "#9bf970",
"X" = "#7cd5ff"))
First, I combined all four tables into one table. From there I split the data into 80% training and 20% testing.
# Combine all data
all <-
atbats %>%
left_join(players, by = c("pitcher_id" = "id"))
names(all)[13:14] <- c("p_first_name", "p_last_name") # specify pitcher names
all <-
all %>%
left_join(players, by = c("batter_id" = "id"))
names(all)[15:16] <- c("b_first_name", "b_last_name") # specify batter names
all <-
pitches %>%
left_join(all, by = c("ab_id" = "ab_id"))
all <-
all %>%
left_join(games, by = "g_id")
# Split the data into training and testing datasets
set.seed(4188135)
index1 <- sample(nrow(all),nrow(all)*0.80)
mlb.train <- all[index1,]
mlb.test <- all[-index1,]
The following table shows a summary (first 5 rows) of the final training data set, which includes ~2.3 million records and 80 variables.
kable(mlb.train[1:5,], format = "html", caption = "MLB Final Data Set") %>%
kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10) %>%
scroll_box(width = "1000px")
| ab_id | ax | ay | az | b_count | b_score | break_angle | break_length | break_y | code | end_speed | nasty | on_1b | on_2b | on_3b | outs | pfx_x | pfx_z | pitch_num | pitch_type | px | pz | s_count | spin_dir | spin_rate | start_speed | sz_bot | sz_top | type | type_confidence | vx0 | vy0 | vz0 | x | x0 | y | y0 | z0 | zone | codeF | pitch_typeF | typeF | batter_id | event | g_id | inning | o | p_score | p_throws | pitcher_id | stand | top | eventF | p_first_name | p_last_name | b_first_name | b_last_name | attendance | away_final_score | away_team | date | elapsed_time | home_final_score | home_team | start_time | umpire_1B | umpire_2B | umpire_3B | umpire_HP | venue_name | temp | forecast | wind_speed | wind_dir | delay | away_teamF | home_teamF | venue_nameF | forecastF | wind_dirF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1464899 | 2017012775 | -20.787041 | 31.30529 | -21.33352 | 2 | 1 | 40.8 | 6.2 | 23.8 | C | 87.3 | 53 | TRUE | FALSE | FALSE | 1 | -10.6718377 | 5.565410 | 4 | FT | 0.4216569 | 1.785570 | 1 | 242.457 | 2451.217 | 95.7 | 1.600000 | 3.560000 | S | 2.00 | 6.285478 | -138.7556 | -5.171628 | 100.93 | -0.4876320 | 190.57 | 50 | 5.096618 | 9 | C | FT | S | 516770 | Pop Out | 201700169 | 5 | 2 | 0 | R | 593372 | R | FALSE | Pop Out | Carlos | Martinez | Starlin | Castro | 43031 | 2 | sln | 2017-04-15 | 185 | 3 | nya | 1:09 PM | Jeff Kellogg | Tim Timmons | James Hoye | Will Little | Yankee Stadium | 58 | cloudy | 10 | R to L | 0 | sln | nya | Yankee Stadium | cloudy | R to L |
| 2718076 | 2018150880 | -10.334074 | 30.10481 | -12.20729 | 2 | 6 | 32.8 | 3.6 | 23.8 | F | 87.0 | 47 | FALSE | FALSE | FALSE | 1 | -5.3514130 | 10.339615 | 5 | FF | 0.4122097 | 1.682453 | 2 | 207.363 | 2364.323 | 95.1 | 1.916983 | 3.848036 | S | 2.00 | 5.372219 | -137.9558 | -9.699538 | 101.29 | -0.8632145 | 193.35 | 50 | 6.062150 | 14 | F | FF | S | 641355 | Strikeout | 201801974 | 5 | 2 | 2 | R | 572750 | L | TRUE | Strikeout | Eddie | Butler | Cody | Bellinger | 30123 | 8 | lan | 2018-08-28 | 199 | 4 | tex | 7:09 PM | Sean Barber | Ted Barrett | Will Little | Lance Barksdale | Globe Life Park in Arlington | 95 | clear | 16 | In from CF | 0 | lan | tex | Globe Life Park in Arlington | clear | In from CF |
| 2687377 | 2018142976 | -1.091764 | 20.82640 | -39.24054 | 0 | 5 | 0.7 | 11.6 | 23.9 | B | 73.9 | 57 | FALSE | FALSE | FALSE | 1 | -0.7945376 | -5.142681 | 1 | CU | 1.3365320 | 2.262686 | 0 | 351.219 | 892.602 | 79.8 | 1.702146 | 3.648827 | B | 2.00 | 3.936309 | -116.2180 | -1.030206 | 66.00 | -0.2724331 | 177.66 | 50 | 6.423290 | 14 | B | CU | B | 607345 | Groundout | 201801870 | 5 | 2 | 3 | R | 543118 | R | TRUE | Groundout | Oliver | Drake | Kevan | Smith | 23431 | 8 | cha | 2018-08-20 | 204 | 5 | min | 6:43 PM | Manny Gonzalez | Laz Diaz | Jeff Nelson | Nick Mahrley | Target Field | 77 | cloudy | 10 | In from LF | 33 | cha | min | Target Field | cloudy | In from LF |
| 1054010 | 2016091610 | 3.070000 | 24.07000 | -30.25000 | 1 | 0 | -3.1 | 7.8 | 23.9 | B | 80.3 | 59 | FALSE | FALSE | FALSE | 1 | 1.9000000 | 1.140000 | 3 | SL | -1.2700000 | 1.480000 | 1 | 122.076 | 416.968 | 86.2 | 1.680000 | 3.560000 | B | 0.66 | -8.080000 | -126.0600 | -4.010000 | 165.49 | 1.7200000 | 198.79 | 50 | 5.520000 | 13 | B | SL | B | 408045 | Walk | 201601200 | 4 | 1 | 0 | L | 527048 | L | FALSE | Walk | Martin | Perez | Joe | Mauer | 25530 | 3 | tex | 2016-07-01 | 184 | 2 | min | 7:10 PM | Lance Barrett | Dan Iassogna | Dale Scott | Bob Davidson | Target Field | 73 | partly | 3 | Varies | 0 | tex | min | Target Field | partly | Varies |
| 931710 | 2016059721 | 5.186000 | 30.71000 | -24.12700 | 3 | 0 | -13.7 | 6.4 | 23.7 | F | 78.3 | 45 | FALSE | FALSE | TRUE | 2 | 3.2600000 | 5.010000 | 6 | SL | 0.5970000 | 3.304000 | 2 | 147.201 | 1096.882 | 86.4 | 1.540000 | 3.450000 | S | 2.00 | 3.087000 | -126.6210 | -0.347000 | 94.24 | -1.0700000 | 149.57 | 50 | 5.408000 | 3 | F | SL | S | 607680 | Groundout | 201600786 | 2 | 3 | 0 | R | 547888 | R | FALSE | Groundout | Masahiro | Tanaka | Kevin | Pillar | 39512 | 0 | nya | 2016-06-01 | 177 | 7 | tor | 7:07 PM | Jim Reynolds | Scott Barry | CB Bucknor | Fieldin Culbreth | Rogers Centre | 62 | clear | 14 | R to L | 0 | nya | tor | Rogers Centre | clear | R to L |
What I don’t know right now:
I will still need to remove some columns so that the data is easier to work with for modeling and analysis. The plan is to use a variable selection method to choose only the most important variables. From there, I can run more advanced machine learning techniques in order to predict what type of pitch will be thrown next. I will also be able to create more meaningful plots and tables. Below is the outline for the remainder of the project.
[Placeholder]
[Placeholder]
[Placeholder]
[Placeholder]
ab_id - at-bat ID (first 4 digits are year)
batter_id - player ID of the batter (player names found in player_names.csv)
event - description of the result of the at-bat
g_id - game ID (first 4 digits are year)
inning - inning number
o - number of outs after this at-bat
p_score - score for the pitcher’s team
p_throws - which hand pitcher throws with (single character, R or L)
pitcher_id - player ID of the pitcher (player names found in player_names.csv)
stand - which side batter hits on (single character, R or L)
top - True if it’s the top of the inning / False if it’s the bottom
attendance - number of fans who attended (NOTE: for first game of doubleheaders, value is often erroneously 1 or 0)
away_final_score - final score for the visiting team
away_team - three letter abbreviation for away team; third letter sometimes indicates league (national vs american)
date - date of game
elapsed_time - length of game in minutes
g_id game ID
home_final_score - final score for the home team
home_team - three letter abbreviation for home team; third letter sometimes indicates league (national vs american)
start_time - start time of game
umpire_1B - first and last name of the umpire at first base
umpire_2B - first and last name of the umpire at second base
umpire_3B - first and last name of the umpire at third base
umpire_HP - first and last name of the umpire at home plate
venue_name - name of stadium
weather - description of weather
wind - description of wind
delay - length of delay before game in minutes
ab_id - at-bat ID
ax
ay
az
b_count - balls in the current count
b_score - score for the batter’s team
break_angle
break_length
break_y
code - records the result of the pitch (See A2)
end_speed - speed of the pitch when it reaches the plate
nasty
on_1b - True if there’s a runner on first, False if empty
on_2b - True if there’s a runner on second, False if empty
on_3b - True if there’s a runner on third, False if empty
outs - number of outs (before pitch is thrown)
pfx_x
pfx_z
pitch_num - pitch number (of at-bat)
pitch_type - type of pitch (See A3)
px - x-location as pitch crosses the plate (X=0 means right down the middle)
pz - z-location as pitch crosses the plate (Z=0 means the ground)
s_count - strikes in the current count
spin_dir - direction in which pitch is spinning, measured in degrees
spin_rate - the pitch’s spin rate, measured in RPM
start_speed - speed of the pitch just as it’s thrown
sz_bot
sz_top
type - simplified code: S (strike) B (ball) or X (in play)
type_confidence - confidence in pitch_type classification (unsure what 2 means)
vx0
vy0
vz0
x
x0
y
y0
z0
zone
id - player ID (matches with batter_id and pitcher_id)
first_name - first name
last_name - last name
B - Ball
*B - Ball in dirt
S - Swinging Strike
C - Called Strike
F - Foul
T - Foul Tip
L - Foul Bunt
I - Intentional Ball
W - Swinging Strike (Blocked)
M - Missed Bunt
P - Pitchout
Q - Swinging pitchout
R - Foul pitchout
Values that only occur on last pitch of at-bat:
X - In play, out(s)
D - In play, no out
E - In play, runs
H - Hit by pitch
Note: all codes, except for H, come directly from the XML files. All at-bats with code H were given no code in the XMLs.
CH - Changeup
CU - Curveball
EP - Eephus
FC - Cutter
FF - Four-seam Fastball
FO - Pitchout (also PO)
FS - Splitter
FT - Two-seam Fastball
IN - Intentional ball
KC - Knuckle curve
KN - Knuckeball
PO - Pitchout (also FO)
SC - Screwball
SI - Sinker
SL - Slider
UN - Unknown