Introduction

The data I chose contains MLB pitch-level data for every pitch thrown between 2015-2018. The data can be used to make predictions on which pitch might be coming next.

For this project, I will begin by cleaning and exploring the data. Then I will run different analytic models and compare them in order to determine which model is the the best at predicting which pitch came next.

Being able to predict which pitch is coming next will help the batter get a hit over getting a strike. In the end, the batting team will benefit from the results by hopefully scoring more runs and winning the game.

Click here to view the original source of data.

Packages Required

library(dplyr)        # Used as a fast, consistent tool for working with data frame like objects
library(data.table)   # Used for a fast and friendly way to load the datasets
library(tidyr)        # Used specifically for data tidying
library(stringr)      # Used for string manipulation, such as splitting a string
library(ggplot2)      # Used for creating elegant data visualizations
library(knitr)        # Used for dynamic report generation
library(kableExtra)   # Additional features for knitr 'kable' function

Data Preparation

To begin, I have loaded the data files into R using the ‘fread’ function.

atbats <- fread("mlb_pitch/atbats.csv")
games <- fread("mlb_pitch/games.csv")
pitches <- fread("mlb_pitch/pitches.csv")
players <- fread("mlb_pitch/player_names.csv")

Next, I got a feel for the data by viewing the first 5 rows of each table.

kable(atbats[1:5,], format = "html", caption = "At Bats Data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10)
At Bats Data
ab_id batter_id event g_id inning o p_score p_throws pitcher_id stand top
2.015e+09 572761 Groundout 201500001 1 1 0 L 452657 L TRUE
2.015e+09 518792 Double 201500001 1 1 0 L 452657 L TRUE
2.015e+09 407812 Single 201500001 1 1 0 L 452657 R TRUE
2.015e+09 425509 Strikeout 201500001 1 2 0 L 452657 R TRUE
2.015e+09 571431 Strikeout 201500001 1 3 0 L 452657 L TRUE
kable(games[1:5,], format = "html", caption = "Games Data") %>%
  kable_styling(bootstrap_options = "striped", font_size = 10) %>%
  column_spec(4, width_min = ".7in") %>%
  column_spec(10:16, width_min = "1.5in") %>%
  scroll_box(width = "1000px")
Games Data
attendance away_final_score away_team date elapsed_time g_id home_final_score home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name weather wind delay
35055 3 sln 2015-04-05 184 201500001 0 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 degrees, clear 7 mph, In from CF 0
45909 1 ana 2015-04-06 153 201500002 4 sea 1:12 PM Ron Kulpa Brian Knight Vic Carapazza Larry Vanover Safeco Field 54 degrees, cloudy 1 mph, Varies 0
36969 2 atl 2015-04-06 156 201500003 1 mia 4:22 PM Laz Diaz Chris Guccione Cory Blaser Jeff Nelson Marlins Park 80 degrees, partly cloudy 16 mph, In from CF 16
31042 6 bal 2015-04-06 181 201500004 2 tba 3:12 PM Ed Hickox Paul Nauert Mike Estabrook Dana DeMuth Tropicana Field 72 degrees, dome 0 mph, None 0
45549 8 bos 2015-04-06 181 201500005 0 phi 3:08 PM Phil Cuzzi Tony Randazzo Will Little Gerry Davis Citizens Bank Park 71 degrees, partly cloudy 11 mph, Out to RF 0
kable(pitches[1:5,], format = "html", caption = "Pitches Data") %>%
  kable_styling(bootstrap_options = "striped", font_size = 10) %>%
  scroll_box(width = "1000px")
Pitches Data
ab_id ax ay az b_count b_score break_angle break_length break_y code end_speed nasty on_1b on_2b on_3b outs pfx_x pfx_z pitch_num pitch_type px pz s_count spin_dir spin_rate start_speed sz_bot sz_top type type_confidence vx0 vy0 vz0 x x0 y y0 z0 zone
2.015e+09 7.665 34.685 -11.960 0 0 -25.0 3.2 23.7 C 84.1 55 FALSE FALSE FALSE 0 4.16 10.93 1 FF 0.416 2.963 0 159.235 2305.052 92.9 1.72 3.56 S 2 -6.409 -136.065 -3.995 101.1400 2.280 158.7800 50 5.302 3
2.015e+09 12.043 34.225 -10.085 0 0 -40.7 3.4 23.7 S 84.1 31 FALSE FALSE FALSE 0 6.57 12.00 2 FF -0.191 2.347 1 151.402 2689.935 92.8 1.72 3.56 S 2 -8.411 -135.690 -5.980 124.2800 2.119 175.4100 50 5.307 5
2.015e+09 14.368 35.276 -11.560 0 0 -43.7 3.7 23.7 F 85.2 49 FALSE FALSE FALSE 0 7.61 10.88 3 FF -0.518 3.284 2 145.125 2647.972 94.1 1.72 3.56 S 2 -9.802 -137.668 -3.337 136.7400 2.127 150.1100 50 5.313 1
2.015e+09 2.104 28.354 -20.540 0 0 -1.3 5.0 23.8 B 84.0 41 FALSE FALSE FALSE 0 1.17 6.45 4 FF -0.641 1.221 2 169.751 1289.590 91.0 1.74 3.35 B 2 -8.071 -133.005 -6.567 109.6856 2.279 187.4635 50 5.210 13
2.015e+09 -10.280 21.774 -34.111 1 0 18.4 12.0 23.8 B 69.6 18 FALSE FALSE FALSE 0 -8.43 -1.65 5 CU -1.821 2.083 2 280.671 1374.569 75.4 1.72 3.56 B 2 -6.309 -110.409 0.325 146.5275 2.179 177.2428 50 5.557 13
kable(players[1:5,], format = "html", caption = "Players Data") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10)
Players Data
id first_name last_name
452657 Jon Lester
425794 Adam Wainwright
457435 Phil Coke
435400 Jason Motte
519166 Neil Ramirez

It is also a good idea to check for duplicates and missing values. In this case, there were no duplicates, however, there were some missing values that needed to be taken care of.

# Check duplicates
anyDuplicated(atbats)
anyDuplicated(games)
anyDuplicated(pitches)
anyDuplicated(players)
# Check missing values
colSums(is.na(atbats))
colSums(is.na(games))
colSums(is.na(players))
# Remove missing values in 'pitches'
colSums(is.na(pitches))
##           ab_id              ax              ay              az 
##               0           14189           14189           14189 
##         b_count         b_score     break_angle    break_length 
##               0               0           14189           14189 
##         break_y            code       end_speed           nasty 
##           14189               0           14114           14189 
##           on_1b           on_2b           on_3b            outs 
##               0               0               0               0 
##           pfx_x           pfx_z       pitch_num      pitch_type 
##           14142           14142               0               0 
##              px              pz         s_count        spin_dir 
##           14189           14189               0           14189 
##       spin_rate     start_speed          sz_bot          sz_top 
##           14189           14114            2083            2083 
##            type type_confidence             vx0             vy0 
##               0           14189           14189           14189 
##             vz0               x              x0               y 
##           14189               0           14189               0 
##              y0              z0            zone 
##           14189           14189           14189
pitches <- na.omit(pitches)

There is also some bad data when it comes to double header games. For some of the double headers, the first game shows an attendance of 0 or 1. I am going to remove these rows.

games <- filter(games, attendance != c(0,1))

Lastly for data preparation, I created some new variables as factor, including event, away team, home team, venue, code, pitch type, and play type, and spilt a few columns into two separate columns (weather split into temperature and forecast and wind split into wind speed and wind direction).

# Convert to factors 
atbats$eventF <- as.factor(atbats$event)
games$away_teamF <- as.factor(games$away_team) 
games$home_teamF <- as.factor(games$home_team)
games$venue_nameF <- as.factor(games$venue_name)
pitches$codeF <- as.factor(pitches$code)
pitches$pitch_typeF <- as.factor(pitches$pitch_type)
pitches$typeF <- as.factor(pitches$type)

# Convert date as date
games$date <- as.Date(games$date)

# Split weather into temp and forecast
games <- games %>%
  separate(weather, c("temp", "forecast"), sep = "\\b\\s\\b")
games <- games %>%
  separate(forecast, c("to.rm", "forecast"), sep = " ")
games <- games[,-16]

games$forecastF <- as.factor(games$forecast)
games$temp <- as.numeric(games$temp)

# Split wind into windspeed and wind direction
games <- games %>%
  separate(wind, c("wind_speed", "wind_dir"), sep = ",")

games$wind_dir <- str_trim(games$wind_dir, side = "left")

games <- games %>%
  separate(wind_speed, c("wind_speed", "to.rm"), sep = " ")
games <- games[,-18]

index <- which(games$wind_dir == "none") 
games$wind_dir[index] <- "None"              # 4 records were spelled "none" and need to be changed to "None"

games$wind_dirF <- as.factor(games$wind_dir)
games$wind_speed <- as.numeric(games$wind_speed)

Exploratory Data Analysis

As part of the exploratory data analysis, I created boxplots to look for outliers. There was one variable, delay (in minutes), that had an outlier.

  # Check for outliers
boxplot(games$attendance)
boxplot(games$away_final_score)
boxplot(games$elapsed_time)
boxplot(games$home_final_score)
boxplot(games$temp)
boxplot(games$wind_speed)
boxplot(pitches$end_speed)
boxplot(pitches$pitch_num)
boxplot(pitches$spin_rate)
boxplot(pitches$start_speed)
boxplot(games$delay)

The following code looks at the specific outlier observation(s).

# Cook's distance >> Outlier Observation = 9608
model <- lm(delay ~ ., data = games)
plot(model, 4)

kable(games[9608,1:19], format = "html", caption = "Games Outlier") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10) %>%
  column_spec(5, width_min = ".7in") %>%
  column_spec(11:15, width_min = "1in") %>%
  scroll_box(width = "1000px")
Games Outlier
attendance away_final_score away_team date elapsed_time g_id home_final_score home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name temp forecast wind_speed wind_dir delay
9608 36508 6 mil 2018-09-24 229 201802339 4 sln 7:16 PM Will Little Ted Barrett Mark Carlson Lance Barksdale Busch Stadium 78 cloudy 5 Out to LF 1860

Lastly for data exploration, I created a plot to show the position of the ball as it crosses home plate for a single pitcher, Brent Suter. It is color coded by pitch type (Strike, Ball, and In Play / Hit). X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.

# Plot pitches on home plate for pitcher, Brent Suter
bs_pitches <- filter(atbats, pitcher_id == 608718)
bs <-
  bs_pitches %>%
  left_join(pitches, by = "ab_id")
ggplot(bs, aes(x = px, y = pz, color = typeF)) + 
  geom_point() +
  ggtitle("Pitches on home plate for Brent Suter") +
  scale_color_manual(name = "Type",
                    labels = c("Ball", 
                               "Strike", 
                               "In Play"),
                    values = c("B" = "#f97970", 
                               "S" = "#9bf970", 
                               "X" = "#7cd5ff"))

Training vs Testing Data

First, I combined all four tables into one table. From there I split the data into 80% training and 20% testing.

# Combine all data
all <-
  atbats %>%
  left_join(players, by = c("pitcher_id" = "id"))
names(all)[13:14] <- c("p_first_name", "p_last_name") # specify pitcher names

all <- 
  all %>%
  left_join(players, by = c("batter_id" = "id"))
names(all)[15:16] <- c("b_first_name", "b_last_name") # specify batter names

all <-
  pitches %>%
  left_join(all, by = c("ab_id" = "ab_id"))

all <-
  all %>%
  left_join(games, by = "g_id")

# Split the data into training and testing datasets

set.seed(4188135)
index1 <- sample(nrow(all),nrow(all)*0.80)
mlb.train <- all[index1,]
mlb.test <- all[-index1,]

The following table shows a summary (first 5 rows) of the final training data set, which includes ~2.3 million records and 80 variables.

kable(mlb.train[1:5,], format = "html", caption = "MLB Final Data Set") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, font_size = 10) %>%
  scroll_box(width = "1000px")
MLB Final Data Set
ab_id ax ay az b_count b_score break_angle break_length break_y code end_speed nasty on_1b on_2b on_3b outs pfx_x pfx_z pitch_num pitch_type px pz s_count spin_dir spin_rate start_speed sz_bot sz_top type type_confidence vx0 vy0 vz0 x x0 y y0 z0 zone codeF pitch_typeF typeF batter_id event g_id inning o p_score p_throws pitcher_id stand top eventF p_first_name p_last_name b_first_name b_last_name attendance away_final_score away_team date elapsed_time home_final_score home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name temp forecast wind_speed wind_dir delay away_teamF home_teamF venue_nameF forecastF wind_dirF
1464899 2017012775 -20.787041 31.30529 -21.33352 2 1 40.8 6.2 23.8 C 87.3 53 TRUE FALSE FALSE 1 -10.6718377 5.565410 4 FT 0.4216569 1.785570 1 242.457 2451.217 95.7 1.600000 3.560000 S 2.00 6.285478 -138.7556 -5.171628 100.93 -0.4876320 190.57 50 5.096618 9 C FT S 516770 Pop Out 201700169 5 2 0 R 593372 R FALSE Pop Out Carlos Martinez Starlin Castro 43031 2 sln 2017-04-15 185 3 nya 1:09 PM Jeff Kellogg Tim Timmons James Hoye Will Little Yankee Stadium 58 cloudy 10 R to L 0 sln nya Yankee Stadium cloudy R to L
2718076 2018150880 -10.334074 30.10481 -12.20729 2 6 32.8 3.6 23.8 F 87.0 47 FALSE FALSE FALSE 1 -5.3514130 10.339615 5 FF 0.4122097 1.682453 2 207.363 2364.323 95.1 1.916983 3.848036 S 2.00 5.372219 -137.9558 -9.699538 101.29 -0.8632145 193.35 50 6.062150 14 F FF S 641355 Strikeout 201801974 5 2 2 R 572750 L TRUE Strikeout Eddie Butler Cody Bellinger 30123 8 lan 2018-08-28 199 4 tex 7:09 PM Sean Barber Ted Barrett Will Little Lance Barksdale Globe Life Park in Arlington 95 clear 16 In from CF 0 lan tex Globe Life Park in Arlington clear In from CF
2687377 2018142976 -1.091764 20.82640 -39.24054 0 5 0.7 11.6 23.9 B 73.9 57 FALSE FALSE FALSE 1 -0.7945376 -5.142681 1 CU 1.3365320 2.262686 0 351.219 892.602 79.8 1.702146 3.648827 B 2.00 3.936309 -116.2180 -1.030206 66.00 -0.2724331 177.66 50 6.423290 14 B CU B 607345 Groundout 201801870 5 2 3 R 543118 R TRUE Groundout Oliver Drake Kevan Smith 23431 8 cha 2018-08-20 204 5 min 6:43 PM Manny Gonzalez Laz Diaz Jeff Nelson Nick Mahrley Target Field 77 cloudy 10 In from LF 33 cha min Target Field cloudy In from LF
1054010 2016091610 3.070000 24.07000 -30.25000 1 0 -3.1 7.8 23.9 B 80.3 59 FALSE FALSE FALSE 1 1.9000000 1.140000 3 SL -1.2700000 1.480000 1 122.076 416.968 86.2 1.680000 3.560000 B 0.66 -8.080000 -126.0600 -4.010000 165.49 1.7200000 198.79 50 5.520000 13 B SL B 408045 Walk 201601200 4 1 0 L 527048 L FALSE Walk Martin Perez Joe Mauer 25530 3 tex 2016-07-01 184 2 min 7:10 PM Lance Barrett Dan Iassogna Dale Scott Bob Davidson Target Field 73 partly 3 Varies 0 tex min Target Field partly Varies
931710 2016059721 5.186000 30.71000 -24.12700 3 0 -13.7 6.4 23.7 F 78.3 45 FALSE FALSE TRUE 2 3.2600000 5.010000 6 SL 0.5970000 3.304000 2 147.201 1096.882 86.4 1.540000 3.450000 S 2.00 3.087000 -126.6210 -0.347000 94.24 -1.0700000 149.57 50 5.408000 3 F SL S 607680 Groundout 201600786 2 3 0 R 547888 R FALSE Groundout Masahiro Tanaka Kevin Pillar 39512 0 nya 2016-06-01 177 7 tor 7:07 PM Jim Reynolds Scott Barry CB Bucknor Fieldin Culbreth Rogers Centre 62 clear 14 R to L 0 nya tor Rogers Centre clear R to L

What I don’t know right now:

I will still need to remove some columns so that the data is easier to work with for modeling and analysis. The plan is to use a variable selection method to choose only the most important variables. From there, I can run more advanced machine learning techniques in order to predict what type of pitch will be thrown next. I will also be able to create more meaningful plots and tables. Below is the outline for the remainder of the project.

Preliminary Models

[Placeholder]

Variable Selection

[Placeholder]

Advanced Models

[Placeholder]

Conclustion

[Placeholder]


Appendix

A1. Data

atbats data

ab_id - at-bat ID (first 4 digits are year)

batter_id - player ID of the batter (player names found in player_names.csv)

event - description of the result of the at-bat

g_id - game ID (first 4 digits are year)

inning - inning number

o - number of outs after this at-bat

p_score - score for the pitcher’s team

p_throws - which hand pitcher throws with (single character, R or L)

pitcher_id - player ID of the pitcher (player names found in player_names.csv)

stand - which side batter hits on (single character, R or L)

top - True if it’s the top of the inning / False if it’s the bottom

games data

attendance - number of fans who attended (NOTE: for first game of doubleheaders, value is often erroneously 1 or 0)

away_final_score - final score for the visiting team

away_team - three letter abbreviation for away team; third letter sometimes indicates league (national vs american)

date - date of game

elapsed_time - length of game in minutes

g_id game ID

home_final_score - final score for the home team

home_team - three letter abbreviation for home team; third letter sometimes indicates league (national vs american)

start_time - start time of game

umpire_1B - first and last name of the umpire at first base

umpire_2B - first and last name of the umpire at second base

umpire_3B - first and last name of the umpire at third base

umpire_HP - first and last name of the umpire at home plate

venue_name - name of stadium

weather - description of weather

wind - description of wind

delay - length of delay before game in minutes

pitches data

ab_id - at-bat ID

ax

ay

az

b_count - balls in the current count

b_score - score for the batter’s team

break_angle

break_length

break_y

code - records the result of the pitch (See A2)

end_speed - speed of the pitch when it reaches the plate

nasty

on_1b - True if there’s a runner on first, False if empty

on_2b - True if there’s a runner on second, False if empty

on_3b - True if there’s a runner on third, False if empty

outs - number of outs (before pitch is thrown)

pfx_x

pfx_z

pitch_num - pitch number (of at-bat)

pitch_type - type of pitch (See A3)

px - x-location as pitch crosses the plate (X=0 means right down the middle)

pz - z-location as pitch crosses the plate (Z=0 means the ground)

s_count - strikes in the current count

spin_dir - direction in which pitch is spinning, measured in degrees

spin_rate - the pitch’s spin rate, measured in RPM

start_speed - speed of the pitch just as it’s thrown

sz_bot

sz_top

type - simplified code: S (strike) B (ball) or X (in play)

type_confidence - confidence in pitch_type classification (unsure what 2 means)

vx0

vy0

vz0

x

x0

y

y0

z0

zone

players data

id - player ID (matches with batter_id and pitcher_id)

first_name - first name

last_name - last name

A2. Pitch Result Codes

B - Ball

*B - Ball in dirt

S - Swinging Strike

C - Called Strike

F - Foul

T - Foul Tip

L - Foul Bunt

I - Intentional Ball

W - Swinging Strike (Blocked)

M - Missed Bunt

P - Pitchout

Q - Swinging pitchout

R - Foul pitchout

Values that only occur on last pitch of at-bat:

X - In play, out(s)

D - In play, no out

E - In play, runs

H - Hit by pitch

Note: all codes, except for H, come directly from the XML files. All at-bats with code H were given no code in the XMLs.

A3. Pitch Types

CH - Changeup

CU - Curveball

EP - Eephus

FC - Cutter

FF - Four-seam Fastball

FO - Pitchout (also PO)

FS - Splitter

FT - Two-seam Fastball

IN - Intentional ball

KC - Knuckle curve

KN - Knuckeball

PO - Pitchout (also FO)

SC - Screwball

SI - Sinker

SL - Slider

UN - Unknown