Introduction

The data I chose contains MLB pitch-level data for every pitch thrown between 2015-2018. It is broken into four separate data sets; one for at bat details, one for general game information, one for pitch level details, and one for player details.

The following table shows the dimensions of each table:
Table Dimensions
Table Variables Observations
atbats 11 740,389
games 19 9,699
pitches 39 2,852,973
players 3 2,218

For this project I chose to do some analytics on a specific pitcher, Brent Suter. In the real world, pitchers and coaches could both benefit from this data by learning how they can improve. For example, knowing where pitches are placed around the strike zone can improve strategy.

Further in the future, another good idea would be to use this data to run machine learning models to predict which pitch was coming next. If the coach could potentially signal to the batter which pitch he thinks is coming next, the batter can be better prepared to hit the ball.

Overall, there are definitely many different ways that baseball players and coaches can benefit from the insights gained from analyzing this data.

Click here to view the original source of data.

Packages Required

The following packages are required for this project:

library(dplyr)        # Used as a fast, consistent tool for working with data frame like objects
library(data.table)   # Used for a fast and friendly way to load the datasets
library(tidyr)        # Used specifically for data tidying
library(stringr)      # Used for string manipulation, such as splitting a string
library(ggplot2)      # Used for creating elegant data visualizations
library(knitr)        # Used for dynamic report generation
library(kableExtra)   # Additional features for knitr 'kable' function

Data Preparation

To begin, I have loaded the data files into R using the ‘fread’ function.

Next, I got a feel for the data by viewing the first 5 rows of each table.

At Bats Data
ab_id batter_id event g_id inning o p_score p_throws pitcher_id stand top
2.015e+09 572761 Groundout 201500001 1 1 0 L 452657 L TRUE
2.015e+09 518792 Double 201500001 1 1 0 L 452657 L TRUE
2.015e+09 407812 Single 201500001 1 1 0 L 452657 R TRUE
2.015e+09 425509 Strikeout 201500001 1 2 0 L 452657 R TRUE
2.015e+09 571431 Strikeout 201500001 1 3 0 L 452657 L TRUE
Games Data
attendance away_final_score away_team date elapsed_time g_id home_final_score home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name weather wind delay
35055 3 sln 2015-04-05 184 201500001 0 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 degrees, clear 7 mph, In from CF 0
45909 1 ana 2015-04-06 153 201500002 4 sea 1:12 PM Ron Kulpa Brian Knight Vic Carapazza Larry Vanover Safeco Field 54 degrees, cloudy 1 mph, Varies 0
36969 2 atl 2015-04-06 156 201500003 1 mia 4:22 PM Laz Diaz Chris Guccione Cory Blaser Jeff Nelson Marlins Park 80 degrees, partly cloudy 16 mph, In from CF 16
31042 6 bal 2015-04-06 181 201500004 2 tba 3:12 PM Ed Hickox Paul Nauert Mike Estabrook Dana DeMuth Tropicana Field 72 degrees, dome 0 mph, None 0
45549 8 bos 2015-04-06 181 201500005 0 phi 3:08 PM Phil Cuzzi Tony Randazzo Will Little Gerry Davis Citizens Bank Park 71 degrees, partly cloudy 11 mph, Out to RF 0
Pitches Data
ab_id ax ay az b_count b_score break_angle break_length break_y code end_speed nasty on_1b on_2b on_3b outs pfx_x pfx_z pitch_num pitch_type px pz s_count spin_dir spin_rate start_speed sz_bot sz_top type type_confidence vx0 vy0 vz0 x x0 y y0 z0 zone
2.015e+09 7.665 34.685 -11.960 0 0 -25.0 3.2 23.7 C 84.1 55 FALSE FALSE FALSE 0 4.16 10.93 1 FF 0.416 2.963 0 159.235 2305.052 92.9 1.72 3.56 S 2 -6.409 -136.065 -3.995 101.1400 2.280 158.7800 50 5.302 3
2.015e+09 12.043 34.225 -10.085 0 0 -40.7 3.4 23.7 S 84.1 31 FALSE FALSE FALSE 0 6.57 12.00 2 FF -0.191 2.347 1 151.402 2689.935 92.8 1.72 3.56 S 2 -8.411 -135.690 -5.980 124.2800 2.119 175.4100 50 5.307 5
2.015e+09 14.368 35.276 -11.560 0 0 -43.7 3.7 23.7 F 85.2 49 FALSE FALSE FALSE 0 7.61 10.88 3 FF -0.518 3.284 2 145.125 2647.972 94.1 1.72 3.56 S 2 -9.802 -137.668 -3.337 136.7400 2.127 150.1100 50 5.313 1
2.015e+09 2.104 28.354 -20.540 0 0 -1.3 5.0 23.8 B 84.0 41 FALSE FALSE FALSE 0 1.17 6.45 4 FF -0.641 1.221 2 169.751 1289.590 91.0 1.74 3.35 B 2 -8.071 -133.005 -6.567 109.6856 2.279 187.4635 50 5.210 13
2.015e+09 -10.280 21.774 -34.111 1 0 18.4 12.0 23.8 B 69.6 18 FALSE FALSE FALSE 0 -8.43 -1.65 5 CU -1.821 2.083 2 280.671 1374.569 75.4 1.72 3.56 B 2 -6.309 -110.409 0.325 146.5275 2.179 177.2428 50 5.557 13
Players Data
id first_name last_name
452657 Jon Lester
425794 Adam Wainwright
457435 Phil Coke
435400 Jason Motte
519166 Neil Ramirez

It is also a good idea to check for duplicates and missing values. In this case, there were no duplicates, however, there were some missing values that needed to be taken care of.

# Remove missing values in 'pitches'
colSums(is.na(pitches))
##           ab_id              ax              ay              az 
##               0           14189           14189           14189 
##         b_count         b_score     break_angle    break_length 
##               0               0           14189           14189 
##         break_y            code       end_speed           nasty 
##           14189               0           14114           14189 
##           on_1b           on_2b           on_3b            outs 
##               0               0               0               0 
##           pfx_x           pfx_z       pitch_num      pitch_type 
##           14142           14142               0               0 
##              px              pz         s_count        spin_dir 
##           14189           14189               0           14189 
##       spin_rate     start_speed          sz_bot          sz_top 
##           14189           14114            2083            2083 
##            type type_confidence             vx0             vy0 
##               0           14189           14189           14189 
##             vz0               x              x0               y 
##           14189               0           14189               0 
##              y0              z0            zone 
##           14189           14189           14189
pitches <- na.omit(pitches)

There is also some bad data when it comes to double header games. For some of the double headers, the first game shows an attendance of 0 or 1. These rows were removed.

I converted some variables as factor, including event, away team, home team, venue, code, pitch type, and play type, and spilt a few columns into two separate columns (weather split into temperature and forecast and wind split into wind speed and wind direction).

Updated Weather and Wind Columns
temp forecast wind_speed wind_dir
44 clear 7 In from CF
54 cloudy 1 Varies
80 partly 16 In from CF
72 dome 0 None
71 partly 11 Out to RF

Additionally, I created boxplots to look for outliers. There was one variable, delay (in minutes), that had an outlier of 1860 minutes.

The following is a look at the specific outlier observation(s), which I removed from the data set.

Games Outlier
attendance away_final_score away_team date elapsed_time g_id home_final_score home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name temp forecast wind_speed wind_dir delay
9608 36508 6 mil 2018-09-24 229 201802339 4 sln 7:16 PM Will Little Ted Barrett Mark Carlson Lance Barksdale Busch Stadium 78 cloudy 5 Out to LF 1860

Combine into One Table

Lastly, I decided to combine all four tables into one table.

The following table shows a summary (first 5 rows) of the “all” data set, which includes ~3 million records and 39 variables.

MLB All Data Set
ab_id b_count b_score on_1b on_2b on_3b outs pitch_num pitch_type s_count batter_id event g_id inning o p_score p_throws pitcher_id stand top p_first_name p_last_name b_first_name b_last_name attendance away_team date home_team start_time umpire_1B umpire_2B umpire_3B umpire_HP venue_name temp forecast wind_speed wind_dir delay
2.015e+09 0 0 FALSE FALSE FALSE 0 1 FF 0 572761 Groundout 201500001 1 1 0 L 452657 L TRUE Jon Lester Matt Carpenter 35055 sln 2015-04-05 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 clear 7 In from CF 0
2.015e+09 0 0 FALSE FALSE FALSE 0 2 FF 1 572761 Groundout 201500001 1 1 0 L 452657 L TRUE Jon Lester Matt Carpenter 35055 sln 2015-04-05 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 clear 7 In from CF 0
2.015e+09 0 0 FALSE FALSE FALSE 0 3 FF 2 572761 Groundout 201500001 1 1 0 L 452657 L TRUE Jon Lester Matt Carpenter 35055 sln 2015-04-05 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 clear 7 In from CF 0
2.015e+09 0 0 FALSE FALSE FALSE 0 4 FF 2 572761 Groundout 201500001 1 1 0 L 452657 L TRUE Jon Lester Matt Carpenter 35055 sln 2015-04-05 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 clear 7 In from CF 0
2.015e+09 1 0 FALSE FALSE FALSE 0 5 CU 2 572761 Groundout 201500001 1 1 0 L 452657 L TRUE Jon Lester Matt Carpenter 35055 sln 2015-04-05 chn 7:17 PM Mark Wegner Marty Foster Mike Muchlinski Mike Winters Wrigley Field 44 clear 7 In from CF 0

Exploratory Data Analysis

Brent Suter

For the exploratory data analysis, I am focusing on a specific pitcher, Brent Suter. First, I created a pie chart to show his most common pitch types.

From the graph above, we can see that Brent sticks to six different pitch types and mostly throws Four-seam Fastballs. Changeups and Sliders are his next popular choices. Below shows a frequency plot of how many of each type he throws specifically on the first pitch of each at bat.

The following graph shows the position of the ball as it crosses home plate for the pitcher, Brent. It is color coded by pitch type thrown. X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.

Also, I created a similar plot except it is color coded by the pitch outcome (Strike, Ball, and In Play). Again, X = 0 means the ball went right down the middle of the plate and Z = 0 means the ball hit the ground.

Summary

In conclusion, this data can be very beneficial to baseball players and coaches. They can gain insights such as, where the pitches are landing based on the type of pitch thrown, in order to improve strategy for when a batter is facing a specific pitcher.

Using the 2015-2018 MLB data, various plots and graphs were created. For this project, we learned that Brent throws six pitches, mostly the Four-seam Fastball.

I did struggle with creating predictive models. Even after only using 10% of the data, I was still running into memory issues. In the future, I think it would be great to use this data to create machine learning predictive models.

Appendix

A1. Data

atbats data

ab_id - at-bat ID (first 4 digits are year)

batter_id - player ID of the batter (player names found in player_names.csv)

event - description of the result of the at-bat

g_id - game ID (first 4 digits are year)

inning - inning number

o - number of outs after this at-bat

p_score - score for the pitcher’s team

p_throws - which hand pitcher throws with (single character, R or L)

pitcher_id - player ID of the pitcher (player names found in player_names.csv)

stand - which side batter hits on (single character, R or L)

top - True if it’s the top of the inning / False if it’s the bottom

games data

attendance - number of fans who attended (NOTE: for first game of doubleheaders, value is often erroneously 1 or 0)

away_final_score - final score for the visiting team

away_team - three letter abbreviation for away team; third letter sometimes indicates league (national vs american)

date - date of game

elapsed_time - length of game in minutes

g_id game ID

home_final_score - final score for the home team

home_team - three letter abbreviation for home team; third letter sometimes indicates league (national vs american)

start_time - start time of game

umpire_1B - first and last name of the umpire at first base

umpire_2B - first and last name of the umpire at second base

umpire_3B - first and last name of the umpire at third base

umpire_HP - first and last name of the umpire at home plate

venue_name - name of stadium

weather - description of weather

wind - description of wind

delay - length of delay before game in minutes

pitches data

ab_id - at-bat ID

ax

ay

az

b_count - balls in the current count

b_score - score for the batter’s team

break_angle

break_length

break_y

code - records the result of the pitch (See A2)

end_speed - speed of the pitch when it reaches the plate

nasty

on_1b - True if there’s a runner on first, False if empty

on_2b - True if there’s a runner on second, False if empty

on_3b - True if there’s a runner on third, False if empty

outs - number of outs (before pitch is thrown)

pfx_x

pfx_z

pitch_num - pitch number (of at-bat)

pitch_type - type of pitch (See A3)

px - x-location as pitch crosses the plate (X=0 means right down the middle)

pz - z-location as pitch crosses the plate (Z=0 means the ground)

s_count - strikes in the current count

spin_dir - direction in which pitch is spinning, measured in degrees

spin_rate - the pitch’s spin rate, measured in RPM

start_speed - speed of the pitch just as it’s thrown

sz_bot

sz_top

type - simplified code: S (strike) B (ball) or X (in play)

type_confidence - confidence in pitch_type classification (unsure what 2 means)

vx0

vy0

vz0

x

x0

y

y0

z0

zone

players data

id - player ID (matches with batter_id and pitcher_id)

first_name - first name

last_name - last name

A2. Pitch Result Codes

B - Ball

*B - Ball in dirt

S - Swinging Strike

C - Called Strike

F - Foul

T - Foul Tip

L - Foul Bunt

I - Intentional Ball

W - Swinging Strike (Blocked)

M - Missed Bunt

P - Pitchout

Q - Swinging pitchout

R - Foul pitchout

Values that only occur on last pitch of at-bat:

X - In play, out(s)

D - In play, no out

E - In play, runs

H - Hit by pitch

Note: all codes, except for H, come directly from the XML files. All at-bats with code H were given no code in the XMLs.

A3. Pitch Types

CH - Changeup

CU - Curveball

EP - Eephus

FC - Cutter

FF - Four-seam Fastball

FO - Pitchout (also PO)

FS - Splitter

FT - Two-seam Fastball

IN - Intentional ball

KC - Knuckle curve

KN - Knuckeball

PO - Pitchout (also FO)

SC - Screwball

SI - Sinker

SL - Slider

UN - Unknown