When should you expect to see a fastball versus an off-speed pitch?
At what velocity is a pitch expected to be thrown at?
These questions are important for batters and pitchers alike. Batters can have an idea of what to expect through their at bat. Pitchers can see where they stack up compared to the rest for a given pitch.
MLB Pitch Data including data for every pitch thrown in the 2015-2018 MLB seasons will be analyzed and visualized to see distribution of pitch speeds and the proportion of pitch type used through an at bat.
Additionally, some fun stats interesting for all baseball fans are included as well!
library(readr) # import data
library(magrittr) # use pipe operator
library(tidyr) # tidy messy data
library(dplyr) # manipulate data
library(purrr) # apply functions to multiple variabls
library(broom) # create data frames from other formats
library(ggplot2) # visualize data
The MLB Pitch Data 2015-2018 is posted on Kaggle. It was scraped from this webpage which is a part of MLB.com. This collection of data includes the following 4 data sets as csv files:
To start, the data is loaded into R.
atbats <- read_csv("data/atbats.csv")
games <- read_csv("data/games.csv")
pitches <- read_csv("data/pitches.csv")
player_names <- read_csv("data/player_names.csv")
Here are descriptions for each variable in these data sets. Because there are 70 columns within this data, any variable not used for analysis will not be described. However, all variables were initially considered when cleaning data and may be seen in those steps.
At Bats
player_names
)player_names
)Games
Pitches
Player Names
Data Cleaning
At Bats
The structure of the data should be evaluated and modify as necessary.
ID numbers are numerics but don’t actually represent a number value. Additionally, all actual numeric values here should logically only take on integer values. For example, there can’t be 2.3 outs in the 4.6th inning with 3.7 runs for the picher’s team!
To maintain the original imported data, a new data frame atbats_clean
will be used as changes are made to data types. The structure was then checked again to ensure the correct changes were made.
str(atbats)
atbats_clean <- atbats
atbats_clean$ab_id <- as.character(atbats_clean$ab_id)
atbats_clean$batter_id <- as.character(atbats_clean$batter_id)
atbats_clean$g_id <- as.character(atbats_clean$g_id)
atbats_clean$pitcher_id <- as.character(atbats_clean$pitcher_id)
atbats_clean$inning <- as.integer(atbats_clean$inning)
atbats_clean$o <- as.integer(atbats_clean$o)
atbats_clean$p_score <- as.integer(atbats_clean$p_score)
str(atbats_clean)
After adjusting the structure of the data, summary statistics will give a better idea of the data that will be worked with.
summary(atbats)
## ab_id batter_id event
## Min. :2.015e+09 Min. :112526 Length:740389
## 1st Qu.:2.016e+09 1st Qu.:457759 Class :character
## Median :2.017e+09 Median :519317 Mode :character
## Mean :2.017e+09 Mean :520223
## 3rd Qu.:2.018e+09 3rd Qu.:592273
## Max. :2.018e+09 Max. :673633
## g_id inning o p_score
## Min. :201500001 Min. : 1.000 Min. :0.000 Min. : 0.000
## 1st Qu.:201600013 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.: 0.000
## Median :201700015 Median : 5.000 Median :2.000 Median : 1.000
## Mean :201651556 Mean : 5.008 Mean :1.677 Mean : 2.286
## 3rd Qu.:201800006 3rd Qu.: 7.000 3rd Qu.:2.000 3rd Qu.: 4.000
## Max. :201802431 Max. :19.000 Max. :3.000 Max. :25.000
## p_throws pitcher_id stand top
## Length:740389 Min. :112526 Length:740389 Mode :logical
## Class :character 1st Qu.:462136 Class :character FALSE:363106
## Mode :character Median :534910 Mode :character TRUE :377283
## Mean :526830
## 3rd Qu.:592836
## Max. :673633
The most important check here is for nonsensible numeric values such as a negative value or outs above 3. Additionally, you can see that the top
variable is a logical and has more TRUE’s than FALSE’s which makes sense as many times the bottom of the 9th (or final inning of the game) may not need to be played if the home team is leading.
To further check the character variables, the unique()
function was used to ensure that the pitcher’s throwing hand and the batter stance could only be L for left or R for right as well as make sure all event descriptions made sense.
unique(atbats_clean$event)
unique(atbats_clean$p_throws)
unique(atbats_clean$stand)
Next, missing values were checked and there are none in the atbats
data.
sum(is.na(atbats))
## [1] 0
Games
games
was checked an cleaned in a very similar way as atbats
.
Summary statistics can be see below.
str(games)
games_clean <- games
games_clean$g_id <- as.character(games_clean$g_id)
games_clean$attendance <- as.integer(games_clean$attendance)
games_clean$away_final_score <- as.integer(games_clean$away_final_score)
games_clean$home_final_score <- as.integer(games_clean$home_final_score)
str(games_clean)
summary(games_clean)
## attendance away_final_score away_team date
## Min. : 0 Min. : 0.000 Length:9718 Min. :2015-04-05
## 1st Qu.:21791 1st Qu.: 2.000 Class :character 1st Qu.:2016-04-03
## Median :30113 Median : 4.000 Mode :character Median :2017-04-02
## Mean :29765 Mean : 4.372 Mean :2017-01-01
## 3rd Qu.:37953 3rd Qu.: 6.000 3rd Qu.:2018-03-29
## Max. :56310 Max. :24.000 Max. :2018-10-01
## elapsed_time g_id home_final_score home_team
## Min. : 75.0 Length:9718 Min. : 0.000 Length:9718
## 1st Qu.:167.0 Class :character 1st Qu.: 2.000 Class :character
## Median :182.0 Mode :character Median : 4.000 Mode :character
## Mean :184.8 Mean : 4.538
## 3rd Qu.:199.0 3rd Qu.: 6.000
## Max. :409.0 Max. :25.000
## start_time umpire_1B umpire_2B
## Length:9718 Length:9718 Length:9718
## Class1:hms Class :character Class :character
## Class2:difftime Mode :character Mode :character
## Mode :numeric
##
##
## umpire_3B umpire_HP venue_name
## Length:9718 Length:9718 Length:9718
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## weather wind delay
## Length:9718 Length:9718 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.000
## Mean : 3.886
## 3rd Qu.: 0.000
## Max. :1860.000
unique(games_clean$away_team)
unique(games_clean$home_team)
unique(games_clean$venue_name)
This data does have a few missing values. They are only in the umpire_2b
variable. Because this variable isn’t used for analysis, all observations will be kept in the data set.
sum(is.na(games))
## [1] 3
games %>%
map_df(~sum(is.na(.))) %>%
gather(variable, num_missing) %>%
filter(num_missing > 0)
## # A tibble: 1 x 2
## variable num_missing
## <chr> <int>
## 1 umpire_2B 3
Pitches
From importing the pitches
data and checking the Global Environment, you can see it’s a large data set with many variables. So, handling missing values and then selecting only needed variables was done first.
This displays the total number of missing values as well as breaks down the variables where the missing values are. Variables to select was determined based on what variables had no missing values and what variables were needed regardless of missing values.
sum(is.na(pitches))
colSums(is.na(pitches))
pitches %>%
select(ab_id, b_count, end_speed, outs, on_1b, on_2b, on_3b, pitch_type, s_count, start_speed, type) %>%
map_df(~sum(is.na(.))) %>%
gather(variable, num_missing) %>%
filter(num_missing > 0)
## # A tibble: 3 x 2
## variable num_missing
## <chr> <int>
## 1 end_speed 14114
## 2 pitch_type 14189
## 3 start_speed 14114
The number of missing values was just under a half a percent of all pitch data, so here the pitches_clean
is created with ommitting any observations with missing values. A check is then done to make sure all missing values were removed.
pitches_clean <- pitches %>%
select(ab_id, b_count, end_speed, outs, on_1b, on_2b, on_3b, pitch_type, s_count, start_speed, type) %>%
na.omit()
sum(is.na(pitches_clean))
## [1] 0
Here the same structure check and data type changes were performed as they were with the previous 2 data sets.
str(pitches_clean)
pitches_clean$ab_id <- as.character(pitches_clean$ab_id)
pitches_clean$b_count <- as.integer(pitches_clean$b_count)
pitches_clean$outs <- as.integer(pitches_clean$outs)
pitches_clean$s_count <- as.integer(pitches_clean$s_count)
Summary statistics are explored and checks of the character variables are performed as well.
summary(pitches_clean)
## ab_id b_count end_speed outs
## Length:2852973 Min. :0.0000 Min. :32.40 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:77.90 1st Qu.:0.0000
## Mode :character Median :1.0000 Median :82.50 Median :1.0000
## Mean :0.8807 Mean :81.36 Mean :0.9813
## 3rd Qu.:2.0000 3rd Qu.:85.40 3rd Qu.:2.0000
## Max. :4.0000 Max. :96.90 Max. :2.0000
## on_1b on_2b on_3b pitch_type
## Mode :logical Mode :logical Mode :logical Length:2852973
## FALSE:1989943 FALSE:2326343 FALSE:2582291 Class :character
## TRUE :863030 TRUE :526630 TRUE :270682 Mode :character
##
##
##
## s_count start_speed type
## Min. :0.0000 Min. : 33.90 Length:2852973
## 1st Qu.:0.0000 1st Qu.: 84.30 Class :character
## Median :1.0000 Median : 89.70 Mode :character
## Mean :0.8832 Mean : 88.38
## 3rd Qu.:2.0000 3rd Qu.: 93.00
## Max. :2.0000 Max. :105.00
unique(pitches_clean$type)
unique(pitches_clean$pitch_type)
From the summary, you can see that there are ball counts of 4. This does not make sense because when a pitch is thrown, there cannot already be 4 balls. Since there are only 14 instances of this, these observations were filtered out.
pitches_clean <- filter(pitches_clean, b_count < 4)
Pitch type is a big part of this analysis, so this variable needs to be explored further. Here you can see the proportion of each pitch type thrown from all 2852959 pitches.
pitch_type_table <- table(pitches_clean$pitch_type)
pitch_type_table %>%
prop.table() %>%
round(3) %>%
as_data_frame() %>%
rename(pitch = Var1, proportion = n) %>%
arrange(desc(proportion))
## # A tibble: 18 x 2
## pitch proportion
## <chr> <dbl>
## 1 FF 0.356
## 2 SL 0.158
## 3 FT 0.118
## 4 CH 0.103
## 5 SI 0.085
## 6 CU 0.082
## 7 FC 0.052
## 8 KC 0.023
## 9 FS 0.015
## 10 KN 0.004
## 11 IN 0.002
## 12 AB 0
## 13 EP 0
## 14 FA 0
## 15 FO 0
## 16 PO 0
## 17 SC 0
## 18 UN 0
The Kaggle page linked above can be viewed to see which pitch each abbreviation corresponds to. Here pitch types will be broken down into 4 buckets:
breakingballs <- c("CU","KC","SC","SL")
changeups <- c("CH","KN","EP")
fastballs <- c("FC","FF","FS","FT","SI")
other_pitches <- c("FO","PO","IN","UN","AB","FA")
pitches_clean$pitch_type_b <- pitches_clean$pitch_type
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% fastballs] <- "FB"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% breakingballs] <- "BB"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% changeups] <- "CU"
pitches_clean$pitch_type_b[pitches_clean$pitch_type %in% other_pitches] <- "OT"
Here’s a view of the proportions for each of these new pitch types
pitch_type_b_table <- table(pitches_clean$pitch_type_b)
pitch_type_b_table %>%
prop.table() %>%
round(3) %>%
as_data_frame() %>%
rename(pitch = Var1, proportion = n) %>%
arrange(desc(proportion)) %>%
ggplot(aes(x = pitch, y = proportion, fill = pitch)) +
geom_bar(stat = "identity") +
geom_text(aes(label = proportion), vjust = -0.25) +
ggtitle("Pitch Type Proportions")
Player Names
This data set is simply an ID, first name, and last name. The id
variable was changed to player_id
for more clarity later when joining data. Also, it was changed to a character value as all other ID’s have been.
str(player_names)
sum(is.na(player_names))
player_names_clean <- player_names
names(player_names_clean)[1] <- "player_id"
player_names_clean$player_id <- as.character(player_names_clean$player_id)
str(player_names_clean)
First, the relationship between pitch velocity and pitch type will be explored.
A data frame speed_means
is created to get mean velocity for each pitch type as well as its standard deviation. Then, the distribution and frequencies for each pitch type can be viewed.
speed_means <- pitches_clean %>%
subset(pitch_type_b %in% c("FB","BB","CU")) %>%
group_by(pitch_type_b) %>%
summarise(Mean = mean(start_speed),
SD = sd(start_speed))
ggplot(subset(pitches_clean, pitch_type_b %in% c("FB","BB","CU")), aes(x = start_speed, fill = pitch_type_b)) +
geom_histogram(binwidth = 1, color = "grey30") +
geom_vline(data = speed_means, aes(xintercept = Mean), linetype = "dashed") +
facet_grid(~ pitch_type_b) +
xlim(60,105) +
ylab("Frequency") +
xlab("Pitch Speed (mph)") +
ggtitle("Pitch Velocity by Pitch Type") +
scale_fill_discrete(name = "Pitch Type", labels = c("Breaking Balls", "Changeups", "Fastballs")) +
theme(panel.grid.minor = element_blank(),
axis.ticks = element_blank())
You can see that fastballs are thrown much more often than breaking balls or changeups. Additionally, fastballs have a tighter distribution. This makes sense as fastballs are more of a standard pitch to be thrown as fast as possible, whereas breaking balls and changeups are used to throw off the batter by having more motion on the ball or confusing their timing.
Now, when can we expected to see each of these pitches throughout an at bat?
A table can be created to see how many pitches are thrown at each combination of balls and strikes throughout the at bat.
count_table <- table(pitches_clean$b_count, pitches_clean$s_count)
(count_prop <- prop.table(count_table))
##
## 0 1 2
## 0 0.25822804 0.12827910 0.06466129
## 1 0.10140735 0.10241717 0.09478300
## 2 0.03487327 0.05296851 0.08074354
## 3 0.01108709 0.02198630 0.04856537
There are 12 possible ball-strike counts from an 0-0 count to a 3-2 count. A new variable bs-count
is created to have what the count is before each pitch.
pitches_clean$bs_count <- paste(pitches_clean$b_count, pitches_clean$s_count, sep = "-")
head(pitches_clean[,c("ab_id","bs_count")],7)
## # A tibble: 7 x 2
## ab_id bs_count
## <chr> <chr>
## 1 2015000001 0-0
## 2 2015000001 0-1
## 3 2015000001 0-2
## 4 2015000001 0-2
## 5 2015000001 1-2
## 6 2015000001 2-2
## 7 2015000002 0-0
Here you can see the result of creating this variable and how it flows from one pitch to the next through an at bat. This flows correctly as the first 2 pitches were strikes, followed by a foul ball which kept that count at 0-2, then two balls were thrown before the final pitch of the at bat when there was a 2-2 count.
This next section of code creates a table showing the pitch types for each ball-strike count.
type_count_table <- table(pitches_clean$pitch_type_b, pitches_clean$bs_count)
(type_count_prop <- prop.table(type_count_table))
##
## 0-0 0-1 0-2 1-0 1-1
## BB 6.546992e-02 3.817896e-02 2.377847e-02 2.036447e-02 2.849287e-02
## CU 1.976544e-02 1.573279e-02 6.062828e-03 1.335911e-02 1.453543e-02
## FB 1.723880e-01 7.427271e-02 3.475690e-02 6.715554e-02 5.931596e-02
## OT 6.046354e-04 9.463858e-05 6.309239e-05 5.282235e-04 7.290676e-05
##
## 1-2 2-0 2-1 2-2 3-0
## BB 3.474358e-02 3.860904e-03 1.049437e-02 2.555207e-02 2.274831e-04
## CU 1.066016e-02 3.158826e-03 6.975565e-03 1.006779e-02 2.001431e-04
## FB 4.931196e-02 2.729482e-02 3.546774e-02 4.507706e-02 1.007270e-02
## OT 6.729855e-05 5.587182e-04 3.084517e-05 4.661827e-05 5.867592e-04
##
## 3-1 3-2
## BB 2.075389e-03 1.019573e-02
## CU 1.453929e-03 4.886155e-03
## FB 1.842543e-02 3.346631e-02
## OT 3.154619e-05 1.717515e-05
The table shows proportions as it relates to all 2852959 pitches. We want it to be the proportion within each ball-strike count.
Here a data frame is created from the table and each value is divided by the column sum to get the within count proportion. The data frame is then taken from wide to long data to be used for visualizing in the next step.
typeVcount <- type_count_table %>%
tidy() %>%
select(pitch_type_b = Var1, bs_count = Var2, count = n) %>%
spread(bs_count, count)
typeVcount_colSums <- colSums(typeVcount[,-1])
for (i in 1:(ncol(typeVcount) - 1)) {
typeVcount[,i + 1] <- typeVcount[,i + 1]/typeVcount_colSums[i]
}
typeVcount_long <- typeVcount %>%
gather(bs_count, prop, -pitch_type_b)
Now that we have the data cleaned up and formatted correctly, we will visualize pitch type use throughout the count.
ggplot(subset(typeVcount_long, pitch_type_b %in% c("BB","CU","FB")),
aes(x = bs_count, y = prop, fill = pitch_type_b)) +
geom_bar(stat = "identity", position = position_dodge()) +
ylab("Proportion") +
xlab("Ball-Strike Count") +
scale_x_discrete(limits = c("0-0","0-1","1-0","0-2","1-1","2-0","1-2","2-1","3-0","2-2","3-1")) +
scale_fill_discrete(name = "Pitch Type", labels = c("Breaking Balls", "Changeups", "Fastballs")) +
ggtitle("Pitch Type Used through an At Bat")
As you step through each count, whether it’s a 1 pitch count (ie 0-1 or 1-0) or a 4 pitch count (ie 2-2 or 3-1), fastballs are used more often when the pitcher is at a disadvantage (generally meaning when there are more balls then strikes in the count). For example, looking at 1-2 through 3-0 counts, there is a significant drop off of breaking balls and changeups being used.
After finding these significant insights analyzing pitch types, here are some fun stats found within the data!
Fastest Pitch
Keeping with pitch analysis, here are the 3 instances of the top recorded pitch speed of 105mph.
pitches_clean %>%
filter(start_speed == max(start_speed)) %>%
inner_join(atbats_clean, by = "ab_id") %>%
select(start_speed, pitch_type, batter_id, pitcher_id) %>%
inner_join(player_names_clean, by = c("pitcher_id" = "player_id")) %>%
inner_join(player_names_clean, by = c("batter_id" = "player_id"), suffix = c("_pitcher", "_batter")) %>%
select(MPH = start_speed, PitchType = pitch_type, PitcherFirst = first_name_pitcher, PitcherLast = last_name_pitcher, BatterFirst = first_name_batter, BatterLast = last_name_batter)
## # A tibble: 3 x 6
## MPH PitchType PitcherFirst PitcherLast BatterFirst BatterLast
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 105 FF Aroldis Chapman J.J. Hardy
## 2 105 SI Jordan Hicks Odubel Herrera
## 3 105 SI Jordan Hicks Odubel Herrera
You can see that all of these happened in the 9th inning when closers came out to finish the game. Closers don’t throw as many pitches as starters and other relief pitchers. The pitches they do throw are expected to be elite pitches and thrown at very high speeds as you can see here. Amazingly, two of these pitches were thrown by the same pitcher in the same at bat!
Here’s an article about this at bat Hicks hits 105 mph– twice – on radar gun.
Slowest pitch of the at bat was 103.7mph…
Game with the Highest Attendance
games_clean %>%
filter(attendance == max(attendance)) %>%
select(Date = date, Attendance = attendance, Away = away_team, Home = home_team)
## # A tibble: 1 x 4
## Date Attendance Away Home
## <date> <int> <chr> <chr>
## 1 2018-07-21 56310 sfn oak
Longest Game (most innings and elapsed time)
atbats_clean %>%
filter(inning == max(inning)) %>%
select(g_id) %>%
unique() %>%
inner_join(games_clean, by = "g_id") %>%
select(Date = date, elapsed_time, Away = away_team, Home = home_team, AwayScore = away_final_score, HomeScore = home_final_score) %>%
arrange(desc(elapsed_time))
## # A tibble: 3 x 6
## Date elapsed_time Away Home AwayScore HomeScore
## <date> <dbl> <chr> <chr> <int> <int>
## 1 2015-04-10 409 bos nya 6 5
## 2 2016-07-01 373 cle tor 2 1
## 3 2017-09-05 360 tor bos 2 3
Most Runs Scored in a Game
atbats_clean %>%
filter(p_score == max(p_score)) %>%
select(g_id) %>%
unique() %>%
inner_join(games_clean, by = "g_id") %>%
select(Date = date, Away = away_team, Home = home_team, AwayScore = away_final_score, HomeScore = home_final_score)
## # A tibble: 1 x 5
## Date Away Home AwayScore HomeScore
## <date> <chr> <chr> <int> <int>
## 1 2018-07-31 nyn was 4 25
When should you expect to see a fastball versus an off-speed pitch? At what velocity is a pitch expected to be thrown at?
We now know the answers to these questions.
The major insights found include the following:
This information can be useful for all baseball players whether preparing for an at bat or analyzing pitcher ability.