Introduction

For this project, my goal is to define pitcher deception and create a leaderboard of the most deceptive pitchers in baseball using my definition. So what defines pitcher deception? Here are a few definitions:

• The inability for a hitter to predict what is coming.
• A tactic or trait that makes a pitch harder to hit than its flight through the air would indicate.
• The effectiveness of a pitcher beyond what we’re able to directly observe or quantify.

But the way a pitcher is “deceptive” can come from a variety of factors, both before and after the pitch is released: Some examples of deception include:

• Hiding the baseball well
• Rarity in profile (Releasing the pitch from ankle height, lefties don’t have to throw as hard since they’re more scarcely seen)
• Changing time between deliveries, or altering the delivery itself
• Creating diversion with herky-jerky mechanics
• Slot changing, and ironically, consistency in arm angles across different pitch types
• High extension and releasing the pitch closer to home plate
• Tunneling
• Seam-Shifted Wake

Until markerless motion-capture systems become universally implemented, its hard to evaluate pitcher deception with anything other than the tracked pitch data we already have. We have to look at different aspects of pitching we deem significant to fooling a hitter, and try to quantify and score their importance. We can do this for each pitcher, so that we may compare them by their “deception_score”.

For my project, I’m going to take a look at four forms of deception and try to quantify and rate each pitcher based on these factors. The four forms of deception are:

• Unpredictability
• Indistinguishability
• Unexpectedness
• Flashiness

I’ll explain what each of these are as I go through my analysis, but I will calculate four scores for every pitcher, find their importance to predicting deception, and create a weighted sum that returns a single score to rate individual pitcher deception.

Reading in the Data

The first thing I’ll do is read in pitch level data for the 2023 MLB season and do some data manipulation to make a ‘count’ column and filter out any position players pitching. The pitch level data was acquired using the “scrape_statcast_savant” function from the ‘baseballr’ package. Player positions were found using player ID data acquired from razzball.com.

library(dplyr)

data23 = read.csv("C:\\Users\\phill\\OneDrive\\Desktop\\Job_Search\\mlb_data_23.csv")
data23$count = paste(data23$balls, data23$strikes, sep = '-')

positions = read.csv("C:\\Users\\phill\\OneDrive\\Desktop\\Job_Search\\razzball_11_20.csv")
positions = positions %>%
  select(MLBAMID, ESPN_POS)

data23 = data23 %>%
  inner_join(positions, by = c('pitcher'='MLBAMID')) %>%
  filter(ESPN_POS %in% c('P','SP','RP'))

1. Unpredictability

The first aspect of deception I will focus on is unpredictability, which is the pitcher’s ability to hide pitch tendencies well based on count. For example, a pitcher that throws 40% fastballs overall but 80% in a 1-0 count would be easier to predict that that same pitcher throwing a fastball in a 1-0 count also around 40% of the time. This deviation will be calculated for each pitcher by first finding usage rates overall and in each count. Then, the deviation for each pitch type usage between that particular count and overall usage will be found in each count, and averaged across pitches to find deviation in each count. I will then take a weighted sum where the weights are based on proportions of counts that appear across the data, thus finding a deviation score for each pitcher.

# For all counts
total_pitches_per_player <- data23 %>%
  group_by(player_name) %>%
  summarise(total_pitches = n(), .groups = 'drop')

data23_pitch_usage <- data23 %>%
  select(player_name, pitch_type) %>%
  group_by(player_name, pitch_type) %>%
  summarise(pitch_usage = n(), .groups = 'drop') %>%
  left_join(total_pitches_per_player, by = "player_name") %>%
  mutate(percent_usage = (pitch_usage / total_pitches) * 100) %>%
  arrange(player_name, desc(percent_usage))

# For specific counts
total_pitches_per_player_count <- data23 %>%
  group_by(player_name, count) %>%
  summarise(total_pitches = n(), .groups = 'drop')

data23_pitch_usage_count <- data23 %>%
  select(player_name, pitch_type, count) %>%
  group_by(player_name, pitch_type, count) %>%
  summarise(pitch_usage = n(), .groups = 'drop') %>%
  left_join(total_pitches_per_player_count, join_by(player_name, count)) %>%
  mutate(percent_usage = (pitch_usage / total_pitches) * 100) %>%
  arrange(player_name, count, pitch_type)

# Calculate deviation for each pitch in each count by pitcher
data23_pitch_deviation = data23_pitch_usage_count %>%
  left_join(data23_pitch_usage, join_by(player_name,pitch_type)) %>%
  mutate(deviation = abs(percent_usage.x - percent_usage.y))

# Find the averages of these deviations by count
data23_pitch_deviation_avg = data23_pitch_deviation %>%
  filter(!(count %in% c('4-1','4-2'))) %>%
  group_by(player_name, count) %>%
  summarise(avg_deviation = mean(deviation), .groups = 'drop')

# Find the proportions of each count across league
count_proportions = data23 %>%
  group_by(count) %>%
  filter(!(count %in% c('4-1','4-2'))) %>%
  summarise(proportion = n())

total_pitches = sum(count_proportions$proportion)
count_proportions$proportion = count_proportions$proportion / total_pitches

# Put the proportions (weights) in the same df as the averages and calculate
# Overall deviation for each pitcher
overall_deviation = data23_pitch_deviation_avg %>%
  left_join(count_proportions, by = 'count') %>%
  group_by(player_name) %>%
  summarise(overall_deviation = sum(avg_deviation * proportion))

2. Indistinguishability

The second aspect of deception I’ll focus on is indistiguishability, or “release point tunneling”. Essentially, it’s how well a pitcher repeats a similar release point across different pitches, as to make it more difficult to a batter to guess what pitch is coming. To do this, I will use pitch level data and look at the difference in release point on consecutive pitches based on Euclidean distance, for each pitcher. This is a more realistic approach, as a batter is more likely to remember the arm slot of a pitcher on the previous pitch than that same pitcher two months ago. I’ll also group by at bat, to handle pitchers that pitch off different sides of the rubber based on batter hand.

release_point_data = data23 %>%
  select(player_name, game_pk, at_bat_number, release_pos_x, release_pos_z) %>%
  group_by(player_name, game_pk, at_bat_number) %>%
  mutate(diff_x = release_pos_x - lag(release_pos_x, default = first(release_pos_x)),
         diff_z = release_pos_z - lag(release_pos_z, default = first(release_pos_z)))

release_change = release_point_data %>%
  summarise(player_name, game_pk, at_bat_number, distance = sqrt((diff_x)^2 + (diff_z)^2), .groups = 'drop')

# Remove the first pitch of each at bat since there's no previous value to compare it to, making it null
release_change <- release_change %>%
  group_by(player_name, game_pk) %>%
  filter(at_bat_number == lag(at_bat_number))
release_change = na.omit(release_change)

Another issue that can affect these difference values are pitchers who intentionally change their release point (e.g. Rich Hill, Clayton Kershaw). This can be considered a form of deception itself, but will return poor results to what I am trying to observe if I include them. Below, I plot the density of the release point change from pitch to pitch.

plot(density(release_change$distance), xlim = c(0,1), xlab = 'Euclidean Distance', main = 'Release Point Change')
abline(v = c(mean(release_change$distance), mean(release_change$distance) + sd(release_change$distance), 
       mean(release_change$distance) + 2*sd(release_change$distance), 
       mean(release_change$distance) + 3*sd(release_change$distance),
       mean(release_change$distance) - sd(release_change$distance)), col = c('red','blue','green','yellow', 'blue'))
text(0.2,0.5,'mean')
text(0.35,0.5,'1 SD')
text(0.5,0.5,'2 SD')
text(0.65,0.5,'3 SD')
text(0.05,0.5,'1 SD')

paste('3 sd from mean:',mean(release_change$distance) + 3*sd(release_change$distance))

## [1] "3 sd from mean: 0.648315499490077"

paste('max release change:',max(release_change$distance))

## [1] "max release change: 3.66083323848547"

Three standard deviations above the mean is about 0.65, meaning most of the data falls below this point, except for these outliers we referred to. Going off the density plot, it appears a good point to cutoff the data is around 0.8, as most values above this were likely intentional changes to release point.

release_change = release_change[release_change$distance <= 0.8,]

I can now find the average release point tunneling for each pitcher.

average_release_change = release_change %>%
  group_by(player_name) %>%
  summarise(avg_distance = mean(distance), .groups = 'drop')

3. Unexpectedness

The third aspect of deception I want to evaluate is unexpectedness. Hitters will see a certain release point from a pitcher and use their experience to generalize how they expect the pitch to move. The more a pitch moves differently than what a hitter expects from a certain arm angle, the more “deceptive” it is. To model this, I will first find the average release point and average movement for each pitcher for each pitch type. I’ll multiply the horizontal break and horizontal release point by -1 for lefties to put them and righties on the same scale.

data_righties = data23 %>%
  filter(p_throws == 'R')

data_lefties = data23 %>%
  filter(p_throws == 'L')

avg_rel_mov_right = data_righties %>%
  group_by(player_name, pitch_type) %>%
  summarise(avg_rel_x = mean(release_pos_x), avg_rel_z = mean(release_pos_z), avg_horz_break = mean(pfx_x), avg_vert_break = mean(pfx_z))

avg_rel_mov_left = data_lefties %>%
  group_by(player_name, pitch_type) %>%
  summarise(avg_rel_x = mean(release_pos_x)*(-1), avg_rel_z = mean(release_pos_z), avg_horz_break = (mean(pfx_x))*(-1), avg_vert_break = mean(pfx_z))

avg_rel_mov = rbind(avg_rel_mov_right, avg_rel_mov_left)
avg_rel_mov = na.omit(avg_rel_mov)

I am using average movement and release point to better represent a hitter’s memory. Hitters are more likely to remember what a pitcher looks like in general rather than how the movement changes pitch by pitch. For example, hitters might see a sidearm pitcher like Grant Anderson and expect his fastball to have more armside run and sink, where as a more over the top pitcher like Josh Sborz would be expected to have more ride on his fastball. Below, I will visualize release point vs movement for different pitch types and look for any relationships.

library(ggplot2)

ggplot(avg_rel_mov, aes(x = avg_rel_x, y = avg_horz_break)) +
  geom_point() +
  facet_wrap(~pitch_type) +  # Creates a separate plot for each pitch type
  labs(title = "Pitch Movement vs. Release Point by Pitch Type (Horizontal)",
       x = "Release Point X",
       y = "Horizontal Movement") +
  theme_minimal()

ggplot(avg_rel_mov, aes(x = avg_rel_z, y = avg_vert_break)) +
  geom_point() +
  facet_wrap(~pitch_type) +  # Creates a separate plot for each pitch type
  labs(title = "Pitch Movement vs. Release Point by Pitch Type (Vertical)",
       x = "Release Point Z",
       y = "Vertical Movement") +
  theme_minimal()

From the graphs, there are clearly some weak but significant correlations between release point and movement. Also, each pitch has a very different relationship between release point and movement. Because of this, I will build predictive models on each pitch type individually to improve model accuracy. It may seem trivial to predict movement solely on release coordinates as there are clearly other factors in play that would make the model better but this is again indicative of a hitter’s approach at the plate. All he can really pick up is where the pitch is being released from.

TO calculate expected movement based on release point I will build a KNN-regression model for each pitch type. I will build separate models for average horizontal break and average vertical break, with both using release_x and release_z as predictors. I have decided to drop pitchouts (PO), knuckleballs (KN), and screwballs (SC) from the pitch types due to lack of data or irrelevance. I am also grouping curvevalls (CU), knuckle curves (KC), and slow curves (CS) together, grouping splitters (FS) and forkballs (FO) together, and grouping sweepers (ST) and slurves (SV) together. I’ll also group generic fastballs (FA) with four-seams (FF).

Let’s prepare the data accordingly:

avg_rel_mov = avg_rel_mov %>%
  filter(!pitch_type %in% c('PO','KN','SC'))
unique(avg_rel_mov$pitch_type)

##  [1] "FF" "KC" "SL" "CH" "SI" "CU" "ST" "FC" "SV" "FS" "FO" "FA" "CS"

ff = avg_rel_mov %>% # 4-seams
  filter(pitch_type %in% c('FF','FA'))
si = avg_rel_mov %>% # sinkers
  filter(pitch_type == 'SI')
cb = avg_rel_mov %>% # curveballs
  filter(pitch_type %in% c('CU','KC','CS'))
sl = avg_rel_mov %>% # sliders
  filter(pitch_type == 'SL')
sw = avg_rel_mov %>% # sweepers/slurves
  filter(pitch_type %in% c('ST','SV'))
ch = avg_rel_mov %>% # changeups
  filter(pitch_type == 'CH')
sp = avg_rel_mov %>% # splitters/forkballs
  filter(pitch_type %in% c('FS','FO'))
ct = avg_rel_mov %>% # cutters
  filter(pitch_type == 'FC')

KNN-Regression Models

To decide on the optimal k, in the past I have iterated through different values of k and chose the one that gave me the lowest RMSE value. But since I want this model to represent a batter, I would prefer to keep k smaller, as it represents the k nearest pitchers to that pitcher the batter is observing. Running through some of these iterations for k I found most “elbow points” in each graph fell somewhere between 5 and 10, indicating a k value around here will give the best tradeoff of interpretability and model performance. Here’s an example for 4-seam horizontal break:

# k values for 4-seam HB
library(caret)

set.seed(12)

indexes = createDataPartition(ff$avg_horz_break, p = .75, list = F)
train = ff[indexes, ]
test = ff[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

rmse_values <- c()

for (k in 1:50) {
  knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = k)
  
  pred_y = predict(knnmodel, data.frame(test_x_scaled))
  rmse = sqrt(mean((as.numeric(unlist(test_y)) - pred_y)^2))
  rmse_values = append(rmse_values, rmse)
}

plot(1:50, rmse_values, type = 'l')

For simplicities’ sake, I will go with k = 10 for each model, as it makes sense that a batter would compare the pitcher he’s facing to around 10 pitchers with similar release points.

I do not expect the model to handle outlier release points well, but just like in real life, a batter would have a harder time predicting what he’ll see from an unfamiliar release point over a pitcher that has a more common delivery.

First, lets model 4-seams (FF, FA) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(ff$avg_horz_break, p = .75, list = F)
train = ff[indexes, ]
test = ff[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ff[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ff[,5])) - pred_y)^2))

ff$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (4-Seams)') + 
  theme(plot.title = element_text(hjust = 0.5))

4-seams (FF, FA) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(ff$avg_horz_break, p = .75, list = F)
train = ff[indexes, ]
test = ff[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ff[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ff[,6])) - pred_y)^2))

ff$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (4-Seams)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sinkers (SI) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(si$avg_horz_break, p = .75, list = F)
train = si[indexes, ]
test = si[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = si[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(si[,5])) - pred_y)^2))

si$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Sinkers)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sinkers (SI) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(si$avg_horz_break, p = .75, list = F)
train = si[indexes, ]
test = si[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = si[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(si[,6])) - pred_y)^2))

si$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Sinkers)') + 
  theme(plot.title = element_text(hjust = 0.5))

Curveballs (CU, KC, CS) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(cb$avg_horz_break, p = .75, list = F)
train = cb[indexes, ]
test = cb[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = cb[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(cb[,5])) - pred_y)^2))

cb$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Curveballs)') + 
  theme(plot.title = element_text(hjust = 0.5))

Curveballs (CU, KC, CS) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(cb$avg_horz_break, p = .75, list = F)
train = cb[indexes, ]
test = cb[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = cb[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(cb[,6])) - pred_y)^2))

cb$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Curveballs)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sliders (SL) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(sl$avg_horz_break, p = .75, list = F)
train = sl[indexes, ]
test = sl[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sl[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sl[,5])) - pred_y)^2))

sl$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Sliders)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sliders (SL) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(sl$avg_horz_break, p = .75, list = F)
train = sl[indexes, ]
test = sl[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sl[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sl[,6])) - pred_y)^2))

sl$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Sliders)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sweepers/Slurves (ST, SV) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(sw$avg_horz_break, p = .75, list = F)
train = sw[indexes, ]
test = sw[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sw[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sw[,5])) - pred_y)^2))

sw$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Sweepers/Slurves)') + 
  theme(plot.title = element_text(hjust = 0.5))

Sweepers/Slurves (ST, SV) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(sw$avg_horz_break, p = .75, list = F)
train = sw[indexes, ]
test = sw[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sw[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sw[,6])) - pred_y)^2))

sw$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Sweepers/Slurves)') + 
  theme(plot.title = element_text(hjust = 0.5))

Changeups (CH) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(ch$avg_horz_break, p = .75, list = F)
train = ch[indexes, ]
test = ch[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ch[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ch[,5])) - pred_y)^2))

ch$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Changeups)') + 
  theme(plot.title = element_text(hjust = 0.5))

Changeups (CH) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(ch$avg_horz_break, p = .75, list = F)
train = ch[indexes, ]
test = ch[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ch[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ch[,6])) - pred_y)^2))

ch$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Changeups)') + 
  theme(plot.title = element_text(hjust = 0.5))

Splitters/Forkballs (FS, FO) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(sp$avg_horz_break, p = .75, list = F)
train = sp[indexes, ]
test = sp[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sp[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sp[,5])) - pred_y)^2))

sp$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Splitters/Forkballs)') + 
  theme(plot.title = element_text(hjust = 0.5))

Splitters/Forkballs (FS, FO) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(sp$avg_horz_break, p = .75, list = F)
train = sp[indexes, ]
test = sp[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = sp[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(sp[,6])) - pred_y)^2))

sp$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Splitters/Forkballs)') + 
  theme(plot.title = element_text(hjust = 0.5))

Cutters (FC) Horizontal Break

library(caret)

set.seed(12)

indexes = createDataPartition(ct$avg_horz_break, p = .75, list = F)
train = ct[indexes, ]
test = ct[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,5]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,5]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ct[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ct[,5])) - pred_y)^2))

ct$exp_horz_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'horz break', title = 'Horizontal Break vs Avg Rel Z & Avg Rel X (Cutters)') + 
  theme(plot.title = element_text(hjust = 0.5))

Cutters (FC) Vertical Break

library(caret)

set.seed(12)

indexes = createDataPartition(ct$avg_horz_break, p = .75, list = F)
train = ct[indexes, ]
test = ct[-indexes, ]

train_x = train[, 3:4]
preProc_train = preProcess(train_x, method =c("center", "scale"))
train_x_scaled = predict(preProc_train, train_x)
train_y = train[,6]

test_x = test[, 3:4]
test_x_scaled = predict(preProc_train, test_x)
test_y = test[,6]

knnmodel = knnreg(train_x_scaled, as.numeric(unlist(train_y)), k = 10)


all_data = ct[,3:4]
all_data_scaled = predict(preProc_train, all_data)

pred_y = predict(knnmodel, data.frame(all_data_scaled))
rmse = sqrt(mean((as.numeric(unlist(ct[,6])) - pred_y)^2))

ct$exp_vert_break = pred_y

mov_plot <- ggplot(data = all_data, aes(x = avg_rel_x, y = avg_rel_z, color = pred_y)) + geom_point()
mov_plot + scale_color_gradient(low="blue", high="red") + labs(x = 'rel x', y = 'rel z', color = 'vert break', title = 'Vertical Break vs Avg Rel Z & Avg Rel X (Cutters)') + 
  theme(plot.title = element_text(hjust = 0.5))

Now, I will find Euclidean distances between actual and expected pitch movement for each pitcher’s pitch types.

ff = ff %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
si = si %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
cb = cb %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
sl = sl %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
sw = sw %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
ch = ch %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
sp = sp %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))
ct = ct %>%
  mutate(distance = sqrt((exp_horz_break - avg_horz_break)^2 + (exp_vert_break - avg_vert_break)^2))

mov_all = rbind(ff,si,cb,sl,sw,ch,sp,ct)

mov_all = mov_all %>%
  left_join(data23_pitch_usage, join_by(player_name, pitch_type))

mov_all$percent_usage = mov_all$percent_usage / 100

unexp_score = mov_all %>%
  group_by(player_name) %>%
  summarise(score = sum(distance * percent_usage))

4. Flashiness

The fourth and final form of deception I will analyze is flashiness, which I will quantify using perceived velocity. Perceived velocity is a measure of how fast the pitch actually looked to a hitter based on the extension of the pitcher’s release. For example, a pitcher with a 6 foot extension throwing 93 will have a perceived velocity of 93 to the hitter, but another pitcher also throwing 93 with a 7 foot extension will have a perceived velocity closer to 96. This will obviously make the pitch appear to “get on them faster” because there’s less distance between where the pitch is released and the plate. To calculate this I will find the difference in true velocity and perceived velocity for each pitcher.

library(tidyr)

perceived_velo = data23 %>%
  drop_na(c(effective_speed,release_speed)) %>%
  filter(!effective_speed == 0) %>%
  summarise(player_name, effective_speed, release_speed, release_extension, velo_diff = effective_speed - release_speed) %>%
  group_by(player_name) %>%
  summarise(avg_velo_diff = mean(velo_diff))

Finding Each Form’s Importance

Now that we have our four forms of deception - unpredictability, indistinguishability, unexpectedness, and flashiness - I need to find their importance to calculating deception. I first want to use min-max scaling on each to put them all on the same scale.

library(scales)

overall_deviation = overall_deviation %>%
  mutate(unpredictability = rescale(-overall_deviation, to = c(0,1)))

average_release_change = average_release_change %>%
  mutate(indistinguishability = rescale(-avg_distance, to = c(0,1)))

unexp_score = unexp_score %>%
  mutate(unexpectedness = rescale(score, to = c(0,1)))

perceived_velo = perceived_velo %>%
  mutate(flashiness = rescale(avg_velo_diff, to = c(0,1)))

deception = cbind(overall_deviation$player_name, overall_deviation$unpredictability, 
                  average_release_change$indistinguishability, unexp_score$unexpectedness,
                  perceived_velo$flashiness)

deception = as.data.frame(deception) %>%
  rename('Pitcher' = 'V1',
         'Unpredictability' = 'V2',
         'Indistinguishability' = 'V3',
         'Unexpectedness' = 'V4',
         'Flashiness' = 'V5')

To find the importance of each of these variables, I’m going to use a multiple linear regression model with each of the four deception variables as predictors and a target variable I believe to be a good indicator of deception, Called Strike plus Swinging (CSW) rate. I’ll find CSW for each pitcher first.

csw = data23 %>%
  filter(description %in% c('called_strike','swinging_strike')) %>%
  group_by(player_name, pitcher) %>%
  summarise(total_csw = n())

csw = csw %>%
  left_join(total_pitches_per_player, by = 'player_name')
csw['csw_rate'] = csw[,3] / csw[,4]

deception = deception %>%
  inner_join(csw[,c(1:2,5)], by = c('Pitcher' = 'player_name'))

deception$Unpredictability = as.numeric(deception$Unpredictability)
deception$Indistinguishability = as.numeric(deception$Indistinguishability)
deception$Unexpectedness = as.numeric(deception$Unexpectedness)
deception$Flashiness = as.numeric(deception$Flashiness)

Now I’ll build the multiple linear regression model.

set.seed(12)

model = lm(csw_rate ~ Unpredictability + Indistinguishability + Unexpectedness + Flashiness, data = deception)
summary(model)

## 
## Call:
## lm(formula = csw_rate ~ Unpredictability + Indistinguishability + 
##     Unexpectedness + Flashiness, data = deception)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.175832 -0.020911 -0.000034  0.019932  0.170498 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.2444426  0.0108420  22.546   <2e-16 ***
## Unpredictability     0.0115662  0.0094084   1.229   0.2193    
## Indistinguishability 0.0003147  0.0065495   0.048   0.9617    
## Unexpectedness       0.0168207  0.0114581   1.468   0.1425    
## Flashiness           0.0192370  0.0094069   2.045   0.0412 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03573 on 797 degrees of freedom
## Multiple R-squared:  0.009157,   Adjusted R-squared:  0.004184 
## F-statistic: 1.841 on 4 and 797 DF,  p-value: 0.1189

From the summary statistics, I get t values for each of the four deception variables. T values tell the significance of each feature in predicting CSW. The larger it is, the more significant. I can see that flashiness had the most significance, while indistinguishability seemed to have the least effect on deception. I will use these t-values as my weights for the following equation:

deception = 1.229 * Unpredictability + 0.048 * Indistinguishability + 1.468 * Unexpectedness + 2.045 * Flashiness

Leaderboard

Using this equation, I’ll calculate deception for each pitcher. The dataframe stored in ‘percentiles’ was pulled from baseballsavant.mlb.com.

library(reactable)

deception = deception %>%
  mutate(deception_score = 1.229 * Unpredictability + 0.048 * Indistinguishability + 1.468 * Unexpectedness + 2.045 * Flashiness)

deception = deception %>%
  inner_join(total_pitches_per_player, by = c('Pitcher' = 'player_name')) %>%
  arrange(desc(deception_score))

deception['Rank'] = seq.int(nrow(deception))
deception$deception_score = round(rescale(deception$deception_score, to = c(0,1)) * 100, digits = 1)
deception$Unpredictability = round(deception$Unpredictability * 100, digits = 1)
deception$Indistinguishability = round(deception$Indistinguishability * 100, digits = 1)
deception$Unexpectedness = round(deception$Unexpectedness * 100, digits = 1)
deception$Flashiness = round(deception$Flashiness * 100, digits = 1)
deception$csw_rate = round(deception$csw_rate * 100, digits = 1)

percentiles = read.csv("C:\\Users\\phill\\OneDrive\\Desktop\\Job_Search\\percentile_rankings.csv")

deception = deception %>%
  left_join(percentiles[,c(2,4,12,17:19)], by = c('pitcher' = 'player_id'))

head(deception[c(1:5,8)], 25)

leaderboard = deception %>%
  select(Rank, Pitcher, total_pitches, deception_score, xwoba, hard_hit_percent, 
         fb_velocity, fb_spin, curve_spin) %>%
  head(25)

reactable(leaderboard,
          columns = list(
    deception_score = colDef(name = "Deception Score"),
    total_pitches = colDef(name = "# Pitches"),
    xwoba = colDef(name = 'xwOBA %ile'),
    hard_hit_percent = colDef(name = "Hard Hit %ile"),
    fb_velocity = colDef(name = 'FB Velocity %ile'),
    fb_spin = colDef(name = 'FB Spin %ile'),
    curve_spin = colDef(name = 'Curve Spin %ile')
  ))

There are some interesting names at the top of the leaderboard. The most deceptive pitcher based on my definition was Brent Suter. Despite his fastball velocity being in the 1st percentile, he was in the 99th percentile in hard hit balls and 88th percentile in xwOBA. So despite his stuff not being overpowering, he is able to generate weak contact. His unpredictablity and flashiness scores are both high, meaning he is able to make his pitches look harder than they actually are with good extension, and he is able to mix his pitches well in different counts, which keeps batters guessing. Some other names like Josh Hader and Tyler Rogers pass the eye test for deception as they are on there mostly due to their unique deliveries, which hitters aren’t used to facing. 5 of the top 10 are in the 93rd percentile or above in hard hit percentage, which is another factor that is likely closely correlated with more deceptive pitchers. Guys like Justin Steele and Phil Maton are both high spin rate pitchers which explains their unexpectedness, making them more deceptive than an average spin pitcher at the same arm slot. Devin Williams gets great extension which increases his flashiness, which may explain why hitters fall all over themselves trying to hit his filthy changeup.

All in all, this is not a perfect model. Many things are probably being left out that could vastly improve its performance, but these are some of the factors I deemed were most important in evaluating deception given the data I had access to. It is likely in the future that things like high framerate cameras and markerless motion capture systems will be used to get a true measure of how deceptive a pitcher is, rather than just what the numbers say. I think these things will be revolutionary in cracking the code of why certain pitchers are able to defy the odds and compete at a high level despite not having eye-popping stuff. If teams are able to find and develop pitchers that combine elite stuff with elite deception, it could put them ahead of the curve.

Rangers_Deception_Project

Lance Reynolds

2023-11-20