Question:

Can we use physical stats to predict the defenisve capabilities of an MLB short stop? If so, which physical traits are most correlated with defensive success?

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

To answer this question I created a csv of all qualified MLB Shortstops from the 2025 season, along with their Outs above average, average throwing velocity, max throw velocity, and average sprint speed. All of this data comes from Baseball Savant.

# Read csv of all 2025 MLB Qualified MLB shortstops and their defensive/physical stats

data <- read.csv("C:\\Users\\konod\\Downloads\\Physical_Traits_to_OAA - Sheet1.csv")


# Plot Max arm velo vs Outs Above Average

data %>%
  ggplot(aes(x = max_arm_velo, y = oaa)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'Max Arm Velocity vs Outs Above Average')
## `geom_smooth()` using formula = 'y ~ x'

# Check for correlation between Max arm velo and Outs Above Average
corr_result_max_arm <- corr.test(data$max_arm_velo, data$oaa)
print(corr_result_max_arm)
## Call:corr.test(x = data$max_arm_velo, y = data$oaa)
## Correlation matrix 
## [1] 0.21
## Sample Size 
## [1] 36
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.21
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
# Plot Average arm velo vs Outs Above Average
data %>%
  ggplot(aes(x = avg_arm_velo, y = oaa)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'Average Arm Velocity vs Outs Above Average')
## `geom_smooth()` using formula = 'y ~ x'

# Check for correlation between Average arm velo and Outs Above Average
corr_result_avg_arm <- corr.test(data$avg_arm_velo, data$oaa)
print(corr_result_avg_arm)
## Call:corr.test(x = data$avg_arm_velo, y = data$oaa)
## Correlation matrix 
## [1] 0.25
## Sample Size 
## [1] 36
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.15
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
# Plot Sprint Speed vs Outs Above Average
data %>%
  ggplot(aes(x = sprint_speed, y = oaa)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'Sprint Speed vs Outs Above Average')
## `geom_smooth()` using formula = 'y ~ x'

# Check for correlation between Sprint Speed and Outs Above Average
corr_result_sprint_speed <- corr.test(data$sprint_speed, data$oaa)
print(corr_result_sprint_speed)
## Call:corr.test(x = data$sprint_speed, y = data$oaa)
## Correlation matrix 
## [1] 0.39
## Sample Size 
## [1] 36
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.02
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Correlations

From the output we can see that sprint speed, at .39, correlates the most to Outs above average, with average arm velocity, at .25, correlating more than max arm velocity, at .21.

This takeaway helps answer the central question partly, we can slightly predict short stop defensive success with average sprint speed, but the correlation is not high enough to be confident.

Next we will combine these variables using a linear regression model to hopefully create a more correlated formula to Outs Above Average than any of the three signular variables already tested.

# Use linear regression model to find best formula correlating to outs Above Average, first using max arm 

model <- lm(oaa ~ sprint_speed + max_arm_velo, data=data)
summary(model)
## 
## Call:
## lm(formula = oaa ~ sprint_speed + max_arm_velo, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.664  -4.744   0.544   4.961  16.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -96.0821    40.8690  -2.351   0.0248 *
## sprint_speed   2.8031     1.3285   2.110   0.0425 *
## max_arm_velo   0.2270     0.3584   0.633   0.5310  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.183 on 33 degrees of freedom
## Multiple R-squared:  0.159,  Adjusted R-squared:  0.108 
## F-statistic: 3.119 on 2 and 33 DF,  p-value: 0.05746
# Second using average arm

model <- lm(oaa ~ sprint_speed + avg_arm_velo, data=data)
summary(model)
## 
## Call:
## lm(formula = oaa ~ sprint_speed + avg_arm_velo, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6958  -4.7242   0.0126   5.5391  15.8164 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -111.3275    42.8405  -2.599   0.0139 *
## sprint_speed    2.7936     1.2711   2.198   0.0351 *
## avg_arm_velo    0.4251     0.3750   1.134   0.2650  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.077 on 33 degrees of freedom
## Multiple R-squared:  0.1807, Adjusted R-squared:  0.131 
## F-statistic: 3.638 on 2 and 33 DF,  p-value: 0.03733
# Third using all three variables

model <- lm(oaa ~ sprint_speed + max_arm_velo + avg_arm_velo, data=data)
summary(model)
## 
## Call:
## lm(formula = oaa ~ sprint_speed + max_arm_velo + avg_arm_velo, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9352  -4.6272  -0.7454   5.9649  15.6115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -114.3653    43.2630  -2.643   0.0126 *
## sprint_speed    3.1209     1.3445   2.321   0.0268 *
## max_arm_velo   -0.6072     0.7720  -0.787   0.4373  
## avg_arm_velo    0.9962     0.8182   1.218   0.2323  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.124 on 32 degrees of freedom
## Multiple R-squared:  0.1962, Adjusted R-squared:  0.1209 
## F-statistic: 2.604 on 3 and 32 DF,  p-value: 0.06893
# Use lrm weights to create new variable for easier calculation using only max arm
data$athletic_score_max <- 2.8031 * data$sprint_speed +
                       .2270 * data$max_arm_velo
cor.test(data$athletic_score_max, data$oaa)
## 
##  Pearson's product-moment correlation
## 
## data:  data$athletic_score_max and data$oaa
## t = 2.5351, df = 34, p-value = 0.01601
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08075342 0.64301751
## sample estimates:
##       cor 
## 0.3987112
# Use lrm weights to create new variable for easier calculation using only average arm
data$athletic_score_avg <- 2.7936 * data$sprint_speed +
                       .4251 * data$avg_arm_velo
cor.test(data$athletic_score_avg, data$oaa)
## 
##  Pearson's product-moment correlation
## 
## data:  data$athletic_score_avg and data$oaa
## t = 2.7381, df = 34, p-value = 0.009762
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1121822 0.6612481
## sample estimates:
##       cor 
## 0.4250522
# Use lrm weights to create new variable for easier calculation using all three variables
data$athletic_score <- 3.1209 * data$sprint_speed +
                       0.9962 * data$avg_arm_velo +
                       -0.6072 * data$max_arm_velo
cor.test(data$athletic_score, data$oaa)
## 
##  Pearson's product-moment correlation
## 
## data:  data$athletic_score and data$oaa
## t = 2.8809, df = 34, p-value = 0.00682
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1339051 0.6734808
## sample estimates:
##       cor 
## 0.4429542

Conclusion

From the linear regression models we can see that the model consisting of just average sprint speed and max throwing velocity is only slightly more correlated to Outs Above average than average sprint speed alone.

The model consisting of just average sprint speed and average throwing velocity is more correlated than both sprint speed on its own, and the previous model of sprint speed and max throwing power.

The final model consisting of all three variables is the most correlated to Outs Above Average, and can be used to predict the defensive performance of a player with an average level of confidence

Real World Uses

From the rather small sample size of just the 2025 season, along with the final correlation sill being below .5, I would not recommend using this formula to predict real world performance. However, I do believe this can be used as a basic grounds for positional assignment when choosing a starting shortstop. If you have two short stops fighting for the starting job, the one that posseses a more consistently strong arm, along with more consistent speed is more than likely going to be the better fielder in the long run.