library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats   1.0.1     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.3     âś” tidyr     1.3.1
## âś” purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
data <- read.csv("C:\\Users\\konod\\Downloads\\2025_college_bats - Sheet1.csv")

QUESTION: Can we use collegiate stats to predict the future MLB On-Base skills of prospects?

To answer this question, I went through and created a database of every MLB batter to accumulate at least 400 Plate Appearances in the 2025 season who also got drafted out of college. The stats I used were MLB walk rate, MLB strikeout rate, MLB BABIP, College walk rate, College strikeout rate, and college BABIP. All of the MLB stats came soley from the 2025 season, where all the college stats came from the player’s draft eligble season, or in the case their draft eligible season was shortened, their highest plate appearance season.

# Plot College Walk Rate vs 2025 MLB Walk Rate
data %>%
  ggplot(aes(x = college_BB_rate, y = BB_rate)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'College Walk Rate vs MLB 2025 Walk Rate')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College Walk Rate vs 2025 MLB Walk Rate
corr_result_BB_rate <- corr.test(data$college_BB_rate, data$BB_rate)
print(corr_result_BB_rate)
## Call:corr.test(x = data$college_BB_rate, y = data$BB_rate)
## Correlation matrix 
## [1] 0.21
## Sample Size 
## [1] 89
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.05
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Walk Rate

Walk rate has a small but positive correlation from college to the MLB, at .21. This implies that while there is some evidence of collegiate walk rate translating to the MLB, we cannot use college walk rate to fully predict the plate discipline of a prospect.

# Plot College Strikeout Rate vs 2025 MLB Strikeout Rate
data %>%
  ggplot(aes(x = college_SO_rate, y = SO_rate)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'College Strikeout Rate vs MLB 2025 Strikeout Rate')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College Strikeout Rate vs 2025 MLB Strikeout Rate
corr_result_SO_rate <- corr.test(data$college_SO_rate, data$SO_rate)
print(corr_result_SO_rate)
## Call:corr.test(x = data$college_SO_rate, y = data$SO_rate)
## Correlation matrix 
## [1] 0.56
## Sample Size 
## [1] 89
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Strikeout Rate

There is a strong and positive correlation between College Strikeout Rate and MLB Strikeout Rate, at .56. Given that this is only a single college season compared to a single MLB season, it is remarkable how strong the correlation is. This implies that players who struggle with consistently making contact are going to struggle with it at all levels. We can use college stats to profile the hit tool of a prospect with confidence.

# Plot College BABIP vs 2025 MLB BABIP
data %>%
  ggplot(aes(x = college_BABIP, y = BABIP)) +
  geom_point() +
  geom_smooth(method = 'lm') + 
  labs(title = 'College BABIP vs MLB 2025 BABIP')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College BABIP vs 2025 MLB BABIP
corr_result_BABIP <- corr.test(data$college_BABIP, data$BABIP)
print(corr_result_BABIP)
## Call:corr.test(x = data$college_BABIP, y = data$BABIP)
## Correlation matrix 
## [1] 0.03
## Sample Size 
## [1] 89
## These are the unadjusted probability values.
##   The probability values  adjusted for multiple tests are in the p.adj object. 
## [1] 0.75
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

BABIP

There is little to no correlation between College Babip and MLB Babip, at .03. This is to be expected, players fluctuate BABIP year to year in the MLB. A numerous amount of variables could cause this low correlation, I would assume different stadiums and levels of play, resulting in levels of opponent defense, can be attributed to this variation. Overall, we cannot predict the MLB Babip of a player based on their College Babip.

Main Takeaways

While two of the three metrics in this experiment proved to be not useful, we can come away with the idea that a players’ strikeout rate in college follows a similar trend to the MLB. We can use this to more accurately grade prospects’ hit tool, as if they struggle making contact statistically in college, then they will struggle make contact in the pros.