library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats 1.0.1 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
data <- read.csv("C:\\Users\\konod\\Downloads\\2025_college_bats - Sheet1.csv")
QUESTION: Can we use collegiate stats to predict the future MLB
On-Base skills of prospects?
To answer this question, I went through and created a database of
every MLB batter to accumulate at least 400 Plate Appearances in the
2025 season who also got drafted out of college. The stats I used were
MLB walk rate, MLB strikeout rate, MLB BABIP, College walk rate, College
strikeout rate, and college BABIP. All of the MLB stats came soley from
the 2025 season, where all the college stats came from the player’s
draft eligble season, or in the case their draft eligible season was
shortened, their highest plate appearance season.
# Plot College Walk Rate vs 2025 MLB Walk Rate
data %>%
ggplot(aes(x = college_BB_rate, y = BB_rate)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'College Walk Rate vs MLB 2025 Walk Rate')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College Walk Rate vs 2025 MLB Walk Rate
corr_result_BB_rate <- corr.test(data$college_BB_rate, data$BB_rate)
print(corr_result_BB_rate)
## Call:corr.test(x = data$college_BB_rate, y = data$BB_rate)
## Correlation matrix
## [1] 0.21
## Sample Size
## [1] 89
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0.05
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Walk Rate
Walk rate has a small but positive correlation from college to the
MLB, at .21. This implies that while there is some evidence of
collegiate walk rate translating to the MLB, we cannot use college walk
rate to fully predict the plate discipline of a prospect.
# Plot College Strikeout Rate vs 2025 MLB Strikeout Rate
data %>%
ggplot(aes(x = college_SO_rate, y = SO_rate)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'College Strikeout Rate vs MLB 2025 Strikeout Rate')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College Strikeout Rate vs 2025 MLB Strikeout Rate
corr_result_SO_rate <- corr.test(data$college_SO_rate, data$SO_rate)
print(corr_result_SO_rate)
## Call:corr.test(x = data$college_SO_rate, y = data$SO_rate)
## Correlation matrix
## [1] 0.56
## Sample Size
## [1] 89
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Strikeout Rate
There is a strong and positive correlation between College Strikeout
Rate and MLB Strikeout Rate, at .56. Given that this is only a single
college season compared to a single MLB season, it is remarkable how
strong the correlation is. This implies that players who struggle with
consistently making contact are going to struggle with it at all levels.
We can use college stats to profile the hit tool of a prospect with
confidence.
# Plot College BABIP vs 2025 MLB BABIP
data %>%
ggplot(aes(x = college_BABIP, y = BABIP)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'College BABIP vs MLB 2025 BABIP')
## `geom_smooth()` using formula = 'y ~ x'

# Test Correlation between College BABIP vs 2025 MLB BABIP
corr_result_BABIP <- corr.test(data$college_BABIP, data$BABIP)
print(corr_result_BABIP)
## Call:corr.test(x = data$college_BABIP, y = data$BABIP)
## Correlation matrix
## [1] 0.03
## Sample Size
## [1] 89
## These are the unadjusted probability values.
## The probability values adjusted for multiple tests are in the p.adj object.
## [1] 0.75
##
## To see confidence intervals of the correlations, print with the short=FALSE option
BABIP
There is little to no correlation between College Babip and MLB
Babip, at .03. This is to be expected, players fluctuate BABIP year to
year in the MLB. A numerous amount of variables could cause this low
correlation, I would assume different stadiums and levels of play,
resulting in levels of opponent defense, can be attributed to this
variation. Overall, we cannot predict the MLB Babip of a player based on
their College Babip.
Main Takeaways
While two of the three metrics in this experiment proved to be not
useful, we can come away with the idea that a players’ strikeout rate in
college follows a similar trend to the MLB. We can use this to more
accurately grade prospects’ hit tool, as if they struggle making contact
statistically in college, then they will struggle make contact in the
pros.