Buried in this data set exists a roster of 40 batters that earns a below average salary and gives an above average performance.
Source: Major League Baseball data from 1985 to 2016
mlb=readxl::read_excel('MLB_1985-2016.xlsx')
dim(mlb)
## [1] 28293 28
There are 28,293 rows and 28 columns in this data set.
mlb_cleaned = mlb |>
drop_na() |>
dplyr::filter(!duplicated(`Player Name`)) |>
rename(Salary = salary, Runs = R)
dim(mlb_cleaned)
## [1] 4931 28
Removed NA and duplicate values, and renamed two columns for labeling clarity. Now the data set has 4,931 rows. Let’s look at batters earning below average salary AND giving above average performance.
summary(mlb_cleaned$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10900 109000 215000 312400 410000 22000000
Salaries below $410,000 will be the cut-off because $312,400 was too
low (zero players existed in this salary range). Next, we’ll look at
Runs
- the number of times a batter scores a run and
Runs Batted In (RBI
) - a statistic that
measures how many runs a batter has driven in.
summary(mlb_cleaned$RBI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 12.85 17.00 149.00
summary(mlb_cleaned$Runs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 14.24 19.00 121.00
HR
- Home Runs could have been used as a metric, but
they aren’t as common and reliable as RBI
or
Runs
. Now, we’ll use RBI
> 13 and
Runs
> 14 as the cut-off for above average metrics.
Also, only batters from 2006 to 2016 will be considered because they
would have more relevance than someone from previous seasons.
These metrics, along with Salary
< $410,000,
will make up the Most Valuable Player (MVP) roster.
What is the relationship between RBI
and
Runs
within the MVP roster?
mlb_mvp = mlb_cleaned |>
dplyr::filter(Season %in% c('2016':'2006') & RBI > 13 & Runs > 14 & Salary < 410000) |>
select(`Player Name`,Runs,RBI,Salary,Team, everything()) |> arrange(desc(Runs)) |>
arrange(desc(RBI))
library(gridExtra)
shapiro.test(mlb_mvp$RBI)
##
## Shapiro-Wilk normality test
##
## data: mlb_mvp$RBI
## W = 0.92403, p-value = 3.113e-07
shapiro.test(mlb_mvp$Runs)
##
## Shapiro-Wilk normality test
##
## data: mlb_mvp$Runs
## W = 0.95631, p-value = 9.625e-05
mdens1=mlb_mvp %>%
ggplot(aes(RBI))+
geom_histogram(aes(y=..density..))+
geom_density(col='red',lwd=1)
mdens2=mlb_mvp %>%
ggplot(aes(Runs))+
geom_histogram(aes(y=..density..))+
geom_density(col='green',lwd=1)
grid.arrange(mdens1,mdens2)
cor.test(mlb_mvp$Runs,mlb_mvp$RBI,method='spearman')
##
## Spearman's rank correlation rho
##
## data: mlb_mvp$Runs and mlb_mvp$RBI
## S = 103644, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8263648
With a correlation value of 0.8263, there is a
strong positive monotonic (not necessarily linear) relationship between
RBI
and Runs
.
mlb_mvp |>
ggplot(aes(x= Runs,y = RBI))+
geom_smooth(method='auto')+
geom_point(size=2)+
labs(title='RBI by Runs and less than $410,000',x='Runs',y='Runs Batted In',caption = 'Source: Major League Baseball data from 1985 to 2016')+
theme_minimal()+
theme(legend.position = 'none')+
theme(plot.title = element_text(hjust = .5))+
annotate('text',x=111,y=125,label='Carlos Gonzalez',size=3,col='darkred')+
annotate('text',x=104,y=157,label='Ryan Howard',size=3,col='darkred')+
annotate('text',x=35,y=120,label="rs = 0.8263",col='blue')
Here are the batters with above average RBI
and
Runs
with below average Salary
. Additionally,
the plot shows two batters, Ryan Howard and Carlos Gonzalez, colored in
teal and shaped as triangles scoring above 100 in both RBI
and Runs
. In other words, they have above averages within
the already above average MVP roster.
mlb_mvp %>%
head(40)
Ryan Howard scores triple digits in RBI
and
Runs
for costing a mere $355,000. I recommend
prioritizing the recruitment of Howard, and future batters using these
metrics, first before proceeding with the rest of the roster.
That said, here are the top 40 batters that cost below the average
Salary
(< $410,000) and score above average in
RBI
(> 13) and Runs
(> 14).
sessionInfo()
## R version 4.3.0 alpha (2023-04-01 r84141 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 performance_0.10.3 readxl_1.4.2 janitor_2.2.0
## [5] tidyselect_1.2.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [9] dplyr_1.1.1 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
## [13] tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.4.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.5 utf8_1.2.3 generics_0.1.3 lattice_0.20-45
## [5] stringi_1.7.12 hms_1.1.3 digest_0.6.31 magrittr_2.0.3
## [9] evaluate_0.20 grid_4.3.0 timechange_0.2.0 fastmap_1.1.1
## [13] Matrix_1.5-3 jsonlite_1.8.4 cellranger_1.1.0 mgcv_1.8-42
## [17] fansi_1.0.4 scales_1.2.1 jquerylib_0.1.4 cli_3.6.1
## [21] rlang_1.1.0 splines_4.3.0 munsell_0.5.0 cachem_1.0.7
## [25] yaml_2.3.7 withr_2.5.0 tools_4.3.0 tzdb_0.3.0
## [29] colorspace_2.1-0 vctrs_0.6.1 R6_2.5.1 lifecycle_1.0.3
## [33] snakecase_0.11.0 insight_0.19.2 pkgconfig_2.0.3 bslib_0.4.2
## [37] pillar_1.9.0 gtable_0.3.3 glue_1.6.2 highr_0.10
## [41] xfun_0.38 rstudioapi_0.14 knitr_1.42 farver_2.1.1
## [45] nlme_3.1-162 htmltools_0.5.5 labeling_0.4.2 rmarkdown_2.21
## [49] compiler_4.3.0