Buried in this data set exists a roster of 40 batters that earns a below average salary and gives an above average performance.

Import

Source: Major League Baseball data from 1985 to 2016

mlb=readxl::read_excel('MLB_1985-2016.xlsx')
dim(mlb)
## [1] 28293    28

There are 28,293 rows and 28 columns in this data set.

Clean

mlb_cleaned = mlb |> 
  drop_na() |> 
  dplyr::filter(!duplicated(`Player Name`)) |> 
  rename(Salary = salary, Runs = R)

dim(mlb_cleaned)
## [1] 4931   28

Removed NA and duplicate values, and renamed two columns for labeling clarity. Now the data set has 4,931 rows. Let’s look at batters earning below average salary AND giving above average performance.

summary(mlb_cleaned$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    10900   109000   215000   312400   410000 22000000

Salaries below $410,000 will be the cut-off because $312,400 was too low (zero players existed in this salary range). Next, we’ll look at Runs - the number of times a batter scores a run and Runs Batted In (RBI) - a statistic that measures how many runs a batter has driven in.

summary(mlb_cleaned$RBI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00   12.85   17.00  149.00
summary(mlb_cleaned$Runs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    2.00   14.24   19.00  121.00

HR - Home Runs could have been used as a metric, but they aren’t as common and reliable as RBI or Runs. Now, we’ll use RBI > 13 and Runs > 14 as the cut-off for above average metrics. Also, only batters from 2006 to 2016 will be considered because they would have more relevance than someone from previous seasons. These metrics, along with Salary < $410,000, will make up the Most Valuable Player (MVP) roster.

Explore

What is the relationship between RBI and Runs within the MVP roster?

mlb_mvp = mlb_cleaned |> 
dplyr::filter(Season %in% c('2016':'2006') & RBI > 13 & Runs > 14 & Salary < 410000) |> 
select(`Player Name`,Runs,RBI,Salary,Team, everything()) |> arrange(desc(Runs)) |> 
arrange(desc(RBI))
library(gridExtra)

shapiro.test(mlb_mvp$RBI)
## 
##  Shapiro-Wilk normality test
## 
## data:  mlb_mvp$RBI
## W = 0.92403, p-value = 3.113e-07
shapiro.test(mlb_mvp$Runs)
## 
##  Shapiro-Wilk normality test
## 
## data:  mlb_mvp$Runs
## W = 0.95631, p-value = 9.625e-05
mdens1=mlb_mvp %>% 
  ggplot(aes(RBI))+
  geom_histogram(aes(y=..density..))+
  geom_density(col='red',lwd=1)

mdens2=mlb_mvp %>% 
  ggplot(aes(Runs))+
  geom_histogram(aes(y=..density..))+
  geom_density(col='green',lwd=1)

grid.arrange(mdens1,mdens2)

cor.test(mlb_mvp$Runs,mlb_mvp$RBI,method='spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  mlb_mvp$Runs and mlb_mvp$RBI
## S = 103644, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8263648

With a correlation value of 0.8263, there is a strong positive monotonic (not necessarily linear) relationship between RBI and Runs.

mlb_mvp |> 
  ggplot(aes(x= Runs,y = RBI))+
  geom_smooth(method='auto')+
  geom_point(size=2)+
  labs(title='RBI by Runs and less than $410,000',x='Runs',y='Runs Batted In',caption = 'Source: Major League Baseball data from 1985 to 2016')+
  theme_minimal()+
  theme(legend.position = 'none')+
  theme(plot.title = element_text(hjust = .5))+
  annotate('text',x=111,y=125,label='Carlos Gonzalez',size=3,col='darkred')+
  annotate('text',x=104,y=157,label='Ryan Howard',size=3,col='darkred')+
  annotate('text',x=35,y=120,label="rs = 0.8263",col='blue')

Here are the batters with above average RBI and Runs with below average Salary. Additionally, the plot shows two batters, Ryan Howard and Carlos Gonzalez, colored in teal and shaped as triangles scoring above 100 in both RBI and Runs. In other words, they have above averages within the already above average MVP roster.

Recommendation

mlb_mvp %>% 
head(40)

Ryan Howard scores triple digits in RBI and Runs for costing a mere $355,000. I recommend prioritizing the recruitment of Howard, and future batters using these metrics, first before proceeding with the rest of the roster. That said, here are the top 40 batters that cost below the average Salary (< $410,000) and score above average in RBI (> 13) and Runs (> 14).

Export

Top 40 MVP Roster.

sessionInfo()
## R version 4.3.0 alpha (2023-04-01 r84141 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3      performance_0.10.3 readxl_1.4.2       janitor_2.2.0     
##  [5] tidyselect_1.2.0   lubridate_1.9.2    forcats_1.0.0      stringr_1.5.0     
##  [9] dplyr_1.1.1        purrr_1.0.1        readr_2.1.4        tidyr_1.3.0       
## [13] tibble_3.2.1       tidyverse_2.0.0    ggplot2_3.4.1     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.5       utf8_1.2.3       generics_0.1.3   lattice_0.20-45 
##  [5] stringi_1.7.12   hms_1.1.3        digest_0.6.31    magrittr_2.0.3  
##  [9] evaluate_0.20    grid_4.3.0       timechange_0.2.0 fastmap_1.1.1   
## [13] Matrix_1.5-3     jsonlite_1.8.4   cellranger_1.1.0 mgcv_1.8-42     
## [17] fansi_1.0.4      scales_1.2.1     jquerylib_0.1.4  cli_3.6.1       
## [21] rlang_1.1.0      splines_4.3.0    munsell_0.5.0    cachem_1.0.7    
## [25] yaml_2.3.7       withr_2.5.0      tools_4.3.0      tzdb_0.3.0      
## [29] colorspace_2.1-0 vctrs_0.6.1      R6_2.5.1         lifecycle_1.0.3 
## [33] snakecase_0.11.0 insight_0.19.2   pkgconfig_2.0.3  bslib_0.4.2     
## [37] pillar_1.9.0     gtable_0.3.3     glue_1.6.2       highr_0.10      
## [41] xfun_0.38        rstudioapi_0.14  knitr_1.42       farver_2.1.1    
## [45] nlme_3.1-162     htmltools_0.5.5  labeling_0.4.2   rmarkdown_2.21  
## [49] compiler_4.3.0