Source: MLB 2012 data set
Which team(s) won the most number of games with the lowest amount of payroll for the 2012 season?
mlb2012 = readxl::read_xlsx('mlb2012.xlsx')
summary(mlb2012)
## Team Payroll (millions) Wins
## Length:30 Min. : 55.24 Min. :55.00
## Class :character 1st Qu.: 76.12 1st Qu.:72.25
## Mode :character Median : 85.75 Median :82.00
## Mean : 98.02 Mean :81.00
## 3rd Qu.:115.79 3rd Qu.:92.25
## Max. :197.96 Max. :98.00
Within the 30 teams in the MLB, there exists a select few that stand out.
mlb = mlb2012 |>
rename('Payroll in Millions' = `Payroll (millions)`)
summary(mlb$`Payroll in Millions`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.24 76.12 85.75 98.02 115.79 197.96
summary(mlb$Wins)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.00 72.25 82.00 81.00 92.25 98.00
The average payroll of 98.02 and the average wins of 81 will be used as the cut-off metrics because we’re looking for the teams that stand out (the outliers).
highpayroll=mlb%>%
dplyr::filter(mlb$`Payroll in Millions`>98.02 & Wins >= 81)
lowpayroll=mlb%>%
dplyr::filter(mlb$`Payroll in Millions`<98.02 & Wins >= 81)
lowpayroll
The teams have been split into two types: low-payroll (below 98.02 with 81 wins) and high-payroll (above 98.02 with 81 wins). Let’s have a closer look at the low-payroll teams to see which ones stand out:
summary(lowpayroll)
## Team Payroll in Millions Wins
## Length:10 Min. :55.37 Min. :81.00
## Class :character 1st Qu.:76.05 1st Qu.:85.25
## Mode :character Median :81.81 Median :91.50
## Mean :81.18 Mean :90.10
## 3rd Qu.:92.18 3rd Qu.:94.00
## Max. :97.65 Max. :98.00
Let’s dive deeper and examine these 10 teams using the average cut-off numbers of 81.18 for payroll and 90 for wins (because you can’t have a 0.10 win per game) with a scatter plot.
lowpayroll %>%
dplyr::filter(Wins >= 90.0 & `Payroll in Millions`<= 81.18) %>%
ggplot(aes(x= `Payroll in Millions`,y=Wins,shape=Team,color=Team))+
geom_point(aes(shape=Team,color=Team,size=1))+
scale_color_manual(values=c('red','darkblue'))+
scale_shape_manual(values=c(15,16))+
labs(title='Most Wins with Lowest Payroll',x='Payroll in Millions',y='Wins')+
theme_bw()+guides(size='none',shape=guide_legend(override.aes=list(size=3)))+
theme(plot.title=element_text(hjust=.5),legend.text=element_text(size=10))
lowpayroll %>%
dplyr::filter(Wins >= 90.0 & `Payroll in Millions`<= 81.18)
Only the Athletics and the Rays remain. While this current analysis doesn’t tell us exactly what the Athletics did, it does alert us to pay more attention to their business strategy.
highpayroll |>
dplyr::filter(Wins == 94) |>
arrange(`Payroll in Millions`) |>
select(Team, Wins, `Payroll in Millions`)
mlb |>
dplyr::filter(Team %in% c('Athletics','Giants')) |> arrange(`Payroll in Millions`) |>
select(Team, Wins, `Payroll in Millions`)
To put this analysis into perspective, let’s look at the high-payroll teams that had the same number of wins (94) as the Athletics did. In the 2012 season, the Athletics paid $589,042 per win (589,042 * 94 = $55,369,948). The Giants paid $1,251,277. So what business strategy drove the Athletics’ results? To answer that question, we must find data on their player salary and performance metrics, which itself will be another story to tell. The takeaway lesson here is - any team that isn’t re-evaluating its business strategy and modeling it after the Athletics risks reducing its chances of making it to the World Series, which in turn reduces its likelihood of winning one.
sessionInfo()
## R version 4.3.0 alpha (2023-04-01 r84141 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.1 tidyselect_1.2.0 lubridate_1.9.2 forcats_1.0.0
## [5] stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1 readr_2.1.4
## [9] tidyr_1.3.0 tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.4.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.5 utf8_1.2.3 generics_0.1.3 stringi_1.7.12
## [5] hms_1.1.3 digest_0.6.31 magrittr_2.0.3 evaluate_0.20
## [9] grid_4.3.0 timechange_0.2.0 fastmap_1.1.1 cellranger_1.1.0
## [13] jsonlite_1.8.4 httr_1.4.5 fansi_1.0.4 viridisLite_0.4.1
## [17] scales_1.2.1 lazyeval_0.2.2 jquerylib_0.1.4 cli_3.6.1
## [21] rlang_1.1.0 munsell_0.5.0 cachem_1.0.7 withr_2.5.0
## [25] yaml_2.3.7 tools_4.3.0 tzdb_0.3.0 colorspace_2.1-0
## [29] vctrs_0.6.1 R6_2.5.1 lifecycle_1.0.3 htmlwidgets_1.6.2
## [33] pkgconfig_2.0.3 bslib_0.4.2 pillar_1.9.0 gtable_0.3.3
## [37] data.table_1.14.8 glue_1.6.2 highr_0.10 xfun_0.38
## [41] rstudioapi_0.14 knitr_1.42 farver_2.1.1 htmltools_0.5.5
## [45] labeling_0.4.2 rmarkdown_2.21 compiler_4.3.0 readxl_1.4.2