Introduction

I have spent the past year completing courses in Machine Learning and Data Analysis in the R language. Once I felt solid in both coding fundamentals and statistics, I began creating projects showcasing my passions. I included all pertinent code in my github repository.
For this project, I used a dataset titled ‘Pitching’ and one titled ‘Batting’ from the Sean Lahman Baseball Database (Lahman package) in R. I used these datasets because they included extensive statistics for pitchers and batters alike. Next, I imported a csv file I obtained from mlb.com which I had cleaned in order to only focus on players’ salaries from 2016. My goal was to look at both hitting and pitching data from 2013-2015 and 2017-2019 in order to see if there was a statistical correlation between output and salary paid. Specifically, I had multiple questions about this data that I hoped to concretely answer.

1. Of the pitching statistics, (Earned Run Average, Strikeouts, Home Runs, Wins, Saves, Opponent’s Batting Average), which (if any) had a clear correlation with a pitcher’s salary?

I hypothesized that Strikeouts and Wins would show to be directly correlated with a pitcher’s salary. Fundamentally, it makes sense to pay pitchers a higher salary who produce more outs as well as overall wins for their team. That being said, receiving a Win for pitching a game is directly related to the remainder of the team’s performance (batting, fielding, relief pitching).

2. Of the batting statistics, (Runs Batted In, Home Runs, Hits, Doubles, Triples, Walks, Strikeouts, Ground into a Double Play, Hit by Pitch, Stolen Bases, Caught Stealing, Games Played, At Bats), which (if any) had a clear correlation with a batter’s salary?

Initially, I believed that a batter’s number of home runs would be clearly correlated with his salary. Upon further thought, I hypothesized that RBIs, GIDPs, and Home runs would be related to a batter’s salary. Home runs are obvious because of team owners wanting to secure players who tend to hit balls out of the park. GIDPs also are a reasonable choice because those aforementioned ‘hard home run hitters’ statistically are more likely to ground into double plays. In the past decade with the wide usage of sabermetrics, I hypothesize that RBIs will be directly correlated with salary. More team owners have begun to value the daily contribution of players (hitting runs in) just as much and maybe even more so than the occasional long ball (and the outs that come as a result). An apt real life example of these two phenomena is Joe Mauer vs Adam Dunn.

3. Are these correlations greater in the three years prior (2013-2015) to the 2016 salary or the three years afterward (2017-2019)?

This question is especially prevalent with the MLB lockout currently taking place. An idealist might believe that a player who had three phenomenal seasons (2013-2015) would be paid a high salary (2016) and then continue to excel (2017-2019). As we’ve seen in recent history, much of baseball is random and MLB has shown to pay high performing players a higher salary after a couple of seasons even though it is statistically very likely they will drift down to the average in the years that follow.

4. Can these correlations help to assess a batter/pitcher’s value relative to others in the league?

These hypothetical correlations tell me that even though it is common practice to pay a player a massive salary after a couple of stellar seasons, doing so isn’t very logical. It makes more sense to avoid handing out those massive contracts and instead, to focus on players that might be playing 3A or 2A ball. This is because if you hone in on these players when they’re near a phenomenal rookie season, then you end up paying them what they deserve rather than spending massive amounts of money on under performing veterans.

5. Can I construct a machine learning model using a batter/pitcher’s performance data from 2013-2015 to predict if said batter/pitcher will receive an above average salary in 2016?

I hypothesize that this is possible because as I’ve already explained some of my theory above, I believe that players who have a couple of phenomenal seasons will likely be well compensated even though it stands to reason that their performance regresses back to average in the seasons to follow.

Note

In order to perform this analysis, I used the 2020 version of salary, batting, and pitching datasets. I was able to obtain the batting and pitching datasets directly from R and used mlb.com to obtain the salary dataset. Below is the documentation for the Lahman R package where this data was found.
https://cran.r-project.org/web/packages/Lahman/Lahman.pdf

A. Preliminary Analysis

I decided to focus on salaries paid in 2016 because the version of MLB data available in the Lahman package was through the end of the 2019 season. This provided a straightforward metric as I could look at the three seasons after 2016 (2017-2019) and well as the three that preceded it (2013-2015).
I began by focusing on the Salary dataset. I filtered by year (2016) and created a histogram focusing on each team’s salary to see the distribution of payroll across both leagues.

library(Lahman)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(tidyr)
library(stringr)
library(ggplot2)
library(knitr)
library(readr)
library(caret)

## Loading required package: lattice

teamSalary <- Salaries %>%
  filter(yearID == 2016) %>%
  select(-yearID, -lgID, -teamID) %>%
  arrange(desc(salary))

individual_salary <- Salaries %>%
  filter(yearID==2016) %>%
  group_by(teamID)

individual_salary

## # A tibble: 853 × 5
## # Groups:   teamID [30]
##    yearID teamID lgID  playerID   salary
##     <int> <fct>  <fct> <chr>       <int>
##  1   2016 ARI    NL    ahmedni01  521600
##  2   2016 ARI    NL    barreja01  507500
##  3   2016 ARI    NL    brachsi01  509300
##  4   2016 ARI    NL    britoso01  508500
##  5   2016 ARI    NL    castiwe01 3700000
##  6   2016 ARI    NL    chafian01  519700
##  7   2016 ARI    NL    clippty01 6100000
##  8   2016 ARI    NL    corbipa01 2525000
##  9   2016 ARI    NL    delarru01 2350000
## 10   2016 ARI    NL    delgara01 1275000
## # … with 843 more rows

hist(teamSalary$salary/1e6, main="2016 Distribution of MLB Salaries",
     ylab="Count", xlab="Team Salary in Millions of $", col="green")

Seven-Year Averages (Pitching) 2013-2019

I chose to import my cleaned salary dataset (.csv) only containing information from 2016. This was more thorough as it consisted of lengthy individual player data, which I wanted to use in order to merge the Batting and Pitching datasets with the salary dataset by playerID.

My goal is for both Batting and Pitching to construct three different data frames. The first will be a seven year average (2013-2019) of the statistics I choose to analyze. The second will be a three year average (2013-2015) prior to 2016 and the third will be a three year average post 2016 (2017-2019).

I began by reading the salary data set csv into R studio. Then, I filtered the Pitching data set in order to examine data from 2013-2019. Next, I merged Pitching with my imported salary data set by the parameter “playerID”. I continued cleaning the combined data set and removed columns I previously thought were important, but realized they would no longer be pertinent in this analysis. I filtered the data to only contain complete cases (rows) as well as no duplicate players. Then, I removed statistics from the tibble that I wouldn’t be analyzing. I created six graphs showing the relationship between average ERA, SO, HR, W, SV, BAOpp from 2013 to 2019 vs each pitcher’s 2016 salary. I filtered to focus on (0 < meanERA < 9), (meanSV > 0), and (salary_2016 >= 1,000,000).

mlb_salary_2016 <- read_csv("/Users/bethanyleach/Downloads/salaries_2016_final - Sheet1-2.csv")

## Rows: 1005 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): salary, name, playerID, total_value, pos, salary_2016, active_major...
## dbl (1): years
## lgl (1): traded_2016
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pitching_table <- Pitching %>%
  filter(yearID >= 2013, yearID <= 2019)

pitching_table2 <- pitching_table

pitch_salary <- merge(pitching_table2, mlb_salary_2016, by="playerID", all=TRUE)

pitch_salary_na <- pitch_salary
table(is.na(pitch_salary_na$salary))

## 
## FALSE  TRUE 
##  3657  2535

test <- subset(pitch_salary_na, is.na(salary))
data2 <- subset(pitch_salary_na, !is.na(salary))

pitch_salary_cleaner <- pitch_salary_na %>%
  group_by(playerID) %>%
  select(-active_majors_2016)

pitch_salary_na <- pitch_salary_cleaner %>%
  select(-traded_2016)

pitch_salary_aard <- pitch_salary_na[complete.cases(pitch_salary_na), ]

revised_pitching <- as_tibble(pitch_salary_aard)

selected_columns <- revised_pitching %>%
  dplyr::select(-stint, -L, -CG, -SHO, -H, -ER, -IBB,-IPouts, 
                -WP, -HBP, -BK, -BFP, -GF, -R, -SH, -SF, -GIDP, -salary, 
                -total_value, -pos, -years, -avg_annual, -team)

seven_years <- selected_columns %>%
  mutate(salary_2016=salary_2016) %>%
  dplyr::mutate(salary_2016 = gsub("^.","", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::select(-BB, -GS) %>%
  dplyr::summarize(meanW = mean(W), meanHR = mean(HR), meanSO = mean(SO), 
                   meanERA = mean(ERA), salary_2016, meanSV = mean(SV),
                   meanBAOpp = mean(BAOpp))

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

seven_years$salary_2016 <- as.numeric(seven_years$salary_2016)

seven_years_edited <- seven_years[!duplicated(seven_years$playerID),] %>%
  filter(meanERA > 0 && meanERA < 9 && salary_2016 >=1000000 && meanSV > 0)

seven_years_edited_2 <- seven_years_edited

mean_so_plot_16 <- ggplot(seven_years_edited, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Strike Outs vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_W_plot_16 <- ggplot(seven_years_edited, aes(meanW, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Wins from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Wins vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SV_plot_16 <- ggplot(seven_years_edited, aes(meanSV, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Saves from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Saves vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_era_plot_16 <- ggplot(seven_years_edited, aes(meanERA, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean ERA from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average ERA vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_16 <- ggplot(seven_years_edited, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Home Runs from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Home Runs vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BAOpp_plot_16 <- ggplot(seven_years_edited, aes(meanBAOpp, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean BAOpp from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Batting Opponent's Average vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_so_plot_16

## `geom_smooth()` using formula 'y ~ x'

mean_W_plot_16

## `geom_smooth()` using formula 'y ~ x'

mean_SV_plot_16

## `geom_smooth()` using formula 'y ~ x'

mean_era_plot_16

## `geom_smooth()` using formula 'y ~ x'

era_salary_correlation <- cor(seven_years_edited$meanERA, seven_years_edited$salary_2016)
so_salary_correlation <- cor(seven_years_edited$meanSO, seven_years_edited$salary_2016)
hr_salary_correlation <- cor(seven_years_edited$meanHR, seven_years_edited$salary_2016)
w_salary_correlation <- cor(seven_years_edited$meanW, seven_years_edited$salary_2016)
sv_salary_correlation <- cor(seven_years_edited$meanSV, seven_years_edited$salary_2016)
baopp_salary_correlation <- cor(seven_years_edited$meanBAOpp, seven_years_edited$salary_2016)

so_salary_correlation

## [1] 0.2281683

w_salary_correlation

## [1] 0.1665059

sv_salary_correlation

## [1] 0.3810178

era_salary_correlation

## [1] -0.1520302

Correlation Coefficients

I included a data table listing the correlations, but the three most positive correlations with salary were Strikeouts, Wins, and Saves.This aligns with my hypothesis. It makes sense that Saves would show a positive correlation as this statistic is related to Wins. There is also a clear negative correlation between ERA and salary. This relationship is expected as a pitcher would receive a hefty contract for producing a low ERA meaning the higher the ERA, the lower the salary.

Three-Year Averages (Pitching) 2013-2015

Next, I cleaned this pitching and salary merged dataset similar to how I did previously when looking at seven year averages. However, here I filtered in order to look at data from 2013 through 2015. In order to have a similar sized dataset (~150 entries), I only focused on limiting the data to meanERA > 0 and meanSV > 0. I understand that this inherently skews my data, but my goal was to be consistent regarding the number of rows being examined.

yr_2013_2015_pitching <- seven_years

yr_2013_2015_pitching <- selected_columns %>%
  filter(yearID > 2012, yearID < 2016) %>%
  mutate(salary_2016=salary_2016) %>%
  dplyr::mutate(salary_2016 = gsub("^.", "", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::select(-BB, -GS) %>%
  dplyr::summarize(meanW = mean(W), meanHR = mean(HR), meanSO = mean(SO), 
                   meanERA = mean(ERA), salary_2016, meanSV = mean(SV),
                   meanBAOpp = mean(BAOpp))

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

yr_2013_2015_pitching$salary_2016 <- as.numeric(yr_2013_2015_pitching$salary_2016)

yr_2013_2015_edited <- yr_2013_2015_pitching[!duplicated(yr_2013_2015_pitching$playerID),] %>%
  filter(meanERA > 0 && meanSV > 0)

mean_ERA_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanERA, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean ERA from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Average ERA vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Average Strikouts vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean HR from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Average Homeruns vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_W_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanW, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Wins from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Average Wins vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SV_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanSV, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Saves from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Average Saves vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BAOpp_plot_13_15 <- ggplot(yr_2013_2015_edited, aes(meanBAOpp, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean BAOpp from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three Year Batting Opponent's Average vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_W_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_SV_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_ERA_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

era_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanERA, yr_2013_2015_edited$salary_2016)
so_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanSO, yr_2013_2015_edited$salary_2016)
hr_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanHR, yr_2013_2015_edited$salary_2016)
w_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanW, yr_2013_2015_edited$salary_2016)
sv_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanSV, yr_2013_2015_edited$salary_2016)
baopp_2013_2015_salary_correlation <- cor(yr_2013_2015_edited$meanBAOpp, yr_2013_2015_edited$salary_2016)

so_2013_2015_salary_correlation

## [1] 0.3898406

w_2013_2015_salary_correlation

## [1] 0.2563163

sv_2013_2015_salary_correlation

## [1] 0.5214593

era_2013_2015_salary_correlation

## [1] -0.3214728

Correlation Coefficients

Again, the result was that the strongest positive correlations with salary were Strikeouts, Wins, and Saves. Conversely, there was an even more negative correlation between ERA and salary, which aligns with my hypothesis that high performing pitchers will post a lower ERA prior to their salary obtained in 2016.

Three-Year Averages (Pitching) 2017-2019

Lastly, I cleaned this pitching and salary merged data set similarly, but instead focusing on 2017-2019. However, here I filtered in order to look at data from 2013 through 2015. In order to have a similar sized data set (~150 entries), I focused on limiting the data to (meanERA > 0, (meanSV > 0), and salary_2016 > 600,000. Once again, I understand that this inherently skews my data, but my goal was to be consistent regarding the number of rows being examined.

yr_2017_2019_pitching <- selected_columns %>%
  filter(yearID > 2016, yearID < 2020) %>%
  mutate(salary_2016=salary_2016) %>%
  dplyr::mutate(salary_2016 = gsub("^.", "", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::select(-BB, -GS) %>%
  dplyr::summarize(meanW = mean(W), meanHR = mean(HR), meanSO = mean(SO), 
                   meanERA = mean(ERA), salary_2016, meanSV = mean(SV),
                   meanBAOpp = mean(BAOpp))

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

yr_2017_2019_pitching$salary_2016 <- as.numeric(yr_2017_2019_pitching$salary_2016)

yr_2017_2019_edited <- yr_2017_2019_pitching[!duplicated(yr_2017_2019_pitching$playerID),] %>%
  filter(meanERA > 0 && meanSV > 0 && salary_2016 > 600000)

mean_ERA_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanERA, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean ERA from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Average ERA vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Average Strikouts vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Homeruns from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Average Homeruns vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_W_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanW, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Wins from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Average Wins vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SV_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanSV, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Saves from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Average Saves vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BAOpp_plot_17_19 <- ggplot(yr_2017_2019_edited, aes(meanBAOpp, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean BAOpp from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three Year Batting Opponent's Average vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_W_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_SV_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_ERA_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

era_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanERA, yr_2017_2019_edited$salary_2016)
so_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanSO, yr_2017_2019_edited$salary_2016)
hr_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanHR, yr_2017_2019_edited$salary_2016)
w_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanW, yr_2017_2019_edited$salary_2016)
sv_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanSV, yr_2017_2019_edited$salary_2016)
baopp_2017_2019_salary_correlation <- cor(yr_2017_2019_edited$meanBAOpp, yr_2017_2019_edited$salary_2016)

so_2017_2019_salary_correlation

## [1] 0.1526471

w_2017_2019_salary_correlation

## [1] 0.07815378

sv_2017_2019_salary_correlation

## [1] 0.2919571

era_2017_2019_salary_correlation

## [1] 0.07869842

Correlation Coefficients

Here, as I anticipated, the correlations have all substantially decreased. There was a clear linear relationship between a pitcher’s performance in the three seasons prior and the salary he received in 2016, but this dependence was not observed in the three seasons following.

ERA <- c(-0.321, -0.152, 0.0787)
Strikeouts <- c(0.390, 0.228, 0.153)
Home_Runs <- c(0.183, 0.116, 0.0971)
Wins <- c(0.256, 0.167, -0.0782)
Saves <- c(0.521, 0.381, 0.292)
BAOpp <- c(-0.266, -0.206, -0.0377)

pitch_cor_dt_13_19 <- data.frame(ERA, Strikeouts, Home_Runs, Wins, Saves, BAOpp)
pitch_cor_dt_13_19_b <- as.data.frame(t(pitch_cor_dt_13_19))

colnames(pitch_cor_dt_13_19_b) <- c("Previous Three Seasons", "Seven-Year Average", "Latter Three Seasons")

pitch_cor_dt_13_19_b

##            Previous Three Seasons Seven-Year Average Latter Three Seasons
## ERA                        -0.321             -0.152               0.0787
## Strikeouts                  0.390              0.228               0.1530
## Home_Runs                   0.183              0.116               0.0971
## Wins                        0.256              0.167              -0.0782
## Saves                       0.521              0.381               0.2920
## BAOpp                      -0.266             -0.206              -0.0377

Correlation Coefficient Table Explanation

This goes hand in hand with what I’ve already discussed in this analysis, meaning, in any random experiment, an observed outlier will eventually over time regress back toward the average. In this case, pitchers who made above average appearances from 2013-2015 will receive large contracts (2016) even though statistically they are more likely to return back to the mean from 2017-2019. It stands to reason that those pitchers who were paid more will not be of any substantial value in the seasons to follow. A simpler example is that pitchers who posted more wins from 2013-2015 will post fewer wins following their salary paid in 2016.

Batting (2013-2019)

To start, I filtered the Batting dataset to examine the data from 2013-2019. As I did previously, I merged Batting with my cleaned salary dataset. I continued cleaning the combined dataset and removed columns I previously thought were important, but realized they would no longer be pertinent in this analysis. I filtered the data to only contain complete cases (rows) as well as no duplicate players. Then, I removed statistics from the tibble that I wouldn’t be analyzing. I created thirteen graphs showing the relationship between average RBI, HR, H, 2B, 3B, BB, SO, GIDP, HBP, SB, CS, G, AB from 2013 to 2019 vs each batter’s 2016 salary. I filtered to focus on position players (removing pitchers who had batting data) as well as meanG > 81 in order to restrict the number of rows between 250-300 and to look at batters who played at least half of the 162 game season.

Batting_2016 <- Batting %>%
  filter(yearID >= 2013, yearID <= 2019)

batting_salary <- merge(Batting_2016, mlb_salary_2016, by="playerID", all=TRUE)

batting_salary_n <- batting_salary

table(is.na(batting_salary_n$salary))

## 
## FALSE  TRUE 
##  6182  4249

test2 <- subset(batting_salary_n, is.na(salary))
data3 <- subset(batting_salary_n, !is.na(salary))

batting_salary_cleaner <- batting_salary_n %>%
  group_by(playerID) %>%
  select(-active_majors_2016)

batting_salary_n <- batting_salary_cleaner %>%
  select(-traded_2016)

batting_salary_adj <- batting_salary_n[complete.cases(batting_salary_n), ]

batting_selected_columns<- as_tibble(batting_salary_adj)

batting_selected_columns <- batting_selected_columns %>%
  select(-stint, -salary, -total_value, -years, -avg_annual, -team)

batting_revised_columns <- batting_selected_columns %>%
  select(-R, -IBB, -SH, -SF)

batting_altered_columns <- batting_revised_columns %>%
  mutate(salary_2016 = salary_2016, pos = pos) %>%
  dplyr::mutate(salary_2016 = gsub("^.","", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::summarize(meanG = mean(G), meanAB = mean(AB), meanH = mean(H), 
                   mean2B = mean(X2B), mean3B = mean(X3B), meanHR = mean(HR),
                   meanRBI = mean(RBI), meanSB = mean(SB), meanCS = mean(CS),
                   meanBB = mean(BB), meanSO = mean(SO), meanHBP = mean(HBP), 
                   meanGIDP = mean(GIDP), pos, salary_2016)

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

batting_altered_columns$salary_2016 <- as.numeric(batting_altered_columns$salary_2016)

batting_altered_columns_edited <- batting_altered_columns[!duplicated(batting_altered_columns$playerID),]

unique(batting_altered_columns_edited[c("pos")])

## # A tibble: 13 × 1
##    pos  
##    <chr>
##  1 RP   
##  2 1B   
##  3 LF   
##  4 SS   
##  5 CF   
##  6 2B   
##  7 SP   
##  8 P    
##  9 3B   
## 10 C    
## 11 RF   
## 12 OF   
## 13 DH

#filter for position players
pos_filtered_batting <- dplyr::filter(batting_altered_columns_edited, pos %in% c("1B", "LF", "SS", "CF", 
                                                                                 "2B", "3B", "C", "RF", "OF", "DH"))
pos_filtered_batting <- pos_filtered_batting %>%
  filter(meanG > 81)

mean_RBI_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanRBI, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean RBI from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average RBI vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Home Runs from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Home Runs vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_H_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanH, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hits from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Hits vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_2B_plot_13_19 <- ggplot(pos_filtered_batting, aes(mean2B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Doubles from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Doubles vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BB_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanBB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Walks from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Walks vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_GIDP_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanGIDP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Grounding into DP from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Grounding into DP vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HBP_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanHBP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hit by Pitch Occurences from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Hit by Pitch Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Strikeouts vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SB_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanSB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Stolen Bases from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Stolen Bases vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_CS_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanCS, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Caught Stealing Occurences from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Caught Stealing Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_3B_plot_13_19 <- ggplot(pos_filtered_batting, aes(mean3B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Triples from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Triples vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_G_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanG, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Games from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average Games vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_AB_plot_13_19 <- ggplot(pos_filtered_batting, aes(meanAB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean At Bats from 2013-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Seven-Year Average At Bats vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_13_19

## `geom_smooth()` using formula 'y ~ x'

mean_GIDP_plot_13_19

## `geom_smooth()` using formula 'y ~ x'

mean_RBI_plot_13_19

## `geom_smooth()` using formula 'y ~ x'

rbi_salary_correlation <- cor(pos_filtered_batting$meanRBI, pos_filtered_batting$salary_2016)
hr_salary_correlation <- cor(pos_filtered_batting$meanHR, pos_filtered_batting$salary_2016)
hits_salary_correlation <- cor(pos_filtered_batting$meanH, pos_filtered_batting$salary_2016)
dbls_salary_correlation <- cor(pos_filtered_batting$mean2B, pos_filtered_batting$salary_2016)
walks_salary_correlation <- cor(pos_filtered_batting$meanBB, pos_filtered_batting$salary_2016)
gidp_salary_correlation <- cor(pos_filtered_batting$meanGIDP, pos_filtered_batting$salary_2016)
hbp_salary_correlation <- cor(pos_filtered_batting$meanHBP, pos_filtered_batting$salary_2016)
so_salary_correlation <- cor(pos_filtered_batting$meanSO, pos_filtered_batting$salary_2016)
sb_salary_correlation <- cor(pos_filtered_batting$meanSB, pos_filtered_batting$salary_2016)
cs_salary_correlation <- cor(pos_filtered_batting$meanCS, pos_filtered_batting$salary_2016)
triples_salary_correlation <- cor(pos_filtered_batting$mean3B, pos_filtered_batting$salary_2016)
games_salary_correlation <- cor(pos_filtered_batting$meanG, pos_filtered_batting$salary_2016)
ab_salary_correlation <- cor(pos_filtered_batting$meanAB, pos_filtered_batting$salary_2016)

hr_salary_correlation

## [1] 0.2191916

gidp_salary_correlation

## [1] 0.3617475

rbi_salary_correlation

## [1] 0.3272181

Correlation Coefficients

I will include a data table listing the correlations, but the three most positive correlations with salary were GIDP (Grounding into Double Play), HR (Home Runs), and RBI (Runs batted in).This aligns with my hypothesis. As I previously explained, team management is looking for the player who can regularly ‘hit it out of the park’. More home runs equate to more hard-hit outs (GIDP). There is also a strong positive relationship between the seven year average RBIs and the batter’s 2016 salary, which makes sense because consistently performing players are valuable to the team as a whole.

Three-Year Averages (Batting) 2013-2015

Next, I cleaned this pitching and salary merged dataset similar to how I did previously when looking at seven year averages. However, here I filtered in order to look at data from 2013 through 2015. In order to have a similar sized dataset (~250 entries), I only focused on limiting the data to meanG > 81. I understand that this inherently skews my data, but my goal was to be consistent regarding the number of rows being examined. In this case, the strongest correlations with 2016 salary were RBI, HR, H, 2B, BB, GIDP, and AB.

yr_2013_2015_batting <- batting_revised_columns %>%
  filter(yearID > 2012, yearID < 2016) %>%
  mutate(salary_2016 = salary_2016, pos = pos) %>%
  dplyr::mutate(salary_2016 = gsub("^.","", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::summarize(meanG = mean(G), meanAB = mean(AB), meanH = mean(H), 
                   mean2B = mean(X2B), mean3B = mean(X3B), meanHR = mean(HR),
                   meanRBI = mean(RBI), meanSB = mean(SB), meanCS = mean(CS),
                   meanBB = mean(BB), meanSO = mean(SO), meanHBP = mean(HBP), 
                   meanGIDP = mean(GIDP), pos, salary_2016)

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

yr_2013_2015_batting$salary_2016 <- as.numeric(yr_2013_2015_batting$salary_2016)

yr_2013_2015_batting_edited <- yr_2013_2015_batting[!duplicated(yr_2013_2015_batting$playerID),]

unique(yr_2013_2015_batting_edited[c("pos")])

## # A tibble: 13 × 1
##    pos  
##    <chr>
##  1 RP   
##  2 1B   
##  3 LF   
##  4 SS   
##  5 CF   
##  6 2B   
##  7 SP   
##  8 P    
##  9 3B   
## 10 C    
## 11 RF   
## 12 OF   
## 13 DH

pos_filtered_batting_13_15 <- dplyr::filter(yr_2013_2015_batting_edited, pos %in% c("1B", "LF", "SS", "CF", 
                                                                                    "2B", "3B", "C", "RF", "OF", "DH"))

pos_filtered_batting_13_15 <- pos_filtered_batting_13_15 %>%
  filter(meanG > 81)

mean_RBI_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanRBI, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean RBI from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average RBI vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Home Runs from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Home Runs vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_H_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanH, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hits from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Hits vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_2B_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(mean2B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Doubles from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Doubles vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BB_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanBB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Walks from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Walks vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_GIDP_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanGIDP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Grounding into DP from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Grounding into DP vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HBP_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanHBP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hit by Pitch Occurences from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Hit by Pitch Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Strikeouts vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SB_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanSB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Stolen Bases from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Stolen Bases vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_CS_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanCS, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Caught Stealing Occurences from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Caught Stealing Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_3B_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(mean3B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Triples from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Triples vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_G_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanG, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Games from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average Games vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_AB_plot_13_15 <- ggplot(pos_filtered_batting_13_15, aes(meanAB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean At Bats from 2013-2015")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Previous Three-Year Average At Bats vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_GIDP_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_RBI_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_AB_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_H_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_2B_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

mean_BB_plot_13_15

## `geom_smooth()` using formula 'y ~ x'

batting_13_15_rbi <- cor(pos_filtered_batting_13_15$meanRBI, pos_filtered_batting_13_15$salary_2016)
batting_13_15_hr <- cor(pos_filtered_batting_13_15$meanHR, pos_filtered_batting_13_15$salary_2016)
batting_13_15_hits <- cor(pos_filtered_batting_13_15$meanH, pos_filtered_batting_13_15$salary_2016)
batting_13_15_dbls <- cor(pos_filtered_batting_13_15$mean2B, pos_filtered_batting_13_15$salary_2016)
batting_13_15_walks <- cor(pos_filtered_batting_13_15$meanBB, pos_filtered_batting_13_15$salary_2016)
batting_13_15_gidp <- cor(pos_filtered_batting_13_15$meanGIDP, pos_filtered_batting_13_15$salary_2016)
batting_13_15_hbp <- cor(pos_filtered_batting_13_15$meanHBP, pos_filtered_batting_13_15$salary_2016)
batting_13_15_so <- cor(pos_filtered_batting_13_15$meanSO, pos_filtered_batting_13_15$salary_2016)
batting_13_15_sb <- cor(pos_filtered_batting_13_15$meanSB, pos_filtered_batting_13_15$salary_2016)
batting_13_15_cs <- cor(pos_filtered_batting_13_15$meanCS, pos_filtered_batting_13_15$salary_2016)
batting_13_15_triples <- cor(pos_filtered_batting_13_15$mean3B, pos_filtered_batting_13_15$salary_2016)
batting_13_15_games <- cor(pos_filtered_batting_13_15$meanG, pos_filtered_batting_13_15$salary_2016)
batting_13_15_ab <- cor(pos_filtered_batting_13_15$meanAB, pos_filtered_batting_13_15$salary_2016)

batting_13_15_hr

## [1] 0.4838633

batting_13_15_gidp

## [1] 0.4095797

batting_13_15_rbi

## [1] 0.5641108

batting_13_15_ab

## [1] 0.4470293

batting_13_15_hits

## [1] 0.4597096

batting_13_15_dbls

## [1] 0.4479218

batting_13_15_walks

## [1] 0.4900238

Correlation Coefficients

I have explained the expected positive relationship between salary and HR, GIDP, and RBI. The other positive relationships observed in the data set from 2013-2015 are reasonable considering the large number of at-bats young players can have. More specifically, when referring to players receiving large contracts in 2016, it stands to reason that they performed above-average offensively - meaning more hits, extra-base hits, and walks.

Three-Year Averages (Batting) 2017-2019

Lastly, I cleaned this batting and salary merged data set similarly, but instead focusing on 2017-2019. However, here I filtered in order to look at data from 2013 through 2015. In order to have a similar sized data set (~250 entries), I again focused on limiting the data to meanG >81. I understand that this inherently skews my data, but my goal was to be consistent regarding the number of rows being examined.

yr_2017_2019_batting <- batting_revised_columns %>%
  filter(yearID > 2016, yearID < 2020) %>%
  mutate(salary_2016 = salary_2016, pos = pos) %>%
  dplyr::mutate(salary_2016 = gsub("^.","", salary_2016)) %>%
  dplyr::mutate(salary_2016 = gsub(",","", salary_2016)) %>%
  group_by(playerID) %>%
  dplyr::summarize(meanG = mean(G), meanAB = mean(AB), meanH = mean(H), 
                   mean2B = mean(X2B), mean3B = mean(X3B), meanHR = mean(HR),
                   meanRBI = mean(RBI), meanSB = mean(SB), meanCS = mean(CS),
                   meanBB = mean(BB), meanSO = mean(SO), meanHBP = mean(HBP), 
                   meanGIDP = mean(GIDP), pos, salary_2016)

## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.

yr_2017_2019_batting$salary_2016 <- as.numeric(yr_2017_2019_batting$salary_2016)

yr_2017_2019_batting_edited <- yr_2017_2019_batting[!duplicated(yr_2017_2019_batting$playerID),]

unique(yr_2017_2019_batting_edited[c("pos")])

## # A tibble: 13 × 1
##    pos  
##    <chr>
##  1 RP   
##  2 1B   
##  3 SS   
##  4 CF   
##  5 2B   
##  6 SP   
##  7 P    
##  8 LF   
##  9 3B   
## 10 C    
## 11 RF   
## 12 OF   
## 13 DH

pos_filtered_batting_17_19<- dplyr::filter(yr_2017_2019_batting_edited, pos %in% c("1B", "LF", "SS", "CF", 
                                                                                   "2B", "3B", "C", "RF", "OF", "DH"))

pos_filtered_batting_17_19 <- pos_filtered_batting_17_19 %>%
  filter(meanG > 81)

mean_RBI_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanRBI, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean RBI from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average RBI vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanHR, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Home Runs from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Home Runs vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_H_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanH, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hits from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Hits vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_2B_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(mean2B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Doubles from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Doubles vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_BB_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanBB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Walks from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Walks vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_GIDP_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanGIDP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Grounding into DP from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Grounding into DP vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HBP_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanHBP, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Hit by Pitch Occurences from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Hit by Pitch Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SO_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanSO, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Strikeouts from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Strikeouts vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_SB_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanSB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Stolen Bases from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Stolen Bases vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_CS_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanCS, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Caught Stealing Occurences from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Caught Stealing Occurences vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_3B_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(mean3B, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Triples from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Triples vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_G_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanG, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean Games from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average Games vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_AB_plot_17_19 <- ggplot(pos_filtered_batting_17_19, aes(meanAB, (salary_2016/1e6))) + geom_point() + 
  scale_x_continuous("Mean At Bats from 2017-2019")+ scale_y_continuous("Salary (millions of $)") + 
  ggtitle("Latter Three-Year Average At Bats vs 2016 Salary") + theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method=lm, level=0.99)

mean_HR_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_GIDP_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_RBI_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_AB_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_H_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_2B_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

mean_BB_plot_17_19

## `geom_smooth()` using formula 'y ~ x'

batting_17_19_rbi <- cor(pos_filtered_batting_17_19$meanRBI, pos_filtered_batting_17_19$salary_2016)
batting_17_19_hr <- cor(pos_filtered_batting_17_19$meanHR, pos_filtered_batting_17_19$salary_2016)
batting_17_19_hits <- cor(pos_filtered_batting_17_19$meanH, pos_filtered_batting_17_19$salary_2016)
batting_17_19_dbls <- cor(pos_filtered_batting_17_19$mean2B, pos_filtered_batting_17_19$salary_2016)
batting_17_19_walks <- cor(pos_filtered_batting_17_19$meanBB, pos_filtered_batting_17_19$salary_2016)
batting_17_19_gidp <- cor(pos_filtered_batting_17_19$meanGIDP, pos_filtered_batting_17_19$salary_2016)
batting_17_19_hbp <- cor(pos_filtered_batting_17_19$meanHBP, pos_filtered_batting_17_19$salary_2016)
batting_17_19_so <- cor(pos_filtered_batting_17_19$meanSO, pos_filtered_batting_17_19$salary_2016)
batting_17_19_sb <- cor(pos_filtered_batting_17_19$meanSB, pos_filtered_batting_17_19$salary_2016)
batting_17_19_cs <- cor(pos_filtered_batting_17_19$meanCS, pos_filtered_batting_17_19$salary_2016)
batting_17_19_triples <- cor(pos_filtered_batting_17_19$mean3B, pos_filtered_batting_17_19$salary_2016)
batting_17_19_games <- cor(pos_filtered_batting_17_19$meanG, pos_filtered_batting_17_19$salary_2016)
batting_17_19_ab <- cor(pos_filtered_batting_17_19$meanAB, pos_filtered_batting_17_19$salary_2016)

batting_17_19_hr

## [1] -0.01871258

batting_17_19_gidp

## [1] 0.2082704

batting_17_19_rbi

## [1] 0.05144179

batting_17_19_ab

## [1] 0.03730766

batting_17_19_hits

## [1] 0.05132616

batting_17_19_dbls

## [1] 0.01622466

batting_17_19_walks

## [1] 0.1055452

Correlation Coefficients

Here, as I anticipated, the correlations have all substantially decreased.

RBI <- c(0.564, 0.327, 0.051)
HR <- c(0.484, 0.219, -0.019)
H <- c(0.460, 0.271, 0.051)
Doubles <- c(0.448, 0.231, 0.0162)
Triples <- c(-0.169, -0.211, -0.216)
BB <- c(0.490, 0.295, 0.106)
SO <- c(0.272, 0.101, -0.104)
GIDP <- c(0.410, 0.362, 0.208)
HBP <- c(0.150, 0.044, -0.0082)
SB <- c(-0.035, -0.088, -0.112)
CS <- c(-0.110, -0.152, -0.184)
G <- c(0.370, 0.214, -0.034)
AB <- c(0.447, 0.272, 0.037)

batting_cor_dt_13_19 <- data.frame(RBI, HR, H, Doubles, Triples, BB, SO, GIDP,
                                   HBP, SB, CS, G, AB)
batting_cor_dt_13_19_b <- as.data.frame(t(batting_cor_dt_13_19))

colnames(batting_cor_dt_13_19_b) <- c("Previous Three Seasons", "Seven-Year Average", "Latter Three Seasons")

batting_cor_dt_13_19_b

##         Previous Three Seasons Seven-Year Average Latter Three Seasons
## RBI                      0.564              0.327               0.0510
## HR                       0.484              0.219              -0.0190
## H                        0.460              0.271               0.0510
## Doubles                  0.448              0.231               0.0162
## Triples                 -0.169             -0.211              -0.2160
## BB                       0.490              0.295               0.1060
## SO                       0.272              0.101              -0.1040
## GIDP                     0.410              0.362               0.2080
## HBP                      0.150              0.044              -0.0082
## SB                      -0.035             -0.088              -0.1120
## CS                      -0.110             -0.152              -0.1840
## G                        0.370              0.214              -0.0340
## AB                       0.447              0.272               0.0370

Correlation Coefficient Table Explanation

There was a clear linear relationship between a batter’s performance in the three seasons prior and the salary he received in 2016, but this dependence was not observed in the three seasons following. If teams can hone in on a player before he reaches his optimal performing season, they acquire the player for less money.

Further Analysis : T-Test (Pitching)

Again using 2016 as my year of interest (since I am looking at salaries from that year), I wanted to see if there was a measurable difference between a pitcher’s performance if he was paid above or below the mean. I needed to use an independent sample t-test due to the unrelatedness of the two means. I’ve included two tables showing the means and standard deviations and the t-scores and standard errors for all pertinent pitching statistics. I am going to delve into performing this t-test looking at the mean of SV (Saves) vs 2016 salary. Specifically, I was looking at the change in SV from the three seasons (2013-2015) preceding the 2016 salary and the latter three seasons (2017-2019). Logic shows that if this were a positive number, it would mean that the pitcher threw better in the years following the 2016 salary whereas a negative value would mean he threw worse. If my hypothesis is correct - meaning that over time, a pitcher’s performance falls to the average - then pitchers paid above the mean 2016 salary would have a smaller \(\Delta\)SV . If I am incorrect, then there would be no observable difference in \(\Delta\)SV between pitchers who were paid above average in 2016 and those who were paid below average. I believe it is appropriate to use a one-tailed t-test because my question to answer is if the means are different in one direction (not two).

\(\Delta\)SV = (Avg SV 2017-2019) - (Avg SV 2013-2015)

To perform any statistical test, I need to create a null and an alternative hypothesis. The purpose of the null hypothesis is to effectively state there is no difference between the two groups being compared, meaning it’s used to either certify or denounce the statistical claim I’m making. Conversely, the purpose of the alternative hypothesis is to state that there is a relationship (in this case an observed difference) between the two groups being compared, meaning that my analysis was statistically significant.

My null hypothesis (H₀) is: pitchers who were paid above average in 2016 \(\mu_{above avg salary}\) will have an average number of saves (\(\Delta\)SV) equivalent or greater than the (\(\Delta\)SV ) recorded by pitchers who were paid below average \(\mu_{below avg salary}\) in 2016.

H₀ : \(\mu_{above avg salary}\) - \(\mu_{below avg salary}\) >= 0

My alternative hypothesis (Ha) is: pitchers who were paid above average (\(\mu\)above avg salary) in 2016 will have an average number of saves (\(\Delta\)SV) less than the (\(\Delta\)SV ) recorded by pitchers who were paid below average (\(\mu\)below avg salary) in 2016.

Ha : \(\mu_{above avg salary}\) - \(\mu_{below avg salary}\) < 0

To perform a t-test, I began by establishing a confidence level ( ɑ ) of 0.05. Next, I had to calculate the degrees of freedom value (df) for my one-tailed t-Test in order to find the t-critical value.

df = #\(samplesabove_{avg salary}\) + #\(samples_{below avg salary}\) - 2
Since there were 27 pitchers paid above average and 38 paid below, my df value was 63. When looking at the t-table, 63 corresponded to a t-critical value of -1.669. The t-critical value is negative because the null hypothesis states that the means will change negatively.

I then had to calculate the sample mean and standard deviation of \(\Delta\)SV for pitchers paid above average in 2016. I had to perform the same calculation for pitchers paid below average in 2016. Next, I had to find the difference in means between pitchers paid above average and those paid below (the two sample groups). Afterwards, I used the standard deviations from each of my sample groups and normalized them using the number of pitchers in each respective sample in order to find the standard error. Then I found the t-statistic, which was the difference in means between the two sample groups divided by the standard error. The next step was comparing the calculated t-statistics to the previously found t-critical value in order to either reject or accept the null hypothesis.

I used my same altered pitching data frame that I have already discussed, but added a standardized column so that I could easily observe if a pitcher was paid above or below the average in 2016.

#t-test information

mean_salary_2016_pitch <- mean(seven_years_edited$salary_2016)
sd_salary_2016_pitch <- sd(seven_years_edited$salary_2016)

#Focusing on players who played 2013-2019

playerID_merge_pitch <- yr_2013_2015_edited %>%
  left_join(yr_2017_2019_edited, by = "playerID")

playerID_merge_na_pitch<- playerID_merge_pitch

table(is.na(playerID_merge_na_pitch$meanW.y))

## 
## FALSE  TRUE 
##    65   109

test_merge_pitch <- subset(playerID_merge_na_pitch, is.na(meanW.y))
data_merge_pitch <- subset(playerID_merge_na_pitch, !is.na(meanW.y))

playerID_merge_13_15_pitch <- data_merge_pitch %>%
  select(-meanW.y, -meanHR.y, -meanSO.y, -meanERA.y, -meanSV.y, -meanBAOpp.y, -salary_2016.y)

ERA_change_previous_latter_pitch <- data_merge_pitch$meanERA.y - data_merge_pitch$meanERA.x
W_change_previous_latter_pitch <- data_merge_pitch$meanW.y - data_merge_pitch$meanW.x
HR_change_previous_latter_pitch <- data_merge_pitch$meanHR.y - data_merge_pitch$meanHR.x
SO_change_previous_latter_pitch <- data_merge_pitch$meanSO.y - data_merge_pitch$meanSO.x
SV_change_previous_latter_pitch <- data_merge_pitch$meanSV.y - data_merge_pitch$meanSV.x
BAOpp_change_previous_latter_pitch <- data_merge_pitch$meanBAOpp.y - data_merge_pitch$meanBAOpp.x


standardized_salary_pitch <- (data_merge_pitch$salary_2016.x - mean_salary_2016_pitch) / (sd_salary_2016_pitch)

finalized_pitch_statistics_change_df <- data.frame(data_merge_pitch$playerID, data_merge_pitch$salary_2016.y, 
                                                   ERA_change_previous_latter_pitch, HR_change_previous_latter_pitch, 
                                                   W_change_previous_latter_pitch, SO_change_previous_latter_pitch,
                                                   SV_change_previous_latter_pitch,
                                                   BAOpp_change_previous_latter_pitch, standardized_salary_pitch)


salary_above_mean_pitch <- finalized_pitch_statistics_change_df %>%
  filter(standardized_salary_pitch > 0)

salary_below_mean_pitch <- finalized_pitch_statistics_change_df %>%
  filter(standardized_salary_pitch < 0)

nrow(salary_below_mean_pitch)

## [1] 38

nrow(salary_above_mean_pitch)

## [1] 27

hist_ERA_above_avg <- hist(salary_above_mean_pitch$ERA_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in ERA", col="blue") 

hist_ERA_below_avg <- hist(salary_below_mean_pitch$ERA_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in ERA", col="blue") 

hist_W_above_avg <- hist(salary_above_mean_pitch$W_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Wins (W)", col="blue") 

hist_W_below_avg <- hist(salary_below_mean_pitch$W_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Wins (W)", col="blue") 

hist_HR_above_avg <- hist(salary_above_mean_pitch$HR_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Home Runs (HR)", col="blue") 

hist_HR_below_avg <- hist(salary_below_mean_pitch$HR_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Home Runs (HR)", col="blue") 

hist_SO_above_avg <- hist(salary_above_mean_pitch$SO_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Strikeouts (SO)", col="blue") 

hist_SO_below_avg <- hist(salary_below_mean_pitch$SO_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Strikeouts (SO)", col="blue") 

hist_BAOpp_above_avg <- hist(salary_above_mean_pitch$W_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in BAOpp", col="blue") 

hist_BAOpp_below_avg <- hist(salary_below_mean_pitch$W_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in BAOpp", col="blue")

hist_SV_above_avg <- hist(salary_above_mean_pitch$SV_change_previous_latter_pitch, main="Pitchers with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Saves (SV)", col="blue")

hist_SV_below_avg <- hist(salary_below_mean_pitch$SV_change_previous_latter_pitch, main="Pitchers with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Saves (SV)", col="blue")

Histogram Explanation

Both histograms loosely follow a normal distribution, but the above average salary graph is skewed more negative. The mean for pitchers paid below average in 2016 was -0.7842 and the mean for those paid above average was -8.219. The negative change in saves (\(\Delta\)SV) demonstrates that the pitchers didn’t perform as well during the years post 2016 as they did in the years preceding 2016.

avg_salary_above_mean_pitch_ERA<- mean(salary_above_mean_pitch$ERA_change_previous_latter_pitch)
avg_salary_below_mean_pitch_ERA <- mean(salary_below_mean_pitch$ERA_change_previous_latter_pitch)

sd_salary_above_mean_pitch_ERA <- sd(salary_above_mean_pitch$ERA_change_previous_latter_pitch)
sd_salary_below_mean_pitch_ERA <- sd(salary_below_mean_pitch$ERA_change_previous_latter_pitch)

avg_salary_above_mean_pitch_W<- mean(salary_above_mean_pitch$W_change_previous_latter_pitch)
avg_salary_below_mean_pitch_W <- mean(salary_below_mean_pitch$W_change_previous_latter_pitch)

sd_salary_above_mean_pitch_W <- sd(salary_above_mean_pitch$W_change_previous_latter_pitch)
sd_salary_below_mean_pitch_W <- sd(salary_below_mean_pitch$W_change_previous_latter_pitch)

avg_salary_above_mean_pitch_HR<- mean(salary_above_mean_pitch$HR_change_previous_latter_pitch)
avg_salary_below_mean_pitch_HR <- mean(salary_below_mean_pitch$HR_change_previous_latter_pitch)

sd_salary_above_mean_pitch_HR <- sd(salary_above_mean_pitch$HR_change_previous_latter_pitch)
sd_salary_below_mean_pitch_HR <- sd(salary_below_mean_pitch$HR_change_previous_latter_pitch)

avg_salary_above_mean_pitch_SO <- mean(salary_above_mean_pitch$SO_change_previous_latter_pitch)
avg_salary_below_mean_pitch_SO <- mean(salary_below_mean_pitch$SO_change_previous_latter_pitch)

sd_salary_above_mean_pitch_SO <- sd(salary_above_mean_pitch$SO_change_previous_latter_pitch)
sd_salary_below_mean_pitch_SO <- sd(salary_below_mean_pitch$SO_change_previous_latter_pitch)

avg_salary_above_mean_pitch_BAOpp <- mean(salary_above_mean_pitch$BAOpp_change_previous_latter_pitch)
avg_salary_below_mean_pitch_BAOpp <- mean(salary_below_mean_pitch$BAOpp_change_previous_latter_pitch)

sd_salary_above_mean_pitch_BAOpp <- sd(salary_above_mean_pitch$BAOpp_change_previous_latter_pitch)
sd_salary_below_mean_pitch_BAOpp <- sd(salary_below_mean_pitch$BAOpp_change_previous_latter_pitch)

std_error_pitch_ERA = sqrt(((sd_salary_above_mean_pitch_ERA**2)/27) + ((sd_salary_below_mean_pitch_ERA**2)/38))
t_statistic_pitch_ERA <- (avg_salary_above_mean_pitch_ERA - avg_salary_below_mean_pitch_ERA) / (std_error_pitch_ERA)

std_error_pitch_W = sqrt(((sd_salary_above_mean_pitch_W**2)/27) + ((sd_salary_below_mean_pitch_W**2)/38))
t_statistic_pitch_W <- (avg_salary_above_mean_pitch_W - avg_salary_below_mean_pitch_W) / (std_error_pitch_W)

std_error_pitch_HR= sqrt(((sd_salary_above_mean_pitch_HR**2)/27) + ((sd_salary_below_mean_pitch_HR**2)/38))
t_statistic_pitch_HR <- (avg_salary_above_mean_pitch_HR - avg_salary_below_mean_pitch_HR) / (std_error_pitch_HR)

std_error_pitch_SO = sqrt(((sd_salary_above_mean_pitch_SO**2)/27) + ((sd_salary_below_mean_pitch_SO**2)/38))
t_statistic_pitch_SO <- (avg_salary_above_mean_pitch_SO - avg_salary_below_mean_pitch_SO) / (std_error_pitch_SO)

std_error_pitch_BAOpp = sqrt(((sd_salary_above_mean_pitch_BAOpp**2)/27) + ((sd_salary_below_mean_pitch_BAOpp**2)/38))
t_statistic_pitch_BAOpp <- (avg_salary_above_mean_pitch_BAOpp - avg_salary_below_mean_pitch_BAOpp) / (std_error_pitch_BAOpp)

avg_salary_above_mean_pitch_SV <- mean(salary_above_mean_pitch$SV_change_previous_latter_pitch)
avg_salary_below_mean_pitch_SV <- mean(salary_below_mean_pitch$SV_change_previous_latter_pitch)

sd_salary_above_mean_pitch_SV <- sd(salary_above_mean_pitch$SV_change_previous_latter_pitch)
sd_salary_below_mean_pitch_SV <- sd(salary_below_mean_pitch$SV_change_previous_latter_pitch)

std_error_pitch_SV = sqrt(((sd_salary_above_mean_pitch_SV**2)/27) + ((sd_salary_below_mean_pitch_SV**2)/38))
t_statistic_pitch_SV <- (avg_salary_above_mean_pitch_SV - avg_salary_below_mean_pitch_SV) / (std_error_pitch_SV)

t_statistic_pitch_SV

## [1] -3.155665

T-statistic Explanation

The calculated t-statistic (listed in the table below) was -3.156, which is less than -1.669 (the t-critical value) using the confidence level 0.05. In other words, I’m able to confidently reject H₀. I’ve used statistics to prove that pitchers who were paid above average in 2016 performed worse compared to pitchers who were paid below average from the years preceding 2016 (2013-2015) to the years post 2016 (2017-2019). Regression to the mean indeed occurred because pitchers who threw better from 2013-2015 were paid above average in 2016, but their output fell from 2017-2019 more than the group of pitchers who were paid below average in 2016. Statistically speaking, I used this t-test to see how a pitcher’s output changed three years prior to his 2016 salary and the three years afterwards. The change in saves was demonstrated by the more negative mean from 2017-2019 than the one found from 2013-2015. The calculated t-statistic being less than the t-critical value imply that pitchers who were paid above average in 2016 will perform worse from 2017-2019 than those paid below average in 2016. This means that over a lengthy period of time, an indicator of a pitcher’s performance (saves) will eventually fall towards the mean. This follows my claim that MLB pays pitchers hefty salaries even though they will inevitably throw worse in the seasons that follow. I also performed independent one-sided t-tests for ERA, W, HR, SO, and BAOpp with respect to pitchers and their 2016 salaries.

#batting t-test information

mean_salary_2016 <- mean(pos_filtered_batting$salary_2016)
sd_salary_2016 <- sd(pos_filtered_batting$salary_2016)

#Focusing on players who played 2013-2019

playerID_merge <- pos_filtered_batting_13_15 %>%
  left_join(pos_filtered_batting_17_19, by = "playerID")

playerID_merge_na<- playerID_merge

table(is.na(playerID_merge_na$pos.y))

## 
## FALSE  TRUE 
##   154   100

test_merge <- subset(playerID_merge_na, is.na(pos.y))
data_merge <- subset(playerID_merge_na, !is.na(pos.y))


playerID_merge_13_15 <- data_merge %>%
  select(-meanG.y, -meanAB.y, -meanH.y, -mean2B.y, -mean3B.y, -meanHR.y, -meanRBI.y, -meanSB.y, 
         -meanCS.y, -meanBB.y, -meanSO.y, -meanHBP.y, -meanGIDP.y, -pos.y, -salary_2016.y)


rbi_change_previous_latter <- data_merge$meanRBI.y - data_merge$meanRBI.x
H_change_previous_latter <- data_merge$meanH.y - data_merge$meanH.x
BB_change_previous_latter <- data_merge$meanBB.y - data_merge$meanBB.x
HR_change_previous_latter <- data_merge$meanHR.y - data_merge$meanHR.x


standardized_salary <- (data_merge$salary_2016.x - mean_salary_2016) / (sd_salary_2016)


finalized_batting_stats_change_df <- data.frame(data_merge$playerID, data_merge$salary_2016.y, rbi_change_previous_latter,
                                                H_change_previous_latter, BB_change_previous_latter, HR_change_previous_latter,
                                                standardized_salary)

salary_above_mean <- finalized_batting_stats_change_df%>%
  filter(standardized_salary > 0)

salary_below_mean <- finalized_batting_stats_change_df %>%
  filter(standardized_salary < 0)

hist(salary_above_mean$rbi_change_previous_latter, main="Players with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in RBI", col="blue") 

hist(salary_below_mean$rbi_change_previous_latter, main="Players with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in RBI", col="blue") 

hist(salary_above_mean$H_change_previous_latter, main="Players with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Hits", col="blue") 

hist(salary_below_mean$H_change_previous_latter, main="Players with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Hits", col="blue") 

hist(salary_above_mean$BB_change_previous_latter, main="Players with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Walks", col="blue") 

hist(salary_below_mean$BB_change_previous_latter, main="Players with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Walks", col="blue") 

hist(salary_above_mean$HR_change_previous_latter, main="Players with Salary Above the Average in 2016",
     ylab="Count", xlab="Change in Home Runs", col="blue") 

hist(salary_below_mean$HR_change_previous_latter, main="Players with Salary Below the Average in 2016",
     ylab="Count", xlab="Change in Home Runs", col="blue") 

avg_salary_above_mean_rbi <- mean(salary_above_mean$rbi_change_previous_latter)
avg_salary_below_mean_rbi <- mean(salary_below_mean$rbi_change_previous_latter)

sd_salary_above_mean_rbi <- sd(salary_above_mean$rbi_change_previous_latter)
sd_salary_below_mean_rbi <- sd(salary_below_mean$rbi_change_previous_latter)

avg_salary_above_mean_H <- mean(salary_above_mean$H_change_previous_latter)
avg_salary_below_mean_H <- mean(salary_below_mean$H_change_previous_latter)

sd_salary_above_mean_H <- sd(salary_above_mean$H_change_previous_latter)
sd_salary_below_mean_H <- sd(salary_below_mean$H_change_previous_latter)

avg_salary_above_mean_BB <- mean(salary_above_mean$BB_change_previous_latter)
avg_salary_below_mean_BB <- mean(salary_below_mean$BB_change_previous_latter)

sd_salary_above_mean_BB <- sd(salary_above_mean$BB_change_previous_latter)
sd_salary_below_mean_BB <- sd(salary_below_mean$BB_change_previous_latter)

avg_salary_above_mean_HR <- mean(salary_above_mean$HR_change_previous_latter)
avg_salary_below_mean_HR <- mean(salary_below_mean$HR_change_previous_latter)

sd_salary_above_mean_HR <- sd(salary_above_mean$HR_change_previous_latter)
sd_salary_below_mean_HR <- sd(salary_below_mean$HR_change_previous_latter)


std_error_rbi = sqrt(((sd_salary_above_mean_rbi**2)/76) + ((sd_salary_below_mean_rbi**2)/78))
t_statistic_rbi <- (avg_salary_above_mean_rbi - avg_salary_below_mean_rbi) / (std_error_rbi)

std_error_H = sqrt(((sd_salary_above_mean_H**2)/76) + ((sd_salary_below_mean_H**2)/78))
t_statistic_H <- (avg_salary_above_mean_H - avg_salary_below_mean_H) / (std_error_H)

std_error_BB = sqrt(((sd_salary_above_mean_BB**2)/76) + ((sd_salary_below_mean_BB**2)/78))
t_statistic_BB <- (avg_salary_above_mean_BB - avg_salary_below_mean_BB) / (std_error_BB)

std_error_HR = sqrt(((sd_salary_above_mean_HR**2)/76) + ((sd_salary_below_mean_HR**2)/78))
t_statistic_HR <- (avg_salary_above_mean_HR - avg_salary_below_mean_HR) / (std_error_HR)

Explanation

Then, I performed independent one-sided t-tests for RBI, H, BB, and HR with respect to batters and their 2016 salaries. As previously stated, these were all one-tailed t-test because I wanted to determine if the means differed in one direction (negative) and not two. As expected, the statistics I chose to analyze demonstrated a negative correlation between 2016 salary and a batter’s/pitcher’s performance in the years that followed. The only positive correlation was between ERA and a pitcher’s 2016 salary - meaning that a pitcher who was paid above average tended to post a lower ERA.

Create a Model with Machine Learning

As stated in the beginning of this report, my fifth objective was to construct a machine learning model using a batter/pitcher’s performance data from 2013-2015 in order to predict if said batter/pitcher would receive a salary above the average in 2016. I used the same filter - meaning pitchers who posted a meanERA > 0, meanSV >0, and were paid a salary in 2016, I used their data from 2013-2015 to make the model. Continuing to build the model, I used the same pitching metrics (meanERA, meanW, meanHR, meanSO, meanSV, meanBAOpp).

testing_set_13_15_pitch <- playerID_merge_13_15_pitch

mean_salary_test_pitch <-mean(testing_set_13_15_pitch$salary_2016.x)
sd_salary_test_pitch <-sd(testing_set_13_15_pitch$salary_2016.x)

standardized_salary_test_pitch <- (testing_set_13_15_pitch$salary_2016.x - mean_salary_test_pitch) / (sd_salary_test_pitch)

testing_set_13_15_pitch$labels <- as.integer(standardized_salary_test_pitch > 0)
trainIndex2=createDataPartition(testing_set_13_15_pitch$labels, p=0.7)

testing_set_13_15_pitch <- testing_set_13_15_pitch %>%
  select(-meanW.x, -meanHR.x, -meanSO.x, -meanERA.x, -meanSV.x, -meanBAOpp.x, 
         -salary_2016.x) %>%
  ungroup()

testing_set_13_15_pitch <- testing_set_13_15_pitch %>%
  select(-playerID) %>%
  ungroup

testing_set_13_15_pitch <- testing_set_13_15_pitch %>%
  mutate(standardized_salary_test_pitch = standardized_salary_test_pitch)

testing_set_13_15_pitch$labels <- as.character(testing_set_13_15_pitch$labels)
testing_set_13_15_pitch$labels <- as.numeric(testing_set_13_15_pitch$labels)

testing_set_13_15_pitch <- testing_set_13_15_pitch[sample(nrow(testing_set_13_15_pitch)),]
testing_13_15_pitch_train <- testing_set_13_15_pitch[1:48,]
testing_13_15_pitch_test <- testing_set_13_15_pitch[48:65,]

testing_13_15_pitch_test$labels <- as.factor(testing_13_15_pitch_test$labels)
testing_13_15_pitch_train$labels <- as.factor(testing_13_15_pitch_train$labels)

nb_pitch <- naivebayes::naive_bayes(labels ~ ., data = testing_13_15_pitch_train)

plot(nb_pitch)

pitching.output <- cbind(testing_13_15_pitch_test, pred_pitch = predict(nb_pitch, testing_13_15_pitch_test))

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

pred_pitch  <- predict(nb_pitch, newdata = select(testing_13_15_pitch_test,-labels))

caret::confusionMatrix(pitching.output$pred_pitch, pitching.output$labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 13  0
##          1  0  5
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8147, 1)
##     No Information Rate : 0.7222     
##     P-Value [Acc > NIR] : 0.002858   
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.7222     
##          Detection Rate : 0.7222     
##    Detection Prevalence : 0.7222     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

Explanation

In this data frame, my inputs were meanERA, meanW, meanHR, meanSO, meanSV, meanBAOpp averaged from 2013-2015. To carry out machine learning, I also needed an output. The output was if a pitcher with those inputs was paid a salary in 2016 that was above the average. To do this, I standardized the 2016 salaries as I did when performing the t-test and then alter them to be label of 1 or 0 with the 1 referencing above average salaries and 0 referring to below average salaries.

I used a Gaussian Naive Bayes classifier because it follows a Gaussian (normal) distribution, which is what I wanted since I was dealing with data that was continuous. My confusion matrix produced an accuracy of 1, which means that it can predict with better than random accuracy that a player will be paid above the average in 2016 based off the meanERA, meanW, meanHR, meanSO, meanSV, meanBAOpp from 2013-2015. It’s obvious that this classifier would be more effective with the data from more pitchers. In addition, these statistics (inputs) weren’t significantly tied with the salaries pitchers were paid. It’s possible that other statistics could better demonstrate with more accuracy if a pitcher will mandate a salary above the average.

#machine learning batting
mean_salary_2016 <- mean(pos_filtered_batting$salary_2016)
sd_salary_2016 <- sd(pos_filtered_batting$salary_2016)

testing_set_13_15 <- playerID_merge_13_15

mean_salary_test <-mean(testing_set_13_15$salary_2016.x)
sd_salary_test <-sd(testing_set_13_15$salary_2016.x)

standardized_salary_test <- (testing_set_13_15$salary_2016.x - mean_salary_test) / (sd_salary_test)

testing_set_13_15$labels <- as.integer(standardized_salary_test > 0)
trainIndex=createDataPartition(testing_set_13_15$labels, p=0.7)

testing_set_13_15 <- testing_set_13_15 %>%
  select(-pos.x, -meanG.x, -meanAB.x, -mean2B.x, -mean3B.x, -meanSB.x, 
         -meanCS.x, -meanHBP.x, -meanGIDP.x, -salary_2016.x) %>%
  ungroup()

testing_set_13_15 <- testing_set_13_15 %>%
  select(-playerID) %>%
  ungroup()

testing_set_13_15 <- testing_set_13_15 %>%
  mutate(standardized_salary_test = standardized_salary_test)
         
testing_set_13_15$labels <- as.character(testing_set_13_15$labels)
testing_set_13_15$labels <- as.numeric(testing_set_13_15$labels)

testing_set_13_15 <- testing_set_13_15[sample(nrow(testing_set_13_15)),]
testing_13_15_train <- testing_set_13_15[1:120,]
testing_13_15_test <- testing_set_13_15[120:154,]

testing_13_15_test$labels <- as.factor(testing_13_15_test$labels)
testing_13_15_train$labels <- as.factor(testing_13_15_train$labels)
nb <- naivebayes::naive_bayes(labels ~ ., data = testing_13_15_train)
plot(nb)

batting.output <- cbind(testing_13_15_test, pred = predict(nb, testing_13_15_test))

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

pred <- predict(nb, newdata = select(testing_13_15_test,-labels))

caret::confusionMatrix(batting.output$pred, batting.output$labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  5
##          1  0 10
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.6974, 0.9519)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.0003014       
##                                           
##                   Kappa : 0.6957          
##                                           
##  Mcnemar's Test P-Value : 0.0736383       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5714          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.7143          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 0               
##

Explanation

The machine learning model I built looking at position players’ salaries vs the specified batting metrics (meanG, meanAB, mean2B, mean3B, meanSB, meanCS, meanHBP, meanGIDP) included more entries (154) than the 65 used with respect to pitchers. The batting model still produced an accuracy rate of 88.6 percent. This usage of machine learning showed that even without a lengthy amount of data, this mechanism can make predictions with a higher accuracy than guessing randomly.

Overall Summary of Analysis

This analysis showed that a pitcher/batter’s statistics might be related to his salary, but this does not equate to said statistics leading to a less than or greater than average salary.
These were the conclusions I made:

Pitching data from 2013-2019 demonstrated that SV (saves) and SO (strikeouts) were the statistics most highly correlated to 2016 salary. These correlations were certainly stronger (larger) from 2013-2015 than they were from 2017-2019.
Batting data from 2013-2019 demonstrated that GIDP (grounding into a double play) and RBI (runs batted in) were the statistics most highly correlated to 2016 salary. These correlations were certainly stronger (larger) from 2013-2015 than they were from 2017-2019.
Using a t-Test, I showed that pitchers who were paid above average in 2016 performed worse (measured by Saves) from 2017-2019 than they did from 2013-2015. This statistically shows that regarding performance, pitchers paid above the average regress towards the mean over a lengthy period of time.
Using a t-Test, I showed that batters who were paid above average in 2016 performed worse (measured by Hits, Walks, Home Runs, and RBIs) from 2017-2019 than they did from 2013-2015. This statistically shows that regarding performance, batters paid above the average regress towards the mean over a lengthy period of time. Therefore, team owners should try to find batters/pitchers who are undervalued prior to them having strong performing seasons. There is a huge disparity in payrolls of MLB teams and by adopting this approach, I believe more teams will be successful. This analysis showed that once a pitcher/batter has gotten to the point where they’re paid a heftier salary, they statistically will not perform as well in the seasons that follow. This is because his previous performance was mathematically an outlier and will eventually fall towards the mean.

These were the limitations I ran into while performing this analysis:

I was not able to set a control for the age of the players when analyzing this data. The age of the pitchers/batters was an issue in that as a player ages and spends more time in the majors, his salary increases even if his performance likely falters.
The number of entries (pitchers/batters) in the datasets. When filtering the pitching dataset - I was only left with around 65 pitchers and when filtering the batting dataset, I was only left with 154 batters. In order to look at more data, I could have looked at a longer span of years (maybe 14) or I could have removed filters (each player batting in at least 81 games).
Some teams inevitably make it to the post-season and it’s entirely possible that a batter/pitcher’s salary could increase based off their performance. It would be interesting to conduct this analysis again, but to include post-season data in order to see its effect vs that of regular season play on one’s salary. In addition, I didn’t account for the phenomena that a pitcher’s number of wins/saves is very related to the defense of the team (their ability to field effectively). If I were to conduct another analysis, I would use these basic metrics (RBIs, Hits, HRs, etc) to look at more in-depth statistics in order to pin down all of the factors that might contribute to a recorded number of home runs. An example of this is WAR (wins above replacement) which gives each batter a number that indicates how many wins they helped their team achieve vs the number that any random batter would have created.
The randomness in baseball statistics meaning that some years batters do better and some pitchers perform better. This is obviously in conjunction with the different sizes of MLB stadiums as well as the climate found at each (precipitation, air quality, etc).

Overall, I chose MLB pitching and batting as the focus of my analysis because I am very passionate about baseball and have been for years. This was a great topic for generating hypotheses about the topic, cleaning the data, and then using statistics to point out pertinent discoveries. I even attempted using machine learning to see if I could construct a model that could predict more accurately than if I guessed at random!

Data Tables

Pitching

ERA_mean <- c(0.568, 1.650)
ERA_st_dev <- c(2.023, 1.902)
W_mean <- c(-0.734, -1.258)
W_st_dev <- c(2.010, 2.760)
HR_mean <- c(0.474, 0.542)
HR_st_dev <- c(3.755, 2.940)
SO_mean <- c(-13.820, -18.550)
SO_st_dev <- c(23.470, 23.010)
SV_mean <- c(-0.784, -8.219)
SV_st_dev <- c(6.890, 10.780)
BAOpp_mean <- c(0.005, 0.026)
BAOpp_st_dev <- c(0.040, 0.034)

pitch_mean_dev <- data.frame(ERA_mean, ERA_st_dev, W_mean, W_st_dev, HR_mean, HR_st_dev,
                             SO_mean, SO_st_dev, SV_mean, SV_st_dev, 
                             BAOpp_mean, BAOpp_st_dev)
pitch_mean_dev_b <- as.data.frame(t(pitch_mean_dev))

colnames(pitch_mean_dev_b) <- c("2016 Salary Above the Mean", "2016 Salary Below the Mean")

pitch_mean_dev_b

##              2016 Salary Above the Mean 2016 Salary Below the Mean
## ERA_mean                          0.568                      1.650
## ERA_st_dev                        2.023                      1.902
## W_mean                           -0.734                     -1.258
## W_st_dev                          2.010                      2.760
## HR_mean                           0.474                      0.542
## HR_st_dev                         3.755                      2.940
## SO_mean                         -13.820                    -18.550
## SO_st_dev                        23.470                     23.010
## SV_mean                          -0.784                     -8.219
## SV_st_dev                         6.890                     10.780
## BAOpp_mean                        0.005                      0.026
## BAOpp_st_dev                      0.040                      0.034

ERA_st <- c(0.492, 2.201)
W_st <- c(0.623, -0.841)
HR_st <- c(0.831, 0.082)
SO_st <- c(5.840, -0.811)
SV_st <- c(2.356, -3.156)
BAOpp_st <- c(0.009, 2.306)

pitch_st_t <- data.frame(ERA_st, W_st, HR_st, SO_st, SV_st, BAOpp_st)
pitch_st_t_b <- as.data.frame(t(pitch_st_t))

colnames(pitch_st_t_b) <- c("Standard Error", "T-Statistic")

pitch_st_t_b

##          Standard Error T-Statistic
## ERA_st            0.492       2.201
## W_st              0.623      -0.841
## HR_st             0.831       0.082
## SO_st             5.840      -0.811
## SV_st             2.356      -3.156
## BAOpp_st          0.009       2.306

Batting

RBI_mean <- c(12.880, -9.660)
RBI_st_dev <- c(19.840, 18.260)
H_mean <- c(5.898, -28.160)
H_st_dev <- c(30.670, 27.480)
BB_mean <- c(9.850, -6.642)
BB_st_dev <- c(15.420, 11.800)
HR_mean <- c(6.213, -1.165)
HR_st_dev <- c(6.940, 6.629)

batting_mean_dev <- data.frame(RBI_mean, RBI_st_dev, H_mean, H_st_dev, BB_mean, BB_st_dev,
                             HR_mean, HR_st_dev)
batting_mean_dev_b <- as.data.frame(t(batting_mean_dev))

colnames(batting_mean_dev_b) <- c("2016 Salary Above the Mean", "2016 Salary Below the Mean")

batting_mean_dev_b

##            2016 Salary Above the Mean 2016 Salary Below the Mean
## RBI_mean                       12.880                     -9.660
## RBI_st_dev                     19.840                     18.260
## H_mean                          5.898                    -28.160
## H_st_dev                       30.670                     27.480
## BB_mean                         9.850                     -6.642
## BB_st_dev                      15.420                     11.800
## HR_mean                         6.213                     -1.165
## HR_st_dev                       6.940                      6.629

RBI_st <- c(3.072, -7.338)
Hit_st <- c(4.689, -7.263)
BB_st <- c(2.210, -7.464)
HR_b_st <- c(1.093, -6.748)

batting_st_t <- data.frame(RBI_st, Hit_st, BB_st, HR_b_st)
batting_st_t_b <- as.data.frame(t(batting_st_t))

colnames(batting_st_t_b) <- c("Standard Error", "T-Statistic")

batting_st_t_b

##         Standard Error T-Statistic
## RBI_st           3.072      -7.338
## Hit_st           4.689      -7.263
## BB_st            2.210      -7.464
## HR_b_st          1.093      -6.748

Analysis of MLB Pitching/Batting from 2013-2019

Bethany Leach

2022-06-28

Introduction

1. Of the pitching statistics, (Earned Run Average, Strikeouts, Home Runs, Wins, Saves, Opponent’s Batting Average), which (if any) had a clear correlation with a pitcher’s salary?

2. Of the batting statistics, (Runs Batted In, Home Runs, Hits, Doubles, Triples, Walks, Strikeouts, Ground into a Double Play, Hit by Pitch, Stolen Bases, Caught Stealing, Games Played, At Bats), which (if any) had a clear correlation with a batter’s salary?

3. Are these correlations greater in the three years prior (2013-2015) to the 2016 salary or the three years afterward (2017-2019)?

4. Can these correlations help to assess a batter/pitcher’s value relative to others in the league?

5. Can I construct a machine learning model using a batter/pitcher’s performance data from 2013-2015 to predict if said batter/pitcher will receive an above average salary in 2016?

Note

A. Preliminary Analysis

Seven-Year Averages (Pitching) 2013-2019

Correlation Coefficients

Three-Year Averages (Pitching) 2013-2015

Correlation Coefficients

Three-Year Averages (Pitching) 2017-2019

Correlation Coefficients

Correlation Coefficient Table Explanation

Batting (2013-2019)

Correlation Coefficients

Three-Year Averages (Batting) 2013-2015

Correlation Coefficients

Three-Year Averages (Batting) 2017-2019

Correlation Coefficients

Correlation Coefficient Table Explanation

Further Analysis : T-Test (Pitching)

Histogram Explanation

T-statistic Explanation

Explanation

Create a Model with Machine Learning

Explanation

Explanation

Overall Summary of Analysis

Data Tables

Pitching

Batting