knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
To answer this question, I looked into openintro.org where I found the Salary data for Major Leage Baseball (2010) dataset. The dataset is minimal as it contains 828 cases (basically rows but in the context of the dataset, it’s the MLB player’s in which they studied). and contains 4 columns which include player, team, position and salary To answer the question, I will only be looking at 3 columns which are team, position, and salary.
mlb<- read.csv("mlb.csv")
head(mlb)
## player team position salary
## 1 Brandon Webb Arizona Diamondbacks Pitcher 8500
## 2 Danny Haren Arizona Diamondbacks Pitcher 8250
## 3 Chris Snyder Arizona Diamondbacks Catcher 5250
## 4 Edwin Jackson Arizona Diamondbacks Pitcher 4600
## 5 Adam LaRoche Arizona Diamondbacks First Baseman 4500
## 6 Chad Qualls Arizona Diamondbacks Pitcher 4185
summary(mlb)
## player team position salary
## Length:828 Length:828 Length:828 Min. : 400.0
## Class :character Class :character Class :character 1st Qu.: 418.3
## Mode :character Mode :character Mode :character Median : 1093.8
## Mean : 3281.8
## 3rd Qu.: 4250.0
## Max. :33000.0
To analyze the dataset, I selected 3 columns which are the team, position, and salary columns using the select function.I then removed missing values from the salary and position columns using the filter function. I also filtered the dataset to keep players with salaries greater than zero. I made a histogram to show the distribution of MLB salaries and a boxplot to show salary by player position.
mlb2<-select(mlb, team, position, salary)
mlb2<-filter(mlb2, !is.na(salary), !is.na(position))
mlb2<-filter(mlb2, salary > 0)
hist(mlb2$salary, main = " Distribution of MLB Salaries",
xlab ="Salary",
col="yellow")
boxplot(salary ~ position, data = mlb2,
main = "Salary by Player Position",
xlab = "Position", ylab = "Salary")
For this dataset, since the team and position variables will be used as the predictors and salary is a continuous outcome variable, I’ll be using the Multiple-Linear Regression analysis model.
mod <- lm(salary ~ team + position, data = mlb2)
summary(mod)
##
## Call:
## lm(formula = salary ~ team + position, data = mlb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8262 -2462 -1225 1344 23271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 936.3 970.8 0.964 0.335106
## teamAtlanta Braves 814.0 1165.1 0.699 0.484958
## teamBaltimore Orioles 856.4 1176.0 0.728 0.466713
## teamBoston Red Sox 3234.4 1146.6 2.821 0.004910 **
## teamChicago Cubs 3150.3 1165.9 2.702 0.007040 **
## teamChicago White Sox 1848.6 1180.4 1.566 0.117739
## teamCincinnati Reds 449.3 1176.7 0.382 0.702687
## teamCleveland Indians -144.4 1149.7 -0.126 0.900102
## teamColorado Rockies 626.2 1155.2 0.542 0.587920
## teamDetroit Tigers 2315.7 1165.7 1.986 0.047324 *
## teamFlorida Marlins -266.9 1166.6 -0.229 0.819125
## teamHouston Astros 1032.7 1155.0 0.894 0.371549
## teamKansas City Royals 390.9 1165.8 0.335 0.737443
## teamLos Angeles Angeles 1513.2 1147.2 1.319 0.187535
## teamLos Angeles Dodgers 1252.1 1165.9 1.074 0.283173
## teamMilwaukee Brewers 478.3 1146.2 0.417 0.676587
## teamMinnesota Twins 1265.8 1161.6 1.090 0.276193
## teamNew York Mets 2652.0 1159.2 2.288 0.022410 *
## teamNew York Yankees 6049.6 1191.5 5.077 4.77e-07 ***
## teamOakland Athletics -632.3 1128.6 -0.560 0.575469
## teamPhiladelphia Phillies 2729.2 1154.4 2.364 0.018312 *
## teamPittsburgh Pirates -804.4 1167.5 -0.689 0.491043
## teamSan Diego Padres -833.3 1177.1 -0.708 0.479186
## teamSan Francisco Giants 1190.6 1156.0 1.030 0.303368
## teamSeattle Mariners 736.6 1155.9 0.637 0.524155
## teamSt. Louis Cardinals 1474.4 1187.5 1.242 0.214746
## teamTampa Bay Rays 479.0 1167.8 0.410 0.681825
## teamTexas Rangers -439.4 1146.2 -0.383 0.701571
## teamToronto Blue Jays -277.3 1137.4 -0.244 0.807450
## teamWashington Nationals -139.8 1136.2 -0.123 0.902092
## positionDesignated Hitter 3334.6 1711.8 1.948 0.051768 .
## positionFirst Baseman 3876.0 837.3 4.629 4.30e-06 ***
## positionInfielder -2424.8 2216.1 -1.094 0.274202
## positionOutfielder 1729.1 625.9 2.762 0.005872 **
## positionPitcher 1072.4 559.7 1.916 0.055744 .
## positionSecond Baseman 1147.6 804.5 1.427 0.154098
## positionShortstop 953.2 769.3 1.239 0.215729
## positionThird Baseman 2743.0 819.5 3.347 0.000855 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4237 on 790 degrees of freedom
## Multiple R-squared: 0.1398, Adjusted R-squared: 0.09956
## F-statistic: 3.471 on 37 and 790 DF, p-value: 6.034e-11
Model Summary: A couple coefficients that were significant among teams include: Boston Red Sox (p=0.0049), Chicago Cubs (p=0.0070), Detroit Tigers (p=0.0473), New York Mets (p=0.0224), New York Yankess (p=4.77e-07), Philadelphia Phillies (p=0.0183)
Coefficients that were significant among positions: First Baseman (p= 4.30e-0.6), Outfielders (p=0.058), Third Basemen (p=0.000855)
Interpretation: The positive coefficients suggest that the team or position the player is in is associated with a higher salary. The R-squared value is approximately 0.14, meaning that a player’s team and position explain about 14% of the variation in salary. This also means there’s more variables that influence a player’s salary.
par(mfrow = c(2,2))
plot(mod)
par(mfrow = c(1, 1))
vif(mod)
## GVIF Df GVIF^(1/(2*Df))
## team 1.149783 29 1.002409
## position 1.149783 8 1.008761
Interpretation:
For linearity, the Residuals vs. Fitted plot shows a slight curved pattern and the points are loosely scattered around the horizontal line. Linearity appears acceptable
For Independence of Observations, each row in the dataset represents a different MLB player so the observations are independent, so independence is satisfied.
For Homoscedasticity, the Scale-Location plots shows points a fair spread of residuals as the fitted values increase. So Homoscedasticity is mostly acceptable.
For Normality of Residuals, the Q-Q plots shows residuals almost following the straight line, with some deviation in the upper tail. Normality is reasonably satisfied.
For Multicollinearity, both VIF values for team and position are approximately equal to 1.15, close to the threshold of 1, which indicates very low mulitcollinearity. This means team and position aren’t strongly correlated with each other.
Based on the multiple linear regression model, I concluded that both team and position showed some relationship with player salary. Many teams (such as the Yankees,Red Sox, etc.) and some positions (first baseman,outfielders, and third basemen) had statistically significant coefficients, meaning players in these specific teams and position tend to have higher salaries. But, the model fit was low with an R-squared value of 0.14, meaning that a player’s team and position explain about 14% of the variation in salary.
While the model gave good insight, there was other factors that weren’t included that explain MLB salaries. Factors such as age, performance and experience are a few that wasn’t included in the data-set. For future directions and research, more predictions can be added to the model. Predictions such as home-run, batting averages and percentages, and even individual-skills players have.