Project 3

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

Does a MLB player’s team and position predict their salary?

To answer this question, I looked into openintro.org where I found the Salary data for Major Leage Baseball (2010) dataset. The dataset is minimal as it contains 828 cases (basically rows but in the context of the dataset, it’s the MLB player’s in which they studied). and contains 4 columns which include player, team, position and salary To answer the question, I will only be looking at 3 columns which are team, position, and salary.

mlb<- read.csv("mlb.csv")
head(mlb)

##          player                 team      position salary
## 1  Brandon Webb Arizona Diamondbacks       Pitcher   8500
## 2   Danny Haren Arizona Diamondbacks       Pitcher   8250
## 3  Chris Snyder Arizona Diamondbacks       Catcher   5250
## 4 Edwin Jackson Arizona Diamondbacks       Pitcher   4600
## 5  Adam LaRoche Arizona Diamondbacks First Baseman   4500
## 6   Chad Qualls Arizona Diamondbacks       Pitcher   4185

summary(mlb)

##     player              team             position             salary       
##  Length:828         Length:828         Length:828         Min.   :  400.0  
##  Class :character   Class :character   Class :character   1st Qu.:  418.3  
##  Mode  :character   Mode  :character   Mode  :character   Median : 1093.8  
##                                                           Mean   : 3281.8  
##                                                           3rd Qu.: 4250.0  
##                                                           Max.   :33000.0

Data Analysis

To analyze the dataset, I selected 3 columns which are the team, position, and salary columns using the select function.I then removed missing values from the salary and position columns using the filter function. I also filtered the dataset to keep players with salaries greater than zero. I made a histogram to show the distribution of MLB salaries and a boxplot to show salary by player position.

mlb2<-select(mlb, team, position, salary)
mlb2<-filter(mlb2, !is.na(salary), !is.na(position))
mlb2<-filter(mlb2, salary > 0)

hist(mlb2$salary, main = " Distribution of MLB Salaries",
     xlab ="Salary",
     col="yellow")

boxplot(salary ~ position, data = mlb2,
        main = "Salary by Player Position",
        xlab = "Position", ylab = "Salary")

Regression Analysis

For this dataset, since the team and position variables will be used as the predictors and salary is a continuous outcome variable, I’ll be using the Multiple-Linear Regression analysis model.

mod <- lm(salary ~ team + position, data = mlb2)
summary(mod)

## 
## Call:
## lm(formula = salary ~ team + position, data = mlb2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8262  -2462  -1225   1344  23271 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  936.3      970.8   0.964 0.335106    
## teamAtlanta Braves           814.0     1165.1   0.699 0.484958    
## teamBaltimore Orioles        856.4     1176.0   0.728 0.466713    
## teamBoston Red Sox          3234.4     1146.6   2.821 0.004910 ** 
## teamChicago Cubs            3150.3     1165.9   2.702 0.007040 ** 
## teamChicago White Sox       1848.6     1180.4   1.566 0.117739    
## teamCincinnati Reds          449.3     1176.7   0.382 0.702687    
## teamCleveland Indians       -144.4     1149.7  -0.126 0.900102    
## teamColorado Rockies         626.2     1155.2   0.542 0.587920    
## teamDetroit Tigers          2315.7     1165.7   1.986 0.047324 *  
## teamFlorida Marlins         -266.9     1166.6  -0.229 0.819125    
## teamHouston Astros          1032.7     1155.0   0.894 0.371549    
## teamKansas City Royals       390.9     1165.8   0.335 0.737443    
## teamLos Angeles Angeles     1513.2     1147.2   1.319 0.187535    
## teamLos Angeles Dodgers     1252.1     1165.9   1.074 0.283173    
## teamMilwaukee Brewers        478.3     1146.2   0.417 0.676587    
## teamMinnesota Twins         1265.8     1161.6   1.090 0.276193    
## teamNew York Mets           2652.0     1159.2   2.288 0.022410 *  
## teamNew York Yankees        6049.6     1191.5   5.077 4.77e-07 ***
## teamOakland Athletics       -632.3     1128.6  -0.560 0.575469    
## teamPhiladelphia Phillies   2729.2     1154.4   2.364 0.018312 *  
## teamPittsburgh Pirates      -804.4     1167.5  -0.689 0.491043    
## teamSan Diego Padres        -833.3     1177.1  -0.708 0.479186    
## teamSan Francisco Giants    1190.6     1156.0   1.030 0.303368    
## teamSeattle Mariners         736.6     1155.9   0.637 0.524155    
## teamSt. Louis Cardinals     1474.4     1187.5   1.242 0.214746    
## teamTampa Bay Rays           479.0     1167.8   0.410 0.681825    
## teamTexas Rangers           -439.4     1146.2  -0.383 0.701571    
## teamToronto Blue Jays       -277.3     1137.4  -0.244 0.807450    
## teamWashington Nationals    -139.8     1136.2  -0.123 0.902092    
## positionDesignated Hitter   3334.6     1711.8   1.948 0.051768 .  
## positionFirst Baseman       3876.0      837.3   4.629 4.30e-06 ***
## positionInfielder          -2424.8     2216.1  -1.094 0.274202    
## positionOutfielder          1729.1      625.9   2.762 0.005872 ** 
## positionPitcher             1072.4      559.7   1.916 0.055744 .  
## positionSecond Baseman      1147.6      804.5   1.427 0.154098    
## positionShortstop            953.2      769.3   1.239 0.215729    
## positionThird Baseman       2743.0      819.5   3.347 0.000855 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4237 on 790 degrees of freedom
## Multiple R-squared:  0.1398, Adjusted R-squared:  0.09956 
## F-statistic: 3.471 on 37 and 790 DF,  p-value: 6.034e-11

Model Summary: A couple coefficients that were significant among teams include: Boston Red Sox (p=0.0049), Chicago Cubs (p=0.0070), Detroit Tigers (p=0.0473), New York Mets (p=0.0224), New York Yankess (p=4.77e-07), Philadelphia Phillies (p=0.0183)

Coefficients that were significant among positions: First Baseman (p= 4.30e-0.6), Outfielders (p=0.058), Third Basemen (p=0.000855)

Interpretation: The positive coefficients suggest that the team or position the player is in is associated with a higher salary. The R-squared value is approximately 0.14, meaning that a player’s team and position explain about 14% of the variation in salary. This also means there’s more variables that influence a player’s salary.

Model Assumptions and Diagnostics

par(mfrow = c(2,2))
plot(mod)

par(mfrow = c(1, 1))
vif(mod)

##              GVIF Df GVIF^(1/(2*Df))
## team     1.149783 29        1.002409
## position 1.149783  8        1.008761

Interpretation:

For linearity, the Residuals vs. Fitted plot shows a slight curved pattern and the points are loosely scattered around the horizontal line. Linearity appears acceptable
For Independence of Observations, each row in the dataset represents a different MLB player so the observations are independent, so independence is satisfied.
For Homoscedasticity, the Scale-Location plots shows points a fair spread of residuals as the fitted values increase. So Homoscedasticity is mostly acceptable.
For Normality of Residuals, the Q-Q plots shows residuals almost following the straight line, with some deviation in the upper tail. Normality is reasonably satisfied.
For Multicollinearity, both VIF values for team and position are approximately equal to 1.15, close to the threshold of 1, which indicates very low mulitcollinearity. This means team and position aren’t strongly correlated with each other.

Conclusion

Based on the multiple linear regression model, I concluded that both team and position showed some relationship with player salary. Many teams (such as the Yankees,Red Sox, etc.) and some positions (first baseman,outfielders, and third basemen) had statistically significant coefficients, meaning players in these specific teams and position tend to have higher salaries. But, the model fit was low with an R-squared value of 0.14, meaning that a player’s team and position explain about 14% of the variation in salary.

While the model gave good insight, there was other factors that weren’t included that explain MLB salaries. Factors such as age, performance and experience are a few that wasn’t included in the data-set. For future directions and research, more predictions can be added to the model. Predictions such as home-run, batting averages and percentages, and even individual-skills players have.