NHL Salaries

This is a stripped-down Kaggle.com data set for NHL Salaries. The original data set had 151 columns but I narrowed it down to 9 for this regression. Variables are:

For this analysis, I would like to see if I can predict the Salary from the other given variables.

Load and Tidy Data

#Pull csv file from Github
url <- "https://raw.githubusercontent.com/smithchad17/Class605/master/train.csv"
nhl_salaries <- read.csv(file = url, header = T, stringsAsFactors = F)

#Mutate the 'Hand' column to numeric
#L = 0 , R = 1
nhl_salaries <- mutate(nhl_salaries, Hand = (if_else(Hand == "R", 1, 0)))
head(nhl_salaries)
##    Salary Ht  Wt Ovrl Hand GP  G  A    TOI
## 1  925000 74 190   18    0  1  0  0    429
## 2 2250000 74 207   15    1 79  2 15 109992
## 3 8000000 72 218    7    1 65 19 26  73983
## 4 3500000 77 220    3    1 30  1  5  36603
## 5 1750000 76 217   16    1 82  7 12  63592
## 6 1500000 70 192  156    0 80  5 12  88462
pairs(nhl_salaries)

From the ‘pairs’ plot it is hard to see if there is a clear predictor variable of Salary. We can see that there are linear relationships between variables. Though obvious, height and weight have a positive correlation as well as games played (GP), goals (G), and assists (A) have with time on ice (TOI).

From a glance, the salary may depend on the overall draft pick (Ovrl) where the lower draft pick has a lower salary. You could also say that the longer a player’s time on ice has a higher salary.

Model

salary_lm <- lm(Salary ~ Ht + Wt + Ovrl + Hand + GP + G + A + TOI, data=nhl_salaries)
summary(salary_lm)
## 
## Call:
## lm(formula = Salary ~ Ht + Wt + Ovrl + Hand + GP + G + A + TOI, 
##     data = nhl_salaries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5572154  -841947  -187174   575703  8838197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.439e+06  3.076e+06   2.093 0.036834 *  
## Ht          -1.831e+05  5.274e+04  -3.472 0.000561 ***
## Wt           4.053e+04  7.111e+03   5.699 2.05e-08 ***
## Ovrl        -1.138e+03  1.265e+03  -0.899 0.368960    
## Hand        -1.112e+05  1.530e+05  -0.727 0.467821    
## GP          -5.310e+04  7.205e+03  -7.370 7.07e-13 ***
## G            7.173e+04  1.361e+04   5.270 2.03e-07 ***
## A            3.314e+04  1.236e+04   2.681 0.007587 ** 
## TOI          5.767e+01  6.828e+00   8.446 3.26e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1666000 on 503 degrees of freedom
##   (100 observations deleted due to missingness)
## Multiple R-squared:  0.5017, Adjusted R-squared:  0.4938 
## F-statistic: 63.32 on 8 and 503 DF,  p-value: < 2.2e-16

The \(R^2\) and adjusted-\(R^2\) are not very high, so this model does not predict the salary very well but it does model the noise pretty good since the \(R^2\) and adjusted-\(R^2\) are close.

Let’s remove the variables ‘Hand’ and ‘Ovrl’ since they have the highest p-values.

test_lm <- lm(Salary ~ Ht + Wt + GP + G + A + TOI, data=nhl_salaries)
summary(test_lm)
## 
## Call:
## lm(formula = Salary ~ Ht + Wt + GP + G + A + TOI, data = nhl_salaries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5360698  -734875  -180454   535622  8967269 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.662e+06  2.524e+06   1.847 0.065190 .  
## Ht          -1.515e+05  4.451e+04  -3.404 0.000707 ***
## Wt           3.697e+04  6.208e+03   5.956 4.40e-09 ***
## GP          -5.098e+04  6.207e+03  -8.214 1.30e-15 ***
## G            7.117e+04  1.235e+04   5.761 1.33e-08 ***
## A            3.241e+04  1.113e+04   2.912 0.003719 ** 
## TOI          5.662e+01  6.025e+00   9.398  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1576000 on 605 degrees of freedom
## Multiple R-squared:  0.508,  Adjusted R-squared:  0.5031 
## F-statistic: 104.1 on 6 and 605 DF,  p-value: < 2.2e-16

The \(R^2\) and adjusted-\(R^2\) increased but the amount is negligible. Some noise from the model was removed but it’s still not a good model for predicting the salary.

Coefficients

With the residual median being so far away from zero, the coefficients are not a good predictor for the salary. Nevertheless, for 1 sec of Time On Ice equals $56 of Salary.

Residuals

The residuals do not show any sign of skew and they look evenly distributed around zero.

nhl_res <- salary_lm$residuals
plot(nhl_res)
abline(h=0)

The Q-Q plot proves that the model does not fit the data very well. These variables are not good predictors for salaries.

qqnorm(resid(salary_lm))
qqline(resid(salary_lm))