NHL Salaries
This is a stripped-down Kaggle.com data set for NHL Salaries. The original data set had 151 columns but I narrowed it down to 9 for this regression. Variables are:
For this analysis, I would like to see if I can predict the Salary from the other given variables.
Load and Tidy Data
#Pull csv file from Github
url <- "https://raw.githubusercontent.com/smithchad17/Class605/master/train.csv"
nhl_salaries <- read.csv(file = url, header = T, stringsAsFactors = F)
#Mutate the 'Hand' column to numeric
#L = 0 , R = 1
nhl_salaries <- mutate(nhl_salaries, Hand = (if_else(Hand == "R", 1, 0)))
head(nhl_salaries)
## Salary Ht Wt Ovrl Hand GP G A TOI
## 1 925000 74 190 18 0 1 0 0 429
## 2 2250000 74 207 15 1 79 2 15 109992
## 3 8000000 72 218 7 1 65 19 26 73983
## 4 3500000 77 220 3 1 30 1 5 36603
## 5 1750000 76 217 16 1 82 7 12 63592
## 6 1500000 70 192 156 0 80 5 12 88462
pairs(nhl_salaries)
From the ‘pairs’ plot it is hard to see if there is a clear predictor variable of Salary. We can see that there are linear relationships between variables. Though obvious, height and weight have a positive correlation as well as games played (GP), goals (G), and assists (A) have with time on ice (TOI).
From a glance, the salary may depend on the overall draft pick (Ovrl) where the lower draft pick has a lower salary. You could also say that the longer a player’s time on ice has a higher salary.
Model
salary_lm <- lm(Salary ~ Ht + Wt + Ovrl + Hand + GP + G + A + TOI, data=nhl_salaries)
summary(salary_lm)
##
## Call:
## lm(formula = Salary ~ Ht + Wt + Ovrl + Hand + GP + G + A + TOI,
## data = nhl_salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5572154 -841947 -187174 575703 8838197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.439e+06 3.076e+06 2.093 0.036834 *
## Ht -1.831e+05 5.274e+04 -3.472 0.000561 ***
## Wt 4.053e+04 7.111e+03 5.699 2.05e-08 ***
## Ovrl -1.138e+03 1.265e+03 -0.899 0.368960
## Hand -1.112e+05 1.530e+05 -0.727 0.467821
## GP -5.310e+04 7.205e+03 -7.370 7.07e-13 ***
## G 7.173e+04 1.361e+04 5.270 2.03e-07 ***
## A 3.314e+04 1.236e+04 2.681 0.007587 **
## TOI 5.767e+01 6.828e+00 8.446 3.26e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1666000 on 503 degrees of freedom
## (100 observations deleted due to missingness)
## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4938
## F-statistic: 63.32 on 8 and 503 DF, p-value: < 2.2e-16
The \(R^2\) and adjusted-\(R^2\) are not very high, so this model does not predict the salary very well but it does model the noise pretty good since the \(R^2\) and adjusted-\(R^2\) are close.
Let’s remove the variables ‘Hand’ and ‘Ovrl’ since they have the highest p-values.
test_lm <- lm(Salary ~ Ht + Wt + GP + G + A + TOI, data=nhl_salaries)
summary(test_lm)
##
## Call:
## lm(formula = Salary ~ Ht + Wt + GP + G + A + TOI, data = nhl_salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5360698 -734875 -180454 535622 8967269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.662e+06 2.524e+06 1.847 0.065190 .
## Ht -1.515e+05 4.451e+04 -3.404 0.000707 ***
## Wt 3.697e+04 6.208e+03 5.956 4.40e-09 ***
## GP -5.098e+04 6.207e+03 -8.214 1.30e-15 ***
## G 7.117e+04 1.235e+04 5.761 1.33e-08 ***
## A 3.241e+04 1.113e+04 2.912 0.003719 **
## TOI 5.662e+01 6.025e+00 9.398 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1576000 on 605 degrees of freedom
## Multiple R-squared: 0.508, Adjusted R-squared: 0.5031
## F-statistic: 104.1 on 6 and 605 DF, p-value: < 2.2e-16
The \(R^2\) and adjusted-\(R^2\) increased but the amount is negligible. Some noise from the model was removed but it’s still not a good model for predicting the salary.
Coefficients
With the residual median being so far away from zero, the coefficients are not a good predictor for the salary. Nevertheless, for 1 sec of Time On Ice equals $56 of Salary.
Residuals
The residuals do not show any sign of skew and they look evenly distributed around zero.
nhl_res <- salary_lm$residuals
plot(nhl_res)
abline(h=0)
The Q-Q plot proves that the model does not fit the data very well. These variables are not good predictors for salaries.
qqnorm(resid(salary_lm))
qqline(resid(salary_lm))