Clean the data before analysis. Remove duplicate columns (player first/last name), unnecessry columns (year,PlsMin Road,OTP,Year.Pts.Rank), turn character fields like TEAM and player name into factors. Remove player name,team: we wont use them for prediction.
Here’s what the data looks like now.
## Player First.Name Last.Name Year Player.Data Team
## 1 Connor McDavid Connor McDavid 2016 Connor McDavid 2016 EDM
## 2 Sidney Crosby Sidney Crosby 2016 Sidney Crosby 2016 PIT
## 3 Patrick Kane Patrick Kane 2016 Patrick Kane 2016 CHI
## 4 Nicklas Backstrom Nicklas Backstrom 2016 Nicklas Backstrom 2016 WSH
## 5 Nikita Kucherov Nikita Kucherov 2016 Nikita Kucherov 2016 TBL
## 6 Brad Marchand Brad Marchand 2016 Brad Marchand 2016 BOS
## Year.Pts.Rank GP G A PTS PlsMin PIM PPG PPP SHG SHP GWG OTP S S.
## 1 NA 82 30 70 100 27 26 3 27 1 2 6 1.11 251 11.95
## 2 NA 75 44 45 89 17 24 14 25 0 0 5 1.11 255 17.25
## 3 NA 82 34 55 89 11 32 7 23 0 0 5 1.11 292 11.64
## 4 NA 82 23 63 86 17 38 8 35 0 0 5 1.11 162 14.19
## 5 NA 74 40 45 85 13 38 17 32 0 0 7 1.11 246 16.26
## 6 NA 80 39 46 85 18 81 9 24 3 5 8 1.11 226 17.25
## Sft.G FO. TOI TOI.PP Hits BkS GvA Tka G.Road Pts.Road PlsMin.Road
## 1 24.3658 43.17 1732 248 34 29 54 76 1.11 1.11 1.11
## 2 24.6933 48.16 1490 269 80 27 70 39 1.11 1.11 1.11
## 3 23.2926 13.72 1754 279 28 15 42 49 1.11 1.11 1.11
## 4 22.4634 51.38 1497 251 45 33 54 61 1.11 1.11 1.11
## 5 23.4459 0.00 1438 248 30 20 64 54 1.11 1.11 1.11
## 6 24.2750 36.11 1554 215 51 35 84 69 1.11 1.11 1.11
## Pos Age InjFreq LineMate1 LineMate2 Image Peer1 Peer2 Peer3 Salary
## 1 C NA NA NA NA NA NA NA NA NA
## 2 C NA NA NA NA NA NA NA NA NA
## 3 RW NA NA NA NA NA NA NA NA NA
## 4 C NA NA NA NA NA NA NA NA NA
## 5 RW NA NA NA NA NA NA NA NA NA
## 6 LW NA NA NA NA NA NA NA NA NA
## Height Weight NHL.Experience Origin Jersey.Num Manual.Adjustment
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## Manual.Adjustment.GP Manual.Adjustment...Goals
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Manual.Adjustment...PlsMin Manual.Adjustment...PPP
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Manual.Adjustment...PPG Manual.Adjustment...PIM Manual.Adjustment...ShP
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Manual.Adjustment...Shots Manual.Adjustment...BkS
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Manual.Adjustment...Hits GameType
## 1 NA R
## 2 NA R
## 3 NA R
## 4 NA R
## 5 NA R
## 6 NA R
Try regression against nearly all the variables. This is doubtful to be the correct model but let’s poke around… Here we’re trying to predict Goals (G) using almost all the other variables. Coefficients are well below 1% so this is pretty useless.
##
## Call:
## lm(formula = G ~ GP + A + PTS + PlsMin + PIM + PPG + PPP + SHG +
## SHP + GWG + S + TOI + Hits + BkS + GvA + Tka - 1, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.066e-12 -7.090e-15 -3.500e-16 6.320e-15 1.446e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## GP -1.614e-15 2.701e-16 -5.973e+00 3.38e-09 ***
## A -1.000e+00 1.160e-15 -8.618e+14 < 2e-16 ***
## PTS 1.000e+00 9.344e-16 1.070e+15 < 2e-16 ***
## PlsMin -9.336e-17 2.915e-16 -3.200e-01 0.74881
## PIM -1.604e-16 1.472e-16 -1.090e+00 0.27595
## PPG 8.639e-15 2.181e-15 3.960e+00 8.09e-05 ***
## PPP -1.967e-15 1.215e-15 -1.618e+00 0.10600
## SHG -7.599e-16 7.153e-15 -1.060e-01 0.91541
## SHP 8.543e-17 4.856e-15 1.800e-02 0.98597
## GWG -4.873e-15 2.596e-15 -1.877e+00 0.06081 .
## S -1.781e-16 1.133e-16 -1.571e+00 0.11650
## TOI -1.901e-17 2.403e-17 -7.910e-01 0.42921
## Hits 3.134e-17 7.090e-17 4.420e-01 0.65860
## BkS 1.097e-16 1.451e-16 7.560e-01 0.44972
## GvA 7.594e-16 2.388e-16 3.180e+00 0.00152 **
## Tka -5.290e-16 2.889e-16 -1.831e+00 0.06746 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.03e-14 on 872 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.471e+30 on 16 and 872 DF, p-value: < 2.2e-16
Plot each variable against each other variable. Pay attention to the G (Goals)
## Loading required package: ggplot2
## Loading required package: GGally
Looking at correlation, we can see what is highly correlated to Goals. (PTS, GWG, S) are all highly correlated (>0.8)
Let’s not consider those which are highly correlated (PTS, GWG, S)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0223494 0.427365918 -4.732126 2.585049e-06
## GP 0.1919828 0.007377308 26.023420 2.352566e-111
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.098642 0.26677807 4.118187 4.175881e-05
## A 0.509690 0.01484784 34.327557 6.466966e-165
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6123773 0.28456775 26.750668 4.992791e-116
## PlsMin 0.1567163 0.02881663 5.438398 6.956258e-08
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4866550 0.39068813 11.48398 1.485393e-28
## PIM 0.1237308 0.01135215 10.89933 4.796307e-26
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.585029 0.19607159 18.28428 1.300262e-63
## PPG 2.505690 0.06139443 40.81299 1.018810e-205
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.571588 0.2910740 22.577038 1.633482e-89
## SHG 4.719725 0.4731951 9.974164 2.806626e-22
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.263886 0.3017579 20.75799 2.641229e-78
## SHP 3.119315 0.3084293 10.11355 7.889276e-23
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.139763604 0.3858285255 -0.3622428 7.172570e-01
## TOI 0.009355408 0.0003827524 24.4424565 2.925649e-101
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.02549808 0.413487972 12.153916 1.516277e-31
## Hits 0.04191099 0.005074621 8.258939 5.317529e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.55785852 0.394780971 16.611385 3.765335e-54
## BkS 0.02516565 0.006882134 3.656664 2.705775e-04
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4641191 0.35925372 6.858994 1.300621e-11
## GvA 0.2169627 0.01129446 19.209653 5.070594e-69
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3352832 0.27444233 1.221689 2.221501e-01
## Tka 0.3862455 0.01083881 35.635416 2.883262e-173