Data cleanup

Clean the data before analysis. Remove duplicate columns (player first/last name), unnecessry columns (year,PlsMin Road,OTP,Year.Pts.Rank), turn character fields like TEAM and player name into factors. Remove player name,team: we wont use them for prediction.

Here’s what the data looks like now.

##              Player First.Name Last.Name Year            Player.Data Team
## 1    Connor McDavid     Connor   McDavid 2016    Connor McDavid 2016  EDM
## 2     Sidney Crosby     Sidney    Crosby 2016     Sidney Crosby 2016  PIT
## 3      Patrick Kane    Patrick      Kane 2016      Patrick Kane 2016  CHI
## 4 Nicklas Backstrom    Nicklas Backstrom 2016 Nicklas Backstrom 2016  WSH
## 5   Nikita Kucherov     Nikita  Kucherov 2016   Nikita Kucherov 2016  TBL
## 6     Brad Marchand       Brad  Marchand 2016     Brad Marchand 2016  BOS
##   Year.Pts.Rank GP  G  A PTS PlsMin PIM PPG PPP SHG SHP GWG  OTP   S    S.
## 1            NA 82 30 70 100     27  26   3  27   1   2   6 1.11 251 11.95
## 2            NA 75 44 45  89     17  24  14  25   0   0   5 1.11 255 17.25
## 3            NA 82 34 55  89     11  32   7  23   0   0   5 1.11 292 11.64
## 4            NA 82 23 63  86     17  38   8  35   0   0   5 1.11 162 14.19
## 5            NA 74 40 45  85     13  38  17  32   0   0   7 1.11 246 16.26
## 6            NA 80 39 46  85     18  81   9  24   3   5   8 1.11 226 17.25
##     Sft.G   FO.  TOI TOI.PP Hits BkS GvA Tka G.Road Pts.Road PlsMin.Road
## 1 24.3658 43.17 1732    248   34  29  54  76   1.11     1.11        1.11
## 2 24.6933 48.16 1490    269   80  27  70  39   1.11     1.11        1.11
## 3 23.2926 13.72 1754    279   28  15  42  49   1.11     1.11        1.11
## 4 22.4634 51.38 1497    251   45  33  54  61   1.11     1.11        1.11
## 5 23.4459  0.00 1438    248   30  20  64  54   1.11     1.11        1.11
## 6 24.2750 36.11 1554    215   51  35  84  69   1.11     1.11        1.11
##   Pos Age InjFreq LineMate1 LineMate2 Image Peer1 Peer2 Peer3 Salary
## 1   C  NA      NA        NA        NA    NA    NA    NA    NA     NA
## 2   C  NA      NA        NA        NA    NA    NA    NA    NA     NA
## 3  RW  NA      NA        NA        NA    NA    NA    NA    NA     NA
## 4   C  NA      NA        NA        NA    NA    NA    NA    NA     NA
## 5  RW  NA      NA        NA        NA    NA    NA    NA    NA     NA
## 6  LW  NA      NA        NA        NA    NA    NA    NA    NA     NA
##   Height Weight NHL.Experience Origin Jersey.Num Manual.Adjustment
## 1     NA     NA             NA     NA         NA                NA
## 2     NA     NA             NA     NA         NA                NA
## 3     NA     NA             NA     NA         NA                NA
## 4     NA     NA             NA     NA         NA                NA
## 5     NA     NA             NA     NA         NA                NA
## 6     NA     NA             NA     NA         NA                NA
##   Manual.Adjustment.GP Manual.Adjustment...Goals
## 1                   NA                        NA
## 2                   NA                        NA
## 3                   NA                        NA
## 4                   NA                        NA
## 5                   NA                        NA
## 6                   NA                        NA
##   Manual.Adjustment...PlsMin Manual.Adjustment...PPP
## 1                         NA                      NA
## 2                         NA                      NA
## 3                         NA                      NA
## 4                         NA                      NA
## 5                         NA                      NA
## 6                         NA                      NA
##   Manual.Adjustment...PPG Manual.Adjustment...PIM Manual.Adjustment...ShP
## 1                      NA                      NA                      NA
## 2                      NA                      NA                      NA
## 3                      NA                      NA                      NA
## 4                      NA                      NA                      NA
## 5                      NA                      NA                      NA
## 6                      NA                      NA                      NA
##   Manual.Adjustment...Shots Manual.Adjustment...BkS
## 1                        NA                      NA
## 2                        NA                      NA
## 3                        NA                      NA
## 4                        NA                      NA
## 5                        NA                      NA
## 6                        NA                      NA
##   Manual.Adjustment...Hits GameType
## 1                       NA        R
## 2                       NA        R
## 3                       NA        R
## 4                       NA        R
## 5                       NA        R
## 6                       NA        R

Linear Regression

Try regression against nearly all the variables. This is doubtful to be the correct model but let’s poke around… Here we’re trying to predict Goals (G) using almost all the other variables. Coefficients are well below 1% so this is pretty useless.

## 
## Call:
## lm(formula = G ~ GP + A + PTS + PlsMin + PIM + PPG + PPP + SHG + 
##     SHP + GWG + S + TOI + Hits + BkS + GvA + Tka - 1, data = df1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.066e-12 -7.090e-15 -3.500e-16  6.320e-15  1.446e-12 
## 
## Coefficients:
##          Estimate Std. Error    t value Pr(>|t|)    
## GP     -1.614e-15  2.701e-16 -5.973e+00 3.38e-09 ***
## A      -1.000e+00  1.160e-15 -8.618e+14  < 2e-16 ***
## PTS     1.000e+00  9.344e-16  1.070e+15  < 2e-16 ***
## PlsMin -9.336e-17  2.915e-16 -3.200e-01  0.74881    
## PIM    -1.604e-16  1.472e-16 -1.090e+00  0.27595    
## PPG     8.639e-15  2.181e-15  3.960e+00 8.09e-05 ***
## PPP    -1.967e-15  1.215e-15 -1.618e+00  0.10600    
## SHG    -7.599e-16  7.153e-15 -1.060e-01  0.91541    
## SHP     8.543e-17  4.856e-15  1.800e-02  0.98597    
## GWG    -4.873e-15  2.596e-15 -1.877e+00  0.06081 .  
## S      -1.781e-16  1.133e-16 -1.571e+00  0.11650    
## TOI    -1.901e-17  2.403e-17 -7.910e-01  0.42921    
## Hits    3.134e-17  7.090e-17  4.420e-01  0.65860    
## BkS     1.097e-16  1.451e-16  7.560e-01  0.44972    
## GvA     7.594e-16  2.388e-16  3.180e+00  0.00152 ** 
## Tka    -5.290e-16  2.889e-16 -1.831e+00  0.06746 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.03e-14 on 872 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.471e+30 on 16 and 872 DF,  p-value: < 2.2e-16

Pairs plot

Plot each variable against each other variable. Pay attention to the G (Goals)

## Loading required package: ggplot2
## Loading required package: GGally

Correlation plot

Looking at correlation, we can see what is highly correlated to Goals. (PTS, GWG, S) are all highly correlated (>0.8)

Try regression using each variable against Goals.

Let’s not consider those which are highly correlated (PTS, GWG, S)

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -2.0223494 0.427365918 -4.732126  2.585049e-06
## GP           0.1919828 0.007377308 26.023420 2.352566e-111
##             Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 1.098642 0.26677807  4.118187  4.175881e-05
## A           0.509690 0.01484784 34.327557 6.466966e-165
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 7.6123773 0.28456775 26.750668 4.992791e-116
## PlsMin      0.1567163 0.02881663  5.438398  6.956258e-08
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 4.4866550 0.39068813 11.48398 1.485393e-28
## PIM         0.1237308 0.01135215 10.89933 4.796307e-26
##             Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 3.585029 0.19607159 18.28428  1.300262e-63
## PPG         2.505690 0.06139443 40.81299 1.018810e-205
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 6.571588  0.2910740 22.577038 1.633482e-89
## SHG         4.719725  0.4731951  9.974164 2.806626e-22
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 6.263886  0.3017579 20.75799 2.641229e-78
## SHP         3.119315  0.3084293 10.11355 7.889276e-23
##                 Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept) -0.139763604 0.3858285255 -0.3622428  7.172570e-01
## TOI          0.009355408 0.0003827524 24.4424565 2.925649e-101
##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 5.02549808 0.413487972 12.153916 1.516277e-31
## Hits        0.04191099 0.005074621  8.258939 5.317529e-16
##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 6.55785852 0.394780971 16.611385 3.765335e-54
## BkS         0.02516565 0.006882134  3.656664 2.705775e-04
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2.4641191 0.35925372  6.858994 1.300621e-11
## GvA         0.2169627 0.01129446 19.209653 5.070594e-69
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 0.3352832 0.27444233  1.221689  2.221501e-01
## Tka         0.3862455 0.01083881 35.635416 2.883262e-173