The 2018-2019 Hockey Season

This was an exciting hockey season for me. My team home town team, St. Louis Blues, won Lord Stanley’s Cup. I wanted to examine the season and use the R programming language to do so. I gathered my data from hockeyreference.com going to the Skaters tab and selecting basic stats. I then took that data, created a csv file and uploaded that file to my github account. I reference that data set here via direct link to read the csv into the programming language as a data frame. Below is a sample of my data:

df[1:10,]
##    Rk                       Player Age  Tm Pos GP  G  A PTS X... PIM   PS EV PP
## 1   1 Justin Abdelkader\\abdelju01  31 DET  LW 71  6 13  19  -14  38  0.1  4  1
## 2   2      Pontus Aberg\\abergpo01  25 TOT  LW 59 12 13  25  -14  20  2.0  9  3
## 3   2      Pontus Aberg\\abergpo01  25 ANA  LW 37 11  8  19  -10  14  1.8  8  3
## 4   2      Pontus Aberg\\abergpo01  25 MIN  LW 22  1  5   6   -4   6  0.2  1  0
## 5   3    Vitaly Abramov\\abramvi01  20 OTT  RW  1  0  0   0   -3   0 -0.1  0  0
## 6   4      Noel Acciari\\acciano01  27 BOS   C 72  6  8  14   -3  47  0.4  6  0
## 7   5    Kenny Agostino\\agostke01  26 TOT  LW 63  6 18  24   -3  34  1.6  6  0
## 8   5    Kenny Agostino\\agostke01  26 MTL  LW 36  2  9  11   -1  26  0.5  2  0
## 9   5    Kenny Agostino\\agostke01  26 NJD  LW 27  4  9  13   -2   8  1.1  4  0
## 10  6   Andrew Agozzino\\agozzan01  28 COL  LW 11  1  1   2    2   0  0.2  1  0
##    SH GW EV.1 PP.1 SH.1   S   S.  TOI  ATOI BLK HIT FOW FOL  FO.
## 1   1  0   12    1    0  95  6.3 1093 15:24  34 185  52  51 50.5
## 2   0  1    9    4    0 101 11.9  861 14:36  11  45   2  17 10.5
## 3   0  1    7    1    0  74 14.9  578 15:37   7  31   2   9 18.2
## 4   0  0    2    3    0  27  3.7  283 12:52   4  14   0   8  0.0
## 5   0  0    0    0    0   0   NA   14 13:52   1   0   0   0   NA
## 6   0  2    8    0    0  99  6.1  935 12:59  36 221 200 203 49.6
## 7   0  0   13    5    0  79  7.6  814 12:55  14 143  15  27 35.7
## 8   0  0    6    3    0  38  5.3  403 11:11   4  80   8  14 36.4
## 9   0  0    7    2    0  41  9.8  411 15:13  10  63   7  13 35.0
## 10  0  0    1    0    0  11  9.1   86  7:49   3  14  16  22 42.1

Find Your Data: Project Part 1

Data Cleaning

There is a couple of interesting things we should point out right away. We notice that Pontus Aberg is listed multiple times. This is because he was traded during the season. The stats from both teams have been recorded as well as his totals (signified by TOT). We will need to be careful and deal with this before we proceed. Let’s first see how many payers there were in the season

players <- unique(df["Player"])

length(players$Player)
## [1] 906

Looks like there were 906 unique players. Next let’s see how many were traded throughout the season.

trades <- df[which(df$Tm=='TOT'),]
length(trades$Player)
## [1] 81

Looks like 81 players played with multiple teams during the season. I will just keep the data for the season and not for each of their individual teams that they played for.

tradedplayers <- trades$Player
nottraded <- df[-which(df$Player %in% tradedplayers),]
df <- rbind(trades,nottraded)
df[1:10,1:10]
##     Rk                      Player Age  Tm Pos GP  G  A PTS X...
## 2    2     Pontus Aberg\\abergpo01  25 TOT  LW 59 12 13  25  -14
## 7    5   Kenny Agostino\\agostke01  26 TOT  LW 63  6 18  24   -3
## 23  19 Darren Archibald\\archida02  28 TOT  RW 12  1  1   2   -3
## 51  45  Nathan Beaulieu\\beaulna01  26 TOT   D 48  3  9  12    6
## 67  59  Anthony Bitetto\\bitetan01  28 TOT   D 36  0  3   3   -7
## 72  62    Nick Bjugstad\\bjugsni01  26 TOT   C 64 14 12  26    6
## 77  65  Joseph Blandisi\\blandjo01  24 TOT   C  9  0  0   0   -1
## 94  80    Madison Bowey\\boweyma01  23 TOT   D 50  2  8  10   -1
## 99  83      Brian Boyle\\boylebr01  34 TOT   C 73 18  6  24  -14
## 104 86  Derick Brassard\\brassde01  31 TOT   C 70 14  9  23  -19

Data Description

So now my data set just includes individual players with no duplicates. Let’s see if we can analize the varaibles.

summary(df)
##        Rk                              Player          Age              Tm     
##  Min.   :  1.0   A.J. Greer\\greeraj01     :  1   Min.   :18.00   TOT    : 81  
##  1st Qu.:227.2   Aaron Ekblad\\ekblaaa01   :  1   1st Qu.:23.00   ANA    : 33  
##  Median :453.5   Adam Clendening\\clendad01:  1   Median :25.00   OTT    : 33  
##  Mean   :453.5   Adam Cracknell\\crackad01 :  1   Mean   :26.01   DET    : 32  
##  3rd Qu.:679.8   Adam Erne\\ernead01       :  1   3rd Qu.:28.75   BOS    : 31  
##  Max.   :906.0   Adam Gaudette\\gaudead01  :  1   Max.   :42.00   NJD    : 31  
##                  (Other)                   :900                   (Other):665  
##  Pos            GP             G                A              PTS        
##  C :307   Min.   : 1.0   Min.   : 0.000   Min.   : 0.00   Min.   :  0.00  
##  D :326   1st Qu.:23.0   1st Qu.: 1.000   1st Qu.: 2.00   1st Qu.:  4.00  
##  LW:147   Median :60.0   Median : 5.000   Median :10.00   Median : 16.00  
##  RW:124   Mean   :50.5   Mean   : 8.363   Mean   :14.05   Mean   : 22.41  
##  W :  2   3rd Qu.:78.0   3rd Qu.:13.000   3rd Qu.:21.00   3rd Qu.: 33.75  
##           Max.   :84.0   Max.   :51.000   Max.   :87.00   Max.   :128.00  
##                                                                           
##       X...               PIM               PS               EV        
##  Min.   :-41.0000   Min.   :  0.00   Min.   :-1.100   Min.   : 0.000  
##  1st Qu.: -6.0000   1st Qu.:  6.00   1st Qu.: 0.200   1st Qu.: 1.000  
##  Median : -1.0000   Median : 18.00   Median : 1.700   Median : 4.000  
##  Mean   : -0.5099   Mean   : 22.52   Mean   : 2.557   Mean   : 6.488  
##  3rd Qu.:  4.0000   3rd Qu.: 33.00   3rd Qu.: 4.100   3rd Qu.:10.000  
##  Max.   : 39.0000   Max.   :153.00   Max.   :14.600   Max.   :37.000  
##                                                                       
##        PP               SH               GW              EV.1     
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 2.0  
##  Median : 0.000   Median :0.0000   Median : 1.000   Median : 9.0  
##  Mean   : 1.618   Mean   :0.2572   Mean   : 1.307   Mean   :10.7  
##  3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.: 2.000   3rd Qu.:16.0  
##  Max.   :20.000   Max.   :6.0000   Max.   :10.000   Max.   :54.0  
##                                                                   
##       PP.1             SH.1              S                S.         
##  Min.   : 0.000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 22.25   1st Qu.:  3.300  
##  Median : 0.000   Median :0.0000   Median : 79.00   Median :  7.100  
##  Mean   : 3.074   Mean   :0.2704   Mean   : 88.23   Mean   :  8.024  
##  3rd Qu.: 4.000   3rd Qu.:0.0000   3rd Qu.:133.75   3rd Qu.: 11.300  
##  Max.   :33.000   Max.   :4.0000   Max.   :365.00   Max.   :100.000  
##                                                     NA's   :18       
##       TOI              ATOI          BLK              HIT        
##  Min.   :   2.0   15:40  :  6   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 263.5   14:01  :  5   1st Qu.: 11.00   1st Qu.: 18.00  
##  Median : 879.5   11:22  :  4   Median : 29.00   Median : 50.00  
##  Mean   : 833.0   12:41  :  4   Mean   : 40.03   Mean   : 62.73  
##  3rd Qu.:1298.0   12:46  :  4   3rd Qu.: 53.00   3rd Qu.: 92.00  
##  Max.   :2189.0   13:03  :  4   Max.   :208.00   Max.   :305.00  
##                   (Other):879                                    
##       FOW               FOL              FO.        
##  Min.   :   0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.: 34.20  
##  Median :   2.00   Median :  3.00   Median : 44.90  
##  Mean   :  82.79   Mean   : 82.79   Mean   : 42.23  
##  3rd Qu.:  38.50   3rd Qu.: 50.75   3rd Qu.: 51.50  
##  Max.   :1086.00   Max.   :935.00   Max.   :100.00  
##                                     NA's   :333

So this output is very intense! It went through every variable (and the case descriptors) and did some basic analysis of each. I am interested in the plus/minus discrete quantitative variable but it is not named properly so let’s fix that. I’ll also grab the data for just my team.

names(df)[names(df)=='X...'] <- 'PM'
blues <- df[which(df$Tm == 'STL'),]

For your paper you should carefully describe any variable you think you might use. For me that would be PM, this is the plus/minus variables. A player gets a plus one if they are on the ice for a goal and a minus one if they are on the ice for a goal against. You may not score the goal but it shows that you were part of the play that scored or was scored on. Pos is position of the player, these include Center, Right and Left Wing, and Defense. Goalies were not included in this dataset. TM stands for Team that the player was on. There are more variables here (clearly) but rather than describe them all I will move on with the project.

Categorical Variables:Project Part 2

I want to examine the categorical variables for my project. Let’s recreate one of the frequency tables above and see if we can get relative frequency in it.

options(digits=2)
freq <- rbind(table(df$Pos),table(df$Pos)/length(df$Pos))
row.names(freq)<-c('Frequency','Relative Frequency')
freq
##                         C      D     LW     RW      W
## Frequency          307.00 326.00 147.00 124.00 2.0000
## Relative Frequency   0.34   0.36   0.16   0.14 0.0022

Cool! We see about 34% of the leagues skaters are centers and there are sigificantly more Left Wingers than right. I left my relative frequency in decimal form. Feel free to change it to percentages in your project.

I am going to try to make the frequency table again using some other functions in R.

percent <- function(x, digits = 2, format = "f", ...) {      # Create user-defined function
  paste0(formatC(x * 100, format = format, digits = digits, ...), "%")
} 

df %>%
  group_by(Pos) %>%
  summarize(Frequency = length(Pos),RelativeFrequency = percent(length(Pos)/900))
## # A tibble: 5 x 3
##   Pos   Frequency RelativeFrequency
##   <fct>     <int> <chr>            
## 1 C           307 34.11%           
## 2 D           326 36.22%           
## 3 LW          147 16.33%           
## 4 RW          124 13.78%           
## 5 W             2 0.22%

Next let’s look at a two way table.

table(df$Pos,df$Tm)
##     
##      ANA ARI BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL NJD NSH
##   C   11   9  11  11   8   8  10   9   8   8   9   8  10   9   6   8   9  13
##   D   10  10  12  10   9   9   9   9  10  11  13  10  11   9   9  10  10   9
##   LW   5   2   5   5   5   6   7   4   8   2   7   2   3   4   4   6   4   2
##   RW   7   5   3   3   3   5   2   2   3   7   3   7   4   5   3   2   7   2
##   W    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
##     
##      NYI NYR OTT PHI PIT SJS STL TBL TOR TOT VAN VEG WPG WSH
##   C   10  10   8   9   8   9  11   8   8  32   9   7   5   8
##   D    8  10  14   9   8   9  10   8   9  24  11   8  10   8
##   LW   5   4   7   4   1   1   5   4   2  16   5   5   4   3
##   RW   3   1   4   3   3   5   3   3   3   9   3   4   3   4
##   W    0   0   0   0   0   0   0   0   0   0   0   0   1   0

Not the most interesting table I have ever made but there are a few interesting aspects like Pittsburg Penguins and San Jose Sharks only had 1 left winger. I have no idea why that might be the case, perhaps they are not “natural” left wingers but still play on that side and are therefore not counted.

Quanitative Variables: Project Part 3

I am going to examine the plus/minus of the players, I think it is one of the most acurate statistics of a player’s overall effectiveness on the ice. Below is the 5 number summary, the mean and standard deviation.

summary(df$PM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     -41      -6      -1      -1       4      39
sd(df$PM)
## [1] 10

Now let’s generate some graphics. A histogram will show if the data is bell shaped and a box plot will help me look for outliers

hist(df$PM, xlab = 'Plus/Minus', main = 'Histogram of Plus/Minus')

boxplot(df$PM, horizontal = TRUE, xlab = 'Plus/Minus', main = 'Box Plot of Plus/Minus')

Looks rather bell shaped but there are many outliers! I want to see if I am truely normal so I will also examine the QQ Plot. You will not need to do this in excel, it is an advanced technique that you might cover in a second semester of statistics.

qqnorm(df$PM)

The data almost lies on the line so I will go with approximately normal.

I am also curious what the St. Louis Blues data might look like.

hist(blues$PM, xlab ='Blues Plus/Minus', main = 'Histogram of Blues Plus/Minus')

For the side-by-side chart, I want to compare team plus/minus. I will do this with box plots

boxplot(df$PM ~ df$Tm, ylab ='Plus/Minus', xlab ="Teams")

The teams have been put in alphabetical order so it is hard to tell if there is a pattern but the number of outliers here is greatly reduced.

Linear Regression

I like linear regression, so I will do it here even if it wasn’t required anywhere for the project.

plot(df$PM,df$G,xlab = 'Plus/Minus', ylab = 'Goals')
abline(lm(df$G~df$PM))

This plot is really boring, I want to spice it up a bit. Let’s see if I can get the teams in there too.

ggplot(df,aes(x = PM, y = G, color=Tm))+
  geom_point()

So now each team is coloured. For giggles, let’s see what happens if we do a linear regression for each team.

ggplot(df,aes(x = PM, y = G, color=Tm))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Weirdly enough on some teams, scoring more goals will mean you have a worse plus/minus. This seems counter intuitive but it might be a result of cutting the data up too much. Let’s also use this same package to make the side=by-side charts with colours.

ggplot(df,aes(x = G,y=PM, color=Tm))+
  geom_boxplot() + coord_flip()

I am not certain this table says much but it sure is pretty…

Hypothesis Testing: Project Part 4

I would like to check and see if the Blues had a higher plus/minus than the average. So I am would like to establish the hypothesis test, so I state my null and alternative hypothesis as

\[ H_0:\quad \mu_{Blues}=-1\\ H_a:\quad \mu_{Blues}\neq-1. \] I have picked \(-1\) here as that was the mean and median for the data set.

I would also like to do a hypothesis test on a proportion. Let’s compare the proportion of right wingers and left wingers and see if they are different. In statistical language, my null hypothesis is that these proportions are equal and my alternative is that they are different.

\[ H_0:\quad p_{rw}=p_{lw}\\ H_a:\quad p_{rw}\neq p_{lw} \] This is actually a fairly tricky hypothesis. I might have made life easier for myself had I compared \(p_{rw}\) to a value like 16% because that was what the proportion of left wingers was.

Bootstrapping: Project Part 5

Let’s go ahead and create a bootstrap here for the Blues plus/minus

samp_mean <- function(x, i) {
  mean(x[i])
}
b1 <- boot(blues$PM, samp_mean, 10000)
plot(b1)

So we now that the mean was \(\overline{x} =2.5\) and \(SE=1.3\) so we will build a 95% confidence interval

c(2.5-2*1.3,2.5+2*1.3)
## [1] -0.1  5.1

We note that the Blues confidence interval did not contain \(-1\) so we are able to reject the null hypothesis, there for we have evidence to suggest that the Blues were better than the average team (at least in plus/minus).

Let’s repeat this using a built in function.

boot.ci(b1,type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b1, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (-0.028,  5.078 )  
## Calculations and Intervals on Original Scale

Pretty similar! I also prefer to use the percentile way. Instead of using normality assumption, it counts the actual values computed by the bootstrap and uses 2.5% below and 2.5% above.

boot.ci(b1,type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b1, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.1,  5.2 )  
## Calculations and Intervals on Original Scale
myprop<-function(x,i){
sum(x[i]==TRUE)/length(x)
}

boot(df$Pos=="LW",myprop,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df$Pos == "LW", statistic = myprop, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     0.16 -0.00027       0.012

So my 95% confidence interval for left wingers is

c(0.160-2*0.012,0.160+2*0.012)
## [1] 0.14 0.18

Let’s do the same for right wingers

boot(df$Pos=="RW",myprop,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df$Pos == "RW", statistic = myprop, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     0.14 0.00019       0.011
c(0.140-2*0.011,0.140+2*0.011)
## [1] 0.12 0.16

So we notice that both point estimates of the statistics fall within the other’s 95% confidence interval so we are going to fail to reject the null hypothesis, there is no difference between the proportion of right and left wingers. I should note that the second bootstrap was not necessary but I wanted to try it anyway. Also let’s generate some visuals while we are doing things that are not neccessary.

plot(boot(df$Pos=="RW",myprop,1000))

Categorical Hypothesis Test: Project Part 6

I would also like to do a hypothesis test on a proportion. Let’s compare the proportion of right wingers and left wingers and see if they are different. In statistical language, my null hypothesis is that these proportions are equal and my alternative is that they are different.

\[ H_0:\quad p_{rw}=p_{lw}\\ H_a:\quad p_{rw}\neq p_{lw} \]

Let’s try this same test with the traditional methods. I’ll use the pooled proportion \(\hat p=\frac{\hat{p}_1n_1+\hat{p}_2n_2}{n_1+n_2}\)

p = (sum(df$Pos=="LW")+sum(df$Pos=="RW"))/(2*length(df$Pos))

So then the Standard Error will be calculated as \[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \]

So here that is

se = sqrt(2*p*(1-p)/length(df$Pos))

So the test statistic here will be \(z=\frac{(\hat{p}_1-\hat{p}_2)-0}{SE}\)

(sum(df$Pos=="LW")/length(df$Pos)-sum(df$Pos=="RW")/length(df$Pos))/se
## [1] 1.5

We see this \(z\) is between -2 and 2 so we fail to reject the null hypothesis at the 95% confidence level.

Quantitative Hypothesis Testing: Project Part 7

Let’s go ahead and compute this value using the formula

\[ SE = \frac{s}{\sqrt{n}}. \] But first I am going to need the sample standard deviation,

sd(blues$PM)/sqrt(length(blues$PM))
## [1] 1.3

Wow that was exactly the same as the bootstrapping! No need to build another confidence interval instead we’ll compute a \(p\) value using the \(t\) distribution.

t.test(blues$PM, mu =-1)
## 
##  One Sample t-test
## 
## data:  blues$PM
## t = 3, df = 28, p-value = 0.01
## alternative hypothesis: true mean is not equal to -1
## 95 percent confidence interval:
##  -0.2  5.2
## sample estimates:
## mean of x 
##       2.5

My \(p=0.01\) so that we will be able to reject the null hypothesis as expected.

Conditional Probabilities: Project Part 8

So for the last part of the project I will compute some conditional probabilities using the two way table I created for Part 2. Let’s recreate that table for easier access.

table(df$Pos,df$Tm)
##     
##      ANA ARI BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL NJD NSH
##   C   11   9  11  11   8   8  10   9   8   8   9   8  10   9   6   8   9  13
##   D   10  10  12  10   9   9   9   9  10  11  13  10  11   9   9  10  10   9
##   LW   5   2   5   5   5   6   7   4   8   2   7   2   3   4   4   6   4   2
##   RW   7   5   3   3   3   5   2   2   3   7   3   7   4   5   3   2   7   2
##   W    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
##     
##      NYI NYR OTT PHI PIT SJS STL TBL TOR TOT VAN VEG WPG WSH
##   C   10  10   8   9   8   9  11   8   8  32   9   7   5   8
##   D    8  10  14   9   8   9  10   8   9  24  11   8  10   8
##   LW   5   4   7   4   1   1   5   4   2  16   5   5   4   3
##   RW   3   1   4   3   3   5   3   3   3   9   3   4   3   4
##   W    0   0   0   0   0   0   0   0   0   0   0   0   1   0

Then I’ll ask a few questions. Given that a player was on the Blues, what is the probability that they are a Left Winger? \[ P(LW|Blues)=\frac{P(LW\cap Blues)}{P(Blues)}=\frac{n(LW\cap Blues)}{n(Blues)}=\frac 5{30}=\frac16 \]

I did some simplifications there since both of the probabilities have the same denominator it is not necessary!

Let’s do the conditional probability the other way. Given that a player was a left winger, what is the probability that they played on the blues?

\[ P(Blues|LW)=\frac{P(Blues\cap LW)}{P(LW)}=\frac{n(Blues\cap LW)}{n(LW)}=\frac5{147} \]

That’s it! The project is complete! Make sure you wrap it up saying what you learned!