This was an exciting hockey season for me. My team home town team, St. Louis Blues, won Lord Stanley’s Cup. I wanted to examine the season and use the R programming language to do so. I gathered my data from hockeyreference.com going to the Skaters tab and selecting basic stats. I then took that data, created a csv file and uploaded that file to my github account. I reference that data set here via direct link to read the csv into the programming language as a data frame. Below is a sample of my data:
df[1:10,]
## Rk Player Age Tm Pos GP G A PTS X... PIM PS EV PP
## 1 1 Justin Abdelkader\\abdelju01 31 DET LW 71 6 13 19 -14 38 0.1 4 1
## 2 2 Pontus Aberg\\abergpo01 25 TOT LW 59 12 13 25 -14 20 2.0 9 3
## 3 2 Pontus Aberg\\abergpo01 25 ANA LW 37 11 8 19 -10 14 1.8 8 3
## 4 2 Pontus Aberg\\abergpo01 25 MIN LW 22 1 5 6 -4 6 0.2 1 0
## 5 3 Vitaly Abramov\\abramvi01 20 OTT RW 1 0 0 0 -3 0 -0.1 0 0
## 6 4 Noel Acciari\\acciano01 27 BOS C 72 6 8 14 -3 47 0.4 6 0
## 7 5 Kenny Agostino\\agostke01 26 TOT LW 63 6 18 24 -3 34 1.6 6 0
## 8 5 Kenny Agostino\\agostke01 26 MTL LW 36 2 9 11 -1 26 0.5 2 0
## 9 5 Kenny Agostino\\agostke01 26 NJD LW 27 4 9 13 -2 8 1.1 4 0
## 10 6 Andrew Agozzino\\agozzan01 28 COL LW 11 1 1 2 2 0 0.2 1 0
## SH GW EV.1 PP.1 SH.1 S S. TOI ATOI BLK HIT FOW FOL FO.
## 1 1 0 12 1 0 95 6.3 1093 15:24 34 185 52 51 50.5
## 2 0 1 9 4 0 101 11.9 861 14:36 11 45 2 17 10.5
## 3 0 1 7 1 0 74 14.9 578 15:37 7 31 2 9 18.2
## 4 0 0 2 3 0 27 3.7 283 12:52 4 14 0 8 0.0
## 5 0 0 0 0 0 0 NA 14 13:52 1 0 0 0 NA
## 6 0 2 8 0 0 99 6.1 935 12:59 36 221 200 203 49.6
## 7 0 0 13 5 0 79 7.6 814 12:55 14 143 15 27 35.7
## 8 0 0 6 3 0 38 5.3 403 11:11 4 80 8 14 36.4
## 9 0 0 7 2 0 41 9.8 411 15:13 10 63 7 13 35.0
## 10 0 0 1 0 0 11 9.1 86 7:49 3 14 16 22 42.1
There is a couple of interesting things we should point out right away. We notice that Pontus Aberg is listed multiple times. This is because he was traded during the season. The stats from both teams have been recorded as well as his totals (signified by TOT). We will need to be careful and deal with this before we proceed. Let’s first see how many payers there were in the season
players <- unique(df["Player"])
length(players$Player)
## [1] 906
Looks like there were 906 unique players. Next let’s see how many were traded throughout the season.
trades <- df[which(df$Tm=='TOT'),]
length(trades$Player)
## [1] 81
Looks like 81 players played with multiple teams during the season. I will just keep the data for the season and not for each of their individual teams that they played for.
tradedplayers <- trades$Player
nottraded <- df[-which(df$Player %in% tradedplayers),]
df <- rbind(trades,nottraded)
df[1:10,1:10]
## Rk Player Age Tm Pos GP G A PTS X...
## 2 2 Pontus Aberg\\abergpo01 25 TOT LW 59 12 13 25 -14
## 7 5 Kenny Agostino\\agostke01 26 TOT LW 63 6 18 24 -3
## 23 19 Darren Archibald\\archida02 28 TOT RW 12 1 1 2 -3
## 51 45 Nathan Beaulieu\\beaulna01 26 TOT D 48 3 9 12 6
## 67 59 Anthony Bitetto\\bitetan01 28 TOT D 36 0 3 3 -7
## 72 62 Nick Bjugstad\\bjugsni01 26 TOT C 64 14 12 26 6
## 77 65 Joseph Blandisi\\blandjo01 24 TOT C 9 0 0 0 -1
## 94 80 Madison Bowey\\boweyma01 23 TOT D 50 2 8 10 -1
## 99 83 Brian Boyle\\boylebr01 34 TOT C 73 18 6 24 -14
## 104 86 Derick Brassard\\brassde01 31 TOT C 70 14 9 23 -19
So now my data set just includes individual players with no duplicates. Let’s see if we can analize the varaibles.
summary(df)
## Rk Player Age Tm
## Min. : 1.0 A.J. Greer\\greeraj01 : 1 Min. :18.00 TOT : 81
## 1st Qu.:227.2 Aaron Ekblad\\ekblaaa01 : 1 1st Qu.:23.00 ANA : 33
## Median :453.5 Adam Clendening\\clendad01: 1 Median :25.00 OTT : 33
## Mean :453.5 Adam Cracknell\\crackad01 : 1 Mean :26.01 DET : 32
## 3rd Qu.:679.8 Adam Erne\\ernead01 : 1 3rd Qu.:28.75 BOS : 31
## Max. :906.0 Adam Gaudette\\gaudead01 : 1 Max. :42.00 NJD : 31
## (Other) :900 (Other):665
## Pos GP G A PTS
## C :307 Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. : 0.00
## D :326 1st Qu.:23.0 1st Qu.: 1.000 1st Qu.: 2.00 1st Qu.: 4.00
## LW:147 Median :60.0 Median : 5.000 Median :10.00 Median : 16.00
## RW:124 Mean :50.5 Mean : 8.363 Mean :14.05 Mean : 22.41
## W : 2 3rd Qu.:78.0 3rd Qu.:13.000 3rd Qu.:21.00 3rd Qu.: 33.75
## Max. :84.0 Max. :51.000 Max. :87.00 Max. :128.00
##
## X... PIM PS EV
## Min. :-41.0000 Min. : 0.00 Min. :-1.100 Min. : 0.000
## 1st Qu.: -6.0000 1st Qu.: 6.00 1st Qu.: 0.200 1st Qu.: 1.000
## Median : -1.0000 Median : 18.00 Median : 1.700 Median : 4.000
## Mean : -0.5099 Mean : 22.52 Mean : 2.557 Mean : 6.488
## 3rd Qu.: 4.0000 3rd Qu.: 33.00 3rd Qu.: 4.100 3rd Qu.:10.000
## Max. : 39.0000 Max. :153.00 Max. :14.600 Max. :37.000
##
## PP SH GW EV.1
## Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 2.0
## Median : 0.000 Median :0.0000 Median : 1.000 Median : 9.0
## Mean : 1.618 Mean :0.2572 Mean : 1.307 Mean :10.7
## 3rd Qu.: 2.000 3rd Qu.:0.0000 3rd Qu.: 2.000 3rd Qu.:16.0
## Max. :20.000 Max. :6.0000 Max. :10.000 Max. :54.0
##
## PP.1 SH.1 S S.
## Min. : 0.000 Min. :0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 22.25 1st Qu.: 3.300
## Median : 0.000 Median :0.0000 Median : 79.00 Median : 7.100
## Mean : 3.074 Mean :0.2704 Mean : 88.23 Mean : 8.024
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:133.75 3rd Qu.: 11.300
## Max. :33.000 Max. :4.0000 Max. :365.00 Max. :100.000
## NA's :18
## TOI ATOI BLK HIT
## Min. : 2.0 15:40 : 6 Min. : 0.00 Min. : 0.00
## 1st Qu.: 263.5 14:01 : 5 1st Qu.: 11.00 1st Qu.: 18.00
## Median : 879.5 11:22 : 4 Median : 29.00 Median : 50.00
## Mean : 833.0 12:41 : 4 Mean : 40.03 Mean : 62.73
## 3rd Qu.:1298.0 12:46 : 4 3rd Qu.: 53.00 3rd Qu.: 92.00
## Max. :2189.0 13:03 : 4 Max. :208.00 Max. :305.00
## (Other):879
## FOW FOL FO.
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 34.20
## Median : 2.00 Median : 3.00 Median : 44.90
## Mean : 82.79 Mean : 82.79 Mean : 42.23
## 3rd Qu.: 38.50 3rd Qu.: 50.75 3rd Qu.: 51.50
## Max. :1086.00 Max. :935.00 Max. :100.00
## NA's :333
So this output is very intense! It went through every variable (and the case descriptors) and did some basic analysis of each. I am interested in the plus/minus discrete quantitative variable but it is not named properly so let’s fix that. I’ll also grab the data for just my team.
names(df)[names(df)=='X...'] <- 'PM'
blues <- df[which(df$Tm == 'STL'),]
For your paper you should carefully describe any variable you think you might use. For me that would be PM, this is the plus/minus variables. A player gets a plus one if they are on the ice for a goal and a minus one if they are on the ice for a goal against. You may not score the goal but it shows that you were part of the play that scored or was scored on. Pos is position of the player, these include Center, Right and Left Wing, and Defense. Goalies were not included in this dataset. TM stands for Team that the player was on. There are more variables here (clearly) but rather than describe them all I will move on with the project.
I want to examine the categorical variables for my project. Let’s recreate one of the frequency tables above and see if we can get relative frequency in it.
options(digits=2)
freq <- rbind(table(df$Pos),table(df$Pos)/length(df$Pos))
row.names(freq)<-c('Frequency','Relative Frequency')
freq
## C D LW RW W
## Frequency 307.00 326.00 147.00 124.00 2.0000
## Relative Frequency 0.34 0.36 0.16 0.14 0.0022
Cool! We see about 34% of the leagues skaters are centers and there are sigificantly more Left Wingers than right. I left my relative frequency in decimal form. Feel free to change it to percentages in your project.
I am going to try to make the frequency table again using some other functions in R.
percent <- function(x, digits = 2, format = "f", ...) { # Create user-defined function
paste0(formatC(x * 100, format = format, digits = digits, ...), "%")
}
df %>%
group_by(Pos) %>%
summarize(Frequency = length(Pos),RelativeFrequency = percent(length(Pos)/900))
## # A tibble: 5 x 3
## Pos Frequency RelativeFrequency
## <fct> <int> <chr>
## 1 C 307 34.11%
## 2 D 326 36.22%
## 3 LW 147 16.33%
## 4 RW 124 13.78%
## 5 W 2 0.22%
Next let’s look at a two way table.
table(df$Pos,df$Tm)
##
## ANA ARI BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL NJD NSH
## C 11 9 11 11 8 8 10 9 8 8 9 8 10 9 6 8 9 13
## D 10 10 12 10 9 9 9 9 10 11 13 10 11 9 9 10 10 9
## LW 5 2 5 5 5 6 7 4 8 2 7 2 3 4 4 6 4 2
## RW 7 5 3 3 3 5 2 2 3 7 3 7 4 5 3 2 7 2
## W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
##
## NYI NYR OTT PHI PIT SJS STL TBL TOR TOT VAN VEG WPG WSH
## C 10 10 8 9 8 9 11 8 8 32 9 7 5 8
## D 8 10 14 9 8 9 10 8 9 24 11 8 10 8
## LW 5 4 7 4 1 1 5 4 2 16 5 5 4 3
## RW 3 1 4 3 3 5 3 3 3 9 3 4 3 4
## W 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Not the most interesting table I have ever made but there are a few interesting aspects like Pittsburg Penguins and San Jose Sharks only had 1 left winger. I have no idea why that might be the case, perhaps they are not “natural” left wingers but still play on that side and are therefore not counted.
I am going to examine the plus/minus of the players, I think it is one of the most acurate statistics of a player’s overall effectiveness on the ice. Below is the 5 number summary, the mean and standard deviation.
summary(df$PM)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -41 -6 -1 -1 4 39
sd(df$PM)
## [1] 10
Now let’s generate some graphics. A histogram will show if the data is bell shaped and a box plot will help me look for outliers
hist(df$PM, xlab = 'Plus/Minus', main = 'Histogram of Plus/Minus')
boxplot(df$PM, horizontal = TRUE, xlab = 'Plus/Minus', main = 'Box Plot of Plus/Minus')
Looks rather bell shaped but there are many outliers! I want to see if I am truely normal so I will also examine the QQ Plot. You will not need to do this in excel, it is an advanced technique that you might cover in a second semester of statistics.
qqnorm(df$PM)
The data almost lies on the line so I will go with approximately normal.
I am also curious what the St. Louis Blues data might look like.
hist(blues$PM, xlab ='Blues Plus/Minus', main = 'Histogram of Blues Plus/Minus')
For the side-by-side chart, I want to compare team plus/minus. I will do this with box plots
boxplot(df$PM ~ df$Tm, ylab ='Plus/Minus', xlab ="Teams")
The teams have been put in alphabetical order so it is hard to tell if there is a pattern but the number of outliers here is greatly reduced.
I like linear regression, so I will do it here even if it wasn’t required anywhere for the project.
plot(df$PM,df$G,xlab = 'Plus/Minus', ylab = 'Goals')
abline(lm(df$G~df$PM))
This plot is really boring, I want to spice it up a bit. Let’s see if I can get the teams in there too.
ggplot(df,aes(x = PM, y = G, color=Tm))+
geom_point()
So now each team is coloured. For giggles, let’s see what happens if we do a linear regression for each team.
ggplot(df,aes(x = PM, y = G, color=Tm))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Weirdly enough on some teams, scoring more goals will mean you have a worse plus/minus. This seems counter intuitive but it might be a result of cutting the data up too much. Let’s also use this same package to make the side=by-side charts with colours.
ggplot(df,aes(x = G,y=PM, color=Tm))+
geom_boxplot() + coord_flip()
I am not certain this table says much but it sure is pretty…
I would like to check and see if the Blues had a higher plus/minus than the average. So I am would like to establish the hypothesis test, so I state my null and alternative hypothesis as
\[ H_0:\quad \mu_{Blues}=-1\\ H_a:\quad \mu_{Blues}\neq-1. \] I have picked \(-1\) here as that was the mean and median for the data set.
I would also like to do a hypothesis test on a proportion. Let’s compare the proportion of right wingers and left wingers and see if they are different. In statistical language, my null hypothesis is that these proportions are equal and my alternative is that they are different.
\[ H_0:\quad p_{rw}=p_{lw}\\ H_a:\quad p_{rw}\neq p_{lw} \] This is actually a fairly tricky hypothesis. I might have made life easier for myself had I compared \(p_{rw}\) to a value like 16% because that was what the proportion of left wingers was.
Let’s go ahead and create a bootstrap here for the Blues plus/minus
samp_mean <- function(x, i) {
mean(x[i])
}
b1 <- boot(blues$PM, samp_mean, 10000)
plot(b1)
So we now that the mean was \(\overline{x} =2.5\) and \(SE=1.3\) so we will build a 95% confidence interval
c(2.5-2*1.3,2.5+2*1.3)
## [1] -0.1 5.1
We note that the Blues confidence interval did not contain \(-1\) so we are able to reject the null hypothesis, there for we have evidence to suggest that the Blues were better than the average team (at least in plus/minus).
Let’s repeat this using a built in function.
boot.ci(b1,type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b1, type = "norm")
##
## Intervals :
## Level Normal
## 95% (-0.028, 5.078 )
## Calculations and Intervals on Original Scale
Pretty similar! I also prefer to use the percentile way. Instead of using normality assumption, it counts the actual values computed by the bootstrap and uses 2.5% below and 2.5% above.
boot.ci(b1,type = "perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b1, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.1, 5.2 )
## Calculations and Intervals on Original Scale
myprop<-function(x,i){
sum(x[i]==TRUE)/length(x)
}
boot(df$Pos=="LW",myprop,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df$Pos == "LW", statistic = myprop, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.16 -0.00027 0.012
So my 95% confidence interval for left wingers is
c(0.160-2*0.012,0.160+2*0.012)
## [1] 0.14 0.18
Let’s do the same for right wingers
boot(df$Pos=="RW",myprop,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df$Pos == "RW", statistic = myprop, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.14 0.00019 0.011
c(0.140-2*0.011,0.140+2*0.011)
## [1] 0.12 0.16
So we notice that both point estimates of the statistics fall within the other’s 95% confidence interval so we are going to fail to reject the null hypothesis, there is no difference between the proportion of right and left wingers. I should note that the second bootstrap was not necessary but I wanted to try it anyway. Also let’s generate some visuals while we are doing things that are not neccessary.
plot(boot(df$Pos=="RW",myprop,1000))
I would also like to do a hypothesis test on a proportion. Let’s compare the proportion of right wingers and left wingers and see if they are different. In statistical language, my null hypothesis is that these proportions are equal and my alternative is that they are different.
\[ H_0:\quad p_{rw}=p_{lw}\\ H_a:\quad p_{rw}\neq p_{lw} \]
Let’s try this same test with the traditional methods. I’ll use the pooled proportion \(\hat p=\frac{\hat{p}_1n_1+\hat{p}_2n_2}{n_1+n_2}\)
p = (sum(df$Pos=="LW")+sum(df$Pos=="RW"))/(2*length(df$Pos))
So then the Standard Error will be calculated as \[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \]
So here that is
se = sqrt(2*p*(1-p)/length(df$Pos))
So the test statistic here will be \(z=\frac{(\hat{p}_1-\hat{p}_2)-0}{SE}\)
(sum(df$Pos=="LW")/length(df$Pos)-sum(df$Pos=="RW")/length(df$Pos))/se
## [1] 1.5
We see this \(z\) is between -2 and 2 so we fail to reject the null hypothesis at the 95% confidence level.
Let’s go ahead and compute this value using the formula
\[ SE = \frac{s}{\sqrt{n}}. \] But first I am going to need the sample standard deviation,
sd(blues$PM)/sqrt(length(blues$PM))
## [1] 1.3
Wow that was exactly the same as the bootstrapping! No need to build another confidence interval instead we’ll compute a \(p\) value using the \(t\) distribution.
t.test(blues$PM, mu =-1)
##
## One Sample t-test
##
## data: blues$PM
## t = 3, df = 28, p-value = 0.01
## alternative hypothesis: true mean is not equal to -1
## 95 percent confidence interval:
## -0.2 5.2
## sample estimates:
## mean of x
## 2.5
My \(p=0.01\) so that we will be able to reject the null hypothesis as expected.
So for the last part of the project I will compute some conditional probabilities using the two way table I created for Part 2. Let’s recreate that table for easier access.
table(df$Pos,df$Tm)
##
## ANA ARI BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL NJD NSH
## C 11 9 11 11 8 8 10 9 8 8 9 8 10 9 6 8 9 13
## D 10 10 12 10 9 9 9 9 10 11 13 10 11 9 9 10 10 9
## LW 5 2 5 5 5 6 7 4 8 2 7 2 3 4 4 6 4 2
## RW 7 5 3 3 3 5 2 2 3 7 3 7 4 5 3 2 7 2
## W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
##
## NYI NYR OTT PHI PIT SJS STL TBL TOR TOT VAN VEG WPG WSH
## C 10 10 8 9 8 9 11 8 8 32 9 7 5 8
## D 8 10 14 9 8 9 10 8 9 24 11 8 10 8
## LW 5 4 7 4 1 1 5 4 2 16 5 5 4 3
## RW 3 1 4 3 3 5 3 3 3 9 3 4 3 4
## W 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Then I’ll ask a few questions. Given that a player was on the Blues, what is the probability that they are a Left Winger? \[ P(LW|Blues)=\frac{P(LW\cap Blues)}{P(Blues)}=\frac{n(LW\cap Blues)}{n(Blues)}=\frac 5{30}=\frac16 \]
I did some simplifications there since both of the probabilities have the same denominator it is not necessary!
Let’s do the conditional probability the other way. Given that a player was a left winger, what is the probability that they played on the blues?
\[ P(Blues|LW)=\frac{P(Blues\cap LW)}{P(LW)}=\frac{n(Blues\cap LW)}{n(LW)}=\frac5{147} \]
That’s it! The project is complete! Make sure you wrap it up saying what you learned!