Loading in the tidyverse, data and setting seed

# Loading tidyverse 

library(tidyverse)
library(gridExtra)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

For Data Dive 5, our goal is to create variable combinations to explore the relationship between explanatory variable and repsonse variables. Thus, for our NHL Draft set, this means we will be taking quantitative variables and mutating them to create new variables.

When I think of player impact, traditionally, offense comes to mind for statisticians as it is hard for people to quantify defense in a way that can lead to better team performance. For simplicity of the concept, confidence intervals have the ability to infer a population parameter range for mean.

Let’s get started by talking about which 3 custom variable groups I will be using for this data dive.

colnames(nhl_draft)
##  [1] "id"                    "year"                  "overall_pick"         
##  [4] "team"                  "player"                "nationality"          
##  [7] "position"              "age"                   "to_year"              
## [10] "amateur_team"          "games_played"          "goals"                
## [13] "assists"               "points"                "plus_minus"           
## [16] "penalties_minutes"     "goalie_games_played"   "goalie_wins"          
## [19] "goalie_losses"         "goalie_ties_overtime"  "save_percentage"      
## [22] "goals_against_average" "point_shares"

Looking at all of the variables above, the 3 quantitative variables that catch my eye are goals, assists, and points. With these 3 we can look at how they vary by game and what can we say the mean for the population falls into.

Goals Per Game:

Starting off with goals, goals are the universal way in hockey to quantify who wins a game. With hockey also having lower scores, each goal hold a lot of weight on whether a team will win or not.

Creating the goals per game variable:

nhl_draft <- nhl_draft |> 
  mutate(goals_per_game = goals/games_played)

Our first grouping of variables to explain goals per game is going to be plus/minus (do players who have a higher plus/minus also score more goals per game?), games played, and goals per game.

Let’s visualize the relationships between these variables with games played and plus minus as explanatory variables for the goals per game variable.

gpg_vs_plusminus <- nhl_draft |> 
  ggplot(mapping = aes(x = plus_minus, y = goals_per_game))+
  geom_point()+
  ggtitle("Goals per Game vs Plus/Minus")

gpg_vs_gameplayed <- nhl_draft |> 
  ggplot(mapping = aes(x = games_played, y = goals_per_game))+
  geom_point()+
  ggtitle("Goals per Game vs Games Played")

grid.arrange(gpg_vs_plusminus, gpg_vs_gameplayed, ncol= 2)

Looking at the 2 graphs we can see that Plus/Minus is centered around 0 with multiple outliers including a couple of guys that have 1 goal per game with a 0 plus minus. There is also a guy with 750 plus minus. But the majority of the observations lies within -250 to 250 plus minus and 0 to 0.5 goals per game. Goals per game only has a positive linear relationship since there are no such thing as “negative games played”.

Let’s calculate the correlation between goals per game and the explanatory variables even there isn’t a linear relationship for either in my opinion.

# Goals Per Game vs Plus Minus
cor(nhl_draft$goals_per_game, nhl_draft$plus_minus, use = "complete.obs")
## [1] 0.1386319
# Goals Per Game vs Games Played 
cor(nhl_draft$goals_per_game, nhl_draft$games_played, use = "complete.obs")
## [1] 0.4805797

We can see that for goals per game and plus minus there is a very weak to non existent relationship with a 0.138 correlation coefficient which means there isn’t a linear relationship between the two variables. Goals per game and games played had a better linear relationship with a 0.48 correlation coefficient but still is weak and can’t be assumed to be a useful predictor of goals per game.

Calculating the confidence intervals for goals per game we get:

# Take a sample of 50

sample <- sample(na.omit(nhl_draft$goals_per_game), size = 50, replace = TRUE)


x_bar <- mean(sample)
n <- length(sample) 
sd <- sd(sample)
margin <- qt(0.975,df=n-1)*sd/sqrt(n)

low <- x_bar - margin
high <- x_bar + margin

ci <- c(low, high)

ci
## [1] 0.09644342 0.15903930

Based on the 95% confidence interval, we can expect out of our sample of 50 that excludes NA values that the true mean goals per game falls between 0.096 and 0.15 95% of the time. This means that players are most of the time not getting goals instead of getting goals since there isn’t such thing as 0.08 of a goal.

Assists Per Game:

Next, let’s look at assists which looks at how much players are spreading the puck to help other players score goals.

Creating the assists per game variable:

nhl_draft <- nhl_draft |> 
  mutate(assists_per_game = assists/games_played)

Our second grouping of variables to explain goals per game is going to be points (do players who have a higher amount of points also get more assists per game?), games played, and assists per game.

Let’s visualize the relationships between these variables with games played and points as explanatory variables for the assists per game variable.

apg_vs_plusminus <- nhl_draft |> 
  ggplot(mapping = aes(x = points, y = assists_per_game))+
  geom_point()+
  ggtitle("Assists per Game vs Points")

apg_vs_gameplayed <- nhl_draft |> 
  ggplot(mapping = aes(x = games_played, y = goals_per_game))+
  geom_point()+
  ggtitle("Assists per Game vs Games Played")

grid.arrange(apg_vs_plusminus, apg_vs_gameplayed, ncol= 2)

Looking at the 2 graphs we can see that Assists per Game has a positive relationship with points. This is because there is no such thing as negative points in hockey and thus your time on the rink can only really negatively impact your plus minus. Assists per game also only has a positive linear relationship since there are no such thing as “negative games played”. Knowing this fact looking at the correlations will only be to see if there is any linear relationship.

Let’s calculate the correlation between assists per game and the explanatory variables to see if there is any relation between the two.

# Assists Per Game vs Points
cor(nhl_draft$assists_per_game, nhl_draft$points, use = "complete.obs")
## [1] 0.7131423
# Assists Per Game vs Games Played 
cor(nhl_draft$assists_per_game, nhl_draft$games_played, use = "complete.obs")
## [1] 0.5561629

We can see that for assists per game and points there is a strong relationship with a 0.7131 correlation coefficient which means the two variables are correlated but this could influenced by a third unknown factor. Assists per game and games played had a worse linear relationship with a 0.55 correlation coefficient but still is stronger than goals per game and games played. This still can’t be assumed to be a useful predictor of assists per game though.

Calculating the confidence intervals for assists per game we get:

# Take a sample of 50

sample <- sample( na.omit(nhl_draft$assists_per_game), size = 50, replace = TRUE)


x_bar <- mean(sample)
n <- length(sample) 
sd <- sd(sample)
margin <- qt(0.975,df=n-1)*sd/sqrt(n)

low <- x_bar - margin
high <- x_bar + margin

ci <- c(low, high)

ci
## [1] 0.1303783 0.2080975

Based on the 95% confidence interval, we can expect out of our sample of 50 that excludes NA values that the true mean assists per game falls between 0.130 and 0.208 95% of the time.

This data dive introduced the concept of Confidence intervals and variable combinations. Even though there wasn’t much that we got out of this, I’ve realized how important a fully recorded observation can be for accurate results. These data dives are a way to practice these skills and understand data through practice.

Points Per Game:

Finally, points per game gives us an overall idea of how players are doing in the eyes of others and can be a sum of their goals and assists to help with standings and rankings.

Creating the points per game variable:

nhl_draft <- nhl_draft |> 
  mutate(points_per_game = points/games_played)

The last grouping of variables to explain goals per game is going to be goals (do players who have more goals also score more points per game?), games played, and points per game.

Let’s visualize the relationships between these variables with games played and goals as explanatory variables for the points per game variable.

ppg_vs_plusminus <- nhl_draft |> 
  ggplot(mapping = aes(x = goals, y = points_per_game))+
  geom_point()+
  ggtitle("Points per Game vs Goals")

ppg_vs_gameplayed <- nhl_draft |> 
  ggplot(mapping = aes(x = games_played, y = points_per_game))+
  geom_point()+
  ggtitle("Points per Game vs Games Played")

grid.arrange(ppg_vs_plusminus, ppg_vs_gameplayed, ncol= 2)

Looking at the 2 graphs we can see that Points per Game has a positive relationship with goals. This is because there is no such thing as negative goals in hockey and thus your time on the rink can only really negatively impact your plus minus. Points per game also only has a positive linear relationship since there are no such thing as “negative games played”. Knowing this fact looking at the correlations will only be to see if there is any linear relationship.

Let’s calculate the correlation between goals per game and the explanatory variables even there isn’t a linear relationship for either in my opinion.

# Points Per Game vs Goals
cor(nhl_draft$points_per_game, nhl_draft$goals, use = "complete.obs")
## [1] 0.7571647
# Points Per Game vs Games Played 
cor(nhl_draft$points_per_game, nhl_draft$games_played, use = "complete.obs")
## [1] 0.5762226

The correlation for points per game and goals has the strongest linear relationship of the data dive with a 0.757 correlation coefficient which means the two variables are correlated but this could influenced by a third unknown factor. Points per game and games played had a worse linear relationship with a 0.57 correlation coefficient but still is stronger than assists per game and games played. This still can’t be assumed to be a useful predictor of points per game though.

Calculating the confidence intervals for points per game we get:

# Take a sample of 50

sample <- sample(na.omit(nhl_draft$goals_per_game), size = 50, replace = TRUE)


x_bar <- mean(sample)
n <- length(sample) 
sd <- sd(sample)
margin <- qt(0.975,df=n-1)*sd/sqrt(n)

low <- x_bar - margin
high <- x_bar + margin

ci <- c(low, high)

ci
## [1] 0.08318782 0.14829371

Based on the 95% confidence interval, we can expect out of our sample of 50 that excludes NA values that the true mean points per game falls between 0.083 and 0.148 95% of the time. This mens that players are most of the time not getting goals instead of getting goals since there isn’t such thing as 0.08 of a goal.