Data Dive 7: ANOVA Tests and Linear Regression

Loading in the tidyverse, data and setting seed

# Loading tidyverse 

library(tidyverse)
library(car)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

For this week’s data dive we will be discussing Linear Regression and the resulting Analysis of Variance (ANOVA) test to compare group means.

To start, I’m going to determine my response (dependent) variable for my regression by choosing a variable that people like to compare players with in hockey.

Here are the columns we have to work and their variable types with:

glimpse(nhl_draft)

## Rows: 12,250
## Columns: 23
## $ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ year                  <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, …
## $ overall_pick          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ team                  <chr> "Montreal Canadiens", "New Jersey Devils", "Ariz…
## $ player                <chr> "Juraj Slafkovsky", "Simon Nemec", "Logan Cooley…
## $ nationality           <chr> "SK", "SK", "US", "CA", "SE", "CZ", "CA", "AT", …
## $ position              <chr> "LW", "D", "C", "C", "LW", "D", "D", "C", "C", "…
## $ age                   <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ to_year               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ amateur_team          <chr> "TPS (Finland)", "HK Nitra (Slovakia)", "USA U-1…
## $ games_played          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ assists               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ points                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ plus_minus            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ penalties_minutes     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_games_played   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_wins           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_losses         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_ties_overtime  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ save_percentage       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals_against_average <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ point_shares          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

For this analysis, I chose the “goals” variable. Just like with points in basketball or touchdowns in football players, goals is a variable that anyone can understand when talking about hockey.

Now it’s time to determine a categorical value to compare expected means with. Considering that player name, nationality, and amateur team are all variables that I believe would not have an effect on the player’s goals in the NHL I will choose the position column as my categorical variable.

This then leads us to state our null and alternate hypothesis tests for goals and position.

Since we are comparing the means of each group to see each group mean is equal to each other we can state the hypothesis as such:

Null Hypothesis (Ho): Average goals for offense positions are identical. Any observed difference is due to chance. Alternative Hypothesis (Ha): Average goals for each offense position varies by position. We reject the null if there are larger differences than what we would expect with chance.

Now before we perform the test we must check three conditions on the data before performing ANOVA: • the observations are independent within and across groups, • the data within each group are nearly normal, and • the variability across the groups is about equal.

First, lets look at the number of observations per position:

# Dropping NA observations for position
nhl_draft <- nhl_draft |> 
  filter(!(is.na(position)))

# Combining duplicate category types
nhl_draft |>
  group_by(position) |> 
  summarize(count = n()) |> 
  print(n = 40)

## # A tibble: 24 × 2
##    position count
##    <chr>    <int>
##  1 C         2688
##  2 C / R        2
##  3 C RW         2
##  4 C/D          5
##  5 C/LW        74
##  6 C/RW        49
##  7 C/W          3
##  8 C; LW        2
##  9 Centr        1
## 10 D         3966
## 11 D/C          2
## 12 D/LW         6
## 13 D/RW         4
## 14 D/W          1
## 15 F           18
## 16 G         1217
## 17 L/RW         1
## 18 LW        2080
## 19 LW/C        18
## 20 LW/D         8
## 21 RW        2021
## 22 RW/C         8
## 23 RW/D         3
## 24 W           44

As you can see from above, there’s a lot of categories! So for this analysis instead of having a variety of combo players we will on focus on the main offense positions Centre (“C”), Left Wing (“LW”), and Right Wing (“RW”).

# Filtering for main positions
nhl_draft <- nhl_draft |> 
  filter(position %in% c("C", "LW", "RW"))


nhl_draft |>
  group_by(position) |> 
  summarize(count = n())

## # A tibble: 3 × 2
##   position count
##   <chr>    <int>
## 1 C         2688
## 2 LW        2080
## 3 RW        2021

It looks like we have different amounts of observations from each group which means that we have an unbalanced design. This will be important later for when we choose which ANOVA test type to do.

Next, let’s look at the variances of each group:

# Checking variances with boxplot

nhl_draft |> 
  ggplot(aes(x = position, y = goals))+
  geom_boxplot()

As we can see, the variances between the C, LW, and RW positions are all roughly the same with a bunch of outliers.

I won’t go into the various types of ANOVA tests but for this scenario since we aren’t checking for any interactions and just looking at positions vs goals, it is recommended to use a Type 3 ANOVA test for unbalanced designs with no interactions.

Now let’s test this hypothesis using the ANOVA test.

# Goals is on left side of the equation in the table since it is the dependent variable.
# All explanatory (independent) variables are on the right of the equation.

position_anova <- aov(goals ~ position, data = nhl_draft)

# Using the "car" library we can get type 2 and 3 ANOVA tests
Anova(position_anova, type = "II")

## Anova Table (Type II tests)
## 
## Response: goals
##             Sum Sq   Df F value Pr(>F)
## position     36102    2  1.4136 0.2434
## Residuals 38487360 3014

Looking at the summary table above, the p-value is larger than alpha = 0.05 which means we reject the Alternate Hypothesis in favor of the Null Hypothesis. Which means that we can assume the average goals per offensive position are the same and that any difference in means is due to chance.

Now, let’s choose another continuous variable that could influence the amount of goals per player. In this case, I’m choosing the variable “assists” since in hockey a player has the choice on whether to pass the puck over to another teammate or not and thus lowering the amount of goals that a player could achieve on average.

We’re going to now just create a linear regression model to model the relationship between goals and assists. Let’s first look at the correlation between the variables and plot them.

# Plotting variables
nhl_draft |> 
  ggplot(aes(x = assists, y = goals)) +
  geom_point()

# Correlation

cor(nhl_draft$goals, nhl_draft$assists, use = "complete.obs")

## [1] 0.947992

We can see that there is a high correlation between the two!

For our null and alternative hypothesis let’s say that:

Null Hypothesis (Ho): There is not a significant relationship between assists and goals. (B1 = 0) Alternative Hypothesis (Ha): There is a significant relationship between assists and goals. (B1 != 0)

Alpha = 0.05

Now let’s create the model and see how well it fits:

goals_assists <- lm(goals ~ assists , data = nhl_draft)

summary(goals_assists)

## 
## Call:
## lm(formula = goals ~ assists, data = nhl_draft)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -326.26   -6.84   -5.13    2.46  348.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.458801   0.773484   7.057  2.1e-12 ***
## assists     0.675662   0.004132 163.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.98 on 3015 degrees of freedom
##   (3772 observations deleted due to missingness)
## Multiple R-squared:  0.8987, Adjusted R-squared:  0.8987 
## F-statistic: 2.674e+04 on 1 and 3015 DF,  p-value: < 2.2e-16

Looking at the summary, at the 5% significance level, there is enough evidence to support the claim that there is a significant linear relationship (correlation) between the number of assists and number of goals due to the p-value being lower.

Here is the line plotted through the points:

# Plotting variables
nhl_draft |> 
  ggplot(aes(x = assists, y = goals)) +
  geom_point()+
  geom_abline(aes(intercept = coef(goals_assists)[1], slope = coef(goals_assists)[2]),
                colour = "red")

Now, for this linear model analysis I only used 1 explanatory variable. In normal practice, it is best to create multiple models of multiple combinations of independent variables to explain the dependent variable. Also weighing different variables in the data set is important to create a balanced model.

Lastly, I will create a model with both variables used in this Data Dive to see if the model improves or not.

Let’s create a model using position and assists as our variables:

goals_assists_position <- lm(goals ~ assists + position , data = nhl_draft)

summary(goals_assists_position)

## 
## Call:
## lm(formula = goals ~ assists + position, data = nhl_draft)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -319.81  -11.45   -3.30    5.45  341.59 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.117427   1.124217  -3.662 0.000254 ***
## assists      0.679433   0.004059 167.388  < 2e-16 ***
## positionLW  14.487188   1.551664   9.337  < 2e-16 ***
## positionRW  15.892781   1.554623  10.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.22 on 3013 degrees of freedom
##   (3772 observations deleted due to missingness)
## Multiple R-squared:  0.903,  Adjusted R-squared:  0.9029 
## F-statistic:  9349 on 3 and 3013 DF,  p-value: < 2.2e-16

After looking at our summary, we can see that the model actually improved, but just a little bit. The model accounted for 90% of the variance of the goals.

In this case, it is a good idea to leave both of these variables in as it improves the model and there is a significant relationship for both assists and position for amount of goals.

Here is the new model plotted:

# Plotting variables
nhl_draft |> 
  ggplot(aes(x = assists, y = goals)) +
  geom_point()+
  geom_abline(aes(intercept = coef(goals_assists_position)[1], slope = coef(goals_assists_position)[2]),
                colour = "red")

There you go! I hope you enjoyed this data dive. In real life there will be a lot more variables to consider as well as interaction terms. However, I hope this gave a basic breakdown of how goals, assists, and position relate to each other in hockey and that the relationship can be modeled through linear regression. ANOVA tests are also important for comparing means of different groups and can be applied in various scenarios with different hypothesis.

Data Dive 7: ANOVA Tests and Linear Regression

Connor Bryson

10/17/2023