Loading in the tidyverse, data and setting seed

# Loading tidyverse 

library(tidyverse)
library(gridExtra)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

For this week’s Data Dive we will be going over Hypothesis Testing and p-values. When it comes to Statistics, we want to be able to figure out whether or not certain variables are related to each other and what the effect size is of those variables.

We’re going to follow the 6 step Process of Hypothesis Testing for this module and produce results from the NHL Dataset!

The process of null hypothesis testing: 1. Formulate a hypothesis that embodies our prediction (before seeing the data) 2. Specify null and alternative hypotheses 3. Collect some data relevant to the hypothesis 4. Fit a model to the data that represents the alternative hypothesis and compute a test statistic 5. Compute the probability of the observed value of that statistic assuming that the null hypothesis is true 6. Assess the “statistical significance” of the result

  1. Formulate a hypothesis that embodies our prediction (before seeing the data):

The first hypothesis test I want to look at whether or not the position in the draft increases the amount of goals a player achieves in the nhl.

  1. Specify null and alternative hypotheses:

Null Hypothesis(Ho): “Average Goals Scored for players in the 2010 draft in the top half of the draft (1-105) is equal to the bottom half (106-210)” Alternative Hypothesis(Ha): “Average Goals Scored for players in the 2010 draft in the top half of the draft (1-105) is not equal to the bottom half (106-210)”

  1. Collect some data relevant to the hypothesis:
# Filtering to the 2010s

draft_2010 <- nhl_draft |> 
  filter(year ==  2010)

# Separating into two groups

top_105 <- draft_2010 |> 
  filter(overall_pick %in% (1:105))

bottom_105 <- draft_2010 |> 
  filter(overall_pick %in% (106:210))


# Box plot to look at distribution of goals for each group
p1 <- top_105 |> 
  ggplot(aes(x = goals)) + 
  geom_boxplot() +
  coord_flip() +
  ggtitle("Top Half of Draft")

p2 <- bottom_105 |> 
  ggplot(aes(x = goals)) + 
  geom_boxplot() +
  coord_flip() +
  ggtitle("Bottom Half of Draft", )
  

grid.arrange(p1, p2, ncol = 2)

Looking at the two groups above we can see that for the people that do have goals there is a larger Interquartile range for the top of the draft compared to the lower half of the draft.

To compare the means of two groups we need to use the t-test to evaluate the difference between the means.

Let now do steps 4, 5, and 6 at the same time thanks to R’s built in t-test function!

Let’s compute the values using a 2 sample t-test:

t.test(top_105$goals, bottom_105$goals, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  top_105$goals and bottom_105$goals
## t = 2.3264, df = 105, p-value = 0.02192
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.964657 62.258359
## sample estimates:
## mean of x mean of y 
##  58.09722  24.48571

We can see that the p-value for the sample are 0.02 and for these examples I’m stating anything below 0.03 is statistically significant and thus we can reject the null hypothesis in favor of the alternative. I will also say that the power for this example is 0.8 since 80% is a good percent for detecting true values.

To conclude since the p-value is low and there is signifcance for the t-test we can say that players in the top 105 of the 2010 do not have the same average amount of goals as the players in the bottom 105.

Second Hypothesis:

  1. Formulate a hypothesis that embodies our prediction (before seeing the data):

For the second hypothesis I want to see whether players with over 100 games have more penalty minutes on average than players who have played less than 100 games.

  1. Specify null and alternative hypotheses:

Null Hypothesis(Ho): “Average Penalty Minutes for players with over 100 games is equal to the players with less than 100 games.” Alternative Hypothesis(Ha): “Average Penalty Minutes for players with over 100 games is not equal to the players with less than 100 games.”

  1. Collect some data relevant to the hypothesis:
# Filtering to the players with 100 games

over_100 <- nhl_draft |> 
  filter(games_played > 100)

# Filtering to the players with less than 100 games

under_100 <- nhl_draft |> 
  filter(games_played <= 100)


# Box plot to look at distribution of goals for each group
p1 <- over_100 |> 
  ggplot(aes(x = penalties_minutes)) + 
  geom_boxplot() +
  coord_flip() +
  ggtitle("Top Half of Draft")

p2 <- under_100 |> 
  ggplot(aes(x = penalties_minutes)) + 
  geom_boxplot() +
  coord_flip() +
  ggtitle("Bottom Half of Draft", )
  

grid.arrange(p1, p2, ncol = 2)

Looking at the two groups above we can see that for the people that have over 100 games there is a larger Interquartile range compared to players with less than 100 games.

To compare the means of two groups we need to use the t-test to evaluate the difference between the means.

Let now do steps 4, 5, and 6 at the same time thanks to R’s built in t-test function!

Let’s compute the values using a 2 sample t-test:

t.test(over_100$penalties_minutes, under_100$penalties_minutes, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  over_100$penalties_minutes and under_100$penalties_minutes
## t = 40.249, df = 5244, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  381.7132 420.8013
## sample estimates:
## mean of x mean of y 
## 418.60369  17.34643

We can see that the p-value for the sample are very small and for these examples I’m stating anything below 0.03 is statistically significant and thus we can reject the null hypothesis in favor of the alternative. I will also say that the power for this example is 0.8 since 80% is a good percent for detecting true values.