Does Not Playing Hockey Make You Worse At Hockey?

Introduction

Project Motivation: Investigate the impact of pandemic restrictions on hockey player development.

In 2020-2021 many leagues had a shortened season or no season
Players had to find new leagues and/or tournament
Some players did not play that season

We are looking at skaters who played in the OHL during the 2019-2020 and 2020-2021 seasons.

# load the data
library(readr)
sams_ohl_data_request <- read_csv("sams_ohl_data_request.csv")

# filter years
recent <- sams_ohl_data_request %>% 
  filter(season %in% c("2019-2020", "2020-2021", "2021-2022"))

# add points per game columns
recent <- recent %>% 
  mutate(ppg = g/gp)

# add drafted y/n column 
recent <- recent %>% 
  mutate(got_drafted = case_when(!is.na(draft_year) ~ 'Yes',
                                 TRUE ~ 'No'))

# yes is only plays drafted before COVID
recent2 <- recent %>% 
  mutate(got_drafted = case_when(!is.na(draft_year) & draft_year < 2020
                                 ~ 'Yes',
                                 TRUE ~ 'No'))

# make dataframe of drafted players
drafted <- recent %>% 
  filter(!is.na(draft_year)) %>% 
  filter(draft_year < 2020)

rank2 <- recent %>%
  filter(league == "OHL") %>%
  group_by(team_name, season) %>%
  arrange(team_name, season, desc(pm)) %>%
  mutate(rank = 1:n()) %>%
  ungroup()

ohl <- read_csv("sams_ohl_data_request.csv")
ohl_covid <- ohl %>%
  filter(season %in% c("2019-2020","2020-2021", "2021-2022"))
## Add treatment var:

# Filtering only for players who played more than 10 games (should we combine number of games played across leagues?)
ohl_20_21 <- ohl %>%
  filter(season == "2020-2021", gp > 10)

# Split players up into treatment vs non-treatment
ohl_treatment <- ohl %>%
  filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
  group_by(player_id, season) %>%
  mutate(ppg = sum(pts)/sum(gp),
         treatment = ifelse(player_id %in% ohl_20_21$player_id, "Played", "Didn't play")
  ) %>%
  ungroup()

plyr_quality <- ohl %>%
  filter(season == "2019-2020", league == "OHL") %>%
  group_by(player_id) %>%
  mutate(plyr_quality = sum(pts)/sum(gp)) %>%
  filter(duplicated(player_id) == FALSE) %>%
  ungroup() %>%
  select(player_id, plyr_quality)

# View(plyr_quality %>%
#   filter(duplicated(plyr_quality$player_id) == TRUE))
# ## need to deal with duplicate players/rows
# View(ohl_filtered %>%
#   filter(duplicated(ohl_filtered$player_id) == TRUE))

ohl_pm <- ohl %>%
  filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
  group_by(team_name, season) %>%
  arrange(team_name, season, desc(pm)) %>%
  mutate(pm_rank = 1:n(),
         pm_mean = mean(pm),
         pm_relative = pm - mean(pm)
  ) %>%
  ungroup()

# We need to check that each player played in the OHL in 2019-2020 and 2021-2022
ohl_szn <- ohl_pm %>%
  group_by(player_id) %>%
  mutate(played_2019 = season == "2019-2020",
         played_2021 = season == "2021-2022",
         played_both_szn = sum(played_2019) & sum(played_2021)) %>%
  ungroup() %>%
  filter(played_both_szn == TRUE)

# add treatment variable through join
ohl_trt <- ohl_szn %>%
  left_join(select(ohl_treatment, player_id, season, team_name, treatment), by = c("player_id", "team_name", "season")) %>%
  mutate(ppg = pts/gp) 

# add age variable
ohl_age <- ohl_trt %>%
  group_by(player_id, season) %>%
  mutate(year = strsplit(season, "-")[[1]][1]) %>%
  mutate(age = trunc((dob %--% as.Date(paste0(year, "-9-15"))) / years(1)),
         age_continuous = (dob %--% as.Date(paste0(year, "-09-15"))) / years(1)
  ) %>%
  ungroup()
    
# add player quality
ohl_qlty <- ohl_age %>%
  left_join(plyr_quality, by = "player_id") %>%
  group_by(player_id, season) %>%
  mutate(ppg_total = sum(pts)/sum(gp),
         gp_total = sum(gp),
         pts_total = sum(pts)) %>%
  arrange(player_id, season, -plyr_quality) %>%
  filter(duplicated(player_id) == FALSE) %>%
  ungroup()

# were they drafted
ohl_filtered <- ohl_qlty %>%
  mutate(drafted = !is.na(draft_year)) %>%
  select(season, player_id, treatment, first_name, last_name, position, plyr_quality, age, gp_total, pts_total, ppg_total, drafted, draft_year, round, overall_pick_num, age_continuous, pm_relative, pm_rank, pm)

# just post-COVID stats
ohl_treatment_21_22 <- ohl_filtered %>%
  filter(season == "2021-2022")

Data

We are looking at data from the 2019-2020, 2020-2021, and 2021-2022 seasons for junior leagues across Canada and Europe.

Wrangling:

Filtering (unless otherwise specified):

2019-2020 (“pre-COVID”) and 2021-2022 (“post-COVID”) seasons
league == OHL
Only players who played in the OHL during both pre- and post-COVID seasons

Variables added:

points per game (PPG) per season (combined if a player played for multiple teams in a season)
games played (GP) per season (combined if a player played for multiple teams in a season)
treatment (i.e. whether a player played more than 10 games during the COVID season)
age (approximately the oldest a player was in a given season)
player quality approximated by plus-minus (PM) relative to average team PM in pre-COVID season
whether a player was drafted (not totally up to date)
all players includes any player with data from the 2019-2020, 2020-2021 and/or 2021-2022 seasons

Snippet of Raw Data:

recent %>% 
  dplyr::select(name, team_name, season, league, position, 
                gp, g, a, pts, pm) %>%
  head(5) %>% 
  knitr::kable()%>% 
  kable_styling("striped")

name	team_name	season	league	position	gp	g	a	pts	pm
Michael Renwick	Hamilton Bulldogs	2019-2020	OHL	D	59	3	14	17	2
Michael Renwick	Windsor Spitfires	2021-2022	OHL	D	67	11	18	29	28
Grayson Tiller	Barrie Colts	2021-2022	OHL	D	39	1	5	6	-3
Kasper Larsen	Mississauga Steelheads	2021-2022	OHL	D	57	8	35	43	12
Landen Hookey	Soo Greyhounds	2021-2022	OHL	F	43	1	4	5	-2

Player Example:

recent %>% 
  dplyr::select(name, team_name, season, league, position, 
                gp, g, a, pts, pm) %>%
  filter(name == "Brett Harrison") %>%
  arrange(season) %>% 
  knitr::kable() %>% 
  kable_styling("striped")

name	team_name	season	league	position	gp	g	a	pts	pm
Brett Harrison	Oshawa Generals	2019-2020	OHL	F	58	21	16	37	-21
Brett Harrison	Canada White U17	2019-2020	WHC-17	F	6	2	0	2	NA
Brett Harrison	Canada U18	2020-2021	WJC-18	F	7	2	0	2	4
Brett Harrison	KOOVEE	2020-2021	Mestis	F	1	0	0	0	0
Brett Harrison	KOOVEE U20	2020-2021	U20 SM-sarja	F	7	4	5	9	-1
Brett Harrison	Oshawa Generals	2021-2022	OHL	F	65	27	34	61	-15

Exploratory Data Analysis

PPG

Treatment vs PPG

# treatment vs ppg
ggplot(ohl_filtered, aes(x = ppg_total, color = treatment)) +
  geom_density() +
  labs(title = "Players who played during COVID season generally have higher PPG",
       x = "PPG") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

Age vs PPG

# age vs ppg
ggplot(ohl_filtered, aes(x = pm_relative, y = ppg_total)) +
  geom_point(alpha = .5) +
  labs(title = "Positive linear relationship between relative PM and PPG",
       x = "Relative PM") +
  geom_smooth(method = "lm") +
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

GP vs PPG

# gp vs ppg
ggplot(ohl_filtered, aes(x = gp_total, y = ppg_total)) +
  geom_point(alpha = .5) +
  labs(title = "Positive non-linear relationship between relative games played and PPG",
       x = "GP", 
       y = "PPG") +
  theme_bw()

Drafted vs PPG

# drafted vs ppg
ggplot(ohl_filtered) +
  geom_density(aes(x = ppg_total, color = drafted)) +
  labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
       x = "PPG") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

PM vs PPG

# pm vs ppg
ggplot(ohl_filtered, aes(x = pm_relative, y = ppg_total)) +
  geom_point(alpha = .5) +
  labs(title = "Positive linear relationship between relative plus-minus and PPG") +
  theme_bw()

Position vs PPG

# position vs ppg
ggplot(ohl_filtered) +
  geom_density(aes(x = ppg_total, color = position)) +
  labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
       x = "PPG") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

Season vs PPG

# season vs ppg
ggplot(ohl_filtered) +
  geom_density(aes(x = ppg_total, color = position)) +
  labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
       x = "PPG") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

Treatment

Drafted vs Treatment

# drafted status vs treatment
mosaicplot(table("Drafted" = ohl_filtered$drafted,
                 "Treatment" = ohl_filtered$treatment),
           main = "Drafted players more likely to have played during COVID",
           shade = TRUE)

PM vs Treatment

# pm vs treatment
ggplot(ohl_filtered, aes(x = pm_relative, color = treatment)) +
  geom_density() +
  labs("Insignificant difference in relative PM for players who played during COVID season", x = "relative plus minus") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

Age vs Treatment

# age vs treatment
ggplot(ohl_filtered, aes(x = age, color = treatment)) +
  geom_density() +
  labs("Insignificant difference in age for players who played during COVID season", x = "Age") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

GP vs Treatment

# gp vs treatment
ggplot(ohl_filtered, aes(x = gp_total, color = treatment)) +
  geom_density() +
  labs("Players who played during COVID season generally played more games", x = "Games Played") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()

Points Per Game and Games Played

ppg_ohl <- recent %>% 
  filter(league == "OHL") %>%
  filter(ppg > 0) %>% 
  ggplot(aes(x = gp, y = ppg)) +
  geom_point(aes(color = got_drafted), alpha = 0.5, size = 1) +
  labs(title = "OHL", x = "Games Played", y = "Points Per Game",
       color = "Drafted Before \n 2020-2021 Season") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw() +
  theme(legend.position = "none")
ppg_all <- recent %>% 
  filter(ppg > 0) %>% 
  ggplot(aes(x = gp, y = ppg)) +
  geom_point(aes(color = got_drafted), alpha = 0.5, size =1) +
  labs(title = "All Leagues", x = "Games Played", y = "Points Per Game",
       color = "Drafted Before \n 2020-2021 Season") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw()
ppg_ohl + ppg_all

People that played less games had a higher points per game than those with greater game counts.

Draft Status “yes” refers to players drafted before the 2020-2021 season.

Seasons Played by Draft Status

mosaicplot(table(recent$got_drafted, recent$season), shade = TRUE,
           main = element_blank())

During the 2020-2021 season the majority of players had already previously been drafted. The following season was dominated by undrafted players in this dataset.

Possible Confounding Variables:

Draft Pick Number and Games Played

# draft pick number on x-axis, gp on y-axis
drafted %>% 
  ggplot(aes(x = overall_pick_num, y = gp)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
  labs(x = "Draft Pick Number", y = "Games Played", caption = "Any player that was drafted, regardless of draft year.") +
  theme_bw()

Players picked later in the draft recorded more games.

Draft Pick Status and Games Played

recent %>% 
  ggplot(aes(x = gp)) +
  geom_density(aes(color = factor(got_drafted)), alpha = 0.3) +
  labs(x = "Games Played", y = "Density", color = "Draft Status",
       caption = "All players.") +
  scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
  theme_bw() +
  facet_wrap(~season)

According to this graph, during the 2020-2021 season drafted players only played slightly more games than undrafted players. In the 2021-2022 undrafted players particpated in more games.

Plus-Minus

In the following graphs, the blue line acts as a reference line at y = 0. The gold line is the team’s average plus-minus.

Highest Player

rank2 %>% 
  filter(team_name == "Ottawa 67's") %>%
  ggplot(aes(x = player_id, y = pm)) +
  geom_point() +
  geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
  geom_hline(aes(yintercept = 0), color = "royalblue3") +
  labs(title = "Ottawa 67's", subtitle = "Highest Player Plus-Minus", 
       y = "Plus-Minus") +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.title.x = element_blank()) +
  facet_grid(~season)

The Ottawa 67’s contain the individual with the highest plus-minus in the Ontario Hockey League. Overall the team had much lower numbers in the 2021-2022 season. A singular player was not below the mean.

Lowest Player

rank2 %>% 
  filter(team_name == "North Bay Battalion") %>%
  ggplot(aes(x = player_id, y = pm)) +
  geom_point() +
  geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
  geom_hline(aes(yintercept = 0), color = "royalblue3") +
  labs(title = "North Bay Battalion", subtitle = "Lowest Player Plus-Minus", 
       y = "Plus-Minus") +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.title.x = element_blank()) +
  facet_grid(~season)

The North Bay Battalion possess the player with the lowest plus-minus in the Ontario Hockey League. This team greatly improved in the later season. The majority switched to above the mean, opposite of the previous season.

Median Player

rank2 %>% 
  filter(team_name == "Hamilton Bulldogs") %>%
  ggplot(aes(x = player_id, y = pm)) +
  geom_point() +
  geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
  geom_hline(aes(yintercept = 0), color = "royalblue3") +
  labs(title = "Hamilton Bulldogs", subtitle = "Median Player Plus-Minus", 
       y = "Plus-Minus") +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.title.x = element_blank()) +
  facet_grid(~season)

The mean plus-minus across the OHL is -1. The Hamilton Bulldogs is one of the teams with the highest number of players with that score. This graph looks highly similar to the North Bay Battalion.

Teams

teampm <- recent %>%
  filter(league == "OHL") %>%
  group_by(team_name, season) %>%
  arrange(team_name, season, desc(pm)) %>%
  summarize(team_pm = mean(pm))
teampm %>%
  ggplot(aes(x = team_name, y = team_pm)) +
  geom_point() +
  geom_hline(aes(yintercept = mean(team_pm)), color = "darkgoldenrod3") +
  geom_hline(aes(yintercept = 0), color = "royalblue3") +
  labs(title = "Team Plus-Minus", y = "Plus-Minus") +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.title.x = element_blank()) +
  facet_grid(~season)

The distributions of team plus-minus scores in the 2019-2020 and 2021-2022 are alike.

Methods

Before performing causal analysis, we wanted to determine whether playing during the COVID season had a significant effect on PPG when controlling for variables (and interactions between variables) suspected to be associated with the response through EDA. Because PPG is right skewed and bounded between 0 and some positive number, the assumptions of ordinary least squares regression are not met. We performed Gamma regression to more accurately model the relationship between our response and explanatory variables.

Initial model: ppg_alt ~ position * pm_relative + treatment * drafted + gp_total + pm_relative * gp_total + pm_relative * drafted + age_continuous * pm_relative + season

ohl_filtered <- ohl_filtered %>%
  mutate(ppg_alt = ppg_total + .001)
ohl_glm_interaction <- glm(ppg_alt ~ position*pm_relative + treatment*drafted + drafted*season + gp_total + pm_relative*gp_total + pm_relative*drafted + age_continuous*pm_relative + season, 
                          data = ohl_filtered, 
                          family = Gamma(log))
plot(ohl_glm_interaction, which = 1)

#summary(ohl_glm_interaction)

Discussion

Model Results

\(\widehat{PPG + .001} = -5.983 + .602*forward + .00412*relative \hspace{.1 cm} PM +\\ .0622*played \hspace{.1 cm} during \hspace{.1 cm} covid + .365*drafted + .0174* GP + .204*age \\ + .0266*season + .000635*forward*relative \hspace{.1 cm} PM \\+ .338*played \hspace{.1 cm} during \hspace{.1 cm} covid * drafted -.0000285*relative \hspace{.1 cm} PM*GP\\ + .00570*relative \hspace{.1 cm} PM*drafted + .000470*relative \hspace{.1 cm} PM*age\)

Variables found to be significant at the \(\alpha = .001\) level in our model: - position - relative plus-minus - drafted status - games played - age

Variables found to be significant at the \(\alpha = .05\) level in our model: - interaction of our treatment variable and drafted status

Limitations

What are the model assumptions for gamma regression?

Our response variable to approximate player performance was points per game. It is a simple measure that may not adequately capture all aspects of a player’s performance. Though we control for position in our model, PPG will generally be higher for forwards than defensemen. Defensive performance is not well-approximated by PPG, and this measure would bias offensive defensemen as the best-performing. Additionally, good defensive forwards may be undervalued by this metric. A more nuanced measure that incorporates important defensive statistics like turnovers created, passes completed, shots blocked, etc. may better approximate player performance.

We defined player quality in terms of relative plus-minus given by \(PM_{relative} = PM_{player} - \mu_{team\hspace{.05cm}PM}\). Plus-minus does not account for all of the statistics that might measure the quality of a player (shots blocked, turnovers created, passes completed, etc) and does not consider factors like sheltering. However, plus-minus is one of the best options to approximate player quality given our limited dataset.

Player Performance: - Defined as \(PPG = \frac{(points + assists)}{games \hspace{.1 cm} played}\) - May not adequately capture all aspects of a player’s performance - Though we control for position in our model, PPG will generally be higher for forwards than defensemen - Defensive performance is not well-approximated by PPG, and this measure would bias offensive defensemen as the best-performing defensemen. - Good defensive forwards may be undervalued using this metric - A more nuanced measure that incorporates important defensive statistics like turnovers created, passes completed, shots blocked, etc. may better approximate player performance.

Player quality - defined by \(player quality = PM_{relative} = PM_{player} - \mu_{team\hspace{.05cm}PM}\) - Plus minus does not account for all of the statistics that might measure the quality of a player (shots blocked, turnovers created, passes completed, etc) - Does not consider factors like sheltering

Treatment Variable:

Our treatment variable was defined as players who participated in more than 10 games in a single league or tournament during the COVID season. We did not control for quality of league. Additionally, though typically only the best players are selected to play in championships/tournaments, many of them were not included as treated in our analysis because of their low game count. Conversely, players who played only a few games in mediocre leagues may have been counted as treated. Players who totaled more than 10 games cumulatively across two or more different leagues would also not be considered treated, though they played a significant number of games. Future analysis could provide a more distinct view of the treatment variable in which league quality and number of games played are controlled for.

Further, games played was a significant variable in our model, but we did not control for percentage of length of season. Both the 2020-2021 and 2021-2022 seasons were shortened due to the pandemic.

Next steps for future work:

Redefining/tweaking treatment and response variables
Preform causal analysis, with a focus on using BART and propensity scores

Acknowledgments

Thank you to our advisor Dr. Sam Ventura (and assistant Dominic Ventura) and our professor, Dr. Ron Yurko. We would also like to thank our TAs, Nick Kissel, Meg Ellingwood, Wanshan Li, Kenta Takatsu, and YJ Choe. Lastly, thank you to Professor Ben Baumer, Smith College and Dr. Ryne Vankrevelen, Elon University. We could not have completed this project without you all and we sincerely appreciate your help!