Project Motivation: Investigate the impact of pandemic restrictions on hockey player development.
We are looking at skaters who played in the OHL during the 2019-2020 and 2020-2021 seasons.
# load the data
library(readr)
sams_ohl_data_request <- read_csv("sams_ohl_data_request.csv")
# filter years
recent <- sams_ohl_data_request %>%
filter(season %in% c("2019-2020", "2020-2021", "2021-2022"))
# add points per game columns
recent <- recent %>%
mutate(ppg = g/gp)
# add drafted y/n column
recent <- recent %>%
mutate(got_drafted = case_when(!is.na(draft_year) ~ 'Yes',
TRUE ~ 'No'))
# yes is only plays drafted before COVID
recent2 <- recent %>%
mutate(got_drafted = case_when(!is.na(draft_year) & draft_year < 2020
~ 'Yes',
TRUE ~ 'No'))
# make dataframe of drafted players
drafted <- recent %>%
filter(!is.na(draft_year)) %>%
filter(draft_year < 2020)
rank2 <- recent %>%
filter(league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
mutate(rank = 1:n()) %>%
ungroup()
ohl <- read_csv("sams_ohl_data_request.csv")
ohl_covid <- ohl %>%
filter(season %in% c("2019-2020","2020-2021", "2021-2022"))
## Add treatment var:
# Filtering only for players who played more than 10 games (should we combine number of games played across leagues?)
ohl_20_21 <- ohl %>%
filter(season == "2020-2021", gp > 10)
# Split players up into treatment vs non-treatment
ohl_treatment <- ohl %>%
filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
group_by(player_id, season) %>%
mutate(ppg = sum(pts)/sum(gp),
treatment = ifelse(player_id %in% ohl_20_21$player_id, "Played", "Didn't play")
) %>%
ungroup()
plyr_quality <- ohl %>%
filter(season == "2019-2020", league == "OHL") %>%
group_by(player_id) %>%
mutate(plyr_quality = sum(pts)/sum(gp)) %>%
filter(duplicated(player_id) == FALSE) %>%
ungroup() %>%
select(player_id, plyr_quality)
# View(plyr_quality %>%
# filter(duplicated(plyr_quality$player_id) == TRUE))
# ## need to deal with duplicate players/rows
# View(ohl_filtered %>%
# filter(duplicated(ohl_filtered$player_id) == TRUE))
ohl_pm <- ohl %>%
filter(season %in% c("2019-2020", "2021-2022"), league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
mutate(pm_rank = 1:n(),
pm_mean = mean(pm),
pm_relative = pm - mean(pm)
) %>%
ungroup()
# We need to check that each player played in the OHL in 2019-2020 and 2021-2022
ohl_szn <- ohl_pm %>%
group_by(player_id) %>%
mutate(played_2019 = season == "2019-2020",
played_2021 = season == "2021-2022",
played_both_szn = sum(played_2019) & sum(played_2021)) %>%
ungroup() %>%
filter(played_both_szn == TRUE)
# add treatment variable through join
ohl_trt <- ohl_szn %>%
left_join(select(ohl_treatment, player_id, season, team_name, treatment), by = c("player_id", "team_name", "season")) %>%
mutate(ppg = pts/gp)
# add age variable
ohl_age <- ohl_trt %>%
group_by(player_id, season) %>%
mutate(year = strsplit(season, "-")[[1]][1]) %>%
mutate(age = trunc((dob %--% as.Date(paste0(year, "-9-15"))) / years(1)),
age_continuous = (dob %--% as.Date(paste0(year, "-09-15"))) / years(1)
) %>%
ungroup()
# add player quality
ohl_qlty <- ohl_age %>%
left_join(plyr_quality, by = "player_id") %>%
group_by(player_id, season) %>%
mutate(ppg_total = sum(pts)/sum(gp),
gp_total = sum(gp),
pts_total = sum(pts)) %>%
arrange(player_id, season, -plyr_quality) %>%
filter(duplicated(player_id) == FALSE) %>%
ungroup()
# were they drafted
ohl_filtered <- ohl_qlty %>%
mutate(drafted = !is.na(draft_year)) %>%
select(season, player_id, treatment, first_name, last_name, position, plyr_quality, age, gp_total, pts_total, ppg_total, drafted, draft_year, round, overall_pick_num, age_continuous, pm_relative, pm_rank, pm)
# just post-COVID stats
ohl_treatment_21_22 <- ohl_filtered %>%
filter(season == "2021-2022")
We are looking at data from the 2019-2020, 2020-2021, and 2021-2022 seasons for junior leagues across Canada and Europe.
Filtering (unless otherwise specified):
2019-2020 (“pre-COVID”) and 2021-2022 (“post-COVID”) seasons
league == OHL
Only players who played in the OHL during both pre- and post-COVID seasons
Variables added:
points per game (PPG) per season (combined if a player played for multiple teams in a season)
games played (GP) per season (combined if a player played for multiple teams in a season)
treatment (i.e. whether a player played more than 10 games during the COVID season)
age (approximately the oldest a player was in a given season)
player quality approximated by plus-minus (PM) relative to average team PM in pre-COVID season
whether a player was drafted (not totally up to date)
all players includes any player with data from the 2019-2020, 2020-2021 and/or 2021-2022 seasons
Snippet of Raw Data:
recent %>%
dplyr::select(name, team_name, season, league, position,
gp, g, a, pts, pm) %>%
head(5) %>%
knitr::kable()%>%
kable_styling("striped")
| name | team_name | season | league | position | gp | g | a | pts | pm |
|---|---|---|---|---|---|---|---|---|---|
| Michael Renwick | Hamilton Bulldogs | 2019-2020 | OHL | D | 59 | 3 | 14 | 17 | 2 |
| Michael Renwick | Windsor Spitfires | 2021-2022 | OHL | D | 67 | 11 | 18 | 29 | 28 |
| Grayson Tiller | Barrie Colts | 2021-2022 | OHL | D | 39 | 1 | 5 | 6 | -3 |
| Kasper Larsen | Mississauga Steelheads | 2021-2022 | OHL | D | 57 | 8 | 35 | 43 | 12 |
| Landen Hookey | Soo Greyhounds | 2021-2022 | OHL | F | 43 | 1 | 4 | 5 | -2 |
Player Example:
recent %>%
dplyr::select(name, team_name, season, league, position,
gp, g, a, pts, pm) %>%
filter(name == "Brett Harrison") %>%
arrange(season) %>%
knitr::kable() %>%
kable_styling("striped")
| name | team_name | season | league | position | gp | g | a | pts | pm |
|---|---|---|---|---|---|---|---|---|---|
| Brett Harrison | Oshawa Generals | 2019-2020 | OHL | F | 58 | 21 | 16 | 37 | -21 |
| Brett Harrison | Canada White U17 | 2019-2020 | WHC-17 | F | 6 | 2 | 0 | 2 | NA |
| Brett Harrison | Canada U18 | 2020-2021 | WJC-18 | F | 7 | 2 | 0 | 2 | 4 |
| Brett Harrison | KOOVEE | 2020-2021 | Mestis | F | 1 | 0 | 0 | 0 | 0 |
| Brett Harrison | KOOVEE U20 | 2020-2021 | U20 SM-sarja | F | 7 | 4 | 5 | 9 | -1 |
| Brett Harrison | Oshawa Generals | 2021-2022 | OHL | F | 65 | 27 | 34 | 61 | -15 |
# treatment vs ppg
ggplot(ohl_filtered, aes(x = ppg_total, color = treatment)) +
geom_density() +
labs(title = "Players who played during COVID season generally have higher PPG",
x = "PPG") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# age vs ppg
ggplot(ohl_filtered, aes(x = pm_relative, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive linear relationship between relative PM and PPG",
x = "Relative PM") +
geom_smooth(method = "lm") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
# gp vs ppg
ggplot(ohl_filtered, aes(x = gp_total, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive non-linear relationship between relative games played and PPG",
x = "GP",
y = "PPG") +
theme_bw()
# drafted vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = drafted)) +
labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
x = "PPG") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# pm vs ppg
ggplot(ohl_filtered, aes(x = pm_relative, y = ppg_total)) +
geom_point(alpha = .5) +
labs(title = "Positive linear relationship between relative plus-minus and PPG") +
theme_bw()
# position vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = position)) +
labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
x = "PPG") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# season vs ppg
ggplot(ohl_filtered) +
geom_density(aes(x = ppg_total, color = position)) +
labs(title = "PPG generally greater for forwards in 2019-20 and 2021-22 season",
x = "PPG") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# drafted status vs treatment
mosaicplot(table("Drafted" = ohl_filtered$drafted,
"Treatment" = ohl_filtered$treatment),
main = "Drafted players more likely to have played during COVID",
shade = TRUE)
# pm vs treatment
ggplot(ohl_filtered, aes(x = pm_relative, color = treatment)) +
geom_density() +
labs("Insignificant difference in relative PM for players who played during COVID season", x = "relative plus minus") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# age vs treatment
ggplot(ohl_filtered, aes(x = age, color = treatment)) +
geom_density() +
labs("Insignificant difference in age for players who played during COVID season", x = "Age") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
# gp vs treatment
ggplot(ohl_filtered, aes(x = gp_total, color = treatment)) +
geom_density() +
labs("Players who played during COVID season generally played more games", x = "Games Played") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
ppg_ohl <- recent %>%
filter(league == "OHL") %>%
filter(ppg > 0) %>%
ggplot(aes(x = gp, y = ppg)) +
geom_point(aes(color = got_drafted), alpha = 0.5, size = 1) +
labs(title = "OHL", x = "Games Played", y = "Points Per Game",
color = "Drafted Before \n 2020-2021 Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw() +
theme(legend.position = "none")
ppg_all <- recent %>%
filter(ppg > 0) %>%
ggplot(aes(x = gp, y = ppg)) +
geom_point(aes(color = got_drafted), alpha = 0.5, size =1) +
labs(title = "All Leagues", x = "Games Played", y = "Points Per Game",
color = "Drafted Before \n 2020-2021 Season") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw()
ppg_ohl + ppg_all
People that played less games had a higher points per game than those with greater game counts.
Draft Status “yes” refers to players drafted before the 2020-2021 season.
mosaicplot(table(recent$got_drafted, recent$season), shade = TRUE,
main = element_blank())
During the 2020-2021 season the majority of players had already previously been drafted. The following season was dominated by undrafted players in this dataset.
# draft pick number on x-axis, gp on y-axis
drafted %>%
ggplot(aes(x = overall_pick_num, y = gp)) +
geom_point(alpha = 0.5) +
geom_smooth() +
labs(x = "Draft Pick Number", y = "Games Played", caption = "Any player that was drafted, regardless of draft year.") +
theme_bw()
Players picked later in the draft recorded more games.
recent %>%
ggplot(aes(x = gp)) +
geom_density(aes(color = factor(got_drafted)), alpha = 0.3) +
labs(x = "Games Played", y = "Density", color = "Draft Status",
caption = "All players.") +
scale_color_manual(values = c("royalblue3", "darkgoldenrod3")) +
theme_bw() +
facet_wrap(~season)
According to this graph, during the 2020-2021 season drafted players only played slightly more games than undrafted players. In the 2021-2022 undrafted players particpated in more games.
In the following graphs, the blue line acts as a reference line at y = 0. The gold line is the team’s average plus-minus.
rank2 %>%
filter(team_name == "Ottawa 67's") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Ottawa 67's", subtitle = "Highest Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The Ottawa 67’s contain the individual with the highest plus-minus in the Ontario Hockey League. Overall the team had much lower numbers in the 2021-2022 season. A singular player was not below the mean.
rank2 %>%
filter(team_name == "North Bay Battalion") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "North Bay Battalion", subtitle = "Lowest Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The North Bay Battalion possess the player with the lowest plus-minus in the Ontario Hockey League. This team greatly improved in the later season. The majority switched to above the mean, opposite of the previous season.
rank2 %>%
filter(team_name == "Hamilton Bulldogs") %>%
ggplot(aes(x = player_id, y = pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Hamilton Bulldogs", subtitle = "Median Player Plus-Minus",
y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The mean plus-minus across the OHL is -1. The Hamilton Bulldogs is one of the teams with the highest number of players with that score. This graph looks highly similar to the North Bay Battalion.
teampm <- recent %>%
filter(league == "OHL") %>%
group_by(team_name, season) %>%
arrange(team_name, season, desc(pm)) %>%
summarize(team_pm = mean(pm))
teampm %>%
ggplot(aes(x = team_name, y = team_pm)) +
geom_point() +
geom_hline(aes(yintercept = mean(team_pm)), color = "darkgoldenrod3") +
geom_hline(aes(yintercept = 0), color = "royalblue3") +
labs(title = "Team Plus-Minus", y = "Plus-Minus") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(),
axis.title.x = element_blank()) +
facet_grid(~season)
The distributions of team plus-minus scores in the 2019-2020 and 2021-2022 are alike.
Before performing causal analysis, we wanted to determine whether playing during the COVID season had a significant effect on PPG when controlling for variables (and interactions between variables) suspected to be associated with the response through EDA. Because PPG is right skewed and bounded between 0 and some positive number, the assumptions of ordinary least squares regression are not met. We performed Gamma regression to more accurately model the relationship between our response and explanatory variables.
Initial model: ppg_alt ~ position * pm_relative + treatment * drafted + gp_total + pm_relative * gp_total + pm_relative * drafted + age_continuous * pm_relative + season
ohl_filtered <- ohl_filtered %>%
mutate(ppg_alt = ppg_total + .001)
ohl_glm_interaction <- glm(ppg_alt ~ position*pm_relative + treatment*drafted + drafted*season + gp_total + pm_relative*gp_total + pm_relative*drafted + age_continuous*pm_relative + season,
data = ohl_filtered,
family = Gamma(log))
plot(ohl_glm_interaction, which = 1)
#summary(ohl_glm_interaction)
\(\widehat{PPG + .001} = -5.983 + .602*forward + .00412*relative \hspace{.1 cm} PM +\\ .0622*played \hspace{.1 cm} during \hspace{.1 cm} covid + .365*drafted + .0174* GP + .204*age \\ + .0266*season + .000635*forward*relative \hspace{.1 cm} PM \\+ .338*played \hspace{.1 cm} during \hspace{.1 cm} covid * drafted -.0000285*relative \hspace{.1 cm} PM*GP\\ + .00570*relative \hspace{.1 cm} PM*drafted + .000470*relative \hspace{.1 cm} PM*age\)
Variables found to be significant at the \(\alpha = .001\) level in our model: - position - relative plus-minus - drafted status - games played - age
Variables found to be significant at the \(\alpha = .05\) level in our model: - interaction of our treatment variable and drafted status
What are the model assumptions for gamma regression?
Our response variable to approximate player performance was points per game. It is a simple measure that may not adequately capture all aspects of a player’s performance. Though we control for position in our model, PPG will generally be higher for forwards than defensemen. Defensive performance is not well-approximated by PPG, and this measure would bias offensive defensemen as the best-performing. Additionally, good defensive forwards may be undervalued by this metric. A more nuanced measure that incorporates important defensive statistics like turnovers created, passes completed, shots blocked, etc. may better approximate player performance.
We defined player quality in terms of relative plus-minus given by \(PM_{relative} = PM_{player} - \mu_{team\hspace{.05cm}PM}\). Plus-minus does not account for all of the statistics that might measure the quality of a player (shots blocked, turnovers created, passes completed, etc) and does not consider factors like sheltering. However, plus-minus is one of the best options to approximate player quality given our limited dataset.
Player Performance: - Defined as \(PPG = \frac{(points + assists)}{games \hspace{.1 cm} played}\) - May not adequately capture all aspects of a player’s performance - Though we control for position in our model, PPG will generally be higher for forwards than defensemen - Defensive performance is not well-approximated by PPG, and this measure would bias offensive defensemen as the best-performing defensemen. - Good defensive forwards may be undervalued using this metric - A more nuanced measure that incorporates important defensive statistics like turnovers created, passes completed, shots blocked, etc. may better approximate player performance.
Player quality - defined by \(player quality = PM_{relative} = PM_{player} - \mu_{team\hspace{.05cm}PM}\) - Plus minus does not account for all of the statistics that might measure the quality of a player (shots blocked, turnovers created, passes completed, etc) - Does not consider factors like sheltering
Treatment Variable:
Our treatment variable was defined as players who participated in more than 10 games in a single league or tournament during the COVID season. We did not control for quality of league. Additionally, though typically only the best players are selected to play in championships/tournaments, many of them were not included as treated in our analysis because of their low game count. Conversely, players who played only a few games in mediocre leagues may have been counted as treated. Players who totaled more than 10 games cumulatively across two or more different leagues would also not be considered treated, though they played a significant number of games. Future analysis could provide a more distinct view of the treatment variable in which league quality and number of games played are controlled for.
Further, games played was a significant variable in our model, but we did not control for percentage of length of season. Both the 2020-2021 and 2021-2022 seasons were shortened due to the pandemic.
Next steps for future work:
Thank you to our advisor Dr. Sam Ventura (and assistant Dominic Ventura) and our professor, Dr. Ron Yurko. We would also like to thank our TAs, Nick Kissel, Meg Ellingwood, Wanshan Li, Kenta Takatsu, and YJ Choe. Lastly, thank you to Professor Ben Baumer, Smith College and Dr. Ryne Vankrevelen, Elon University. We could not have completed this project without you all and we sincerely appreciate your help!