NHL Head Injuries Before and After Majority Visor Use
Comparing the 2005-2006 and 2024-2025 Seasons
Jamie Benn of the Dallas Stars is one of the last four players to not wear a visor. Photo: Richard A. Whittaker/Icon Sports
Avalanche Nathan MacKinnon was hit in the mouth by a puck May 11th, 2026 during a playoff game. He returned to the game. Source: ESPN, Associated Press
Introduction
Fast paced, on a near-frictionless surface with blades strapped to everyone’s feet, and hard pieces of rubber being shot across the ice at speeds that can exceed 100 miles per hour, it is unsurprising that hockey is a dangerous sport. Now consider that it is part of the strategy to slam opposing players into the rink boards, and the long tradition of fights – even if those are technically against the rules– and that NHL teams play 82 games between September and April.
My data set is the NHL Injury Database, compiled by LW3H, from data from the teams, puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, consisting of injury data for all NHL teams to from the 2000-2001 season to midway through 2025-2026, when I downloaded it. It is posted on the NHL Injury Viz blog page, and Tableau Public. I joined this with team rosters and player stats from Hockey-reference.com, official NHL data provided by sportsradar.
Collegiate hockey wears full face protection. The Professional Womens Hockey League, entering their third playoffs final, wear full facial protection. The American Hockey League – the top minor league in North America – has mandated visors since 2006-2007. Every other level of hockey mandates facial protection – the NHL was the last (Associated P)
The NHL mandated the use of helmet visors starting for the 2013-2014 season for all players who had played less than 26 games. The final incident in the long debate over visors was Marc Staal of the New York Ranger’s bad eye injury in the spring of 2013 from being hit in face with a puck. The National Hockey League Player’s Association is reported to have had a “strong majority” in favor of the mandate that June, compared to prior to Staal’s injury, when they had rejected a mandate (Halford).
73% of the league were voluntarily wearing visors already in 2012-2013, when Staal was injured, which is why I did not choose that year for my analysis (Halford). In 2004-2005, only 38% of players wore visors (CBC). By 2024-2025, all but four players wore visors: Dallas Star’s Jamie Benn (above), Minnesota Wild’s Zach Bogosian, of the Nashville Predators’s Ryan O’Reilly, and San Jose Sharks’ Ryan Reaves. The question regarding visors has mostly shifted to which of the four will be the last visorless player (Johnson, Trettenero).
Variable (dataset)
Defintion
Notes
Season (both)
Season of play, presented as the year it ended it
either 2006 or 2025
Team (both)
Team player was signed with
used in joining the datasets
Position (injuries)
Position played (Forward, Defense, or Goalie)
also indicates retired – filtered for only Forward and Defense
fullname (injuries) /Player (roster)
Player’s given and surname
had to wrangle injuries dataset to create to join to rosters
InjuryType (injuries)
What the injury or illness was
injurygroup (created)
classified InjuryType by area of body
head, upper or lower body, or other for sickness, undisclosed, non location specific injuries
is_concussion (created)
Concussion (1) or not (0)
for filtered data set only, for logistic regression
GamesMissed (injuries)
how many games were missed
Indicates severity of injury
Nationality (roster)
What country the player is from
modified from original column “Birth” due to formatting
Pos (roster)
also position, but classifies as Center, Left and Right Wing, Defense, and Goalie
Age (roster)
Players age as of February of that season
height (roster)
height in inches
wrangled this column from original
Wt (roster)
weight in pounds
Shoots (roster)
side player shoots puck from L or R
often opposite of dominate hand
Exp (Experience) (roster)
Years in league before that season
GP (Games Played) (roster)
how many games played that season
specifically for that team
Goals
Goals scored
Assists
Goals they contributed to
usually by making one of the last two passes
PTS (Points) (Roster)
Sum of Goals and Assists
+/- (plus minus) (Roster
Differential of goals scored for and against when player is on the ice
PIM (Roster)
Penalties in minutes
minor penalties are 2 min, major is 5, game misconduct is 10
TOI (Roster
Time on Ice, total for whole season
displayed as mm:ss
ATOI (roster)
Average time on ice, TOI/Games played
displayed as mm:ss
TOI_min (roster)
Time on ice, total
converted to minutes as a decimal
ATOI_min (roster)
Average time on ice, TOI/Games played
converted to minutes as a decimal
Have majority use of visors improved player safety in regards to head injuries?
Two important notes about this dataset:
One: Injuries are not independent necessarily independent, as the same players can get repeat injuries, and are often injuring each other.
Two: Only injuries that meant the player missed subsequent games appear on this database. The above picture of Nathan MacKinnon’s injury would not appear in this dataset later this week as he came back in that game, and he did not miss the next game.
Conclusions
Once again, my results should be taken with a grain of salt as my observations are not independent. My chi-squared analysis of association between InjuryType and Season showed no association with a p-value of .1636 This was surprising, as the graph suggests drop offs in nearly all head injury types. I wonder if a more focused test on just proportions of concussion injuries, for example, would produce the same results. Similarly, the difference of two means 95% confidence interval of the average of Games Missed for all head injuries in 2006 and 2025 seasons produced a range that included zero, suggesting there is no difference in injury severity between the two seasons. This result was initially surprising, considering some of the high values for Games Missed on my vizualization for the Data 110 project, but upon looking at the box plot here, I realize they were outliers. Also, I don’t think I transformed the data for the 110 visualization.
My logistic regression was a little frustrating. I tried to use the machine learning method, but was stalled by the output only showing the coefficients and not the p-values. I figured it was best to do it manually, a way I knew I understood. It is also frustrating that once again, I have made a poor model. My final model is is_concussion ~ Age + GP + Wt, and the AIC only got down to 138.71. Surprisingly, according to my confusion matrix, my accuracy is 63.9%
One problem with my dataset is I had no way of knowing if a player with a head injury – especially in ’06 – was wearing a visor or not. While I was able to find a site that showed players current and most recent gear, it did not show gear (or lack of) choices from years ago, nor would it have been efficient to acquire the data from.
Future directions I would like to go to, would be looking at comparisons to collegiate hockey injuries and the PWHL. While it seemed like there are datasets on the first, it seemed hard to access, and the PWHL is only in it’s third season that the data might just not be out there yet.
References
Associated Press. “Avs Star MacKinnon Bloodied after Puck to Face from Teammate.” ESPN.Com, 12 May 2026, https://www.espn.com/nhl/story/_/id/48746176/avs-star-mackinnon-bloodied-puck-face-teammate.
Associated Press. “Use of Visors up for Debate in Rough-&-Tumble NHL.” ESPN.Com, https://www.espn.com/espn/wire/_/section/nhl/id/9048371. Accessed 14 May 2026.
Benson, B. W., et al. “The Impact of Face Shield Use on Concussions in Ice Hockey: A Multivariate Analysis.” British Journal of Sports Medicine, vol. 36, no. 1, Feb. 2002, pp. 27–32. Original Articles. bjsm.bmj.com, https://doi.org/10.1136/bjsm.36.1.27.
CBC. “Use of Visors in NHL on the Rise: Report.” CBC Sports, 18 Oct. 2005. CBC.ca, https://www.cbc.ca/sports/hockey/use-of-visors-in-nhl-on-the-rise-report-1.528068.
Halford, Mike. “NHL Makes Visors Mandatory for New Players.” NBC Sports, 4 June 2013, https://www.nbcsports.com/nhl/news/nhl-makes-visors-mandatory-for-new-players.
Johnston, Chris. “Why 4 NHL Players Still Don’t Wear Visors, and Who Will Be the Last Standing?” The New York Times, 13 Mar. 2026. NYTimes.com, https://www.nytimes.com/athletic/7108800/2026/03/12/nhl-players-visor-holdouts/.
Kennedy, Ian. “The Last Players Not to Wear Visors in the NHL.” The Hockey News, 22 Apr. 2023, https://thehockeynews.com/news/news/the-last-players-not-to-wear-visors-in-the-nhl.
Warning: package 'patchwork' was built under R version 4.5.3
setwd("~/Schol Stuff/Montgomery College 2025/Math 217 Stat for Scientists/Final Project")nhlinjury <-read_csv("NHL Injury Database_data.csv") Roster06 <-read_excel("2006_2025Rosters.xlsx", sheet =1)Roster25 <-read_excel("2006_2025Rosters.xlsx", sheet =2)
Preview and Clean The Rosters
starting with the Rosters and player stats: two different tables from Hockey-reference.com were combined per team in Excel, and not all columns were used.
head(Roster06)
# A tibble: 6 × 21
Season Team Player Birth Nationality Pos Age Ht Wt `S/C` Exp
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 2006 Anaheim François… ca CA CA D 25 5-11 208 L/- 1
2 2006 Anaheim Kip Bren… ca CA CA LW 25 6-4 230 L/- 3
3 2006 Anaheim Ilya Bry… su SU SU G 25 6-3 213 -/L 2
4 2006 Anaheim Keith Ca… us US US D 35 6-1 207 L/- 13
5 2006 Anaheim Joe DiPe… ca CA CA D 26 6-2 199 L/- 1
6 2006 Anaheim Sergei F… su SU SU C 36 6-2 207 L/- 14
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
# Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
# ATOI <chr>
head(Roster25)
# A tibble: 6 × 21
Season Team Player Birth Nationality Pos Age Ht Wt `S/C` Exp
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 2025 Anaheim Leo Carl… se SE SE C 20 6-3 208 L/- 1
2 2025 Anaheim Sam Cola… us US US RW 23 6-2 213 R/- 1
3 2025 Anaheim Lukáš Do… cz CZ CZ G 24 6-2 190 -/L 3
4 2025 Anaheim Brian Du… us US US D 33 6-4 215 L/- 11
5 2025 Anaheim Robby Fa… ca CA CA C 29 5-11 185 L/- 8
6 2025 Anaheim Cam Fowl… ca CA CA D 33 6-2 213 L/- 14
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
# Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
# ATOI <chr>
Nationality was created as the lowercase characters in ‘Birth’ were a country flag image on the website.
Bind the rosters
Combine the two roster dataframes into one before any other cleaning is done to them:
# change Roster06 Goals to numeric so the bind worksRoster06$Goals <-as.numeric(Roster06$Goals) Rosterall <-bind_rows(Roster06, Roster25)dim(Rosterall)
[1] 2205 21
head(Rosterall)
# A tibble: 6 × 21
Season Team Player Birth Nationality Pos Age Ht Wt `S/C` Exp
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
1 2006 Anaheim François… ca CA CA D 25 5-11 208 L/- 1
2 2006 Anaheim Kip Bren… ca CA CA LW 25 6-4 230 L/- 3
3 2006 Anaheim Ilya Bry… su SU SU G 25 6-3 213 -/L 2
4 2006 Anaheim Keith Ca… us US US D 35 6-1 207 L/- 13
5 2006 Anaheim Joe DiPe… ca CA CA D 26 6-2 199 L/- 1
6 2006 Anaheim Sergei F… su SU SU C 36 6-2 207 L/- 14
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
# Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
# ATOI <chr>
Making Numeric and Seperating Columns (or both)
Time on Ice (TOI) and Average Time on Ice (ATOI) need converted to numeric
Experience (Exp) (years in league) needs the R for rookie replaced with 0, and converted to numeric
# ms() pulls out the mm:ss format, as.numeric converts to total seconds, divide by 60 to make minutesRosterall$TOI_min <-as.numeric(ms(Rosterall$TOI))/60Rosterall$ATOI_min <-as.numeric(ms(Rosterall$ATOI))/60Rosterall$Exp <-as.numeric(gsub("R", "0", Rosterall$Exp))head(Rosterall)
# A tibble: 6 × 23
Season Team Player Birth Nationality Pos Age Ht Wt `S/C` Exp
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 2006 Anaheim François… ca CA CA D 25 5-11 208 L/- 1
2 2006 Anaheim Kip Bren… ca CA CA LW 25 6-4 230 L/- 3
3 2006 Anaheim Ilya Bry… su SU SU G 25 6-3 213 -/L 2
4 2006 Anaheim Keith Ca… us US US D 35 6-1 207 L/- 13
5 2006 Anaheim Joe DiPe… ca CA CA D 26 6-2 199 L/- 1
6 2006 Anaheim Sergei F… su SU SU C 36 6-2 207 L/- 14
# ℹ 12 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
# Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
# ATOI <chr>, TOI_min <dbl>, ATOI_min <dbl>
Ht (height) in feet-inches needs to be separated and recombined. S/C (shoots/catches) needs to be separated as well
Rosterall2 <- Rosterall |>separate(Ht, into =c("feet", "inch"), sep ="-", convert =TRUE) |>mutate(height =as.numeric(feet)*12+as.numeric(inch), .after ="inch") |>#multiply feet by 12, add to inches to get height in inches as new column separate(`S/C`, into=c("shoots", "catches"), sep ="/", convert =TRUE) # separate S/C and names them; not concerned about the dashes in place of NAs as Catches is only relevant for goalies and they should be the only ones without "shoots" and I'm filtering them out! head(Rosterall2)
# A tibble: 6 × 26
Season Team Player Birth Nationality Pos Age feet inch height Wt
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <int> <dbl> <dbl>
1 2006 Anaheim Françoi… ca CA CA D 25 5 11 71 208
2 2006 Anaheim Kip Bre… ca CA CA LW 25 6 4 76 230
3 2006 Anaheim Ilya Br… su SU SU G 25 6 3 75 213
4 2006 Anaheim Keith C… us US US D 35 6 1 73 207
5 2006 Anaheim Joe DiP… ca CA CA D 26 6 2 74 199
6 2006 Anaheim Sergei … su SU SU C 36 6 2 74 207
# ℹ 15 more variables: shoots <chr>, catches <chr>, Exp <dbl>,
# `Birth Date` <chr>, Summary <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>,
# PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
# ATOI_min <dbl>
Rosterall3 <- Rosterall2 |># removing a few columns I absolutely don't need for ease of viewingselect(-Birth, -feet, -inch, -catches, -Summary)head(Rosterall3)
# A tibble: 6 × 21
Season Team Player Nationality Pos Age height Wt shoots Exp
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 2006 Anaheim François Bea… CA D 25 71 208 L 1
2 2006 Anaheim Kip Brennan CA LW 25 76 230 L 3
3 2006 Anaheim Ilya Bryzgal… SU G 25 75 213 - 2
4 2006 Anaheim Keith Carney US D 35 73 207 L 13
5 2006 Anaheim Joe DiPenta CA D 26 74 199 L 1
6 2006 Anaheim Sergei Fedor… SU C 36 74 207 L 14
# ℹ 11 more variables: `Birth Date` <chr>, GP <dbl>, Goals <dbl>,
# Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>,
# TOI_min <dbl>, ATOI_min <dbl>
Cleaning the Injury Data base
head(nhlinjury)
# A tibble: 6 × 8
Season Team Position Player `Injury Type` `Cap Hit` Chip `Games Missed`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2000/01 Anaheim F Kariya,… Foot NA NA 16
2 2000/01 Anaheim F Leclerc… Abdominal NA NA 13
3 2000/01 Anaheim F Leclerc… Knee NA NA 15
4 2000/01 Anaheim F McDonal… Concussion NA NA 7
5 2000/01 Anaheim F McInnis… Groin NA NA 3
6 2000/01 Anaheim F McInnis… Neck NA NA 3
Remove spaces in column names
Fix Season to be just the ending year of the season
# A tibble: 6 × 8
Season Team Position Player InjuryType CapHit Chip GamesMissed
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2001 Anaheim F Kariya, Paul Foot NA NA 16
2 2001 Anaheim F Leclerc, Mike Abdominal NA NA 13
3 2001 Anaheim F Leclerc, Mike Knee NA NA 15
4 2001 Anaheim F McDonald, Andy Concussion NA NA 7
5 2001 Anaheim F McInnis, Marty Groin NA NA 3
6 2001 Anaheim F McInnis, Marty Neck NA NA 3
Filter for my two seasons, 2006 and 2025
Then separate Player from Surname, Given name and recombine into Given Surname to match the Rosters
# A tibble: 6 × 10
Season Team Position surname given fullname InjuryType CapHit Chip
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2006 Anaheim F Brennan " Kip" " Kip Br… Shoulder NA NA
2 2006 Anaheim F Fedorov " Sergei" " Sergei… Groin NA NA
3 2006 Anaheim F Fedoruk " Todd" " Todd F… Back NA NA
4 2006 Anaheim F Getzlaf " Ryan" " Ryan G… Shoulder NA NA
5 2006 Anaheim F Hedström " Jonathan" " Jonath… Groin NA NA
6 2006 Anaheim F Konopka " Zenon" " Zenon … Ankle NA NA
# ℹ 1 more variable: GamesMissed <dbl>
# A tibble: 6 × 11
Season Team Position surname given fullname InjuryType injurygroup CapHit
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2006 Anaheim F Brennan " Kip" " Kip B… Shoulder upper NA
2 2006 Anaheim F Fedorov " Ser… " Serge… Groin lower NA
3 2006 Anaheim F Fedoruk " Tod… " Todd … Back upper NA
4 2006 Anaheim F Getzlaf " Rya… " Ryan … Shoulder upper NA
5 2006 Anaheim F Hedström " Jon… " Jonat… Groin lower NA
6 2006 Anaheim F Konopka " Zen… " Zenon… Ankle lower NA
# ℹ 2 more variables: Chip <dbl>, GamesMissed <dbl>
Join the dataframes
First trim all whitespace from the joining columns
Rosterall3$Player <-trimws(Rosterall3$Player) # weirdly this did not seem to work on every row and I had to manually remove spaces from the Excel fileinjury2$fullname <-trimws(injury2$fullname)Rosterall$Team <-trimws(Rosterall3$Team)injury2$Team <-trimws(injury2$Team)Rosterall3$Season <-trimws(Rosterall3$Season)injury2$Season <-trimws(injury2$Season)
complete the join
joined <-left_join(injury2, Rosterall3, by =c("Season"="Season", "fullname"="Player", "Team"="Team"), relationship ="many-to-many")head(joined)
# A tibble: 6 × 29
Season Team Position surname given fullname InjuryType injurygroup CapHit
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2006 Anaheim F Brennan " Kip" Kip Bre… Shoulder upper NA
2 2006 Anaheim F Fedorov " Ser… Sergei … Groin lower NA
3 2006 Anaheim F Fedoruk " Tod… Todd Fe… Back upper NA
4 2006 Anaheim F Getzlaf " Rya… Ryan Ge… Shoulder upper NA
5 2006 Anaheim F Hedström " Jon… Jonatha… Groin lower NA
6 2006 Anaheim F Konopka " Zen… Zenon K… Ankle lower NA
# ℹ 20 more variables: Chip <dbl>, GamesMissed <dbl>, Nationality <chr>,
# Pos <chr>, Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
# `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
# `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
# ATOI_min <dbl>
Please note that joins sometimes failed due to the player never actually playing a game (during the regular season) with the team they were signed to when reported as injured. In some cases, this is because they missed the entire season due to injury. In others, it is because they were traded or otherwise changed teams mid-season, or were signed to an NHL level contract, but played solely for the minor league affiliate team.
Filter for only active Forwards and Defensmen
Goalies have completely different stats that I did not grab. They also had all been wearing full face masks.
The “retired” status is for players who haven’t played since a previous season due to injuries or physical condition but had not officially retired. Those players would matter for salary cap investigations, but I’m not looking at money at all.
join1 <- joined |>filter(Position %in%c("F", "D")) |>filter(!is.na(Nationality)) |># as the first column from the second dataframe, any row that failed to join doesn't have a value in Nationality and thus will be dropped.select(-CapHit, -Chip) # remove the columns concerning injury time effect on team's salary cap, as that is outside of the scope of this projecthead(join1)
# A tibble: 6 × 27
Season Team Position surname given fullname InjuryType injurygroup
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2006 Anaheim F Brennan " Kip" Kip Brenn… Shoulder upper
2 2006 Anaheim F Fedorov " Sergei" Sergei Fe… Groin lower
3 2006 Anaheim F Fedoruk " Todd" Todd Fedo… Back upper
4 2006 Anaheim F Getzlaf " Ryan" Ryan Getz… Shoulder upper
5 2006 Anaheim F Hedström " Jonathan" Jonathan … Groin lower
6 2006 Anaheim F Konopka " Zenon" Zenon Kon… Ankle lower
# ℹ 19 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
# Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
# `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
# `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
# ATOI_min <dbl>
# A tibble: 6 × 27
Season Team Position surname given fullname InjuryType injurygroup
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2006 Anaheim F Lupul " Joffrey" Joffrey… Concussion head
2 2006 Anaheim F Niedermayer " Rob" Rob Nie… Concussion head
3 2006 Anaheim F Perry " Corey" Corey P… Concussion head
4 2006 Anaheim D Marshall " Jason" Jason M… Nose head
5 2006 Anaheim D Salei " Ruslan" Ruslan … Orbital b… head
6 2006 Boston D Leetch " Brian" Brian L… Head head
# ℹ 19 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
# Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
# `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
# `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
# ATOI_min <dbl>
The original column Nationality was created from was “Birth”, and since many of the players – everyone in 2006 – were born before the collapse of the Soviet Union – all players from that area are listed as born there, instead of the component countries they now hold citizenship in. As I am realizing this incredibly late, this is just going to be a quirk of the data.
create a binary is_concussion
for the logistic regression
headjoin2 <- headjoin |>mutate(is_concussion =ifelse(InjuryType =="Concussion", 1, 0)) # 1 if concussion, 0 if nothead(headjoin2)
# A tibble: 6 × 28
Season Team Position surname given fullname InjuryType injurygroup
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2006 Anaheim F Lupul " Joffrey" Joffrey… Concussion head
2 2006 Anaheim F Niedermayer " Rob" Rob Nie… Concussion head
3 2006 Anaheim F Perry " Corey" Corey P… Concussion head
4 2006 Anaheim D Marshall " Jason" Jason M… Nose head
5 2006 Anaheim D Salei " Ruslan" Ruslan … Orbital b… head
6 2006 Boston D Leetch " Brian" Brian L… Head head
# ℹ 20 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
# Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
# `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
# `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
# ATOI_min <dbl>, is_concussion <dbl>
Exploratory Data Analysis
Make histograms or density plots for all numeric data
Display the first group of Numeric Distributions in a Grid
# display in a grid; plot_annotation shares gives one title and caption p1 + p2 + p3 + p4 + p5 + p10 +plot_layout(ncol =2) +plot_annotation(title ="Distrubtions of Numeric Variables for Injuries in 2005-2006 and 2024-2025 Seasons",caption ="Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Age Exp height Wt
Min. :20.00 Min. : 0.000 Min. :68.00 Min. :176.0
1st Qu.:24.00 1st Qu.: 2.000 1st Qu.:72.00 1st Qu.:196.8
Median :28.00 Median : 5.000 Median :73.00 Median :205.0
Mean :27.81 Mean : 6.139 Mean :73.66 Mean :206.6
3rd Qu.:31.00 3rd Qu.: 9.000 3rd Qu.:75.00 3rd Qu.:216.2
Max. :38.00 Max. :18.000 Max. :80.00 Max. :248.0
GP PIM
Min. : 2.00 Min. : 2.00
1st Qu.:40.00 1st Qu.: 20.75
Median :59.00 Median : 37.50
Mean :54.58 Mean : 43.65
3rd Qu.:75.00 3rd Qu.: 58.00
Max. :81.00 Max. :156.00
The distribution of Age is close enough to be called nearly normal. Players who had head injures ranged from age 20-38, which is nearly the age range of the whole league. The mean and median are nearly identical, at 27.81 and 28.
Experience is right skewed, and again shows nearly the whole range of the league from 0-18 years. The mean and median are a little farther apart than age’s were, at 5 and 6.139 respectivel.
Height and weight are nearly normal. These ranges, 68-80 in for height, and 176-248 lbs for weight, do not show the range of the entire league, as they are missing the smaller players. This is opposite of what I would have expected, honestly.
Games Played is a wild mess. It is strongly left skewed with two main clusters. it rangers from 2 to 81, with the mean at 54.48 and median at 59. This is mostly interesting as it means that that most players who got a head injury played over half the season.
Penalties in minutes is another long ranged, from 2 minutes (i.e. a single minor penalty) to 156 minutes. It is skewed right with probable outliers.
Second Group of Numerical Varibles
p7 + p8 + p6 + p9 + p11 + p12 + p13 +plot_layout(ncol =2) +plot_annotation(title ="Distrubtions of Numeric Variables for Injuries in 2005-2006 and 2024-2025 Seasons",caption ="Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar")
summary(headjoin2[c("Goals", "Assists", "PTS", "TOI_min", "ATOI_min", "GamesMissed")]) # this is refusing to believe `+/-` exists, so will call separately
Goals Assists PTS TOI_min
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.8333
1st Qu.: 2.000 1st Qu.: 5.00 1st Qu.: 7.00 1st Qu.: 511.0750
Median : 6.000 Median :11.00 Median :18.00 Median : 935.9083
Mean : 8.583 Mean :14.72 Mean :23.31 Mean : 869.9593
3rd Qu.:11.000 3rd Qu.:22.00 3rd Qu.:32.25 3rd Qu.:1233.4083
Max. :44.000 Max. :47.00 Max. :78.00 Max. :1850.1333
ATOI_min GamesMissed
Min. : 0.4167 Min. : 1.00
1st Qu.:12.1167 1st Qu.: 2.00
Median :15.3583 Median : 4.00
Mean :15.0599 Mean : 7.63
3rd Qu.:18.7542 3rd Qu.: 9.00
Max. :24.6667 Max. :73.00
summary(headjoin2$`+/-`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-24.0000 -7.0000 -1.0000 -0.6389 3.0000 29.0000
(The awkward bottom row of the output is for Plus-minus)
Goals, Assists, and Points are all skewed right – as Points is the sum of the first two, this makes sense. Goals has the most isolated cluster on the left side, with assists and points more spread out. Again this makes sense, as more players get assists than goals.
Time on Ice is all over the place. the min is under a minute, and the max is 1850.1333 minutes. WIth that range, the mean and median are surprisingly close at approx. 870 and approx. 936.
Average Time on Ice is near enough that I can call it nearly normal. A slightly less vast range, if only because ATOI gets divided by games played, it ranges from a little under half a minute to 24.6667 minutes. The mean and median are both around 15 minutes.
GamesMissed is highly right skewed, with a few observations very far from the main cluster, and will need a logarithmic transformation for any statistical tests. It ranges from 1 to 73, but the mean and median are way down at 7.63 and 4, showing that the max is most definitely an outlier.
Chi-squared test Between InjuryType and Season
headcount <- headjoin2 |>group_by(Season, InjuryType) |>count()ggplot(headcount, aes(reorder(InjuryType, n), n, fill = Season)) +geom_col()+coord_flip() +labs(title="InjuryType by Season",x ="Head Injury TYpe",y ="Count",caption ="Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar") +theme_bw() +scale_fill_manual(values =c("#b4d4ed","#83b6de"))
Null Hypothesis: There is no association between Season and InjuryType
Alternative Hypothesis: There is an association between Season and Injury Type
Check Conditions:
Independence: the injuries are NOT indenpendent, so my results are not going to be valid
Fisher's Exact Test for Count Data
data: headjoin2$Season and headjoin2$InjuryType
p-value = 0.1636
alternative hypothesis: two.sided
With a p-value of 0.1636, I fail to reject the null: there is no evidence that there is an association between Season and InjuryType.
Difference of TWo Means Bootstrap: GamesMissed for each Season
Null Hypothesis: The mean GamesMissed for 2006 for head injuries is the same as the mean GamesMissed for 2025 for head injuries.
Check conditions:
Independence: Not exactly independent, so results aren’t exactly valid!
Distribution and counts: Need to transform GamesMissed
ggplot(headjoin2, aes(log(GamesMissed), color = Season)) +geom_density(linewidth =2) +labs(title ='Distribution of Log(GamesMissed)',x ="Log(GamesMissed)",color ="Season",caption ="Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar") +scale_color_manual(values =c("#b4d4ed","#83b6de"))
headjoin |>group_by(Season) |>count()
# A tibble: 2 × 2
# Groups: Season [2]
Season n
<chr> <int>
1 2006 73
2 2025 35
The log transformation of GamesMissed is better, and while still right skewed, close enough to normal. (The other transformations are worse.) Both seasons have counts of over 30, so I can proceed.
Null Hypothesis: The mean GamesMissed for 2006 for head injuries is the same as the mean GamesMissed for 2025 for head injuries.
Alternative Hypothesis: The mean GamesMisssed for ’06 and ’25 are different.
ggplot(headjoin2, aes(x = Season, y =log(GamesMissed))) +geom_boxplot(aes(fill = Season)) +geom_jitter(alpha =0.8) +labs(title ="Games Missed For Head Injuries by Season",x="Season",y ="Games Missed",caption ="Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar" ) +theme_bw() +scale_fill_manual(values =c("#b4d4ed","#83b6de")) +theme(legend.position ="none")
diff_mean_ci <- headjoin2 |>specify(GamesMissed ~ Season) |>generate(reps =1000, type ="bootstrap") |>calculate(stat ="diff in means", order =c("2006", "2025"))head(diff_mean_ci)
As zero is included in the confidence interval range, I fail to reject the null hypothesis. I am 95% confident that there is no true difference in mean games missed for head injuries between the 2006 and 2025 seasons. ## Model: Logistic RegressionL Probability a head injury is a concussion
fit1 <-glm(is_concussion ~ GP, family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(fit1)
Call:
glm(formula = is_concussion ~ GP, family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.782530 0.536227 1.459 0.1445
GP -0.017095 0.009157 -1.867 0.0619 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 145.52 on 106 degrees of freedom
AIC: 149.52
Number of Fisher Scoring iterations: 4
ggplot(headjoin2, aes(x = GP, y = is_concussion)) +geom_jitter(width =0, height =0.05, alpha =0.5) +geom_smooth(method ="glm", color ="#b4d4ed", method.args =list(family ="binomial"), se =FALSE)
`geom_smooth()` using formula = 'y ~ x'
Fit the model
full_model <-glm(is_concussion ~factor(Season) +factor(Team)+factor(Position) + Nationality + GamesMissed + Age + GP + Wt +factor(shoots) + PTS +`+/-`+ ATOI_min , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
See what removing Team does – some have very high p-values
model2 <-glm(is_concussion ~factor(Season) +factor(Position) + Nationality + GamesMissed + Age + GP + Wt +factor(shoots) + PTS +`+/-`+ ATOI_min , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model2)
Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) +
Nationality + GamesMissed + Age + GP + Wt + factor(shoots) +
PTS + `+/-` + ATOI_min, family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.06150 5.27966 2.095 0.0362 *
factor(Season)2025 0.53137 0.56274 0.944 0.3450
factor(Position)F 1.17107 0.82861 1.413 0.1576
NationalityCZ 14.36280 1665.41629 0.009 0.9931
NationalityCzechoslovakia 0.39530 1.04098 0.380 0.7041
NationalityFinland -0.10589 1.23623 -0.086 0.9317
NationalityGermany 15.90907 1688.74714 0.009 0.9925
NationalityRussia 0.55801 1.62543 0.343 0.7314
NationalitySoviet Union 0.71796 1.00166 0.717 0.4735
NationalitySweden -0.82357 0.92568 -0.890 0.3736
NationalitySwitzerland 17.01237 2399.54490 0.007 0.9943
NationalityUSA -0.09172 0.60745 -0.151 0.8800
GamesMissed 0.01273 0.02572 0.495 0.6208
Age -0.11174 0.05970 -1.872 0.0612 .
GP -0.02372 0.01611 -1.472 0.1410
Wt -0.04272 0.01978 -2.159 0.0308 *
factor(shoots)R -0.35413 0.49563 -0.715 0.4749
PTS -0.01830 0.02555 -0.716 0.4739
`+/-` -0.02170 0.02687 -0.808 0.4194
ATOI_min 0.09351 0.10146 0.922 0.3567
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 120.38 on 88 degrees of freedom
AIC: 160.38
Number of Fisher Scoring iterations: 15
That helped. See if removing nationality helps.
model3<-glm(is_concussion ~factor(Season) +factor(Position) + GamesMissed + Age + GP + Wt +factor(shoots) + PTS +`+/-`+ ATOI_min , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model3)
Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) +
GamesMissed + Age + GP + Wt + factor(shoots) + PTS + `+/-` +
ATOI_min, family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.804177 4.975071 2.172 0.0299 *
factor(Season)2025 0.499764 0.481003 1.039 0.2988
factor(Position)F 0.961435 0.802260 1.198 0.2308
GamesMissed 0.005791 0.023328 0.248 0.8039
Age -0.134712 0.055702 -2.418 0.0156 *
GP -0.027012 0.014905 -1.812 0.0699 .
Wt -0.036533 0.018133 -2.015 0.0439 *
factor(shoots)R -0.499054 0.462877 -1.078 0.2810
PTS -0.011735 0.023346 -0.503 0.6152
`+/-` -0.025867 0.025204 -1.026 0.3048
ATOI_min 0.091491 0.096183 0.951 0.3415
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 125.72 on 97 degrees of freedom
AIC: 147.72
Number of Fisher Scoring iterations: 3
Now remove GamesMissed, as highest p-value.
model4<-glm(is_concussion ~factor(Season) +factor(Position) + Age + GP + Wt +factor(shoots) + PTS +`+/-`+ ATOI_min , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model4)
Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) +
Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min,
family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.65472 4.93231 2.160 0.0308 *
factor(Season)2025 0.49312 0.48000 1.027 0.3043
factor(Position)F 0.98889 0.79460 1.245 0.2133
Age -0.13285 0.05508 -2.412 0.0159 *
GP -0.02818 0.01418 -1.987 0.0469 *
Wt -0.03562 0.01775 -2.007 0.0447 *
factor(shoots)R -0.51216 0.45981 -1.114 0.2653
PTS -0.01224 0.02325 -0.526 0.5986
`+/-` -0.02514 0.02500 -1.005 0.3147
ATOI_min 0.09277 0.09601 0.966 0.3339
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 125.78 on 98 degrees of freedom
AIC: 145.78
Number of Fisher Scoring iterations: 3
AIC Still going down. Now remove PTS as highest p-value
model5<-glm(is_concussion ~factor(Season) +factor(Position) + Age + GP + Wt +factor(shoots) +`+/-`+ ATOI_min , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model5)
Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) +
Age + GP + Wt + factor(shoots) + `+/-` + ATOI_min, family = binomial(),
data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.80937 4.43393 2.663 0.00774 **
factor(Season)2025 0.50091 0.48000 1.044 0.29669
factor(Position)F 0.71986 0.60165 1.196 0.23151
Age -0.13307 0.05489 -2.424 0.01534 *
GP -0.03182 0.01247 -2.552 0.01072 *
Wt -0.03821 0.01704 -2.243 0.02492 *
factor(shoots)R -0.50242 0.45923 -1.094 0.27393
`+/-` -0.02898 0.02387 -1.214 0.22467
ATOI_min 0.05797 0.06868 0.844 0.39868
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 126.06 on 99 degrees of freedom
AIC: 144.06
Number of Fisher Scoring iterations: 3
Slightly lower AIC; now remove ATOI_min
model6<-glm(is_concussion ~factor(Season) +factor(Position) + Age + GP + Wt +factor(shoots) +`+/-` , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model6)
Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) +
Age + GP + Wt + factor(shoots) + `+/-`, family = binomial(),
data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.36402 4.09724 3.262 0.00111 **
factor(Season)2025 0.50490 0.47834 1.056 0.29119
factor(Position)F 0.41324 0.47288 0.874 0.38218
Age -0.11789 0.05146 -2.291 0.02197 *
GP -0.02712 0.01099 -2.467 0.01362 *
Wt -0.04369 0.01594 -2.741 0.00613 **
factor(shoots)R -0.55077 0.45633 -1.207 0.22745
`+/-` -0.02413 0.02315 -1.042 0.29744
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 126.78 on 100 degrees of freedom
AIC: 142.78
Number of Fisher Scoring iterations: 3
Still lowering slightly. Now remove Position
model7<-glm(is_concussion ~factor(Season) + Age + GP + Wt +factor(shoots) +`+/-` , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model7)
Call:
glm(formula = is_concussion ~ factor(Season) + Age + GP + Wt +
factor(shoots) + `+/-`, family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.82003 4.07471 3.392 0.000695 ***
factor(Season)2025 0.47077 0.47412 0.993 0.320744
Age -0.12249 0.05098 -2.403 0.016269 *
GP -0.02664 0.01100 -2.422 0.015436 *
Wt -0.04407 0.01593 -2.765 0.005684 **
factor(shoots)R -0.50373 0.45129 -1.116 0.264338
`+/-` -0.02456 0.02307 -1.064 0.287200
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 127.55 on 101 degrees of freedom
AIC: 141.55
Number of Fisher Scoring iterations: 3
I am surprised that Season is my largest p-value
model8<-glm(is_concussion ~ Age + GP + Wt +factor(shoots) +`+/-` , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model8)
Call:
glm(formula = is_concussion ~ Age + GP + Wt + factor(shoots) +
`+/-`, family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.46598 3.99479 3.621 0.000293 ***
Age -0.13111 0.05035 -2.604 0.009222 **
GP -0.02621 0.01088 -2.410 0.015962 *
Wt -0.04544 0.01568 -2.898 0.003759 **
factor(shoots)R -0.48809 0.44684 -1.092 0.274694
`+/-` -0.02127 0.02277 -0.934 0.350087
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 128.54 on 102 degrees of freedom
AIC: 140.54
Number of Fisher Scoring iterations: 3
AIC is still decreasing SLIGHTLY. Now remove Plus Minus
model9<-glm(is_concussion ~ Age + GP + Wt +factor(shoots) , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model9)
Call:
glm(formula = is_concussion ~ Age + GP + Wt + factor(shoots),
family = binomial(), data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.43681 3.99554 3.613 0.000302 ***
Age -0.12364 0.04956 -2.495 0.012609 *
GP -0.02732 0.01082 -2.525 0.011554 *
Wt -0.04595 0.01567 -2.932 0.003372 **
factor(shoots)R -0.49713 0.44324 -1.122 0.262048
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 129.43 on 103 degrees of freedom
AIC: 139.43
Number of Fisher Scoring iterations: 3
Still decreasing a little. Removing shoots next.
model10<-glm(is_concussion ~ Age + GP + Wt , family =binomial(), data = headjoin2) # family = bimonial makes the sigmoid curvesummary(model10)
Call:
glm(formula = is_concussion ~ Age + GP + Wt, family = binomial(),
data = headjoin2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.81041 3.89129 3.549 0.000387 ***
Age -0.11611 0.04842 -2.398 0.016481 *
GP -0.02726 0.01074 -2.538 0.011158 *
Wt -0.04486 0.01548 -2.898 0.003754 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.13 on 107 degrees of freedom
Residual deviance: 130.71 on 104 degrees of freedom
AIC: 138.71
Number of Fisher Scoring iterations: 4
All p-values are very small. The AIC is still rather large, at 138.71 though. This is probably not a very good model.