217 FINAL NHL Head Injuries Visor Use V2

Author

Z Griffin

NHL Head Injuries Before and After Majority Visor Use

Comparing the 2005-2006 and 2024-2025 Seasons

Jamie Benn of the Dallas Stars is one of the last four players to not wear a visor. Photo: Richard A. Whittaker/Icon Sports

Avalanche Nathan MacKinnon was hit in the mouth by a puck May 11th, 2026 during a playoff game. He returned to the game. Source: ESPN, Associated Press

Introduction

Fast paced, on a near-frictionless surface with blades strapped to everyone’s feet, and hard pieces of rubber being shot across the ice at speeds that can exceed 100 miles per hour, it is unsurprising that hockey is a dangerous sport. Now consider that it is part of the strategy to slam opposing players into the rink boards, and the long tradition of fights – even if those are technically against the rules– and that NHL teams play 82 games between September and April.

My data set is the NHL Injury Database, compiled by LW3H, from data from the teams, puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, consisting of injury data for all NHL teams to from the 2000-2001 season to midway through 2025-2026, when I downloaded it. It is posted on the NHL Injury Viz blog page, and Tableau Public. I joined this with team rosters and player stats from Hockey-reference.com, official NHL data provided by sportsradar.

Collegiate hockey wears full face protection. The Professional Womens Hockey League, entering their third playoffs final, wear full facial protection. The American Hockey League – the top minor league in North America – has mandated visors since 2006-2007. Every other level of hockey mandates facial protection – the NHL was the last (Associated P)

The NHL mandated the use of helmet visors starting for the 2013-2014 season for all players who had played less than 26 games. The final incident in the long debate over visors was Marc Staal of the New York Ranger’s bad eye injury in the spring of 2013 from being hit in face with a puck. The National Hockey League Player’s Association is reported to have had a “strong majority” in favor of the mandate that June, compared to prior to Staal’s injury, when they had rejected a mandate (Halford).

73% of the league were voluntarily wearing visors already in 2012-2013, when Staal was injured, which is why I did not choose that year for my analysis (Halford). In 2004-2005, only 38% of players wore visors (CBC). By 2024-2025, all but four players wore visors: Dallas Star’s Jamie Benn (above), Minnesota Wild’s Zach Bogosian, of the Nashville Predators’s Ryan O’Reilly, and San Jose Sharks’ Ryan Reaves. The question regarding visors has mostly shifted to which of the four will be the last visorless player (Johnson, Trettenero).

Variable (dataset) Defintion Notes
Season (both) Season of play, presented as the year it ended it either 2006 or 2025
Team (both) Team player was signed with used in joining the datasets
Position (injuries) Position played (Forward, Defense, or Goalie) also indicates retired – filtered for only Forward and Defense
fullname (injuries) /Player (roster) Player’s given and surname had to wrangle injuries dataset to create to join to rosters
InjuryType (injuries) What the injury or illness was
injurygroup (created) classified InjuryType by area of body head, upper or lower body, or other for sickness, undisclosed, non location specific injuries
is_concussion (created) Concussion (1) or not (0) for filtered data set only, for logistic regression
GamesMissed (injuries) how many games were missed Indicates severity of injury
Nationality (roster) What country the player is from modified from original column “Birth” due to formatting
Pos (roster) also position, but classifies as Center, Left and Right Wing, Defense, and Goalie
Age (roster) Players age as of February of that season
height (roster) height in inches wrangled this column from original
Wt (roster) weight in pounds
Shoots (roster) side player shoots puck from L or R often opposite of dominate hand
Exp (Experience) (roster) Years in league before that season
GP (Games Played) (roster) how many games played that season specifically for that team
Goals Goals scored
Assists Goals they contributed to usually by making one of the last two passes
PTS (Points) (Roster) Sum of Goals and Assists
+/- (plus minus) (Roster Differential of goals scored for and against when player is on the ice
PIM (Roster) Penalties in minutes minor penalties are 2 min, major is 5, game misconduct is 10
TOI (Roster Time on Ice, total for whole season displayed as mm:ss
ATOI (roster) Average time on ice, TOI/Games played displayed as mm:ss
TOI_min (roster) Time on ice, total converted to minutes as a decimal
ATOI_min (roster) Average time on ice, TOI/Games played converted to minutes as a decimal

Have majority use of visors improved player safety in regards to head injuries?

Two important notes about this dataset:

One: Injuries are not independent necessarily independent, as the same players can get repeat injuries, and are often injuring each other.

Two: Only injuries that meant the player missed subsequent games appear on this database. The above picture of Nathan MacKinnon’s injury would not appear in this dataset later this week as he came back in that game, and he did not miss the next game.

Conclusions

Once again, my results should be taken with a grain of salt as my observations are not independent. My chi-squared analysis of association between InjuryType and Season showed no association with a p-value of .1636 This was surprising, as the graph suggests drop offs in nearly all head injury types. I wonder if a more focused test on just proportions of concussion injuries, for example, would produce the same results. Similarly, the difference of two means 95% confidence interval of the average of Games Missed for all head injuries in 2006 and 2025 seasons produced a range that included zero, suggesting there is no difference in injury severity between the two seasons. This result was initially surprising, considering some of the high values for Games Missed on my vizualization for the Data 110 project, but upon looking at the box plot here, I realize they were outliers. Also, I don’t think I transformed the data for the 110 visualization.

My logistic regression was a little frustrating. I tried to use the machine learning method, but was stalled by the output only showing the coefficients and not the p-values. I figured it was best to do it manually, a way I knew I understood. It is also frustrating that once again, I have made a poor model. My final model is is_concussion ~ Age + GP + Wt, and the AIC only got down to 138.71. Surprisingly, according to my confusion matrix, my accuracy is 63.9%

One problem with my dataset is I had no way of knowing if a player with a head injury – especially in ’06 – was wearing a visor or not. While I was able to find a site that showed players current and most recent gear, it did not show gear (or lack of) choices from years ago, nor would it have been efficient to acquire the data from.

Future directions I would like to go to, would be looking at comparisons to collegiate hockey injuries and the PWHL. While it seemed like there are datasets on the first, it seemed hard to access, and the PWHL is only in it’s third season that the data might just not be out there yet.

References

Associated Press. “Avs Star MacKinnon Bloodied after Puck to Face from Teammate.” ESPN.Com, 12 May 2026, https://www.espn.com/nhl/story/_/id/48746176/avs-star-mackinnon-bloodied-puck-face-teammate.

Associated Press. “Use of Visors up for Debate in Rough-&-Tumble NHL.” ESPN.Com, https://www.espn.com/espn/wire/_/section/nhl/id/9048371. Accessed 14 May 2026.

Benson, B. W., et al. “The Impact of Face Shield Use on Concussions in Ice Hockey: A Multivariate Analysis.” British Journal of Sports Medicine, vol. 36, no. 1, Feb. 2002, pp. 27–32. Original Articles. bjsm.bmj.com, https://doi.org/10.1136/bjsm.36.1.27.

CBC. “Use of Visors in NHL on the Rise: Report.” CBC Sports, 18 Oct. 2005. CBC.ca, https://www.cbc.ca/sports/hockey/use-of-visors-in-nhl-on-the-rise-report-1.528068.

Halford, Mike. “NHL Makes Visors Mandatory for New Players.” NBC Sports, 4 June 2013, https://www.nbcsports.com/nhl/news/nhl-makes-visors-mandatory-for-new-players.

Johnston, Chris. “Why 4 NHL Players Still Don’t Wear Visors, and Who Will Be the Last Standing?” The New York Times, 13 Mar. 2026. NYTimes.com, https://www.nytimes.com/athletic/7108800/2026/03/12/nhl-players-visor-holdouts/.

Kennedy, Ian. “The Last Players Not to Wear Visors in the NHL.” The Hockey News, 22 Apr. 2023, https://thehockeynews.com/news/news/the-last-players-not-to-wear-visors-in-the-nhl.

Miller, Michael. English: Pittsburgh Penguins Defensemen Kris Letang. 9 Dec. 2017, Own work. Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Kris_Letang_2017-12-09_17960_%281%29.jpg.

Trettenero, Brady. “NHL Players Without Visors (2024).” Gino Hard, 5 Apr. 2023, https://www.ginohard.com/nhl-visorless-players/.

Data

Load Libraries and Data

library(tidyverse)
library(readxl)
Warning: package 'readxl' was built under R version 4.5.3
library(webshot2)
Warning: package 'webshot2' was built under R version 4.5.3
library(tidymodels)
library(RColorBrewer)
library(patchwork)
Warning: package 'patchwork' was built under R version 4.5.3
setwd("~/Schol Stuff/Montgomery College 2025/Math 217 Stat for Scientists/Final Project")

nhlinjury <- read_csv("NHL Injury Database_data.csv") 
 
Roster06 <- read_excel("2006_2025Rosters.xlsx", sheet = 1)
Roster25 <- read_excel("2006_2025Rosters.xlsx", sheet = 2)

Preview and Clean The Rosters

starting with the Rosters and player stats: two different tables from Hockey-reference.com were combined per team in Excel, and not all columns were used.

head(Roster06)
# A tibble: 6 × 21
  Season Team    Player    Birth Nationality Pos     Age Ht       Wt `S/C` Exp  
   <dbl> <chr>   <chr>     <chr> <chr>       <chr> <dbl> <chr> <dbl> <chr> <chr>
1   2006 Anaheim François… ca CA CA          D        25 5-11    208 L/-   1    
2   2006 Anaheim Kip Bren… ca CA CA          LW       25 6-4     230 L/-   3    
3   2006 Anaheim Ilya Bry… su SU SU          G        25 6-3     213 -/L   2    
4   2006 Anaheim Keith Ca… us US US          D        35 6-1     207 L/-   13   
5   2006 Anaheim Joe DiPe… ca CA CA          D        26 6-2     199 L/-   1    
6   2006 Anaheim Sergei F… su SU SU          C        36 6-2     207 L/-   14   
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
#   Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
#   ATOI <chr>
head(Roster25)
# A tibble: 6 × 21
  Season Team    Player    Birth Nationality Pos     Age Ht       Wt `S/C` Exp  
   <dbl> <chr>   <chr>     <chr> <chr>       <chr> <dbl> <chr> <dbl> <chr> <chr>
1   2025 Anaheim Leo Carl… se SE SE          C        20 6-3     208 L/-   1    
2   2025 Anaheim Sam Cola… us US US          RW       23 6-2     213 R/-   1    
3   2025 Anaheim Lukáš Do… cz CZ CZ          G        24 6-2     190 -/L   3    
4   2025 Anaheim Brian Du… us US US          D        33 6-4     215 L/-   11   
5   2025 Anaheim Robby Fa… ca CA CA          C        29 5-11    185 L/-   8    
6   2025 Anaheim Cam Fowl… ca CA CA          D        33 6-2     213 L/-   14   
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
#   Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
#   ATOI <chr>

Nationality was created as the lowercase characters in ‘Birth’ were a country flag image on the website.

Bind the rosters

Combine the two roster dataframes into one before any other cleaning is done to them:

# change Roster06 Goals to numeric so the bind works
Roster06$Goals <- as.numeric(Roster06$Goals) 

Rosterall <- bind_rows(Roster06, Roster25)

dim(Rosterall)
[1] 2205   21
head(Rosterall)
# A tibble: 6 × 21
  Season Team    Player    Birth Nationality Pos     Age Ht       Wt `S/C` Exp  
   <dbl> <chr>   <chr>     <chr> <chr>       <chr> <dbl> <chr> <dbl> <chr> <chr>
1   2006 Anaheim François… ca CA CA          D        25 5-11    208 L/-   1    
2   2006 Anaheim Kip Bren… ca CA CA          LW       25 6-4     230 L/-   3    
3   2006 Anaheim Ilya Bry… su SU SU          G        25 6-3     213 -/L   2    
4   2006 Anaheim Keith Ca… us US US          D        35 6-1     207 L/-   13   
5   2006 Anaheim Joe DiPe… ca CA CA          D        26 6-2     199 L/-   1    
6   2006 Anaheim Sergei F… su SU SU          C        36 6-2     207 L/-   14   
# ℹ 10 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
#   Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
#   ATOI <chr>

Making Numeric and Seperating Columns (or both)

Time on Ice (TOI) and Average Time on Ice (ATOI) need converted to numeric

Experience (Exp) (years in league) needs the R for rookie replaced with 0, and converted to numeric

# ms() pulls out the mm:ss format, as.numeric converts to total seconds, divide by 60 to make minutes
Rosterall$TOI_min <- as.numeric(ms(Rosterall$TOI))/60 

Rosterall$ATOI_min <-as.numeric(ms(Rosterall$ATOI))/60

Rosterall$Exp <- as.numeric(gsub("R", "0", Rosterall$Exp))


head(Rosterall)
# A tibble: 6 × 23
  Season Team    Player    Birth Nationality Pos     Age Ht       Wt `S/C`   Exp
   <dbl> <chr>   <chr>     <chr> <chr>       <chr> <dbl> <chr> <dbl> <chr> <dbl>
1   2006 Anaheim François… ca CA CA          D        25 5-11    208 L/-       1
2   2006 Anaheim Kip Bren… ca CA CA          LW       25 6-4     230 L/-       3
3   2006 Anaheim Ilya Bry… su SU SU          G        25 6-3     213 -/L       2
4   2006 Anaheim Keith Ca… us US US          D        35 6-1     207 L/-      13
5   2006 Anaheim Joe DiPe… ca CA CA          D        26 6-2     199 L/-       1
6   2006 Anaheim Sergei F… su SU SU          C        36 6-2     207 L/-      14
# ℹ 12 more variables: `Birth Date` <chr>, Summary <chr>, GP <dbl>,
#   Goals <dbl>, Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>,
#   ATOI <chr>, TOI_min <dbl>, ATOI_min <dbl>

Ht (height) in feet-inches needs to be separated and recombined. S/C (shoots/catches) needs to be separated as well

Rosterall2 <- Rosterall |>
  
  separate(Ht, into = c("feet", "inch"), sep = "-", convert = TRUE) |> 
  mutate(height = as.numeric(feet)*12 + as.numeric(inch), .after = "inch") |> #multiply feet by 12, add to inches to get height in inches as new column 
  
  separate(`S/C`, into= c("shoots", "catches"), sep ="/", convert = TRUE)  # separate S/C and names them; not concerned about the dashes in place of NAs as Catches is only relevant for goalies and they should be the only ones without "shoots" and I'm filtering them out! 
  


head(Rosterall2)
# A tibble: 6 × 26
  Season Team    Player   Birth Nationality Pos     Age  feet  inch height    Wt
   <dbl> <chr>   <chr>    <chr> <chr>       <chr> <dbl> <int> <int>  <dbl> <dbl>
1   2006 Anaheim Françoi… ca CA CA          D        25     5    11     71   208
2   2006 Anaheim Kip Bre… ca CA CA          LW       25     6     4     76   230
3   2006 Anaheim Ilya Br… su SU SU          G        25     6     3     75   213
4   2006 Anaheim Keith C… us US US          D        35     6     1     73   207
5   2006 Anaheim Joe DiP… ca CA CA          D        26     6     2     74   199
6   2006 Anaheim Sergei … su SU SU          C        36     6     2     74   207
# ℹ 15 more variables: shoots <chr>, catches <chr>, Exp <dbl>,
#   `Birth Date` <chr>, Summary <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>,
#   PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
#   ATOI_min <dbl>
Rosterall3 <- Rosterall2 |>
  # removing a few columns I absolutely don't need for ease of viewing
  select(-Birth, -feet, -inch, -catches, -Summary)

head(Rosterall3)
# A tibble: 6 × 21
  Season Team    Player        Nationality Pos     Age height    Wt shoots   Exp
   <dbl> <chr>   <chr>         <chr>       <chr> <dbl>  <dbl> <dbl> <chr>  <dbl>
1   2006 Anaheim François Bea… CA          D        25     71   208 L          1
2   2006 Anaheim Kip Brennan   CA          LW       25     76   230 L          3
3   2006 Anaheim Ilya Bryzgal… SU          G        25     75   213 -          2
4   2006 Anaheim Keith Carney  US          D        35     73   207 L         13
5   2006 Anaheim Joe DiPenta   CA          D        26     74   199 L          1
6   2006 Anaheim Sergei Fedor… SU          C        36     74   207 L         14
# ℹ 11 more variables: `Birth Date` <chr>, GP <dbl>, Goals <dbl>,
#   Assists <dbl>, PTS <dbl>, `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>,
#   TOI_min <dbl>, ATOI_min <dbl>

Cleaning the Injury Data base

head(nhlinjury)
# A tibble: 6 × 8
  Season  Team    Position Player   `Injury Type` `Cap Hit`  Chip `Games Missed`
  <chr>   <chr>   <chr>    <chr>    <chr>             <dbl> <dbl>          <dbl>
1 2000/01 Anaheim F        Kariya,… Foot                 NA    NA             16
2 2000/01 Anaheim F        Leclerc… Abdominal            NA    NA             13
3 2000/01 Anaheim F        Leclerc… Knee                 NA    NA             15
4 2000/01 Anaheim F        McDonal… Concussion           NA    NA              7
5 2000/01 Anaheim F        McInnis… Groin                NA    NA              3
6 2000/01 Anaheim F        McInnis… Neck                 NA    NA              3

Remove spaces in column names

Fix Season to be just the ending year of the season

#nhlinjury <- read_csv("NHL Injury Database_data.csv")

names(nhlinjury) <- gsub(" ", "", names(nhlinjury))

str_sub(nhlinjury$Season, start=3, end = 5) <- ""


head(nhlinjury)
# A tibble: 6 × 8
  Season Team    Position Player         InjuryType CapHit  Chip GamesMissed
  <chr>  <chr>   <chr>    <chr>          <chr>       <dbl> <dbl>       <dbl>
1 2001   Anaheim F        Kariya, Paul   Foot           NA    NA          16
2 2001   Anaheim F        Leclerc, Mike  Abdominal      NA    NA          13
3 2001   Anaheim F        Leclerc, Mike  Knee           NA    NA          15
4 2001   Anaheim F        McDonald, Andy Concussion     NA    NA           7
5 2001   Anaheim F        McInnis, Marty Groin          NA    NA           3
6 2001   Anaheim F        McInnis, Marty Neck           NA    NA           3

Filter for my two seasons, 2006 and 2025

Then separate Player from Surname, Given name and recombine into Given Surname to match the Rosters

nhlinjury1 <- nhlinjury |>
  filter(Season %in% c('2006','2025')) |>  
  
  separate(Player, into = c('surname', 'given'), sep =",", convert = TRUE) |>
  mutate(fullname = paste({given},{surname}, sep = " "), .after = 'given')

head(nhlinjury1)
# A tibble: 6 × 10
  Season Team    Position surname  given       fullname  InjuryType CapHit  Chip
  <chr>  <chr>   <chr>    <chr>    <chr>       <chr>     <chr>       <dbl> <dbl>
1 2006   Anaheim F        Brennan  " Kip"      " Kip Br… Shoulder       NA    NA
2 2006   Anaheim F        Fedorov  " Sergei"   " Sergei… Groin          NA    NA
3 2006   Anaheim F        Fedoruk  " Todd"     " Todd F… Back           NA    NA
4 2006   Anaheim F        Getzlaf  " Ryan"     " Ryan G… Shoulder       NA    NA
5 2006   Anaheim F        Hedström " Jonathan" " Jonath… Groin          NA    NA
6 2006   Anaheim F        Konopka  " Zenon"    " Zenon … Ankle          NA    NA
# ℹ 1 more variable: GamesMissed <dbl>

Group Injury Types

sort(unique(nhlinjury1$InjuryType))
 [1] "Abdominal"     "Adductor"      "Ankle"         "Appendectomy" 
 [5] "Arm"           "Back"          "Bicep"         "Blood clots"  
 [9] "Bronchitis"    "Cancer"        "Charley horse" "Cheek"        
[13] "Chest"         "Collarbone"    "Concussion"    "Dehydration"  
[17] "Dental"        "Dizziness"     "Ear"           "Elbow"        
[21] "Eye"           "Facial"        "Finger"        "Flu"          
[25] "Foot"          "Groin"         "Hamstring"     "Hand"         
[29] "Head"          "Heart"         "Heel"          "Hepatitis"    
[33] "Hernia"        "Hip"           "Illness"       "Infection"    
[37] "Jaw"           "Knee"          "Larynx"        "Leg"          
[41] "Lower body"    "Mid-body"      "Neck"          "Nose"         
[45] "Oblique"       "Orbital bone"  "Pectoral"      "Pelvis"       
[49] "Quadriceps"    "Respiratory"   "Ribs"          "Shoulder"     
[53] "Sinus"         "Sternum"       "Tailbone"      "Thigh"        
[57] "Thumb"         "Toe"           "Torso"         "Undisclosed"  
[61] "Upper body"    "Wrist"        

Group for injuries to the head, upper body, lower body, and illness or unspecified location injuries

head <- c("Cheek", "Concussion", "Dizziness", "Ear", "Eye", "Facial", "Head", "Jaw", "Nose", "Orbital bone")

upper <- c("Abdominal", "Arm", "Back", "Bicep", "Chest", "Collarbone", "Elbow", "Finger", "Hand", "Heart", "Larynx", "Mid-body", "Oblique", "Pectoral", "Ribs", "Shoulder", "Sternum", "Thumb", "Torso", "Upper body", "Wrist")

lower <- c("Adductor", "Ankle", "Charley horse", "Foot", "Groin", "Hamstring", "Heel", "Hip", "Knee", "Leg", "Lower body", "Pelvis", "Quadricepts", "Tailbone", "Thigh", "Toe")

other <- c("Appendectomy", "Blood clots", "Bronchitis", "Cancer", "Dehydration", "Dental", "Hepatitis", "Hernia", "Illness", "infection", "Sinus", "Undisclosed")

injury2 <- nhlinjury1 |>
  mutate(injurygroup = case_when(
    InjuryType %in% head ~ "head",
    InjuryType %in% upper ~ "upper", 
    InjuryType %in% lower ~ "lower",
    TRUE ~ "other"), .after = InjuryType) 

head(injury2)    
# A tibble: 6 × 11
  Season Team    Position surname  given  fullname InjuryType injurygroup CapHit
  <chr>  <chr>   <chr>    <chr>    <chr>  <chr>    <chr>      <chr>        <dbl>
1 2006   Anaheim F        Brennan  " Kip" " Kip B… Shoulder   upper           NA
2 2006   Anaheim F        Fedorov  " Ser… " Serge… Groin      lower           NA
3 2006   Anaheim F        Fedoruk  " Tod… " Todd … Back       upper           NA
4 2006   Anaheim F        Getzlaf  " Rya… " Ryan … Shoulder   upper           NA
5 2006   Anaheim F        Hedström " Jon… " Jonat… Groin      lower           NA
6 2006   Anaheim F        Konopka  " Zen… " Zenon… Ankle      lower           NA
# ℹ 2 more variables: Chip <dbl>, GamesMissed <dbl>

Join the dataframes

First trim all whitespace from the joining columns

Rosterall3$Player <-trimws(Rosterall3$Player) # weirdly this did not seem to work on every row and I had to manually remove spaces from the Excel file

injury2$fullname <- trimws(injury2$fullname)

Rosterall$Team <- trimws(Rosterall3$Team)

injury2$Team <- trimws(injury2$Team)

Rosterall3$Season <-trimws(Rosterall3$Season)

injury2$Season <- trimws(injury2$Season)

complete the join

joined <- left_join(injury2, Rosterall3, by = c("Season" = "Season", "fullname" = "Player", "Team" = "Team"), relationship = "many-to-many")

head(joined)
# A tibble: 6 × 29
  Season Team    Position surname  given  fullname InjuryType injurygroup CapHit
  <chr>  <chr>   <chr>    <chr>    <chr>  <chr>    <chr>      <chr>        <dbl>
1 2006   Anaheim F        Brennan  " Kip" Kip Bre… Shoulder   upper           NA
2 2006   Anaheim F        Fedorov  " Ser… Sergei … Groin      lower           NA
3 2006   Anaheim F        Fedoruk  " Tod… Todd Fe… Back       upper           NA
4 2006   Anaheim F        Getzlaf  " Rya… Ryan Ge… Shoulder   upper           NA
5 2006   Anaheim F        Hedström " Jon… Jonatha… Groin      lower           NA
6 2006   Anaheim F        Konopka  " Zen… Zenon K… Ankle      lower           NA
# ℹ 20 more variables: Chip <dbl>, GamesMissed <dbl>, Nationality <chr>,
#   Pos <chr>, Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
#   `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
#   `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
#   ATOI_min <dbl>

Please note that joins sometimes failed due to the player never actually playing a game (during the regular season) with the team they were signed to when reported as injured. In some cases, this is because they missed the entire season due to injury. In others, it is because they were traded or otherwise changed teams mid-season, or were signed to an NHL level contract, but played solely for the minor league affiliate team.

Filter for only active Forwards and Defensmen

Goalies have completely different stats that I did not grab. They also had all been wearing full face masks.

The “retired” status is for players who haven’t played since a previous season due to injuries or physical condition but had not officially retired. Those players would matter for salary cap investigations, but I’m not looking at money at all.

unique(joined$Position)
[1] "F"             "D"             "G"             "D \"retired\""
[5] "G \"retired\""
join1 <- joined |>
  filter(Position %in% c("F", "D")) |>
  filter(!is.na(Nationality)) |> # as the first column from the second dataframe, any row that failed to join doesn't have a value in Nationality and thus will be dropped.

  select(-CapHit, -Chip) # remove the columns concerning injury time effect on team's salary cap, as that is outside of the scope of this project

head(join1)
# A tibble: 6 × 27
  Season Team    Position surname  given       fullname   InjuryType injurygroup
  <chr>  <chr>   <chr>    <chr>    <chr>       <chr>      <chr>      <chr>      
1 2006   Anaheim F        Brennan  " Kip"      Kip Brenn… Shoulder   upper      
2 2006   Anaheim F        Fedorov  " Sergei"   Sergei Fe… Groin      lower      
3 2006   Anaheim F        Fedoruk  " Todd"     Todd Fedo… Back       upper      
4 2006   Anaheim F        Getzlaf  " Ryan"     Ryan Getz… Shoulder   upper      
5 2006   Anaheim F        Hedström " Jonathan" Jonathan … Groin      lower      
6 2006   Anaheim F        Konopka  " Zenon"    Zenon Kon… Ankle      lower      
# ℹ 19 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
#   Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
#   `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
#   `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
#   ATOI_min <dbl>

Filter for only injury group head

headjoin <- join1 |> 
  filter(injurygroup == "head")

head(headjoin)
# A tibble: 6 × 27
  Season Team    Position surname     given      fullname InjuryType injurygroup
  <chr>  <chr>   <chr>    <chr>       <chr>      <chr>    <chr>      <chr>      
1 2006   Anaheim F        Lupul       " Joffrey" Joffrey… Concussion head       
2 2006   Anaheim F        Niedermayer " Rob"     Rob Nie… Concussion head       
3 2006   Anaheim F        Perry       " Corey"   Corey P… Concussion head       
4 2006   Anaheim D        Marshall    " Jason"   Jason M… Nose       head       
5 2006   Anaheim D        Salei       " Ruslan"  Ruslan … Orbital b… head       
6 2006   Boston  D        Leetch      " Brian"   Brian L… Head       head       
# ℹ 19 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
#   Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
#   `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
#   `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
#   ATOI_min <dbl>

Spell out the nationalities

sort(unique(headjoin$Nationality))
 [1] "CA" "CH" "CS" "CZ" "DE" "FI" "RU" "SE" "SU" "US"
headjoin$Nationality[headjoin$Nationality == "CA"] <- "Canada"

headjoin$Nationality[headjoin$Nationality ==  "CH"] <- "Switzerland"

headjoin$Nationality[headjoin$Nationality == "CS"] <- "Czechoslovakia"

headjoin$Nationality[headjoin$Nationality == "DE"] <- "Germany"

headjoin$Nationality[headjoin$Nationality == "FI"] <- "Finland"

headjoin$Nationality[headjoin$Nationality == "RU"] <- "Russia"

headjoin$Nationality[headjoin$Nationality == "SE"] <- "Sweden"

headjoin$Nationality[headjoin$Nationality == "SU"] <- "Soviet Union" 

headjoin$Nationality[headjoin$Nationality == "US"] <- "USA"

The original column Nationality was created from was “Birth”, and since many of the players – everyone in 2006 – were born before the collapse of the Soviet Union – all players from that area are listed as born there, instead of the component countries they now hold citizenship in. As I am realizing this incredibly late, this is just going to be a quirk of the data.

create a binary is_concussion

for the logistic regression

headjoin2 <- headjoin |>
  mutate(is_concussion = ifelse(InjuryType == "Concussion", 1, 0)) # 1 if concussion, 0 if not
head(headjoin2)
# A tibble: 6 × 28
  Season Team    Position surname     given      fullname InjuryType injurygroup
  <chr>  <chr>   <chr>    <chr>       <chr>      <chr>    <chr>      <chr>      
1 2006   Anaheim F        Lupul       " Joffrey" Joffrey… Concussion head       
2 2006   Anaheim F        Niedermayer " Rob"     Rob Nie… Concussion head       
3 2006   Anaheim F        Perry       " Corey"   Corey P… Concussion head       
4 2006   Anaheim D        Marshall    " Jason"   Jason M… Nose       head       
5 2006   Anaheim D        Salei       " Ruslan"  Ruslan … Orbital b… head       
6 2006   Boston  D        Leetch      " Brian"   Brian L… Head       head       
# ℹ 20 more variables: GamesMissed <dbl>, Nationality <chr>, Pos <chr>,
#   Age <dbl>, height <dbl>, Wt <dbl>, shoots <chr>, Exp <dbl>,
#   `Birth Date` <chr>, GP <dbl>, Goals <dbl>, Assists <dbl>, PTS <dbl>,
#   `+/-` <dbl>, PIM <dbl>, TOI <chr>, ATOI <chr>, TOI_min <dbl>,
#   ATOI_min <dbl>, is_concussion <dbl>

Exploratory Data Analysis

Make histograms or density plots for all numeric data

p1 <- ggplot(headjoin2, aes(Age)) +
  geom_density(fill = "#b4d4ed") +
  labs(x = "Age") +
  theme_bw()

p2<-ggplot(headjoin2, aes(Exp)) +
  geom_density(fill = "#b4d4ed") +
  labs(x = "Experience") +
  theme_bw()

p3<- ggplot(headjoin2, aes(height)) +
  geom_density(fill = "#b4d4ed") +
  labs(x = "Height (inches)") +
  theme_bw()

p4 <- ggplot(headjoin2, aes(Wt)) +
  geom_density(fill = "#b4d4ed") +
  labs(x = "Weight (lbs)") +
  theme_bw()
  
p5<-ggplot(headjoin2, aes(GP)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Games Played (GP)") +
  theme_bw()

p6<- ggplot(headjoin2, aes(PTS)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Points (PTS = Goals + Assists)") +
  theme_bw() 

p7 <- ggplot(headjoin2, aes(Goals)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Goals") +
  theme_bw()

p8 <- ggplot(headjoin2, aes(Assists)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Assists") +
  theme_bw()

p9 <- ggplot(headjoin2, aes(`+/-`)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Plus Minus: Goal Differential") +
  theme_bw()

p10 <- ggplot(headjoin2, aes(PIM)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Penalties in Minutes") +
  theme_bw()

p11 <- ggplot(headjoin2, aes(TOI_min)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Total Time on Ice, minutes") +
  theme_bw()

p12 <- ggplot(headjoin2, aes(ATOI_min)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Average Time on Ice, minutes") +
  theme_bw()

p13 <- ggplot(headjoin2, aes(GamesMissed)) +
  geom_histogram(fill = "#b4d4ed") +
  labs(x = "Games Missed per Injury") +
  theme_bw()

Display the first group of Numeric Distributions in a Grid

# display in a grid; plot_annotation shares gives one title and caption  
p1 + p2 + p3 + p4 + p5 + p10   + plot_layout(ncol = 2) + plot_annotation(
  title = "Distrubtions of Numeric Variables for Injuries in 2005-2006 and 2024-2025 Seasons",
  caption = "Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca,
   and sportsforcaster.com, and sportsradar")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Summary Statistics:

summary(headjoin2[c('Age','Exp', 'height', 'Wt', 'GP', 'PIM')])
      Age             Exp             height            Wt       
 Min.   :20.00   Min.   : 0.000   Min.   :68.00   Min.   :176.0  
 1st Qu.:24.00   1st Qu.: 2.000   1st Qu.:72.00   1st Qu.:196.8  
 Median :28.00   Median : 5.000   Median :73.00   Median :205.0  
 Mean   :27.81   Mean   : 6.139   Mean   :73.66   Mean   :206.6  
 3rd Qu.:31.00   3rd Qu.: 9.000   3rd Qu.:75.00   3rd Qu.:216.2  
 Max.   :38.00   Max.   :18.000   Max.   :80.00   Max.   :248.0  
       GP             PIM        
 Min.   : 2.00   Min.   :  2.00  
 1st Qu.:40.00   1st Qu.: 20.75  
 Median :59.00   Median : 37.50  
 Mean   :54.58   Mean   : 43.65  
 3rd Qu.:75.00   3rd Qu.: 58.00  
 Max.   :81.00   Max.   :156.00  

The distribution of Age is close enough to be called nearly normal. Players who had head injures ranged from age 20-38, which is nearly the age range of the whole league. The mean and median are nearly identical, at 27.81 and 28.

Experience is right skewed, and again shows nearly the whole range of the league from 0-18 years. The mean and median are a little farther apart than age’s were, at 5 and 6.139 respectivel.

Height and weight are nearly normal. These ranges, 68-80 in for height, and 176-248 lbs for weight, do not show the range of the entire league, as they are missing the smaller players. This is opposite of what I would have expected, honestly.

Games Played is a wild mess. It is strongly left skewed with two main clusters. it rangers from 2 to 81, with the mean at 54.48 and median at 59. This is mostly interesting as it means that that most players who got a head injury played over half the season.

Penalties in minutes is another long ranged, from 2 minutes (i.e. a single minor penalty) to 156 minutes. It is skewed right with probable outliers.

Second Group of Numerical Varibles

p7 + p8 + p6 + p9 + p11 + p12 + p13 + plot_layout(ncol = 2) + plot_annotation(
  title = "Distrubtions of Numeric Variables for Injuries in 2005-2006 and 2024-2025 Seasons",
  caption = "Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca,
   and sportsforcaster.com, and sportsradar")

summary(headjoin2[c("Goals", "Assists", "PTS", "TOI_min", "ATOI_min", "GamesMissed")]) # this is refusing to believe `+/-` exists, so will call separately
     Goals           Assists           PTS           TOI_min         
 Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   :   0.8333  
 1st Qu.: 2.000   1st Qu.: 5.00   1st Qu.: 7.00   1st Qu.: 511.0750  
 Median : 6.000   Median :11.00   Median :18.00   Median : 935.9083  
 Mean   : 8.583   Mean   :14.72   Mean   :23.31   Mean   : 869.9593  
 3rd Qu.:11.000   3rd Qu.:22.00   3rd Qu.:32.25   3rd Qu.:1233.4083  
 Max.   :44.000   Max.   :47.00   Max.   :78.00   Max.   :1850.1333  
    ATOI_min        GamesMissed   
 Min.   : 0.4167   Min.   : 1.00  
 1st Qu.:12.1167   1st Qu.: 2.00  
 Median :15.3583   Median : 4.00  
 Mean   :15.0599   Mean   : 7.63  
 3rd Qu.:18.7542   3rd Qu.: 9.00  
 Max.   :24.6667   Max.   :73.00  
summary(headjoin2$`+/-`)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-24.0000  -7.0000  -1.0000  -0.6389   3.0000  29.0000 

(The awkward bottom row of the output is for Plus-minus)

Goals, Assists, and Points are all skewed right – as Points is the sum of the first two, this makes sense. Goals has the most isolated cluster on the left side, with assists and points more spread out. Again this makes sense, as more players get assists than goals.

Time on Ice is all over the place. the min is under a minute, and the max is 1850.1333 minutes. WIth that range, the mean and median are surprisingly close at approx. 870 and approx. 936.

Average Time on Ice is near enough that I can call it nearly normal. A slightly less vast range, if only because ATOI gets divided by games played, it ranges from a little under half a minute to 24.6667 minutes. The mean and median are both around 15 minutes.

GamesMissed is highly right skewed, with a few observations very far from the main cluster, and will need a logarithmic transformation for any statistical tests. It ranges from 1 to 73, but the mean and median are way down at 7.63 and 4, showing that the max is most definitely an outlier.

Chi-squared test Between InjuryType and Season

headcount <- headjoin2 |>
  group_by(Season, InjuryType) |>
  count()

ggplot(headcount, aes(reorder(InjuryType, n), n,  fill = Season)) +
  geom_col()+
  coord_flip() +
  labs(title= "InjuryType by Season",
       x = "Head Injury TYpe",
       y = "Count",
       caption =  "Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca,
   and sportsforcaster.com, and sportsradar") +
  theme_bw() +
  scale_fill_manual(values = c("#b4d4ed","#83b6de"))

Null Hypothesis: There is no association between Season and InjuryType

Alternative Hypothesis: There is an association between Season and Injury Type

Check Conditions:

Independence: the injuries are NOT indenpendent, so my results are not going to be valid

Expected cell counts:

chisq.test(headjoin2$Season, headjoin2$InjuryType)$expected
Warning in chisq.test(headjoin2$Season, headjoin2$InjuryType): Chi-squared
approximation may be incorrect
                headjoin2$InjuryType
headjoin2$Season Concussion Dizziness       Ear      Eye   Facial      Head
            2006    33.7963 1.3518519 0.6759259 5.407407 9.462963 12.842593
            2025    16.2037 0.6481481 0.3240741 2.592593 4.537037  6.157407
                headjoin2$InjuryType
headjoin2$Season      Jaw      Nose Orbital bone
            2006 4.731481 2.0277778     2.703704
            2025 2.268519 0.9722222     1.296296

Multiple expected cell counts less than 5. Proceeding to non-parametric fisher exact text:

fisher.test(headjoin2$Season, headjoin2$InjuryType)

    Fisher's Exact Test for Count Data

data:  headjoin2$Season and headjoin2$InjuryType
p-value = 0.1636
alternative hypothesis: two.sided

With a p-value of 0.1636, I fail to reject the null: there is no evidence that there is an association between Season and InjuryType.

Difference of TWo Means Bootstrap: GamesMissed for each Season

Null Hypothesis: The mean GamesMissed for 2006 for head injuries is the same as the mean GamesMissed for 2025 for head injuries.

Check conditions:

Independence: Not exactly independent, so results aren’t exactly valid!

Distribution and counts: Need to transform GamesMissed

ggplot(headjoin2, aes(log(GamesMissed), color = Season)) +
  geom_density(linewidth = 2) +
  labs(title = 'Distribution of Log(GamesMissed)',
       x = "Log(GamesMissed)",
       color = "Season",
       caption = "Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca, and sportsforcaster.com, and sportsradar") +
  scale_color_manual(values = c("#b4d4ed","#83b6de"))

headjoin |>
  group_by(Season) |>
  count()
# A tibble: 2 × 2
# Groups:   Season [2]
  Season     n
  <chr>  <int>
1 2006      73
2 2025      35

The log transformation of GamesMissed is better, and while still right skewed, close enough to normal. (The other transformations are worse.) Both seasons have counts of over 30, so I can proceed.

Null Hypothesis: The mean GamesMissed for 2006 for head injuries is the same as the mean GamesMissed for 2025 for head injuries.

Alternative Hypothesis: The mean GamesMisssed for ’06 and ’25 are different.

ggplot(headjoin2, aes(x = Season, y = log(GamesMissed))) +
  geom_boxplot(aes(fill = Season)) +
  geom_jitter(alpha = 0.8) + 
  labs(title = "Games Missed For Head Injuries by Season",
       x= "Season",
       y = "Games Missed",
       caption = "Source: puckpedia.com, cbssports.com, the former capfriendly.com, tsn.ca,
   and sportsforcaster.com, and sportsradar" ) +
  theme_bw() +
  scale_fill_manual(values = c("#b4d4ed","#83b6de")) +
  theme(legend.position = "none")

diff_mean_ci <- headjoin2 |>
  specify(GamesMissed ~ Season) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("2006", "2025"))

head(diff_mean_ci)
Response: GamesMissed (numeric)
Explanatory: Season (factor)
# A tibble: 6 × 2
  replicate  stat
      <int> <dbl>
1         1  3.56
2         2  2.12
3         3  4.12
4         4  4.29
5         5  3.87
6         6  2.94

Calculate the 95% Confidence interval:

diff_mean_ci |>
  get_ci(level = 0.95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1   -0.494     6.47

As zero is included in the confidence interval range, I fail to reject the null hypothesis. I am 95% confident that there is no true difference in mean games missed for head injuries between the 2006 and 2025 seasons. ## Model: Logistic RegressionL Probability a head injury is a concussion

fit1 <- glm(is_concussion ~ GP, family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve
summary(fit1)

Call:
glm(formula = is_concussion ~ GP, family = binomial(), data = headjoin2)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.782530   0.536227   1.459   0.1445  
GP          -0.017095   0.009157  -1.867   0.0619 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 145.52  on 106  degrees of freedom
AIC: 149.52

Number of Fisher Scoring iterations: 4
ggplot(headjoin2, aes(x = GP, y = is_concussion)) +
  geom_jitter(width = 0, height = 0.05, alpha = 0.5) +
  geom_smooth(method = "glm", color = "#b4d4ed", method.args = list(family = "binomial"), se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Fit the model

full_model <- glm(is_concussion ~ factor(Season) + factor(Team)+ factor(Position) + Nationality + GamesMissed + Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(full_model)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Team) + 
    factor(Position) + Nationality + GamesMissed + Age + GP + 
    Wt + factor(shoots) + PTS + `+/-` + ATOI_min, family = binomial(), 
    data = headjoin2)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)  
(Intercept)                1.934e+01  9.362e+00   2.065   0.0389 *
factor(Season)2025         1.728e+00  1.023e+00   1.689   0.0912 .
factor(Team)Boston        -1.983e+01  1.075e+04  -0.002   0.9985  
factor(Team)Buffalo       -3.049e-01  1.528e+00  -0.200   0.8418  
factor(Team)Calgary       -3.974e+00  1.951e+00  -2.037   0.0416 *
factor(Team)Carolina      -7.796e-01  1.825e+00  -0.427   0.6693  
factor(Team)Chicago        1.857e+01  5.963e+03   0.003   0.9975  
factor(Team)Columbus      -9.056e-01  1.445e+00  -0.627   0.5307  
factor(Team)Dallas        -2.125e+01  4.283e+03  -0.005   0.9960  
factor(Team)Detroit       -2.183e+01  7.576e+03  -0.003   0.9977  
factor(Team)Edmonton       1.746e+01  1.075e+04   0.002   0.9987  
factor(Team)Florida       -3.143e+00  2.284e+00  -1.376   0.1688  
factor(Team)Los Angeles   -1.274e+00  2.003e+00  -0.636   0.5247  
factor(Team)Minnesota     -3.416e+00  2.143e+00  -1.594   0.1109  
factor(Team)Nashville     -3.488e+00  1.962e+00  -1.778   0.0754 .
factor(Team)New Jersey    -3.533e+00  2.404e+00  -1.470   0.1417  
factor(Team)NY Islanders  -7.473e-01  1.635e+00  -0.457   0.6477  
factor(Team)NY Rangers    -2.433e+01  6.245e+03  -0.004   0.9969  
factor(Team)Ottawa         4.499e-01  1.550e+00   0.290   0.7716  
factor(Team)Philadelphia   2.557e-02  1.696e+00   0.015   0.9880  
factor(Team)Phoenix       -3.276e+00  2.397e+00  -1.367   0.1717  
factor(Team)Pittsburgh    -1.899e+00  1.771e+00  -1.072   0.2835  
factor(Team)San Jose       1.088e-01  2.301e+00   0.047   0.9623  
factor(Team)St. Louis     -1.108e+00  2.131e+00  -0.520   0.6030  
factor(Team)Tampa Bay     -9.582e-01  2.108e+00  -0.455   0.6494  
factor(Team)Toronto        1.605e+00  3.459e+00   0.464   0.6426  
factor(Team)Utah          -2.066e+01  1.075e+04  -0.002   0.9985  
factor(Team)Vancouver     -2.435e+00  1.635e+00  -1.490   0.1363  
factor(Team)Washington    -1.634e+01  1.075e+04  -0.002   0.9988  
factor(Team)Winnipeg       3.376e+01  7.245e+03   0.005   0.9963  
factor(Position)F          2.817e+00  1.323e+00   2.129   0.0333 *
NationalityCZ              1.524e+01  7.404e+03   0.002   0.9984  
NationalityCzechoslovakia -2.244e-01  1.601e+00  -0.140   0.8885  
NationalityFinland        -1.800e+01  4.749e+03  -0.004   0.9970  
NationalityGermany         1.693e+01  7.306e+03   0.002   0.9982  
NationalityRussia         -1.308e-01  1.940e+00  -0.067   0.9462  
NationalitySoviet Union    2.400e+00  1.529e+00   1.570   0.1164  
NationalitySweden         -8.946e-01  1.480e+00  -0.605   0.5455  
NationalitySwitzerland     4.032e+01  1.158e+04   0.003   0.9972  
NationalityUSA            -6.540e-01  1.057e+00  -0.619   0.5361  
GamesMissed               -2.971e-02  3.700e-02  -0.803   0.4221  
Age                       -6.608e-02  8.606e-02  -0.768   0.4426  
GP                        -4.538e-02  2.413e-02  -1.881   0.0600 .
Wt                        -7.587e-02  3.566e-02  -2.128   0.0334 *
factor(shoots)R           -1.433e+00  8.367e-01  -1.713   0.0868 .
PTS                       -2.349e-02  3.815e-02  -0.616   0.5381  
`+/-`                     -3.132e-02  4.035e-02  -0.776   0.4376  
ATOI_min                   5.536e-02  1.491e-01   0.371   0.7104  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.127  on 107  degrees of freedom
Residual deviance:  82.986  on  60  degrees of freedom
AIC: 178.99

Number of Fisher Scoring iterations: 18

See what removing Team does – some have very high p-values

model2 <- glm(is_concussion ~ factor(Season) + factor(Position) + Nationality + GamesMissed + Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model2)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) + 
    Nationality + GamesMissed + Age + GP + Wt + factor(shoots) + 
    PTS + `+/-` + ATOI_min, family = binomial(), data = headjoin2)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)  
(Intercept)                 11.06150    5.27966   2.095   0.0362 *
factor(Season)2025           0.53137    0.56274   0.944   0.3450  
factor(Position)F            1.17107    0.82861   1.413   0.1576  
NationalityCZ               14.36280 1665.41629   0.009   0.9931  
NationalityCzechoslovakia    0.39530    1.04098   0.380   0.7041  
NationalityFinland          -0.10589    1.23623  -0.086   0.9317  
NationalityGermany          15.90907 1688.74714   0.009   0.9925  
NationalityRussia            0.55801    1.62543   0.343   0.7314  
NationalitySoviet Union      0.71796    1.00166   0.717   0.4735  
NationalitySweden           -0.82357    0.92568  -0.890   0.3736  
NationalitySwitzerland      17.01237 2399.54490   0.007   0.9943  
NationalityUSA              -0.09172    0.60745  -0.151   0.8800  
GamesMissed                  0.01273    0.02572   0.495   0.6208  
Age                         -0.11174    0.05970  -1.872   0.0612 .
GP                          -0.02372    0.01611  -1.472   0.1410  
Wt                          -0.04272    0.01978  -2.159   0.0308 *
factor(shoots)R             -0.35413    0.49563  -0.715   0.4749  
PTS                         -0.01830    0.02555  -0.716   0.4739  
`+/-`                       -0.02170    0.02687  -0.808   0.4194  
ATOI_min                     0.09351    0.10146   0.922   0.3567  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 120.38  on  88  degrees of freedom
AIC: 160.38

Number of Fisher Scoring iterations: 15

That helped. See if removing nationality helps.

model3<- glm(is_concussion ~ factor(Season) + factor(Position) + GamesMissed + Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model3)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) + 
    GamesMissed + Age + GP + Wt + factor(shoots) + PTS + `+/-` + 
    ATOI_min, family = binomial(), data = headjoin2)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)  
(Intercept)        10.804177   4.975071   2.172   0.0299 *
factor(Season)2025  0.499764   0.481003   1.039   0.2988  
factor(Position)F   0.961435   0.802260   1.198   0.2308  
GamesMissed         0.005791   0.023328   0.248   0.8039  
Age                -0.134712   0.055702  -2.418   0.0156 *
GP                 -0.027012   0.014905  -1.812   0.0699 .
Wt                 -0.036533   0.018133  -2.015   0.0439 *
factor(shoots)R    -0.499054   0.462877  -1.078   0.2810  
PTS                -0.011735   0.023346  -0.503   0.6152  
`+/-`              -0.025867   0.025204  -1.026   0.3048  
ATOI_min            0.091491   0.096183   0.951   0.3415  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 125.72  on  97  degrees of freedom
AIC: 147.72

Number of Fisher Scoring iterations: 3

Now remove GamesMissed, as highest p-value.

model4<- glm(is_concussion ~ factor(Season) + factor(Position) + Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model4)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) + 
    Age + GP + Wt + factor(shoots) + PTS + `+/-` + ATOI_min, 
    family = binomial(), data = headjoin2)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)  
(Intercept)        10.65472    4.93231   2.160   0.0308 *
factor(Season)2025  0.49312    0.48000   1.027   0.3043  
factor(Position)F   0.98889    0.79460   1.245   0.2133  
Age                -0.13285    0.05508  -2.412   0.0159 *
GP                 -0.02818    0.01418  -1.987   0.0469 *
Wt                 -0.03562    0.01775  -2.007   0.0447 *
factor(shoots)R    -0.51216    0.45981  -1.114   0.2653  
PTS                -0.01224    0.02325  -0.526   0.5986  
`+/-`              -0.02514    0.02500  -1.005   0.3147  
ATOI_min            0.09277    0.09601   0.966   0.3339  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 125.78  on  98  degrees of freedom
AIC: 145.78

Number of Fisher Scoring iterations: 3

AIC Still going down. Now remove PTS as highest p-value

model5<- glm(is_concussion ~ factor(Season) + factor(Position) + Age + GP + Wt + factor(shoots) + `+/-` + ATOI_min , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model5)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) + 
    Age + GP + Wt + factor(shoots) + `+/-` + ATOI_min, family = binomial(), 
    data = headjoin2)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        11.80937    4.43393   2.663  0.00774 **
factor(Season)2025  0.50091    0.48000   1.044  0.29669   
factor(Position)F   0.71986    0.60165   1.196  0.23151   
Age                -0.13307    0.05489  -2.424  0.01534 * 
GP                 -0.03182    0.01247  -2.552  0.01072 * 
Wt                 -0.03821    0.01704  -2.243  0.02492 * 
factor(shoots)R    -0.50242    0.45923  -1.094  0.27393   
`+/-`              -0.02898    0.02387  -1.214  0.22467   
ATOI_min            0.05797    0.06868   0.844  0.39868   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 126.06  on  99  degrees of freedom
AIC: 144.06

Number of Fisher Scoring iterations: 3

Slightly lower AIC; now remove ATOI_min

model6<- glm(is_concussion ~ factor(Season) + factor(Position) + Age + GP + Wt + factor(shoots) + `+/-` , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model6)

Call:
glm(formula = is_concussion ~ factor(Season) + factor(Position) + 
    Age + GP + Wt + factor(shoots) + `+/-`, family = binomial(), 
    data = headjoin2)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        13.36402    4.09724   3.262  0.00111 **
factor(Season)2025  0.50490    0.47834   1.056  0.29119   
factor(Position)F   0.41324    0.47288   0.874  0.38218   
Age                -0.11789    0.05146  -2.291  0.02197 * 
GP                 -0.02712    0.01099  -2.467  0.01362 * 
Wt                 -0.04369    0.01594  -2.741  0.00613 **
factor(shoots)R    -0.55077    0.45633  -1.207  0.22745   
`+/-`              -0.02413    0.02315  -1.042  0.29744   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 126.78  on 100  degrees of freedom
AIC: 142.78

Number of Fisher Scoring iterations: 3

Still lowering slightly. Now remove Position

model7<- glm(is_concussion ~ factor(Season)  + Age + GP + Wt + factor(shoots) + `+/-` , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model7)

Call:
glm(formula = is_concussion ~ factor(Season) + Age + GP + Wt + 
    factor(shoots) + `+/-`, family = binomial(), data = headjoin2)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        13.82003    4.07471   3.392 0.000695 ***
factor(Season)2025  0.47077    0.47412   0.993 0.320744    
Age                -0.12249    0.05098  -2.403 0.016269 *  
GP                 -0.02664    0.01100  -2.422 0.015436 *  
Wt                 -0.04407    0.01593  -2.765 0.005684 ** 
factor(shoots)R    -0.50373    0.45129  -1.116 0.264338    
`+/-`              -0.02456    0.02307  -1.064 0.287200    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 127.55  on 101  degrees of freedom
AIC: 141.55

Number of Fisher Scoring iterations: 3

I am surprised that Season is my largest p-value

model8<- glm(is_concussion ~ Age + GP + Wt + factor(shoots) + `+/-` , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model8)

Call:
glm(formula = is_concussion ~ Age + GP + Wt + factor(shoots) + 
    `+/-`, family = binomial(), data = headjoin2)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     14.46598    3.99479   3.621 0.000293 ***
Age             -0.13111    0.05035  -2.604 0.009222 ** 
GP              -0.02621    0.01088  -2.410 0.015962 *  
Wt              -0.04544    0.01568  -2.898 0.003759 ** 
factor(shoots)R -0.48809    0.44684  -1.092 0.274694    
`+/-`           -0.02127    0.02277  -0.934 0.350087    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 128.54  on 102  degrees of freedom
AIC: 140.54

Number of Fisher Scoring iterations: 3

AIC is still decreasing SLIGHTLY. Now remove Plus Minus

model9<- glm(is_concussion ~ Age + GP + Wt + factor(shoots) , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model9)

Call:
glm(formula = is_concussion ~ Age + GP + Wt + factor(shoots), 
    family = binomial(), data = headjoin2)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     14.43681    3.99554   3.613 0.000302 ***
Age             -0.12364    0.04956  -2.495 0.012609 *  
GP              -0.02732    0.01082  -2.525 0.011554 *  
Wt              -0.04595    0.01567  -2.932 0.003372 ** 
factor(shoots)R -0.49713    0.44324  -1.122 0.262048    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 129.43  on 103  degrees of freedom
AIC: 139.43

Number of Fisher Scoring iterations: 3

Still decreasing a little. Removing shoots next.

model10<- glm(is_concussion ~ Age + GP + Wt  , family = binomial(), data = headjoin2) # family = bimonial makes the sigmoid curve

summary(model10)

Call:
glm(formula = is_concussion ~ Age + GP + Wt, family = binomial(), 
    data = headjoin2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 13.81041    3.89129   3.549 0.000387 ***
Age         -0.11611    0.04842  -2.398 0.016481 *  
GP          -0.02726    0.01074  -2.538 0.011158 *  
Wt          -0.04486    0.01548  -2.898 0.003754 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 149.13  on 107  degrees of freedom
Residual deviance: 130.71  on 104  degrees of freedom
AIC: 138.71

Number of Fisher Scoring iterations: 4

All p-values are very small. The AIC is still rather large, at 138.71 though. This is probably not a very good model.

Manually Making the Confusion Matrix

concussion_plus <- model10 |>
  augment(type.predict = "response") |>
  mutate(y_hat = .fitted)

head(concussion_plus)
# A tibble: 6 × 11
  is_concussion   Age    GP    Wt .fitted .resid   .hat .sigma .cooksd
          <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1             1    22    81   206   0.452  1.26  0.0478   1.12 0.0160 
2             1    31    76   203   0.275  1.61  0.0234   1.12 0.0161 
3             1    20    56   210   0.632  0.958 0.0414   1.12 0.00656
4             0    34    23   200   0.565 -1.29  0.0682   1.12 0.0255 
5             0    31    78   212   0.194 -0.656 0.0263   1.12 0.00167
6             0    37    61   185   0.390 -0.994 0.0727   1.12 0.0135 
# ℹ 2 more variables: .std.resid <dbl>, y_hat <dbl>
fit_plus <- augment(model10, type.predict ="response") |>
  mutate(concussion_hat = round(.fitted))

fit_plus |>
  select(is_concussion, concussion_hat) |>
  table()
             concussion_hat
is_concussion  0  1
            0 40 18
            1 21 29
accuracy_mod10 <- (40 +29) / (40+29 +18 +21)
accuracy_mod10
[1] 0.6388889