LJ Data Dive - Summaries

#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv('./shots_2024.csv')

#adding descriptors to dataframe

# Load the data dictionary (update with your file path)
data_dict <- read.csv('./MoneyPuck_Shot_Data_Dictionary (1) (1).csv')

# Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
  column_name <- data_dict$Variable[i]
  description <- data_dict$Definition[i]
  
  if (column_name %in% colnames(mpd)) {
    label(mpd[[column_name]]) <- description
  }
}

Question 1: Which players have taken shots while playing for multiple teams this season?

For this analysis, we will make use of the teamCode and shooterplayerID column.

teamCode Summary

table(mpd$teamCode)

## 
##  ANA  BOS  BUF  CAR  CBJ  CGY  CHI  COL  DAL  DET  EDM  FLA  LAK  MIN  MTL  NJD 
## 1792 1839 1812 2227 1836 1866 1593 1909 1872 1624 1965 2035 1728 1802 1639 2018 
##  NSH  NYI  NYR  OTT  PHI  PIT  SEA  SJS  STL  TBL  TOR  UTA  VAN  VGK  WPG  WSH 
## 1831 1820 1878 1834 1795 1943 1867 1794 1704 1698 1846 1791 1610 1890 1849 1852

This summary validates that there is data for all 32 NHL teams, and all teams have a reasonable amount of shot data.

Any summary statistics of shooterPlayerID would not be super valuable as this is an ID column.

Answering the Question

To answer the question, I will aggregate on teamCode and keep the players teams grouped as a vector. I will then filter by if the vector is a list (indicating multiple teams), and then output the remaining players names.

#constructing the aggregated dataframe
player_group_team = mpd |> group_by(shooterPlayerId) |>
  summarize(teamCode = list(unique(teamCode)), 
            shooterName=list(unique(shooterName))
            )

#filtering by players who played for multiple teams
mult_teams = player_group_team |> filter(lengths(teamCode) > 1)

#printing a list of names
mult_teams$shooterName

## [[1]]
## [1] "Lars Eller"
## 
## [[2]]
## [1] "Cam Fowler"
## 
## [[3]]
## [1] "Olli Maatta"
## 
## [[4]]
## [1] "Jacob Trouba"
## 
## [[5]]
## [1] "Kasperi Kapanen"
## 
## [[6]]
## [1] "Travis Dermott"
## 
## [[7]]
## [1] "Daniel Sprong"
## 
## [[8]]
## [1] "Will Borgen"
## 
## [[9]]
## [1] "Alexandre Carrier"
## 
## [[10]]
## [1] "Dante Fabbro"
## 
## [[11]]
## [1] "Givani Smith"
## 
## [[12]]
## [1] "Urho Vaakanainen"
## 
## [[13]]
## [1] "Pierre-Olivier Joseph" "P.O Joseph"           
## 
## [[14]]
## [1] "Oliver Wahlstrom"
## 
## [[15]]
## [1] "Nikolai Kovalenko"
## 
## [[16]]
## [1] "Kaapo Kakko"
## 
## [[17]]
## [1] "Philip Tomasino"
## 
## [[18]]
## [1] "Juuso Parssinen"
## 
## [[19]]
## [1] "Justin Barron"
## 
## [[20]]
## [1] "David Jiricek"

The analysis found that 20 NHL players took shots while playing for multiple teams this season. All of these cases were the result of trades. This shows how trades play a role in redistributing players who contribute offensively. These players are often key assets, and their movement can have a noticeable impact on team performance. It would be interesting to follow up on these players xG and Goals Above xG and analyze their contributions to their new team versus their old team.

Question 2: How many non-empty net shots this season have had an expected goals value of .5 or higher?

For this analysis, I will make use of the xGoal column and the goalieNameForShot (if its blank its an empty net).

#is this equal to whitespace
mpd_no_en = mpd |> select(goalieNameForShot, xGoal) |> filter(trimws(mpd$goalieNameForShot) != "")

#summary
summary(mpd_no_en$xGoal)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001442 0.014450 0.034358 0.068005 0.082898 0.956395

This summary shows a large portion of the shots should be below .5 expected goals.

#The numerical solution
mpd_no_en_5_higher = mpd_no_en |> filter(xGoal >= .5)

length(mpd_no_en_5_higher$goalieNameForShot)

## [1] 434

This analysis shows that 434 non-empty net shots this season had an xGoal value of 0.5 or higher. This indicates that these shots had a higher chance of scoring. The finding is useful for understanding which shots are likely more dangerous, and we could further the analysis by looking at which areas of the ice these shots came from. Even further investigation could look at how many of these high xGoal shots actually resulted in goals. Additionally, we could look at which teams give up these high danger shots.

Question 3: What is the distribution of shot distance and goals for slap shot vs wrist shot goals?

To answer this question, I will use shotType, arenaAdjustedShotDistance, shotAngle, and goal columns.

ws = mpd %>% filter(shotType %in% c("WRIST", "SLAP") & goal == 1)

ggplot(ws, aes(x = arenaAdjustedShotDistance, y = shotAngle, color = shotType)) +
  geom_point() +
  xlim(0, 100) +
  ylim(-90, 90) +
  ggtitle('Goals Shot Angle and Distance Distribution by Shot Type')

This analysis highlights the differences between slap shots and wrist shots in terms of shot angle and distance. Wrist shots are versatile, with a wide range of distances and angles, making them effective under pressure. Slap shots, however, are more concentrated at shorter distances due to their longer wind-up time, limiting their effectiveness in high-speed play. As the NHL has evolved to prioritize speed and adaptability, wrist shots have become more viable, while slap shots have declined. While slap shots still have situational value, their narrow range of effectiveness reflects their reduced role in modern hockey. It would be interesting to explore the xG of the different shot types.

Question 4: What is the most common event to occur before a shot (essentially a distribution analysis of categorical data) and which lead to more goals?

ggplot(mpd, aes(x = lastEventCategory, fill = factor(goal))) +
  geom_bar() + labs (x = 'Event Before Shot', y = "Count of Event", title="Distribution of Event Before Shot",fill="Goal")

The most common event before a shot is a faceoff, followed by another shot and then a block. While faceoffs happen most often, they result in very few goals relative to their frequency. Events like shots have a higher percentage of successful outcomes compared to their occurrence. Other events, such as a Delayed Penalty or Stoppage rarely happen and lead to very few goals. This suggests that certain events before a shot are more likely to result in a goal.

Other Data Exploration (ADDED FOR SUBMISSION 2)

For the interest of better understanding the set, and to meet the assignment criteria, below are two more data summaries, one categorical and one numeric.

For the categorical summary, I will be summarizing the L/R shooter column to see which handedness have taken the most shots.

table(mpd$shooterLeftRight)

## 
##           L     R 
##  2879 34680 21000

Finally, I will look at the defending team time on ice to better understand the range of values in the column.

describe(mpd$defendingTeamAverageTimeOnIce)

## mpd$defendingTeamAverageTimeOnIce : The average playing time in seconds the shooting team's players have been on the ice 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##    58559        0     1259        1    34.05    32.62    20.09     8.50 
##      .10      .25      .50      .75      .90      .95 
##    13.00    21.00    31.80    43.92    57.25    67.40 
## 
## lowest : 0       1       1.2     1.5     1.6    
## highest: 169     169.8   175.833 176.833 185.4