#Package Loading
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
#Loading the MoneyPuck Shot Dataset
mpd = read.csv('./shots_2024.csv')
#adding descriptors to dataframe
# Load the data dictionary (update with your file path)
data_dict <- read.csv('./MoneyPuck_Shot_Data_Dictionary (1) (1).csv')
# Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
column_name <- data_dict$Variable[i]
description <- data_dict$Definition[i]
if (column_name %in% colnames(mpd)) {
label(mpd[[column_name]]) <- description
}
}
For this analysis, we will make use of the teamCode and shooterplayerID column.
table(mpd$teamCode)
##
## ANA BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL NJD
## 1792 1839 1812 2227 1836 1866 1593 1909 1872 1624 1965 2035 1728 1802 1639 2018
## NSH NYI NYR OTT PHI PIT SEA SJS STL TBL TOR UTA VAN VGK WPG WSH
## 1831 1820 1878 1834 1795 1943 1867 1794 1704 1698 1846 1791 1610 1890 1849 1852
This summary validates that there is data for all 32 NHL teams, and all teams have a reasonable amount of shot data.
Any summary statistics of shooterPlayerID would not be super valuable as this is an ID column.
To answer the question, I will aggregate on teamCode and keep the players teams grouped as a vector. I will then filter by if the vector is a list (indicating multiple teams), and then output the remaining players names.
#constructing the aggregated dataframe
player_group_team = mpd |> group_by(shooterPlayerId) |>
summarize(teamCode = list(unique(teamCode)),
shooterName=list(unique(shooterName))
)
#filtering by players who played for multiple teams
mult_teams = player_group_team |> filter(lengths(teamCode) > 1)
#printing a list of names
mult_teams$shooterName
## [[1]]
## [1] "Lars Eller"
##
## [[2]]
## [1] "Cam Fowler"
##
## [[3]]
## [1] "Olli Maatta"
##
## [[4]]
## [1] "Jacob Trouba"
##
## [[5]]
## [1] "Kasperi Kapanen"
##
## [[6]]
## [1] "Travis Dermott"
##
## [[7]]
## [1] "Daniel Sprong"
##
## [[8]]
## [1] "Will Borgen"
##
## [[9]]
## [1] "Alexandre Carrier"
##
## [[10]]
## [1] "Dante Fabbro"
##
## [[11]]
## [1] "Givani Smith"
##
## [[12]]
## [1] "Urho Vaakanainen"
##
## [[13]]
## [1] "Pierre-Olivier Joseph" "P.O Joseph"
##
## [[14]]
## [1] "Oliver Wahlstrom"
##
## [[15]]
## [1] "Nikolai Kovalenko"
##
## [[16]]
## [1] "Kaapo Kakko"
##
## [[17]]
## [1] "Philip Tomasino"
##
## [[18]]
## [1] "Juuso Parssinen"
##
## [[19]]
## [1] "Justin Barron"
##
## [[20]]
## [1] "David Jiricek"
The analysis found that 20 NHL players took shots while playing for multiple teams this season. All of these cases were the result of trades. This shows how trades play a role in redistributing players who contribute offensively. These players are often key assets, and their movement can have a noticeable impact on team performance. It would be interesting to follow up on these players xG and Goals Above xG and analyze their contributions to their new team versus their old team.
For this analysis, I will make use of the xGoal column and the goalieNameForShot (if its blank its an empty net).
#is this equal to whitespace
mpd_no_en = mpd |> select(goalieNameForShot, xGoal) |> filter(trimws(mpd$goalieNameForShot) != "")
#summary
summary(mpd_no_en$xGoal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001442 0.014450 0.034358 0.068005 0.082898 0.956395
This summary shows a large portion of the shots should be below .5 expected goals.
#The numerical solution
mpd_no_en_5_higher = mpd_no_en |> filter(xGoal >= .5)
length(mpd_no_en_5_higher$goalieNameForShot)
## [1] 434
This analysis shows that 434 non-empty net shots this season had an xGoal value of 0.5 or higher. This indicates that these shots had a higher chance of scoring. The finding is useful for understanding which shots are likely more dangerous, and we could further the analysis by looking at which areas of the ice these shots came from. Even further investigation could look at how many of these high xGoal shots actually resulted in goals. Additionally, we could look at which teams give up these high danger shots.
To answer this question, I will use shotType, arenaAdjustedShotDistance, shotAngle, and goal columns.
ws = mpd %>% filter(shotType %in% c("WRIST", "SLAP") & goal == 1)
ggplot(ws, aes(x = arenaAdjustedShotDistance, y = shotAngle, color = shotType)) +
geom_point() +
xlim(0, 100) +
ylim(-90, 90) +
ggtitle('Goals Shot Angle and Distance Distribution by Shot Type')
This analysis highlights the differences between slap shots and wrist shots in terms of shot angle and distance. Wrist shots are versatile, with a wide range of distances and angles, making them effective under pressure. Slap shots, however, are more concentrated at shorter distances due to their longer wind-up time, limiting their effectiveness in high-speed play. As the NHL has evolved to prioritize speed and adaptability, wrist shots have become more viable, while slap shots have declined. While slap shots still have situational value, their narrow range of effectiveness reflects their reduced role in modern hockey. It would be interesting to explore the xG of the different shot types.
ggplot(mpd, aes(x = lastEventCategory, fill = factor(goal))) +
geom_bar() + labs (x = 'Event Before Shot', y = "Count of Event", title="Distribution of Event Before Shot",fill="Goal")
The most common event before a shot is a faceoff, followed by another shot and then a block. While faceoffs happen most often, they result in very few goals relative to their frequency. Events like shots have a higher percentage of successful outcomes compared to their occurrence. Other events, such as a Delayed Penalty or Stoppage rarely happen and lead to very few goals. This suggests that certain events before a shot are more likely to result in a goal.
For the interest of better understanding the set, and to meet the assignment criteria, below are two more data summaries, one categorical and one numeric.
For the categorical summary, I will be summarizing the L/R shooter column to see which handedness have taken the most shots.
table(mpd$shooterLeftRight)
##
## L R
## 2879 34680 21000
Finally, I will look at the defending team time on ice to better understand the range of values in the column.
describe(mpd$defendingTeamAverageTimeOnIce)
## mpd$defendingTeamAverageTimeOnIce : The average playing time in seconds the shooting team's players have been on the ice
## n missing distinct Info Mean pMedian Gmd .05
## 58559 0 1259 1 34.05 32.62 20.09 8.50
## .10 .25 .50 .75 .90 .95
## 13.00 21.00 31.80 43.92 57.25 67.40
##
## lowest : 0 1 1.2 1.5 1.6
## highest: 169 169.8 175.833 176.833 185.4