R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Loading stuff
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv('C:\\Users\\Logan\\Downloads\\shots_2024.csv')

#adding descriptors to dataframe

# Load the data dictionary (update with your file path)
data_dict <- read.csv("C:\\Users\\Logan\\Downloads\\MoneyPuck_Shot_Data_Dictionary (1).csv")

# Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
  column_name <- data_dict$Variable[i]
  description <- data_dict$Definition[i]
  
  if (column_name %in% colnames(mpd)) {
    label(mpd[[column_name]]) <- description
  }
}

Grouping by Shot Type

The first group by function will group by the shot type column to compare groups of shot types as a follow up to last week’s analysis. We theorized that Wrist shots would have a higher shotGoalProbability, so lets test that.

shots_by_type = mpd |> filter(shotType!="") |> group_by(shotType)

summary_by_type = shots_by_type |> summarize(
  avg_goal_probability = mean(xGoal, na.rm = TRUE),
  avg_shot_distance = mean(shotDistance, na.rm = TRUE),
  count = n()
)

summary_by_type
## # A tibble: 7 × 4
##   shotType   avg_goal_probability avg_shot_distance count
##   <labelled>                <dbl>             <dbl> <int>
## 1 BACK                     0.101              19.5   4197
## 2 DEFL                     0.139              17.3    998
## 3 SLAP                     0.0427             48.1   6902
## 4 SNAP                     0.0742             35.0  11195
## 5 TIP                      0.115              17.1   4985
## 6 WRAP                     0.0689              6.02   434
## 7 WRIST                    0.0634             37.2  29332
summary_by_type <- summary_by_type |>
  mutate(frequency_tag = ifelse(count == min(count), "Least Common", ""))

summary_by_type
## # A tibble: 7 × 5
##   shotType   avg_goal_probability avg_shot_distance count frequency_tag 
##   <labelled>                <dbl>             <dbl> <int> <chr>         
## 1 BACK                     0.101              19.5   4197 ""            
## 2 DEFL                     0.139              17.3    998 ""            
## 3 SLAP                     0.0427             48.1   6902 ""            
## 4 SNAP                     0.0742             35.0  11195 ""            
## 5 TIP                      0.115              17.1   4985 ""            
## 6 WRAP                     0.0689              6.02   434 "Least Common"
## 7 WRIST                    0.0634             37.2  29332 ""

The WRAP shot type, which only occurs in 434 instances out of the total number of shots, is marked as least common. This means that if a shot is randomly selected from our dataset, the probability of it being a WRAP shot is much lower than that of more common shot types like WRIST or SNAP. In probability terms, if there are a total of N shots, the chance of a shot being of type WRAP is 434/N.

The rarity of WRAP shots might suggest that either these shots are harder to execute, are less favored by players due to lower success rates, or are only attempted under specific, less common game situations.

Testable Hypothesis:

H0 : Mean shot distance of WRAP = Mean shot distance of other shot types H1 : Mean shot distance of WRAP < Mean shot distance of other shot types

To help better visualize the differences between group sizes and prepare for future testing of this hypothesis, I will plot a simple bar plot of the averages for the groups.

ggplot(summary_by_type, aes(x = reorder(shotType, avg_shot_distance), y = avg_shot_distance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Shot Distance",
       x = "Shot Type",
       y = "Average Shot Distance") +
  theme_minimal() +
  coord_flip()

The bar graph indicates that the Wraparounds have a significantly lower shot distance than most other shot types. If I had to make a decision now, I would say that the null hypothesis would be rejected.

The analysis of shot types revealed that WRAP shots are the least common and have the shortest average shot distance, while DEFL shots have the highest average goal probability. This suggests that WRAP shots are rare due to positional constraints, and DEFL shots are more effective in scoring. Further investigation could explore why WRAP shots are less favored by players and identify specific game situations where WRAP shots are more likely to be attempted.

Grouping By by Home Away Team Performance

shots_by_home_away <- mpd |> group_by(isHomeTeam)

summary_by_home_away <- shots_by_home_away |> summarize(avg_goal_probability = mean(xGoal, na.rm = TRUE),
  count = n()
)

summary_by_home_away <- summary_by_home_away |>
  mutate(frequency_tag = ifelse(count == min(count), "Least Common", ""))

summary_by_home_away
## # A tibble: 2 × 4
##   isHomeTeam avg_goal_probability count frequency_tag 
##   <labelled>                <dbl> <int> <chr>         
## 1 0                        0.0711 28785 "Least Common"
## 2 1                        0.0734 29774 ""

The average goal probability does not vary from home and away. This is logical, as home and away has no impact on overall skill, and is made up of the same teams with overlap.

If the away team has the lowest shot count, it means a randomly selected shot is less likely to come from an away team than a home team. If there are N total shots and 28785 come from away teams, the probability of an away-team shot is 28785/N.

Testable Hypothesis:

“Away teams take significantly fewer shots than home teams due to factors like rink familiarity, crowd influence, and home-ice advantages.”

(H0): The number of shots taken by home and away teams is equal. Alternative Hypothesis.

(H1): The number of shots taken by away teams is significantly lower than that of home teams.

ggplot(summary_by_home_away, aes(x = factor(isHomeTeam, labels = c("Away Team", "Home Team")), y = count, fill = factor(isHomeTeam))) +
  geom_bar(stat = "identity") +
  labs(title = "Shot Frequency by Home vs. Away Teams",
       x = "Team Type",
       y = "Shot Count") +
  theme_minimal() +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme(legend.position = "none")

It does not appear that there is a significant difference between the two, which makes sense. While these factors may play a role in other aspects of the game, there is probably not an impact in shots. A full hypothesis test would be needed, but a better follow up is to look at the areas of the ice the home team shoots from, perhaps indicating more familiarity with ice conditions.

When comparing home and away team performance, the average goal probability does not significantly differ, but away teams take fewer shots overall. This indicates that factors like rink familiarity and crowd influence might not impact goal probability but could affect the number of shots taken. Further questions include how the location of shots taken by home teams compares to those taken by away teams and whether rink familiarity influences shot quality.

Task 3: Grouping by Team Strength (Power Play vs. Even Strength vs. Shorthanded)

Shot frequency may be affected by whether a team is on a power play, even strength, or shorthanded. A logical assumption is that shorthanded teams (playing with fewer skaters due to penalties) would have significantly fewer shot attempts.

shots_by_goalie <- mpd |> group_by(goalieNameForShot)
summary_by_goalie <- shots_by_goalie |> summarize(
  count = n(),
  avg_goal_probability = mean(xGoal, na.rm = TRUE)
)

summary_by_goalie
## # A tibble: 91 × 3
##    goalieNameForShot    count avg_goal_probability
##    <labelled>           <int>                <dbl>
##  1 ""                     493               0.579 
##  2 "Adin Hill"           1102               0.0690
##  3 "Akira Schmid"          24               0.0312
##  4 "Aleksei Kolosov"      546               0.0719
##  5 "Alex Lyon"            603               0.0652
##  6 "Alex Nedeljkovic"     806               0.0692
##  7 "Alexandar Georgiev"   927               0.0745
##  8 "Andrei Vasilevskiy"  1237               0.0639
##  9 "Anthony Stolarz"      696               0.0636
## 10 "Anton Forsberg"       526               0.0717
## # ℹ 81 more rows
summary_by_goalie <- summary_by_goalie |>
  mutate(frequency_tag = ifelse(count == min(count), "Least Common", ""))
summary_by_goalie
## # A tibble: 91 × 4
##    goalieNameForShot    count avg_goal_probability frequency_tag
##    <labelled>           <int>                <dbl> <chr>        
##  1 ""                     493               0.579  ""           
##  2 "Adin Hill"           1102               0.0690 ""           
##  3 "Akira Schmid"          24               0.0312 ""           
##  4 "Aleksei Kolosov"      546               0.0719 ""           
##  5 "Alex Lyon"            603               0.0652 ""           
##  6 "Alex Nedeljkovic"     806               0.0692 ""           
##  7 "Alexandar Georgiev"   927               0.0745 ""           
##  8 "Andrei Vasilevskiy"  1237               0.0639 ""           
##  9 "Anthony Stolarz"      696               0.0636 ""           
## 10 "Anton Forsberg"       526               0.0717 ""           
## # ℹ 81 more rows

Yaniv Perets, the goaltender with the least shots, also had the highest avg_goal_probability. This is interesting, which may indicate he is an inexperienced goaltender and doesn’t play often, only playing in one game.

For example, if one goalie (Perets) appears in only 9 shots while others have 1,000 or more, then a randomly selected shot is very unlikely to be against that goalie. In probability terms, the chance of a shot being faced by goalie X is 9/N, where N is the total number of shots.

Hypothesis: “The goalie with the fewest shots faced (Perets) is likely a backup.”

(H0): The shot frequency faced by this goalie is the same as for the starting goaltenders. (H1): The shot frequency faced by this goalie is significantly lower than that faced by starting goaltenders.

This hypothesis is testable by comparing the shot counts (or shot rates per game) for Perets versus those for other goaltenders, and by correlating those counts with team defensive metrics or goaltender role (starter vs. backup) – this second can be added.

ggplot(summary_by_goalie, aes(x = count)) +
  geom_histogram(binwidth = 200, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Shots Faced by Goalies",
       x = "Shot Count",
       y = "Frequency") +
  theme_minimal()

The visualization above shows that shot count varies a lot across goaltenders, and indicates that shots faced may rely on a multitude of factors and be hard to predict or attribute to one cause.

The analysis of team strength (power play, even strength, shorthanded) showed that shot frequency varies significantly depending on the team’s situation. Shorthanded teams take fewer shots, highlighting the impact of team strength on shot attempts. Further investigation could examine how goal probability changes under different team strengths and identify strategies that teams use to mitigate the disadvantage of being shorthanded.

Task: Grouping by Handedness and Wing Play

The final analysis will fulfill the pick two categorical variables. I decided to choose shooter handedness and whether a shot was taken on the off hand or primary hand.

shots_by_handedness_offwing = mpd |> filter(shooterLeftRight!="") |> group_by(shooterLeftRight, offWing) |> 
  summarize(
    avg_goal_probability = mean(xGoal, na.rm = TRUE),
    count = n()
  )
## `summarise()` has grouped output by 'shooterLeftRight'. You can override using
## the `.groups` argument.
shots_by_handedness_offwing
## # A tibble: 4 × 4
## # Groups:   shooterLeftRight [2]
##   shooterLeftRight offWing    avg_goal_probability count
##   <labelled>       <labelled>                <dbl> <int>
## 1 L                0                        0.0692 20075
## 2 L                1                        0.0846 14605
## 3 R                0                        0.0590 12802
## 4 R                1                        0.0811  8198
shots_by_handedness_offwing <- shots_by_handedness_offwing |>
  mutate(frequency_tag = ifelse(count == 8198, "Least Common", ""))

shots_by_handedness_offwing
## # A tibble: 4 × 5
## # Groups:   shooterLeftRight [2]
##   shooterLeftRight offWing    avg_goal_probability count frequency_tag 
##   <labelled>       <labelled>                <dbl> <int> <chr>         
## 1 L                0                        0.0692 20075 ""            
## 2 L                1                        0.0846 14605 ""            
## 3 R                0                        0.0590 12802 ""            
## 4 R                1                        0.0811  8198 "Least Common"

Right handed shooters on the offWing shots occur less frequently than any other combination. This makes sense, as there are actually less right handed shooters in the set, and it is unlikely that off wing shots would outnumber on wing shots.

Finally, the analysis of handedness and wing play found that right-handed shooters taking off-wing shots are the least common, and off-wing shots generally have a higher average goal probability. This suggests that off-wing shots, though less frequent, might be more effective in scoring. Further questions include what factors contribute to the higher goal probability of off-wing shots and whether there are particular game situations where off-wing shots are more advantageous.

ggplot(shots_by_handedness_offwing, aes(x = shooterLeftRight, y = offWing, fill = count)) +
  geom_tile() +
  geom_text(aes(label = count), color = "white", size = 5, fontface = "bold") +  
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Shot Frequency by Handedness and Off-Wing Status",
       x = "Shooter Handedness",
       y = "Off-Wing Status",
       fill = "Shot Count") +
  theme_minimal()