LJ Data Dive - Documentation

#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv('./shots_2024.csv')

#adding descriptors to dataframe

# Load the data dictionary (update with your file path)
data_dict <- read.csv('./MoneyPuck_Shot_Data_Dictionary (1) (1).csv')

# Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
  column_name <- data_dict$Variable[i]
  description <- data_dict$Definition[i]
  
  if (column_name %in% colnames(mpd)) {
    label(mpd[[column_name]]) <- description
  }
}

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

Column 1 - Speed From Last Event

This column name was unclear because speed isn’t an actual representation of what this is. In fact, it is more accurately a distance measurement. It was encoded as speed because speed is an abstract idea that they were trying to highlight, and this is the closest data point to do so. “Speed” in the context of this hockey event means the game events between this and the previous event with time as an added factor, and the column measures this as distance. If I didn’t read the documentation, I would have incorrectly attributed this as a time measurement. This is in fact a measurement of puck speed, and is a very complex measurement that needs every bit of documentation to explain.

Column 2 - Goaltender

Some column values are blank. At first, this appears to be missing data, but the document explains that this represents an Empty-Net situation. This is encoded as a blank because, well, there was no goalie. If I hadn’t read the documentation, I would have omitted these values thinking this was an incorrect row, but in fact this adds additional context and allows analysis to be performed on empty net situations.

Column 3 - Adjusted Distances

The adjusted distance columns are unclear because you have to understand what the adjustment means. In fact, this adjustment isn’t even consistent between columns because it uses a different calculation. In this aspect, the documentation is very helpful because it very clearly outlines the methodology to calculate these adjusted distances. Had I not read the documentation, I would not have known how these were formed, and when to use absolute value vs just the adjusted distance. This raises a key point about documentation, that when you perform additional calculations on raw data that is publicly available, you need to show why you did that and what the reasoning is behind it so someone who reuses your data set can assess it for validity.

Still Unclear

I’m still unsure why player number that did last event has so many zeroes. This should all be actual player number not so many zeros.

Viz 1 and 2 using the column

Player shooting hand isn’t always filled. This creates major holes when trying to analyze this information.

#Replace NA in Shooting Hand
mpd_hand <- mpd %>%
  mutate(shooterLeftRight = ifelse(is.na(shooterLeftRight), "Unknown", shooterLeftRight)) %>%
  count(shooterLeftRight)

ggplot(mpd_hand, aes(x = shooterLeftRight, y = n, fill = shooterLeftRight)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("L" = "blue", "R" = "red", "Unknown" = "gray")) +
  labs(title = "Distribution of Player Shooting Hands",
       x = "Shooting Hand",
       y = "Number of Players",
       fill = "Shooting Hand") +
  theme_minimal()

Looking at what this graph shows, this impacts a small but significant number of rows. It is unclear without prior knowledge about these being empty what this mysterious grey bar would mean. This is concerning because these unknown rows would clutter and be an extra question mark for any analysis trying to use left or right hand, and if this is a side feature of the analysis and not the primary variable, would incorrectly handle or remove rows without the user knowing. I don’t think this poses any significant risks, as there is still a high volume of rows that include this information. Unless these are all tied to a specific player or two and you were looking at them specifically, there is not much of a real concern.

# Filter for players on the NJD team and ensure data has columns 'playerLeftRight', 'team', and 'goals'
library(dplyr)

goals_by_shooting_hand <- mpd |>
  filter(teamCode == "NJD") %>%
  mutate(shooterLeftRight = ifelse(is.na(shooterLeftRight), "Unknown", shooterLeftRight)) %>%
  group_by(shooterLeftRight, playerNumThatDidEvent) %>%
  summarise(total_goals = sum(goal, na.rm = TRUE))

## `summarise()` has grouped output by 'shooterLeftRight'. You can override using
## the `.groups` argument.

ggplot(goals_by_shooting_hand, aes(x = playerNumThatDidEvent, y = total_goals, fill = shooterLeftRight)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("L" = "blue", "R" = "red", "Unknown" = "gray")) +
  labs(title = "Goals by Player Shooting Hand for Team NJD (Stacked)",
       x = "Player ID",
       y = "Total Goals",
       fill = "Shooting Hand") +
  theme_minimal()

While not the cleanest visualization, this second chart can show how these errors center around one player. We can see that this will cause issues when looking at specific player attributes. It would be best to simply fix this by adding a “S” for switch-shooters in the data!

Checking Categorical Columns awayTeamCode and homeTeamCode

missing_team_code <- sum(is.na(mpd$homeTeamCode))
missing_shooting_hand <- sum(is.na(mpd$awayTeamCode))

cat("Explicitly missing rows in 'homeTeamCode':", missing_team_code, "\n")

## Explicitly missing rows in 'homeTeamCode': 0

cat("Explicitly missing rows in 'awayTeamCode':", missing_shooting_hand, "\n")

## Explicitly missing rows in 'awayTeamCode': 0

implicit_missing_team_code <- sum(mpd$homeTeamCode == "")
implicit_missing_shooting_hand <- sum(mpd$awayTeamCode == "")

cat("Implicitly missing rows in 'homeTeamCode':", implicit_missing_team_code, "\n")

## Implicitly missing rows in 'homeTeamCode': 0

cat("Implicitly missing rows in 'awayTeamCode':", implicit_missing_shooting_hand, "\n")

## Implicitly missing rows in 'awayTeamCode': 0

unique_teams <- length(unique(mpd$teamCode))

cat("Number of unique teams in 'homeTeamCode':", unique_teams, "\n")

## Number of unique teams in 'homeTeamCode': 32

cat("Number of unique teams in 'awayTeamCode':", unique_teams, "\n")

## Number of unique teams in 'awayTeamCode': 32

There are no missing values in the team code columns. This is expected and reassures the the data is recorded correctly.

Defining the Practical Outlier for shotDistance

For shot distance I would define an outlier as any shots >75 ft. While this is not a mathematical outlier, this is what I like to call a practical outlier, as this is the edge of the blue line. Any shots past this line are likely accidental dump-ins or empty net shots, and would skew any shot analysis as these are not necessarily legitimate attempts that reflect real game situations.