More Time On Ice, More Shots On Net? 2023-2024 NHL Regular Season Player Analysis

Author

Ash Ibasan

Oilers vs. Kings (Game 1)

Edmonton Oilers forward Ryan Nugent-Hopkins attempts a shot on goal towards LA Kings goaltender Cam Talbot during Game 1 of the Oilers vs. Kings playoff series.

Project Introduction

I analyzed the relationship between ice time and shots on goal for all skaters in the 2023-24 NHL regular season using data collected throughout the season. The analysis aims to understand whether players who spend more time on ice attempt more shots on net. I can verify the relationship by performing a linear regression analysis, looking at the additional playing time with shot attempts towards the goal. Not only will we explore correlation, regression, data visualization, and EDA, but we will also get to see player insights during the regular season.

1. Load necessary libraries for data handling and visualization

library(tidyverse) 
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.1
Warning: package 'tibble' was built under R version 4.4.1
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'readr' was built under R version 4.4.1
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'dplyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2. Load and clean dataset

skaters23_24 <- read_csv("skaters.csv")
Rows: 4620 Columns: 154
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): name, team, position, situation
dbl (150): playerId, season, games_played, icetime, shifts, gameScore, onIce...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skater_data_cleaned <- skaters23_24 %>%
  filter(!is.na(icetime) & !is.na(OnIce_F_shotsOnGoal)) # take out missing values

head(skater_data_cleaned) # to confirm first few rows
# A tibble: 6 × 154
  playerId season name      team  position situation games_played icetime shifts
     <dbl>  <dbl> <chr>     <chr> <chr>    <chr>            <dbl>   <dbl>  <dbl>
1  8480950   2023 Ilya Lyu… TOR   D        other               74    2881     56
2  8480950   2023 Ilya Lyu… TOR   D        all                 74   76034   1717
3  8480950   2023 Ilya Lyu… TOR   D        5on5                74   61758   1389
4  8480950   2023 Ilya Lyu… TOR   D        4on5                74   11271    259
5  8480950   2023 Ilya Lyu… TOR   D        5on4                74     124     13
6  8478438   2023 Tommy No… NSH   C        other               71    2378     44
# ℹ 145 more variables: gameScore <dbl>, onIce_xGoalsPercentage <dbl>,
#   offIce_xGoalsPercentage <dbl>, onIce_corsiPercentage <dbl>,
#   offIce_corsiPercentage <dbl>, onIce_fenwickPercentage <dbl>,
#   offIce_fenwickPercentage <dbl>, iceTimeRank <dbl>, I_F_xOnGoal <dbl>,
#   I_F_xGoals <dbl>, I_F_xRebounds <dbl>, I_F_xFreeze <dbl>,
#   I_F_xPlayStopped <dbl>, I_F_xPlayContinuedInZone <dbl>,
#   I_F_xPlayContinuedOutsideZone <dbl>, I_F_flurryAdjustedxGoals <dbl>, …

3. Run linear regression

Let’s see how ice time affect the shots on goal (SOG)

lr_skater_model <- lm(OnIce_F_shotsOnGoal ~ icetime, data = skater_data_cleaned)
summary(lr_skater_model) # model summary for viewing coefficients, R-squared, and p-values

Call:
lm(formula = OnIce_F_shotsOnGoal ~ icetime, data = skater_data_cleaned)

Residuals:
    Min      1Q  Median      3Q     Max 
-238.12  -17.79    1.35    7.68  392.64 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.773e+00  9.527e-01  -1.861   0.0628 .  
icetime      8.500e-03  2.687e-05 316.374   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53.18 on 4618 degrees of freedom
Multiple R-squared:  0.9559,    Adjusted R-squared:  0.9559 
F-statistic: 1.001e+05 on 1 and 4618 DF,  p-value: < 2.2e-16

4. Convert ice time from seconds to minutes

skater_data_cleaned <- skater_data_cleaned %>% # for readability
  mutate(icetime_minutes = icetime / 60)

5. Fix position names

skater_data_cleaned <- skater_data_cleaned %>% # to match colors
  mutate(position = recode(position,
                           "C" = "Center",
                           "D" = "Defense",
                           "L" = "Left Wing",
                           "R" = "Right Wing"))

6. Visualize the data (scatterplot) - Ice time (in minutes) vs shots on goal (SOG), color-coded by position

ggplot(skater_data_cleaned, aes(x = icetime_minutes, y = OnIce_F_shotsOnGoal, color = position)) +
  geom_point(alpha = 0.5, size = 3) +  
  geom_smooth(method = "lm", se = FALSE, color = "#ff6347", size = 1.5) +  # regression line
  scale_color_manual(values = c("Center" = "#1f77b4",
                                "Defense" = "#ffd700",  
                                "Left Wing" = "#32cd32",
                                "Right Wing" = "#ff69b4" 
                               )) + 
  labs(title = 'Impact of Ice Time on Shots by Player Position',
       x = 'Total Time Spent on Ice (Minutes)',
       y = 'Total Number of Shots on Goal',
       color = 'Position') + 
  annotate("text", x = Inf, y = -Inf, label = "Dataset Source: https://moneypuck.com/data.htm", hjust = 1, vjust = -1, size = 3, color = "gray50") +
  theme_minimal() +
  theme(plot.title = element_text(color = "#ff7f0e", size = 16, face = "bold"),
        axis.title = element_text(size = 12),
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 10))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

7. Visualize shot distribution based on position

ggplot(skater_data_cleaned, aes(x = position, y = OnIce_F_shotsOnGoal, fill = position)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Center" = "#1f77b4",
                                "Defense" = "#ffd700",  
                                "Left Wing" = "#32cd32",
                                "Right Wing" = "#ff69b4" 
                               )) + 
  labs(title = 'Distribution of Shots Attempted by Position',
       x = 'Player Position',
       y = 'Number of Shots Attempted',
       fill = 'Position') +
  theme_minimal() +
  theme(plot.title = element_text(color = "#ff7f0e", size = 16, face = "bold"),
        axis.title = element_text(size = 12),
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 10))

8. Summary stats

Let’s check out the distribution of Ice Time and Shots on Goal

summary(skater_data_cleaned$icetime_minutes)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.000    7.817   73.375  337.176  554.550 2123.117 
summary(skater_data_cleaned$OnIce_F_shotsOnGoal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     3.0    29.0   170.2   265.2  1345.0 

Exploring the Impact of Ice Time on Shot Attempts in the 2023-2024 NHL Regular Season

Data cleaning and prep

The dataset I worked with gave me the quantitative data I needed, but there was one issue: Ice Time was measured in seconds, which made interpreting the significance difficult. To make it easier to read, I converted Ice Time from seconds to minutes, which not only made the data easier to understand for a general audience but more straightforward on the graph, especially when it came to the x-axis. The analysis helped put into context the amount of time players were really spending on the ice.

After that, I cleaned up the data by removing any missing values and narrowed my dataset to the key variables I needed for analysis: Ice Time, Shots on Goal (SOG), and Position. With everything prepped, I ran a linear regression to see if there was a significant relationship between these variables.

Model results, interpretation, and data visualization

The linear regression model showed a strong, positive relationship between Ice Time and Shots on Goal. For every additional minute a player spent on the ice, they took about 0.0085 more shots. The p-value was less than 2.2e-16, meaning this relationship is statistically significant, so Ice Time does impact how many shots players attempt towards the net.

Even better, the model’s R-squared value was 0.956, meaning that 95.6% of the variation in shot attempts to the net can be explained by how much time a player spends on the ice. Only a tiny chunk is left to be explained by other factors, like skill level, team strategies, or player position.

To distinguish the differences among the various hockey positions, I incorporated Position to see the shots on goal distribution among centers, defensemen, and the left and right wings (forwards) by using a scatterplot and boxplot with color code to differentiate each position. While all positions followed the overall trend of increased ice time leading to more shots, there were some slight differences in the distributions, especially with defensemen generally taking fewer shots on goal compared to forwards (center, left wing, and right wing).

Takeaways

This analysis supports what you would expect in a sports game: players who spend more time on the ice are more likely to take more shots. What’s interesting is the strength of that connection. An R-squared value of 0.956 shows just how closely these two variables are linked. Data science techniques used in this analysis, like linear regression, can help uncover critical insights, even for those who don’t follow hockey or sports in general.

What I want to explore more deeply in the future is how other factors, such as team strategies and position, play a role in influencing shot attempts. The position analysis gives a hint that player roles do affect shooting behavior, and incorporating more variables like team dynamics or game situations would provide an even more robust context on what drives overall player performance. The analysis in this project shows how beneficial statistical modeling can be for understanding sports data, where complex relationships become accessible and engaging when broken down with effective EDA techniques.