More Time On Ice, More Shots On Net? 2023-2024 NHL Regular Season Player Analysis
Author
Ash Ibasan
Edmonton Oilers forward Ryan Nugent-Hopkins attempts a shot on goal towards LA Kings goaltender Cam Talbot during Game 1 of the Oilers vs. Kings playoff series.
Project Introduction
I analyzed the relationship between ice time and shots on goal for all skaters in the 2023-24 NHL regular season using data collected throughout the season. The analysis aims to understand whether players who spend more time on ice attempt more shots on net. I can verify the relationship by performing a linear regression analysis, looking at the additional playing time with shot attempts towards the goal. Not only will we explore correlation, regression, data visualization, and EDA, but we will also get to see player insights during the regular season.
1. Load necessary libraries for data handling and visualization
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.1
Warning: package 'tibble' was built under R version 4.4.1
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'readr' was built under R version 4.4.1
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'dplyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2. Load and clean dataset
skaters23_24 <-read_csv("skaters.csv")
Rows: 4620 Columns: 154
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, team, position, situation
dbl (150): playerId, season, games_played, icetime, shifts, gameScore, onIce...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skater_data_cleaned <- skaters23_24 %>%filter(!is.na(icetime) &!is.na(OnIce_F_shotsOnGoal)) # take out missing valueshead(skater_data_cleaned) # to confirm first few rows
# A tibble: 6 × 154
playerId season name team position situation games_played icetime shifts
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 8480950 2023 Ilya Lyu… TOR D other 74 2881 56
2 8480950 2023 Ilya Lyu… TOR D all 74 76034 1717
3 8480950 2023 Ilya Lyu… TOR D 5on5 74 61758 1389
4 8480950 2023 Ilya Lyu… TOR D 4on5 74 11271 259
5 8480950 2023 Ilya Lyu… TOR D 5on4 74 124 13
6 8478438 2023 Tommy No… NSH C other 71 2378 44
# ℹ 145 more variables: gameScore <dbl>, onIce_xGoalsPercentage <dbl>,
# offIce_xGoalsPercentage <dbl>, onIce_corsiPercentage <dbl>,
# offIce_corsiPercentage <dbl>, onIce_fenwickPercentage <dbl>,
# offIce_fenwickPercentage <dbl>, iceTimeRank <dbl>, I_F_xOnGoal <dbl>,
# I_F_xGoals <dbl>, I_F_xRebounds <dbl>, I_F_xFreeze <dbl>,
# I_F_xPlayStopped <dbl>, I_F_xPlayContinuedInZone <dbl>,
# I_F_xPlayContinuedOutsideZone <dbl>, I_F_flurryAdjustedxGoals <dbl>, …
3. Run linear regression
Let’s see how ice time affect the shots on goal (SOG)
lr_skater_model <-lm(OnIce_F_shotsOnGoal ~ icetime, data = skater_data_cleaned)summary(lr_skater_model) # model summary for viewing coefficients, R-squared, and p-values
Call:
lm(formula = OnIce_F_shotsOnGoal ~ icetime, data = skater_data_cleaned)
Residuals:
Min 1Q Median 3Q Max
-238.12 -17.79 1.35 7.68 392.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.773e+00 9.527e-01 -1.861 0.0628 .
icetime 8.500e-03 2.687e-05 316.374 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53.18 on 4618 degrees of freedom
Multiple R-squared: 0.9559, Adjusted R-squared: 0.9559
F-statistic: 1.001e+05 on 1 and 4618 DF, p-value: < 2.2e-16
4. Convert ice time from seconds to minutes
skater_data_cleaned <- skater_data_cleaned %>%# for readabilitymutate(icetime_minutes = icetime /60)
5. Fix position names
skater_data_cleaned <- skater_data_cleaned %>%# to match colorsmutate(position =recode(position,"C"="Center","D"="Defense","L"="Left Wing","R"="Right Wing"))
6. Visualize the data (scatterplot) - Ice time (in minutes) vs shots on goal (SOG), color-coded by position
ggplot(skater_data_cleaned, aes(x = icetime_minutes, y = OnIce_F_shotsOnGoal, color = position)) +geom_point(alpha =0.5, size =3) +geom_smooth(method ="lm", se =FALSE, color ="#ff6347", size =1.5) +# regression linescale_color_manual(values =c("Center"="#1f77b4","Defense"="#ffd700", "Left Wing"="#32cd32","Right Wing"="#ff69b4" )) +labs(title ='Impact of Ice Time on Shots by Player Position',x ='Total Time Spent on Ice (Minutes)',y ='Total Number of Shots on Goal',color ='Position') +annotate("text", x =Inf, y =-Inf, label ="Dataset Source: https://moneypuck.com/data.htm", hjust =1, vjust =-1, size =3, color ="gray50") +theme_minimal() +theme(plot.title =element_text(color ="#ff7f0e", size =16, face ="bold"),axis.title =element_text(size =12),legend.title =element_text(size =12, face ="bold"),legend.text =element_text(size =10))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
7. Visualize shot distribution based on position
ggplot(skater_data_cleaned, aes(x = position, y = OnIce_F_shotsOnGoal, fill = position)) +geom_boxplot() +scale_fill_manual(values =c("Center"="#1f77b4","Defense"="#ffd700", "Left Wing"="#32cd32","Right Wing"="#ff69b4" )) +labs(title ='Distribution of Shots Attempted by Position',x ='Player Position',y ='Number of Shots Attempted',fill ='Position') +theme_minimal() +theme(plot.title =element_text(color ="#ff7f0e", size =16, face ="bold"),axis.title =element_text(size =12),legend.title =element_text(size =12, face ="bold"),legend.text =element_text(size =10))
8. Summary stats
Let’s check out the distribution of Ice Time and Shots on Goal
summary(skater_data_cleaned$icetime_minutes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 7.817 73.375 337.176 554.550 2123.117
summary(skater_data_cleaned$OnIce_F_shotsOnGoal)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 29.0 170.2 265.2 1345.0
Exploring the Impact of Ice Time on Shot Attempts in the 2023-2024 NHL Regular Season
Data cleaning and prep
The dataset I worked with gave me the quantitative data I needed, but there was one issue: Ice Time was measured in seconds, which made interpreting the significance difficult. To make it easier to read, I converted Ice Time from seconds to minutes, which not only made the data easier to understand for a general audience but more straightforward on the graph, especially when it came to the x-axis. The analysis helped put into context the amount of time players were really spending on the ice.
After that, I cleaned up the data by removing any missing values and narrowed my dataset to the key variables I needed for analysis: Ice Time, Shots on Goal (SOG), and Position. With everything prepped, I ran a linear regression to see if there was a significant relationship between these variables.
Model results, interpretation, and data visualization
The linear regression model showed a strong, positive relationship between Ice Time and Shots on Goal. For every additional minute a player spent on the ice, they took about 0.0085 more shots. The p-value was less than 2.2e-16, meaning this relationship is statistically significant, so Ice Time does impact how many shots players attempt towards the net.
Even better, the model’s R-squared value was 0.956, meaning that 95.6% of the variation in shot attempts to the net can be explained by how much time a player spends on the ice. Only a tiny chunk is left to be explained by other factors, like skill level, team strategies, or player position.
To distinguish the differences among the various hockey positions, I incorporated Position to see the shots on goal distribution among centers, defensemen, and the left and right wings (forwards) by using a scatterplot and boxplot with color code to differentiate each position. While all positions followed the overall trend of increased ice time leading to more shots, there were some slight differences in the distributions, especially with defensemen generally taking fewer shots on goal compared to forwards (center, left wing, and right wing).
Takeaways
This analysis supports what you would expect in a sports game: players who spend more time on the ice are more likely to take more shots. What’s interesting is the strength of that connection. An R-squared value of 0.956 shows just how closely these two variables are linked. Data science techniques used in this analysis, like linear regression, can help uncover critical insights, even for those who don’t follow hockey or sports in general.
What I want to explore more deeply in the future is how other factors, such as team strategies and position, play a role in influencing shot attempts. The position analysis gives a hint that player roles do affect shooting behavior, and incorporating more variables like team dynamics or game situations would provide an even more robust context on what drives overall player performance. The analysis in this project shows how beneficial statistical modeling can be for understanding sports data, where complex relationships become accessible and engaging when broken down with effective EDA techniques.