Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)
#Importing my data setshots_2023 <-read_csv("~/Downloads/shots_2023 2.csv")
Rows: 122472 Columns: 124
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): awayTeamCode, event, goalieNameForShot, homeTeamCode, lastEventCa...
dbl (111): shotID, arenaAdjustedShotDistance, arenaAdjustedXCord, arenaAdjus...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Selecting only the columns that I need for analysisadj_shots_23 <-select(shots_2023,c("shotID", "awayTeamCode", "awayTeamGoals","event":"goalieNameForShot","homeTeamCode":"homeTeamWon","isHomeTeam", "period", "shotOnEmptyNet","teamCode"))#Inspecting the cleaned datahead(adj_shots_23)
This data is all of the shots taken in the 2023-2024 NHL regular season. There are 124 variables (columns) and 122,472 observations (rows). The variables vary from what team took the shot to what goalie was trying to make the save. I selected certain variables to keep based on what I think I will need because 124 variables is way too many to realistically work with.
This data comes from Money Puck. This is the link to the data: Data.
Motivation
My motivation for exploring this data is that I love hockey and want to explore hockey data more in depth. Questions that I want to try and answer include;
What factors affect whether or not a goal is scored?
If a team shoots more, are they more likely to win the game?
Hypothesis
Better teams shoot more than bad teams and are more likely to win.
Ethical Considerations
My bias is that I think the team that shoots more shots is more likely to win any given game.
Table Creation/Data Dictionary
Var Name
Class
Continuity
Description
shotID
Numeric
Discrete
A unique ID for the shot taken ordered from the first shot of the season to the last one.
awayTeamCode
Character
Discrete
The abbreviation for the team name of the away team.
awayTeamGoals
Numeric
Discrete
The number of goals that the away team has in the game when the shot was taken.
event
Character
Discrete
Whether it was a shot on goal, a missed shot, or a goal.
game_id
Numeric
Discrete
An unique ID given to each game in the season increasing by one for each new game.
goal
Numeric
Discrete
Whether or not a goal was scored on the shot. 0 for no goal and 1 for a goal.
goalieIdForShot
Numeric
Discrete
A unique ID number given to each goalie that has played in the season. It is the ID for the goalie that the shot was taken on.
goalieNameForShot
Character
Discrete
The name of the goalie that was shot at.
homeTeamCode
Character
Discrete
The abbreviation for the team name of the home team.
homeTeamGoals
Numeric
Discrete
The number of goals that the home team has in the game when the shot was taken.
homeTeamWon
Numeric
Discrete
Whether or not the home team won the game. 0 for an away team win and 1 for a home team win.
isHomeTeam
Numeric
Discrete
Whether or not the team shooting the puck is the home team. 0 for the away team and 1 for the home team.
period
Numeric
Discrete
What period of the game does the shot take place in. Typically 1 to 3, but can be 4 if there was OT (overtime).
shotOnEmptyNet
Numeric
Discrete
Was the shot on an empty net or not. 0 if the goalie was in the net and 1 if the net was empty.
teamCode
Character
Discrete
The abbreviation for the team name of the team that took the shot.
Where are missing values concentrated in my dataset?
# Count total missing values in each columnna_counts <-colSums(is.na(adj_shots_23))#Find variables with NA'sprint(na_counts[na_counts >0])
goalieNameForShot
843
#Replacing missing values for goalieNameForShot with "Unknown"adj_shots_23 %>%mutate(goalieNameForShot =replace_na(goalieNameForShot, "Unknown"))
The missing data in my dataset appear in the “goalieNameForShot” variable. This is a character variable so I replaced all of the missing values with “Unknown”.
What outliers exist in my dataset using both IQR and Z-score methods?
Creating a box plot to look for outliers with the IQR method.
#Create box plot for homeTeamGoals and awayTeamGoals to check for outliersadj_shots_23 %>%select(homeTeamGoals, awayTeamGoals) %>%gather(metric, value) %>%ggplot(aes(x = metric, y = value)) +geom_boxplot() +theme_minimal() +labs(title ="Distribution of Home Team Goals",x ="Metric",y ="Value") +theme(axis.text.x =element_text(angle =45))
Interpretation: The distribution of goals scored by the home team seems to be skewed up, or right. There also seem to be some really high outliers. The same can be said for away team goals.
Using the z-score method to examine outliers.
# Function to identify outliers using z-scoreidentify_outliers_z <-function(x) { z_scores <-scale(x)abs(z_scores) >3}# Apply to numerical columnsadj_shots_23 %>%select(homeTeamGoals, awayTeamGoals) %>%map_df(~sum(identify_outliers_z(.))) %>%gather(variable, outlier_count) %>%filter(outlier_count >0)
# Calculate z-scores for loudnessadj_shots_23_away_z <- adj_shots_23 %>%mutate(awayTeamGoals_zscore = (awayTeamGoals -mean(awayTeamGoals)) /sd(awayTeamGoals))# Look at the most extreme outliers by loudnessadj_shots_23_away_z %>%filter(abs(awayTeamGoals_zscore) >3) %>%arrange(desc(abs(awayTeamGoals_zscore))) %>%select(awayTeamCode, awayTeamGoals, awayTeamGoals_zscore) %>%head(100)
# A tibble: 100 × 3
awayTeamCode awayTeamGoals awayTeamGoals_zscore
<chr> <dbl> <dbl>
1 VAN 10 6.12
2 VAN 10 6.12
3 VAN 10 6.12
4 VAN 10 6.12
5 VAN 10 6.12
6 VAN 10 6.12
7 VAN 10 6.12
8 VAN 10 6.12
9 VAN 10 6.12
10 VAN 10 6.12
# ℹ 90 more rows
Interpretation: This code shows that there are a ton of outliers in this data. This is misleading however, because most of these outliers are from the same game. This means that they are essentially the same observation. The “awayTeamGoals” variable displays the number of goals that the away team has in the game whenever a shot is taken in the game. This means that if, for example, VAN has 10 goals in a game, then every time a shot is taken in that same game, it will display “awayTeamGoals” as 10. I have tried to run this same code while grouping by “game_id” and “teamCode”, but the code would not run anymore.
Exploratory Visualization
# Count the number of times each awayTeamCode appearsaway_team_counts <-table(adj_shots_23$awayTeamCode)# Print the resultprint(away_team_counts)
ANA ARI BOS BUF CAR CBJ CGY CHI COL DAL DET EDM FLA LAK MIN MTL
3469 3500 4063 3573 4066 3711 3694 3433 4297 4450 3690 4569 4469 3831 3606 3516
NJD NSH NYI NYR OTT PHI PIT SEA SJS STL TBL TOR VAN VGK WPG WSH
3588 3915 3957 4425 3640 3581 3688 3577 3680 3603 3689 4027 3871 3977 3727 3590
# Create the counts data frameteam_counts_df <-as.data.frame(table(adj_shots_23$teamCode))colnames(team_counts_df) <-c("teamCode", "count")# Select top 10 teams by counttop_teams_df <- team_counts_df %>%arrange(desc(count)) %>%slice(1:10)# Plot the bar chartggplot(top_teams_df, aes(x =reorder(teamCode, -count), y = count)) +geom_bar(stat ="identity", fill ="steelblue") +labs(title ="Top 10 Teams by Shot Count", x ="Team Code", y ="Shot Count") +theme_minimal(base_size =24) +theme(axis.text.x =element_text(angle =45, hjust =1, size =24),axis.text.y =element_text(size =24),axis.title =element_text(size =24),plot.title =element_text(size =24, face ="bold"))
Interpretation: The above bar chart shows the ten teams with the highest number of shots taken over the course of an entire season. This graph shows that the ten teams with the highest number of shots taken are all very good teams that made it to the playoffs. This is pretty much what I expected.
#Summarize total goals by teamteam_goals <- adj_shots_23 %>%filter(goal ==1) %>%group_by(teamCode) %>%summarise(total_goals =n(), .groups ="drop")#Plotggplot(team_goals, aes(x =reorder(teamCode, -total_goals), y = total_goals)) +geom_bar(stat ="identity", fill ="steelblue") +labs(title ="Total Goals Scored by Each Team",x ="Team Code",y ="Total Goals" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Interpretation: This bar chart shows the total number of goals scored by every team in the NHL. This graph shows that, once again, the best teams are at the top.
#Filter out "MISS" events and group by teamCode and game_idfiltered_grouped_data <- adj_shots_23 %>%filter(event !="MISS") %>%group_by(teamCode, game_id)#Filter out "MISS" events and count shots per team per gameshots_per_team_game <- adj_shots_23 %>%filter(event !="MISS") %>%group_by(teamCode, game_id) %>%summarise(shot_count =n(), .groups ="drop")#Calculate mean shot count for each teammean_shot_count_per_team <- shots_per_team_game %>%group_by(teamCode) %>%summarise(mean_shot_count =mean(shot_count)) %>%arrange(desc(mean_shot_count))#Sum goals per team per gamegoals_per_team_game <- adj_shots_23 %>%group_by(teamCode, game_id) %>%summarise(total_goals =sum(goal, na.rm =TRUE), .groups ="drop")#Calculate mean goals per team across all gamesmean_goals_per_team <- goals_per_team_game %>%group_by(teamCode) %>%summarise(mean_goals =mean(total_goals), .groups ="drop") %>%arrange(desc(mean_goals))#View the resulthead(mean_goals_per_team)
# A tibble: 6 × 2
teamCode mean_goals
<chr> <dbl>
1 COL 3.71
2 EDM 3.52
3 TOR 3.48
4 TBL 3.46
5 DAL 3.41
6 CAR 3.38
#For non-overlapping labelslibrary(ggrepel) #Merge goal and shot data by teamteam_shooting_summary <-left_join(mean_goals_per_team, mean_shot_count_per_team, by ="teamCode")#Clean and readable scatter plotggplot(team_shooting_summary, aes(x = mean_shot_count, y = mean_goals, label = teamCode)) +geom_point(color ="steelblue", size =5) +geom_smooth(method ="lm", se =FALSE, color ="gray30", linetype ="dashed") + ggrepel::geom_text_repel(size =6) +# Approx 24ptlabs(title ="More Shots Means More Goals",x ="Mean Shot Count per Game",y ="Mean Goals per Game" ) +theme_minimal(base_size =24) +theme(plot.title =element_text(size =26, face ="bold"),axis.title =element_text(size =24),axis.text =element_text(size =24) )
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Interpretation: This scatter plot shows the average number of shots taken per game plotted against the average number of goals scored per game for every team. Ideally, a team would want to be in the top-right of the graph. All of the best teams in the league are in the top-right of the graph.
#Count events (excluding "MISS") by game_id and teamCodeevent_counts <- filtered_grouped_data %>%filter(event !="MISS") %>%group_by(game_id, teamCode) %>%summarise(event_count =n(), .groups ="drop")#Create a winner column#Extract unique game infogame_winners <- filtered_grouped_data %>%select(game_id, homeTeamWon, homeTeamCode, awayTeamCode) %>%distinct() %>%mutate(winner =ifelse(homeTeamWon ==1, homeTeamCode, awayTeamCode))
Adding missing grouping variables: `teamCode`
# Join winner info to event countsfinal_data <-left_join(event_counts, game_winners, by ="game_id")
Warning in left_join(event_counts, game_winners, by = "game_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
#Extract unique game_id and winner from final_datawinners_by_game <- final_data %>%select(game_id, winner) %>%distinct()#Merge with shots_per_team_gameshots_with_winner <- shots_per_team_game %>%left_join(winners_by_game, by ="game_id")#View the resulthead(shots_with_winner)
# A tibble: 6 × 4
teamCode game_id shot_count winner
<chr> <dbl> <int> <chr>
1 ANA 20031 23 VGK
2 ANA 20034 25 ANA
3 ANA 20059 29 DAL
4 ANA 20066 33 ARI
5 ANA 20081 31 BOS
6 ANA 20084 27 ANA
#Identify the shot leader for each gameshot_leaders <- shots_with_winner %>%group_by(game_id) %>%slice_max(shot_count, with_ties =FALSE) %>%ungroup()#Check if shot leader was the winnershot_leaders <- shot_leaders %>%mutate(shot_leader_won = teamCode == winner)#Count and calculate percentagesummary_counts <- shot_leaders %>%group_by(shot_leader_won) %>%summarise(games =n()) %>%mutate(label =ifelse(shot_leader_won, "Shot Leader Won", "Shot Leader Lost"),percent =round(games /sum(games) *100, 1),label_text =paste0(percent, "%") )#Plot with percent inside barsggplot(summary_counts, aes(x = label, y = games, fill = label)) +geom_bar(stat ="identity", width =0.6) +geom_text(aes(label = label_text), vjust =1.5, size =8, color ="white", fontface ="bold") +labs(title ="The Shot Leader Tends to Win",x ="",y ="Number of Games" ) +theme_minimal(base_size =24) +scale_fill_manual(values =c("steelblue", "tomato")) +theme(legend.position ="none",plot.title =element_text(size =26, face ="bold"),axis.text =element_text(size =22),axis.title =element_text(size =24) )
Interpretation: This bar chart shows whether or not the team that took more shots over the course of a game won the game or not. This graph shows that, on average the team that took more shots won the game. This coincides with the earlier findings that better teams take more shots on average, thus, they win more.
ggplot(adj_shots_23, aes(x = homeTeamGoals)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Distribution of Home Team Goals",x ="Home Team Goals",y ="Frequency" ) +theme_minimal()
Interpretation: This histogram shows the distribution of goals scored by the home team in every game over the course of the season. It has a minimum of 0 because you can’t score negative goals. It appears to be skewed right. There are some very high outliers.
Exploratory Analysis Questions
Does homeTeamGoals or awayTeamGoals have more variance, and what might this tell us?
I think that this is an important question to answer because it will tell us if home teams or away teams are more consistent scorers. To do this I found the variance of each of the variables, and then compared them.
It appears that homeTeamGoals has a higher variance than awayTeamGoals. This means that, in general, away teams score at a more consistent rate than home teams. This is probably due to the fact that home teams are more likely to score more goals and create outliers in the data set because of their home ice advantage.
What relationship exists between shots in a game and goals in a game?
I want to look into this relationship to see if there is a distinct relationship between the number of shots taken in a game and how many goals that team would score in the game. To do this, I make a scatter plot of all of the data and try to fit a line to it.
#Merging datasets to make one dataset that has how many shots a team took in a game and how many goals they scoredmerged_data <- shots_per_team_game %>%left_join(goals_per_team_game, by =c("game_id", "teamCode"))ggplot(merged_data, aes(x = shot_count, y = total_goals)) +geom_point(color ="steelblue", alpha =0.6, size =2) +geom_smooth(method ="lm", se =TRUE, color ="darkred") +labs(title ="Relationship Between Shot Count and Total Goals",x ="Shot Count",y ="Total Goals" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This graph doesn’t really show a distinct relationship between shots in a game and goals in a game. There is a vaguely positive linear relationship between the two. It appears that to score a certain number of goals, in general you need to shoot a certain number of shots. I did think, however, that there would be a more obvious relationship between the two variables.
How do total goals differ for home and away teams?
I feel that this is an important thing investigate as it will show if there really is a home ice advantage. If there is one, then we can expect there to be a fair difference between the total number of goals scored by home teams versus away teams. To do this, I add up all of the goals scored by the home teams in every game and then do the same for the away teams.
#Extract unique game_id/teamCode/isHomeTeam comboshome_info <- adj_shots_23 %>%select(game_id, teamCode, isHomeTeam) %>%distinct()#Join with merged_datamerged_data <- merged_data %>%left_join(home_info, by =c("game_id", "teamCode"))#Summarize total goals by home vs awayhome_away_goals <- merged_data %>%group_by(isHomeTeam) %>%summarise(total_goals =sum(total_goals, na.rm =TRUE), .groups ="drop") %>%mutate(team_type =ifelse(isHomeTeam ==1, "Home", "Away"))#Plot goals for home team vs away teamggplot(home_away_goals, aes(x = team_type, y = total_goals, fill = team_type)) +geom_bar(stat ="identity") +geom_text(aes(label = total_goals), vjust =1.5, color ="white", size =5) +labs(title ="Total Goals: Home vs Away Teams",x ="Team Type",y ="Total Goals" ) +scale_fill_manual(values =c("Home"="skyblue", "Away"="tomato")) +theme_minimal()
This bar chart shows that there is a fairly significant difference in the number of goals scored by home teams versus away teams. It shows that home teams do score more in general than away teams. This is very interesting because over the course of an entire season, one would expect the total number of goals to be pretty much even. This graph however shows that there is a difference of about 300 goals, which is a lot.
How many goals does a team score, on average, at different number of shots taken?
I think that looking into this could give some good insight into whether or not more shots really does equate to more goals. To do this, I split up the number of shots taken in a game into different ranges. Then I find the mean number of goals scored by teams that had games where they took the number of shots in one of the ranges.
These results show that, in general, the more shots a team takes, the more goals they score. Some of the groups have a larger jump in average goals than others. One of the largest jumps is from 21-30 shots to 31-40 shots. These two groups also have the highest number of observations so this must be a consistent difference throughout the season for all teams.
Hypothesis Generation
Hypothesis: Shooting more shots means winning more games.
In my exploratory data analysis I looked at how shots and goals are related to each other. In some of the graphs, I looked at how many shots each team was taking and found that the best teams tend to shoot more. I also looked at how many goals teams were scoring and again, the best teams were at the top. This leads me to believe that the best teams score more goals because they are shooting more. This is seemingly true as can be seen through some of my other data explorations. This means that teams should be trying to get more shots on goal throughout the course of every game because this will lead to more goals and more success. To fully test this hypothesis I would need more data gathered over more seasons to make this more concrete. Ideally, I would want more data from different eras in the NHL, some where there was less scoring and some where there was more. I would probably want to run some kind of regression using shots and goals to determine how many wins a team would get throughout a season. If my hypothesis is true, then it might become easier to predict what teams will do well and what teams would not based soley on the number of shots that they take in games.
Stakeholder Communication
I found many interesting things through my data exploration. The first thing that I found that really pushed my thought process is the graph of the top ten teams in terms of the number of shots taken in the season. This graph showed that the ten teams with the highest number of shots taken over the course of the season were all some of the best teams in the entire NHL. All ten of them made the playoffs in the year that this data is from. After I saw that, it really drove my exploration process. Next I examined the total number of goals scored by each team in the league over the course of the season. This graph also showed the best teams in league, many of the same teams as the first graph, being at the very top. These two graphs together made me more interested n exploring shots as they relate to goals, especially when it comes to the top teams. Thus, the next thing that I examined was the average number of goals scored per game per team versus the average number of shots taken per game per team. This scatter plot once again showed the same teams being at the top right of the graph, which is where you would want your team to be. I then examined is the team that shot more shots in a game tended to win the game. The graph showed that the team that shot more did tend to win the game. This result is supported by my previous results because better teams are going to win more games and they will also shoot more shots. Then I looked at the distribution of goals scored by the home team because the best teams tend to perform best at home. This would show that the distribution is skewed right with some high-end results from the home teams. The most important result, to me anyways, is the one showing the average number of goals scored at different levels of shots. This result really supports my hyppothesis that shooting more means more wins. It showed that there was a positive relation between the number of shots taken and the number of goals scored. Thus, good teams score more whening shooting more and win more because of it. All of these results mean that teams should implement more shots into their game plans’ if they want better results. For further investigation I would suggest doing a closer examination of the worst teams in the league and how much they’re shooting because I was more focused on the best teams.
#Extract unique game_id/teamCode/isHomeTeam comboswin_info <- adj_shots_23 %>%select(game_id, teamCode, homeTeamWon) %>%distinct()#Join with merged_datamerged_data <- merged_data %>%left_join(win_info, by =c("game_id", "teamCode"))#Join with merged_datamerged_data <- merged_data %>%left_join(winners_by_game, by =c("game_id"))#Prepare datawinner_shots <- merged_data %>%mutate(is_winner =ifelse(teamCode == winner, "Winner", "Not Winner")) %>%group_by(is_winner) %>%summarise(avg_shot_count =mean(shot_count, na.rm =TRUE),.groups ="drop" )#Plotggplot(winner_shots, aes(x = is_winner, y = avg_shot_count, fill = is_winner)) +geom_bar(stat ="identity") +geom_text(aes(label =round(avg_shot_count, 1)), vjust =1.5, color ="white", size =5) +labs(title ="Average Shot Count: Winners vs Non-Winners",x ="Team Outcome",y ="Average Shot Count" ) +scale_fill_manual(values =c("Winner"="forestgreen", "Not Winner"="gray70")) +theme_minimal()