Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Importing my data setshots_2023 <-read_csv("~/Downloads/shots_2023 2.csv")
Rows: 122472 Columns: 124
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): awayTeamCode, event, goalieNameForShot, homeTeamCode, lastEventCa...
dbl (111): shotID, arenaAdjustedShotDistance, arenaAdjustedXCord, arenaAdjus...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Selecting only the columns that I wantadj_shots_23 <-select(shots_2023,c("shotID", "awayTeamCode", "awayTeamGoals","event":"goalieNameForShot","homeTeamCode":"homeTeamWon","isHomeTeam", "period", "shotOnEmptyNet","teamCode"))#Inspecting the datahead(shots_2023)
This data is all of the shots taken in the 2023-2024 NHL regular season. There are 124 variables (columns) and 122,472 observations (rows). The variables vary from what team took the shot to what goalie was trying to make the save. I selected certain variables to keep based on what I think I will need because 124 variables is way too many to realistically work with.
This data comes from Money Puck. This is the link to the data: Data.
Motivation
My motivation for exploring this data is that I love hockey and want to explore hockey data more in depth. Questions that I want to try and answer include;
What factors affect whether or not a goal is scored?
If a team scores a goal in the first period, are they more likely to win the game?
Hypothesis
A team that has a higher number of average goals is more likely to score in the first period.
Ethical Considerations
My bias is that I think the team that scores in the first period has a better chance of winning the game.
Table Creation/Data Dictionary
Var Name
Class
Continuity
Description
shotID
Numeric
Discrete
A unique ID for the shot taken ordered from the first shot of the season to the last one.
awayTeamCode
Character
Discrete
The abbreviation for the team name of the away team.
awayTeamGoals
Numeric
Discrete
The number of goals that the away team has in the game when the shot was taken.
event
Character
Discrete
Whether it was a shot on goal, a missed shot, or a goal.
game_id
Numeric
Discrete
An unique ID given to each game in the season increasing by one for each new game.
goal
Numeric
Discrete
Whether or not a goal was scored on the shot. 0 for no goal and 1 for a goal.
goalieIdForShot
Numeric
Discrete
A unique ID number given to each goalie that has played in the season. It is the ID for the goalie that the shot was taken on.
goalieNameForShot
Character
Discrete
The name of the goalie that was shot at.
homeTeamCode
Character
Discrete
The abbreviation for the team name of the home team.
homeTeamGoals
Numeric
Discrete
The number of goals that the home team has in the game when the shot was taken.
homeTeamWon
Numeric
Discrete
Whether or not the home team won the game. 0 for an away team win and 1 for a home team win.
isHomeTeam
Numeric
Discrete
Whether or not the team shooting the puck is the home team. 0 for the away team and 1 for the home team.
period
Numeric
Discrete
What period of the game does the shot take place in. Typically 1 to 3, but can be 4 if there was OT (overtime).
shotOnEmptyNet
Numeric
Discrete
Was the shot on an empty net or not. 0 if the goalie was in the net and 1 if the net was empty.
teamCode
Character
Discrete
The abbreviation for the team name of the team that took the shot.