DSA406_001_SP25_project_gpetkau

Reading in the Data

library(readr)
library(tidyverse)
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Importing my data set
shots_2023 <- read_csv("~/Downloads/shots_2023 2.csv")
Rows: 122472 Columns: 124
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (13): awayTeamCode, event, goalieNameForShot, homeTeamCode, lastEventCa...
dbl (111): shotID, arenaAdjustedShotDistance, arenaAdjustedXCord, arenaAdjus...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Selecting only the columns that I want
adj_shots_23 <- select(shots_2023,c("shotID", "awayTeamCode", "awayTeamGoals",
                                    "event":"goalieNameForShot",
                                    "homeTeamCode":"homeTeamWon",
                                    "isHomeTeam", "period", "shotOnEmptyNet",
                                    "teamCode"))
#Inspecting the data
head(shots_2023)
# A tibble: 6 × 124
  shotID arenaAdjustedShotDistance arenaAdjustedXCord arenaAdjustedXCordABS
   <dbl>                     <dbl>              <dbl>                 <dbl>
1      0                      39.7                 59                    59
2      1                      11.3                 81                    81
3      2                      45.3                 55                    55
4      3                      43.1                 58                    58
5      4                      42.2                -64                    64
6      5                      19.9                 83                    83
# ℹ 120 more variables: arenaAdjustedYCord <dbl>, arenaAdjustedYCordAbs <dbl>,
#   averageRestDifference <dbl>, awayEmptyNet <dbl>, awayPenalty1Length <dbl>,
#   awayPenalty1TimeLeft <dbl>, awaySkatersOnIce <dbl>, awayTeamCode <chr>,
#   awayTeamGoals <dbl>, defendingTeamAverageTimeOnIce <dbl>,
#   defendingTeamAverageTimeOnIceOfDefencemen <dbl>,
#   defendingTeamAverageTimeOnIceOfDefencemenSinceFaceoff <dbl>,
#   defendingTeamAverageTimeOnIceOfForwards <dbl>, …

This data is all of the shots taken in the 2023-2024 NHL regular season. There are 124 variables (columns) and 122,472 observations (rows). The variables vary from what team took the shot to what goalie was trying to make the save. I selected certain variables to keep based on what I think I will need because 124 variables is way too many to realistically work with.

This data comes from Money Puck. This is the link to the data: Data.

Motivation

My motivation for exploring this data is that I love hockey and want to explore hockey data more in depth. Questions that I want to try and answer include;

  • What factors affect whether or not a goal is scored?

  • If a team scores a goal in the first period, are they more likely to win the game?

Hypothesis

A team that has a higher number of average goals is more likely to score in the first period.

Ethical Considerations

My bias is that I think the team that scores in the first period has a better chance of winning the game.

Table Creation/Data Dictionary

Var Name Class Continuity Description
shotID Numeric Discrete A unique ID for the shot taken ordered from the first shot of the season to the last one.
awayTeamCode Character Discrete The abbreviation for the team name of the away team.
awayTeamGoals Numeric Discrete The number of goals that the away team has in the game when the shot was taken.
event Character Discrete Whether it was a shot on goal, a missed shot, or a goal.
game_id Numeric Discrete An unique ID given to each game in the season increasing by one for each new game.
goal Numeric Discrete Whether or not a goal was scored on the shot. 0 for no goal and 1 for a goal.
goalieIdForShot Numeric Discrete A unique ID number given to each goalie that has played in the season. It is the ID for the goalie that the shot was taken on.
goalieNameForShot Character Discrete The name of the goalie that was shot at.
homeTeamCode Character Discrete The abbreviation for the team name of the home team.
homeTeamGoals Numeric Discrete The number of goals that the home team has in the game when the shot was taken.
homeTeamWon Numeric Discrete Whether or not the home team won the game. 0 for an away team win and 1 for a home team win.
isHomeTeam Numeric Discrete Whether or not the team shooting the puck is the home team. 0 for the away team and 1 for the home team.
period Numeric Discrete What period of the game does the shot take place in. Typically 1 to 3, but can be 4 if there was OT (overtime).
shotOnEmptyNet Numeric Discrete Was the shot on an empty net or not. 0 if the goalie was in the net and 1 if the net was empty.
teamCode Character Discrete The abbreviation for the team name of the team that took the shot.