# excel file
data <- read_excel("../00_data/MyData.xlsx")
data
## # A tibble: 19 × 100
## Date Opponent Score Column1 `All shifts` `Time on ice` Goals
## <dttm> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 2024-10-24 00:00:00 vs New En… 5:2 0 17 18:02 1
## 2 2024-11-01 00:00:00 @ Univers… 0:5 0 23 22:59 0
## 3 2024-11-03 00:00:00 @ Babson … 4:5 0 23 25:07 0
## 4 2024-11-07 00:00:00 vs SUNY-P… 5:3 0 22 21:26 0
## 5 2024-11-09 00:00:00 @ New Eng… 5:1 0 20 21:19 0
## 6 2024-11-15 00:00:00 vs Worces… 8:2 0 18 18:16 0
## 7 2024-11-21 00:00:00 vs Framin… 7:1 0 20 19:11 0
## 8 2024-11-23 00:00:00 @ Salem S… 4:3 0 24 20:05 0
## 9 2024-11-26 00:00:00 vs Trinit… 1:2 0 24 21:10 0
## 10 2024-12-05 00:00:00 @ UMass D… 4:2 0 17 19:00 0
## 11 2024-12-07 00:00:00 vs Westfi… 1:0 0 27 22:20 1
## 12 2025-01-03 00:00:00 @ Hamilto… 0:4 0 20 19:35 0
## 13 2025-01-04 00:00:00 @ William… 8:2 0 19 18:18 0
## 14 2025-01-09 00:00:00 @ Rivier … 6:1 0 20 18:30 0
## 15 2025-01-11 00:00:00 vs Fitchb… 3:2 0 24 21:30 0
## 16 2025-01-16 00:00:00 @ Worcest… 5:2 0 20 20:22 0
## 17 2025-01-18 00:00:00 vs Massac… 3:1 0 19 17:18 0
## 18 2025-01-23 00:00:00 @ Anna Ma… 2:4 0 25 25:17 0
## 19 NA Average p… <NA> 169 20 18:59 0.17
## # ℹ 93 more variables: `First assist` <dbl>, `Second assist` <dbl>,
## # Assists <dbl>, `Puck touches` <dbl>, `Puck control time` <chr>,
## # Points <dbl>, `+/-` <dbl>, Plus <dbl>, Minus <dbl>, Penalties <dbl>,
## # `Penalties drawn` <dbl>, `Penalty time` <chr>, Hits <dbl>,
## # `Hits against` <dbl>, `Error leading to goal` <dbl>, `Dump ins` <dbl>,
## # `Dump outs` <dbl>, `Team goals when on ice` <dbl>,
## # `Opponent's goals when on ice` <dbl>, Shots <dbl>, `Shots on goal` <dbl>, …
This is an analysis of my hockey data across a season. It examines various metrics like shots on goal, plus/minus, and hits using different data visualization techniques. We run into some problems with calculated averages being included in the visuals
data %>%
ggplot(aes(x = Shots)) +
geom_histogram(binwidth = 0.5) +
labs(title = "Distribution of Shots per Game",
x = "# of Shots",
y = "Frequency")
How are the typical plus/minus values distibuted over games?
data %>%
ggplot(aes(x = `+/-`)) +
geom_histogram() +
labs(title = "Plus/Minus Rating Distribution")
Here the average +/- from the data is included. We can exclude this by using mod to filter out decimals. x mod 1 == 0 will be true for only whole numbers:
data %>%
filter(`+/-` %% 1 == 0) %>%
ggplot(aes(x = `+/-`)) +
geom_histogram() +
labs(title = "Plus/Minus Rating Distribution")
Lets look at the distribution of shots per game and see if there are any outliers.
data %>%
ggplot(aes(Shots)) +
geom_histogram() +
coord_cartesian(xlim = c(0, 10)) +
labs(title = "Distribution of Shot Count",
x = "Shots per Game",
y = "Frequency")
Seems like 6 is my lucky number.
No missing Values in my data
A lack of categorical data points in my dataset
data <- data %>%
mutate(Game_Location = ifelse(str_detect(Opponent, "^@"), "Away", "Home"))
home_games <- data %>%
filter(Game_Location == "Home")
ggplot(home_games, mapping = aes(x = `Puck battles`)) +
geom_freqpoly() +
labs(title = "Distribution of Puck Battles in Home Games",
x = "Number of puck battles",
y = "Frequency")
ggplot(data, mapping = aes(x = Takeaways)) +
geom_freqpoly(mapping = aes(colour = Game_Location), binwidth = 1) +
labs(title = "Distribution - Takeaways by Game Location",
x = "Number of Takeaways",
y = "Frequency",
colour = "Game Location")