Loading in the tidyverse and data
# Loading tidyverse
library(tidyverse)
#Loading in Data
nhl_draft <- read_csv("nhldraft.csv")
Preface: The data discussed in the project is based on NHL Draft Hockey Player Data from 1963 to 2022 provided by Kaggle user “Matt OP” (https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022). In no way shape or form do I own this data and this project is merely for academic purposes. However, I hope you find enjoyment from my analysis and hope to gain insight on how NHL Draft and Player Data can make an impact when considering who to pick in the next years draft.
This project is broken down into 3 parts: 1. Exploring how the Draft has evolved throughout time 2. Which country tends to produce the best players and why. 3. How this data can help teams choose draft picks in the future.
Each part will have it’s own data analysis with the intention on gaining insight into NHL Draft Picks and each player’s impact in the NHL.
To start, let’s get a summary of the data to get a better idea of what lies within.
#Looking at data
glimpse(nhl_draft)
## Rows: 12,250
## Columns: 23
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ year <chr> "2022", "2022", "2022", "2022", "2022", "2022", …
## $ overall_pick <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ team <chr> "Montreal Canadiens", "New Jersey Devils", "Ariz…
## $ player <chr> "Juraj Slafkovsky", "Simon Nemec", "Logan Cooley…
## $ nationality <chr> "SK", "SK", "US", "CA", "SE", "CZ", "CA", "AT", …
## $ position <chr> "LW", "D", "C", "C", "LW", "D", "D", "C", "C", "…
## $ age <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ to_year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ amateur_team <chr> "TPS (Finland)", "HK Nitra (Slovakia)", "USA U-1…
## $ games_played <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ assists <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ points <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ plus_minus <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ penalties_minutes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_games_played <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_wins <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_losses <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_ties_overtime <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ save_percentage <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals_against_average <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ point_shares <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
The dataset has 12,250 rows and 23 columns. There are 6 categorical (character) columns and 17 numeric columns.
Let’s look at a summary of all of the numeric columns before going to the categorical:
numeric_cols <- nhl_draft |> select_if(is.double)
summary(numeric_cols)
## id overall_pick age to_year
## Min. : 1 Min. : 1.0 Min. :16.00 Min. :1968
## 1st Qu.: 3063 1st Qu.: 55.0 1st Qu.:18.00 1st Qu.:1993
## Median : 6126 Median :112.0 Median :18.00 Median :2006
## Mean : 6126 Mean :116.6 Mean :18.68 Mean :2004
## 3rd Qu.: 9188 3rd Qu.:174.0 3rd Qu.:19.00 3rd Qu.:2018
## Max. :12250 Max. :293.0 Max. :37.00 Max. :2022
## NA's :3959 NA's :7004
## games_played goals assists points
## Min. : 1.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 25.0 1st Qu.: 0.00 1st Qu.: 2.00 1st Qu.: 3.0
## Median : 151.5 Median : 9.00 Median : 18.00 Median : 28.0
## Mean : 305.4 Mean : 50.86 Mean : 85.04 Mean : 135.9
## 3rd Qu.: 521.0 3rd Qu.: 56.00 3rd Qu.: 107.00 3rd Qu.: 166.0
## Max. :1779.0 Max. :780.00 Max. :1249.00 Max. :1921.0
## NA's :7004 NA's :7004 NA's :7004 NA's :7004
## plus_minus penalties_minutes goalie_games_played goalie_wins
## Min. :-257.000 Min. : 0.0 Min. : 1.0 Min. : 0.00
## 1st Qu.: -17.000 1st Qu.: 8.0 1st Qu.: 7.0 1st Qu.: 2.00
## Median : -2.000 Median : 65.0 Median : 74.0 Median : 23.50
## Mean : -2.283 Mean : 245.3 Mean : 181.3 Mean : 77.97
## 3rd Qu.: 1.000 3rd Qu.: 311.0 3rd Qu.: 294.0 3rd Qu.:117.00
## Max. : 722.000 Max. :3971.0 Max. :1266.0 Max. :691.00
## NA's :7016 NA's :7004 NA's :11760 NA's :11762
## goalie_losses goalie_ties_overtime save_percentage goals_against_average
## Min. : 0.0 Min. : 0.00 Min. :0.500 Min. : 0.000
## 1st Qu.: 3.0 1st Qu.: 1.00 1st Qu.:0.876 1st Qu.: 2.710
## Median : 32.0 Median : 9.00 Median :0.895 Median : 3.100
## Mean : 69.2 Mean : 21.32 Mean :0.887 Mean : 3.376
## 3rd Qu.:112.2 3rd Qu.: 33.00 3rd Qu.:0.908 3rd Qu.: 3.697
## Max. :397.0 Max. :154.00 Max. :1.000 Max. :27.270
## NA's :11762 NA's :11762 NA's :11762 NA's :11760
## point_shares
## Min. :-10.10
## 1st Qu.: 0.10
## Median : 3.30
## Mean : 17.36
## 3rd Qu.: 22.40
## Max. :242.70
## NA's :7004
Now let’s look at the 6 categorical variables closer:
Year:
# Unique Years
unique(nhl_draft$year)
## [1] "2022" "2021" "2020" "2019" "2018" "2017" "2016" "2015" "2014" "2013"
## [11] "2012" "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003"
## [21] "2002" "2001" "2000" "1999" "1998" "1997" "1996" "1995" "1994" "1993"
## [31] "1992" "1991" "1990" "1989" "1988" "1987" "1986" "1985" "1984" "1983"
## [41] "1982" "1981" "1980" "1979" "1978" "1977" "1976" "1975" "1974" "1973"
## [51] "1972" "1971" "1970" "1969" "1968" "1967" "1966" "1965" "1964" "1963"
This data set covers years from 1963 all the way to 2022. This will allow us to eventually break the data down by decade and type of draft later on.
# Number of Records per year
ggplot(nhl_draft)+
geom_histogram(mapping = aes(x = year, fill = year), stat = "count" )+
labs(x = "Year", y= "Observations", fill = "Year",
title = "Number of Observations per Year",
subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
caption = "Data provided by Kaggle User \"Matt OP\".
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
guides(x = guide_axis(angle = 75))+
theme(plot.caption = element_text(hjust = 1.2))
## Warning in geom_histogram(mapping = aes(x = year, fill = year), stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Looking at the graph above, we can see that amount of picks per year increases over time. This is due to the fact that the NHL didn’t originally have that many players picked per year in the 1960s. However, as time has gone on the more data has been collected for draft picks in a given year.
Team:
# Unique Values
unique(nhl_draft$team)
## [1] "Montreal Canadiens" "New Jersey Devils"
## [3] "Arizona Coyotes" "Seattle Kraken"
## [5] "Philadelphia Flyers" "Columbus Blue Jackets"
## [7] "Chicago Blackhawks" "Detroit Red Wings"
## [9] "Buffalo Sabres" "Anaheim Ducks"
## [11] "Winnipeg Jets" "Vancouver Canucks"
## [13] "Nashville Predators" "Dallas Stars"
## [15] "Minnesota Wild" "Washington Capitals"
## [17] "Pittsburgh Penguins" "St. Louis Blues"
## [19] "San Jose Sharks" "Tampa Bay Lightning"
## [21] "Edmonton Oilers" "Toronto Maple Leafs"
## [23] "Vegas Golden Knights" "Los Angeles Kings"
## [25] "Boston Bruins" "Calgary Flames"
## [27] "Carolina Hurricanes" "New York Rangers"
## [29] "Ottawa Senators" "New York Islanders"
## [31] "Florida Panthers" "Colorado Avalanche"
## [33] "Phoenix Coyotes" "Atlanta Thrashers"
## [35] "Hartford Whalers" "Quebec Nordiques"
## [37] "Minnesota North Stars" "Colorado Rockies"
## [39] "Atlanta Flames" "Cleveland Barons"
## [41] "California Golden Seals" "Kansas City Scouts"
## [43] "Oakland Seals" NA
Looking at the unique values of teams in the dataset, we can see more teams than there are currently in the NHL (32). This is because of the amount of teams that have been created and dissolved since the start of the NHL draft.
# Counts
ggplot(nhl_draft)+
geom_histogram(mapping = aes(x = team, fill = team), stat = "count" )+
labs(x = "Team", y= "Observations", fill = "Team",
title = "Number of Observations per NHL Team",
subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
caption = "Data provided by Kaggle User \"Matt OP\".
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
guides(x = guide_axis(angle = 75))+
theme(plot.caption = element_text(hjust = 1.85))
The graph above, proves the point I made before that teams with less history or short lived histories tend to have the least amount of data. Looking at the amount of observations per team we can tell based on the number observations which teams are younger and which teams have been around for a long time.
Player:
This variable is a categorical variable of a player’s name. Thus there is no particular analysis that I can do based upon a player name. For example, I can’t make any generalizations about players with the first name “Cody” and whether or not that player was successful in making it to the NHL.
Nationality:
Nationality in my opinion is one of the most interesting variables we will be looking at during this analysis. This is because of the sheer amount of countries that players come from and how the location of those drafted players has changed over time.
# Unique Countries
unique(nhl_draft$nationality)
## [1] "SK" "US" "CA" "SE" "CZ" "AT" "RU" "FI" "CH" "DE" "LV" "PL" "BY" "GB" "KZ"
## [16] "NO" "UA" "UZ" "DK" "AU" "TH" "JM" "FR" "SI" "BE" "NL" "CN" "LT" NA "IT"
## [31] "NG" "EE" "JP" "ME" "HU" "YU" "BS" "BR" "TZ" "BN" "KR" "ZA" "SU" "HT" "TW"
## [46] "PY" "VE"
Looking at the unique countries above, they are foramtted as country codes.
Below is the full names of the countries.
## [1] "Austria" "Australia" "Belgium"
## [4] "Brunei Darussalam" "Brazil" "Bahamas"
## [7] "Belarus" "Canada" "Switzerland"
## [10] "China" "Czech Republic" "Germany"
## [13] "Denmark" "Estonia" "Finland"
## [16] "France" "United Kingdom" "Haiti"
## [19] "Hungary" "Italy" "Jamaica"
## [22] "Japan" "South Korea" "Kazakhstan"
## [25] "Lithuania" "Latvia" "Montenegro"
## [28] "Nigeria" "Netherlands" "Norway"
## [31] "Poland" "Paraguay" "Russia"
## [34] "Sweden" "Slovenia" "Slovakia"
## [37] "Soviet Union" "Thailand" "Taiwan"
## [40] "Tanzania" "Ukraine" "United States"
## [43] "Uzbekistan" "Venezuela" "Yugoslavia"
## [46] "South Africa"
Now let’s look at how many players have been drafted per nationality.
# Count per nationality
ggplot(nhl_draft)+
geom_histogram(mapping = aes(x = nationality, fill = nationality), stat = "count" )+
labs(x = "Country", y= "Observations", fill = "Team",
title = "Number of Observations per Player Nationality",
subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
caption = "Data provided by Kaggle User \"Matt OP\".
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
guides(x = guide_axis(angle = 75))+
theme(plot.caption = element_text(hjust = 1.21))+
scale_fill_discrete(labels=countries)
## Warning in geom_histogram(mapping = aes(x = nationality, fill = nationality), :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
From the graph above, we can see that most of the players that have been drafted have mainly come from Canada and the United States and easily contain more than 50% of the data between those 2 countries alone. A lot of countries don’t have many players from that country. This proves that Canada does indeed have the most presence within the hockey world.
Position:
Now let’s look at the different draft positions within hockey
unique(nhl_draft$position)
## [1] "LW" "D" "C" "RW" "G" "W" "C/LW" "C/RW" "L/RW"
## [10] NA "F" "C/W" "D/W" "LW/C" "C/D" "RW/C" "D/LW" "LW/D"
## [19] "RW/D" "D/C" "D/RW" "C / R"
As you can see, there are more than just one type of position as hockey is a sport that has multiple players who play multiple positions going into the draft. This gives a player more value to their skillset for teams to look at.
# Counts per position
ggplot(nhl_draft)+
geom_histogram(mapping = aes(x = position, fill = position), stat = "count" )+
labs(x = "Position", y= "Observations", fill = "Position",
title = "Number of Observations per Position",
subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
caption = "Data provided by Kaggle User \"Matt OP\".
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
guides(x = guide_axis(angle = 75))+
theme(plot.caption = element_text(hjust = 1.2))
## Warning in geom_histogram(mapping = aes(x = position, fill = position), :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Looking at the graph above, we can see that the positions with the most observations are the general positions of hockey: Center (C), Defenseman (D), G (Goaltender), RW (Right Wing), and LW (Left Wing). This makes sense as there are more players in the draft that play one position than two.
Amateur Team:
cat("Count:", NROW(unique(nhl_draft$amateur_team)))
## Count: 1548
As you can see above there are over 1548 amateur teams that these players have come from. Later in this analysis we will be exploring which of these teams has produced the best players and if there is any correlation to player success with the amateur team they came from.
Now let’s talk about the history of the NHL draft. In 1963, the NHL had its first ever draft called the Amateur Draft which ran until 1979. In 1979, the rules changed to allow players who had previously played professional hockey to be drafted. This was due to the NHL absorbing the newly defunct World Hockey Association so that those players could still play. In 1980, any player between 18 and 20 years old was then eligible to be drafted and any Non-American player over the age of 20 could be selected. These changes were the foundation to the draft that is known today as the NHL Entry Draft.
To illustrate the ages of players in each draft, let’s break the data down by draft.
Amateur Draft (1963-1978):
amateur_draft_era <- nhl_draft |>
filter(year %in% C(1963:1978))
unique(amateur_draft_era$team)
## [1] "Minnesota North Stars" "Washington Capitals"
## [3] "St. Louis Blues" "Vancouver Canucks"
## [5] "Colorado Rockies" "Philadelphia Flyers"
## [7] "Montreal Canadiens" "Detroit Red Wings"
## [9] "Chicago Blackhawks" "Atlanta Flames"
## [11] "Buffalo Sabres" "New York Islanders"
## [13] "Boston Bruins" "Toronto Maple Leafs"
## [15] "Pittsburgh Penguins" "New York Rangers"
## [17] "Los Angeles Kings" "Cleveland Barons"
## [19] "California Golden Seals" "Kansas City Scouts"
## [21] "Oakland Seals" NA
amateur_draft_era |>
group_by(year) |>
summarise(min_age = min(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE),
n = n())
## # A tibble: 16 × 6
## year min_age median_age max_age avg_age n
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1963 16 16 17 16.2 21
## 2 1964 16 17 17 16.6 24
## 3 1965 17 17 17 17 11
## 4 1966 17 17 18 17.2 24
## 5 1967 20 20 21 20.3 18
## 6 1968 19 20 21 19.9 24
## 7 1969 19 20 20 19.7 84
## 8 1970 19 20 20 19.7 115
## 9 1971 19 20 20 19.7 117
## 10 1972 19 20 20 19.7 152
## 11 1973 19 20 24 19.8 168
## 12 1974 17 20 20 19.5 246
## 13 1975 19 20 20 19.7 217
## 14 1976 19 20 20 19.8 135
## 15 1977 19 20 20 19.8 185
## 16 1978 19 20 21 19.8 234
NHL Entry Draft (1979-Present):
nhl_etry_draft <- nhl_draft |>
filter(year %in% C(1979:2022))
nhl_etry_draft |>
group_by(year) |>
summarise(min_age = min(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE),
n = n()) |>
print(n = 44)
## # A tibble: 44 × 6
## year min_age median_age max_age avg_age n
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1979 18 19 20 19.3 126
## 2 1980 18 19 20 18.8 210
## 3 1981 18 18 20 18.5 211
## 4 1982 18 18 30 19.1 252
## 5 1983 18 18 31 18.7 242
## 6 1984 18 18 31 18.6 250
## 7 1985 18 18 24 18.3 252
## 8 1986 18 18 26 18.5 252
## 9 1987 18 18 25 18.8 252
## 10 1988 18 18 25 18.9 252
## 11 1989 18 19 37 19.4 252
## 12 1990 18 19 26 19.0 252
## 13 1991 18 19 27 19.2 264
## 14 1992 18 18 30 19.1 264
## 15 1993 18 18 30 18.9 286
## 16 1994 18 18 28 18.7 286
## 17 1995 18 18 26 18.5 234
## 18 1996 18 18 28 18.7 241
## 19 1997 17 18 26 18.8 246
## 20 1998 18 18 29 19.1 258
## 21 1999 18 18 27 19.0 272
## 22 2000 18 19 31 19.7 293
## 23 2001 18 18 29 19.1 289
## 24 2002 18 18 32 18.9 291
## 25 2003 18 18 26 18.6 292
## 26 2004 18 18 26 18.5 291
## 27 2005 18 18 20 18.3 230
## 28 2006 18 18 20 18.3 213
## 29 2007 18 18 21 18.3 211
## 30 2008 18 18 21 18.3 211
## 31 2009 18 18 21 18.3 210
## 32 2010 18 18 20 18.3 210
## 33 2011 18 18 21 18.3 211
## 34 2012 18 18 22 18.3 211
## 35 2013 18 18 21 18.3 211
## 36 2014 18 18 21 18.3 210
## 37 2015 18 18 21 18.2 211
## 38 2016 18 18 21 18.4 211
## 39 2017 18 18 21 18.3 217
## 40 2018 18 18 21 18.3 217
## 41 2019 18 18 21 18.3 217
## 42 2020 18 18 21 18.3 216
## 43 2021 18 18 21 18.2 223
## 44 2022 18 18 21 18.3 225
After looking at all of this data there are a couple of questions that we’ll be looking further into next.
5 Questions from the Data: