Week 2 Data Dive by Connor Bryson

Loading in the tidyverse and data

# Loading tidyverse 

library(tidyverse)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

Preface: The data discussed in the project is based on NHL Draft Hockey Player Data from 1963 to 2022 provided by Kaggle user “Matt OP” (https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022). In no way shape or form do I own this data and this project is merely for academic purposes. However, I hope you find enjoyment from my analysis and hope to gain insight on how NHL Draft and Player Data can make an impact when considering who to pick in the next years draft.

This project is broken down into 3 parts: 1. Exploring how the Draft has evolved throughout time 2. Which country tends to produce the best players and why. 3. How this data can help teams choose draft picks in the future.

Each part will have it’s own data analysis with the intention on gaining insight into NHL Draft Picks and each player’s impact in the NHL.

To start, let’s get a summary of the data to get a better idea of what lies within.

#Looking at data

glimpse(nhl_draft)

## Rows: 12,250
## Columns: 23
## $ id                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ year                  <chr> "2022", "2022", "2022", "2022", "2022", "2022", …
## $ overall_pick          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ team                  <chr> "Montreal Canadiens", "New Jersey Devils", "Ariz…
## $ player                <chr> "Juraj Slafkovsky", "Simon Nemec", "Logan Cooley…
## $ nationality           <chr> "SK", "SK", "US", "CA", "SE", "CZ", "CA", "AT", …
## $ position              <chr> "LW", "D", "C", "C", "LW", "D", "D", "C", "C", "…
## $ age                   <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ to_year               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ amateur_team          <chr> "TPS (Finland)", "HK Nitra (Slovakia)", "USA U-1…
## $ games_played          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ assists               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ points                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ plus_minus            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ penalties_minutes     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_games_played   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_wins           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_losses         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goalie_ties_overtime  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ save_percentage       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ goals_against_average <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ point_shares          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

The dataset has 12,250 rows and 23 columns. There are 6 categorical (character) columns and 17 numeric columns.

Let’s look at a summary of all of the numeric columns before going to the categorical:

numeric_cols <- nhl_draft |> select_if(is.double)

summary(numeric_cols)

##        id         overall_pick        age           to_year    
##  Min.   :    1   Min.   :  1.0   Min.   :16.00   Min.   :1968  
##  1st Qu.: 3063   1st Qu.: 55.0   1st Qu.:18.00   1st Qu.:1993  
##  Median : 6126   Median :112.0   Median :18.00   Median :2006  
##  Mean   : 6126   Mean   :116.6   Mean   :18.68   Mean   :2004  
##  3rd Qu.: 9188   3rd Qu.:174.0   3rd Qu.:19.00   3rd Qu.:2018  
##  Max.   :12250   Max.   :293.0   Max.   :37.00   Max.   :2022  
##                                  NA's   :3959    NA's   :7004  
##   games_played        goals           assists            points      
##  Min.   :   1.0   Min.   :  0.00   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:  25.0   1st Qu.:  0.00   1st Qu.:   2.00   1st Qu.:   3.0  
##  Median : 151.5   Median :  9.00   Median :  18.00   Median :  28.0  
##  Mean   : 305.4   Mean   : 50.86   Mean   :  85.04   Mean   : 135.9  
##  3rd Qu.: 521.0   3rd Qu.: 56.00   3rd Qu.: 107.00   3rd Qu.: 166.0  
##  Max.   :1779.0   Max.   :780.00   Max.   :1249.00   Max.   :1921.0  
##  NA's   :7004     NA's   :7004     NA's   :7004      NA's   :7004    
##    plus_minus       penalties_minutes goalie_games_played  goalie_wins    
##  Min.   :-257.000   Min.   :   0.0    Min.   :   1.0      Min.   :  0.00  
##  1st Qu.: -17.000   1st Qu.:   8.0    1st Qu.:   7.0      1st Qu.:  2.00  
##  Median :  -2.000   Median :  65.0    Median :  74.0      Median : 23.50  
##  Mean   :  -2.283   Mean   : 245.3    Mean   : 181.3      Mean   : 77.97  
##  3rd Qu.:   1.000   3rd Qu.: 311.0    3rd Qu.: 294.0      3rd Qu.:117.00  
##  Max.   : 722.000   Max.   :3971.0    Max.   :1266.0      Max.   :691.00  
##  NA's   :7016       NA's   :7004      NA's   :11760       NA's   :11762   
##  goalie_losses   goalie_ties_overtime save_percentage goals_against_average
##  Min.   :  0.0   Min.   :  0.00       Min.   :0.500   Min.   : 0.000       
##  1st Qu.:  3.0   1st Qu.:  1.00       1st Qu.:0.876   1st Qu.: 2.710       
##  Median : 32.0   Median :  9.00       Median :0.895   Median : 3.100       
##  Mean   : 69.2   Mean   : 21.32       Mean   :0.887   Mean   : 3.376       
##  3rd Qu.:112.2   3rd Qu.: 33.00       3rd Qu.:0.908   3rd Qu.: 3.697       
##  Max.   :397.0   Max.   :154.00       Max.   :1.000   Max.   :27.270       
##  NA's   :11762   NA's   :11762        NA's   :11762   NA's   :11760        
##   point_shares   
##  Min.   :-10.10  
##  1st Qu.:  0.10  
##  Median :  3.30  
##  Mean   : 17.36  
##  3rd Qu.: 22.40  
##  Max.   :242.70  
##  NA's   :7004

Now let’s look at the 6 categorical variables closer:

Year:

# Unique Years

unique(nhl_draft$year)

##  [1] "2022" "2021" "2020" "2019" "2018" "2017" "2016" "2015" "2014" "2013"
## [11] "2012" "2011" "2010" "2009" "2008" "2007" "2006" "2005" "2004" "2003"
## [21] "2002" "2001" "2000" "1999" "1998" "1997" "1996" "1995" "1994" "1993"
## [31] "1992" "1991" "1990" "1989" "1988" "1987" "1986" "1985" "1984" "1983"
## [41] "1982" "1981" "1980" "1979" "1978" "1977" "1976" "1975" "1974" "1973"
## [51] "1972" "1971" "1970" "1969" "1968" "1967" "1966" "1965" "1964" "1963"

This data set covers years from 1963 all the way to 2022. This will allow us to eventually break the data down by decade and type of draft later on.

# Number of Records per year

ggplot(nhl_draft)+
  geom_histogram(mapping = aes(x = year, fill = year), stat = "count" )+
  labs(x = "Year", y= "Observations", fill = "Year",
       title = "Number of Observations per Year",
       subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
       caption = "Data provided by Kaggle User \"Matt OP\".
                https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
  guides(x = guide_axis(angle = 75))+
  theme(plot.caption = element_text(hjust = 1.2))

## Warning in geom_histogram(mapping = aes(x = year, fill = year), stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Looking at the graph above, we can see that amount of picks per year increases over time. This is due to the fact that the NHL didn’t originally have that many players picked per year in the 1960s. However, as time has gone on the more data has been collected for draft picks in a given year.

Team:

# Unique Values
unique(nhl_draft$team)

##  [1] "Montreal Canadiens"      "New Jersey Devils"      
##  [3] "Arizona Coyotes"         "Seattle Kraken"         
##  [5] "Philadelphia Flyers"     "Columbus Blue Jackets"  
##  [7] "Chicago Blackhawks"      "Detroit Red Wings"      
##  [9] "Buffalo Sabres"          "Anaheim Ducks"          
## [11] "Winnipeg Jets"           "Vancouver Canucks"      
## [13] "Nashville Predators"     "Dallas Stars"           
## [15] "Minnesota Wild"          "Washington Capitals"    
## [17] "Pittsburgh Penguins"     "St. Louis Blues"        
## [19] "San Jose Sharks"         "Tampa Bay Lightning"    
## [21] "Edmonton Oilers"         "Toronto Maple Leafs"    
## [23] "Vegas Golden Knights"    "Los Angeles Kings"      
## [25] "Boston Bruins"           "Calgary Flames"         
## [27] "Carolina Hurricanes"     "New York Rangers"       
## [29] "Ottawa Senators"         "New York Islanders"     
## [31] "Florida Panthers"        "Colorado Avalanche"     
## [33] "Phoenix Coyotes"         "Atlanta Thrashers"      
## [35] "Hartford Whalers"        "Quebec Nordiques"       
## [37] "Minnesota North Stars"   "Colorado Rockies"       
## [39] "Atlanta Flames"          "Cleveland Barons"       
## [41] "California Golden Seals" "Kansas City Scouts"     
## [43] "Oakland Seals"           NA

Looking at the unique values of teams in the dataset, we can see more teams than there are currently in the NHL (32). This is because of the amount of teams that have been created and dissolved since the start of the NHL draft.

# Counts
ggplot(nhl_draft)+
  geom_histogram(mapping = aes(x = team, fill = team), stat = "count" )+
  labs(x = "Team", y= "Observations", fill = "Team",
       title = "Number of Observations per NHL Team",
       subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
       caption = "Data provided by Kaggle User \"Matt OP\".                               
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
  guides(x = guide_axis(angle = 75))+
  theme(plot.caption = element_text(hjust = 1.85))

The graph above, proves the point I made before that teams with less history or short lived histories tend to have the least amount of data. Looking at the amount of observations per team we can tell based on the number observations which teams are younger and which teams have been around for a long time.

Player:

This variable is a categorical variable of a player’s name. Thus there is no particular analysis that I can do based upon a player name. For example, I can’t make any generalizations about players with the first name “Cody” and whether or not that player was successful in making it to the NHL.

Nationality:

Nationality in my opinion is one of the most interesting variables we will be looking at during this analysis. This is because of the sheer amount of countries that players come from and how the location of those drafted players has changed over time.

# Unique Countries

unique(nhl_draft$nationality)

##  [1] "SK" "US" "CA" "SE" "CZ" "AT" "RU" "FI" "CH" "DE" "LV" "PL" "BY" "GB" "KZ"
## [16] "NO" "UA" "UZ" "DK" "AU" "TH" "JM" "FR" "SI" "BE" "NL" "CN" "LT" NA   "IT"
## [31] "NG" "EE" "JP" "ME" "HU" "YU" "BS" "BR" "TZ" "BN" "KR" "ZA" "SU" "HT" "TW"
## [46] "PY" "VE"

Looking at the unique countries above, they are foramtted as country codes.

Below is the full names of the countries.

##  [1] "Austria"           "Australia"         "Belgium"          
##  [4] "Brunei Darussalam" "Brazil"            "Bahamas"          
##  [7] "Belarus"           "Canada"            "Switzerland"      
## [10] "China"             "Czech Republic"    "Germany"          
## [13] "Denmark"           "Estonia"           "Finland"          
## [16] "France"            "United Kingdom"    "Haiti"            
## [19] "Hungary"           "Italy"             "Jamaica"          
## [22] "Japan"             "South Korea"       "Kazakhstan"       
## [25] "Lithuania"         "Latvia"            "Montenegro"       
## [28] "Nigeria"           "Netherlands"       "Norway"           
## [31] "Poland"            "Paraguay"          "Russia"           
## [34] "Sweden"            "Slovenia"          "Slovakia"         
## [37] "Soviet Union"      "Thailand"          "Taiwan"           
## [40] "Tanzania"          "Ukraine"           "United States"    
## [43] "Uzbekistan"        "Venezuela"         "Yugoslavia"       
## [46] "South Africa"

Now let’s look at how many players have been drafted per nationality.

# Count per nationality

ggplot(nhl_draft)+
  geom_histogram(mapping = aes(x = nationality, fill = nationality), stat = "count" )+
  labs(x = "Country", y= "Observations", fill = "Team",
       title = "Number of Observations per Player Nationality",
       subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
       caption = "Data provided by Kaggle User \"Matt OP\".                               
https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
  guides(x = guide_axis(angle = 75))+
  theme(plot.caption = element_text(hjust = 1.21))+
  scale_fill_discrete(labels=countries)

## Warning in geom_histogram(mapping = aes(x = nationality, fill = nationality), :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

From the graph above, we can see that most of the players that have been drafted have mainly come from Canada and the United States and easily contain more than 50% of the data between those 2 countries alone. A lot of countries don’t have many players from that country. This proves that Canada does indeed have the most presence within the hockey world.

Position:

Now let’s look at the different draft positions within hockey

unique(nhl_draft$position)

##  [1] "LW"    "D"     "C"     "RW"    "G"     "W"     "C/LW"  "C/RW"  "L/RW" 
## [10] NA      "F"     "C/W"   "D/W"   "LW/C"  "C/D"   "RW/C"  "D/LW"  "LW/D" 
## [19] "RW/D"  "D/C"   "D/RW"  "C / R"

As you can see, there are more than just one type of position as hockey is a sport that has multiple players who play multiple positions going into the draft. This gives a player more value to their skillset for teams to look at.

# Counts per position

ggplot(nhl_draft)+
  geom_histogram(mapping = aes(x = position, fill = position), stat = "count" )+
  labs(x = "Position", y= "Observations", fill = "Position",
       title = "Number of Observations per Position",
       subtitle = "NHL Draft Hockey Player Data (1963 - 2022)",
       caption = "Data provided by Kaggle User \"Matt OP\".
                https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022")+
  guides(x = guide_axis(angle = 75))+
  theme(plot.caption = element_text(hjust = 1.2))

## Warning in geom_histogram(mapping = aes(x = position, fill = position), :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Looking at the graph above, we can see that the positions with the most observations are the general positions of hockey: Center (C), Defenseman (D), G (Goaltender), RW (Right Wing), and LW (Left Wing). This makes sense as there are more players in the draft that play one position than two.

Amateur Team:

cat("Count:", NROW(unique(nhl_draft$amateur_team)))

## Count: 1548

As you can see above there are over 1548 amateur teams that these players have come from. Later in this analysis we will be exploring which of these teams has produced the best players and if there is any correlation to player success with the amateur team they came from.

Now let’s talk about the history of the NHL draft. In 1963, the NHL had its first ever draft called the Amateur Draft which ran until 1979. In 1979, the rules changed to allow players who had previously played professional hockey to be drafted. This was due to the NHL absorbing the newly defunct World Hockey Association so that those players could still play. In 1980, any player between 18 and 20 years old was then eligible to be drafted and any Non-American player over the age of 20 could be selected. These changes were the foundation to the draft that is known today as the NHL Entry Draft.

To illustrate the ages of players in each draft, let’s break the data down by draft.

Amateur Draft (1963-1978):

amateur_draft_era <- nhl_draft |> 
  filter(year %in% C(1963:1978))

unique(amateur_draft_era$team)

##  [1] "Minnesota North Stars"   "Washington Capitals"    
##  [3] "St. Louis Blues"         "Vancouver Canucks"      
##  [5] "Colorado Rockies"        "Philadelphia Flyers"    
##  [7] "Montreal Canadiens"      "Detroit Red Wings"      
##  [9] "Chicago Blackhawks"      "Atlanta Flames"         
## [11] "Buffalo Sabres"          "New York Islanders"     
## [13] "Boston Bruins"           "Toronto Maple Leafs"    
## [15] "Pittsburgh Penguins"     "New York Rangers"       
## [17] "Los Angeles Kings"       "Cleveland Barons"       
## [19] "California Golden Seals" "Kansas City Scouts"     
## [21] "Oakland Seals"           NA

amateur_draft_era |> 
  group_by(year) |> 
  summarise(min_age = min(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE),
            max_age = max(age, na.rm = TRUE),
            avg_age = mean(age, na.rm = TRUE),
            n = n())

## # A tibble: 16 × 6
##    year  min_age median_age max_age avg_age     n
##    <chr>   <dbl>      <dbl>   <dbl>   <dbl> <int>
##  1 1963       16         16      17    16.2    21
##  2 1964       16         17      17    16.6    24
##  3 1965       17         17      17    17      11
##  4 1966       17         17      18    17.2    24
##  5 1967       20         20      21    20.3    18
##  6 1968       19         20      21    19.9    24
##  7 1969       19         20      20    19.7    84
##  8 1970       19         20      20    19.7   115
##  9 1971       19         20      20    19.7   117
## 10 1972       19         20      20    19.7   152
## 11 1973       19         20      24    19.8   168
## 12 1974       17         20      20    19.5   246
## 13 1975       19         20      20    19.7   217
## 14 1976       19         20      20    19.8   135
## 15 1977       19         20      20    19.8   185
## 16 1978       19         20      21    19.8   234

NHL Entry Draft (1979-Present):

nhl_etry_draft <- nhl_draft |> 
  filter(year %in% C(1979:2022))

nhl_etry_draft |> 
  group_by(year) |> 
  summarise(min_age = min(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE),
            max_age = max(age, na.rm = TRUE),
            avg_age = mean(age, na.rm = TRUE),
            n = n()) |> 
  print(n = 44)

## # A tibble: 44 × 6
##    year  min_age median_age max_age avg_age     n
##    <chr>   <dbl>      <dbl>   <dbl>   <dbl> <int>
##  1 1979       18         19      20    19.3   126
##  2 1980       18         19      20    18.8   210
##  3 1981       18         18      20    18.5   211
##  4 1982       18         18      30    19.1   252
##  5 1983       18         18      31    18.7   242
##  6 1984       18         18      31    18.6   250
##  7 1985       18         18      24    18.3   252
##  8 1986       18         18      26    18.5   252
##  9 1987       18         18      25    18.8   252
## 10 1988       18         18      25    18.9   252
## 11 1989       18         19      37    19.4   252
## 12 1990       18         19      26    19.0   252
## 13 1991       18         19      27    19.2   264
## 14 1992       18         18      30    19.1   264
## 15 1993       18         18      30    18.9   286
## 16 1994       18         18      28    18.7   286
## 17 1995       18         18      26    18.5   234
## 18 1996       18         18      28    18.7   241
## 19 1997       17         18      26    18.8   246
## 20 1998       18         18      29    19.1   258
## 21 1999       18         18      27    19.0   272
## 22 2000       18         19      31    19.7   293
## 23 2001       18         18      29    19.1   289
## 24 2002       18         18      32    18.9   291
## 25 2003       18         18      26    18.6   292
## 26 2004       18         18      26    18.5   291
## 27 2005       18         18      20    18.3   230
## 28 2006       18         18      20    18.3   213
## 29 2007       18         18      21    18.3   211
## 30 2008       18         18      21    18.3   211
## 31 2009       18         18      21    18.3   210
## 32 2010       18         18      20    18.3   210
## 33 2011       18         18      21    18.3   211
## 34 2012       18         18      22    18.3   211
## 35 2013       18         18      21    18.3   211
## 36 2014       18         18      21    18.3   210
## 37 2015       18         18      21    18.2   211
## 38 2016       18         18      21    18.4   211
## 39 2017       18         18      21    18.3   217
## 40 2018       18         18      21    18.3   217
## 41 2019       18         18      21    18.3   217
## 42 2020       18         18      21    18.3   216
## 43 2021       18         18      21    18.2   223
## 44 2022       18         18      21    18.3   225

After looking at all of this data there are a couple of questions that we’ll be looking further into next.

5 Questions from the Data:

How do different decades compare in terms of nationality and why?
Is the modern player better than they were a decade or two ago?
Are there any ameteur teams that the best players tend to come from?
How has the NHL Draft evolved over time?
Have the best draft picks made the greatest impact in the league?

Week 2 Data Dive by Connor Bryson

Connor Bryson

2023-09-04