Week 3 Data Dive

Loading in the tidyverse and data

# Loading tidyverse 

library(tidyverse)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

In this Data Dive we are looking at the probabilities behind different groups of variables. The groups that I decided to look at in this report each have summary statistics of other variables to go with them. Thus this should be able to give us a view of the data based on which groups tend to appear more or less when randomly picking a row.

In the code below, I grouped the data by nationality and included columns that summarize: The number of values per country, the means for age, games, played, goals, assists, points, plus_minus (how well they do on the field for the team), and their penalty minutes. This will give us an idea on why the means for all these values are high or low based on their given counts. I also calculated the amount of rows each nationality takes up in the data and divided it by the total number of rows in the data to get the probability of each group being picked randomly. Using the probability variable I was also able to create a “is_anomaly” variable that takes the lowest anomaly and classifies it as an “Anomaly”.

NOTE: For the calculations of the means, all of the “NA”s were removed and the means were rounded to 2 decimals.

To start, let’s look at player’s and their stats by Nationality.

nationality_summary <- nhl_draft |> 
  group_by(nationality) |> 
  summarise(count = n(),
            avg_age = round(mean(age, na.rm = TRUE), 2),
            avg_games_played = round(mean(games_played, na.rm = TRUE), 2),
            avg_goals = round(mean(goals, na.rm = TRUE), 2),
            avg_assists = round(mean(assists, na.rm = TRUE), 2),
            avg_points = round(mean(points, na.rm = TRUE), 2),
            avg_plus_minus = round(mean(plus_minus, na.rm = TRUE), 2),
            avg_penalty_minutes = round(mean(penalties_minutes, na.rm = TRUE), 2)
            ) |> 
  mutate(prob = count/NROW(nhl_draft)) |> 
  relocate(prob, .after = count) |> 
  mutate(is_anomaly = case_when(
    prob == min(prob) ~ "Anomaly"
  )) |> 
  relocate(is_anomaly, .after = prob)

nationality_summary |> 
  print(n = 40)

## # A tibble: 47 × 11
##    nationality count      prob is_anomaly avg_age avg_games_played avg_goals
##    <chr>       <int>     <dbl> <chr>        <dbl>            <dbl>     <dbl>
##  1 AT             19 0.00155   <NA>          19.5            322.       88.6
##  2 AU              1 0.0000816 Anomaly       18               24         2  
##  3 BE              2 0.000163  <NA>          18                2         0  
##  4 BN              1 0.0000816 Anomaly       19              951        55  
##  5 BR              2 0.000163  <NA>          18              546.       18  
##  6 BS              2 0.000163  <NA>          18               31         0  
##  7 BY             32 0.00261   <NA>          18.6            223.       30.1
##  8 CA           6498 0.530     <NA>          18.6            319.       53.7
##  9 CH             73 0.00596   <NA>          18.9            203.       30  
## 10 CN              1 0.0000816 Anomaly       18              NaN       NaN  
## 11 CZ            479 0.0391    <NA>          19.1            314.       52.8
## 12 DE             81 0.00661   <NA>          18.4            319.       54.7
## 13 DK             27 0.00220   <NA>          18.4            335.       53.9
## 14 EE              2 0.000163  <NA>          19              491        63  
## 15 FI            497 0.0406    <NA>          19.1            233.       37.2
## 16 FR              9 0.000735  <NA>          20              188.       50.3
## 17 GB             22 0.00180   <NA>          18.6            296.       61.4
## 18 HT              1 0.0000816 Anomaly       19               89        21  
## 19 HU              3 0.000245  <NA>          18              NaN       NaN  
## 20 IT              2 0.000163  <NA>          18              276.       10  
## 21 JM              1 0.0000816 Anomaly       20              NaN       NaN  
## 22 JP              3 0.000245  <NA>          19.5             18.5       0.5
## 23 KR              2 0.000163  <NA>          18              478.       53.5
## 24 KZ             24 0.00196   <NA>          18.5            244.       27.6
## 25 LT              3 0.000245  <NA>          18.3            723        85  
## 26 LV             39 0.00318   <NA>          19.9            214.       23.0
## 27 ME              1 0.0000816 Anomaly       18              NaN       NaN  
## 28 NG              2 0.000163  <NA>          18.5             38         2  
## 29 NL              1 0.0000816 Anomaly       18              202        46  
## 30 NO             23 0.00188   <NA>          18.6             88.6       6.8
## 31 PL              8 0.000653  <NA>          18.8            463        82.8
## 32 PY              1 0.0000816 Anomaly       20              834       222  
## 33 RU            724 0.0591    <NA>          19.1            266.       53.3
## 34 SE            800 0.0653    <NA>          18.8            312.       50.0
## 35 SI              9 0.000735  <NA>          18.1            628       184. 
## 36 SK            166 0.0136    <NA>          18.7            289.       58.2
## 37 SU              2 0.000163  <NA>          27.5            281        18  
## 38 TH              1 0.0000816 Anomaly       18              NaN       NaN  
## 39 TW              1 0.0000816 Anomaly       20              994        51  
## 40 TZ              1 0.0000816 Anomaly       18               52         6  
## # ℹ 7 more rows
## # ℹ 4 more variables: avg_assists <dbl>, avg_points <dbl>,
## #   avg_plus_minus <dbl>, avg_penalty_minutes <dbl>

When looking at the data we can see that the counts for each Nationality ranges from 1 to 6498 counts. There are also people who are listed with no country. For the probabilities of each country we need to look at why the draft included international players.

Back in 1979, the rules of the draft were changed to allow players who had previously played professional hockey to be drafted. During that time there were 2 leagues: The NHL and the World Hockey Association. The World Hockey Association, however, dissolved because the NHL decided to merge 6 WHA teams (the Edmonton Oilers, New England Whalers, Quebec Nordiques, Cincinnati Stingers, Houston Aeros, and Winnipeg Jets) into the league and the merger was approved. A lot of international players had the opportunity to play during that time but most of the teams and players came from Canada. Thus this is why Canada has so many players. As the years went on, hockey became a more popular sport and the reputation of the NHL became well known across the world. Now the NHL has a variety of players from various countries but there’s only a small amount of those players due to location and climate that hockey plays in. This is why Canada and the United States are the 2 major country groups due to the popularity in those countries during the fall and winter.

The small count of some countries means that only 1 player played in the league from that country and thus their mean is skewed to only that sample population. Canada has the largest sample size of any country and thus is able to have the highest probability of any group and the most accurate means for draft picks. Thus, players from Canada and the United States are what most of the draft pool for a given draft will look like.

One interesting thing to note about this data is that a former country like the Soviet Union exists in this data due to the 1972 International Summit Series between team Canada and the Soviets. WHA players were not permitted

Next lets look a grouping by NHL team

team_summary <- nhl_draft |> 
  group_by(team) |> 
  summarise(count = n(),
            avg_age = round(mean(age, na.rm = TRUE), 2),
            avg_games_played = round(mean(games_played, na.rm = TRUE), 2),
            avg_goals = round(mean(goals, na.rm = TRUE), 2),
            avg_assists = round(mean(assists, na.rm = TRUE), 2),
            avg_points = round(mean(points, na.rm = TRUE), 2),
            avg_plus_minus = round(mean(plus_minus, na.rm = TRUE), 2),
            avg_penalty_minutes = round(mean(penalties_minutes, na.rm = TRUE), 2)
  ) |> 
  mutate(prob = count/NROW(nhl_draft)) |> 
  relocate(prob, .after = count) |> 
  mutate(is_anomaly = case_when(
    prob == min(prob) ~ "Anomaly"
  )) |> 
  relocate(is_anomaly, .after = prob)

team_summary |> 
  print(n = 40)

## # A tibble: 44 × 11
##    team  count    prob is_anomaly avg_age avg_games_played avg_goals avg_assists
##    <chr> <int>   <dbl> <chr>        <dbl>            <dbl>     <dbl>       <dbl>
##  1 Anah…   224 1.83e-2 <NA>          18.8             290.      42.2        73.8
##  2 Ariz…    74 6.04e-3 <NA>          18.3             120.      22.0        30.8
##  3 Atla…    78 6.37e-3 <NA>          19.7             332.      63.8       109. 
##  4 Atla…   107 8.73e-3 <NA>          18.5             276.      43.2        62.8
##  5 Bost…   474 3.87e-2 <NA>          18.8             316.      54.8        95.3
##  6 Buff…   496 4.05e-2 <NA>          18.6             359.      56.9        93.4
##  7 Calg…   397 3.24e-2 <NA>          18.7             300.      52.7        84.5
##  8 Cali…    51 4.16e-3 <NA>          19.8             217.      50.1        73.3
##  9 Caro…   202 1.65e-2 <NA>          18.4             234.      36.6        55.3
## 10 Chic…   545 4.45e-2 <NA>          18.5             301.      51.2        86.1
## 11 Clev…     9 7.35e-4 <NA>          19.7             160.      24.8        34.4
## 12 Colo…   221 1.80e-2 <NA>          18.4             255.      40.2        70.0
## 13 Colo…    45 3.67e-3 <NA>          19.2             218.      34.1        76.1
## 14 Colu…   183 1.49e-2 <NA>          18.5             238.      33.6        55.4
## 15 Dall…   234 1.91e-2 <NA>          18.3             290.      45.0        69.4
## 16 Detr…   534 4.36e-2 <NA>          18.6             358.      67.0       106. 
## 17 Edmo…   405 3.31e-2 <NA>          18.7             307.      56.0        92.7
## 18 Flor…   252 2.06e-2 <NA>          18.4             281.      37.5        66.1
## 19 Hart…   178 1.45e-2 <NA>          18.8             400.      65.8       109. 
## 20 Kans…    26 2.12e-3 <NA>          19.6             285.      72.8        92.6
## 21 Los …   475 3.88e-2 <NA>          18.9             280.      46.9        80.8
## 22 Minn…   261 2.13e-2 <NA>          19.4             271.      40.9        73.5
## 23 Minn…   165 1.35e-2 <NA>          18.6             270.      38.8        67.7
## 24 Mont…   627 5.12e-2 <NA>          18.8             360.      57.4       102. 
## 25 Nash…   205 1.67e-2 <NA>          18.6             308.      43.5        79.4
## 26 New …   386 3.15e-2 <NA>          18.5             306.      46.2        75.7
## 27 New …   496 4.05e-2 <NA>          18.7             347.      56.2        96.0
## 28 New …   551 4.50e-2 <NA>          18.8             334.      55.8        96.3
## 29 Oakl…    20 1.63e-3 <NA>          19.8             261       30.3        80.7
## 30 Otta…   250 2.04e-2 <NA>          18.6             313.      54.6        87.1
## 31 Phil…   509 4.16e-2 <NA>          18.8             283.      50.9        85.2
## 32 Phoe…   140 1.14e-2 <NA>          18.4             247.      34.8        64.4
## 33 Pitt…   463 3.78e-2 <NA>          18.7             285.      51.8        85.0
## 34 Queb…   182 1.49e-2 <NA>          19               362.      64.5       107. 
## 35 San …   259 2.11e-2 <NA>          18.6             334.      51.5        82.8
## 36 Seat…    18 1.47e-3 <NA>          18.2              10        3           6  
## 37 St. …   498 4.07e-2 <NA>          18.9             299.      46.4        79.3
## 38 Tamp…   269 2.20e-2 <NA>          18.6             236.      39.4        67.2
## 39 Toro…   509 4.16e-2 <NA>          18.7             310.      51.7        83.3
## 40 Vanc…   452 3.69e-2 <NA>          18.7             307.      56.0        87.0
## # ℹ 4 more rows
## # ℹ 3 more variables: avg_points <dbl>, avg_plus_minus <dbl>,
## #   avg_penalty_minutes <dbl>

We can see the average age of players for the draft is 18 which makes sense since players either can come straight from high school or play one year in college and enter the draft. Also different countries treat 18 year olds different when it comes to be considered as an adult. Teams also find this age useful because it allows for players to adapt and grow at a time that will allow them to perform better in their prime years.

Also what is interesting is that we can see all of the teams that were added or removed during the history of the NHL up to the present. Teams like Cleveland Barons and California Golden Seals are teams that played during the 1960s and 70 while the Seattle Kraken and Vegas Golden Knights are expansion teams that were founded within the last decade.

The counts for the numbers support the fact that these teams have not had long histories and thus when picked out of random will have more skewed means than teams who have been in the league for a long time.

Finally let’s look at grouping based on Overall Draft Pick

overall_pick_summary <- nhl_draft |> 
  group_by(overall_pick) |> 
  summarise(count = n(),
            avg_age = round(mean(age, na.rm = TRUE), 2),
            avg_games_played = round(mean(games_played, na.rm = TRUE), 2),
            avg_goals = round(mean(goals, na.rm = TRUE), 2),
            avg_assists = round(mean(assists, na.rm = TRUE), 2),
            avg_points = round(mean(points, na.rm = TRUE), 2),
            avg_plus_minus = round(mean(plus_minus, na.rm = TRUE), 2),
            avg_penalty_minutes = round(mean(penalties_minutes, na.rm = TRUE), 2)
  ) |> 
  mutate(prob = count/NROW(nhl_draft)) |> 
  relocate(prob, .after = count) |> 
  mutate(is_anomaly = case_when(
    prob == min(prob) ~ "Anomaly"
  )) |> 
  relocate(is_anomaly, .after = prob)


overall_pick_summary |> 
  print(n = 40)

## # A tibble: 293 × 11
##    overall_pick count    prob is_anomaly avg_age avg_games_played avg_goals
##           <dbl> <int>   <dbl> <chr>        <dbl>            <dbl>     <dbl>
##  1            1    60 0.00490 <NA>          18.3             809.     250. 
##  2            2    60 0.00490 <NA>          18.3             789.     226. 
##  3            3    60 0.00490 <NA>          18.4             708.     153. 
##  4            4    60 0.00490 <NA>          18.4             632.     152. 
##  5            5    60 0.00490 <NA>          18.4             621.     113. 
##  6            6    60 0.00490 <NA>          18.3             550.     123. 
##  7            7    60 0.00490 <NA>          18.4             597.     124. 
##  8            8    60 0.00490 <NA>          18.3             485.      94.8
##  9            9    60 0.00490 <NA>          18.4             528.      99.2
## 10           10    60 0.00490 <NA>          18.4             450.      82.0
## 11           11    59 0.00482 <NA>          18.4             563.     109. 
## 12           12    59 0.00482 <NA>          18.4             428.      78.0
## 13           13    59 0.00482 <NA>          18.4             486.      86.0
## 14           14    59 0.00482 <NA>          18.3             528.      84.5
## 15           15    59 0.00482 <NA>          18.4             397.      96.0
## 16           16    59 0.00482 <NA>          18.6             394.      72.3
## 17           17    59 0.00482 <NA>          18.5             471.      82.6
## 18           18    59 0.00482 <NA>          18.4             375.      53.5
## 19           19    58 0.00473 <NA>          18.4             362.      64.2
## 20           20    58 0.00473 <NA>          18.4             393.      64.9
## 21           21    58 0.00473 <NA>          18.4             385.      62.2
## 22           22    57 0.00465 <NA>          18.4             393.      84.1
## 23           23    57 0.00465 <NA>          18.3             377.      55.3
## 24           24    57 0.00465 <NA>          18.5             357.      62.6
## 25           25    54 0.00441 <NA>          18.5             336.      51.3
## 26           26    54 0.00441 <NA>          18.3             338.      65.6
## 27           27    54 0.00441 <NA>          18.5             375.      50.7
## 28           28    54 0.00441 <NA>          18.4             312.      54.3
## 29           29    54 0.00441 <NA>          18.4             268.      44.4
## 30           30    54 0.00441 <NA>          18.5             295.      43.0
## 31           31    54 0.00441 <NA>          18.5             160.      17.2
## 32           32    54 0.00441 <NA>          18.4             273.      40.4
## 33           33    54 0.00441 <NA>          18.4             389.      70.2
## 34           34    54 0.00441 <NA>          18.5             206.      21.9
## 35           35    54 0.00441 <NA>          18.5             277.      44.7
## 36           36    54 0.00441 <NA>          18.6             324.      48.4
## 37           37    54 0.00441 <NA>          18.4             234.      28.1
## 38           38    54 0.00441 <NA>          18.7             311.      32.8
## 39           39    54 0.00441 <NA>          18.5             270.      40.7
## 40           40    54 0.00441 <NA>          18.6             341.      53.6
## # ℹ 253 more rows
## # ℹ 4 more variables: avg_assists <dbl>, avg_points <dbl>,
## #   avg_plus_minus <dbl>, avg_penalty_minutes <dbl>

Even though overall pick doesn’t have a huge impact in terms of player performance, we can get a glimpse of how each pick has done over time and how their impact has been in the league.

Let’s look at something about the draft picks that is something that is different than the other tables.

overall_pick_summary |> 
  arrange(desc(avg_points)) |> 
  print(n = 40)

## # A tibble: 293 × 11
##    overall_pick count     prob is_anomaly avg_age avg_games_played avg_goals
##           <dbl> <int>    <dbl> <chr>        <dbl>            <dbl>     <dbl>
##  1            1    60 0.00490  <NA>          18.3             809.     250. 
##  2            2    60 0.00490  <NA>          18.3             789.     226. 
##  3            3    60 0.00490  <NA>          18.4             708.     153. 
##  4            4    60 0.00490  <NA>          18.4             632.     152. 
##  5            6    60 0.00490  <NA>          18.3             550.     123. 
##  6            7    60 0.00490  <NA>          18.4             597.     124. 
##  7            5    60 0.00490  <NA>          18.4             621.     113. 
##  8          264    10 0.000816 <NA>          18.5             491      114  
##  9           11    59 0.00482  <NA>          18.4             563.     109. 
## 10            9    60 0.00490  <NA>          18.4             528.      99.2
## 11          134    50 0.00408  <NA>          18.6             451.      87.7
## 12           15    59 0.00482  <NA>          18.4             397.      96.0
## 13            8    60 0.00490  <NA>          18.3             485.      94.8
## 14           14    59 0.00482  <NA>          18.3             528.      84.5
## 15           17    59 0.00482  <NA>          18.5             471.      82.6
## 16           13    59 0.00482  <NA>          18.4             486.      86.0
## 17          171    47 0.00384  <NA>          18.8             310.      91.4
## 18           22    57 0.00465  <NA>          18.4             393.      84.1
## 19           12    59 0.00482  <NA>          18.4             428.      78.0
## 20          210    46 0.00376  <NA>          18.7             280.      76  
## 21          231    25 0.00204  <NA>          20.2             371       81.3
## 22          120    51 0.00416  <NA>          18.7             326.      84.5
## 23          214    33 0.00269  <NA>          19.8             461.      62  
## 24           10    60 0.00490  <NA>          18.4             450.      82.0
## 25          288     5 0.000408 <NA>          18.8             462      104  
## 26          183    46 0.00376  <NA>          18.6             448.      77.9
## 27           71    54 0.00441  <NA>          18.6             393.      69.6
## 28          257    11 0.000898 <NA>          20.6             357       82.3
## 29          156    48 0.00392  <NA>          18.7             436.      64.2
## 30           20    58 0.00473  <NA>          18.4             393.      64.9
## 31           16    59 0.00482  <NA>          18.6             394.      72.3
## 32           33    54 0.00441  <NA>          18.4             389.      70.2
## 33          250    19 0.00155  <NA>          19.7             428       34.2
## 34           21    58 0.00473  <NA>          18.4             385.      62.2
## 35           69    54 0.00441  <NA>          18.4             315.      70.4
## 36          166    48 0.00392  <NA>          18.6             370.      67.9
## 37          133    50 0.00408  <NA>          19.6             318.      59.3
## 38           19    58 0.00473  <NA>          18.4             362.      64.2
## 39           26    54 0.00441  <NA>          18.3             338.      65.6
## 40          121    51 0.00416  <NA>          18.6             302.      73.3
## # ℹ 253 more rows
## # ℹ 4 more variables: avg_assists <dbl>, avg_points <dbl>,
## #   avg_plus_minus <dbl>, avg_penalty_minutes <dbl>

Looking at the data above, if you sort the data by the average points there is draft pick 264 which pops up as 8th on the ranking! This means that those 10 players are skewing the mean of their points for the time that they played. Thus some overall picks we need to weight more than the ones with way less data.

This is also interesting data because we can see how players based on their draft number contributed to the value of their draft pick. This confirms that the 1-4 picks in drafts tend to actually live up to the expectation they hold as a number 1 pick.

After looking at these groups, there is a lot to unpack on why some groups are smaller than others. The probabilities of these different groups are impacted based on the year that they were drafted as well as the culture and decade they played in. After learning this information, there needs to be a deeper dive on each decade and how those players did.

Week 3 Data Dive

Connor Bryson

9/9/2023