Introduction

This analysis uses datasets provided by TidyTuesday that focuses on NHL players and another canadian birth dataset. The analysis will address the following key questions:

What is the distribution of players by position type? How do player characteristics (height, weight, BMI) vary across positions? Are NHL players born at different times of the year compared to general Canadian birth trends? What are the most common sweater numbers?

Loading the Data

Click to see the output
# Load necessary libraries
library(ggplot2)
library(data.table)
library(tidytuesdayR)
library(RColorBrewer)

# Load TidyTuesday data for the week of 2024-01-09
tuesdata <- tidytuesdayR::tt_load('2024-01-09')
## ---- Compiling #TidyTuesday Information for 2024-01-09 ----
## --- There are 4 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 4: "canada_births_1991_2022.csv"
##   2 of 4: "nhl_player_births.csv"
##   3 of 4: "nhl_rosters.csv"
##   4 of 4: "nhl_teams.csv"

# Extract relevant NHL datasets
canada_births_1991_2022 <- tuesdata$canada_births_1991_2022
nhl_player_births <- tuesdata$nhl_player_births
nhl_rosters <- tuesdata$nhl_rosters
nhl_teams <- tuesdata$nhl_teams

Exploring the Data

Click to see the output
# Structure and summary of the NHL rosters dataset
head(nhl_rosters)
## # A tibble: 6 × 18
##   team_code   season position_type player_id headshot       first_name last_name
##   <chr>        <dbl> <chr>             <dbl> <chr>          <chr>      <chr>    
## 1 ATL       19992000 forwards        8467867 https://asset… Bryan      Adams    
## 2 ATL       19992000 forwards        8445176 https://asset… Donald     Audette  
## 3 ATL       19992000 forwards        8460014 https://asset… Eric       Bertrand 
## 4 ATL       19992000 forwards        8460510 https://asset… Jason      Botterill
## 5 ATL       19992000 forwards        8459596 https://asset… Andrew     Brunette 
## 6 ATL       19992000 forwards        8445733 https://asset… Kelly      Buchberg…
## # ℹ 11 more variables: sweater_number <dbl>, position_code <chr>,
## #   shoots_catches <chr>, height_in_inches <dbl>, weight_in_pounds <dbl>,
## #   height_in_centimeters <dbl>, weight_in_kilograms <dbl>, birth_date <date>,
## #   birth_city <chr>, birth_country <chr>, birth_state_province <chr>
summary(nhl_rosters)
##   team_code             season         position_type        player_id      
##  Length:54883       Min.   :19171918   Length:54883       Min.   :8444850  
##  Class :character   1st Qu.:19801981   Class :character   1st Qu.:8448170  
##  Mode  :character   Median :19961997   Mode  :character   Median :8456153  
##                     Mean   :19919668                      Mean   :8459076  
##                     3rd Qu.:20112012                      3rd Qu.:8470645  
##                     Max.   :20232024                      Max.   :8484314  
##                                                                            
##    headshot          first_name         last_name         sweater_number
##  Length:54883       Length:54883       Length:54883       Min.   : 1.0  
##  Class :character   Class :character   Class :character   1st Qu.:10.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :20.0  
##                                                           Mean   :24.2  
##                                                           3rd Qu.:31.0  
##                                                           Max.   :99.0  
##                                                           NA's   :149   
##  position_code      shoots_catches     height_in_inches weight_in_pounds
##  Length:54883       Length:54883       Min.   :63.0     Min.   :125.0   
##  Class :character   Class :character   1st Qu.:71.0     1st Qu.:185.0   
##  Mode  :character   Mode  :character   Median :72.0     Median :195.0   
##                                        Mean   :72.4     Mean   :195.4   
##                                        3rd Qu.:74.0     3rd Qu.:207.0   
##                                        Max.   :81.0     Max.   :265.0   
##                                        NA's   :28       NA's   :25      
##  height_in_centimeters weight_in_kilograms   birth_date        
##  Min.   :160.0         Min.   : 57.00      Min.   :1879-07-27  
##  1st Qu.:180.0         1st Qu.: 84.00      1st Qu.:1955-01-03  
##  Median :183.0         Median : 88.00      Median :1970-01-23  
##  Mean   :183.9         Mean   : 88.66      Mean   :1965-12-29  
##  3rd Qu.:188.0         3rd Qu.: 94.00      3rd Qu.:1984-05-01  
##  Max.   :206.0         Max.   :120.00      Max.   :2005-07-17  
##  NA's   :28            NA's   :25                              
##   birth_city        birth_country      birth_state_province
##  Length:54883       Length:54883       Length:54883        
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
## 
nhl_rosters <- as.data.table(nhl_rosters)
colnames(nhl_rosters)
##  [1] "team_code"             "season"                "position_type"        
##  [4] "player_id"             "headshot"              "first_name"           
##  [7] "last_name"             "sweater_number"        "position_code"        
## [10] "shoots_catches"        "height_in_inches"      "weight_in_pounds"     
## [13] "height_in_centimeters" "weight_in_kilograms"   "birth_date"           
## [16] "birth_city"            "birth_country"         "birth_state_province"

Player Distribution by Position Type

# Remove unnecessary column 'headshot'
nhl_rosters[, headshot := NULL]

# Filter for position_type and make a pie chart of the position type distribution
unique(nhl_rosters$position_type)
## [1] "forwards"   "defensemen" "goalies"
sum(is.na(nhl_rosters$position_type))
## [1] 0
nhl_rosters[, .N, by = position_type][order(-N)]
##    position_type     N
##           <char> <int>
## 1:      forwards 33038
## 2:    defensemen 16823
## 3:       goalies  5022
ggplot(nhl_rosters[, .N, by = position_type][order(-N)], aes(x = "", y = N, fill = position_type)) + geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") + #makes it a pie chart
  labs(title = "Proportion of Players by Position Type") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

The position type forwards make up the largest group in the dataset, with goalies being the smallest, and defensemen falling in between.

Height and Proportion Analysis by Position Type

# Count players taller than the mean height (183.9 cm) by position type
sum(is.na(nhl_rosters$height_in_centimeters))
## [1] 28
nhl_rosters[height_in_centimeters > 183.9, .N, by = position_type][order(-N)]
##    position_type     N
##           <char> <int>
## 1:      forwards 13929
## 2:    defensemen 10726
## 3:       goalies  2318
# Proportion of tall people by position type
proportion_tall <- merge(nhl_rosters[height_in_centimeters > 183.9, .N, by = position_type],
                         nhl_rosters[, .N, by = position_type],
                         by = "position_type")[, proportion := N.x / N.y][order(-proportion)]
print(proportion_tall)
##    position_type   N.x   N.y proportion
##           <char> <int> <int>      <num>
## 1:    defensemen 10726 16823  0.6375795
## 2:       goalies  2318  5022  0.4615691
## 3:      forwards 13929 33038  0.4216054
ggplot(proportion_tall, aes(x = position_type, y = N.y, fill = position_type)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  geom_segment(aes(x = position_type, xend = position_type, 
                   y = 0, yend = N.y * proportion), color = "darkblue", size = 1) + 
  geom_text(aes(y = N.y * proportion, 
                label = scales::percent(proportion, accuracy = 0.1)), 
            vjust = -0.5, color = "darkblue") +  # Adding percentage
  labs(title = "Number of Players by Position with Proportion of Tall Players", 
       x = "Position Type", 
       y = "Total Count", 
       caption = "Percentage indicates proportion of tall players (height > 183.9 cm)") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Each bar’s height represents the total number of players for a given position type (e.g., forwards, defensemen, goalies). The blue segment on each bar represents the number of tall players (height > 183.9 cm) within that position. It was decided that tall players would be classified as those players above the mean average height of 183.9cm. The percentage shown indicates the proportion of tall players out of the total players for that position. It can be seen that defensemen have a higher proportion of tall players, perhaps suggesting that height is a more critical attribute for this position.

Weight Distribution by Position Type

# Boxplot for weight distribution by position type
ggplot(nhl_rosters, aes(x = position_type, y = weight_in_kilograms, fill = position_type)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Weight Distributions by Position", x = "Position Type", y = "Weight (kg)") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")
## Warning: Removed 25 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Defensemen had the highest average weight, followed by forwards and then goalies. This compliments the other results, as it makes sense that taller players tend to weigh more.

Height Distribution by Position Type

# Boxplot: Height by Position Type
ggplot(nhl_rosters, aes(x = position_type, y = height_in_centimeters, fill = position_type)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Height Distribution by Position Type", x = "Position Type", y = "Height (cm)") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The height distribution by position height in the form of a box plot shows again that the defensemen have the highest proportion of tall players.

Relationship Between Height and Weight

#Scatterplot for Canada weight_in_kilograms vs height_in_centimeters
ggplot(nhl_rosters[birth_country == "CAN"], aes(x = weight_in_kilograms, y = height_in_centimeters, color = position_type)) + 
  geom_point() +
  labs(title = "Canadian NHL Players' Weight vs Height by Position", 
       x = "Weight (kg)", 
       y = "Height (cm)") +
  theme_bw() +
  scale_color_brewer(palette = "Set2")
## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_point()`).

This scatterplot displays a positive association between weight and height, colour-coded by position type. We can see that defensemen in green seem to make up most of the points on the higher end of the x and y-axis, supporting again that the position defensemen seems to be the “tallest”.

BMI Analysis by Position Type

# Calculate BMI and categorize
nhl_rosters$bmi <- nhl_rosters$weight_in_kilograms / (nhl_rosters$height_in_centimeters / 100)^2
nhl_rosters[, bmicat := cut(bmi, c(0, 18.5, 25, Inf))]

# BMI Categories by Position Type
ggplot(nhl_rosters, aes(x = position_type, fill = bmicat)) + 
  geom_bar() +
  labs(title = "BMI Categories by Position Type", 
       x = "Position Type", 
       y = "Count",
       fill = "BMI Category") +
  theme_bw() +
  scale_fill_brewer(palette = "Paired")

# BMI Histogram
ggplot(nhl_rosters, aes(x = bmi)) +
  geom_histogram(fill = "skyblue", alpha = 0.7) +
  labs(title = "Distribution of BMI", x = "BMI", y = "Count") +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 29 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Density Plot: BMI Distribution by Position Type
ggplot(nhl_rosters, aes(x = bmi, fill = position_type)) +
  geom_density(alpha = 0.7) +
  labs(title = "BMI Distribution by Position Type", 
       x = "BMI", 
       y = "Density") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")
## Warning: Removed 29 rows containing non-finite outside the scale range
## (`stat_density()`).

These three plots portrays that most players fall around 26 BMI, with no players having a BMI below 20. Goalies generally had lower BMIs compared to other positions, while defensemen again stood out, with their BMI distribution skewing slightly higher.

Teams with the Highest BMI

# Find team with the highest BMI
nhl_rosters[, .(max_bmi = max(bmi)), by = team_code][order(-max_bmi)][1:5]
##    team_code  max_bmi
##       <char>    <num>
## 1:       STL 33.33333
## 2:       CHI 32.72462
## 3:       NJD 32.24940
## 4:       NYR 32.24206
## 5:       EDM 31.68855
nhl_merged <- merge(nhl_rosters, nhl_teams, by = "team_code")
nhl_merged[, .(max_bmi = max(bmi)), by = full_name][order(-max_bmi)][1:5]
##             full_name  max_bmi
##                <char>    <num>
## 1:    St. Louis Blues 33.33333
## 2: Chicago Blackhawks 32.72462
## 3:  New Jersey Devils 32.24940
## 4:   New York Rangers 32.24206
## 5:    Edmonton Oilers 31.68855

The top five teams with the highest BMIs were the St. Louis Blues, Chicago Blackhawks, New Jersey Devils, New York Rangers, and Edmonton Oilers.

Places of Birth (Countries and Provinces)

# Player counts by birth country and Canadian provinces
nhl_rosters[, .N, by = birth_country][order(-N)]
##     birth_country     N
##            <char> <int>
##  1:           CAN 36405
##  2:           USA  8906
##  3:           SWE  2423
##  4:           CZE  1701
##  5:           RUS  1634
##  6:           FIN  1362
##  7:           SVK   583
##  8:           DEU   296
##  9:           GBR   287
## 10:           CHE   222
## 11:           UKR   143
## 12:           LVA   133
## 13:           DNK   126
## 14:           AUT    77
## 15:           FRA    69
## 16:           POL    65
## 17:           KAZ    52
## 18:           BLR    46
## 19:           NOR    40
## 20:           LTU    39
## 21:           NLD    26
## 22:           SRB    21
## 23:           SVN    21
## 24:           KOR    21
## 25:           VEN    17
## 26:           ITA    17
## 27:           ZAF    17
## 28:           BRA    17
## 29:           BRN    16
## 30:           TWN    15
## 31:           PRY    13
## 32:           IRL    12
## 33:           EST     9
## 34:           NGA     7
## 35:           BGR     7
## 36:           HTI     5
## 37:           JAM     5
## 38:           JPN     5
## 39:           UZB     4
## 40:           HRV     4
## 41:           LBN     3
## 42:           BHS     3
## 43:           AUS     3
## 44:           BEL     2
## 45:           IDN     2
## 46:           TZA     2
##     birth_country     N
nhl_rosters[!is.na(birth_state_province) & birth_country == "CAN", .N, by = birth_state_province][order(-N)]
##          birth_state_province     N
##                        <char> <int>
##  1:                   Ontario 16378
##  2:                    Quebec  6019
##  3:                   Alberta  4201
##  4:              Saskatchewan  3431
##  5:          British Columbia  2724
##  6:                  Manitoba  2410
##  7:               Nova Scotia   510
##  8:             New Brunswick   323
##  9: Newfoundland and Labrador   190
## 10:      Prince Edward Island   187
## 11:     Northwest Territories    25
## 12:           Yukon Territory     7

Canada, being the most represented country in the dataset tends to have NHL players come from the province Ontario.

Weight Distribution by Top Birth Countries

# Weight distribution by top countries (Canada, USA, Sweden)
ggplot(nhl_rosters[birth_country %in% c("CAN", "USA", "SWE")], aes(x = birth_country, y = weight_in_kilograms, fill = birth_country)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Weight Distributions by Birth Country", 
       x = "Birth Country", 
       y = "Weight (kg)") +
  theme_bw() +
  scale_fill_brewer(palette = "Paired")
## Warning: Removed 25 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Weight trends by country show that the U.S. has the highest average weight, followed by Sweden and then Canada — the top three countries represented in the dataset.

Birth Month Distribution

# Merge with nhl_player_births
nhl_merged <- merge(nhl_merged, nhl_player_births, by = intersect(names(nhl_merged), names(nhl_player_births)))
# Birth Month Distribution
nhl_merged[, .N, by = birth_month][order(-N)]
##     birth_month     N
##           <num> <int>
##  1:           1  5607
##  2:           2  5321
##  3:           3  5258
##  4:           4  5021
##  5:           5  4966
##  6:           6  4660
##  7:           7  4565
##  8:          10  4173
##  9:           9  4077
## 10:           8  3850
## 11:          12  3804
## 12:          11  3581
ggplot(nhl_merged, aes(x = factor(birth_month))) +
  geom_bar(fill = "lightblue") +
  labs(title = "Birth Month Distribution of NHL Players", 
       x = "Birth Month", 
       y = "Count") +
  theme_bw()

The majority of NHL players in the dataset were born in the first month, January, followed by month 2, 3, 4, 5, 6, and 7.

Birth Year Distribution for 2023-2024 Season

# Birth Year Distribution for 2023-2024
nhl_merged[season == "20232024", .N, by = birth_year][order(-N)]
##     birth_year     N
##          <num> <int>
##  1:       1996    84
##  2:       1994    68
##  3:       1995    66
##  4:       1998    64
##  5:       1999    64
##  6:       1997    60
##  7:       2001    52
##  8:       1992    51
##  9:       1993    51
## 10:       2000    47
## 11:       1991    43
## 12:       1990    35
## 13:       2002    27
## 14:       1989    25
## 15:       1987    16
## 16:       1988    14
## 17:       2003     8
## 18:       2004     7
## 19:       1985     6
## 20:       1986     5
## 21:       1984     2
## 22:       2005     2
## 23:       1983     1
##     birth_year     N
nhl_merged[season == "20232024", .N, by = birth_year][order(-birth_year)]
##     birth_year     N
##          <num> <int>
##  1:       2005     2
##  2:       2004     7
##  3:       2003     8
##  4:       2002    27
##  5:       2001    52
##  6:       2000    47
##  7:       1999    64
##  8:       1998    64
##  9:       1997    60
## 10:       1996    84
## 11:       1995    66
## 12:       1994    68
## 13:       1993    51
## 14:       1992    51
## 15:       1991    43
## 16:       1990    35
## 17:       1989    25
## 18:       1988    14
## 19:       1987    16
## 20:       1986     5
## 21:       1985     6
## 22:       1984     2
## 23:       1983     1
##     birth_year     N
ggplot(nhl_merged[season == "20232024", .N, by = birth_year], aes(x = birth_year, y = N)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Birth Year Distribution of NHL Players for 2023-2024 Season", 
       x = "Birth Year", 
       y = "Count of Players") +
  theme_bw()

For the season 2023-2024, the majority of NHL players were born in 1996. ### Comparing NHL and General Canadian Birth Trends

# Canada_births data set
colnames(canada_births_1991_2022)
## [1] "year"   "month"  "births"
head(canada_births_1991_2022)
## # A tibble: 6 × 3
##    year month births
##   <dbl> <dbl>  <dbl>
## 1  1991     1  32213
## 2  1991     2  30345
## 3  1991     3  34869
## 4  1991     4  35398
## 5  1991     5  36371
## 6  1991     6  34378
canada_births_1991_2022 <- as.data.table(canada_births_1991_2022)

# Canada Quarterly Births
canada_births_1991_2022[, list(total_births = sum(births)), by = month][order(-total_births)]
##     month total_births
##     <num>        <num>
##  1:     7      1042392
##  2:     5      1029538
##  3:     8      1024021
##  4:     9      1019490
##  5:     6      1001064
##  6:     3       995978
##  7:     4       983363
##  8:    10       982729
##  9:     1       941370
## 10:    12       920185
## 11:    11       918434
## 12:     2       886150
canada_births_1991_2022[, quarter := cut(month, 
                                         breaks = c(0, 3, 6, 9, 12), 
                                         labels = c("Q1", "Q2", "Q3", "Q4"))]

canada_quarterly_births <- canada_births_1991_2022[, .(total_births = sum(births)), by = quarter][order(-total_births)]
nhl_merged[, quarter := cut(birth_month, breaks = c(0, 3, 6, 9, 12), labels = c("Q1", "Q2", "Q3", "Q4"))]
nhl_quarterly_births <- nhl_merged[, .(total_births = .N), by = quarter][order(-total_births)]

# Plot Canada Births by Quarter
ggplot(canada_quarterly_births, aes(x = quarter, y = total_births, fill = quarter)) +
  geom_bar(stat = "identity") +
  labs(title = "Canada Births by Quarter", 
       x = "Quarter", 
       y = "Total Births") +
  theme_bw() +
  scale_fill_brewer(palette = "Paired") +
  theme(legend.position = "none")

# Plot NHL Players' Births by Quarter
ggplot(nhl_quarterly_births, aes(x = quarter, y = total_births, fill = quarter)) +
  geom_bar(stat = "identity") +
  labs(title = "NHL Players Births by Quarter", 
       x = "Quarter", 
       y = "Total Births") +
  theme_bw() +
  scale_fill_brewer(palette = "Paired") +
  theme(legend.position = "none")

# Mann-Whitney U Test comparing NHL and Canada quarterly births (e.g., Q1 vs rest)
test_results_mannwhitney <- wilcox.test(nhl_quarterly_births$total_births, 
                                        canada_quarterly_births$total_births)

test_results_mannwhitney
## 
##  Wilcoxon rank sum exact test
## 
## data:  nhl_quarterly_births$total_births and canada_quarterly_births$total_births
## W = 0, p-value = 0.02857
## alternative hypothesis: true location shift is not equal to 0

The majority of NHL players were born in the first quarter of the year. In contrast, Canadian birth data for 1991–2022 showed that the third quarter of the year (July–September) had the highest number of births. A Mann-Whitney U-test confirmed that the NHL dataset’s birth distribution, which was heavily skewed toward the first quarter, significantly differed from the general Canadian population (p-value of 0.0286).

Fun Fact: Top 3 Common/Favourite Sweater Numbers

# Filter for top 3 common sweater_number
nhl_rosters[, .N, by = sweater_number][order(-N)][1:3]
##    sweater_number     N
##             <num> <int>
## 1:             14  1699
## 2:              4  1609
## 3:             17  1592

The most common sweater number in the dataset was 14, followed by 4 and 17.

Conclusion

This small data analysis looked at various characteristics of NHL players and found some meaningful insights. First, when looking at the distribution of players by position in a pie chart, forwards made up the largest group in the dataset, with goalies being the smallest, and defensemen falling in between. Interestingly, defensemen had the highest proportion of tall players, making them the “tallest” position overall, while forwards had the lowest proportion. This trend carries over to weight, where defensemen also had the highest average weight, followed by forwards and then goalies. It makes sense that taller players tend to weigh more, which explains why defensemen consistently rank highest in both categories.

When examining BMI, most players fell around 26 BMI, with no players having a BMI below 20. Goalies generally had lower BMIs compared to other positions, while defensemen again stood out, with their BMI distribution skewing slightly higher.

Team-wise, the top five teams with the highest BMIs were the St. Louis Blues, Chicago Blackhawks, New Jersey Devils, New York Rangers, and Edmonton Oilers. These results somewhat matched with the earlier results of weight trends by country where it was shown that the U.S. had the highest average weight, followed by Sweden and then Canada — the top three countries represented in the dataset.

An analysis of birth data revealed some surprising insights as well. The majority of NHL players in the dataset were born in the first month, January. This is followed by month 2, 3, 4, 5, 6, and 7, indicating that most players in this dataset were born in the first half of the year. Similarly, another analysis shows that the majority of NHL were born in the first quarter of the year. In contrast, Canadian birth data for 1991–2022 showed that the third quarter of the year (July–September) had the highest number of births. A Mann-Whitney U-test confirmed that the NHL dataset’s birth distribution, which was heavily skewed toward the first quarter, significantly differed from the general Canadian population (p-value of 0.0286). We can therefore conclude that NHL players have a higher likelihood of being born in the first quarter of the year compared to the general Canadian population.

Lastly, for a fun fact: the most common sweater number in the dataset was 14, followed by 4 and 17.