This analysis uses datasets provided by TidyTuesday that focuses on NHL players and another canadian birth dataset. The analysis will address the following key questions:
What is the distribution of players by position type? How do player characteristics (height, weight, BMI) vary across positions? Are NHL players born at different times of the year compared to general Canadian birth trends? What are the most common sweater numbers?
# Load necessary libraries
library(ggplot2)
library(data.table)
library(tidytuesdayR)
library(RColorBrewer)
# Load TidyTuesday data for the week of 2024-01-09
tuesdata <- tidytuesdayR::tt_load('2024-01-09')
## ---- Compiling #TidyTuesday Information for 2024-01-09 ----
## --- There are 4 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 4: "canada_births_1991_2022.csv"
## 2 of 4: "nhl_player_births.csv"
## 3 of 4: "nhl_rosters.csv"
## 4 of 4: "nhl_teams.csv"
# Extract relevant NHL datasets
canada_births_1991_2022 <- tuesdata$canada_births_1991_2022
nhl_player_births <- tuesdata$nhl_player_births
nhl_rosters <- tuesdata$nhl_rosters
nhl_teams <- tuesdata$nhl_teams
# Structure and summary of the NHL rosters dataset
head(nhl_rosters)
## # A tibble: 6 × 18
## team_code season position_type player_id headshot first_name last_name
## <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 ATL 19992000 forwards 8467867 https://asset… Bryan Adams
## 2 ATL 19992000 forwards 8445176 https://asset… Donald Audette
## 3 ATL 19992000 forwards 8460014 https://asset… Eric Bertrand
## 4 ATL 19992000 forwards 8460510 https://asset… Jason Botterill
## 5 ATL 19992000 forwards 8459596 https://asset… Andrew Brunette
## 6 ATL 19992000 forwards 8445733 https://asset… Kelly Buchberg…
## # ℹ 11 more variables: sweater_number <dbl>, position_code <chr>,
## # shoots_catches <chr>, height_in_inches <dbl>, weight_in_pounds <dbl>,
## # height_in_centimeters <dbl>, weight_in_kilograms <dbl>, birth_date <date>,
## # birth_city <chr>, birth_country <chr>, birth_state_province <chr>
summary(nhl_rosters)
## team_code season position_type player_id
## Length:54883 Min. :19171918 Length:54883 Min. :8444850
## Class :character 1st Qu.:19801981 Class :character 1st Qu.:8448170
## Mode :character Median :19961997 Mode :character Median :8456153
## Mean :19919668 Mean :8459076
## 3rd Qu.:20112012 3rd Qu.:8470645
## Max. :20232024 Max. :8484314
##
## headshot first_name last_name sweater_number
## Length:54883 Length:54883 Length:54883 Min. : 1.0
## Class :character Class :character Class :character 1st Qu.:10.0
## Mode :character Mode :character Mode :character Median :20.0
## Mean :24.2
## 3rd Qu.:31.0
## Max. :99.0
## NA's :149
## position_code shoots_catches height_in_inches weight_in_pounds
## Length:54883 Length:54883 Min. :63.0 Min. :125.0
## Class :character Class :character 1st Qu.:71.0 1st Qu.:185.0
## Mode :character Mode :character Median :72.0 Median :195.0
## Mean :72.4 Mean :195.4
## 3rd Qu.:74.0 3rd Qu.:207.0
## Max. :81.0 Max. :265.0
## NA's :28 NA's :25
## height_in_centimeters weight_in_kilograms birth_date
## Min. :160.0 Min. : 57.00 Min. :1879-07-27
## 1st Qu.:180.0 1st Qu.: 84.00 1st Qu.:1955-01-03
## Median :183.0 Median : 88.00 Median :1970-01-23
## Mean :183.9 Mean : 88.66 Mean :1965-12-29
## 3rd Qu.:188.0 3rd Qu.: 94.00 3rd Qu.:1984-05-01
## Max. :206.0 Max. :120.00 Max. :2005-07-17
## NA's :28 NA's :25
## birth_city birth_country birth_state_province
## Length:54883 Length:54883 Length:54883
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
nhl_rosters <- as.data.table(nhl_rosters)
colnames(nhl_rosters)
## [1] "team_code" "season" "position_type"
## [4] "player_id" "headshot" "first_name"
## [7] "last_name" "sweater_number" "position_code"
## [10] "shoots_catches" "height_in_inches" "weight_in_pounds"
## [13] "height_in_centimeters" "weight_in_kilograms" "birth_date"
## [16] "birth_city" "birth_country" "birth_state_province"
# Remove unnecessary column 'headshot'
nhl_rosters[, headshot := NULL]
# Filter for position_type and make a pie chart of the position type distribution
unique(nhl_rosters$position_type)
## [1] "forwards" "defensemen" "goalies"
sum(is.na(nhl_rosters$position_type))
## [1] 0
nhl_rosters[, .N, by = position_type][order(-N)]
## position_type N
## <char> <int>
## 1: forwards 33038
## 2: defensemen 16823
## 3: goalies 5022
ggplot(nhl_rosters[, .N, by = position_type][order(-N)], aes(x = "", y = N, fill = position_type)) + geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") + #makes it a pie chart
labs(title = "Proportion of Players by Position Type") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
The position type forwards make up the largest group in the dataset, with goalies being the smallest, and defensemen falling in between.
# Count players taller than the mean height (183.9 cm) by position type
sum(is.na(nhl_rosters$height_in_centimeters))
## [1] 28
nhl_rosters[height_in_centimeters > 183.9, .N, by = position_type][order(-N)]
## position_type N
## <char> <int>
## 1: forwards 13929
## 2: defensemen 10726
## 3: goalies 2318
# Proportion of tall people by position type
proportion_tall <- merge(nhl_rosters[height_in_centimeters > 183.9, .N, by = position_type],
nhl_rosters[, .N, by = position_type],
by = "position_type")[, proportion := N.x / N.y][order(-proportion)]
print(proportion_tall)
## position_type N.x N.y proportion
## <char> <int> <int> <num>
## 1: defensemen 10726 16823 0.6375795
## 2: goalies 2318 5022 0.4615691
## 3: forwards 13929 33038 0.4216054
ggplot(proportion_tall, aes(x = position_type, y = N.y, fill = position_type)) +
geom_bar(stat = "identity", alpha = 0.7) +
geom_segment(aes(x = position_type, xend = position_type,
y = 0, yend = N.y * proportion), color = "darkblue", size = 1) +
geom_text(aes(y = N.y * proportion,
label = scales::percent(proportion, accuracy = 0.1)),
vjust = -0.5, color = "darkblue") + # Adding percentage
labs(title = "Number of Players by Position with Proportion of Tall Players",
x = "Position Type",
y = "Total Count",
caption = "Percentage indicates proportion of tall players (height > 183.9 cm)") +
theme_bw() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Each bar’s height represents the total number of players for a given position type (e.g., forwards, defensemen, goalies). The blue segment on each bar represents the number of tall players (height > 183.9 cm) within that position. It was decided that tall players would be classified as those players above the mean average height of 183.9cm. The percentage shown indicates the proportion of tall players out of the total players for that position. It can be seen that defensemen have a higher proportion of tall players, perhaps suggesting that height is a more critical attribute for this position.
# Boxplot for weight distribution by position type
ggplot(nhl_rosters, aes(x = position_type, y = weight_in_kilograms, fill = position_type)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Weight Distributions by Position", x = "Position Type", y = "Weight (kg)") +
theme_bw() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none")
## Warning: Removed 25 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Defensemen had the highest average weight, followed by forwards and then goalies. This compliments the other results, as it makes sense that taller players tend to weigh more.
# Boxplot: Height by Position Type
ggplot(nhl_rosters, aes(x = position_type, y = height_in_centimeters, fill = position_type)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Height Distribution by Position Type", x = "Position Type", y = "Height (cm)") +
theme_bw() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none")
## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The height distribution by position height in the form of a box plot shows again that the defensemen have the highest proportion of tall players.
#Scatterplot for Canada weight_in_kilograms vs height_in_centimeters
ggplot(nhl_rosters[birth_country == "CAN"], aes(x = weight_in_kilograms, y = height_in_centimeters, color = position_type)) +
geom_point() +
labs(title = "Canadian NHL Players' Weight vs Height by Position",
x = "Weight (kg)",
y = "Height (cm)") +
theme_bw() +
scale_color_brewer(palette = "Set2")
## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_point()`).
This scatterplot displays a positive association between weight and height, colour-coded by position type. We can see that defensemen in green seem to make up most of the points on the higher end of the x and y-axis, supporting again that the position defensemen seems to be the “tallest”.
# Calculate BMI and categorize
nhl_rosters$bmi <- nhl_rosters$weight_in_kilograms / (nhl_rosters$height_in_centimeters / 100)^2
nhl_rosters[, bmicat := cut(bmi, c(0, 18.5, 25, Inf))]
# BMI Categories by Position Type
ggplot(nhl_rosters, aes(x = position_type, fill = bmicat)) +
geom_bar() +
labs(title = "BMI Categories by Position Type",
x = "Position Type",
y = "Count",
fill = "BMI Category") +
theme_bw() +
scale_fill_brewer(palette = "Paired")
# BMI Histogram
ggplot(nhl_rosters, aes(x = bmi)) +
geom_histogram(fill = "skyblue", alpha = 0.7) +
labs(title = "Distribution of BMI", x = "BMI", y = "Count") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 29 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Density Plot: BMI Distribution by Position Type
ggplot(nhl_rosters, aes(x = bmi, fill = position_type)) +
geom_density(alpha = 0.7) +
labs(title = "BMI Distribution by Position Type",
x = "BMI",
y = "Density") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
## Warning: Removed 29 rows containing non-finite outside the scale range
## (`stat_density()`).
These three plots portrays that most players fall around 26 BMI, with no players having a BMI below 20. Goalies generally had lower BMIs compared to other positions, while defensemen again stood out, with their BMI distribution skewing slightly higher.
# Find team with the highest BMI
nhl_rosters[, .(max_bmi = max(bmi)), by = team_code][order(-max_bmi)][1:5]
## team_code max_bmi
## <char> <num>
## 1: STL 33.33333
## 2: CHI 32.72462
## 3: NJD 32.24940
## 4: NYR 32.24206
## 5: EDM 31.68855
nhl_merged <- merge(nhl_rosters, nhl_teams, by = "team_code")
nhl_merged[, .(max_bmi = max(bmi)), by = full_name][order(-max_bmi)][1:5]
## full_name max_bmi
## <char> <num>
## 1: St. Louis Blues 33.33333
## 2: Chicago Blackhawks 32.72462
## 3: New Jersey Devils 32.24940
## 4: New York Rangers 32.24206
## 5: Edmonton Oilers 31.68855
The top five teams with the highest BMIs were the St. Louis Blues, Chicago Blackhawks, New Jersey Devils, New York Rangers, and Edmonton Oilers.
# Player counts by birth country and Canadian provinces
nhl_rosters[, .N, by = birth_country][order(-N)]
## birth_country N
## <char> <int>
## 1: CAN 36405
## 2: USA 8906
## 3: SWE 2423
## 4: CZE 1701
## 5: RUS 1634
## 6: FIN 1362
## 7: SVK 583
## 8: DEU 296
## 9: GBR 287
## 10: CHE 222
## 11: UKR 143
## 12: LVA 133
## 13: DNK 126
## 14: AUT 77
## 15: FRA 69
## 16: POL 65
## 17: KAZ 52
## 18: BLR 46
## 19: NOR 40
## 20: LTU 39
## 21: NLD 26
## 22: SRB 21
## 23: SVN 21
## 24: KOR 21
## 25: VEN 17
## 26: ITA 17
## 27: ZAF 17
## 28: BRA 17
## 29: BRN 16
## 30: TWN 15
## 31: PRY 13
## 32: IRL 12
## 33: EST 9
## 34: NGA 7
## 35: BGR 7
## 36: HTI 5
## 37: JAM 5
## 38: JPN 5
## 39: UZB 4
## 40: HRV 4
## 41: LBN 3
## 42: BHS 3
## 43: AUS 3
## 44: BEL 2
## 45: IDN 2
## 46: TZA 2
## birth_country N
nhl_rosters[!is.na(birth_state_province) & birth_country == "CAN", .N, by = birth_state_province][order(-N)]
## birth_state_province N
## <char> <int>
## 1: Ontario 16378
## 2: Quebec 6019
## 3: Alberta 4201
## 4: Saskatchewan 3431
## 5: British Columbia 2724
## 6: Manitoba 2410
## 7: Nova Scotia 510
## 8: New Brunswick 323
## 9: Newfoundland and Labrador 190
## 10: Prince Edward Island 187
## 11: Northwest Territories 25
## 12: Yukon Territory 7
Canada, being the most represented country in the dataset tends to have NHL players come from the province Ontario.
# Weight distribution by top countries (Canada, USA, Sweden)
ggplot(nhl_rosters[birth_country %in% c("CAN", "USA", "SWE")], aes(x = birth_country, y = weight_in_kilograms, fill = birth_country)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Weight Distributions by Birth Country",
x = "Birth Country",
y = "Weight (kg)") +
theme_bw() +
scale_fill_brewer(palette = "Paired")
## Warning: Removed 25 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Weight trends by country show that the U.S. has the highest average weight, followed by Sweden and then Canada — the top three countries represented in the dataset.
# Merge with nhl_player_births
nhl_merged <- merge(nhl_merged, nhl_player_births, by = intersect(names(nhl_merged), names(nhl_player_births)))
# Birth Month Distribution
nhl_merged[, .N, by = birth_month][order(-N)]
## birth_month N
## <num> <int>
## 1: 1 5607
## 2: 2 5321
## 3: 3 5258
## 4: 4 5021
## 5: 5 4966
## 6: 6 4660
## 7: 7 4565
## 8: 10 4173
## 9: 9 4077
## 10: 8 3850
## 11: 12 3804
## 12: 11 3581
ggplot(nhl_merged, aes(x = factor(birth_month))) +
geom_bar(fill = "lightblue") +
labs(title = "Birth Month Distribution of NHL Players",
x = "Birth Month",
y = "Count") +
theme_bw()
The majority of NHL players in the dataset were born in the first month, January, followed by month 2, 3, 4, 5, 6, and 7.
# Birth Year Distribution for 2023-2024
nhl_merged[season == "20232024", .N, by = birth_year][order(-N)]
## birth_year N
## <num> <int>
## 1: 1996 84
## 2: 1994 68
## 3: 1995 66
## 4: 1998 64
## 5: 1999 64
## 6: 1997 60
## 7: 2001 52
## 8: 1992 51
## 9: 1993 51
## 10: 2000 47
## 11: 1991 43
## 12: 1990 35
## 13: 2002 27
## 14: 1989 25
## 15: 1987 16
## 16: 1988 14
## 17: 2003 8
## 18: 2004 7
## 19: 1985 6
## 20: 1986 5
## 21: 1984 2
## 22: 2005 2
## 23: 1983 1
## birth_year N
nhl_merged[season == "20232024", .N, by = birth_year][order(-birth_year)]
## birth_year N
## <num> <int>
## 1: 2005 2
## 2: 2004 7
## 3: 2003 8
## 4: 2002 27
## 5: 2001 52
## 6: 2000 47
## 7: 1999 64
## 8: 1998 64
## 9: 1997 60
## 10: 1996 84
## 11: 1995 66
## 12: 1994 68
## 13: 1993 51
## 14: 1992 51
## 15: 1991 43
## 16: 1990 35
## 17: 1989 25
## 18: 1988 14
## 19: 1987 16
## 20: 1986 5
## 21: 1985 6
## 22: 1984 2
## 23: 1983 1
## birth_year N
ggplot(nhl_merged[season == "20232024", .N, by = birth_year], aes(x = birth_year, y = N)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Birth Year Distribution of NHL Players for 2023-2024 Season",
x = "Birth Year",
y = "Count of Players") +
theme_bw()
For the season 2023-2024, the majority of NHL players were born in 1996. ### Comparing NHL and General Canadian Birth Trends
# Canada_births data set
colnames(canada_births_1991_2022)
## [1] "year" "month" "births"
head(canada_births_1991_2022)
## # A tibble: 6 × 3
## year month births
## <dbl> <dbl> <dbl>
## 1 1991 1 32213
## 2 1991 2 30345
## 3 1991 3 34869
## 4 1991 4 35398
## 5 1991 5 36371
## 6 1991 6 34378
canada_births_1991_2022 <- as.data.table(canada_births_1991_2022)
# Canada Quarterly Births
canada_births_1991_2022[, list(total_births = sum(births)), by = month][order(-total_births)]
## month total_births
## <num> <num>
## 1: 7 1042392
## 2: 5 1029538
## 3: 8 1024021
## 4: 9 1019490
## 5: 6 1001064
## 6: 3 995978
## 7: 4 983363
## 8: 10 982729
## 9: 1 941370
## 10: 12 920185
## 11: 11 918434
## 12: 2 886150
canada_births_1991_2022[, quarter := cut(month,
breaks = c(0, 3, 6, 9, 12),
labels = c("Q1", "Q2", "Q3", "Q4"))]
canada_quarterly_births <- canada_births_1991_2022[, .(total_births = sum(births)), by = quarter][order(-total_births)]
nhl_merged[, quarter := cut(birth_month, breaks = c(0, 3, 6, 9, 12), labels = c("Q1", "Q2", "Q3", "Q4"))]
nhl_quarterly_births <- nhl_merged[, .(total_births = .N), by = quarter][order(-total_births)]
# Plot Canada Births by Quarter
ggplot(canada_quarterly_births, aes(x = quarter, y = total_births, fill = quarter)) +
geom_bar(stat = "identity") +
labs(title = "Canada Births by Quarter",
x = "Quarter",
y = "Total Births") +
theme_bw() +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none")
# Plot NHL Players' Births by Quarter
ggplot(nhl_quarterly_births, aes(x = quarter, y = total_births, fill = quarter)) +
geom_bar(stat = "identity") +
labs(title = "NHL Players Births by Quarter",
x = "Quarter",
y = "Total Births") +
theme_bw() +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none")
# Mann-Whitney U Test comparing NHL and Canada quarterly births (e.g., Q1 vs rest)
test_results_mannwhitney <- wilcox.test(nhl_quarterly_births$total_births,
canada_quarterly_births$total_births)
test_results_mannwhitney
##
## Wilcoxon rank sum exact test
##
## data: nhl_quarterly_births$total_births and canada_quarterly_births$total_births
## W = 0, p-value = 0.02857
## alternative hypothesis: true location shift is not equal to 0
The majority of NHL players were born in the first quarter of the year. In contrast, Canadian birth data for 1991–2022 showed that the third quarter of the year (July–September) had the highest number of births. A Mann-Whitney U-test confirmed that the NHL dataset’s birth distribution, which was heavily skewed toward the first quarter, significantly differed from the general Canadian population (p-value of 0.0286).
# Filter for top 3 common sweater_number
nhl_rosters[, .N, by = sweater_number][order(-N)][1:3]
## sweater_number N
## <num> <int>
## 1: 14 1699
## 2: 4 1609
## 3: 17 1592
The most common sweater number in the dataset was 14, followed by 4 and 17.
This small data analysis looked at various characteristics of NHL players and found some meaningful insights. First, when looking at the distribution of players by position in a pie chart, forwards made up the largest group in the dataset, with goalies being the smallest, and defensemen falling in between. Interestingly, defensemen had the highest proportion of tall players, making them the “tallest” position overall, while forwards had the lowest proportion. This trend carries over to weight, where defensemen also had the highest average weight, followed by forwards and then goalies. It makes sense that taller players tend to weigh more, which explains why defensemen consistently rank highest in both categories.
When examining BMI, most players fell around 26 BMI, with no players having a BMI below 20. Goalies generally had lower BMIs compared to other positions, while defensemen again stood out, with their BMI distribution skewing slightly higher.
Team-wise, the top five teams with the highest BMIs were the St. Louis Blues, Chicago Blackhawks, New Jersey Devils, New York Rangers, and Edmonton Oilers. These results somewhat matched with the earlier results of weight trends by country where it was shown that the U.S. had the highest average weight, followed by Sweden and then Canada — the top three countries represented in the dataset.
An analysis of birth data revealed some surprising insights as well. The majority of NHL players in the dataset were born in the first month, January. This is followed by month 2, 3, 4, 5, 6, and 7, indicating that most players in this dataset were born in the first half of the year. Similarly, another analysis shows that the majority of NHL were born in the first quarter of the year. In contrast, Canadian birth data for 1991–2022 showed that the third quarter of the year (July–September) had the highest number of births. A Mann-Whitney U-test confirmed that the NHL dataset’s birth distribution, which was heavily skewed toward the first quarter, significantly differed from the general Canadian population (p-value of 0.0286). We can therefore conclude that NHL players have a higher likelihood of being born in the first quarter of the year compared to the general Canadian population.
Lastly, for a fun fact: the most common sweater number in the dataset was 14, followed by 4 and 17.