Olympic weightlifting is one of the most iconic sports in the history of the Olympic Games, showcasing the pinnacle of human strength, skill, and determination. The sport has evolved over decades, with athletes competing in various weight classes to achieve gold, silver, or bronze medals. For this project, I chose to focus on Olympic weightlifting because of its rich history and the inspiration it brings to athletes worldwide, including myself as a weightlifter enthusiast.
Olympic weightlifting has been part of the Games since 1896 and involves two key lifts: the snatch and the clean and jerk. These lifts test athletes’ strength and technique in specific weight classes, ensuring fair competition. Men have competed in weightlifting since the beginning, and women joined the sport in 2000, marking a significant milestone for gender inclusion in the Olympics.
Country Dominance
Countries like China, Russia, and Bulgaria have historically excelled in
the sport, often due to well-funded training programs and a focus on
athlete development. China, in particular, has dominated in recent
decades, becoming a leader in both men’s and women’s events.
Age and Gender in Weightlifting
Athletes typically peak in their 20s or early 30s, aligning with the
physical demands of weightlifting. Historically, men have outnumbered
women in participation, but the introduction of women’s weightlifting
has significantly boosted female representation in recent years.
Sources
International Weightlifting Federation (IWF): www.iwf.net
Olympic History Archives: www.olympic.org
athletes.event.csv comes from a publicly available Kaggle dataset containing information on all Olympic athletes from 1896 to 2016. The data was scraped from www.sports-reference.com in May 2018. It includes over 200,000 observations and 15 variables. Key variables I used in this project include:
Categorical Variables:
Quantitative Variables:
Age : The age of the athlete during the event (years).
Weight : The weight of the athlete (kilograms).
Year : The year of the Olympic Games.
To analyze the Dataset effectively, I filtered it to include only weightlifting events. Through additional cleaning and analysis, I focused on the years 1972 to 2016, representing the most recent 50 years of Olympic weightlifting history (the dataset is updated to 2016). Rows with missing values for key variables such as Age, Weight, or Medal were removed to ensure the data was clean and complete for analysis.
The primary objective of this project is to uncover trends and patterns in Olympic weightlifting, including which countries have been the most dominant in the sport, how factors like age and gender influence medal performance, and how the sport has evolved over time. Initially, I explored the entire dataset spanning 1896 to 2016 to gain a broad understanding of the data. However, as the analysis progressed, I focused specifically on the past 50 years (1972–2016) to highlight more recent trends in Olympic weightlifting history. This dataset is personally meaningful because it combines my passion for strength sports with my academic pursuit of data science.
setwd("/Users/ronaldohernandez/Desktop/Data 110 folder/Project 2 Data110")
#libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
athlete_events <- read_csv("athlete_events.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#filtering to only have weightlifting data.
Filtered_Weightlifting <- athlete_events %>%
filter(Sport == "Weightlifting")
#Selecting only Relevant Variables to my Analysis, #removing non medalist.
weightlifting <- Filtered_Weightlifting %>%
select(Name,Sex,NOC,Medal,Event,Age,Weight,Year) %>%
filter(!is.na(Medal))
#Group Data by ,Noc, Medal, counting how many medals each country has won in olympic weightlifting, wide format with separate columns in order to be able to compare medals across countries.
medals_by_country <- weightlifting %>%
group_by(NOC, Medal) %>%
summarize(Count = n(), .groups = "drop") %>%
pivot_wider(names_from = Medal, values_from = Count, values_fill = 0) %>%
mutate(Total = rowSums(across(c(Gold, Silver, Bronze)))) %>%
arrange(desc(Total))
#top 10 countries with most medals
top_10_countries <- medals_by_country %>%
slice_head(n = 10)
# Filter for the last 50 years
Last_50 <- Filtered_Weightlifting %>%
filter(Year >= 1968)
# Summarize medal counts by country
medals_by_country_last_50 <- Last_50 %>%
group_by(NOC, Medal) %>%
summarize(Count = n(), .groups = "drop") %>%
pivot_wider(names_from = Medal, values_from = Count, values_fill = 0) %>%
mutate(Total = rowSums(across(c(Gold, Silver, Bronze)))) %>%
arrange(desc(Total))
# Top 10 countries with the most medals
top_10_countries_last_50 <- medals_by_country_last_50 %>%
slice_head(n = 10)
# Create a new variable, Medal_Score, to convert medal types into numeric values:
weightlifting_Analysis <- weightlifting%>%
mutate(Medal_Score = case_when(
Medal == "Gold" ~ 3,
Medal == "Silver" ~ 2,
Medal == "Bronze" ~ 1,
TRUE ~ 0
))
#Removing rows with missing values for age, weight and medal score
weightlifting_clean <- weightlifting_Analysis %>%
filter(!is.na(Age), !is.na(Sex), !is.na(Medal_Score),)
#Perform Multiple Linear Regression
regression_model <- lm(Medal_Score ~ Age + Sex , data = weightlifting_clean)
summary(regression_model)
##
## Call:
## lm(formula = Medal_Score ~ Age + Sex, data = weightlifting_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1250 -0.9663 -0.0192 0.9543 1.1500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.313130 0.210057 11.012 <2e-16 ***
## Age -0.013231 0.008209 -1.612 0.108
## SexM 0.036839 0.088855 0.415 0.679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8186 on 633 degrees of freedom
## Multiple R-squared: 0.004114, Adjusted R-squared: 0.000967
## F-statistic: 1.307 on 2 and 633 DF, p-value: 0.2713
par(mfrow = c(2, 2))
plot(regression_model)
The Regression of the model based on the model output is Medal_Score = 2.313130 - 0.013231(Age) + 0.036839(SexM)
Age - The slope (-0.013231) indicates that for every additional year of age, the predicted chances (medal score) decreases by 0.013 points. However the p-value (0.108) indicates that age is not a statistically significant predictor (p > 0.05)
Sex - The coefficient (0.036839) suggests that male athletes (Sex = M) have a slightly higher predicted (medal score) compared to females (Sex = F). However, the p-value (0.679) shows that sex is also not statistically significant (p > 0.05).
The adjusted R-Squared suggest that only 0.09% of the variations in the observations may be explained by the model.
The regression model does not provide meaningful insights into medal performance in Olympic weightlifting, Age and Sex are not statistically significant predictors of winning Medals. The low R-squared value suggests that other factors, such as Training Experience , Competition Type , or Country (NOC), might better explain the variation in medal performance in future analysis.
ggplot(weightlifting, aes(x = Age)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black", alpha = 0.7) +
labs(
title = "Age Distribution of Weightlifters",
x = "Age",
y = "Frequency"
) +
theme_minimal()
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_bin()`).
This Histogram Shows the Distribution of Ages among Olympic Weightlifters. Most Athletes appear to be in the ranges of 20-30 years which aligns with the average peak of performance in strength sports.
ggplot(weightlifting, aes(x = Sex, fill = Sex)) +
geom_bar() +
labs(
title = "Gender Distribution in Weightlifting",
x = "Gender",
y = "Count"
) +
scale_fill_brewer(palette = "Set1") +
theme_minimal()
This bar chart explores the gender distribution of weightlifters. Male athletes dominate Olympic weightlifting, though female representation has grown over recent decades.
# This code creates a stacked bar chart to visualize the distribution of Gold, Silver, and Bronze medals
# for the top 10 countries in Olympic weightlifting of all time
top_10_countries %>%
pivot_longer(cols = c(Gold, Silver, Bronze), names_to = "Medal", values_to = "Count") %>%
ggplot(aes(x = reorder(NOC, -Total), y = Count, fill = Medal)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Medals by Country in Olympic Weightlifting All Time",
x = "Country (NOC)",
y = "Medal Count",
fill = "Medal Type"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal()
This visualization focuses on the all-time medal distribution in Olympic weightlifting for the top 10 countries. By highlighting the total number of medals (Gold, Silver, and Bronze) each country has won, this chart provides insights into historical dominance in the sport.
# Interactive plot using plotly
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
interactive_plot <- ggplotly(
ggplot(top_10_countries_last_50 %>%
pivot_longer(cols = c(Gold, Silver, Bronze), names_to = "Medal", values_to = "Count"),
aes(x = reorder(NOC, -Total), y = Count, fill = Medal)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Top 10 Countries in Olympic Weightlifting (1972–2016)",
x = "Country (NOC)",
y = "Total Medals",
fill = "Medal Type"
) +
scale_fill_viridis_d() +
theme_classic()
)
interactive_plot
The goal of this visualization is to identify the countries that have dominated Olympic weightlifting in the past 50 years (1972–2016). By focusing on the top 10 countries with the most medals, this visualization highlights the distribution of Gold, Silver, and Bronze medals for each country. The interactive nature of the plot allows for deeper exploration of the data.
The visualizations provided the following insights:
Age Distribution: Most Olympic weightlifters are between 20 and 30 years old, reflecting the typical peak performance age for strength sports.
Gender Distribution: Male athletes dominate Olympic weightlifting participation. However, female representation has grown significantly since women’s weightlifting was added in 2000.
Country Dominance: China and Russia are the most dominant nations in Olympic weightlifting, with China excelling in Gold medal counts. Smaller nations like Bulgaria and North Korea also perform exceptionally well, highlighting the effectiveness of their specialized training programs.
Trends Over Time: The interactive visualization of the last 50 years (1972–2016) emphasizes the rise of China as a leading nation and the consistency of countries like Russia.
This project explored trends in Olympic weightlifting, focusing on age, gender, and country dominance over the past 50 years. The findings highlight the rise of countries like China and the consistency of countries such as Russia. The analysis provides valuable insights, identifying opportunities for deeper exploration, such as weight class performance or the role of training strategies.
By combining data science with a personal interest in strength sports, this project demonstrates how real-world datasets can provide meaningful insights into sports history and performance trends.