Problem Statement: The main objective of this analysis is to identify major factors which influences an individual’s ability to do Powerlifting. As a sport, an individual gets 3 attempts on three lifts: squat, bench press and deadlift. There are many championships which take place around the world all along the year. All these competitions may have different age, weight and equipment categories; but the burning question every person debate, either as an audience or a powerlifter is how much weight will be lifted successfully by every player? By the end of this analysis, fans of the sport will have some inkling about their favorite player’s ability. Also, the players themselves can understand their strengths and weaknesses; and plan to perform better in future competitions.
As a powerlifter trains throughout their career, there are many factors which keep varying both internally and externally. We will deep dive into some of these to better understand and predict their lifting ability. We will perform separate analysis for male and female. The data available of players has age groups starting from early teens to late 70s. Along with it, we have corresponding bodyweights for players in those competitions. We also have the maximum lift data by players in their respective division i.e. best squat, bench press and deadlift.
Approach: Initially, we observe the overall trend of average weights lifted to understand if there is any increase in overall strength. Post that, the EDA is around understanding the effect of age, bodyweight and equipment across a player’s strength in performing the three lifts. Simultaneously, linear regression model is built to predict a player’s lift. Also, the data was present for players where they failed to lift the required weight, we could have performed analysis on it but the data available was very less. This can be done in the future when more data becomes available.
Findings & Insights:
For analysis, most of the packages used will be base R for data cleaning, manipulation and analyzing. There are some other packages which will be used like readr, dplyr, ggplot2, plotly, broom, car and DT. Users must install (install.packages(“package_name”)) and load these packages prior to the analysis.
Below is how to load above mentioned packages along with their short description:
# Loading all the packages required to perform the analysis
library(readr) # Reading CSV file
library(dplyr) # Majorly used for data cleaning and manipulation
library(ggplot2) # For creating better visualization
library(broom) # Turns output of built in R functions to tidy data frames
library(car) # Contains multiple functions to analyze regression output
library(DT) # Display data frames with better visualization in HTML
library(lubridate) # Extract date parts from date column
library(dvmisc) # To get MSE of linear model
The data is obtained from Open Powerlifting. It is a project which aims on creating and providing an open archive of the world’s powerlifting data. There is also a Github Repository, where people can contribute to the project. There have been multiple projects undertaken using this data to understand how powerlifting has evolved over the years.
The subset of above data which contains 16 variables can be obtained here. Below is the data dictionary of all variables:
Below are the steps for building Master Dataset
In the following steps, each of the above variables will be operated separately.
# Importing CSV file
pl_data <- read_csv("ipf_lifts.csv")
# Display the data
head(pl_data)
# There are 41,152 observations with 16 variables
dim(pl_data)
# Analysing the structure of the data set
str(pl_data)
# Verifying if names are assigned properly
names(pl_data)
# Since all the weights are in kg, the suffix _kg can be removed
pl_data <- rename(pl_data,
bodyweight = bodyweight_kg,
weight_class = weight_class_kg,
best3squat = best3squat_kg,
best3bench = best3bench_kg,
best3deadlift = best3deadlift_kg)
# Checking duplicates at an overall level
pl_data <- unique(pl_data)
# Checking NA values
colSums(is.na(pl_data))
# There are many NAs spread across columns which can be due to valid/invalid reasons
# There's only 1 NA value in weight_class_kg which can be removed as it won't hinder our analysis
pl_data <- filter(pl_data, !is.na(pl_data$weight_class))
colSums(is.na(pl_data))
# Operating NAs in bodyweight_kg
pl_data_bw_check <-
pl_data %>%
filter(is.na(bodyweight)) %>%
select(bodyweight, weight_class)
# There are only 187 rows with NA
# Let’s see if we can impute this NA values with weight_class_kg variable
unique(pl_data_bw_check$weight_class)
# The NAs are randomly scattered 6 classes. It won't make sense to impute these values since these are only 187 rows(0.45%) of the data. Hence, we can safely remove them
pl_data <- filter(pl_data, !is.na(pl_data$bodyweight))
colSums(is.na(pl_data))
# The remaining columns have high NA concentration, hence removing/imputing them might disrupt our analysis. Hence, we won't remove any other observation at this stage
# Unique values in character columns
unique(pl_data$sex)
unique(pl_data$event)
unique(pl_data$equipment)
unique(pl_data$division)
unique(pl_data$place)
unique(pl_data$federation)
# Since federation has only one unique value this variable can be removed
pl_data <- select(pl_data, -federation)
unique(pl_data$meet_name)
unique(pl_data$age_class)
# One class is incorrectly defined 80-999. Hence, correcting it
pl_data <- mutate(pl_data, age_class = ifelse(age_class == "80-999", "80-99", age_class))
unique(pl_data$age_class)
unique(pl_data$weight_class)
# This will be performed on all numeric variables
## Age
boxplot(pl_data$age)
# The minimum age value of outlier is 79
out_age_check <- filter((pl_data %>%
group_by(age_class) %>%
summarize(count = sum(age > 79, na.rm = T))), count > 0)
sum(out_age_check$count) # 55
# There are 55 outlier observations with age > 79
# This is again only 0.1% of our data and also heuristically we can remove these observations
# Since very few people aged 79+ participate in competitions
pl_data <- filter(pl_data, age < 79 | is.na(age))
## Bodyweight
boxplot(pl_data$bodyweight)
out_bw_check <- filter((pl_data %>%
group_by(weight_class) %>%
summarize(count = sum(bodyweight > 153.27, na.rm = T))), count > 0)
sum(out_bw_check$count) # 4-9
# There are 409 observations with bodyweights > 153.27
# Which is less than 1% of the data. We can make safe assumptions of not many powerlifters will have such high weights
pl_data <- filter(pl_data, bodyweight < 153.27 | is.na(bodyweight))
## best3squat
boxplot(pl_data$best3squat)
# There are negative values which needs to be filtered
out_bs_check <- filter((pl_data %>%
summarize(count = sum(best3squat < 0, na.rm = T))), count > 0)
pl_data <- filter(pl_data, best3squat > 0 | is.na(best3squat))
## best3bench
boxplot(pl_data$best3bench)
# There are negative values which needs to be filtered
out_bb_check <- filter((pl_data %>%
summarize(count = sum(best3bench < 0, na.rm = T))), count > 0)
pl_data <- filter(pl_data, best3bench > 0 | is.na(best3bench))
## best3deadlift
boxplot(pl_data$best3deadlift)
# There are negative values which needs to be filtered
out_bd_check <- filter((pl_data %>%
summarize(count = sum(best3deadlift < 0, na.rm = T))), count > 0)
pl_data <- filter(pl_data, best3deadlift > 0 | is.na(best3deadlift))
# Removing 3 columns as it won't be needed in our analysis
pl_mds <- select(pl_data, -c(division, place, meet_name))
# Reordering the columns
pl_mds <- pl_mds[, c(1,2,3,12,4,5,6,7,8,9,10,11)]
Here’s a glimpse of the Master Dataset:
# Creating subsets of final data by gender
pl_mds_m <- filter(pl_mds, sex == "M")
pl_mds_f <- filter(pl_mds, sex == "F")
age_ol_f <- boxplot(pl_mds_f$age)$out
# Removing outliers from age as it is an extremely small percentage
pl_mds_f <- filter(pl_mds_f, age < min(age_ol_f))
# There are some outliers in bodyweight which can hinder the analysis. Hence, removing them
bw_ol_m <- boxplot(pl_mds_m$bodyweight)$out
pl_mds_m <- filter(pl_mds_m, bodyweight < min(bw_ol_m))
bw_ol_f <- boxplot(pl_mds_f$bodyweight)$out
pl_mds_f <- filter(pl_mds_f, bodyweight < min(bw_ol_f))
# Adding a variable which extracts year from date
pl_mds_m <- mutate(pl_mds_m, year = year(date))
# Creating a dataframe which contains average lifts by year
year_avg_m <-
pl_mds_m %>%
group_by(year) %>%
summarize(avg_bs = mean(best3squat, na.rm = T),
avg_bb = mean(best3bench, na.rm = T),
avg_bd = mean(best3deadlift, na.rm = T))
# Plotting the trend of 3 lifts on a yearly basis
year_avg_m %>%
ggplot(aes(x = year)) +
geom_line(aes(y = avg_bs, color = "Squat")) +
geom_line(aes(y = avg_bb, color = "Bench press")) +
geom_line(aes(y = avg_bd, color = "Deadlift")) +
xlab("Dates") +
ylab("Average weight lifted") +
scale_colour_manual(name = "Lift type", breaks = c("Deadlift", "Squat", "Bench press"),
values = c("green", "blue", "red")) +
ylim(140, 275) +
ggtitle("Average lifts over time - Male")
# Adding a variable which extracts year from date
pl_mds_f <- mutate(pl_mds_f, year = year(date))
# Creating a dataframe which contains average lifts by year
year_avg_f <-
pl_mds_f %>%
group_by(year) %>%
summarize(avg_bs = mean(best3squat, na.rm = T),
avg_bb = mean(best3bench, na.rm = T),
avg_bd = mean(best3deadlift, na.rm = T))
# Plotting the trend of 3 lifts on a yearly basis
year_avg_f %>%
ggplot(aes(x = year)) +
geom_line(aes(y = avg_bs, color = "Squat")) +
geom_line(aes(y = avg_bb, color = "Bench press")) +
geom_line(aes(y = avg_bd, color = "Deadlift")) +
xlab("Dates") +
ylab("Average weight lifted") +
scale_colour_manual(name = "Lift type", breaks = c("Deadlift", "Squat", "Bench press"),
values = c("green", "blue", "red")) +
ggtitle("Average lifts over time - Female")
After observing the trend at an overall level throughout these years, we’ll try to see how age affects an individual’s lifting ability for all 3 lifts.
Male
pl_mds_m %>%
ggplot(aes(x = age, y = best3squat)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Squat lifts in kg") +
ggtitle("Effect of age on squat weights")
pl_mds_m %>%
ggplot(aes(x = age, y = best3bench)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Bench press lifts in kg") +
ggtitle("Effect of age on bench press weights")
pl_mds_m %>%
ggplot(aes(x = age, y = best3deadlift)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Deadlift weight in kg") +
ggtitle("Effect of age on deadlift weights")
Female
pl_mds_f %>%
ggplot(aes(x = age, y = best3squat)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Squat lifts in kg") +
ggtitle("Effect of age on squat weights")
pl_mds_f %>%
ggplot(aes(x = age, y = best3bench)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Bench press lifts in kg") +
ggtitle("Effect of age on bench press weights")
pl_mds_f %>%
ggplot(aes(x = age, y = best3deadlift)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Age") +
ylab("Deadlift weight in kg") +
ggtitle("Effect of age on deadlift weights")
From the above charts, it looks like women tend to maintain their peak strength for longer periods than men. Also, the equipment type used is a major factor in an individual’s lifting ability. It’s not a heavy influencer for deadlift but provides high boost in squat and bench press. Let’s now see how bodyweight affects the lifting ability.
We’ll measure the overall effect of bodyweight on the lifting abilities.
Male
There is a very noticeable relation between bodyweight and the lifting ability of individuals. It shows as bodyweight increases, the strength and lifting ability increases. This can be seen in the below charts for all 3 lifts.
pl_mds_m %>%
ggplot(aes(x = bodyweight, y = best3squat)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Squat lifts in kg") +
ggtitle("Effect of bodyweight on squat weights")
pl_mds_m %>%
ggplot(aes(x = bodyweight, y = best3bench)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Benchpress lifts in kg") +
ggtitle("Effect of bodyweight on bench press weights")
pl_mds_m %>%
ggplot(aes(x = bodyweight, y = best3deadlift)) +
geom_point(colour = "blue3") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Deadlift weight in kg") +
ggtitle("Effect of bodyweight on deadlift weights")
Female
It’s not a surprise that we observe similar trend as men for bodyweight here. It’s crystal clear that bodyweight holds a positive relation with strength.
pl_mds_f %>%
ggplot(aes(x = bodyweight, y = best3squat)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Squat lifts in kg") +
ggtitle("Effect of bodyweight on squat weights")
pl_mds_f %>%
ggplot(aes(x = bodyweight, y = best3bench)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Bench press weight in kg") +
ggtitle("Effect of bodyweight on bench press weights")
pl_mds_f %>%
ggplot(aes(x = bodyweight, y = best3deadlift)) +
geom_point(colour = "darkorange1") +
facet_wrap(~ equipment, nrow = 1) +
xlab("Bodyweight") +
ylab("Deadlift weight in kg") +
ggtitle("Effect of bodyweight on deadlift weights")
Therefore, going by the analysis with age and bodyweight, after a certain amount of age there is a decrease in lifting strength while more bodyweight can be associated with more lifting power. We’ll try to derive a clear relation for strength with age and bodyweight by linear regression.
After the initial analysis, let’s try to build a regression model which can predict the effect of age and bodyweight for men and women. The model will be built at an overall level and for that only SBD events will be filtered, since most of them will have values for all 3 lifts. An additional metric total_lift which is the sum of 3 lifts will be created for both men and women dataset.
Male
The initial model was built with total_lift as the response variable, but there were multiple violations observed during assessment of model performance. Therefore, a log transformation is applied on total_lift to make the data more normal, rather more symmetric. This helped us achieving better model performance. The coefficient values are mentioned below in model output. The final regression equation is,
log(total_lift) = 5.95 - 0.005038(age) + 0.006955(bodyweight) + 0.1152(single-ply) + 0.1255(wraps)
If we predict the lifting ability based on equipment type raw then the other two equipment coefficient terms become 0. If prediction is done for only single-ply, then wraps term become zero and vice versa. As per the equation, if all other terms are constant then for every unit increase in age, the total_lift will approximately decrease by (0.005038 * 100)%. This is approximately 0.5%. Similarly, for every unit increase in bodyweight, the lifting strength in men increases by approximately 0.6 to 0.7%. For eg. if the total weight lifted by a man is 500 kgs, next year keeping all other factor constant, his total lift strength will drop by 0.5% i.e. it will become 497.5 kgs. The respective age and bodyweight values along with equipment type can be used to predict an individual’s total lifting ability.
# Filtering out records which has all 3 lifts
pl_mds_msbd <- filter(pl_mds_m, event == "SBD")
pl_mds_msbd <- mutate( pl_mds_msbd, total_lift = best3squat + best3bench + best3deadlift)
# Fitting linear model on log(total_lift)
fit_pl_mds_msbd <- lm(log(total_lift) ~ age + bodyweight + equipment, pl_mds_msbd)
summary(fit_pl_mds_msbd)
##
## Call:
## lm(formula = log(total_lift) ~ age + bodyweight + equipment,
## data = pl_mds_msbd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32996 -0.09124 0.01677 0.11114 0.45158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.950e+00 6.567e-03 906.186 < 2e-16 ***
## age -5.038e-03 9.042e-05 -55.720 < 2e-16 ***
## bodyweight 6.955e-03 5.948e-05 116.930 < 2e-16 ***
## equipmentSingle-ply 1.152e-01 3.285e-03 35.077 < 2e-16 ***
## equipmentWraps 1.255e-01 2.088e-02 6.012 1.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1627 on 15350 degrees of freedom
## (3458 observations deleted due to missingness)
## Multiple R-squared: 0.5276, Adjusted R-squared: 0.5275
## F-statistic: 4287 on 4 and 15350 DF, p-value: < 2.2e-16
get_mse(fit_pl_mds_msbd) # 0.0264
## [1] 0.02645952
sqrt(get_mse(fit_pl_mds_msbd)) # 0.1626
## [1] 0.1626638
vif(fit_pl_mds_msbd) # ~1. Hence, no multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## age 1.003551 1 1.001774
## bodyweight 1.004446 1 1.002220
## equipment 1.001642 2 1.000410
Female
Like the model for men, there were multiple violations observed with total_lift during model assessment. Therefore, the model is built on log transformation of total_lift to make the data more more symmetric. The coefficient values are mentioned below in model output. The final regression equation is,
log(total_lift) = 5.327 - 0.00284(age) + 0.00934(bodyweight) + 0.1579(single-ply)
If we predict the lifting ability based on equipment type raw then the other two equipment coefficient terms become 0. If prediction is done for only single-ply, then wraps term become zero and vice versa. As per the equation, if all other terms are constant then for every unit increase in age, the total_lift will approximately decrease by (0.00284 * 100)%. This is approximately 0.2%. Similarly, for every unit increase in bodyweight, the lifting strength in men increases by approximately 0.9%. For eg. if the total weight lifted by a woman is 500 kgs, with an unit increase in bodyweight keeping all other factor constant, her total lift strength will increase by 0.9% i.e. it will become 504.5 kgs. The respective age and bodyweight values along with equipment type can be used to predict an individual’s total lifting ability. From these model results, some of our previous conclusions on age and bodyweight are confirmed. Men tend to lose their strength at a faster rate when compared to women.
# Filtering out records which has all 3 lifts
pl_mds_fsbd <- filter(pl_mds_f, event == "SBD")
pl_mds_fsbd <- mutate( pl_mds_fsbd, total_lift = best3squat + best3bench + best3deadlift)
# Fitting linear model on log(total_lift)
fit_pl_mds_fsbd <- lm(log(total_lift) ~ age + bodyweight + equipment, pl_mds_fsbd)
summary(fit_pl_mds_fsbd)
##
## Call:
## lm(formula = log(total_lift) ~ age + bodyweight + equipment,
## data = pl_mds_fsbd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11020 -0.11411 0.00881 0.12530 0.50603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3271011 0.0118181 450.76 <2e-16 ***
## age -0.0028479 0.0001632 -17.45 <2e-16 ***
## bodyweight 0.0093483 0.0001545 60.52 <2e-16 ***
## equipmentSingle-ply 0.1579675 0.0048032 32.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1878 on 8051 degrees of freedom
## (429 observations deleted due to missingness)
## Multiple R-squared: 0.3851, Adjusted R-squared: 0.3848
## F-statistic: 1680 on 3 and 8051 DF, p-value: < 2.2e-16
get_mse(fit_pl_mds_fsbd) # 0.0352
## [1] 0.03526094
sqrt(get_mse(fit_pl_mds_fsbd)) # 0.1877
## [1] 0.187779
vif(fit_pl_mds_fsbd) # ~1. Hence, no multicollinearity
## age bodyweight equipment
## 1.006266 1.000555 1.006652
Below are some of the next steps that can be taken to improve the model and analysis. It also includes some limitations from current analysis.