What is Power-lifting?
Powerlifting is the ultimate strength competition. The International Powerlifting Federation is the head of nearly 100 country federations world-wide. The powerlifts:- squat, bench press and deadlift
are increasingly being recognized as principal exercises in the development of an individual’s true strength. But, analyzing the factors affecting the power lifts forms the core of this project.
Problem Statement:
Analyze the various factors affecting the ability of the person to do power-lifting. This project is all about understanding this exciting and intense sport better.
Some of the key questions that will be answered in this project include (but not limited to) -
Solution Overview/ Approach:
We will be using a subset of the International Powerlifting Data in this project. Once we are done with the tedious but necessary task of data cleaning, we will be performing univariate/ multi-variate analysis of the various attributes affecting the performance of the individual.We will also look at scatter plots and histograms broken down by multiple filters to be able to visually comprehend the influence of different attributes on the performance.
Finally, the relation will also be explained with the help of a regression model which will give us a mathematical equation of the relationship between independent variables such as age, gender, body weight with the response variable - “ability to lift” across different divisions and world wide meets.
In short - this report aims to equip the consumer with the ability to guesstimate and be able to explain the lifting capacity of an individual, when provided with certain basic data points such as gender, age, bodyweight etc.
The packages we will be using for this project are -
library(tidyverse) # To clean data, filter, subset, plot graphs (ggplot2), organize data in nicer data formats using tibble
library(DT) # To display the data in a clean table format with options to select/ de-select columns
library(knitr) # To display the data in a clean table format, helps with commands such as head
library(plotly) # For interacticve plots
library(lattice) # To plot multiple clean looking graphs
library(broom) # To get the regression output in a neat table format
library(car) # To calculate variance inflation factor (VIF()) while fitting a multiple regression model
All the packages required to run the document have been loaded.
This section talks about the various steps involved from importing the data set to the cleaning of data set. Each step of this process has been explained and the corresponding code has been shown. An option has been provided to hide the code if the reader wishes to skip reading the code. All the tables have been provided with a HTML scrollable format of the data. This will help us filter and check only the columns of interest using the option “Column Visibility” in addition to the “Search bar” given on top of the table.
At the end of this step, the data set will be ready to use for the analysis.
The data set that is being used in this project can be found on the github page: IPF Cleaned Dataset (Note - This link works only when opened in a new window)
The original source of the data set which is a much larger dataset can be found at: Open Powerlifting
This data set aims to create an accessible, accurate and open archive of the world’s powerlifting data. The original dataset is being updated every month, but the data set we are using for this project is last collected on September 20th 2019 and consists of data until August 26th 2019.
The dataset we are using in this project contains 41,152 observations
with 16 variables
having information on player’s gender, age, weight, lifting capacity, event participated and the date of the meet. We notice that the missing values have been recorded as "NA"
. Some data entries to call out are negative values in the weight lifted variables which indicate failed attempts. The dataset also contains information on people who were disqualified and guest lifters i.e. people who succeeded but aren’t eligible for awards.
Our preliminary observation suggests there might be few data entry errors, outlier values and irrelavant columns for the analysis, which have been dealt with in the subsequent data cleaning steps.
First, let us start with the important step of importing our dataset -
ipf <- read.csv("IPF.csv",stringsAsFactors = FALSE)
str(ipf)
## 'data.frame': 41152 obs. of 16 variables:
## $ name : chr "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
## $ sex : chr "M" "M" "M" "M" ...
## $ event : chr "SBD" "SBD" "SBD" "SBD" ...
## $ equipment : chr "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
## $ age : num NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
## $ age_class : chr NA "24-34" "35-39" "20-23" ...
## $ division : chr NA NA NA NA ...
## $ bodyweight_kg : num 67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
## $ weight_class_kg : chr "67.5" "67.5" "67.5" "67.5" ...
## $ best3squat_kg : num 205 225 245 195 240 ...
## $ best3bench_kg : num 140 132 158 110 140 ...
## $ best3deadlift_kg: num 225 235 270 240 215 230 235 335 310 295 ...
## $ place : chr "1" "2" "3" "4" ...
## $ date : chr "1985-08-03" "1985-08-03" "1985-08-03" "1985-08-03" ...
## $ federation : chr "IPF" "IPF" "IPF" "IPF" ...
## $ meet_name : chr "World Games" "World Games" "World Games" "World Games" ...
From the structure, we notice that the dataset has 41,152 rows
and 16 columns
- the dataset has been imported correctly. We see that the column names are appropriate and have suitable datatype assigned except for the column “date”, which is in character format.
We Change the class of Date Variable to date format
ipf$date <- as.Date(ipf$date)
Checking the structure again to verify if the command worked and new NA’s are not introduced.
str(ipf)
## 'data.frame': 41152 obs. of 16 variables:
## $ name : chr "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
## $ sex : chr "M" "M" "M" "M" ...
## $ event : chr "SBD" "SBD" "SBD" "SBD" ...
## $ equipment : chr "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
## $ age : num NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
## $ age_class : chr NA "24-34" "35-39" "20-23" ...
## $ division : chr NA NA NA NA ...
## $ bodyweight_kg : num 67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
## $ weight_class_kg : chr "67.5" "67.5" "67.5" "67.5" ...
## $ best3squat_kg : num 205 225 245 195 240 ...
## $ best3bench_kg : num 140 132 158 110 140 ...
## $ best3deadlift_kg: num 225 235 270 240 215 230 235 335 310 295 ...
## $ place : chr "1" "2" "3" "4" ...
## $ date : Date, format: "1985-08-03" "1985-08-03" ...
## $ federation : chr "IPF" "IPF" "IPF" "IPF" ...
## $ meet_name : chr "World Games" "World Games" "World Games" "World Games" ...
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight_kg
## 2906 2884 627 187
## weight_class_kg best3squat_kg best3bench_kg best3deadlift_kg
## 1 13698 2462 14028
## place date federation meet_name
## 0 0 0 0
Let us see if the data is unique at this level -
ipf <- unique(ipf)
As there is no change in the number of observations, the data is unique at this level.
Let us also rename the column ‘place’ to ‘position’ to make it more intuitive. Also we are getting rid of the suffix “kg” for few columns-
names(ipf)
## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight_kg"
## [9] "weight_class_kg" "best3squat_kg" "best3bench_kg" "best3deadlift_kg"
## [13] "place" "date" "federation" "meet_name"
colnames(ipf)[13] <- "position"
colnames(ipf)[8] <- "bodyweight"
colnames(ipf)[9] <- "weight_class"
colnames(ipf)[10] <- "best3squat"
colnames(ipf)[11] <- "best3bench"
colnames(ipf)[12] <- "best3deadlift"
colnames(ipf)[14] <- "meet_date"
names(ipf)
## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight"
## [9] "weight_class" "best3squat" "best3bench" "best3deadlift"
## [13] "position" "meet_date" "federation" "meet_name"
Now, we have fixed the datatype and column names - Let’s move ahead!
We will remove the column ‘federation’ as it has only one unique value and would not add any value to the analysis
ipf <- ipf[,-15]
dim(ipf)
## [1] 41152 15
datatable(head(ipf) ,extensions = 'Buttons', selection = list(mode="multiple", selected = c(1,3,5), target = 'column') ,options = list(dom = 'Bfrtip', buttons = I('colvis')))
Among the columns, age and age_class, we might only use either of them for our analysis. Also, as age groups are randomly distributed in this column, we might not use this column and might just stick to using the age column but let us not remove any of the columns now, which might turn out useful for the analysis later.
One possible way the age_class column would be helpful is to look at the distributions across multiple classes as this has only 16 unique values, we will retain this column for now.
unique(ipf$age_class)
## [1] NA "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19"
## [9] "50-54" "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-999"
## [17] "5-12"
We see that one particluar class is wrongly named - so renaming age class of 80-999 to 80-99, but however as all age groups are randomly distributed, we might not use this column and might stick to just using the age column.
ipf <- mutate(ipf,age_class = ifelse(age_class == "80-999","80-99",age_class))
unique(ipf$age_class)
## [1] NA "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19" "50-54"
## [10] "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-99" "5-12"
Let us check for any similar outlier in the weight_class column:
unique(ipf$weight_class)
## [1] "67.5" "90" "90+" "52" "56" "60" "75" "82.5" "100"
## [10] "110" "125" "125+" "44" "48" "67.5+" "100+" "57" "63"
## [19] "72" "84" "84+" "74" "83" "105" "120" "120+" "66"
## [28] "105+" "72+" "53" "59" "93" "43" "47" "110+" "82.5+"
## [37] NA "40" "75+"
For Weight class column, the groups are again randomly distributed with multiple classes and this column might not be useful when we try to correlate the amount of weight lifted with the weight column. However, we will try to standardize the weight class column as it has too many classes. Let us create a new column with equal spaced buckets.
(labs <- c("1-10","10-20","20-30","30-40","40-50","50-60","60-70","70-80",
"80-90","90-100","100-110","110-120","120-130","130-140","140-150",
"150-160","160-170","170-180","180-190","190-200","200-210","210-220"
,"220-230","230-240","240+"))
## [1] "1-10" "10-20" "20-30" "30-40" "40-50" "50-60" "60-70"
## [8] "70-80" "80-90" "90-100" "100-110" "110-120" "120-130" "130-140"
## [15] "140-150" "150-160" "160-170" "170-180" "180-190" "190-200" "200-210"
## [22] "210-220" "220-230" "230-240" "240+"
ipf$bodyweight_class <- cut(ipf$bodyweight, breaks = c(seq(0, 240,
by = 10), Inf),labels = labs, right = FALSE)
unique(ipf$bodyweight_class)
## [1] 60-70 90-100 <NA> 50-60 40-50 70-80 80-90 100-110 120-130
## [10] 110-120 130-140 140-150 150-160 160-170 170-180 190-200 180-190 30-40
## [19] 200-210 210-220 240+
## 25 Levels: 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 ... 240+
Now, we will look at the summary statistics of the numeric variables. This will help us understand how the values are distributed in these columns:
summary(ipf$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.50 22.50 31.50 34.77 45.00 93.50 2906
summary(ipf$bodyweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 37.29 60.00 75.55 81.15 97.30 240.00 187
summary(ipf$best3squat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -210.0 160.0 215.0 217.6 270.0 490.0 13698
summary(ipf$best3bench)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -160.0 97.5 140.0 144.7 185.0 415.0 2462
summary(ipf$best3deadlift)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -215.0 170.0 222.5 221.8 270.0 420.0 14028
Let us also check for missing values before we determine what our next steps should be -
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2906 2884 627 187
## weight_class best3squat best3bench best3deadlift
## 1 13698 2462 14028
## position meet_date meet_name bodyweight_class
## 0 0 0 187
Having looked at the numerical summaries, we see that the data does have both extreme values and missing values. We will have to clean this dataset further, so let us deal with each of these variables separately in the next step of the data cleaning process.
We observe that there are missing values in multiple columns, we decide to clean the data by selecting each of the column and check for outliers and missing values.
For Columns - weight_class
and bodyweight
As Weight Class column has only one missing value, we remove it as it doesn’t add any value.
ipf <- ipf[!is.na(ipf$weight_class),]
unique(ipf$weight_class)
## [1] "67.5" "90" "90+" "52" "56" "60" "75" "82.5" "100"
## [10] "110" "125" "125+" "44" "48" "67.5+" "100+" "57" "63"
## [19] "72" "84" "84+" "74" "83" "105" "120" "120+" "66"
## [28] "105+" "72+" "53" "59" "93" "43" "47" "110+" "82.5+"
## [37] "40" "75+"
Missing value for the Weight Class column has been removed.
We can impute for missing values of bodyweight from the column weight class but again, this wouldn’t make much sense as the weight class is a range and the tradeoff is making data available for just 187 rows (~0.5% of data) that too with error after a lot data computation. So, we just go ahead and remove these rows.
ipf <- ipf[!is.na(ipf$bodyweight),]
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2830 2808 613 0
## weight_class best3squat best3bench best3deadlift
## 0 13663 2452 13990
## position meet_date meet_name bodyweight_class
## 0 0 0 0
We also see that the missing values for Body Weight column have been removed. Also, as we do not get any additional information from the old column - “weight_class”, So we just get rid of this column and see if the dataset is unique at this level.
colnames(ipf)
## [1] "name" "sex" "event" "equipment"
## [5] "age" "age_class" "division" "bodyweight"
## [9] "weight_class" "best3squat" "best3bench" "best3deadlift"
## [13] "position" "meet_date" "meet_name" "bodyweight_class"
ipf <- ipf[,-9]
ipf <- unique(ipf)
Let us check for outliers in the column bodyweight now,
boxplot(ipf$bodyweight)
As We observe that there are lot of outliers in the higher quartile,let us follow the standard 1.5(IQR) approach to deal with the outliers
(outlier_bodyweight <- quantile(ipf$bodyweight, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf$bodyweight, na.rm = TRUE)))
## 75%
## 153.25
We check for outliers in the body weight column above 153.25
to see the number of rows that have the value of 153.25
and above
ipf_outlier_bodyweight <- filter(ipf,bodyweight > 153.25)
We observe that there are 410
observations. We decide to go ahead and remove these observations
ipf <- ipf %>%
filter((bodyweight <= 153.25) %>% replace_na(TRUE))
For Columns - age
and age class
We will check if we can derive the missing values from either of the 2 columns age and age class.
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2801 610 0
## best3squat best3bench best3deadlift position
## 13476 2405 13802 0
## meet_date meet_name bodyweight_class
## 0 0 0
ipf_age_check <- ipf[is.na(ipf$age_class),c(5,6)]
We observe that there are 2823
and 2801
missing values in the columns age and age class.
sum(ipf_age_check$age,na.rm = T)
## [1] 0.5
filter(ipf_age_check,ipf_age_check$age == 0.5)
## age age_class
## 1 0.5 <NA>
From the row with value equal to 0.5
, we cannot infer anything meaningful i.e., we will not be able to impute the missing values from this column and this could a data entry error.
Now, as we cannot infer the value of age from this, it doesn’t make sense to impute these values. So we decide leave these rows as it is and We will not remove these rows as we would lose ~7% of the data if we do so, also we can use the information from other columns for these rows.
class(ipf$age)
## [1] "numeric"
Now, let us look at the outliers for the column age.
boxplot(ipf$age)
We notice that there are lot of outliers in the extremities of higher scale.
(outlier_age <- quantile(ipf$age, c(0.75), na.rm=TRUE) +
(1.5*IQR(ipf$age, na.rm = TRUE)))
## 75%
## 80
We check for outliers above the age of 80
in the age column and let us see how many rows have a value of 80
and above.
ipf_greater80 <- filter(ipf,ipf$age > 80)
We decide to remove these rows where the age is greater than 80
- as we only have 41
observations. Also we get rid of the observation, where age is 0.5, which seems to be a clear anomaly. Also, we will get rid of the row in which the age class is mentioned as “5-12” as it has only one record.
ipf <- ipf %>%
filter((age <= 80) %>% replace_na(TRUE))
ipf <- ipf %>%
filter(age != 0.5 | is.na(age))
ipf <- ipf %>%
filter(age_class != "5-12" | is.na(age_class))
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2800 610 0
## best3squat best3bench best3deadlift position
## 13445 2404 13773 0
## meet_date meet_name bodyweight_class
## 0 0 0
For the division column, there are 610
missing values and we can impute them if the distribution of age within classes is consistent. This doesn’t seem to be the case here and hence these values cannot be imputed.
However, We will not remove the missing values now as we are not sure if this variable will be used in the analysis. Also, considering the number of NA (~1.5% of the total data), we decided to leave them as it is. Another major reason is that these values belong to the time period between 1985-1994, which could be useful information from other columns while doing the analysis.
Also, let us look at the unique values and count for each of the type of event.
table(ipf$event)
##
## B SB SBD
## 12343 2 28166
As we have only 2
observations for SB, we can and hence can remove these observations. The rationale for removing is that 2 is a very small number to understand pattern or draw insights from.
ipf <- ipf %>%
filter((event != "SB") %>% replace_na(TRUE))
Now, we are left with 3 numeric columns - best3squat, best3bench and best3deadlift.
colSums(is.na(ipf))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2823 2800 610 0
## best3squat best3bench best3deadlift position
## 13445 2404 13771 0
## meet_date meet_name bodyweight_class
## 0 0 0
Although we have approximately 13,500 and 13,800 missing values for squat and deadlift, we are not really concerned about them as a majority of them fall under a particuar event “B”, where they are not supposed to have these values.This is shown below
ipf_checkB <- ipf %>%
filter(event != 'B' %>% replace_na(TRUE))
colSums(is.na(ipf_checkB))
## name sex event equipment
## 0 0 0 0
## age age_class division bodyweight
## 2519 2498 327 0
## best3squat best3bench best3deadlift position
## 1102 1349 1428 0
## meet_date meet_name bodyweight_class
## 0 0 0
Now, We observe that the resulting dataset with the event type as “SBD” has approximately 1100 to 1500 missing values as compared to the previously seen ~13k missing values. So, we subset the dataset into 2 datasets, one for Bench and one for SBD
ipf_Bench <- ipf %>%
filter(event == 'B' %>% replace_na(TRUE))
ipf_SBD <- ipf %>%
filter(event == 'SBD' %>% replace_na(TRUE))
Now, we will have to look at the 3 variables - Bench, Deadlift and Squat for treatment.
However, only the variable bench
for the dataset ipf_Bench,
boxplot(ipf_Bench$best3bench)
(outlier_bench <- quantile(ipf_Bench$best3bench, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_Bench$best3bench, na.rm = TRUE)))
## 75%
## 347.5
We observe that there are outliers above the best3bench of 347.5
. Let us see how many rows have value above 347.5
.
ipf_outlier_bench <- filter(ipf_Bench,best3bench > 347.5)
We decide to remove these rows where the value is above 347.5
as we have 22 observations only.
ipf_outlier_bench_lesszero <- filter(ipf_Bench,best3bench < 0,
replace_na(TRUE))
We Check for outliers for negative values and observed that we do not have any.
ipf_Bench <- ipf_Bench %>%
filter((best3bench <= 347.5) %>% replace_na(TRUE))
Now, let us proceed to look at the other dataset where we have observations for all the 3 variables - Squat, Deadlift and Bench i.e. we will look at the dataset ipf_SBD.
For bench
in event SBD,
boxplot(ipf_SBD$best3bench)
We observe that there are outliers at both the extremities. Although, we have a reason for the existence of negative values, we go ahead and remove them as they are supposed to be rare events with very limited number of observations for us to draw any conclusions or insights.
(outlier_bench_SBD_upper <- quantile(ipf_SBD$best3bench, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))
## 75%
## 292.5
(outlier_bench_SBD_lower <- quantile(ipf_SBD$best3bench, c(0.25), na.rm = TRUE) -
(1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))
## 25%
## -27.5
We see outliers above 292.5
and below -27.5
, let us see check for number of rows having values above 292.5
and below -27.5
.
ipf_outlier_bench_SBD <- filter(ipf_SBD,best3bench > 292.5 |
best3bench < -27.5,replace_na(TRUE))
There are 151
observations and we decide to remove these rows where the value is above 292.5
and below -27.5
.
ipf_SBD <- ipf_SBD %>%
filter((best3bench <= 292.5 & best3bench >= -27.5) %>%
replace_na(TRUE))
For squat
in event SBD,
boxplot(ipf_SBD$best3squat)
Again, we notice that there are outliers on both positive and negative ends.
(outlier_squat_SBD_upper <- quantile(ipf_SBD$best3squat, c(0.75), na.rm = TRUE) +
(1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))
## 75%
## 422.5
(outlier_squat_SBlower <- quantile(ipf_SBD$best3squat, c(0.25), na.rm = TRUE) -
(1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))
## 25%
## 2.5
Here, we observe outliers above 422.5
and below 2.5
, let us check for the number of rows that have values above 422.5
and below 2.5
.
ipf_outlier_squat_SBD <- filter(ipf_SBD,best3squat > 422.5 |
best3squat < 2.5,replace_na(TRUE))
There are 25
observations and we decide to remove these observations.
ipf_SBD <- ipf_SBD %>%
filter((best3squat <= 422.5 & best3squat >= 2.5)
%>% replace_na(TRUE))
For deadlift
in event SBD,
boxplot(ipf_SBD$best3deadlift)
We notice that there are outliers only in the lower extremity. However, let us check on both extremities for outliers and also the number of rows that have value above 420
and below 20
.
(outlier_deadlift_SBD_upper <- quantile(ipf_SBD$best3deadlift, c(0.75), na.rm = TRUE)
+ (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))
## 75%
## 420
(outlier_deadlift_SBD_lower <- quantile(ipf_SBD$best3deadlift, c(0.25), na.rm = TRUE)
- (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))
## 25%
## 20
ipf_outlier_deadlift_SBD <- filter(ipf_SBD,best3deadlift > 420 |
best3deadlift < 20,replace_na(TRUE))
There are 3
observations and We decide to get rid of these observations.
ipf_SBD <- ipf_SBD %>%
filter((best3deadlift <= 420 & best3deadlift >= 20) %>%
replace_na(TRUE))
As we evaluated the event types separately to avoid removal of values that are not outliers within a specific group, we will now combine them to form the final dataset and reorder the columns.
ipf_main <- rbind(ipf_Bench,ipf_SBD)
ipf_main <- ipf_main[,c(1:8,15,9:14)]
Performing few sanity checks on the combined dataset and displaying the head for the final dataset we will be using for the analysis
datatable(head(ipf_main) ,extensions = 'Buttons', selection = list(mode = "multiple", selected = c(1,3,5), target = 'column') ,options = list(dom = 'Bfrtip', buttons = I('colvis')))
ipf <- unique(ipf_main)
Finally, after cleaning the dataset, we are left with a dataframe of dimensions 40308, 15
Let us look at the summary statistics for the numerical variables:
age: The minimum and maximum values are 13.5 and 80 after removing ~40 outlier observations. The descriptive statistics are shown below:
summary(ipf_main$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 13.50 22.50 31.50 34.78 45.50 80.00 2822
bodyweight: The minimum and maximum values are 37.29 and 94.80 after removing ~410 outlier observations. The descriptive statistics are column is shown below:
summary(ipf_main$bodyweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.29 60.00 75.00 80.10 94.80 153.24
best3squat: The minimum and maximum values are 25 and 422.5 after removing ~25 outlier observations. The descriptive statistics are column is shown below:
summary(ipf_main$best3squat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 160.0 212.5 215.0 265.0 422.5 13415
best3bench: The minimum and maximum values are 25 and 347.5 after removing ~170 outlier observations. The descriptive statistics are column is shown below:
summary(ipf_main$best3bench)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 95.0 140.0 142.7 182.5 347.5 2398
best3deadlift: The minimum and maximum values are 25 and 420 after removing ~3 outlier observations. The descriptive statistics are column is shown below:
summary(ipf_main$best3deadlift)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 25.0 170.0 220.0 220.3 270.0 420.0 13739
From the summary statistics of the numeric columns, we observe that the distribution of the data points for the variables is not skewed as the difference between Mean and Median is minimal.
A few comments on the dataset -
We have removed outliers and treated NA’s to a major extent
For the column - “age”, we could have derived some of the NA values from the column “age_class”
Although we will not be using the “name” column in our analysis - we will retain this column in the main dataset due to the following reason
Data Dictionary:
data_dictionary <- read.csv("Data_Dictionary.csv",stringsAsFactors = FALSE)
kable(data_dictionary)
variable_name | data_type | description |
---|---|---|
name | character | Individual lifter name |
sex | character | Binary gender (M/F) |
event | character | The type of competition that the lifter entered - unique values are SBD(Squat Bench Deadlift) and B(Bench) |
equipment | character | The equipment category under which the lifts were performed. Values such as Raw (Bare Knees or Sleeves), Wraps(with knee wraps) and Single-ply (single-ply suits) |
age | double | The age of the lifter on the start date of the meet, if known. |
age_class | character | The age class in which the filter falls, for example 40-45 |
division | character | Text describing the division of competition, like Open or Juniors 20-23 or Professional. |
bodyweight | double | The recorded bodyweight of the lifter at the time of competition, to two decimal places. |
bodyweight_class | character | The weight class in which the lifter competed, to two decimal places. Max specified by number and min specified by the symbol “+” to the right |
best3squat | double | Maximum of the first three successful attempts for the lift - squat. |
best3bench | double | Maximum of the first three successful attempts for the lift - bench. |
best3deadlift | double | Maximum of the first three successful attempts for the lift - deadlift. |
position | character | The recorded place of the lifter in the given division at the end of the meet. Special Values: G - Guest Lifter, DQ - Disqualified, DD - Doping Disqualification |
meet_date | date | The date of the meet |
meet_name | character | The name of the meet without the year. Can be seen in conjuction with the date column |
We want to test multiple hypothesis to understand the powerlifting dataset in detail.
Some of the questions which will help us in this process are:
To be able to answer these questions, we will have to create various subsets of the data at different levels such as age - weight lifted, age - equipment - weight lifted, gender - weight lifted, gender - equipment - weight lifted, gender - position, gender - meet - date. Tha data summarized at these various levels will help us answer the questions listed above. For answering few questions, we might have to merge few subsets together.
We will look at histograms to understand the distributions of gender, age - classes participating in powerlifting the type of equipment being used and if it varies across different meets.
Once we cover ggplot in detail, We will be using other intuitive plots such as violin plots to understand where the bulk of the population is concentrated. Also we can make the scatter plots more reader friendly,which will help us understand the relation between variables easily and help us in identifying a pattern, if it exists.
We also plan to perform hypothesis testing to compare mean lifting capacity of different groups and finally close it with the implementation of a linear regression model to understand if there is any linear relationship between any of the variables and quantify it in case a relationship exists. We intend to use response variables such as weight lifted, position (numeric values only).
More details will be added during the next phase of the project, stay tuned!
This will be updated during the next phase of the project!