Do you know how much you can lift?

Introduction

What is Power-lifting?

Powerlifting is the ultimate strength competition. The International Powerlifting Federation is the head of nearly 100 country federations world-wide. The powerlifts:- squat, bench press and deadlift are increasingly being recognized as principal exercises in the development of an individual’s true strength. But, analyzing the factors affecting the power lifts forms the core of this project.

Problem Statement:

Analyze the various factors affecting the ability of the person to do power-lifting. This project is all about understanding this exciting and intense sport better.

Some of the key questions that will be answered in this project include (but not limited to) -

Does age play a role in the amount of weight a person lifts?
Is the perfomance influenced by the equipment being used?
How is the body weight of the individual correlate with the amount of weight he can lift?
What role does gender play in this competition of strength?
….

Solution Overview/ Approach:

We will be using a subset of the International Powerlifting Data in this project. Once we are done with the tedious but necessary task of data cleaning, we will be performing univariate/ multi-variate analysis of the various attributes affecting the performance of the individual.We will also look at scatter plots and histograms broken down by multiple filters to be able to visually comprehend the influence of different attributes on the performance.

Finally, the relation will also be explained with the help of a regression model which will give us a mathematical equation of the relationship between independent variables such as age, gender, body weight with the response variable - “ability to lift” across different divisions and world wide meets.

In short - this report aims to equip the consumer with the ability to guesstimate and be able to explain the lifting capacity of an individual, when provided with certain basic data points such as gender, age, bodyweight etc.

Packages Required

The packages we will be using for this project are -

library(tidyverse)  # To clean data, filter, subset, plot graphs (ggplot2), organize data in nicer data formats using tibble 
library(DT)         # To display the data in a clean table format with options to select/ de-select columns
library(knitr)      # To display the data in a clean table format, helps with commands such as head
library(plotly)     # For interacticve plots
library(lattice)    # To plot multiple clean looking graphs
library(broom)      # To get the regression output in a neat table format
library(car)        # To calculate variance inflation factor (VIF()) while fitting a multiple regression model

All the packages required to run the document have been loaded.

Data Preparation

This section talks about the various steps involved from importing the data set to the cleaning of data set. Each step of this process has been explained and the corresponding code has been shown. An option has been provided to hide the code if the reader wishes to skip reading the code. All the tables have been provided with a HTML scrollable format of the data. This will help us filter and check only the columns of interest using the option “Column Visibility” in addition to the “Search bar” given on top of the table.

At the end of this step, the data set will be ready to use for the analysis.

3.1 Data Source and Explanation

The data set that is being used in this project can be found on the github page: IPF Cleaned Dataset (Note - This link works only when opened in a new window)

The original source of the data set which is a much larger dataset can be found at: Open Powerlifting

This data set aims to create an accessible, accurate and open archive of the world’s powerlifting data. The original dataset is being updated every month, but the data set we are using for this project is last collected on September 20th 2019 and consists of data until August 26th 2019.

The dataset we are using in this project contains 41,152 observations with 16 variables having information on player’s gender, age, weight, lifting capacity, event participated and the date of the meet. We notice that the missing values have been recorded as "NA". Some data entries to call out are negative values in the weight lifted variables which indicate failed attempts. The dataset also contains information on people who were disqualified and guest lifters i.e. people who succeeded but aren’t eligible for awards.

Our preliminary observation suggests there might be few data entry errors, outlier values and irrelavant columns for the analysis, which have been dealt with in the subsequent data cleaning steps.

3.2 Importing Dataset and Preliminary Checks

First, let us start with the important step of importing our dataset -

ipf <- read.csv("IPF.csv",stringsAsFactors = FALSE)
str(ipf)

## 'data.frame':    41152 obs. of  16 variables:
##  $ name            : chr  "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
##  $ sex             : chr  "M" "M" "M" "M" ...
##  $ event           : chr  "SBD" "SBD" "SBD" "SBD" ...
##  $ equipment       : chr  "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
##  $ age             : num  NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
##  $ age_class       : chr  NA "24-34" "35-39" "20-23" ...
##  $ division        : chr  NA NA NA NA ...
##  $ bodyweight_kg   : num  67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
##  $ weight_class_kg : chr  "67.5" "67.5" "67.5" "67.5" ...
##  $ best3squat_kg   : num  205 225 245 195 240 ...
##  $ best3bench_kg   : num  140 132 158 110 140 ...
##  $ best3deadlift_kg: num  225 235 270 240 215 230 235 335 310 295 ...
##  $ place           : chr  "1" "2" "3" "4" ...
##  $ date            : chr  "1985-08-03" "1985-08-03" "1985-08-03" "1985-08-03" ...
##  $ federation      : chr  "IPF" "IPF" "IPF" "IPF" ...
##  $ meet_name       : chr  "World Games" "World Games" "World Games" "World Games" ...

From the structure, we notice that the dataset has 41,152 rows and 16 columns - the dataset has been imported correctly. We see that the column names are appropriate and have suitable datatype assigned except for the column “date”, which is in character format.

We Change the class of Date Variable to date format

ipf$date <- as.Date(ipf$date)

Checking the structure again to verify if the command worked and new NA’s are not introduced.

str(ipf)

## 'data.frame':    41152 obs. of  16 variables:
##  $ name            : chr  "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
##  $ sex             : chr  "M" "M" "M" "M" ...
##  $ event           : chr  "SBD" "SBD" "SBD" "SBD" ...
##  $ equipment       : chr  "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
##  $ age             : num  NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
##  $ age_class       : chr  NA "24-34" "35-39" "20-23" ...
##  $ division        : chr  NA NA NA NA ...
##  $ bodyweight_kg   : num  67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
##  $ weight_class_kg : chr  "67.5" "67.5" "67.5" "67.5" ...
##  $ best3squat_kg   : num  205 225 245 195 240 ...
##  $ best3bench_kg   : num  140 132 158 110 140 ...
##  $ best3deadlift_kg: num  225 235 270 240 215 230 235 335 310 295 ...
##  $ place           : chr  "1" "2" "3" "4" ...
##  $ date            : Date, format: "1985-08-03" "1985-08-03" ...
##  $ federation      : chr  "IPF" "IPF" "IPF" "IPF" ...
##  $ meet_name       : chr  "World Games" "World Games" "World Games" "World Games" ...

colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division    bodyweight_kg 
##             2906             2884              627              187 
##  weight_class_kg    best3squat_kg    best3bench_kg best3deadlift_kg 
##                1            13698             2462            14028 
##            place             date       federation        meet_name 
##                0                0                0                0

Let us see if the data is unique at this level -

ipf <- unique(ipf)

As there is no change in the number of observations, the data is unique at this level.

Let us also rename the column ‘place’ to ‘position’ to make it more intuitive. Also we are getting rid of the suffix “kg” for few columns-

names(ipf)

##  [1] "name"             "sex"              "event"            "equipment"       
##  [5] "age"              "age_class"        "division"         "bodyweight_kg"   
##  [9] "weight_class_kg"  "best3squat_kg"    "best3bench_kg"    "best3deadlift_kg"
## [13] "place"            "date"             "federation"       "meet_name"

colnames(ipf)[13] <- "position"
colnames(ipf)[8] <- "bodyweight"
colnames(ipf)[9] <- "weight_class"
colnames(ipf)[10] <- "best3squat"
colnames(ipf)[11] <- "best3bench"
colnames(ipf)[12] <- "best3deadlift"
colnames(ipf)[14] <- "meet_date"
names(ipf)

##  [1] "name"          "sex"           "event"         "equipment"    
##  [5] "age"           "age_class"     "division"      "bodyweight"   
##  [9] "weight_class"  "best3squat"    "best3bench"    "best3deadlift"
## [13] "position"      "meet_date"     "federation"    "meet_name"

Now, we have fixed the datatype and column names - Let’s move ahead!

We will remove the column ‘federation’ as it has only one unique value and would not add any value to the analysis

ipf <- ipf[,-15]
dim(ipf)

## [1] 41152    15

datatable(head(ipf) ,extensions = 'Buttons', selection = list(mode="multiple", selected = c(1,3,5), target = 'column') ,options = list(dom = 'Bfrtip', buttons = I('colvis')))

Among the columns, age and age_class, we might only use either of them for our analysis. Also, as age groups are randomly distributed in this column, we might not use this column and might just stick to using the age column but let us not remove any of the columns now, which might turn out useful for the analysis later.

One possible way the age_class column would be helpful is to look at the distributions across multiple classes as this has only 16 unique values, we will retain this column for now.

unique(ipf$age_class)

##  [1] NA       "24-34"  "35-39"  "20-23"  "40-44"  "45-49"  "16-17"  "18-19" 
##  [9] "50-54"  "13-15"  "55-59"  "60-64"  "65-69"  "70-74"  "75-79"  "80-999"
## [17] "5-12"

We see that one particluar class is wrongly named - so renaming age class of 80-999 to 80-99, but however as all age groups are randomly distributed, we might not use this column and might stick to just using the age column.

ipf <- mutate(ipf,age_class = ifelse(age_class == "80-999","80-99",age_class))
unique(ipf$age_class)

##  [1] NA      "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19" "50-54"
## [10] "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-99" "5-12"

Let us check for any similar outlier in the weight_class column:

unique(ipf$weight_class)

##  [1] "67.5"  "90"    "90+"   "52"    "56"    "60"    "75"    "82.5"  "100"  
## [10] "110"   "125"   "125+"  "44"    "48"    "67.5+" "100+"  "57"    "63"   
## [19] "72"    "84"    "84+"   "74"    "83"    "105"   "120"   "120+"  "66"   
## [28] "105+"  "72+"   "53"    "59"    "93"    "43"    "47"    "110+"  "82.5+"
## [37] NA      "40"    "75+"

For Weight class column, the groups are again randomly distributed with multiple classes and this column might not be useful when we try to correlate the amount of weight lifted with the weight column. However, we will try to standardize the weight class column as it has too many classes. Let us create a new column with equal spaced buckets.

(labs <- c("1-10","10-20","20-30","30-40","40-50","50-60","60-70","70-80",
           "80-90","90-100","100-110","110-120","120-130","130-140","140-150",
           "150-160","160-170","170-180","180-190","190-200","200-210","210-220"
           ,"220-230","230-240","240+"))

##  [1] "1-10"    "10-20"   "20-30"   "30-40"   "40-50"   "50-60"   "60-70"  
##  [8] "70-80"   "80-90"   "90-100"  "100-110" "110-120" "120-130" "130-140"
## [15] "140-150" "150-160" "160-170" "170-180" "180-190" "190-200" "200-210"
## [22] "210-220" "220-230" "230-240" "240+"

ipf$bodyweight_class <- cut(ipf$bodyweight, breaks = c(seq(0, 240, 
                            by = 10), Inf),labels = labs, right = FALSE)
unique(ipf$bodyweight_class)

##  [1] 60-70   90-100  <NA>    50-60   40-50   70-80   80-90   100-110 120-130
## [10] 110-120 130-140 140-150 150-160 160-170 170-180 190-200 180-190 30-40  
## [19] 200-210 210-220 240+   
## 25 Levels: 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 ... 240+

Now, we will look at the summary statistics of the numeric variables. This will help us understand how the values are distributed in these columns:

summary(ipf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.50   22.50   31.50   34.77   45.00   93.50    2906

summary(ipf$bodyweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   37.29   60.00   75.55   81.15   97.30  240.00     187

summary(ipf$best3squat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -210.0   160.0   215.0   217.6   270.0   490.0   13698

summary(ipf$best3bench)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -160.0    97.5   140.0   144.7   185.0   415.0    2462

summary(ipf$best3deadlift)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -215.0   170.0   222.5   221.8   270.0   420.0   14028

Let us also check for missing values before we determine what our next steps should be -

colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2906             2884              627              187 
##     weight_class       best3squat       best3bench    best3deadlift 
##                1            13698             2462            14028 
##         position        meet_date        meet_name bodyweight_class 
##                0                0                0              187

Having looked at the numerical summaries, we see that the data does have both extreme values and missing values. We will have to clean this dataset further, so let us deal with each of these variables separately in the next step of the data cleaning process.

3.3 Missing Value and Outlier Treatment

We observe that there are missing values in multiple columns, we decide to clean the data by selecting each of the column and check for outliers and missing values.

For Columns - weight_class and bodyweight

As Weight Class column has only one missing value, we remove it as it doesn’t add any value.

ipf <- ipf[!is.na(ipf$weight_class),]
unique(ipf$weight_class)

##  [1] "67.5"  "90"    "90+"   "52"    "56"    "60"    "75"    "82.5"  "100"  
## [10] "110"   "125"   "125+"  "44"    "48"    "67.5+" "100+"  "57"    "63"   
## [19] "72"    "84"    "84+"   "74"    "83"    "105"   "120"   "120+"  "66"   
## [28] "105+"  "72+"   "53"    "59"    "93"    "43"    "47"    "110+"  "82.5+"
## [37] "40"    "75+"

Missing value for the Weight Class column has been removed.

We can impute for missing values of bodyweight from the column weight class but again, this wouldn’t make much sense as the weight class is a range and the tradeoff is making data available for just 187 rows (~0.5% of data) that too with error after a lot data computation. So, we just go ahead and remove these rows.

ipf <- ipf[!is.na(ipf$bodyweight),]

colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2830             2808              613                0 
##     weight_class       best3squat       best3bench    best3deadlift 
##                0            13663             2452            13990 
##         position        meet_date        meet_name bodyweight_class 
##                0                0                0                0

We also see that the missing values for Body Weight column have been removed. Also, as we do not get any additional information from the old column - “weight_class”, So we just get rid of this column and see if the dataset is unique at this level.

colnames(ipf)

##  [1] "name"             "sex"              "event"            "equipment"       
##  [5] "age"              "age_class"        "division"         "bodyweight"      
##  [9] "weight_class"     "best3squat"       "best3bench"       "best3deadlift"   
## [13] "position"         "meet_date"        "meet_name"        "bodyweight_class"

ipf <- ipf[,-9]
ipf <- unique(ipf)

Let us check for outliers in the column bodyweight now,

boxplot(ipf$bodyweight)

As We observe that there are lot of outliers in the higher quartile,let us follow the standard 1.5(IQR) approach to deal with the outliers

(outlier_bodyweight <- quantile(ipf$bodyweight, c(0.75), na.rm = TRUE) + 
    (1.5 * IQR(ipf$bodyweight, na.rm = TRUE)))

##    75% 
## 153.25

We check for outliers in the body weight column above 153.25 to see the number of rows that have the value of 153.25 and above

ipf_outlier_bodyweight <- filter(ipf,bodyweight >  153.25)

We observe that there are 410 observations. We decide to go ahead and remove these observations

ipf <-  ipf %>% 
  filter((bodyweight <= 153.25) %>% replace_na(TRUE))

For Columns - age and age class

We will check if we can derive the missing values from either of the 2 columns age and age class.

colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2801              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13476             2405            13802                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

ipf_age_check <- ipf[is.na(ipf$age_class),c(5,6)]

We observe that there are 2823 and 2801 missing values in the columns age and age class.

sum(ipf_age_check$age,na.rm = T)

## [1] 0.5

filter(ipf_age_check,ipf_age_check$age == 0.5)

##   age age_class
## 1 0.5      <NA>

From the row with value equal to 0.5, we cannot infer anything meaningful i.e., we will not be able to impute the missing values from this column and this could a data entry error.

Now, as we cannot infer the value of age from this, it doesn’t make sense to impute these values. So we decide leave these rows as it is and We will not remove these rows as we would lose ~7% of the data if we do so, also we can use the information from other columns for these rows.

class(ipf$age)

## [1] "numeric"

Now, let us look at the outliers for the column age.

boxplot(ipf$age)

We notice that there are lot of outliers in the extremities of higher scale.

(outlier_age <- quantile(ipf$age, c(0.75), na.rm=TRUE) + 
    (1.5*IQR(ipf$age, na.rm = TRUE)))

## 75% 
##  80

We check for outliers above the age of 80 in the age column and let us see how many rows have a value of 80 and above.

ipf_greater80 <- filter(ipf,ipf$age >  80)

We decide to remove these rows where the age is greater than 80 - as we only have 41 observations. Also we get rid of the observation, where age is 0.5, which seems to be a clear anomaly. Also, we will get rid of the row in which the age class is mentioned as “5-12” as it has only one record.

ipf <-  ipf %>% 
         filter((age <= 80) %>% replace_na(TRUE))
ipf <- ipf %>% 
       filter(age != 0.5 | is.na(age))
ipf <- ipf %>% 
  filter(age_class != "5-12" | is.na(age_class))
colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2800              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13445             2404            13773                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

For the division column, there are 610 missing values and we can impute them if the distribution of age within classes is consistent. This doesn’t seem to be the case here and hence these values cannot be imputed.

However, We will not remove the missing values now as we are not sure if this variable will be used in the analysis. Also, considering the number of NA (~1.5% of the total data), we decided to leave them as it is. Another major reason is that these values belong to the time period between 1985-1994, which could be useful information from other columns while doing the analysis.

Also, let us look at the unique values and count for each of the type of event.

table(ipf$event)

## 
##     B    SB   SBD 
## 12343     2 28166

As we have only 2 observations for SB, we can and hence can remove these observations. The rationale for removing is that 2 is a very small number to understand pattern or draw insights from.

ipf <- ipf %>% 
  filter((event != "SB") %>%  replace_na(TRUE))

Now, we are left with 3 numeric columns - best3squat, best3bench and best3deadlift.

colSums(is.na(ipf))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2800              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13445             2404            13771                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

Although we have approximately 13,500 and 13,800 missing values for squat and deadlift, we are not really concerned about them as a majority of them fall under a particuar event “B”, where they are not supposed to have these values.This is shown below

ipf_checkB <-  ipf %>% 
               filter(event != 'B' %>% replace_na(TRUE))
colSums(is.na(ipf_checkB))

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2519             2498              327                0 
##       best3squat       best3bench    best3deadlift         position 
##             1102             1349             1428                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

Now, We observe that the resulting dataset with the event type as “SBD” has approximately 1100 to 1500 missing values as compared to the previously seen ~13k missing values. So, we subset the dataset into 2 datasets, one for Bench and one for SBD

ipf_Bench <- ipf %>% 
            filter(event == 'B' %>%  replace_na(TRUE))
ipf_SBD <- ipf %>% 
  filter(event == 'SBD' %>%  replace_na(TRUE))

Now, we will have to look at the 3 variables - Bench, Deadlift and Squat for treatment.

However, only the variable bench for the dataset ipf_Bench,

boxplot(ipf_Bench$best3bench)

(outlier_bench <- quantile(ipf_Bench$best3bench, c(0.75), na.rm = TRUE) + 
      (1.5 * IQR(ipf_Bench$best3bench, na.rm = TRUE)))

##   75% 
## 347.5

We observe that there are outliers above the best3bench of 347.5. Let us see how many rows have value above 347.5.

ipf_outlier_bench <- filter(ipf_Bench,best3bench >  347.5)

We decide to remove these rows where the value is above 347.5 as we have 22 observations only.

ipf_outlier_bench_lesszero <- filter(ipf_Bench,best3bench < 0, 
                                       replace_na(TRUE))

We Check for outliers for negative values and observed that we do not have any.

ipf_Bench <-  ipf_Bench %>% 
  filter((best3bench <= 347.5) %>% replace_na(TRUE))

Now, let us proceed to look at the other dataset where we have observations for all the 3 variables - Squat, Deadlift and Bench i.e. we will look at the dataset ipf_SBD.

For bench in event SBD,

boxplot(ipf_SBD$best3bench)

We observe that there are outliers at both the extremities. Although, we have a reason for the existence of negative values, we go ahead and remove them as they are supposed to be rare events with very limited number of observations for us to draw any conclusions or insights.

(outlier_bench_SBD_upper <- quantile(ipf_SBD$best3bench, c(0.75), na.rm = TRUE) + 
    (1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))

##   75% 
## 292.5

(outlier_bench_SBD_lower <- quantile(ipf_SBD$best3bench, c(0.25), na.rm = TRUE) - 
    (1.5 * IQR(ipf_SBD$best3bench, na.rm = TRUE)))

##   25% 
## -27.5

We see outliers above 292.5 and below -27.5, let us see check for number of rows having values above 292.5 and below -27.5.

ipf_outlier_bench_SBD <- filter(ipf_SBD,best3bench >  292.5 |
                                best3bench < -27.5,replace_na(TRUE))

There are 151 observations and we decide to remove these rows where the value is above 292.5 and below -27.5.

ipf_SBD <-  ipf_SBD %>% 
  filter((best3bench <= 292.5 & best3bench >= -27.5) %>% 
           replace_na(TRUE))

For squat in event SBD,

boxplot(ipf_SBD$best3squat)

Again, we notice that there are outliers on both positive and negative ends.

(outlier_squat_SBD_upper <- quantile(ipf_SBD$best3squat, c(0.75), na.rm = TRUE) + 
    (1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))

##   75% 
## 422.5

(outlier_squat_SBlower <- quantile(ipf_SBD$best3squat, c(0.25), na.rm = TRUE) - 
    (1.5 * IQR(ipf_SBD$best3squat, na.rm = TRUE)))

## 25% 
## 2.5

Here, we observe outliers above 422.5 and below 2.5, let us check for the number of rows that have values above 422.5 and below 2.5.

ipf_outlier_squat_SBD <- filter(ipf_SBD,best3squat >  422.5 | 
                                  best3squat < 2.5,replace_na(TRUE))

There are 25 observations and we decide to remove these observations.

ipf_SBD <- ipf_SBD %>% 
  filter((best3squat <= 422.5 & best3squat >= 2.5) 
         %>% replace_na(TRUE))

For deadlift in event SBD,

boxplot(ipf_SBD$best3deadlift)

We notice that there are outliers only in the lower extremity. However, let us check on both extremities for outliers and also the number of rows that have value above 420 and below 20.

(outlier_deadlift_SBD_upper <- quantile(ipf_SBD$best3deadlift, c(0.75), na.rm = TRUE) 
  + (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))

## 75% 
## 420

(outlier_deadlift_SBD_lower <- quantile(ipf_SBD$best3deadlift, c(0.25), na.rm = TRUE) 
  - (1.5 * IQR(ipf_SBD$best3deadlift, na.rm = TRUE)))

## 25% 
##  20

ipf_outlier_deadlift_SBD <- filter(ipf_SBD,best3deadlift >  420 | 
                                best3deadlift < 20,replace_na(TRUE))

There are 3 observations and We decide to get rid of these observations.

ipf_SBD <- ipf_SBD %>% 
  filter((best3deadlift <= 420 & best3deadlift >= 20) %>% 
           replace_na(TRUE))

As we evaluated the event types separately to avoid removal of values that are not outliers within a specific group, we will now combine them to form the final dataset and reorder the columns.

ipf_main <- rbind(ipf_Bench,ipf_SBD)
ipf_main <- ipf_main[,c(1:8,15,9:14)]

Performing few sanity checks on the combined dataset and displaying the head for the final dataset we will be using for the analysis

datatable(head(ipf_main) ,extensions = 'Buttons', selection = list(mode = "multiple", selected = c(1,3,5), target = 'column') ,options = list(dom = 'Bfrtip', buttons = I('colvis')))

ipf <- unique(ipf_main)

Finally, after cleaning the dataset, we are left with a dataframe of dimensions 40308, 15

Let us look at the summary statistics for the numerical variables:

age: The minimum and maximum values are 13.5 and 80 after removing ~40 outlier observations. The descriptive statistics are shown below:

summary(ipf_main$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.50   22.50   31.50   34.78   45.50   80.00    2822

bodyweight: The minimum and maximum values are 37.29 and 94.80 after removing ~410 outlier observations. The descriptive statistics are column is shown below:

summary(ipf_main$bodyweight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.29   60.00   75.00   80.10   94.80  153.24

best3squat: The minimum and maximum values are 25 and 422.5 after removing ~25 outlier observations. The descriptive statistics are column is shown below:

summary(ipf_main$best3squat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0   160.0   212.5   215.0   265.0   422.5   13415

best3bench: The minimum and maximum values are 25 and 347.5 after removing ~170 outlier observations. The descriptive statistics are column is shown below:

summary(ipf_main$best3bench)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0    95.0   140.0   142.7   182.5   347.5    2398

best3deadlift: The minimum and maximum values are 25 and 420 after removing ~3 outlier observations. The descriptive statistics are column is shown below:

summary(ipf_main$best3deadlift)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0   170.0   220.0   220.3   270.0   420.0   13739

From the summary statistics of the numeric columns, we observe that the distribution of the data points for the variables is not skewed as the difference between Mean and Median is minimal.

3.4 Final summary of the Dataset

A few comments on the dataset -

We have removed outliers and treated NA’s to a major extent
- There might still be some outliers but due to limited context on data, we refrain from removing these values
For the column - “age”, we could have derived some of the NA values from the column “age_class”
- However, this involves a series of data transformations and we observed that ~10% of NA values will be fixed despite this
- As we already have ~37k values, these NA values for age should not cause us any trouble
- So, we do not perform any treatment as the trade off is low for the amount of computation we perform
Although we will not be using the “name” column in our analysis - we will retain this column in the main dataset due to the following reason
- During the course of analysis, when we subset the data and come across any interesting finding which is particular to a small sub-set of the population, this column will help us identify these observations

Data Dictionary:

data_dictionary <- read.csv("Data_Dictionary.csv",stringsAsFactors = FALSE)
kable(data_dictionary)

variable_name	data_type	description
name	character	Individual lifter name
sex	character	Binary gender (M/F)
event	character	The type of competition that the lifter entered - unique values are SBD(Squat Bench Deadlift) and B(Bench)
equipment	character	The equipment category under which the lifts were performed. Values such as Raw (Bare Knees or Sleeves), Wraps(with knee wraps) and Single-ply (single-ply suits)
age	double	The age of the lifter on the start date of the meet, if known.
age_class	character	The age class in which the filter falls, for example 40-45
division	character	Text describing the division of competition, like Open or Juniors 20-23 or Professional.
bodyweight	double	The recorded bodyweight of the lifter at the time of competition, to two decimal places.
bodyweight_class	character	The weight class in which the lifter competed, to two decimal places. Max specified by number and min specified by the symbol “+” to the right
best3squat	double	Maximum of the first three successful attempts for the lift - squat.
best3bench	double	Maximum of the first three successful attempts for the lift - bench.
best3deadlift	double	Maximum of the first three successful attempts for the lift - deadlift.
position	character	The recorded place of the lifter in the given division at the end of the meet. Special Values: G - Guest Lifter, DQ - Disqualified, DD - Doping Disqualification
meet_date	date	The date of the meet
meet_name	character	The name of the meet without the year. Can be seen in conjuction with the date column

Exploratory Data Analysis

We want to test multiple hypothesis to understand the powerlifting dataset in detail.

Some of the questions which will help us in this process are:

How is the weight lifted by a person being impacted by age?
How is the weight lifted by a person being impacted by gender?
Is age impacting the equipment being used to do powerlifting?
Which section of people are more likely to get disqualified?
How is age impacting different types of lifts such as bench, deadlift and squat?
How is gender impacting different types of lifts such as bench, deadlift and squat?
Is there a relationship between bodyweight and position?
How is the gender split in different meets? Is it changing over the years?
What is the distribution of different age classes in the meets? How it varies over the years?

To be able to answer these questions, we will have to create various subsets of the data at different levels such as age - weight lifted, age - equipment - weight lifted, gender - weight lifted, gender - equipment - weight lifted, gender - position, gender - meet - date. Tha data summarized at these various levels will help us answer the questions listed above. For answering few questions, we might have to merge few subsets together.

We will look at histograms to understand the distributions of gender, age - classes participating in powerlifting the type of equipment being used and if it varies across different meets.

Once we cover ggplot in detail, We will be using other intuitive plots such as violin plots to understand where the bulk of the population is concentrated. Also we can make the scatter plots more reader friendly,which will help us understand the relation between variables easily and help us in identifying a pattern, if it exists.

We also plan to perform hypothesis testing to compare mean lifting capacity of different groups and finally close it with the implementation of a linear regression model to understand if there is any linear relationship between any of the variables and quantify it in case a relationship exists. We intend to use response variables such as weight lifted, position (numeric values only).

More details will be added during the next phase of the project, stay tuned!

Results

This will be updated during the next phase of the project!