Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(readr)
library(readxl)
library(magrittr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(plyr)
library(foreign)
library(lubridate)
library(forecast)
1.Two datasets consisting of information about a body weight lifting competition was uploaded in R to perform detailed data preprocessing functions.
2.Dataset_1 consisting of information of weight lifting competition and Dataset_2 consisting of event information of the weightlifting competition was joined to form a final detailed dataset with all the information in one table and was named Final_Dataset.
3.As our data didn’t conform to the tidy data principles, manipulation and changes were made to tidy the dataset.
4.A new variable Totalliftkg was formed by mutating 3 variables namely BestSquatKg, BestBenchKg and BestDeadliftkg from the Final_Dataset.
5.The data was then scanned for missing values, inconsistances and obvious errors.
6.It was found that the data consisted of large number of missing values in the dataset.So,in order to overcome such a problem and we were missing values only in numeric variables, we decided to impute the mean of the variables in the cells that contained no values.
7.To visually represent the numeric variables in our dataset, we then plotted a boxplot for all the variables.
8.Outliers were seen the visual representation and hence to overcome such values we filtered out the outliers.
9.Data transformation on Best Bench (kg) was done using the log and the boxcox transformation.
10.However we felt that the boxcox transformation provided better visual represenation in reducing skewness and in converting the distribution into normal distribution.
1.Datasets were acquired from the following website“https://www.kaggle.com/open-powerlifting/powerlifting-database” in xlsx format.
2.Dataset 1 stored as powerlifting.xlsx in the working directory consisted information of different powerlifting results.At total it consisted of 1117 observations and 11 variables. The eleven variables listed were MeetId,Name,Sex,Equipment,Age,Bodyweight(kg),Weightclass(kg),Best Squat(kg), BestBench(kg) and Wilks.
3.Dataset 2 on the other-hand was stored as meets.xlsx, consisting of information about the meets of the different powerlifting competitions. At total it consisted of 21 observations and 8 variables. The variables included MeetID, Federation,MeetCountry,MeetState,MeetTown,MeetName and date.
4.Both dataset were imported into R using the read_excel function from the library(readxl).
DataSet1<-read_excel("powerlifting.xlsx")
head(DataSet1)
DataSet2<-read_excel("meets.xlsx")
head(DataSet2)
Final_Dataset<-DataSet1 %>% full_join(DataSet2)
Joining, by = "MeetID"
str(Final_Dataset)
Classes 'tbl_df', 'tbl' and 'data.frame': 1117 obs. of 18 variables:
$ MeetID : num 0 0 0 0 0 0 0 0 0 0 ...
$ Name : chr "Angie Belk Terry" "Dawn Bogart" "Dawn Bogart" "Dawn Bogart" ...
$ Sex : chr "F" "F" "F" "F" ...
$ Equipment : chr "Wraps" "Single-ply" "Single-ply" "Raw" ...
$ Age : num 47 42 42 42 18 28 60 60 52 52 ...
$ BodyweightKg : num 59.6 58.5 58.5 58.5 63.7 ...
$ WeightClassKg : chr "60" "60" "60" "60" ...
$ BestSquatKg : num 47.6 142.9 142.9 NA NA ...
$ BestBenchKg : num 20.4 95.2 95.2 95.2 31.8 ...
$ BestDeadliftKg: num 70.3 163.3 163.3 NA 90.7 ...
$ Wilks : num 155 456 456 108 130 ...
$ MeetPath : chr "365strong/1601" "365strong/1601" "365strong/1601" "365strong/1601" ...
$ Federation : chr "365Strong" "365Strong" "365Strong" "365Strong" ...
$ MeetCountry : chr "USA" "USA" "USA" "USA" ...
$ MeetState : chr "NC" "NC" "NC" "NC" ...
$ MeetTown : chr "Charlotte" "Charlotte" "Charlotte" "Charlotte" ...
$ MeetName : chr "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" ...
$ Date : POSIXct, format: "2016-10-29" "2016-10-29" "2016-10-29" "2016-10-29" ...
head(Final_Dataset)
Final_Dataset <- Final_Dataset %>% separate(MeetPath, into=c("Federation", "Path"), sep="/")
Final_Dataset$Sex<- Final_Dataset$Sex %>% factor(levels = c("M","F"),labels = c("Male","Female"))
Final_Dataset$Equipment<-Final_Dataset$Equipment %>% factor(levels =c("Raw","Wraps","Multi-ply","Single-ply"))
Final_Dataset$WeightClassKg<-Final_Dataset$WeightClassKg %>% factor(levels =c("140+","140","125","110","100","90","82.5","75","67.5","60","56","52","48","44","42","40"), ordered = TRUE)
Final_Dataset$MeetID<-as.factor(Final_Dataset$MeetID)
Final_Dataset<-rename(Final_Dataset,c("WeightClassKg"="Weight Category (Kg)"))
#date formating
Final_Dataset$Date <- as.Date(Final_Dataset$Date, "%d/%m/%Y")
unknown timezone '%d/%m/%Y'
head(Final_Dataset)
Final_Dataset %>% replace_na(list(BestSquatKg=0, BestBenchKg=0, BestDeadliftKg = 0)) %>% mutate (Final_Dataset, TotalLiftKg = BestSquatKg + BestBenchKg + BestDeadliftKg)
colSums(is.na(Final_Dataset))
MeetID Name Sex Equipment Age BodyweightKg Weight Category (Kg)
0 0 0 0 11 1 0
BestSquatKg BestBenchKg BestDeadliftKg Wilks Federation Path MeetCountry
412 125 224 39 0 0 0
MeetState MeetTown MeetName Date
0 0 0 0
Final_Dataset$BodyweightKg<-impute(Final_Dataset$BodyweightKg,fun = mean)
Final_Dataset$Age <-impute(Final_Dataset$Age, fun = mean)
Final_Dataset$BestSquatKg<-impute(Final_Dataset$BestSquatKg, fun = mean)
Final_Dataset$BestBenchKg<-impute(Final_Dataset$BestBenchKg, fun = mean)
Final_Dataset$BestDeadliftKg<-impute(Final_Dataset$BestDeadliftKg, fun = mean)
Final_Dataset$Wilks<-impute(Final_Dataset$Wilks, fun = mean)
colSums(is.na(Final_Dataset))
MeetID Name Sex Equipment Age BodyweightKg Weight Category (Kg)
0 0 0 0 0 0 0
BestSquatKg BestBenchKg BestDeadliftKg Wilks Federation Path MeetCountry
0 0 0 0 0 0 0
MeetState MeetTown MeetName Date
0 0 0 0
head(Final_Dataset)
summary(Final_Dataset)
11 values imputed to 36.7387
1 values imputed to 90.80884
412 values imputed to 185.5859
125 values imputed to 129.5909
224 values imputed to 197.929
39 values imputed to 278.6083
MeetID Name Sex Equipment Age BodyweightKg Weight Category (Kg) BestSquatKg BestBenchKg
15 :211 Length:1117 Male :819 Raw :775 Min. : 5.00 Min. : 26.13 90 :200 Min. : 20.0 Min. : 15.5
5 :129 Class :character Female:298 Wraps :235 1st Qu.:25.00 1st Qu.: 74.16 100 :144 1st Qu.:155.0 1st Qu.: 92.5
12 :128 Mode :character Multi-ply : 56 Median :33.00 Median : 88.36 75 :130 Median :185.6 Median :129.6
2 : 92 Single-ply: 51 Mean :36.74 Mean : 90.81 110 :126 Mean :185.6 Mean :129.6
0 : 79 3rd Qu.:47.00 3rd Qu.:106.87 82.5 :126 3rd Qu.:205.0 3rd Qu.:160.0
7 : 64 Max. :95.00 Max. :179.99 125 :110 Max. :457.5 Max. :350.0
(Other):414 (Other):281
BestDeadliftKg Wilks Federation Path MeetCountry MeetState MeetTown MeetName
Min. : 25.0 Min. : 45.56 Length:1117 Length:1117 Length:1117 Length:1117 Length:1117 Length:1117
1st Qu.:150.0 1st Qu.:158.77 Class :character Class :character Class :character Class :character Class :character Class :character
Median :197.9 Median :286.81 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Mean :197.9 Mean :278.61
3rd Qu.:238.1 3rd Qu.:375.19
Max. :374.2 Max. :666.62
Date
Min. :2016-04-10
1st Qu.:2016-09-23
Median :2017-01-21
Mean :2017-02-03
3rd Qu.:2017-06-10
Max. :2017-12-09
str(Final_Dataset)
Classes 'tbl_df', 'tbl' and 'data.frame': 1117 obs. of 18 variables:
$ MeetID : Factor w/ 21 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Name : chr "Angie Belk Terry" "Dawn Bogart" "Dawn Bogart" "Dawn Bogart" ...
$ Sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...
$ Equipment : Factor w/ 4 levels "Raw","Wraps",..: 2 4 4 1 1 2 1 1 2 1 ...
$ Age : 'impute' num 47 42 42 42 18 28 60 60 52 52 ...
..- attr(*, "imputed")= int 731 854 862 863 864 865 872 873 874 875 ...
$ BodyweightKg : 'impute' num 59.6 58.5 58.5 58.5 63.7 ...
..- attr(*, "imputed")= int 1103
$ Weight Category (Kg): Ord.factor w/ 16 levels "140+"<"140"<"125"<..: 10 10 10 10 9 9 9 9 9 9 ...
$ BestSquatKg : 'impute' num 47.6 142.9 142.9 185.6 185.6 ...
..- attr(*, "imputed")= int 4 5 10 18 19 34 38 39 43 44 ...
$ BestBenchKg : 'impute' num 20.4 95.2 95.2 95.2 31.8 ...
..- attr(*, "imputed")= int 10 18 38 41 60 69 73 106 107 108 ...
$ BestDeadliftKg : 'impute' num 70.3 163.3 163.3 197.9 90.7 ...
..- attr(*, "imputed")= int 4 18 34 38 41 43 44 53 55 56 ...
$ Wilks : 'impute' num 155 456 456 108 130 ...
..- attr(*, "imputed")= int 18 38 41 43 73 107 116 133 135 154 ...
$ Federation : chr "365strong" "365strong" "365strong" "365strong" ...
$ Path : chr "1601" "1601" "1601" "1601" ...
$ MeetCountry : chr "USA" "USA" "USA" "USA" ...
$ MeetState : chr "NC" "NC" "NC" "NC" ...
$ MeetTown : chr "Charlotte" "Charlotte" "Charlotte" "Charlotte" ...
$ MeetName : chr "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" "2016 Junior & Senior National Powerlifting Championships" ...
$ Date : Date, format: "2016-10-29" "2016-10-29" "2016-10-29" "2016-10-29" ...
Final_Dataset$Wilks = as.numeric(Final_Dataset$Wilks)
boxplot(Final_Dataset$Wilks, main="Strength of Power Lifters", ylab = "Wilks", col = "grey")
Final_Dataset$BodyweightKg = as.numeric(Final_Dataset$BodyweightKg)
boxplot(Final_Dataset$BodyweightKg, main="Body Weight in kg", ylab = "Weight(kg)",col = "red")
x<-IQR(Final_Dataset$BodyweightKg,na.rm = TRUE)
summary(Final_Dataset$BodyweightKg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
26.13 74.16 88.36 90.81 106.87 179.99
c<-1.5*x+106.87
z<-Final_Dataset%>% filter(BodyweightKg<156.895)
boxplot(z$BodyweightKg, main="Body Weight in kg After Removing Outliers", ylab="weight(kg)", col = "red")
Final_Dataset$BestSquatKg= as.numeric(Final_Dataset$BestSquatKg)
boxplot(Final_Dataset$BestSquatKg, main = " Box Plot of Best Squat ", col="blue", ylab="weight(kg)")
Y<-IQR(Final_Dataset$BestSquatKg)
summary(Final_Dataset$BestSquatKg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.0 155.0 185.6 185.6 205.0 457.5
D<-1.5*Y+205.0
R<-155-1.5*Y
R3<-Final_Dataset %>% filter((80.0<BestSquatKg) & (BestSquatKg<280))
boxplot(R3$BestSquatKg, main = " Box Plot of Best Squat After Removing Outlier ", col="blue",ylab="weight(kg)")
Final_Dataset$BestBenchKg= as.numeric(Final_Dataset$BestBenchKg)
boxplot(Final_Dataset$BestBenchKg, main = " Box Plot of Best Bench ", col="green",ylab="weight(kg)",ylab="weight(kg)")
summary(Final_Dataset$BestBenchKg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.5 92.5 129.6 129.6 160.0 350.0
Q<-IQR(Final_Dataset$BestBenchKg)
H<-1.5*Q+160
T<- Final_Dataset %>% filter(BestBenchKg<261.25)
boxplot(T$BestBenchKg, main = " Box Plot of Best Bench After Removing Outlier ", col="green",ylab="weight(kg)")
Final_Dataset$BestDeadliftKg<-as.numeric(Final_Dataset$BestDeadliftKg)
boxplot(Final_Dataset$BestDeadliftKg, main = " Box Plot of Best Dead Lift ", col="orange",ylab="weight(kg)")
summary(Final_Dataset$BestDeadliftKg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
25.0 150.0 197.9 197.9 238.1 374.2
W<-IQR(Final_Dataset$BestDeadliftKg)
K<-1.5*W+238.1
R<-Final_Dataset %>% filter(BestDeadliftKg<370.31)
boxplot(R$BestDeadliftKg, main = "Box Plot of Best Dead Lift After Removing Outlier", col="orange",ylab="weight(kg)")
1.Data transformation on Best Bench (kg) was done using the log and the boxcox transformation.
2.However it was found that the transformation with boxcox transformation provided better visual represenation in reducing skewness and converting the distribution into normal distribution.
hist(Final_Dataset$BestBenchKg,col = "grey")
hist(log(Final_Dataset$BestBenchKg), main = "Log Transformation to BestBenchKg",col = "grey")
boxcox<-BoxCox(Final_Dataset$BestBenchKg, lambda = "auto")
hist(boxcox, main = "Boxcox Transformation to BestBenchkg",col = "grey")