Body Performance
Introduction
For LBB (learning by building) - Classification in Machine Learning 2, i’m using human body performance data downloaded from Kaggle - Body Performance. Purpose of this report is to find which variable that play important role in resulting better score in Body Performance.
List of library
library(manipulate)
library(dplyr)
library(performance)
library(GGally)
library(ggplot2)
library(car)
library(randomForest)
library(rsample)
library(caret)
Data Preparation
<- read.csv("bodyPerformance.csv") performance
::paged_table(performance, options = list(rows.print = 5)) rmarkdown
Noted on the data table there are height_cm
and weight_kg
information, we want to convert those column into single column name BMI_status.
To calculate BMI, we can use formula below: \[ BMI = \frac{weight (kg)}{height (m)^2}\]
#make new column for BMI
$BMI <- round(performance$weight_kg / ((performance$height_cm/100)^2),2)
performance
#check BMI column
head(performance$BMI,5)
## [1] 25.34 20.50 24.18 23.35 22.41
We already create BMI column, now we want to convert it into categorical value rather than in numeric value based on BMI status below:
#create function
<- function(y){
convert_bmi if(y < 18.5){
<- "Underweight"
y else
}if(y >= 18.5 & y < 25){
<- "Healthy"
y else
}if(y >= 25 & y < 29.9){
<- "Overweight"
y else{
}<- "Obese"
y
} }
#apply function
$BMI_status <- sapply (performance$BMI,
performanceFUN = convert_bmi)
#To check wether the BMI status classification is correct or wrong.
%>%
performance ggplot(aes(BMI_status, BMI, fill = BMI_status))+
geom_boxplot()+
theme_minimal()+
theme(legend.position="none")
BMI_status
column has been created and classified properly, now we can remove weight_kg
, height_cm
and BMI
column from our data frame. Form box plot we also observe that there are outliers in “Obese” and “Underweight” status.
#delete weight, height and BMI column from data frame.
<- performance %>% select(-weight_kg, -height_cm, -BMI) performance
#check N/A
colSums(is.na(performance))
## age gender body.fat_.
## 0 0 0
## diastolic systolic gripForce
## 0 0 0
## sit.and.bend.forward_cm sit.ups.counts broad.jump_cm
## 0 0 0
## class BMI_status
## 0 0
#check data type
glimpse(performance)
## Rows: 13,393
## Columns: 11
## $ age <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57~
## $ gender <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "~
## $ body.fat_. <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9~
## $ diastolic <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69~
## $ systolic <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1~
## $ gripForce <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9~
## $ sit.and.bend.forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,~
## $ sit.ups.counts <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,~
## $ broad.jump_cm <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2~
## $ class <chr> "C", "A", "C", "B", "B", "B", "D", "B", "C", "~
## $ BMI_status <chr> "Overweight", "Healthy", "Healthy", "Healthy",~
Data columns description:
age:
20 ~ 64 (year)
gender:
F (Female) & M (Male)
body fat_%:
body fat in percentage
diastolic:
blood pressure, the bottom number, measures the force your heart exerts on the walls of your arteries in between beats
systolic:
blood pressure, the top number, measures the force your heart exerts on the walls of your arteries each time it beats
gripForce:
person grip strength in kilogram
sit and bend forward_cm:
forward bend measured in centimeter (flexibility)
sit-ups counts:
sit up in 1 repetition
broad jump_cm:
high jump measured in centimeter
class:
performance score A,B,C,D ( A: best) / stratified (Target Variable)
BMI_status:
Person body mass index status
Based on glimpse
of data above, we noted there are several columns with character information, thus we will transform it into number, then factor.
#Changing several data type and information
$gender[performance$gender=="M"]<-"0"
performance$gender[performance$gender=="F"]<-"1"
performance$class[performance$class=="A"]<-"0"
performance$class[performance$class=="B"]<-"1"
performance$class[performance$class=="C"]<-"2"
performance$class[performance$class=="D"]<-"3"
performance$BMI_status[performance$BMI_status=="Underweight"]<- "0"
performance$BMI_status[performance$BMI_status=="Healthy"]<- "1"
performance$BMI_status[performance$BMI_status=="Overweight"]<- "2"
performance$BMI_status[performance$BMI_status=="Obese"]<- "3"
performance
<- performance %>% mutate_if(is.character, as.factor) performance
#check dataframe performance
glimpse(performance)
## Rows: 13,393
## Columns: 11
## $ age <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57~
## $ gender <fct> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1~
## $ body.fat_. <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9~
## $ diastolic <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69~
## $ systolic <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1~
## $ gripForce <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9~
## $ sit.and.bend.forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,~
## $ sit.ups.counts <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,~
## $ broad.jump_cm <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2~
## $ class <fct> 2, 0, 2, 1, 1, 1, 3, 1, 2, 1, 0, 3, 2, 2, 2, 0~
## $ BMI_status <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1~
performance
data frame has been cleaned and modify, now we will go to the next step for processing then modelling.
Data Preprocessing
# checking proportion of target variable
prop.table(table(performance$class))
##
## 0 1 2 3
## 0.2499813 0.2499067 0.2500560 0.2500560
#Split data for training (20%) and testing dataset (80%)
RNGkind(sample.kind = "Rounding")
set.seed(120)
<- initial_split(performance, prop = 0.8, strata = "class")
index
<- training(index)
perform_train <- testing(index) perform_test
Data has been split into 2 object perform_train
for modelling purpose and perform_test
for model testing.
Modelling
Random Forest usually take a lot of time to compute a model, often due to high number of predictor used. Our data only consist of ‘10’ predictor, thus we will try to use all the predictors in our model.
Random Forest
Random Forest 1
#Create Random Forest Model
<- randomForest(class~.,
model_rf_1 data = perform_train,
importance = TRUE,
ntree = 500)
model_rf_1
##
## Call:
## randomForest(formula = class ~ ., data = perform_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 25.51%
## Confusion matrix:
## 0 1 2 3 class.error
## 0 2298 349 26 5 0.1418969
## 1 614 1621 339 103 0.3944714
## 2 238 494 1866 81 0.3034714
## 3 45 120 319 2195 0.1806644
Based on first model
Model_rf_1
, resulting OOB error rate 25.51% which is quite high. Highest error occur when model try to predictclass
with score “1”/“B” and lowest error in predicting score “0”/“A”.
#predicttion using Random Forest model in training data
<- predict(model_rf_1,
pred_mod1 newdata = perform_test)
#Confusion matrix result
confusionMatrix(data = pred_mod1,
reference = perform_test$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 577 166 62 8
## 1 88 383 120 40
## 2 4 86 473 66
## 3 1 35 15 556
##
## Overall Statistics
##
## Accuracy : 0.7422
## 95% CI : (0.7252, 0.7586)
## No Information Rate : 0.25
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6562
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 0.8612 0.5716 0.7060 0.8299
## Specificity 0.8826 0.8766 0.9224 0.9746
## Pos Pred Value 0.7097 0.6070 0.7520 0.9160
## Neg Pred Value 0.9502 0.8599 0.9039 0.9450
## Prevalence 0.2500 0.2500 0.2500 0.2500
## Detection Rate 0.2153 0.1429 0.1765 0.2075
## Detection Prevalence 0.3034 0.2354 0.2347 0.2265
## Balanced Accuracy 0.8719 0.7241 0.8142 0.9022
Based confusion matrix, this model have quite high accuracy 0.7422 to predict body performance goal score / class
. Let see and compare the result with Model 2 that using K-Fold Cross Validation.
Random Forest 2
For second random forest model, we will import it from pre-created model called performance_forest
. This model is created using k-fold cross validation with trainControl
method = “repeatedcv”, number of fold = 4, repeated = 2 and set sampling seed (120).
#read pre created model
<- readRDS("performance_forest.RDS")
model_rf_2 model_rf_2
## Random Forest
##
## 10713 samples
## 10 predictor
## 4 classes: '0', '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 2 times)
## Summary of sample sizes: 8036, 8034, 8034, 8035, 8034, 8035, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7075983 0.6101292
## 7 0.7405034 0.6540072
## 12 0.7382168 0.6509591
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
using 10 predictor, this model__rf_2
model generate 10,713 samples with highest accuracy in mtry 7 0.7405.
$finalModel model_rf_2
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 25.15%
## Confusion matrix:
## 0 1 2 3 class.error
## 0 2334 307 30 7 0.1284541
## 1 618 1669 299 91 0.3765409
## 2 253 524 1822 80 0.3198955
## 3 47 140 298 2194 0.1810377
Final model is mtry = 7, with the highest accuracy result. Based on second model
model__rf_2
, resulting OOB error rate 25.15% which is quite high. Highest error occur when model try to predictclass
with score “1”/“B” and lowest error in predicting score “0”/“A”.
#predict model in testing data, using type "raw" to predict class.
<- predict(model_rf_2,
pred_mod2 newdata = perform_test, type = "raw")
#Confusion matrix result
confusionMatrix(data = pred_mod2,
reference = perform_test$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3
## 0 578 168 68 10
## 1 85 396 125 45
## 2 6 77 456 61
## 3 1 29 21 554
##
## Overall Statistics
##
## Accuracy : 0.7403
## 95% CI : (0.7233, 0.7568)
## No Information Rate : 0.25
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6537
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity 0.8627 0.5910 0.6806 0.8269
## Specificity 0.8776 0.8731 0.9284 0.9746
## Pos Pred Value 0.7015 0.6083 0.7600 0.9157
## Neg Pred Value 0.9504 0.8650 0.8971 0.9441
## Prevalence 0.2500 0.2500 0.2500 0.2500
## Detection Rate 0.2157 0.1478 0.1701 0.2067
## Detection Prevalence 0.3075 0.2429 0.2239 0.2257
## Balanced Accuracy 0.8701 0.7321 0.8045 0.9007
Conclusion
Based on Random Forest model testing, model_rf_1
generate Accuracy 74.22%, while model_rf_2
with K-fold Cross Validation generate slightly lower Accuracy 74.03%. Thus, we can conclude that both model are perform well in predicting Body Performance using all predictor.
Although when using Random Forest model, we cannot interpret the result explicitly in detail, we still can identify which predictor is important from our model in predicting the result as below:
#Create plot to check important predictor
plot(varImp(model_rf_2),10)
sit and bend forward
play important role in predicting Body Performance score. In real case, that variable is used to measure person flexibility. Increased in flexibility can improve our muscle strength and endurance, which could affecting our Body Performance result.sit up counts
is the second important variable, which is pretty obvious as sit up is one of many workout movement to develope core strength muscle in our body.gender
is variable that considered as least importance in our model.
Here is some tips to develope or maintain our Body Performance in order to keep healthy:
- Exercise daily
- Eat the rights foods
- Get proper sleep
- Stay motivated and have positive mindset