Body Performance

Introduction

For LBB (learning by building) - Classification in Machine Learning 2, i’m using human body performance data downloaded from Kaggle - Body Performance. Purpose of this report is to find which variable that play important role in resulting better score in Body Performance.

List of library

library(manipulate)
library(dplyr)
library(performance)
library(GGally)
library(ggplot2)
library(car)
library(randomForest)
library(rsample)
library(caret)

Data Preparation

performance <- read.csv("bodyPerformance.csv")
rmarkdown::paged_table(performance, options = list(rows.print = 5))

Noted on the data table there are height_cm and weight_kg information, we want to convert those column into single column name BMI_status.

To calculate BMI, we can use formula below: \[ BMI = \frac{weight (kg)}{height (m)^2}\]

#make new column for BMI 
performance$BMI <- round(performance$weight_kg / ((performance$height_cm/100)^2),2)

#check BMI column
head(performance$BMI,5)
## [1] 25.34 20.50 24.18 23.35 22.41

We already create BMI column, now we want to convert it into categorical value rather than in numeric value based on BMI status below:

#create function
convert_bmi <- function(y){
    if(y < 18.5){
      y <- "Underweight"
    }else
      if(y >= 18.5 & y < 25){
      y <- "Healthy"
    }else
      if(y >= 25 & y < 29.9){
      y <- "Overweight"
    }else{
      y <- "Obese"
    } 
}
#apply function
performance$BMI_status <- sapply (performance$BMI,
                                  FUN = convert_bmi)
#To check wether the BMI status classification is correct or wrong.
performance %>% 
  ggplot(aes(BMI_status, BMI, fill = BMI_status))+
  geom_boxplot()+
  theme_minimal()+
  theme(legend.position="none")

BMI_status column has been created and classified properly, now we can remove weight_kg, height_cm and BMI column from our data frame. Form box plot we also observe that there are outliers in “Obese” and “Underweight” status.

#delete weight, height and BMI column from data frame.
performance <- performance %>% select(-weight_kg, -height_cm, -BMI)
#check N/A
colSums(is.na(performance)) 
##                     age                  gender              body.fat_. 
##                       0                       0                       0 
##               diastolic                systolic               gripForce 
##                       0                       0                       0 
## sit.and.bend.forward_cm          sit.ups.counts           broad.jump_cm 
##                       0                       0                       0 
##                   class              BMI_status 
##                       0                       0
 #check data type
glimpse(performance)
## Rows: 13,393
## Columns: 11
## $ age                     <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57~
## $ gender                  <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "~
## $ body.fat_.              <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9~
## $ diastolic               <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69~
## $ systolic                <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1~
## $ gripForce               <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9~
## $ sit.and.bend.forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,~
## $ sit.ups.counts          <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,~
## $ broad.jump_cm           <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2~
## $ class                   <chr> "C", "A", "C", "B", "B", "B", "D", "B", "C", "~
## $ BMI_status              <chr> "Overweight", "Healthy", "Healthy", "Healthy",~

Data columns description:

age: 20 ~ 64 (year)

gender: F (Female) & M (Male)

body fat_%: body fat in percentage

diastolic: blood pressure, the bottom number, measures the force your heart exerts on the walls of your arteries in between beats

systolic: blood pressure, the top number, measures the force your heart exerts on the walls of your arteries each time it beats

gripForce: person grip strength in kilogram

sit and bend forward_cm: forward bend measured in centimeter (flexibility)

sit-ups counts: sit up in 1 repetition

broad jump_cm: high jump measured in centimeter

class: performance score A,B,C,D ( A: best) / stratified (Target Variable)

BMI_status: Person body mass index status

Based on glimpse of data above, we noted there are several columns with character information, thus we will transform it into number, then factor.

#Changing several data type and information
performance$gender[performance$gender=="M"]<-"0"
performance$gender[performance$gender=="F"]<-"1"
performance$class[performance$class=="A"]<-"0"
performance$class[performance$class=="B"]<-"1"
performance$class[performance$class=="C"]<-"2"
performance$class[performance$class=="D"]<-"3"
performance$BMI_status[performance$BMI_status=="Underweight"]<- "0"
performance$BMI_status[performance$BMI_status=="Healthy"]<- "1"
performance$BMI_status[performance$BMI_status=="Overweight"]<- "2"
performance$BMI_status[performance$BMI_status=="Obese"]<- "3"

performance <- performance %>% mutate_if(is.character, as.factor)
#check dataframe performance
glimpse(performance)
## Rows: 13,393
## Columns: 11
## $ age                     <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57~
## $ gender                  <fct> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1~
## $ body.fat_.              <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9~
## $ diastolic               <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69~
## $ systolic                <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1~
## $ gripForce               <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9~
## $ sit.and.bend.forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,~
## $ sit.ups.counts          <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,~
## $ broad.jump_cm           <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2~
## $ class                   <fct> 2, 0, 2, 1, 1, 1, 3, 1, 2, 1, 0, 3, 2, 2, 2, 0~
## $ BMI_status              <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1~

performance data frame has been cleaned and modify, now we will go to the next step for processing then modelling.

Data Preprocessing

# checking proportion of target variable
prop.table(table(performance$class))
## 
##         0         1         2         3 
## 0.2499813 0.2499067 0.2500560 0.2500560
#Split data for training (20%) and testing dataset (80%)
RNGkind(sample.kind = "Rounding")
set.seed(120)

index <- initial_split(performance, prop = 0.8, strata = "class")

perform_train <- training(index)
perform_test <- testing(index)

Data has been split into 2 object perform_train for modelling purpose and perform_test for model testing.

Modelling

Random Forest usually take a lot of time to compute a model, often due to high number of predictor used. Our data only consist of ‘10’ predictor, thus we will try to use all the predictors in our model.

Random Forest

Random Forest 1

#Create Random Forest Model
model_rf_1 <- randomForest(class~., 
                         data = perform_train, 
                         importance = TRUE,
                         ntree = 500)
model_rf_1
## 
## Call:
##  randomForest(formula = class ~ ., data = perform_train, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 25.51%
## Confusion matrix:
##      0    1    2    3 class.error
## 0 2298  349   26    5   0.1418969
## 1  614 1621  339  103   0.3944714
## 2  238  494 1866   81   0.3034714
## 3   45  120  319 2195   0.1806644

Based on first model Model_rf_1, resulting OOB error rate 25.51% which is quite high. Highest error occur when model try to predict class with score “1”/“B” and lowest error in predicting score “0”/“A”.

#predicttion using Random Forest model in training data
pred_mod1 <- predict(model_rf_1, 
                    newdata = perform_test)
#Confusion matrix result
confusionMatrix(data = pred_mod1,
                reference = perform_test$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3
##          0 577 166  62   8
##          1  88 383 120  40
##          2   4  86 473  66
##          3   1  35  15 556
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7422          
##                  95% CI : (0.7252, 0.7586)
##     No Information Rate : 0.25            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6562          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            0.8612   0.5716   0.7060   0.8299
## Specificity            0.8826   0.8766   0.9224   0.9746
## Pos Pred Value         0.7097   0.6070   0.7520   0.9160
## Neg Pred Value         0.9502   0.8599   0.9039   0.9450
## Prevalence             0.2500   0.2500   0.2500   0.2500
## Detection Rate         0.2153   0.1429   0.1765   0.2075
## Detection Prevalence   0.3034   0.2354   0.2347   0.2265
## Balanced Accuracy      0.8719   0.7241   0.8142   0.9022

Based confusion matrix, this model have quite high accuracy 0.7422 to predict body performance goal score / class. Let see and compare the result with Model 2 that using K-Fold Cross Validation.

Random Forest 2

For second random forest model, we will import it from pre-created model called performance_forest. This model is created using k-fold cross validation with trainControl method = “repeatedcv”, number of fold = 4, repeated = 2 and set sampling seed (120).

#read pre created model

model_rf_2 <- readRDS("performance_forest.RDS")
model_rf_2
## Random Forest 
## 
## 10713 samples
##    10 predictor
##     4 classes: '0', '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 2 times) 
## Summary of sample sizes: 8036, 8034, 8034, 8035, 8034, 8035, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7075983  0.6101292
##    7    0.7405034  0.6540072
##   12    0.7382168  0.6509591
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.

using 10 predictor, this model__rf_2 model generate 10,713 samples with highest accuracy in mtry 7 0.7405.

model_rf_2$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 25.15%
## Confusion matrix:
##      0    1    2    3 class.error
## 0 2334  307   30    7   0.1284541
## 1  618 1669  299   91   0.3765409
## 2  253  524 1822   80   0.3198955
## 3   47  140  298 2194   0.1810377

Final model is mtry = 7, with the highest accuracy result. Based on second model model__rf_2, resulting OOB error rate 25.15% which is quite high. Highest error occur when model try to predict class with score “1”/“B” and lowest error in predicting score “0”/“A”.

#predict model in testing data, using type "raw" to predict class.
pred_mod2 <- predict(model_rf_2, 
                    newdata = perform_test, type = "raw")
#Confusion matrix result
confusionMatrix(data = pred_mod2,
                reference = perform_test$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3
##          0 578 168  68  10
##          1  85 396 125  45
##          2   6  77 456  61
##          3   1  29  21 554
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7403          
##                  95% CI : (0.7233, 0.7568)
##     No Information Rate : 0.25            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6537          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3
## Sensitivity            0.8627   0.5910   0.6806   0.8269
## Specificity            0.8776   0.8731   0.9284   0.9746
## Pos Pred Value         0.7015   0.6083   0.7600   0.9157
## Neg Pred Value         0.9504   0.8650   0.8971   0.9441
## Prevalence             0.2500   0.2500   0.2500   0.2500
## Detection Rate         0.2157   0.1478   0.1701   0.2067
## Detection Prevalence   0.3075   0.2429   0.2239   0.2257
## Balanced Accuracy      0.8701   0.7321   0.8045   0.9007

Conclusion

Based on Random Forest model testing, model_rf_1 generate Accuracy 74.22%, while model_rf_2 with K-fold Cross Validation generate slightly lower Accuracy 74.03%. Thus, we can conclude that both model are perform well in predicting Body Performance using all predictor.

Although when using Random Forest model, we cannot interpret the result explicitly in detail, we still can identify which predictor is important from our model in predicting the result as below:

#Create plot to check important predictor
plot(varImp(model_rf_2),10)

  • sit and bend forward play important role in predicting Body Performance score. In real case, that variable is used to measure person flexibility. Increased in flexibility can improve our muscle strength and endurance, which could affecting our Body Performance result.
  • sit up counts is the second important variable, which is pretty obvious as sit up is one of many workout movement to develope core strength muscle in our body.
  • gender is variable that considered as least importance in our model.

Here is some tips to develope or maintain our Body Performance in order to keep healthy:

  • Exercise daily
  • Eat the rights foods
  • Get proper sleep
  • Stay motivated and have positive mindset