XGBoosting is a machine learning technique based on decision trees that uses a boosting algorithm to enhance predictive performance. It combines multiple weak decision trees into a stronger model, making it one of the most popular approaches for classification and regression problems.

Let’s do a classification exercise using information recorded by the gyroscope of a cellphone. This information relates to the acceleration of different types of movements performed by the cellphone, such as rotational and translational movements around the X, Y, and Z axes. Consequently, our output variable will be the identification of the activity being performed by the cellphone user. Namely: 1 - walking; 2 - going upstairs; 3 - going downstairs; 4 - sitting; 5 - standing; 6 - lying down.

Instalação de pacotes

pacotes <- c('tidyverse','viridis','rpart','rpart.plot','gtools','Rmisc','scales','caret','shapr','gamlss','gamlss.add','mlbench','reshape')

if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}

Step 1 - Data preprocessing

The dataset is allready divided in “train” and “test”. As we can see, we have a lot of variables.

The only preprocessing step here is converting the response variable to a factor and removing the variable ‘y’

Loading the dataset

load("HAR_train.RData")
load("HAR_test.RData")

Converting the response variable to a factor and adding the “titles” to its levels

HAR_train$V1 <- as.factor(HAR_train$V1)
levels(HAR_train$V1) <- c("walking", "going_up", "going_down", "sitting", "standing", "lying_down")

HAR_test$V1 <- as.factor(HAR_test$V1)
levels(HAR_test$V1) <- c("walking", "going_up", "going_down", "sitting", "standing", "lying_down")

Evaluating the response variable

HAR_train$V1 %>% table
## .
##    walking   going_up going_down    sitting   standing lying_down 
##       1226       1073        986       1286       1374       1407

Removing the variable ‘y’

HAR_train <- subset(HAR_train, select = -y)
HAR_test <- subset(HAR_test, select = -y)

Step 2 - Basic descriptive analysis

Creating boxplots

ggplot(HAR_train) +
  geom_boxplot(aes(x = V1, y = HAR_train[,2])) +
  xlab("Activity performed") +
  ylab(colnames(HAR_train)[2]) +
  ggtitle("Average acceleration (x) by activity")

ggplot(HAR_train) +
  geom_boxplot(aes(x = V1, y = HAR_train[,4])) +
  xlab("Activity performed") +
  ylab(colnames(HAR_train)[2]) +
  ggtitle("Average acceleration (x) by activity")

ggplot(HAR_train) +
  geom_boxplot(aes(x = V1, y = HAR_train[,15])) +
  xlab("Activity performed") +
  ylab(colnames(HAR_train)[2]) +
  ggtitle("Average acceleration (x) by activity")

We can observe from our boxplots that certain variables are more effective in describing the user’s activities. For instance, in the examples provided, only the variable in column 15 appears to be a significant indicator of the user’s activities.

Step 3 - Train the model

Since not all variables appear to be effective in explaining the response variable, and since we have a very big number of variables let’s select only the most relevant ones and train a model using only those variables.

Let’s train an initial model (tree) so that we can identify the most relevant variables from it.

A great method for variable selection is using the ‘variable importances’ provided by the traditional tree

Training our tree

set.seed(1729)
tree <- rpart::rpart(V1 ~ .,
                     data=HAR_train,
                     control = rpart.control(cp = 0.1,
                                             minsplit = 2,
                                             maxdepth = 6)
)

Selecting the variables

variables <- tree$variable.importance %>% sort(decreasing = TRUE) %>% names

variables <- variables[1:3]

variables
## [1] "X53tGravityAcc.min.X"  "X41tGravityAcc.mean.X" "X559angleXgravityMean"

Configuring the “control” and “search-grid” parameters

control <- caret::trainControl(
  "cv",
  number = 2,
  summaryFunction = multiClassSummary, # Performance evaluation function
  classProbs = TRUE # Required for calculating ROC curve
)

search_grid <- expand.grid(
  nrounds = c(300),
  max_depth = c(2, 8),
  gamma = c(0),
  eta = c(0.05, 0.4),
  colsample_bytree = c(.7),
  min_child_weight = c(1),
  subsample = c(.7)
)

Run the model with “xgbTree” method and our pre-configured parameters

set.seed(1729)
model <- caret::train(
  V1 ~ .,
  data = HAR_train[,c(variables, "V1")],
  method = "xgbTree",
  trControl = control,
  tuneGrid = search_grid,
  verbosity = 0
)

Model output

model
## eXtreme Gradient Boosting 
## 
## 7352 samples
##    3 predictor
##    6 classes: 'walking', 'going_up', 'going_down', 'sitting', 'standing', 'lying_down' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 3676, 3676 
## Resampling results across tuning parameters:
## 
##   eta   max_depth  logLoss    AUC        prAUC      Accuracy   Kappa    
##   0.05  2          0.8614320  0.9098798  0.5205846  0.6787269  0.6135597
##   0.05  8          0.6629906  0.9441747  0.6449938  0.7501360  0.6993790
##   0.40  2          0.7084707  0.9367611  0.5945825  0.7340860  0.6799144
##   0.40  8          0.8768972  0.9419044  0.6232079  0.7487758  0.6977106
##   Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
##   0.6658070  0.6675688         0.9361256         0.6763717          
##   0.7388447  0.7385160         0.9504036         0.7408079          
##   0.7222443  0.7214551         0.9471287         0.7244638          
##   0.7362391  0.7361125         0.9501505         0.7367487          
##   Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
##   0.9364651            0.6763717       0.6675688    0.1131211          
##   0.9503042            0.7408079       0.7385160    0.1250227          
##   0.9471083            0.7244638       0.7214551    0.1223477          
##   0.9500570            0.7367487       0.7361125    0.1247960          
##   Mean_Balanced_Accuracy
##   0.8018472             
##   0.8444598             
##   0.8342919             
##   0.8431315             
## 
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.7
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 300, max_depth = 8, eta
##  = 0.05, gamma = 0, colsample_bytree = 0.7, min_child_weight = 1 and
##  subsample = 0.7.
class_HAR_train <- predict(model, HAR_train)
class_HAR_test <- predict(model, HAR_test)

train_accuracy <- sum(class_HAR_train == HAR_train$V1)/nrow(HAR_train)
train_accuracy
## [1] 0.9510337
test_accuracy <- sum(class_HAR_test == HAR_test$V1)/nrow(HAR_test)
test_accuracy
## [1] 0.6202918