XGBoosting is a machine learning technique based on decision trees that uses a boosting algorithm to enhance predictive performance. It combines multiple weak decision trees into a stronger model, making it one of the most popular approaches for classification and regression problems.
Let’s do a classification exercise using information recorded by the gyroscope of a cellphone. This information relates to the acceleration of different types of movements performed by the cellphone, such as rotational and translational movements around the X, Y, and Z axes. Consequently, our output variable will be the identification of the activity being performed by the cellphone user. Namely: 1 - walking; 2 - going upstairs; 3 - going downstairs; 4 - sitting; 5 - standing; 6 - lying down.
pacotes <- c('tidyverse','viridis','rpart','rpart.plot','gtools','Rmisc','scales','caret','shapr','gamlss','gamlss.add','mlbench','reshape')
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
instalador <- pacotes[!pacotes %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(pacotes, require, character = T)
} else {
sapply(pacotes, require, character = T)
}
The dataset is allready divided in “train” and “test”. As we can see, we have a lot of variables.
The only preprocessing step here is converting the response variable to a factor and removing the variable ‘y’
load("HAR_train.RData")
load("HAR_test.RData")
HAR_train$V1 <- as.factor(HAR_train$V1)
levels(HAR_train$V1) <- c("walking", "going_up", "going_down", "sitting", "standing", "lying_down")
HAR_test$V1 <- as.factor(HAR_test$V1)
levels(HAR_test$V1) <- c("walking", "going_up", "going_down", "sitting", "standing", "lying_down")
HAR_train$V1 %>% table
## .
## walking going_up going_down sitting standing lying_down
## 1226 1073 986 1286 1374 1407
HAR_train <- subset(HAR_train, select = -y)
HAR_test <- subset(HAR_test, select = -y)
ggplot(HAR_train) +
geom_boxplot(aes(x = V1, y = HAR_train[,2])) +
xlab("Activity performed") +
ylab(colnames(HAR_train)[2]) +
ggtitle("Average acceleration (x) by activity")
ggplot(HAR_train) +
geom_boxplot(aes(x = V1, y = HAR_train[,4])) +
xlab("Activity performed") +
ylab(colnames(HAR_train)[2]) +
ggtitle("Average acceleration (x) by activity")
ggplot(HAR_train) +
geom_boxplot(aes(x = V1, y = HAR_train[,15])) +
xlab("Activity performed") +
ylab(colnames(HAR_train)[2]) +
ggtitle("Average acceleration (x) by activity")
We can observe from our boxplots that certain variables are more effective in describing the user’s activities. For instance, in the examples provided, only the variable in column 15 appears to be a significant indicator of the user’s activities.
Since not all variables appear to be effective in explaining the response variable, and since we have a very big number of variables let’s select only the most relevant ones and train a model using only those variables.
Let’s train an initial model (tree) so that we can identify the most relevant variables from it.
A great method for variable selection is using the ‘variable importances’ provided by the traditional tree
set.seed(1729)
tree <- rpart::rpart(V1 ~ .,
data=HAR_train,
control = rpart.control(cp = 0.1,
minsplit = 2,
maxdepth = 6)
)
variables <- tree$variable.importance %>% sort(decreasing = TRUE) %>% names
variables <- variables[1:3]
variables
## [1] "X53tGravityAcc.min.X" "X41tGravityAcc.mean.X" "X559angleXgravityMean"
control <- caret::trainControl(
"cv",
number = 2,
summaryFunction = multiClassSummary, # Performance evaluation function
classProbs = TRUE # Required for calculating ROC curve
)
search_grid <- expand.grid(
nrounds = c(300),
max_depth = c(2, 8),
gamma = c(0),
eta = c(0.05, 0.4),
colsample_bytree = c(.7),
min_child_weight = c(1),
subsample = c(.7)
)
set.seed(1729)
model <- caret::train(
V1 ~ .,
data = HAR_train[,c(variables, "V1")],
method = "xgbTree",
trControl = control,
tuneGrid = search_grid,
verbosity = 0
)
model
## eXtreme Gradient Boosting
##
## 7352 samples
## 3 predictor
## 6 classes: 'walking', 'going_up', 'going_down', 'sitting', 'standing', 'lying_down'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 3676, 3676
## Resampling results across tuning parameters:
##
## eta max_depth logLoss AUC prAUC Accuracy Kappa
## 0.05 2 0.8614320 0.9098798 0.5205846 0.6787269 0.6135597
## 0.05 8 0.6629906 0.9441747 0.6449938 0.7501360 0.6993790
## 0.40 2 0.7084707 0.9367611 0.5945825 0.7340860 0.6799144
## 0.40 8 0.8768972 0.9419044 0.6232079 0.7487758 0.6977106
## Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
## 0.6658070 0.6675688 0.9361256 0.6763717
## 0.7388447 0.7385160 0.9504036 0.7408079
## 0.7222443 0.7214551 0.9471287 0.7244638
## 0.7362391 0.7361125 0.9501505 0.7367487
## Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
## 0.9364651 0.6763717 0.6675688 0.1131211
## 0.9503042 0.7408079 0.7385160 0.1250227
## 0.9471083 0.7244638 0.7214551 0.1223477
## 0.9500570 0.7367487 0.7361125 0.1247960
## Mean_Balanced_Accuracy
## 0.8018472
## 0.8444598
## 0.8342919
## 0.8431315
##
## Tuning parameter 'nrounds' was held constant at a value of 300
## Tuning
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
##
## Tuning parameter 'subsample' was held constant at a value of 0.7
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 300, max_depth = 8, eta
## = 0.05, gamma = 0, colsample_bytree = 0.7, min_child_weight = 1 and
## subsample = 0.7.
class_HAR_train <- predict(model, HAR_train)
class_HAR_test <- predict(model, HAR_test)
train_accuracy <- sum(class_HAR_train == HAR_train$V1)/nrow(HAR_train)
train_accuracy
## [1] 0.9510337
test_accuracy <- sum(class_HAR_test == HAR_test$V1)/nrow(HAR_test)
test_accuracy
## [1] 0.6202918