Prediction Assignment

Background

With the advent of wearable devices like Fitbit gathering vast amounts of personal activity data has become more accessible and affordable. These devices are embraced by the quantified self movement, a community of individuals who consistently track and measure various aspects of their lives to enhance their well-being, identify behavioral patterns, or simply due to their passion for technology. While people often quantify the quantity of a specific activity they engage in, they seldom assess the quality of their performance in that activity.

In this particular project, the objective is to utilize data captured by accelerometers placed on various parts of the body (belt, forearm, arm, and dumbbell) from six participants. These individuals were instructed to perform barbell lifts in both correct and incorrect manners across five different variations. For further details and additional information about the dataset, you can refer to the website http://groupware.les.inf.puc-rio.br/har, specifically the section dedicated to the Weight Lifting Exercise Dataset.

Preparing the data

Load packages, set caching

require(caret)

## Loading required package: caret

## Loading required package: ggplot2

## Loading required package: lattice

require(corrplot)

## Loading required package: corrplot

## corrplot 0.92 loaded

require(Rtsne)

## Loading required package: Rtsne

require(xgboost)

## Loading required package: xgboost

require(stats)
require(knitr)

## Loading required package: knitr

require(ggplot2)
knitr::opts_chunk$set(cache=TRUE)

To ensure fast and precise model training, I have opted for XGBoost, which is an implementation of the tree-based extreme gradient boosting algorithm. XGBoost is known for its efficiency and effectiveness in handling large datasets, as well as its ability to capture complex relationships and patterns within the data. By utilizing XGBoost, I aim to enhance the training process and achieve optimal performance for the model.

Getting Data

# URL of the training and testing data
train.url ="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test.url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# file names
train.name = "./data/pml-training.csv"
test.name = "./data/pml-testing.csv"
# if directory does not exist, create new
if (!file.exists("./data")) {
  dir.create("./data")
}
# if files does not exist, download the files
if (!file.exists(train.name)) {
  download.file(train.url, destfile=train.name, method="curl")
}
if (!file.exists(test.name)) {
  download.file(test.url, destfile=test.name, method="curl")
}
# load the CSV files as data.frame 
train = read.csv("./data/pml-training.csv")
test = read.csv("./data/pml-testing.csv")
dim(train)

## [1] 19622   160

dim(test)

## [1]  20 160

The raw training data consists of 19,622 rows of observations and 158 features (predictors). One of the columns, named “X,” represents the row number and is not usable for training the model. Additionally, the testing data comprises 20 rows with the same 158 features. The target outcome is indicated in a single column called “classe.” This column contains the desired outcome or class that the model aims to predict.

Data cleaning

To preprocess the training data, the first step involves extracting the target outcome, which represents the activity quality. This allows us to create a modified training dataset that contains only the predictor variables, specifically the activity monitors.

# Extract target outcome (label) from training data
outcome.org <- train[["classe"]]
outcome <- as.factor(outcome.org)

# Check the levels of the outcome variable
levels(outcome)

## [1] "A" "B" "C" "D" "E"

Outcome has 5 levels in character format.
Convert the outcome to numeric, because XGBoost gradient booster only recognizes numeric data.

# convert character levels to numeric
num.class = length(levels(outcome))
levels(outcome) = 1:num.class
head(outcome)

## [1] 1 1 1 1 1 1
## Levels: 1 2 3 4 5

The outcome is removed from training data.

# remove outcome from train
train$classe = NULL

Data from accelerometers on the belt, forearm, arm, and dumbell, are the features are extracted based on these keywords.

# filter columns on: belt, forearm, arm, dumbell
filter = grepl("belt|arm|dumbell", names(train))
train = train[, filter]
test = test[, filter]

Instead of less-accurate imputation of missing data, remove all columns with NA values.

# remove columns with NA, use test data as referal for NA
cols.without.na = colSums(is.na(test)) == 0
train = train[, cols.without.na]
test = test[, cols.without.na]

Preprocessing

Check for features's variance

In the context of principal component analysis (PCA), maximizing the variance of features is crucial for achieving maximum distinctiveness. This ensures that each feature is as far apart as possible (or as orthogonal as possible) from the other features. By maximizing variance, PCA can effectively capture the most significant patterns and variability within the data, leading to more informative and representative components.

# check for zero variance
zero.var = nearZeroVar(train, saveMetrics=TRUE)
zero.var

##                     freqRatio percentUnique zeroVar   nzv
## roll_belt            1.101904     6.7781062   FALSE FALSE
## pitch_belt           1.036082     9.3772296   FALSE FALSE
## yaw_belt             1.058480     9.9734991   FALSE FALSE
## total_accel_belt     1.063160     0.1477933   FALSE FALSE
## gyros_belt_x         1.058651     0.7134849   FALSE FALSE
## gyros_belt_y         1.144000     0.3516461   FALSE FALSE
## gyros_belt_z         1.066214     0.8612782   FALSE FALSE
## accel_belt_x         1.055412     0.8357966   FALSE FALSE
## accel_belt_y         1.113725     0.7287738   FALSE FALSE
## accel_belt_z         1.078767     1.5237998   FALSE FALSE
## magnet_belt_x        1.090141     1.6664968   FALSE FALSE
## magnet_belt_y        1.099688     1.5187035   FALSE FALSE
## magnet_belt_z        1.006369     2.3290184   FALSE FALSE
## roll_arm            52.338462    13.5256345   FALSE FALSE
## pitch_arm           87.256410    15.7323412   FALSE FALSE
## yaw_arm             33.029126    14.6570176   FALSE FALSE
## total_accel_arm      1.024526     0.3363572   FALSE FALSE
## gyros_arm_x          1.015504     3.2769341   FALSE FALSE
## gyros_arm_y          1.454369     1.9162165   FALSE FALSE
## gyros_arm_z          1.110687     1.2638875   FALSE FALSE
## accel_arm_x          1.017341     3.9598410   FALSE FALSE
## accel_arm_y          1.140187     2.7367241   FALSE FALSE
## accel_arm_z          1.128000     4.0362858   FALSE FALSE
## magnet_arm_x         1.000000     6.8239731   FALSE FALSE
## magnet_arm_y         1.056818     4.4439914   FALSE FALSE
## magnet_arm_z         1.036364     6.4468454   FALSE FALSE
## roll_forearm        11.589286    11.0895933   FALSE FALSE
## pitch_forearm       65.983051    14.8557741   FALSE FALSE
## yaw_forearm         15.322835    10.1467740   FALSE FALSE
## total_accel_forearm  1.128928     0.3567424   FALSE FALSE
## gyros_forearm_x      1.059273     1.5187035   FALSE FALSE
## gyros_forearm_y      1.036554     3.7763735   FALSE FALSE
## gyros_forearm_z      1.122917     1.5645704   FALSE FALSE
## accel_forearm_x      1.126437     4.0464784   FALSE FALSE
## accel_forearm_y      1.059406     5.1116094   FALSE FALSE
## accel_forearm_z      1.006250     2.9558659   FALSE FALSE
## magnet_forearm_x     1.012346     7.7667924   FALSE FALSE
## magnet_forearm_y     1.246914     9.5403119   FALSE FALSE
## magnet_forearm_z     1.000000     8.5771073   FALSE FALSE

There is no features without variability (all has enough variance). So there is no feature to be removed further.

Plot of relationship between features and outcome

Plot the relationship between features and outcome. From the plot below, each features has relatively the same distribution among the 5 outcome levels (A, B, C, D, E).

featurePlot(train, outcome, "strip")

Plot of correlation matrix

Plot a correlation matrix between features.

When considering a set of features, it is desirable for them to exhibit low correlation, indicating orthogonality. By assessing the average correlation using a correlation matrix plot, we can determine the extent of correlation among the features. In this case, the plot reveals that the average correlation is not excessively high, suggesting that the features are relatively uncorrelated. Therefore, I have made the decision not to proceed with additional preprocessing steps, such as principal component analysis (PCA).

corrplot.mixed(cor(train), lower="circle", upper="color", 
               tl.pos="lt", diag="n", order="hclust", hclust.method="complete")

tSNE plot

The t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization is a technique used to reduce the dimensionality of multidimensional features and project them onto a 2D plane. It aims to capture the underlying structure and relationships in the data. In the t-SNE plot provided below, there is no evident separation or distinct clustering observed among the five levels of the outcome variable (A, B, C, D, E). As a result, it becomes challenging to draw definitive conclusions or manually construct a regression equation due to the irregularity and lack of clear patterns in the data.

# t-Distributed Stochastic Neighbor Embedding
tsne = Rtsne(as.matrix(train), check_duplicates=FALSE, pca=TRUE, 
              perplexity=30, theta=0.5, dims=2)

embedding = as.data.frame(tsne$Y)
embedding$Class = outcome
g = ggplot(embedding, aes(x=V1, y=V2, color=Class)) +
  geom_point(size=1.25) +
  guides(colour=guide_legend(override.aes=list(size=6))) +
  xlab("") + ylab("") +
  ggtitle("t-SNE 2D Embedding of 'Classe' Outcome") +
  theme_light(base_size=20) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank())
print(g)

Build machine learning model

Now we build a machine learning model to predict activity quality (classe outcome) from the activity monitors (the features or predictors) by using XGBoost extreme gradient boosting algorithm.

XGBoost data

XGBoost supports only numeric matrix data. Converting all training, testing and outcome data to matrix.

# convert data to matrix
train.matrix = as.matrix(train)
mode(train.matrix) = "numeric"
test.matrix = as.matrix(test)
mode(test.matrix) = "numeric"
# convert outcome from factor to numeric matrix 
#   xgboost takes multi-labels in [0, numOfClass)
y = as.matrix(as.integer(outcome)-1)

XGBoost parameters

Set XGBoost parameters for cross validation and training.
Set a multiclass classification objective as the gradient boosting's learning function.
Set evaluation metric to merror, multiclass error rate.

# xgboost parameters
param <- list("objective" = "multi:softprob",    # multiclass classification 
              "num_class" = num.class,    # number of classes 
              "eval_metric" = "merror",    # evaluation metric 
              "nthread" = 8,   # number of threads to be used 
              "max_depth" = 16,    # maximum depth of tree 
              "eta" = 0.3,    # step size shrinkage 
              "gamma" = 0,    # minimum loss reduction 
              "subsample" = 1,    # part of data instances to grow tree 
              "colsample_bytree" = 1,  # subsample ratio of columns when constructing each tree 
              "min_child_weight" = 12  # minimum sum of instance weight needed in a child 
              )

Expected error rate

Expected error rate is less than 1% for a good classification. Do cross validation to estimate the error rate using 4-fold cross validation, with 200 epochs to reach the expected error rate of less than 1%.

4-fold cross validation

# set random seed, for reproducibility 
set.seed(1234)
# k-fold cross validation, with timing
nround.cv = 200
system.time( bst.cv <- xgb.cv(param=param, data=train.matrix, label=y, 
              nfold=4, nrounds=nround.cv, prediction=TRUE, verbose=FALSE) )

##    user  system elapsed 
##  90.928   0.075  91.269

Elapsed time is around 90 seconds (1.5 minutes)

tail(bst.cv$evaluation_log)

##    iter train_merror_mean train_merror_std test_merror_mean test_merror_std
## 1:  195                 0                0      0.006370624     0.001555904
## 2:  196                 0                0      0.006217719     0.001516092
## 3:  197                 0                0      0.006217719     0.001516092
## 4:  198                 0                0      0.006268698     0.001528951
## 5:  199                 0                0      0.006268698     0.001528951
## 6:  200                 0                0      0.006166782     0.001494608

# index of minimum merror
evaluation_df <- as.data.frame(bst.cv$evaluation_log)
min.merror.idx <- which.min(evaluation_df$test_merror_mean)
min.merror.idx

## [1] 184

bst.cv$evaluation_log[min.merror.idx,]

##    iter train_merror_mean train_merror_std test_merror_mean test_merror_std
## 1:  184                 0                0      0.006166782     0.001494608

Best cross-validation's minimum error rate test.merror.mean is around 0.006 (0.6%), happened at 184th iteration.

Confusion matrix

Tabulates the cross-validation's predictions of the model against the truths.

# get CV's prediction decoding
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/num.class, ncol=num.class)
pred.cv = max.col(pred.cv, "last")
# confusion matrix
confusionMatrix(factor(y+1), factor(pred.cv))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 5567   10    2    1    0
##          2   15 3761   21    0    0
##          3    2   18 3388   14    0
##          4    0    0   22 3190    4
##          5    0    1    0   11 3595
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9938          
##                  95% CI : (0.9926, 0.9949)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9922          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9970   0.9923   0.9869   0.9919   0.9989
## Specificity            0.9991   0.9977   0.9979   0.9984   0.9993
## Pos Pred Value         0.9977   0.9905   0.9901   0.9919   0.9967
## Neg Pred Value         0.9988   0.9982   0.9972   0.9984   0.9998
## Prevalence             0.2846   0.1932   0.1750   0.1639   0.1834
## Detection Rate         0.2837   0.1917   0.1727   0.1626   0.1832
## Detection Prevalence   0.2844   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9980   0.9950   0.9924   0.9952   0.9991

Confusion matrix shows concentration of correct predictions is on the diagonal, as expected.

The average accuracy is 99.38%, with error rate is 0.62%. So, expected error rate of less than 1% is fulfilled.

Model Training

Fit the XGBoost gradient boosting model on all of the training data.

system.time( bst <- xgboost(param=param, data=train.matrix, label=y, 
                           nrounds=min.merror.idx, verbose=0) )

##    user  system elapsed 
##  28.400   0.020  28.455

Time elapsed is around 28 seconds.

Predicting the testing data

# xgboost predict test data using the trained model
pred <- predict(bst, test.matrix)  
head(pred, 10)

##  [1] 7.850143e-04 9.969912e-01 1.699796e-03 1.538672e-04 3.701838e-04
##  [6] 9.991676e-01 5.930706e-04 2.202196e-04 5.147834e-06 1.402383e-05

Post-processing

Output of prediction is the predicted probability of the 5 levels (columns) of outcome.
Decode the quantitative 5 levels of outcomes to qualitative letters (A, B, C, D, E).

# decode prediction
pred = matrix(pred, nrow=num.class, ncol=length(pred)/num.class)
pred = t(pred)
pred = max.col(pred, "last")
pred.char = toupper(letters[pred])

Feature importance

# get the trained model
model = xgb.dump(bst, with_stats=TRUE)
# get the feature real names
names = dimnames(train.matrix)[[2]]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=bst)

# plot
gp = xgb.plot.importance(importance_matrix)

print(gp)

##                 Feature        Gain       Cover   Frequency  Importance
##  1:           roll_belt 0.179693469 0.121764440 0.057974580 0.179693469
##  2:       pitch_forearm 0.104947328 0.092465767 0.059614596 0.104947328
##  3:            yaw_belt 0.102475210 0.095275090 0.091266913 0.102475210
##  4:          pitch_belt 0.092135506 0.071272885 0.067650677 0.092135506
##  5:        roll_forearm 0.069670770 0.064869754 0.057236572 0.069670770
##  6:     accel_forearm_z 0.037746487 0.026622591 0.033128331 0.037746487
##  7:       magnet_belt_z 0.037438375 0.039328015 0.027634276 0.037438375
##  8:     accel_forearm_x 0.034442853 0.026717664 0.021238212 0.034442853
##  9:    magnet_forearm_z 0.029086592 0.025872505 0.031570316 0.029086592
## 10:             yaw_arm 0.024875200 0.020199864 0.022222222 0.024875200
## 11:            roll_arm 0.020061558 0.022156598 0.037884379 0.020061558
## 12:        gyros_belt_z 0.018613804 0.037328584 0.018696187 0.018613804
## 13:        accel_belt_z 0.018396164 0.021744467 0.025256253 0.018396164
## 14:       magnet_belt_x 0.018055847 0.021371456 0.026486265 0.018055847
## 15:         gyros_arm_y 0.017025835 0.024099412 0.024846248 0.017025835
## 16:        magnet_arm_z 0.016503448 0.025150016 0.024272243 0.016503448
## 17:    magnet_forearm_x 0.014500613 0.017758645 0.022140221 0.014500613
## 18: total_accel_forearm 0.014454439 0.011082587 0.010332103 0.014454439
## 19:       magnet_belt_y 0.013472704 0.019088688 0.018204182 0.013472704
## 20:         yaw_forearm 0.012920827 0.011525266 0.023124231 0.012920827
## 21:    magnet_forearm_y 0.012472699 0.016735308 0.021894219 0.012472699
## 22:        magnet_arm_x 0.012335533 0.006822218 0.011890119 0.012335533
## 23:        magnet_arm_y 0.011368374 0.019503077 0.021074211 0.011368374
## 24:     gyros_forearm_z 0.010087971 0.012498078 0.014596146 0.010087971
## 25:         accel_arm_y 0.009375662 0.012738708 0.017466175 0.009375662
## 26:     accel_forearm_y 0.009330366 0.012748419 0.022140221 0.009330366
## 27:         gyros_arm_x 0.008447371 0.015855702 0.024518245 0.008447371
## 28:     gyros_forearm_y 0.008301205 0.013944945 0.025830258 0.008301205
## 29:         accel_arm_x 0.007769142 0.016177037 0.020500205 0.007769142
## 30:           pitch_arm 0.007106283 0.012049470 0.022878229 0.007106283
## 31:     gyros_forearm_x 0.005417408 0.009446135 0.016072161 0.005417408
## 32:         accel_arm_z 0.005278959 0.008382962 0.019188192 0.005278959
## 33:        gyros_belt_x 0.002989696 0.009767955 0.016810168 0.002989696
## 34:        gyros_belt_y 0.002935031 0.013814657 0.007462075 0.002935031
## 35:        accel_belt_x 0.002510063 0.008031963 0.010496105 0.002510063
## 36:     total_accel_arm 0.002334722 0.003809815 0.008938089 0.002334722
## 37:        accel_belt_y 0.002290135 0.005811711 0.004920049 0.002290135
## 38:    total_accel_belt 0.001786903 0.002294077 0.003116031 0.001786903
## 39:         gyros_arm_z 0.001345448 0.003873466 0.009430094 0.001345448
##                 Feature        Gain       Cover   Frequency  Importance

Feature importance plot is useful to select only best features with highest correlation to the outcome(s). To improve model fitting performance (time or overfitting), less important features can be removed.

Summary

In this project, an XGBoost algorithm was implemented for multi-class classification. The training data was used to train the model, and cross-validation was performed to assess its performance. The trained model was then applied to the test data to make predictions. Additionally, the feature importance was computed and visualized to gain insights into the significance of different features in the classification task.

XGBoost:Extreme Gradient Boosting Algorithm to build Machine Learning Model

Yogesh Gupta

2023-05-09