1. Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise as labelled by the classe variable. Weight Lifting Exercise Dataset[1] is used in the project.

The main objectives of this project are as follows:

  1. Preprocess the weight lifting exercise dataset;
  2. Use the XGBoost algorithm to build a prediction model;
  3. Evaluate the model performance, including accuracy, out of sample error and features importance list;
  4. Use the prediction model to predict the test data.

2. Data Preprocessing

2.1 Downloading data sets and reading the files

trainingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

testingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if (!file.exists("pml_training.csv")){
        download.file(trainingURL, 
                      destfile = "pml_training.csv")}

if (!file.exists("pml_testing.csv")){
        download.file(testingURL, 
                      destfile = "pml_testing.csv")}

# Missing values "#DIV/0!", "", or "NA" are all coded as NA for consistency
pml_building <- read.csv("pml_training.csv", na.strings = c("","NA","#DIV/0!"))
pml_testing <- read.csv("pml_testing.csv", na.strings = c("","NA","#DIV/0!"))

# pml_building will be used to build the prediction model, whereas pml_testing will be used to predict the 'classe' variable. Both datasets will be cleaned first to remove any missing or irrelevant data.

2.2 Loading packages to be used

library(knitr)
library(caret)
library(xgboost)
library(Matrix)
library(DiagrammeR)
opts_chunk$set(echo = TRUE, fig.width=7, fig.height=5, dev='svg', fig.path='Figs/', cache=TRUE)

options(scipen=999)

2.3 Cleaning the data

There are 160 variables in total in the pml_building data set, with some variables containing many NAs. To ensure the prediction model we are going to build is good, we remove columns containing more than 95% NAs and also those related with row number, user name, times, and window that we think are not good predictor variables.

## 160 variables in the data sets and some with many NAs
str(pml_building)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]
dim(pml_building)
## [1] 19622   160
## Keep columns containing less than 5% NAs
building <- pml_building[lapply(pml_building, function(x) sum(is.na(x)) / length(x) ) < 0.05]

testing <- pml_testing[lapply(pml_testing, function(x) sum(is.na(x)) / length(x)) < 0.05]

## Remove unlikely predictor variables, i.e., row number, user name, times, and window 
building_data <- building[,-c(1:7)]
testing_data <- testing[,-c(1:7)]
dim(building_data); dim(testing_data)
## [1] 19622    53
## [1] 20 53

After cleaning, we are left with 53 variables (including classe) in building_data and testing_data. In the following section, we employ the XGBoost algorithm to train a prediction model using building_data and test the model’s performance.

3. Modelling using XGBoost algorithm

XGBoost is used in this project because it is more powerful than a traditional Random Forest[2]. It implements the gradient boosting decision tree algorithm and it’s the most popular implementation deployed in Kaggle Script to solve data science challenges.

3.1 Random Subsampling

The building_data dataset is ramdonly split into 2 subsets using the createDataPartition() function in caret package. 70% of the original data is used for training, and 30% is used for testing.

## Unclass the factor 'classe' as XGBoost works only with numeric data 
building_data$classe <- unclass(building_data$classe)

## Subsampling
set.seed(23166)
inTrain <- createDataPartition(y = building_data$classe,
                               p = .7, list = FALSE)
subtraining <- building_data[inTrain,]
subtesting <- building_data[-inTrain,]

## Construct XGBoost input type; XGBoost is optimized for sparse input; label is the outcome of our dataset
trainingM <- sparse.model.matrix(classe ~ .-1, data = subtraining)
trainingM <- xgb.DMatrix(trainingM, label = subtraining$classe)

testingM <- sparse.model.matrix(classe ~ .-1, data = subtesting)
testingM <- xgb.DMatrix(testingM, label = subtesting$classe)

3.2 Fine-tuning parameters

We run 5-fold cross validation 10 times, each time with random parameters from random set.seeds. The best parameter set is determined by whichever has the smallest CV-based evaluation error mean for the test CV-set.The xgb.cv function in xgboost package is used to cross validate the model performance. If the performance keeps getting worse consecutively for 8 rounds, training with a validation set will stop.

best_param = list()
best_seednumber = 1234
best_test_merror_mean = Inf
best_test_merror_mean_index = 0

for (iter in 1:10) {
    param <- list(objective = "multi:softmax",
          eval_metric = "merror",
          num_class = 6,
          max_depth = sample(6:10, 1),
          eta = runif(1, .01, .3),
          gamma = runif(1, 0.0, 0.2), 
          subsample = runif(1, .6, .9),
          colsample_bytree = runif(1, .5, .8), 
          min_child_weight = sample(1:40, 1),
          max_delta_step = sample(1:10, 1)
          )
   
seed.number = sample.int(10000, 1)[[1]]
set.seed(seed.number)
bst_cv <- xgb.cv(data = trainingM, 
                   params = param, 
                   nthread = 3, 
                   nfold = 5, 
                   nrounds = 200,
                   verbose = FALSE, 
                   early_stopping_rounds = 8, 
                   maximize = FALSE)

min_test_merror_mean = min(bst_cv$evaluation_log[, test_merror_mean])
min_test_merror_mean_index = which.min(bst_cv$evaluation_log[, test_merror_mean])

if (min_test_merror_mean < best_test_merror_mean) {
        best_test_merror_mean = min_test_merror_mean
        best_test_merror_mean_index = min_test_merror_mean_index
        best_seednumber = seed.number
        best_param = param
    }
}

best_parameter <- c(best_test_merror_mean_index,best_test_merror_mean,best_seednumber)
names(best_parameter) = c("nround","CV test error","Seed number")
kable(head(best_parameter), format = "html", align = "c", caption = "Best Parameter Set")
Best Parameter Set
nround 107.0000000
CV test error 0.0053142
Seed number 1710.0000000

The lowest CV-based evaluation error is 0.0053142 which happens at seed number 1710 and nround of 107.

3.3 Training the model

We use nround = 107 and Seed number = 1710 to train the model. To do this, we employ the xgb.train() function in xgboost package.

nround = best_test_merror_mean_index
set.seed(best_seednumber)
bst_Model <- xgb.train(data = trainingM, 
                params=best_param, 
                nrounds = nround, 
                nthread = 3)

3.4 Evaluating performance

We calculate the accuracy of the model on the validation data set testingM.

testing_pred <- predict(bst_Model,testingM)
accuracy.validation <- confusionMatrix(testing_pred, subtesting$classe)$overall[1]
accuracy.validation
##  Accuracy 
## 0.9969414

As you can see, the accuracy is high (>99%), which indicates that our final model is pretty robust. Therefore, we obtain a very low out-of-sample error of 0.0030586.

To further examine the model, we plot the top 10 important features in the model according to the gain value. Gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. In this model, yaw_belt, roll_belt, and pitch_forearm are top 3 important features to predict the classe variable.

importance <- xgb.importance(dimnames(trainingM)[[2]],
                             model = bst_Model)

xgb.plot.importance(importance[1:10], 
                    xlab = "Gain")

The 2 first trees of the model are plotted below.

xgb.plot.tree(feature_names = dimnames(trainingM)[[2]], 
              model = bst_Model,
              trees = 2,
              plot_height = 1200, 
              plot_width = 1000)

4. Predicting the testing_data data set

In the following code, we use the model we’ve obtained from above to predict the testing data from pml_testing.csv.

testM <- sparse.model.matrix(problem_id ~ .-1, 
                             data = testing_data)
test_pred <- predict(bst_Model, testM)
prediction_results <- chartr("12345", "ABCDE", test_pred)
prediction_results
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

5. Conclusion

In this project, we used the xgboost algorithm to build a prediction model based on 19622 observations from Weight Lifting Exercise Dataset.

70% of the total observations were used to train the model while the remaining 30% were used for cross-validation. We ran 5-fold cross validation for 10 times. The model we finalised has an overall accuracy of 0.9969414 on the validation test set; out of sample error is 0.0030586.

We believe that the model is well developed for prediction.

References

[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.

[2] Srivastava, T. How to use XGBoost algorithm in R in easy steps. January 22, 2016.

[3] The Comprehensive R Archive Network - R Project. XGBoost R Tutorial.