Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise as labelled by the classe variable. Weight Lifting Exercise Dataset[1] is used in the project.
The main objectives of this project are as follows:
XGBoost algorithm to build a prediction model;trainingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if (!file.exists("pml_training.csv")){
download.file(trainingURL,
destfile = "pml_training.csv")}
if (!file.exists("pml_testing.csv")){
download.file(testingURL,
destfile = "pml_testing.csv")}
# Missing values "#DIV/0!", "", or "NA" are all coded as NA for consistency
pml_building <- read.csv("pml_training.csv", na.strings = c("","NA","#DIV/0!"))
pml_testing <- read.csv("pml_testing.csv", na.strings = c("","NA","#DIV/0!"))
# pml_building will be used to build the prediction model, whereas pml_testing will be used to predict the 'classe' variable. Both datasets will be cleaned first to remove any missing or irrelevant data.
library(knitr)
library(caret)
library(xgboost)
library(Matrix)
library(DiagrammeR)
opts_chunk$set(echo = TRUE, fig.width=7, fig.height=5, dev='svg', fig.path='Figs/', cache=TRUE)
options(scipen=999)
There are 160 variables in total in the pml_building data set, with some variables containing many NAs. To ensure the prediction model we are going to build is good, we remove columns containing more than 95% NAs and also those related with row number, user name, times, and window that we think are not good predictor variables.
## 160 variables in the data sets and some with many NAs
str(pml_building)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_belt : logi NA NA NA NA NA NA ...
## $ skewness_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_belt.1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_belt : logi NA NA NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ skewness_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ skewness_yaw_dumbbell : logi NA NA NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
dim(pml_building)
## [1] 19622 160
## Keep columns containing less than 5% NAs
building <- pml_building[lapply(pml_building, function(x) sum(is.na(x)) / length(x) ) < 0.05]
testing <- pml_testing[lapply(pml_testing, function(x) sum(is.na(x)) / length(x)) < 0.05]
## Remove unlikely predictor variables, i.e., row number, user name, times, and window
building_data <- building[,-c(1:7)]
testing_data <- testing[,-c(1:7)]
dim(building_data); dim(testing_data)
## [1] 19622 53
## [1] 20 53
After cleaning, we are left with 53 variables (including classe) in building_data and testing_data. In the following section, we employ the XGBoost algorithm to train a prediction model using building_data and test the model’s performance.
XGBoost algorithmXGBoost is used in this project because it is more powerful than a traditional Random Forest[2]. It implements the gradient boosting decision tree algorithm and it’s the most popular implementation deployed in Kaggle Script to solve data science challenges.
The building_data dataset is ramdonly split into 2 subsets using the createDataPartition() function in caret package. 70% of the original data is used for training, and 30% is used for testing.
## Unclass the factor 'classe' as XGBoost works only with numeric data
building_data$classe <- unclass(building_data$classe)
## Subsampling
set.seed(23166)
inTrain <- createDataPartition(y = building_data$classe,
p = .7, list = FALSE)
subtraining <- building_data[inTrain,]
subtesting <- building_data[-inTrain,]
## Construct XGBoost input type; XGBoost is optimized for sparse input; label is the outcome of our dataset
trainingM <- sparse.model.matrix(classe ~ .-1, data = subtraining)
trainingM <- xgb.DMatrix(trainingM, label = subtraining$classe)
testingM <- sparse.model.matrix(classe ~ .-1, data = subtesting)
testingM <- xgb.DMatrix(testingM, label = subtesting$classe)
We run 5-fold cross validation 10 times, each time with random parameters from random set.seeds. The best parameter set is determined by whichever has the smallest CV-based evaluation error mean for the test CV-set.The xgb.cv function in xgboost package is used to cross validate the model performance. If the performance keeps getting worse consecutively for 8 rounds, training with a validation set will stop.
best_param = list()
best_seednumber = 1234
best_test_merror_mean = Inf
best_test_merror_mean_index = 0
for (iter in 1:10) {
param <- list(objective = "multi:softmax",
eval_metric = "merror",
num_class = 6,
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3),
gamma = runif(1, 0.0, 0.2),
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
seed.number = sample.int(10000, 1)[[1]]
set.seed(seed.number)
bst_cv <- xgb.cv(data = trainingM,
params = param,
nthread = 3,
nfold = 5,
nrounds = 200,
verbose = FALSE,
early_stopping_rounds = 8,
maximize = FALSE)
min_test_merror_mean = min(bst_cv$evaluation_log[, test_merror_mean])
min_test_merror_mean_index = which.min(bst_cv$evaluation_log[, test_merror_mean])
if (min_test_merror_mean < best_test_merror_mean) {
best_test_merror_mean = min_test_merror_mean
best_test_merror_mean_index = min_test_merror_mean_index
best_seednumber = seed.number
best_param = param
}
}
best_parameter <- c(best_test_merror_mean_index,best_test_merror_mean,best_seednumber)
names(best_parameter) = c("nround","CV test error","Seed number")
kable(head(best_parameter), format = "html", align = "c", caption = "Best Parameter Set")
| nround | 107.0000000 |
| CV test error | 0.0053142 |
| Seed number | 1710.0000000 |
The lowest CV-based evaluation error is 0.0053142 which happens at seed number 1710 and nround of 107.
We use nround = 107 and Seed number = 1710 to train the model. To do this, we employ the xgb.train() function in xgboost package.
nround = best_test_merror_mean_index
set.seed(best_seednumber)
bst_Model <- xgb.train(data = trainingM,
params=best_param,
nrounds = nround,
nthread = 3)
We calculate the accuracy of the model on the validation data set testingM.
testing_pred <- predict(bst_Model,testingM)
accuracy.validation <- confusionMatrix(testing_pred, subtesting$classe)$overall[1]
accuracy.validation
## Accuracy
## 0.9969414
As you can see, the accuracy is high (>99%), which indicates that our final model is pretty robust. Therefore, we obtain a very low out-of-sample error of 0.0030586.
To further examine the model, we plot the top 10 important features in the model according to the gain value. Gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. In this model, yaw_belt, roll_belt, and pitch_forearm are top 3 important features to predict the classe variable.
importance <- xgb.importance(dimnames(trainingM)[[2]],
model = bst_Model)
xgb.plot.importance(importance[1:10],
xlab = "Gain")
The 2 first trees of the model are plotted below.
xgb.plot.tree(feature_names = dimnames(trainingM)[[2]],
model = bst_Model,
trees = 2,
plot_height = 1200,
plot_width = 1000)
testing_data data setIn the following code, we use the model we’ve obtained from above to predict the testing data from pml_testing.csv.
testM <- sparse.model.matrix(problem_id ~ .-1,
data = testing_data)
test_pred <- predict(bst_Model, testM)
prediction_results <- chartr("12345", "ABCDE", test_pred)
prediction_results
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"
In this project, we used the xgboost algorithm to build a prediction model based on 19622 observations from Weight Lifting Exercise Dataset.
70% of the total observations were used to train the model while the remaining 30% were used for cross-validation. We ran 5-fold cross validation for 10 times. The model we finalised has an overall accuracy of 0.9969414 on the validation test set; out of sample error is 0.0030586.
We believe that the model is well developed for prediction.
[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.
[2] Srivastava, T. How to use XGBoost algorithm in R in easy steps. January 22, 2016.
[3] The Comprehensive R Archive Network - R Project. XGBoost R Tutorial.