Executive Summary

This detailed analysis has been performed to fulfill the requirements of the course project for the Practical Machine Learning course on Coursera. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The main objectives of this project are as follows

Data retrieval, clearing and transformation

Setting up required environment in R

In the following code segment, we set the required global options and load the required packages in R.

library(knitr)
library(e1071)
library(xgboost)
library(Matrix)
library(methods)
library(caret)
library(dplyr)
library(Metrics)

set.seed(111)


Load data

The links for the training and test data are given below:

First of all we load them:

if (!file.exists("training.csv")) {
    download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", 
                  destfile = "training.csv")
}
if (!file.exists("testing.csv")) {
    download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", 
                  destfile = "testing.csv")
}

training <- read.csv("training.csv", na.strings=c("NA", "", "#DIV/0!"), stringsAsFactors=FALSE)
testing <- read.csv("testing.csv", na.strings=c("NA", "", "#DIV/0!"), stringsAsFactors=FALSE)


Clear data

Whole dataset devided on 1-second windows and a lot of accelerometers have data only once in each window. There are a lot of columns, where dataset has values only at the first row of 1-second window. All another rows have NA in these columns. I think, this columns have constant values for whole 1-second window, but anyway in this course project I will simply remove these columns.

# for training dataset
columnNACounts <- colSums(is.na(training))        # getting NA counts for all columns
badColumns <- columnNACounts >= 19000             # ignoring columns with majority NA values
training <- training[!badColumns]                 # getting clean data
sum(is.na(training))                              # checking for NA values
## [1] 0
# for testing dataset
columnNACounts <- colSums(is.na(testing))
badColumns <- columnNACounts >= 20
testing <- testing[!badColumns]
sum(is.na(testing))
## [1] 0

As you can see, now we don’t have any NA values.

Transform data

training$classe <- factor(training$classe)
training$user_name <- factor(training$user_name)
training$new_window <- factor(training$new_window)
training$cvtd_timestamp <- NULL

testing$classe <- -1
testing$user_name <- factor(testing$user_name)
testing$new_window <- factor(testing$new_window)
testing$cvtd_timestamp <- NULL


Build XGBOOST prediction model

Create feature list

I decided to remove some features, because using XGBOOST and only two predictors user_name and raw_timestamp_part_1 we can predict with perfect accuracy, but I think it will be cheat, not a good prediction model.

feature.names <- names(training)
feature.names <- feature.names[-which(feature.names %in% c('X', 'classe'))]
feature.names <- feature.names[-which(feature.names %in% c('user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'new_window', 'num_window'))]
feature.formula <- formula(paste('classe ~ ', paste(feature.names, collapse = ' + '), sep = ''))

Create sparse matrixes with data

We’ll create training and validation datasets in proportion 1:5

dtrain_cv <- training[, c(feature.names, 'classe')]
indexes <- createDataPartition(y = dtrain_cv$classe, p = 0.8, list = FALSE)
dtrain.matrix <- sparse.model.matrix(feature.formula, data = dtrain_cv[indexes, ])
dtrain <- xgb.DMatrix(dtrain.matrix, label = dtrain_cv[indexes, 'classe'])
dvalid <- xgb.DMatrix(sparse.model.matrix(feature.formula, data = dtrain_cv[-indexes, ]),
                      label = dtrain_cv[-indexes, 'classe'])

dtest_cv <- testing[, c(feature.names, 'classe')]
dtest <- sparse.model.matrix(feature.formula, data = dtest_cv)

Make cross validation

n_rounds.cv <- 301
params <- list(booster = "gbtree", objective = "multi:softmax",
               num_class = 6, eval_metric = 'merror',
               max_depth = 6, eta = 0.1,
               colsample_bytree = 1, subsample = 1)

bst.cv <- xgb.cv(params, dtrain, n_rounds.cv, nfold = 5, metrics = {'merror'},
                 print.every.n = 20, prediction = TRUE)
## [0]  train-merror:0.154995+0.008774  test-merror:0.173320+0.012080
## [20] train-merror:0.044748+0.001004  test-merror:0.060259+0.006736
## [40] train-merror:0.018058+0.000840  test-merror:0.032932+0.005338
## [60] train-merror:0.005637+0.000594  test-merror:0.018536+0.003544
## [80] train-merror:0.001943+0.000321  test-merror:0.012357+0.002630
## [100]    train-merror:0.000605+0.000311  test-merror:0.008663+0.001565
## [120]    train-merror:0.000048+0.000044  test-merror:0.006688+0.001441
## [140]    train-merror:0.000000+0.000000  test-merror:0.005733+0.001543
## [160]    train-merror:0.000000+0.000000  test-merror:0.005287+0.001022
## [180]    train-merror:0.000000+0.000000  test-merror:0.004777+0.001191
## [200]    train-merror:0.000000+0.000000  test-merror:0.004395+0.001089
## [220]    train-merror:0.000000+0.000000  test-merror:0.004013+0.001398
## [240]    train-merror:0.000000+0.000000  test-merror:0.003694+0.001399
## [260]    train-merror:0.000000+0.000000  test-merror:0.003503+0.001293
## [280]    train-merror:0.000000+0.000000  test-merror:0.003567+0.001179
## [300]    train-merror:0.000000+0.000000  test-merror:0.003567+0.001395
n_rounds.train <- which.min(bst.cv$dt[, test.merror.mean])
n_rounds.train
## [1] 256

Cross validation gave us required number of nrounds for training model.

Train the model

model <- xgb.train(params = params, data = dtrain, nrounds = n_rounds.train)

Training dataset accuracy

predicted <- factor(predict(model, dtrain), labels = levels(training$classe))
confusionMatrix(predicted, dtrain_cv[indexes, 'classe'])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Validation dataset accuracy

predicted <- factor(predict(model, dvalid), labels = levels(training$classe))
confusionMatrix(predicted, dtrain_cv[-indexes, 'classe'])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    2    0    0    0
##          B    0  756    2    0    0
##          C    1    1  681    3    0
##          D    0    0    1  639    0
##          E    1    0    0    1  721
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9947, 0.9984)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9960   0.9956   0.9938   1.0000
## Specificity            0.9993   0.9994   0.9985   0.9997   0.9994
## Pos Pred Value         0.9982   0.9974   0.9927   0.9984   0.9972
## Neg Pred Value         0.9993   0.9991   0.9991   0.9988   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1927   0.1736   0.1629   0.1838
## Detection Prevalence   0.2845   0.1932   0.1749   0.1631   0.1843
## Balanced Accuracy      0.9987   0.9977   0.9970   0.9967   0.9997

Feature importance

feature.importance <- xgb.importance(dimnames(dtrain.matrix)[[2]], model = model)
head(feature.importance)
##              Feature       Gain      Cover  Frequence
## 1:         roll_belt 0.16582316 0.10458000 0.05081047
## 2:          yaw_belt 0.09121426 0.08290369 0.07515932
## 3:     pitch_forearm 0.08322418 0.08143426 0.04564976
## 4: magnet_dumbbell_y 0.06693414 0.06407337 0.04021197
## 5:      roll_forearm 0.06596198 0.05321316 0.04561513
## 6:        pitch_belt 0.05612549 0.05575112 0.05042948
xgb.plot.importance(feature.importance)

Predict test cases

answers <- factor(predict(model, dtest), labels = levels(training$classe))
answers <- as.character(answers)
answers
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

Finally, we write the answers to files

pml_write_files = function(x) {
    n = length(x)
    for (i in 1:n) {
        filename = paste0("problem_id_", i, ".txt")
        write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, 
            col.names = FALSE)
    }
}

pml_write_files(answers)

Conclusion

We got accuracy for validation sample near 0.9969 which is 99.7%.