This detailed analysis has been performed to fulfill the requirements of the course project for the Practical Machine Learning course on Coursera. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The main objectives of this project are as follows
classe variable.In the following code segment, we set the required global options and load the required packages in R.
library(knitr)
library(e1071)
library(xgboost)
library(Matrix)
library(methods)
library(caret)
library(dplyr)
library(Metrics)
set.seed(111)
The links for the training and test data are given below:
First of all we load them:
if (!file.exists("training.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
destfile = "training.csv")
}
if (!file.exists("testing.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
destfile = "testing.csv")
}
training <- read.csv("training.csv", na.strings=c("NA", "", "#DIV/0!"), stringsAsFactors=FALSE)
testing <- read.csv("testing.csv", na.strings=c("NA", "", "#DIV/0!"), stringsAsFactors=FALSE)
Whole dataset devided on 1-second windows and a lot of accelerometers have data only once in each window. There are a lot of columns, where dataset has values only at the first row of 1-second window. All another rows have NA in these columns. I think, this columns have constant values for whole 1-second window, but anyway in this course project I will simply remove these columns.
# for training dataset
columnNACounts <- colSums(is.na(training)) # getting NA counts for all columns
badColumns <- columnNACounts >= 19000 # ignoring columns with majority NA values
training <- training[!badColumns] # getting clean data
sum(is.na(training)) # checking for NA values
## [1] 0
# for testing dataset
columnNACounts <- colSums(is.na(testing))
badColumns <- columnNACounts >= 20
testing <- testing[!badColumns]
sum(is.na(testing))
## [1] 0
As you can see, now we don’t have any NA values.
training$classe <- factor(training$classe)
training$user_name <- factor(training$user_name)
training$new_window <- factor(training$new_window)
training$cvtd_timestamp <- NULL
testing$classe <- -1
testing$user_name <- factor(testing$user_name)
testing$new_window <- factor(testing$new_window)
testing$cvtd_timestamp <- NULL
I decided to remove some features, because using XGBOOST and only two predictors user_name and raw_timestamp_part_1 we can predict with perfect accuracy, but I think it will be cheat, not a good prediction model.
feature.names <- names(training)
feature.names <- feature.names[-which(feature.names %in% c('X', 'classe'))]
feature.names <- feature.names[-which(feature.names %in% c('user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'new_window', 'num_window'))]
feature.formula <- formula(paste('classe ~ ', paste(feature.names, collapse = ' + '), sep = ''))
We’ll create training and validation datasets in proportion 1:5
dtrain_cv <- training[, c(feature.names, 'classe')]
indexes <- createDataPartition(y = dtrain_cv$classe, p = 0.8, list = FALSE)
dtrain.matrix <- sparse.model.matrix(feature.formula, data = dtrain_cv[indexes, ])
dtrain <- xgb.DMatrix(dtrain.matrix, label = dtrain_cv[indexes, 'classe'])
dvalid <- xgb.DMatrix(sparse.model.matrix(feature.formula, data = dtrain_cv[-indexes, ]),
label = dtrain_cv[-indexes, 'classe'])
dtest_cv <- testing[, c(feature.names, 'classe')]
dtest <- sparse.model.matrix(feature.formula, data = dtest_cv)
n_rounds.cv <- 301
params <- list(booster = "gbtree", objective = "multi:softmax",
num_class = 6, eval_metric = 'merror',
max_depth = 6, eta = 0.1,
colsample_bytree = 1, subsample = 1)
bst.cv <- xgb.cv(params, dtrain, n_rounds.cv, nfold = 5, metrics = {'merror'},
print.every.n = 20, prediction = TRUE)
## [0] train-merror:0.154995+0.008774 test-merror:0.173320+0.012080
## [20] train-merror:0.044748+0.001004 test-merror:0.060259+0.006736
## [40] train-merror:0.018058+0.000840 test-merror:0.032932+0.005338
## [60] train-merror:0.005637+0.000594 test-merror:0.018536+0.003544
## [80] train-merror:0.001943+0.000321 test-merror:0.012357+0.002630
## [100] train-merror:0.000605+0.000311 test-merror:0.008663+0.001565
## [120] train-merror:0.000048+0.000044 test-merror:0.006688+0.001441
## [140] train-merror:0.000000+0.000000 test-merror:0.005733+0.001543
## [160] train-merror:0.000000+0.000000 test-merror:0.005287+0.001022
## [180] train-merror:0.000000+0.000000 test-merror:0.004777+0.001191
## [200] train-merror:0.000000+0.000000 test-merror:0.004395+0.001089
## [220] train-merror:0.000000+0.000000 test-merror:0.004013+0.001398
## [240] train-merror:0.000000+0.000000 test-merror:0.003694+0.001399
## [260] train-merror:0.000000+0.000000 test-merror:0.003503+0.001293
## [280] train-merror:0.000000+0.000000 test-merror:0.003567+0.001179
## [300] train-merror:0.000000+0.000000 test-merror:0.003567+0.001395
n_rounds.train <- which.min(bst.cv$dt[, test.merror.mean])
n_rounds.train
## [1] 256
Cross validation gave us required number of nrounds for training model.
model <- xgb.train(params = params, data = dtrain, nrounds = n_rounds.train)
predicted <- factor(predict(model, dtrain), labels = levels(training$classe))
confusionMatrix(predicted, dtrain_cv[indexes, 'classe'])
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
predicted <- factor(predict(model, dvalid), labels = levels(training$classe))
confusionMatrix(predicted, dtrain_cv[-indexes, 'classe'])
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1114 2 0 0 0
## B 0 756 2 0 0
## C 1 1 681 3 0
## D 0 0 1 639 0
## E 1 0 0 1 721
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9947, 0.9984)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9960 0.9956 0.9938 1.0000
## Specificity 0.9993 0.9994 0.9985 0.9997 0.9994
## Pos Pred Value 0.9982 0.9974 0.9927 0.9984 0.9972
## Neg Pred Value 0.9993 0.9991 0.9991 0.9988 1.0000
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2840 0.1927 0.1736 0.1629 0.1838
## Detection Prevalence 0.2845 0.1932 0.1749 0.1631 0.1843
## Balanced Accuracy 0.9987 0.9977 0.9970 0.9967 0.9997
feature.importance <- xgb.importance(dimnames(dtrain.matrix)[[2]], model = model)
head(feature.importance)
## Feature Gain Cover Frequence
## 1: roll_belt 0.16582316 0.10458000 0.05081047
## 2: yaw_belt 0.09121426 0.08290369 0.07515932
## 3: pitch_forearm 0.08322418 0.08143426 0.04564976
## 4: magnet_dumbbell_y 0.06693414 0.06407337 0.04021197
## 5: roll_forearm 0.06596198 0.05321316 0.04561513
## 6: pitch_belt 0.05612549 0.05575112 0.05042948
xgb.plot.importance(feature.importance)
answers <- factor(predict(model, dtest), labels = levels(training$classe))
answers <- as.character(answers)
answers
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"
Finally, we write the answers to files
pml_write_files = function(x) {
n = length(x)
for (i in 1:n) {
filename = paste0("problem_id_", i, ".txt")
write.table(x[i], file = filename, quote = FALSE, row.names = FALSE,
col.names = FALSE)
}
}
pml_write_files(answers)
We got accuracy for validation sample near 0.9969 which is 99.7%.