Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement — a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The original article is located at Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements
You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
# remove dplyr if loaded detach_package(dplyr)
# first load plyr
RequireOrInstall("plyr")
# then load dplyr
RequireOrInstall("dplyr")
# NAVIGATE TO DATASETS DIRECTORY USING setwd() command
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(training) # 19622 rows, 160 cols
## [1] 19622 160
dim(testing) # 20 rows, 160 cols
## [1] 20 160
fulldata <- bind_rows(training,testing)
check_for_NAs <- function(x,useceil=F){
if(useceil==T){
apply(as.data.frame(x),2,FUN=function(x){ceiling(sum(is.na(x))/length(x))})
} else {
apply(as.data.frame(x),2,FUN=function(x){sum(is.na(x))/length(x)})
}
}
NAs <- check_for_NAs(fulldata,useceil = T)
NAs_w <- check_for_NAs(fulldata,useceil = F)
NAs_n <- as.numeric(NAs) # 67 NA variables -- so we have 160 - 67 = 93 that are mostly free of NA's
featuredata <- fulldata%>%select(which(NAs_n==0))
dim(featuredata)
## [1] 19642 59
# these variables have less than 50% NA variables also subsetted out
features <- names(fulldata[,colSums(is.na(fulldata)) == 0])
features
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [13] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [16] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [19] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [22] "pitch_arm" "yaw_arm" "total_accel_arm"
## [25] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [28] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [31] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [34] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [37] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [40] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [43] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [46] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [49] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [52] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [55] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [58] "magnet_forearm_y" "magnet_forearm_z"
length(features) # so only 59 of 93 variables contain no NA's
## [1] 59
featuredata<- subset(featuredata, select = features)
featuredata
## Source: local data frame [19,642 x 59]
##
## X user_name raw_timestamp_part_1 raw_timestamp_part_2
## (int) (fctr) (int) (int)
## 1 1 carlitos 1323084231 788290
## 2 2 carlitos 1323084231 808298
## 3 3 carlitos 1323084231 820366
## 4 4 carlitos 1323084232 120339
## 5 5 carlitos 1323084232 196328
## 6 6 carlitos 1323084232 304277
## 7 7 carlitos 1323084232 368296
## 8 8 carlitos 1323084232 440390
## 9 9 carlitos 1323084232 484323
## 10 10 carlitos 1323084232 484434
## .. ... ... ... ...
## Variables not shown: cvtd_timestamp (chr), new_window (chr), num_window
## (int), roll_belt (dbl), pitch_belt (dbl), yaw_belt (dbl),
## total_accel_belt (int), gyros_belt_x (dbl), gyros_belt_y (dbl),
## gyros_belt_z (dbl), accel_belt_x (int), accel_belt_y (int), accel_belt_z
## (int), magnet_belt_x (int), magnet_belt_y (int), magnet_belt_z (int),
## roll_arm (dbl), pitch_arm (dbl), yaw_arm (dbl), total_accel_arm (int),
## gyros_arm_x (dbl), gyros_arm_y (dbl), gyros_arm_z (dbl), accel_arm_x
## (int), accel_arm_y (int), accel_arm_z (int), magnet_arm_x (int),
## magnet_arm_y (int), magnet_arm_z (int), roll_dumbbell (dbl),
## pitch_dumbbell (dbl), yaw_dumbbell (dbl), total_accel_dumbbell (int),
## gyros_dumbbell_x (dbl), gyros_dumbbell_y (dbl), gyros_dumbbell_z (dbl),
## accel_dumbbell_x (int), accel_dumbbell_y (int), accel_dumbbell_z (int),
## magnet_dumbbell_x (int), magnet_dumbbell_y (int), magnet_dumbbell_z
## (dbl), roll_forearm (dbl), pitch_forearm (dbl), yaw_forearm (dbl),
## total_accel_forearm (int), gyros_forearm_x (dbl), gyros_forearm_y (dbl),
## gyros_forearm_z (dbl), accel_forearm_x (int), accel_forearm_y (int),
## accel_forearm_z (int), magnet_forearm_x (int), magnet_forearm_y (dbl),
## magnet_forearm_z (dbl)
#dropping the features related to timeseries
train <- as.data.frame(mutate(featuredata[1:19622,],classe=training$classe)[,c(2,8:60)])
test <- as.data.frame(featuredata[19623:19642,c(2,8:60)])
I have not opted for to use the famous Principal Component Analysis
preprocessing technique here since rotating the feature space will possibly result in a loss of descriptive power that I have not planned here. Thus, instead, I have chosen to remove the unnecessary collinearities in the data in the following way:
set.seed(123)
RequireOrInstall("caret")
numericset_names <- names(train[,sapply(train,is.numeric)|sapply(train,is.integer)])
numericset <- train[,sapply(train,is.numeric)|sapply(train,is.integer)]
#findCorrelation searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
highCor_feat <- names(numericset[,findCorrelation(abs(cor(numericset)),0.88)])
lowcor_features <- setdiff(setdiff(names(train),"classe"),highCor_feat)
train_lowcor <- train[,c(lowcor_features,"classe")]
dim(train_lowcor)
## [1] 19622 46
#### ----------- 19623:19642 TESTING, the rest TRAINING ---------- ########
inTrain <- createDataPartition(train_lowcor$classe, p = 0.6, list = F)
train_ = train_lowcor[inTrain,]
validate_ = train_lowcor[-inTrain,]
classe_idx = which(names(train_) == "classe")
classe_idx #index corresponds to classe
## [1] 46
str(train_)
## 'data.frame': 11776 obs. of 46 variables:
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0.02 0.03 0.03 0.02 0.02 0 0 0 ...
## $ gyros_belt_y : num 0 0 0 0 0 0 0 0 0.02 0 ...
## $ gyros_belt_z : num -0.02 -0.03 -0.02 0 -0.02 -0.02 0 -0.02 0 -0.02 ...
## $ magnet_belt_x : int -2 -6 0 -3 -5 -2 -3 -6 1 -3 ...
## $ magnet_belt_y : int 600 604 603 609 596 602 606 598 600 603 ...
## $ magnet_belt_z : int -305 -310 -312 -308 -317 -319 -309 -317 -316 -313 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -129 -129 -129 ...
## $ pitch_arm : num 22.5 22.1 22 21.6 21.5 21.5 21.4 21.3 21.2 21.2 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_y : num -0.02 -0.03 -0.03 -0.03 -0.03 -0.03 -0.02 0 -0.02 -0.02 ...
## $ gyros_arm_z : num -0.02 0.02 0 -0.02 0 0 -0.02 -0.02 -0.03 -0.02 ...
## $ accel_arm_x : int -289 -289 -289 -288 -290 -288 -287 -289 -288 -289 ...
## $ accel_arm_y : int 110 111 111 110 110 111 111 110 108 109 ...
## $ accel_arm_z : int -126 -123 -122 -124 -123 -123 -124 -122 -124 -122 ...
## $ magnet_arm_x : int -368 -372 -369 -376 -366 -363 -372 -371 -373 -369 ...
## $ magnet_arm_y : int 344 344 342 334 339 343 338 337 336 340 ...
## $ magnet_arm_z : int 513 512 513 516 509 520 509 512 510 509 ...
## $ roll_dumbbell : num 12.9 13.4 13.4 13.3 13.1 ...
## $ pitch_dumbbell : num -70.3 -70.4 -70.8 -70.9 -70.6 ...
## $ yaw_dumbbell : num -85.1 -84.9 -84.5 -84.4 -84.7 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 36 37 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ accel_dumbbell_x : int -232 -232 -234 -235 -233 -233 -234 -233 -231 -233 ...
## $ accel_dumbbell_y : int 46 48 48 48 47 47 48 47 47 47 ...
## $ accel_dumbbell_z : int -270 -269 -269 -270 -269 -270 -269 -272 -268 -271 ...
## $ magnet_dumbbell_x : int -561 -552 -558 -558 -564 -554 -552 -551 -557 -559 ...
## $ magnet_dumbbell_y : int 298 303 294 291 299 291 302 296 292 295 ...
## $ magnet_dumbbell_z : num -63 -60 -66 -69 -64 -65 -69 -56 -62 -74 ...
## $ roll_forearm : num 28.3 28.1 27.9 27.7 27.6 27.5 27.2 27.1 27 26.9 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.9 -64 -64 -64 ...
## $ yaw_forearm : num -152 -152 -152 -152 -152 -152 -151 -151 -151 -151 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.02 0.02 0.02 0.02 0 0.02 0.02 0.02 ...
## $ gyros_forearm_y : num -0.02 -0.02 -0.02 0 -0.02 0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num 0 0 -0.03 -0.02 -0.02 -0.03 -0.03 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 196 189 193 190 193 191 193 192 192 192 ...
## $ accel_forearm_y : int 204 206 203 205 205 203 205 204 206 203 ...
## $ accel_forearm_z : int -213 -214 -215 -215 -214 -215 -215 -213 -216 -216 ...
## $ magnet_forearm_x : int -18 -16 -9 -22 -17 -11 -15 -13 -16 -10 ...
## $ magnet_forearm_y : num 658 658 660 656 657 657 655 653 653 657 ...
## $ magnet_forearm_z : num 469 469 478 473 465 478 472 481 472 466 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(validate_)
## [1] 7846 46
I used the default randomForest
settings to build a classification algorithm – since it is a very powerful nonlinear model with potentially low bias, and the variance of which can be easily reduced using the cross-validation technique. I did not use a special cross validation set in parallel mode as I had split the training data into training and validation sets (60% and 40% of the initial training data size respectively), but I also performed cross validation for the random forest in serial mode.
I experimented with RandomForest and Support Vector Machine (SVM) classifiers.
I tried out a few runs of SVM, the first one with a sigmoidal and linear kernels (interestingly, parallelSVM
doesn’t have a generic Gaussian kernel?), but I couldn’t figure it out fast how to make the 1-hot encoding for the parallel SVM to recognize multiple class response variable – that’s probably why it only made a binary classification (like it should by default) and that’s why the accuracy was only around 20% instead of near 100 %.
What I really opted for was the Random Forest
classifier. I had some bug that I wasn’t able to fix – for some reason, cross validation did not work for my parallel impementation of Random Forest
– I got some sort of confusing a runtime termination error. But the cross validation worked nicely with the Caret package, so I used cross-validation with the train
function of Caret
.
printconfusion <- function(ml_model,validate_){
confusionMatrix(predict(ml_model,validate_),validate_$classe)
}
RequireOrInstall <- function(package) {
suppressWarnings({
if (!require(package,character.only=TRUE)) {
install.packages(package, dep=TRUE)
require(package,character.only=TRUE)
}})
}
RequireOrInstall("randomForest")
RequireOrInstall("caret")
set.seed(123)
dim(train_)
## [1] 11776 46
dim(validate_)
## [1] 7846 46
classe_idx
## [1] 46
samsung.randomforest2 <-train(train_[,-classe_idx],train_[,classe_idx],method="rf",trControl=trainControl(method="cv",number=5))
samsung.randomforest <- randomForest(train_[,-classe_idx], train_[,classe_idx], importance = T, ntree = 700 )
print(samsung.randomforest, digits=3)
##
## Call:
## randomForest(x = train_[, -classe_idx], y = train_[, classe_idx], ntree = 700, importance = T)
## Type of random forest: classification
## Number of trees: 700
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 0.89%
## Confusion matrix:
## A B C D E class.error
## A 3344 1 1 0 2 0.001194743
## B 17 2255 7 0 0 0.010530935
## C 0 27 2024 3 0 0.014605648
## D 2 0 34 1892 2 0.019689119
## E 0 0 2 7 2156 0.004157044
# PLOT TRAINING ERROR
plot(samsung.randomforest, ylim=c(0,0.15))
legend('topright', colnames(samsung.randomforest$err.rate), col=1:5, fill=1:5)
printconfusion(samsung.randomforest,validate_)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 7 0 4 0
## B 0 1508 17 1 0
## C 0 3 1351 23 2
## D 0 0 0 1256 3
## E 0 0 0 2 1437
##
## Overall Statistics
##
## Accuracy : 0.9921
## 95% CI : (0.9899, 0.9939)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.99
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9934 0.9876 0.9767 0.9965
## Specificity 0.9980 0.9972 0.9957 0.9995 0.9997
## Pos Pred Value 0.9951 0.9882 0.9797 0.9976 0.9986
## Neg Pred Value 1.0000 0.9984 0.9974 0.9954 0.9992
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1922 0.1722 0.1601 0.1832
## Detection Prevalence 0.2859 0.1945 0.1758 0.1605 0.1834
## Balanced Accuracy 0.9990 0.9953 0.9916 0.9881 0.9981
#mtry -- number of attributes randomly bootstrapped at each node
#nodesize -- minimum size of terminal nodes
#ntrees(howmany trees, how many computing nodes)
timer <- proc.time()
############ PARALLEL FOREST ################
cores <- 4
RequireOrInstall("doSNOW")
cluster1<-makeCluster(cores) #<- # of processors / hyperthreads on machine
registerDoSNOW(cluster1)
samsung.parallelforest <- foreach(ntree=rep(400, cores), .combine=combine,.multicombine=TRUE,.packages='randomForest') %dopar%
randomForest(train_[,-classe_idx],train_[,classe_idx], importance = F,do.trace=TRUE,ntree=ntree, nodesize=150,mtry=8)
# command failed -- claims that response is not a factor !?
proc.time()-timer
## user system elapsed
## 0.41 0.16 44.42
printconfusion(samsung.parallelforest,validate_)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2167 113 5 17 5
## B 11 1186 89 10 26
## C 16 177 1266 153 96
## D 33 39 7 1087 48
## E 5 3 1 19 1267
##
## Overall Statistics
##
## Accuracy : 0.8887
## 95% CI : (0.8816, 0.8956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8592
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9709 0.7813 0.9254 0.8453 0.8786
## Specificity 0.9751 0.9785 0.9318 0.9806 0.9956
## Pos Pred Value 0.9393 0.8971 0.7412 0.8954 0.9784
## Neg Pred Value 0.9883 0.9491 0.9834 0.9700 0.9733
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2762 0.1512 0.1614 0.1385 0.1615
## Detection Prevalence 0.2940 0.1685 0.2177 0.1547 0.1651
## Balanced Accuracy 0.9730 0.8799 0.9286 0.9129 0.9371
#######################################
# Print the Predictions
pred_rf <- predict(samsung.randomforest2,newdata=test)
pred_rf # the predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We see that although
samsung.parallelforest
depicts a 4-ensemble of 400 trees random forest model, it does worse than thesamsung.randomforest
model of 700 trees. The optimal maximum node size in the parallel version seems to be definitely less than 200, it may start to overfit with a node size around 200, I believe.
Out of sample error can be defined as the error on the cases not in the training set. Since the original testing
set here doesn’t contain the true labels, thus it will be impossible to estimate the Out-of-sample error beforehand. To be able to give an estimate for it, I split the initial training set randomly into two – into training and validation sets and I use the validation set records to estimate the out of sample error. So I could say that the out-of-the-bag error is either 1- accuracy of the model on the validation set (here: 1-0.9921=0.0079) or the cross validation error. OOB is also estimated when running the randomForest
algorithm of the randomForest
package.
RequireOrInstall("plyr")
RequireOrInstall("dplyr")
RequireOrInstall("ggplot2")
importance <- varImp(samsung.randomforest)
importance <- data.frame(Variables = row.names(importance),
Importance = round(samsung.randomforest$importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on decreasing importance
rankImportance <- (importance %>%
mutate(Rank =dense_rank(desc(Importance))))%>%arrange(Rank)
first7 <- rankImportance[1:7,1]
inImp = createDataPartition(train_$classe, p = 0.05)[[1]]
featurePlot(train_[inImp,first7],train_$classe[inImp], plot = "pairs", xlab="Most Important Variables",ylab="Activity Type (Classe)")
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance[1:7,], aes(x = reorder(Variables, Importance),
y = Importance)) +
geom_bar(stat='identity', colour = 'green') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'lavender',
fontface = 'bold') +
labs(x = 'Variables', title = 'Relative Variable Importance') +
coord_flip()
## [1] "total_accel_belt > 21.5"
Removing the 7 multicollinear features (that is, using train_lowcor
) from the training set lowered the accuracy of the classification by 0.0017 in the best scenario (from 0.9938 to 0.9921), which conveys the point that when one has built a versatile classifier (which the random forest is), compressing the feature space 3.5 times may not reduce the selected classification metric (e.g. accuracy) almost by any significant margin. Thus, the effective dimensionality reduction performed is from 159-dimensional to 46-dimensional feature space while achieving 99.3% of accuracy on the validation set. Thus, for me, the project has fulfilled its purpose.