Human Activity Recognition - HAR - has emerged as a key research area in the last years and is gaining increasing attention for applications of machine learning, supervised and unsupervised. For more informations, please visit: http://groupware.les.inf.puc-rio.br/har
We will use the caret R Package for its ease of use and the unified interface to a large number of modeling and prediction models. To estimate model performance we will use the accuracy, in case of multiclassifications problem the most significant outcome. Similar model performance indicators are AUC, the area under the Receiver Operating Characteristic (ROC) curve, used to estimate performance using a combination of sensitivity and specificity.
#rm(list=ls()); gc()
suppressWarnings(suppressMessages(library(caret)))
# some other libraries are required as well: gbm, plot3D, randomForest, lattice and several subdependecies
dataUrl = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv';
testUrl = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv';
dataSet = read.csv(dataUrl, na.strings=c("NA",""))
testSet = read.csv(testUrl,na.strings=c("NA",""))
# a large object, candidate for data.table?
print(object.size(dataSet), units="Mb")
## 19 Mb
# 19622 rows x 160 predictors data.frame; 20 rows x 160 final testing set
dim(dataSet); dim(testSet)
## [1] 19622 160
## [1] 20 160
# which variable is in the Training Set and not in the Testing Set, respective the reverse (in the Testing Set, not in the Training Set)
setdiff(names(dataSet), names(testSet)); setdiff(names(testSet), names(dataSet))
## [1] "classe"
## [1] "problem_id"
# 20 - very few test-cases.
nrow(testSet)/nrow(dataSet)
## [1] 0.001019264
# outcome variable spread
table(dataSet$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
# Now some charts to get an idea of the data complexity
# plot cross-combinations of magnet_arm values by activity
plot(dataSet[,grep("^magnet_arm", names(dataSet))],
pch=20,
cex=0.8,
col=dataSet$classe,
main="Magnet Arm coloured by Activity")
# in 3D, some beautiful patterns emerge
suppressWarnings(
plot3D::points3D(
dataSet$magnet_arm_x,
dataSet$magnet_arm_y,
dataSet$magnet_arm_z,
col=dataSet$classe,
main="3D Magnet Arm by Activity",
pch=20, cex=0.8
))
# 3D Magnet dumbbell
suppressWarnings(
plot3D::points3D(
dataSet$magnet_dumbbell_x,
dataSet$magnet_dumbbell_y,
dataSet$magnet_dumbbell_z,
col=dataSet$classe,
theta=0,
main="3D Magnet dumbbell by Activity",
pch=20, cex=0.8
))
# ... way more data to explore and plot, a true project on its own.
# chartGroupPatterns = c('magnet_belt', 'magnet_arm', 'magnet_dumbbell', 'magnet_forearm' ), group charts and so on
A large number of predictor variables lead to complicated models, our goal is to prevent that but the same time avoid over-simplification. We will remove variables that have more than 30% NAs, investigate the use of imputation for the remaining ones. We will ignore autoincrement IDs & timestamps.
# create the threshold
thresholdNAs <- round(nrow(dataSet)*0.3)
# keep only the predictor variables that have less than 30% NAs
dataSet <- dataSet[, colSums(is.na(dataSet))< thresholdNAs]
# this actually got rid of all NAs, no imputation needed
sum(colSums(is.na(dataSet)))
## [1] 0
# remove timestamps & autoincrement IDs
dataSet <- dataSet[,6:ncol(dataSet)]
# 55 predictors left
dim(dataSet)
## [1] 19622 55
# only numeric and integer variables left + 2 remaining factor variables: "new_window"" and the outcome variable "classe"
table(sapply(dataSet, class))
##
## factor integer numeric
## 2 26 27
# look at "new_window" closer
prop.table(table(dataSet$new_window, dataSet$classe),1)
##
## A B C D E
## no 0.2847107 0.1934846 0.1744380 0.1637698 0.1835970
## yes 0.2684729 0.1945813 0.1724138 0.1699507 0.1945813
# new_window looks irelevant, dropping
dataSet$new_window<-NULL
We use caret’s createDataPartition() to conduct data splits within groups of the data (stratified sampling). The final model will be used just once on the generated 30% testing set to prevent overfitting. Same goes for the final 20 testSet cases.
set.seed(2015)
inTrain <- createDataPartition(y=dataSet$classe,
p=0.70, list=FALSE)
training <- dataSet[inTrain,]
testing <- dataSet[-inTrain,]
#* centering and scaling variables except for the factor outcome
preObj <- preProcess(training[, -ncol(training)],method=c("center","scale"))
trainVARS <- predict(preObj, training[, -ncol(training)])
# mean=0
round(sum(sapply(trainVARS, mean)),8)
## [1] 0
# sd = 1
table(sapply(trainVARS, sd))
##
## 1
## 53
We’re looking into “binding” correlated predictors together using Principal Component Analysis in order to simplify the dataset even further without losing relevance. Covariates creation is actually refered to as Feature engineering.
# build correlation matrix
M = cor(training[, -ncol(training)])
diag(M) = 0
# add the actual correlation to the result set
cbind(which(M>0.8, arr.ind=T), round(M[which(M>0.8, arr.ind=T)],2))
## row col
## yaw_belt 4 2 0.81
## total_accel_belt 5 2 0.98
## accel_belt_y 10 2 0.93
## roll_belt 2 4 0.81
## roll_belt 2 5 0.98
## accel_belt_y 10 5 0.93
## magnet_belt_x 12 9 0.89
## roll_belt 2 10 0.93
## total_accel_belt 5 10 0.93
## accel_belt_x 9 12 0.89
## magnet_arm_x 25 22 0.81
## accel_arm_x 22 25 0.81
## magnet_arm_z 27 26 0.81
## magnet_arm_y 26 27 0.81
## accel_dumbbell_x 35 29 0.80
## accel_dumbbell_z 37 30 0.85
## pitch_dumbbell 29 35 0.80
## yaw_dumbbell 30 37 0.85
# highly correlated variables are magnet_arm_x and accel_arm_x (> 80% correlation)
plot(
training$magnet_arm_x,
training$accel_arm_x,
col=training$classe,
pch=20,
cex=0.8,
main="Correlated Magnet and Accel predictor values (cor>0.8)"
)
preProc <- preProcess(
training[, -ncol(training)] ,
method="pca",
pcaComp=16
)
# so we could use the parameters of the preferred PC to "aggregate" variables without losing relevancy
round(preProc$rotation[,1],4)
## num_window roll_belt pitch_belt
## -0.0198 -0.3286 0.0491
## yaw_belt total_accel_belt gyros_belt_x
## -0.2554 -0.3200 0.0415
## gyros_belt_y gyros_belt_z accel_belt_x
## -0.1524 0.1629 -0.0652
## accel_belt_y accel_belt_z magnet_belt_x
## -0.3153 0.3316 -0.0869
## magnet_belt_y magnet_belt_z roll_arm
## 0.0923 0.0304 0.1020
## pitch_arm yaw_arm total_accel_arm
## 0.0207 0.0743 0.1172
## gyros_arm_x gyros_arm_y gyros_arm_z
## -0.0239 0.0922 -0.1965
## accel_arm_x accel_arm_y accel_arm_z
## -0.1285 0.2912 -0.1183
## magnet_arm_x magnet_arm_y magnet_arm_z
## -0.0842 0.0575 0.0258
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0.0502 -0.0724 -0.0581
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 0.1304 0.0221 -0.0197
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## -0.0526 -0.1347 0.1312
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## -0.0912 -0.1136 0.0953
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0.1676 0.0743 -0.1162
## yaw_forearm total_accel_forearm gyros_forearm_x
## 0.1176 -0.0261 -0.1172
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## -0.0185 -0.0255 0.2103
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0.0107 -0.0444 0.1059
## magnet_forearm_y magnet_forearm_z
## 0.0104 -0.0626
In terms of model selection for this multiclass classification problem, there are several choices: CART, randomForest, glm, GBM. We train two multinomial gradient boosting machine (GBM) models. GMB creates many weak learners with little predictive power, stacking the learners on top of each other get a very powerful model as we will see. The first model is using gbm default settings and 5-fold CV, the second one uses an alternate tuning grid for parameters. Model Training and Parameter Tuning for caret and gbm. We also use caret’s model independent syntax to create a randomForest model as benchmark.
# using 5-fold Cross-Validation
fitControl <- trainControl(method="cv",
number=5,
verboseIter=FALSE)
set.seed(2015)
gbmFit <- train(classe ~ ., data=training,
method="gbm",
trControl=fitControl,
verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.2.2
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
# plot the profile, examine the relationship between the estimates of performance and the tuning parameters.
plot(gbmFit)
gbmFit
## Stochastic Gradient Boosting
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10989, 10990, 10991, 10990
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.7618090 0.6979322 0.014509474
## 1 100 0.8314764 0.7867031 0.002921711
## 1 150 0.8716606 0.8375399 0.008185305
## 2 50 0.8846917 0.8539368 0.005167990
## 2 100 0.9411803 0.9255747 0.004679793
## 2 150 0.9654948 0.9563422 0.003347979
## 3 50 0.9327356 0.9148558 0.003390564
## 3 100 0.9713181 0.9637149 0.002556355
## 3 150 0.9851490 0.9812154 0.003562295
## Kappa SD
## 0.018323165
## 0.003679978
## 0.010361872
## 0.006581991
## 0.005902242
## 0.004228813
## 0.004288672
## 0.003234625
## 0.004504355
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
#Alternate Tuning Grids, varying largely the amount of trees
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:10)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
set.seed(2015)
gbmFit2 <- train(classe ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
## Now specify the exact models
## to evaluate:
tuneGrid = gbmGrid)
plot(gbmFit2)
# as the trees number increases, so does the Accuracy and the Kappa GBM indicator.
gbmFit2
## Stochastic Gradient Boosting
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10989, 10990, 10991, 10990
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.7621744 0.6983984 0.0107401590
## 1 100 0.8306754 0.7856991 0.0060192446
## 1 150 0.8722427 0.8382968 0.0039195011
## 1 200 0.8951002 0.8672370 0.0047394737
## 1 250 0.9120614 0.8886980 0.0045373589
## 1 300 0.9280769 0.9089925 0.0017659522
## 1 350 0.9389962 0.9228139 0.0027735102
## 1 400 0.9469311 0.9328576 0.0026672337
## 1 450 0.9531915 0.9407794 0.0030527576
## 1 500 0.9589426 0.9480600 0.0030774188
## 5 50 0.9703713 0.9625133 0.0039971973
## 5 100 0.9919194 0.9897786 0.0024060911
## 5 150 0.9966513 0.9957645 0.0020809737
## 5 200 0.9978887 0.9973296 0.0013714249
## 5 250 0.9986168 0.9982504 0.0007002394
## 5 300 0.9986895 0.9983425 0.0008379905
## 5 350 0.9989080 0.9986188 0.0008139148
## 5 400 0.9988351 0.9985266 0.0007002325
## 5 450 0.9988351 0.9985266 0.0007891478
## 5 500 0.9989808 0.9987108 0.0007001222
## 9 50 0.9907547 0.9883056 0.0019197306
## 9 100 0.9978888 0.9973297 0.0016154556
## 9 150 0.9985439 0.9981584 0.0011218655
## 9 200 0.9988352 0.9985267 0.0008689599
## 9 250 0.9987624 0.9984347 0.0007546587
## 9 300 0.9989081 0.9986189 0.0007279345
## 9 350 0.9989081 0.9986189 0.0007279345
## 9 400 0.9988353 0.9985268 0.0008300253
## 9 450 0.9988353 0.9985268 0.0008300253
## 9 500 0.9988353 0.9985268 0.0008300253
## Kappa SD
## 0.0135059149
## 0.0076293568
## 0.0049171807
## 0.0059594773
## 0.0057179910
## 0.0022125828
## 0.0034942384
## 0.0033610703
## 0.0038447038
## 0.0038798112
## 0.0050552670
## 0.0030429534
## 0.0026319462
## 0.0017346278
## 0.0008856313
## 0.0010598738
## 0.0010294174
## 0.0008856585
## 0.0009980943
## 0.0008855051
## 0.0024277649
## 0.0020431949
## 0.0014188848
## 0.0010990447
## 0.0009544517
## 0.0009206682
## 0.0009206682
## 0.0010497866
## 0.0010497866
## 0.0010497866
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500,
## interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 20.
rfFit<-train(classe ~ ., data=training,
method="rf",
trControl=fitControl,
prox=TRUE,
allowParallel=TRUE
)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
rfFit
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10989, 10991, 10990, 10990
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9938848 0.9922639 0.0015762315 0.0019950490
## 27 0.9975247 0.9968691 0.0006520696 0.0008250165
## 53 0.9944670 0.9930014 0.0021457431 0.0027145723
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
We build the Confusion Matrix for all the 3 models build previously. Pay special attention to the Accuracy rate and the 95% Confidence Interval. All models use 5-fold Cross Validation.
# GBM with default settings (150 trees)
confusionMatrix(
predict(gbmFit, newdata=testing),
testing$classe
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 10 0 0 1
## B 2 1113 8 8 4
## C 0 16 1014 14 1
## D 1 0 4 942 6
## E 0 0 0 0 1070
##
## Overall Statistics
##
## Accuracy : 0.9873
## 95% CI : (0.9841, 0.99)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9839
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9772 0.9883 0.9772 0.9889
## Specificity 0.9974 0.9954 0.9936 0.9978 1.0000
## Pos Pred Value 0.9935 0.9806 0.9703 0.9885 1.0000
## Neg Pred Value 0.9993 0.9945 0.9975 0.9955 0.9975
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1891 0.1723 0.1601 0.1818
## Detection Prevalence 0.2858 0.1929 0.1776 0.1619 0.1818
## Balanced Accuracy 0.9978 0.9863 0.9910 0.9875 0.9945
# GBM with alternate tuning grid, trees count between 50 and 500
confusionMatrix(
predict(gbmFit2, newdata=testing),
testing$classe
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1026 0 0
## D 0 0 0 964 1
## E 0 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9998
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9998
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 0.9991
## Specificity 1.0000 1.0000 1.0000 0.9998 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 0.9990 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1935 0.1743 0.1638 0.1837
## Detection Prevalence 0.2845 0.1935 0.1743 0.1640 0.1837
## Balanced Accuracy 1.0000 1.0000 1.0000 0.9999 0.9995
# an incredible performance, misclassified one single result out of almost 6.000!
# too bad CPU time is around 2 hours on a quadcore I5/16GB RAM
# randomForest
confusionMatrix(
predict(rfFit, newdata=testing),
testing$classe
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 2 0 0 0
## B 0 1134 2 0 0
## C 0 3 1024 2 0
## D 0 0 0 962 1
## E 1 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9981
## 95% CI : (0.9967, 0.9991)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9976
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9956 0.9981 0.9979 0.9991
## Specificity 0.9995 0.9996 0.9990 0.9998 0.9998
## Pos Pred Value 0.9988 0.9982 0.9951 0.9990 0.9991
## Neg Pred Value 0.9998 0.9989 0.9996 0.9996 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1927 0.1740 0.1635 0.1837
## Detection Prevalence 0.2846 0.1930 0.1749 0.1636 0.1839
## Balanced Accuracy 0.9995 0.9976 0.9985 0.9989 0.9994
# collect resamples
results = resamples(list(GBM=gbmFit, GBM_GRID=gbmFit2, RandomForest=rfFit))
# summarize the distributions
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: GBM, GBM_GRID, RandomForest
## Number of resamples: 5
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.9814 0.9833 0.9847 0.9851 0.9854 0.9909 0
## GBM_GRID 0.9982 0.9985 0.9989 0.9990 0.9993 1.0000 0
## RandomForest 0.9967 0.9971 0.9975 0.9975 0.9982 0.9982 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.9765 0.9788 0.9807 0.9812 0.9816 0.9885 0
## GBM_GRID 0.9977 0.9982 0.9986 0.9987 0.9991 1.0000 0
## RandomForest 0.9959 0.9963 0.9968 0.9969 0.9977 0.9977 0
# boxplots of results
bwplot(results)
We will choose the GBM model with the alternate tuning grid since with 99.98% it has the highest accuracy on the testing set, even higher than the one of RandomForest (99.80%) as shown in the bwplot. The accuracy is so high that we almost shout overfiting!, but the Confusion Matrix was generated only once on the testing set, the machine learning algo was therefore not overfit for it and 30% of the data or almost 6.000 observations represent a solid testing set size. The 95% Confidence Interval of the classifier is also very high, low p-values. We therefore expect the lowest true Out of Sample error on the 20 test-cases with both GBM with alternate grid and randomForest.
answers = predict(gbmFit2, newdata=testSet)
dir.create("output_files/", showWarnings = FALSE)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("output_files/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)
# now clicking 20 times the Submit button... surprise!