One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.The goal of this project is to predict the manner in which they did the exercise. The main strategy that will be followed is: after cleaning the data, three subsets will be created: one for training, one for testing the different models, and a last one for validation of the chosen model.
if (file.exists('preddata.csv') == FALSE){
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',
destfile = './preddata.csv',method = 'curl')
}
if (file.exists('quiz.csv') == FALSE) {
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',
destfile = './quiz.csv',method = 'curl')
}
preddata <- read.csv('preddata.csv')
quiz <- read.csv('quiz.csv')
library(caret)
library(rattle)
library(ElemStatLearn)
library(randomForest)
library(e1071)
library(h2o)
set.seed(68490)
We then proceed to modify the class of certain variables into factor and POSIXct.
preddata$classe <- factor(preddata$classe)
preddata$user_name <- factor(preddata$user_name)
preddata$cvtd_timestamp <- as.POSIXct(preddata$cvtd_timestamp,format='%d/%m/%Y %H:%M')
We drop the variable X, which merely indicates the number of the observation. Likewise, certain variables that happen to have a considerable number of NA are removed.
preddata <- preddata[,2:dim(preddata)[2]]
navles <- apply(preddata,2,function(x) any(is.na(x)))
preddata <- preddata[,!navles]
Now we create the data subsets.
inTrain <- createDataPartition(y=preddata$classe,p=.6,list=FALSE)
training <- preddata[inTrain,]
testing <- preddata[-inTrain,]
inTrain <- createDataPartition(y=testing$classe,list=FALSE)
validation <- preddata[inTrain,]
testing <- preddata[-inTrain,]
Finally, we remove those variables that have a strikingly low variance, and that therefore have barely no prediction power. For this purpose we use the training subset.
novar <- nearZeroVar(training)
training <- training[,-novar]
testing <- testing[,-novar]
validation <- validation[,-novar]
modtree <- train(classe~.,method='rpart',data=training)
fancyRpartPlot(modtree$finalModel)
confusionMatrix(testing$classe,predict(modtree,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2512 42 206 0 5
## B 1081 1005 603 0 0
## C 1586 108 1728 0 0
## D 1436 568 1212 0 0
## E 524 486 966 0 1631
##
## Overall Statistics
##
## Accuracy : 0.438
## 95% CI : (0.4302, 0.4458)
## No Information Rate : 0.4547
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3031
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.3519 0.45496 0.3665 NA 0.9969
## Specificity 0.9704 0.87517 0.8458 0.7951 0.8595
## Pos Pred Value 0.9085 0.37374 0.5050 NA 0.4522
## Neg Pred Value 0.6423 0.90746 0.7567 NA 0.9996
## Prevalence 0.4547 0.14071 0.3003 0.0000 0.1042
## Detection Rate 0.1600 0.06402 0.1101 0.0000 0.1039
## Detection Prevalence 0.1761 0.17128 0.2180 0.2049 0.2298
## Balanced Accuracy 0.6612 0.66506 0.6061 NA 0.9282
modsvm <- svm(classe~.,data=training)
confusionMatrix(testing$classe,predict(modsvm,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2727 8 29 0 1
## B 218 2385 71 5 10
## C 7 106 3270 34 5
## D 6 2 334 2872 2
## E 1 15 81 98 3412
##
## Overall Statistics
##
## Accuracy : 0.9342
## 95% CI : (0.9302, 0.938)
## No Information Rate : 0.2411
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9175
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9216 0.9479 0.8639 0.9545 0.9948
## Specificity 0.9970 0.9769 0.9872 0.9729 0.9841
## Pos Pred Value 0.9863 0.8869 0.9556 0.8930 0.9459
## Neg Pred Value 0.9821 0.9899 0.9581 0.9890 0.9985
## Prevalence 0.1885 0.1603 0.2411 0.1917 0.2185
## Detection Rate 0.1737 0.1519 0.2083 0.1829 0.2173
## Detection Prevalence 0.1761 0.1713 0.2180 0.2049 0.2298
## Balanced Accuracy 0.9593 0.9624 0.9256 0.9637 0.9894
h2o.init()
predictors <- colnames(training)[1:57]
response <- colnames(training)[58]
trainsp <- as.h2o(training)
testsp <- as.h2o(testing)
modgbm <- h2o.gbm(x=predictors,y=response,training_frame = trainsp,
validation_frame = testsp,ntrees=100,distribution='multinomial')
predgbm <- predict(modgbm,testsp)
perf <- h2o.performance(modgbm,testsp)
h2o.confusionMatrix(modgbm,testsp)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## A B C D E Error Rate
## A 2765 0 0 0 0 0.0000 = 0 / 2.765
## B 2 2687 0 0 0 0.0007 = 2 / 2.689
## C 0 2 3420 0 0 0.0006 = 2 / 3.422
## D 0 0 0 3216 0 0.0000 = 0 / 3.216
## E 0 0 0 0 3607 0.0000 = 0 / 3.607
## Totals 2767 2689 3420 3216 3607 0.0003 = 4 / 15.699
1-0.0008 #Accuracy
## [1] 0.9992
modrf <- randomForest(classe~.,data=training)
varImpPlot(modrf)
confusionMatrix(testing$classe,predict(modrf,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2765 0 0 0 0
## B 0 2689 0 0 0
## C 0 5 3417 0 0
## D 0 0 1 3214 1
## E 0 0 0 1 3606
##
## Overall Statistics
##
## Accuracy : 0.9995
## 95% CI : (0.999, 0.9998)
## No Information Rate : 0.2298
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9994
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9981 0.9997 0.9997 0.9997
## Specificity 1.0000 1.0000 0.9996 0.9998 0.9999
## Pos Pred Value 1.0000 1.0000 0.9985 0.9994 0.9997
## Neg Pred Value 1.0000 0.9996 0.9999 0.9999 0.9999
## Prevalence 0.1761 0.1716 0.2177 0.2048 0.2298
## Detection Rate 0.1761 0.1713 0.2177 0.2047 0.2297
## Detection Prevalence 0.1761 0.1713 0.2180 0.2049 0.2298
## Balanced Accuracy 1.0000 0.9991 0.9997 0.9998 0.9998
The model that performs the best, in terms of accuracy, is the random forest. We proceed then to apply it to the validation subset.
conf<-confusionMatrix(validation$classe,predict(modrf,validation))
ac <- conf$overall[1]
conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2815 0 0 0 0
## B 1 1107 0 0 0
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.9997
## 95% CI : (0.9986, 1)
## No Information Rate : 0.7178
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9994
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9996 1.0000 NA NA NA
## Specificity 1.0000 0.9996 1 1 1
## Pos Pred Value 1.0000 0.9991 NA NA NA
## Neg Pred Value 0.9991 1.0000 NA NA NA
## Prevalence 0.7178 0.2822 0 0 0
## Detection Rate 0.7176 0.2822 0 0 0
## Detection Prevalence 0.7176 0.2824 0 0 0
## Balanced Accuracy 0.9998 0.9998 NA NA NA
Using validation data, the resulting accuracy of the random forest model is 0.9997451, which indicates a really precise prediction. The associated expected out of sample error is, hence, 2.5490696^{-4}.