The goal of the project is to predict the manner in which 6 individuals exercised using data from accelerometers on the belt, forearm, arm, and dumbell and apply it to 20 different test cases and check which machine learning algorithm shows better performance for this task.
## freq percentage
## A 5580 28.43747
## B 3797 19.35073
## C 3422 17.43961
## D 3216 16.38977
## E 3607 18.38243
#Create a filter to remove all the first 7 rows and all NAs which are related to the time series.
library(dplyr)
gooddata<-names(test[,colMeans(is.na(test))==0])[8:59]
#Use the filter applying to clean both data sets and make sure both have same variables for analysis.
train<- train[,c(gooddata,"classe")]
test<-test[,c(gooddata,"problem_id")]
#Separate train dataset to increase performance and accuracy
inTrain<- createDataPartition(train$classe, p=0.7, list=FALSE)
trtest<- train[inTrain, ]
ttest<- train[-inTrain, ]
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
set.seed(10)
fit.lda <- train(classe~., data=trtest, method="lda", metric=metric, trControl=control)
set.seed(10)
fit.knn <- train(classe~., data=trtest, method="knn", metric=metric, trControl=control)
set.seed(10)
fit.rf <- train(classe~., data=trtest, method="rf", metric=metric, trControl=control)
###Checking performance from all the models
results <- resamples(list(lda=fit.lda, knn=fit.knn, gbm=fit.gbm, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, knn, gbm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6906841 0.7002730 0.7041486 0.7063417 0.7140007 0.7227074 0
## knn 0.8915575 0.8968717 0.8981077 0.8985956 0.8994902 0.9082969 0
## gbm 0.9534207 0.9570675 0.9599709 0.9602537 0.9621267 0.9716157 0
## rf 0.9847050 0.9894451 0.9912696 0.9908276 0.9919927 0.9956300 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6079599 0.6212435 0.6255691 0.6283562 0.6376937 0.6488074 0
## knn 0.8626099 0.8695342 0.8711036 0.8716748 0.8728454 0.8839254 0
## gbm 0.9410368 0.9456811 0.9493503 0.9497100 0.9520885 0.9641039 0
## rf 0.9806554 0.9866472 0.9889550 0.9883970 0.9898709 0.9944723 0
predictLDA <- predict(fit.lda, newdata=ttest)
confMatLDA <- confusionMatrix(predictLDA, ttest$classe)
predictKNN <- predict(fit.knn, newdata=ttest)
confMatKNN <- confusionMatrix(predictKNN, ttest$classe)
predictGBM <- predict(fit.gbm, newdata=ttest)
confMatGBM <- confusionMatrix(predictGBM, ttest$classe)
predictRF <- predict(fit.rf, newdata=ttest)
confMatRF <- confusionMatrix(predictRF, ttest$classe)
performance <- matrix(round(c(confMatLDA$overall,confMatKNN$overall,confMatGBM$overall,confMatRF$overall),3), ncol=4)
colnames(performance)<-c('Linear Discrimination Analysis (LDA)', 'K- Nearest Neighbors (KNN)','Gradient Boosting (GBM)','Random Forest (RF)')
performance.table <- as.table(performance)
print(performance.table)
## Linear Discrimination Analysis (LDA) K- Nearest Neighbors (KNN)
## A 0.697 0.908
## B 0.616 0.883
## C 0.685 0.900
## D 0.708 0.915
## E 0.284 0.284
## F 0.000 0.000
## G 0.000 0.000
## Gradient Boosting (GBM) Random Forest (RF)
## A 0.956 0.995
## B 0.945 0.993
## C 0.951 0.992
## D 0.962 0.996
## E 0.284 0.284
## F 0.000 0.000
## G 0.000
predictions <- predict(fit.rf, test)
table(predictions,test$problem_id)
##
## predictions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## A 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0
## B 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1
## C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## D 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## E 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0
8.Conclusion
The best model as proven in step 5. is random forest which presented a 0.9909735 mean accuracy and was decided as the the model to apply to predict our activity. GBM showed a very high performance as well. On the random forest classifier, the higher the number of trees in the forest gives the high accuracy results. Which is the reason why we believe for our purpose this approach was the most suitable and the results support it.