Project goal:

The goal of the project is to predict the manner in which 6 individuals exercised using data from accelerometers on the belt, forearm, arm, and dumbell and apply it to 20 different test cases and check which machine learning algorithm shows better performance for this task.

Step - Observing the frequency distribution over the train dataset in order to understand the generall trend of the “classe” variable.

##   freq percentage
## A 5580   28.43747
## B 3797   19.35073
## C 3422   17.43961
## D 3216   16.38977
## E 3607   18.38243

Step - Cleaning missing data and remove irrevelant variables to improve our model performance by cleaning all the variables with missing values from our test and train datasets.

#Create a filter to remove all the first 7 rows and all NAs which are related to the time series.
library(dplyr)

gooddata<-names(test[,colMeans(is.na(test))==0])[8:59]

#Use the filter applying to clean both data sets and make sure both have same variables for analysis.

train<- train[,c(gooddata,"classe")]
test<-test[,c(gooddata,"problem_id")]

#Separate train dataset to increase performance and accuracy
inTrain<- createDataPartition(train$classe, p=0.7, list=FALSE)
trtest<- train[inTrain, ]
ttest<- train[-inTrain, ]

We have uses K- fold Cross Validation for 10 iterations to create a number of partitions of sample observations, known as the validation sets, from the training data set, after fitting a model on to the training data, its performance is measured against each validation set and then averaged, gaining a better assessment of how the model will perform when asked to predict for new observations.

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

Experimenting different models

Linear Algorithms

1. Linear Discriminant Analysis (LDA)

set.seed(10)
fit.lda <- train(classe~., data=trtest, method="lda", metric=metric, trControl=control)

Non-Linear Algorithms

2. k-Nearest Neighbors (kNN)

set.seed(10)
fit.knn <- train(classe~., data=trtest, method="knn", metric=metric, trControl=control)

Advanced algorithms

3. Gradient Boosting (GBM)

4. Random Forest (RF)

set.seed(10)
fit.rf <- train(classe~., data=trtest, method="rf", metric=metric, trControl=control)

Models Performance

###Checking performance from all the models

results <- resamples(list(lda=fit.lda, knn=fit.knn, gbm=fit.gbm, rf=fit.rf))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, knn, gbm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda 0.6906841 0.7002730 0.7041486 0.7063417 0.7140007 0.7227074    0
## knn 0.8915575 0.8968717 0.8981077 0.8985956 0.8994902 0.9082969    0
## gbm 0.9534207 0.9570675 0.9599709 0.9602537 0.9621267 0.9716157    0
## rf  0.9847050 0.9894451 0.9912696 0.9908276 0.9919927 0.9956300    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda 0.6079599 0.6212435 0.6255691 0.6283562 0.6376937 0.6488074    0
## knn 0.8626099 0.8695342 0.8711036 0.8716748 0.8728454 0.8839254    0
## gbm 0.9410368 0.9456811 0.9493503 0.9497100 0.9520885 0.9641039    0
## rf  0.9806554 0.9866472 0.9889550 0.9883970 0.9898709 0.9944723    0

Check the predictions of the models and apply them to the test part of the train data set to check its performance

predictLDA <- predict(fit.lda, newdata=ttest)
confMatLDA <- confusionMatrix(predictLDA, ttest$classe)

predictKNN <- predict(fit.knn, newdata=ttest)
confMatKNN <- confusionMatrix(predictKNN, ttest$classe)

predictGBM <- predict(fit.gbm, newdata=ttest)
confMatGBM <- confusionMatrix(predictGBM, ttest$classe)

predictRF <- predict(fit.rf, newdata=ttest)
confMatRF <- confusionMatrix(predictRF, ttest$classe)

performance <- matrix(round(c(confMatLDA$overall,confMatKNN$overall,confMatGBM$overall,confMatRF$overall),3), ncol=4)
colnames(performance)<-c('Linear Discrimination Analysis (LDA)', 'K- Nearest Neighbors (KNN)','Gradient Boosting (GBM)','Random Forest (RF)')
performance.table <- as.table(performance)
print(performance.table)

##   Linear Discrimination Analysis (LDA) K- Nearest Neighbors (KNN)
## A                                0.697                      0.908
## B                                0.616                      0.883
## C                                0.685                      0.900
## D                                0.708                      0.915
## E                                0.284                      0.284
## F                                0.000                      0.000
## G                                0.000                      0.000
##   Gradient Boosting (GBM) Random Forest (RF)
## A                   0.956              0.995
## B                   0.945              0.993
## C                   0.951              0.992
## D                   0.962              0.996
## E                   0.284              0.284
## F                   0.000              0.000
## G                   0.000

Choosen Model applied to the real test dataset to determine the predictions of the classes.

predictions <- predict(fit.rf, test)
table(predictions,test$problem_id)

##            
## predictions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##           A 0 1 0 1 1 0 0 0 1  1  0  0  0  1  0  0  1  0  0  0
##           B 1 0 1 0 0 0 0 1 0  0  1  0  1  0  0  0  0  1  1  1
##           C 0 0 0 0 0 0 0 0 0  0  0  1  0  0  0  0  0  0  0  0
##           D 0 0 0 0 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0
##           E 0 0 0 0 0 1 0 0 0  0  0  0  0  0  1  1  0  0  0  0

8.Conclusion

The best model as proven in step 5. is random forest which presented a 0.9909735 mean accuracy and was decided as the the model to apply to predict our activity. GBM showed a very high performance as well. On the random forest classifier, the higher the number of trees in the forest gives the high accuracy results. Which is the reason why we believe for our purpose this approach was the most suitable and the results support it.

Prediction Assignment Writeup

Joana

19 January 2018