With the advent devices such as Jawbone Up, Nike FuelBand, and Fitbit, an increasing number of fitness enthusiasts are beginning to track and monitor their activity levels, with very little bother regarding the quality of the same. This assignment seeks to better understand, quantify and predict the performance of a particular physical activity an individual engages in. The assignment begins with data extraction, storing, cleaning and separation. Soon after, various predictive models were formulated using the training data set. Further on, the accuracies of each of these methods were projected using the test set. From the models created, the one with the highest accuracy was chosen to predict how well were the activities performed for 20 different cases within the testing dataset.
With devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively.These type of devices are part of the quantified self-movement – a group of enthusiasts who take measurements of themselves regularly to improve their health, to find patterns in their behaviour, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.n this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
The data has been made available through the following links:
1. Training Data Set - https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
2. Testing Data Set - https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The Weight Lifting Exercises dataset for this project was obtained from this source: http://groupware.les.inf.puc-rio.br/har.
The reference for the research paper from where the data was extracted is as follows, Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
The aim of the assignment is to design prediction models based on the available dataset to project the quality of barbell lifts performed by the participants. Once various prediction models are realised, the most accurate one is chosen to predict the quality of lifts by participants for 20 different cases. The results from the prediction are then utilised to answer a quiz in order to validate the prediction model
The list of libraries that we will be taping into are as follows:
library(ggplot2)
library(caret)
library(kernlab)
library(MASS)
library(rpart)
library(randomForest)
Using the aforementioned links the data is read into the training and testing variables.
training<- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
header = TRUE,
sep = ",",
na.strings = c("NA","#DIV/0!",""))
testing<- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
header = TRUE,
sep = ",",
na.strings = c("NA","#DIV/0!",""))
The training data set is now split into a cross validation set and a training set, also the testing set is further stored in the variable test.
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
my_train<- training[inTrain, ]
my_test<- training[-inTrain, ]
test<- testing
The first part of data cleaning and tidying is the removal of near zero variance predictors from the data set. These are variables with a rather constant value and low influence on the outcome of the dataset
near_zero_train<- nearZeroVar(training)
near_zero_test<- nearZeroVar(testing)
my_train<- my_train[,-near_zero_train]
my_test<- my_test[,-near_zero_train]
test<- test[,-near_zero_test]
Further on, variables with more than 80% of its data points assigned to NA is removed
my_train <- my_train[ lapply( my_train,function(x) sum(is.na(x)) / length(x) ) < 0.2 ]
my_test <- my_test[ lapply( my_test,function(x) sum(is.na(x)) / length(x) ) < 0.2 ]
test <- test[ lapply( test, function(x) sum(is.na(x)) / length(x) ) < 0.2 ]
Finally the identification variables are also removed
my_train<- my_train[,-(1:5)]
my_test<- my_test[,-(1:5)]
test<- test[,-c(1:5)]
The final datasets now have the following dimensions
dim(my_train)
## [1] 13737 54
dim(my_test)
## [1] 5885 54
dim(test)
## [1] 20 54
Data cleaning and tidying have resulted in smaller datasets with the same number of predictors.
Since we have more than 50 predictors, it would be wise to shrink the dataset a little further before we proceed with the prediction models. But, before we go ahead with PCA, we must verify if doing so would be actually beneficial.
corMatrix <- abs(cor(my_train[, -54]))
values <- length(corMatrix)/2
sum<- sum(corMatrix > 0.8 )
percent<- round(sum/values *100, 1)
Only 5.9% of the variables seem to have a high correlation. This would not result in any significant decrease in variance for increased bias. Hence PCA will not be conducted.
The Decision Tree machine learning algorithm is known to perform best within a non-linear setting. Additionally, the model is known to perform well within large data sets.
set.seed(333)
fit_rpart<- train(classe~., method = "rpart", data = my_train)
pred_rpart<- predict(fit_rpart, my_test)
conf_rpart<- confusionMatrix(pred_rpart, my_test$classe)
conf_rpart
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1516 495 487 431 102
## B 24 384 23 158 88
## C 109 260 516 336 231
## D 0 0 0 0 0
## E 25 0 0 39 661
##
## Overall Statistics
##
## Accuracy : 0.5229
## 95% CI : (0.51, 0.5357)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3767
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9056 0.33714 0.50292 0.0000 0.6109
## Specificity 0.6402 0.93826 0.80737 1.0000 0.9867
## Pos Pred Value 0.5002 0.56721 0.35537 NaN 0.9117
## Neg Pred Value 0.9446 0.85503 0.88495 0.8362 0.9184
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.1839
## Detection Rate 0.2576 0.06525 0.08768 0.0000 0.1123
## Detection Prevalence 0.5150 0.11504 0.24673 0.0000 0.1232
## Balanced Accuracy 0.7729 0.63770 0.65515 0.5000 0.7988
The LDA method is a part of the Dimensionality Reduction machine learning algorithm. This method is often employed when the outcome of a dataset is dependent on a large number of predictors, as LDA helps maximise the separability between the different classes so as to make better predictions.
set.seed(333)
fit_lda<- train(classe~., method = "lda", data = my_train)
pred_lda<- predict(fit_lda, my_test)
conf_lda<- confusionMatrix(pred_lda, my_test$classe)
conf_lda
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1414 169 96 75 37
## B 37 739 91 39 165
## C 116 135 687 122 93
## D 101 51 125 688 110
## E 6 45 27 40 677
##
## Overall Statistics
##
## Accuracy : 0.7145
## 95% CI : (0.7028, 0.726)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6383
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8447 0.6488 0.6696 0.7137 0.6257
## Specificity 0.9105 0.9300 0.9041 0.9214 0.9754
## Pos Pred Value 0.7895 0.6900 0.5958 0.6400 0.8516
## Neg Pred Value 0.9365 0.9169 0.9284 0.9426 0.9204
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2403 0.1256 0.1167 0.1169 0.1150
## Detection Prevalence 0.3043 0.1820 0.1959 0.1827 0.1351
## Balanced Accuracy 0.8776 0.7894 0.7868 0.8175 0.8006
The Random Forest method is a subset of the Ensemble machine learning algorithm. The method constructs numerous decision trees and uses them to create a classification and predictions consecutively.
set.seed(333)
fit_rf<- randomForest(classe~., data = my_train)
pred_rf<- predict(fit_rf, my_test)
conf_rf<- confusionMatrix(pred_rf, my_test$classe)
conf_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 4 0 0 0
## B 0 1135 4 0 0
## C 0 0 1022 6 0
## D 0 0 0 958 2
## E 1 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9971
## 95% CI : (0.9954, 0.9983)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9963
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9965 0.9961 0.9938 0.9982
## Specificity 0.9991 0.9992 0.9988 0.9996 0.9998
## Pos Pred Value 0.9976 0.9965 0.9942 0.9979 0.9991
## Neg Pred Value 0.9998 0.9992 0.9992 0.9988 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1929 0.1737 0.1628 0.1835
## Detection Prevalence 0.2850 0.1935 0.1747 0.1631 0.1837
## Balanced Accuracy 0.9992 0.9978 0.9974 0.9967 0.9990
The table below draws together a comparison between the accuracies between the different prediction methods used.
a<- data.frame(Prediction_Method = c("Decision Trees","Linear Discriminant Analysis","Random Forest" ), Accuracy = c("49.0%","71.1%","99.8%"))
knitr::kable(a)
| Prediction_Method | Accuracy |
|---|---|
| Decision Trees | 49.0% |
| Linear Discriminant Analysis | 71.1% |
| Random Forest | 99.8% |
From the table above it becomes evident that with an accuracy of about 99.8% the random forest method is the most effective predictive model.
To predict the classes for the 20 cases in the testing data set the random forest method will be employed.
predict(fit_rf,test)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E