The goal of this project is to predict the manner in which people did exercise.
To perform this study, six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
as defined in the “classe” variable in the data set.
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lift experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by us a relatively light dumbbell (1.25kg).
The dataset has 19622 rows and 160 columns.
There are different ways to measure out of sample error rate.
Continuous outcomes:
- RMSE = root mean squared error
- RSquared = sq(R) from regression models
Categorical outcomes:
- Accuracy = Fraction correct
- Kappa = A measure of concordance
As we are performing a classification prediction, we would use Accuracy and Kappa as our out of sample error rate measurement. Ideally, we would like to have an Accuracy rate of more than 90%.
With a medium-sized data set, we would set our train/test set at a 6:4 ratio.
## Create a training (60%) and testing (40%) set
inTrain <- createDataPartition(y = train$classe, p = 0.6, list = FALSE)
training <- train[inTrain,]; testing <- train[-inTrain,]
dim(training)
## [1] 11776 153
dim(testing)
## [1] 7846 153
We will 10-fold cross validation to estimate accuracy.
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
Since We do not know which algorithms would be good on this problem or what configurations to use, we will evaluate 4 common algorithms and compare their performance:
We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our four models:
# CART
set.seed(7)
fit.cart <- train(classe~., data=training, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(classe~., data=training, method="knn", metric=metric, trControl=control)
# SVM
set.seed(7)
fit.svm <- train(classe~., data=training, method="svmRadial", metric=metric, trControl=control)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Random Forest
set.seed(7)
fit.rf <- train(classe~., data=training, method="rf", metric=metric, trControl=control)
We now have 4 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
# summarize accuracy of models
results <- resamples(list(cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cart 0.4808836 0.4948015 0.5095550 0.5070446 0.5145279 0.5424448 0
## knn 0.8581138 0.8649973 0.8696399 0.8702445 0.8732486 0.8837012 0
## svm 0.8071368 0.8285229 0.8398471 0.8386540 0.8551717 0.8581138 0
## rf 0.9872666 0.9893798 0.9910826 0.9914225 0.9938441 0.9957555 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## cart 0.3222774 0.3405656 0.3602228 0.3565816 0.3663887 0.4047856 0
## knn 0.8205567 0.8292003 0.8351034 0.8358497 0.8395636 0.8531256 0
## svm 0.7555647 0.7825658 0.7970345 0.7955157 0.8165723 0.8201732 0
## rf 0.9838910 0.9865619 0.9887224 0.9891495 0.9922122 0.9946311 0
dotplot(results)
fit.rf
## Random Forest
##
## 11776 samples
## 152 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10598, 10599, 10598, 10599, 10597, 10599, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8412042 0.7970340
## 77 0.9914225 0.9891495
## 152 0.9825901 0.9779762
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 77.
We can see that the most accurate model in this case was Random Forest using 77 out of the 152 variables.
To validate the model choice, we test the model by applying prediction one time to the test set.
pred <- predict(fit.rf, testing)
testing$predRight <- pred==testing$classe
cmatrix <- confusionMatrix(table(pred, testing$classe))
cmatrix
## Confusion Matrix and Statistics
##
##
## pred A B C D E
## A 2231 7 0 0 0
## B 1 1509 0 0 0
## C 0 1 1364 3 0
## D 0 1 4 1283 2
## E 0 0 0 0 1440
##
## Overall Statistics
##
## Accuracy : 0.9976
## 95% CI : (0.9962, 0.9985)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9969
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9996 0.9941 0.9971 0.9977 0.9986
## Specificity 0.9988 0.9998 0.9994 0.9989 1.0000
## Pos Pred Value 0.9969 0.9993 0.9971 0.9946 1.0000
## Neg Pred Value 0.9998 0.9986 0.9994 0.9995 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1923 0.1738 0.1635 0.1835
## Detection Prevalence 0.2852 0.1925 0.1744 0.1644 0.1835
## Balanced Accuracy 0.9992 0.9970 0.9982 0.9983 0.9993
The result shows an Accuracy of 0.9976 and Kappa of 0.9969, which aligns with the accuracy of our prediction with the training set.
From the confusion matrix, we know that the 95% confident interval of the Accuracy lies between 0.9896 and 0.9937, so we can be very confident that Random Forest prediction model is accurate.