Introduction

The goal of this project is to predict the manner in which people did exercise.

Weight Lift Exercises Dataset - On-body sens schema

To perform this study, six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

  • exactly accord to the specification (Class A),
  • throw the elbows to the front (Class B),
  • lift the dumbbell only halfway (Class C),
  • lower the dumbbell only halfway (Class D) and
  • throw the hips to the front (Class E).

as defined in the “classe” variable in the data set.

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lift experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by us a relatively light dumbbell (1.25kg).

The dataset has 19622 rows and 160 columns.

Prediction Study Design

  1. Define error rate

There are different ways to measure out of sample error rate.

Continuous outcomes:
- RMSE = root mean squared error
- RSquared = sq(R) from regression models

Categorical outcomes:
- Accuracy = Fraction correct
- Kappa = A measure of concordance

As we are performing a classification prediction, we would use Accuracy and Kappa as our out of sample error rate measurement. Ideally, we would like to have an Accuracy rate of more than 90%.

  1. Split data into train and test set

With a medium-sized data set, we would set our train/test set at a 6:4 ratio.

## Create a training (60%) and testing (40%) set
inTrain <- createDataPartition(y = train$classe, p = 0.6, list = FALSE)
training <- train[inTrain,]; testing <- train[-inTrain,]

dim(training)
## [1] 11776   153
dim(testing)
## [1] 7846  153
  1. On train set, pick features using cross-validation

We will 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
  1. On train set, pick prediction function using cross-validation

Since We do not know which algorithms would be good on this problem or what configurations to use, we will evaluate 4 common algorithms and compare their performance:

We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build our four models:

# CART
set.seed(7)
fit.cart <- train(classe~., data=training, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(classe~., data=training, method="knn", metric=metric, trControl=control)
# SVM
set.seed(7)
fit.svm <- train(classe~., data=training, method="svmRadial", metric=metric, trControl=control)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
# Random Forest
set.seed(7)
fit.rf <- train(classe~., data=training, method="rf", metric=metric, trControl=control)

We now have 4 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

# summarize accuracy of models
results <- resamples(list(cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart 0.4808836 0.4948015 0.5095550 0.5070446 0.5145279 0.5424448    0
## knn  0.8581138 0.8649973 0.8696399 0.8702445 0.8732486 0.8837012    0
## svm  0.8071368 0.8285229 0.8398471 0.8386540 0.8551717 0.8581138    0
## rf   0.9872666 0.9893798 0.9910826 0.9914225 0.9938441 0.9957555    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart 0.3222774 0.3405656 0.3602228 0.3565816 0.3663887 0.4047856    0
## knn  0.8205567 0.8292003 0.8351034 0.8358497 0.8395636 0.8531256    0
## svm  0.7555647 0.7825658 0.7970345 0.7955157 0.8165723 0.8201732    0
## rf   0.9838910 0.9865619 0.9887224 0.9891495 0.9922122 0.9946311    0
dotplot(results)

fit.rf
## Random Forest 
## 
## 11776 samples
##   152 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10598, 10599, 10598, 10599, 10597, 10599, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.8412042  0.7970340
##    77   0.9914225  0.9891495
##   152   0.9825901  0.9779762
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 77.

We can see that the most accurate model in this case was Random Forest using 77 out of the 152 variables.

  1. Apply prediction one time to the test set

To validate the model choice, we test the model by applying prediction one time to the test set.

pred <- predict(fit.rf, testing)
testing$predRight <- pred==testing$classe

cmatrix <- confusionMatrix(table(pred, testing$classe))
cmatrix
## Confusion Matrix and Statistics
## 
##     
## pred    A    B    C    D    E
##    A 2231    7    0    0    0
##    B    1 1509    0    0    0
##    C    0    1 1364    3    0
##    D    0    1    4 1283    2
##    E    0    0    0    0 1440
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9976          
##                  95% CI : (0.9962, 0.9985)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9969          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9941   0.9971   0.9977   0.9986
## Specificity            0.9988   0.9998   0.9994   0.9989   1.0000
## Pos Pred Value         0.9969   0.9993   0.9971   0.9946   1.0000
## Neg Pred Value         0.9998   0.9986   0.9994   0.9995   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1923   0.1738   0.1635   0.1835
## Detection Prevalence   0.2852   0.1925   0.1744   0.1644   0.1835
## Balanced Accuracy      0.9992   0.9970   0.9982   0.9983   0.9993

The result shows an Accuracy of 0.9976 and Kappa of 0.9969, which aligns with the accuracy of our prediction with the training set.

Conclusion

From the confusion matrix, we know that the 95% confident interval of the Accuracy lies between 0.9896 and 0.9937, so we can be very confident that Random Forest prediction model is accurate.