I. Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

II. Data loading and manipulation

1. Load the database

The dataset used in the project comes from “Velloso,E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. QualitativeActivity Recognition of Weight Lifting Exercises. Proceedings of4th International Conference in Cooperation with SIGCHI (AugmentedHuman ’13) . Stuttgart, Germany: ACM SIGCHI, 2013”.

Data description: Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (ClassA), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).

The training data for this project is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The above URL provided are the source to load the datasets.

training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE)
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE)
dim(training)
## [1] 19622   160
dim(test)
## [1]  20 160
table(training$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

2. Required packages

The R packages below are required for the data analyses. The caret (Classification And Regression Training) package is used to streamline the model training process for complex regression and classification problems, and the rattle package is providing a graphical user interface for the results.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

3. Data manipulation

There are three steps for data manipulation. First, the first 7 variables in training dataset are removed since they are patient names and time frames which are not predictors and have no impact on prediction. Second, those varabies have any of NA are excluded since no missing value is allowed in model buidling. Third, a total of 5 classes are to be categorized. The near zero variance variables need to be exclued as well.

training_clean <- training[, -c(1:7)] 
dim(training_clean)
## [1] 19622   153
training_na <- training_clean[sapply(training_clean, function(x) !any(is.na(x)))] 
dim(training_na)
## [1] 19622    86
training_nzv <- training_na[, -nearZeroVar(training_na)] 
dim(training_nzv)
## [1] 19622    53

4. Cross validation

Training data is splitting into 70% as train set and 30% as test set. Train set is used to build the prediction model and test set is served to compute the out of sample error.

set.seed(1212)
CVdata <- createDataPartition(y=training_nzv$classe, p=0.7, list=FALSE)
trainset <- training_nzv[CVdata,]; dim(trainset)
## [1] 13737    53
testset <- training_nzv[-CVdata,]; dim(testset)
## [1] 5885   53

III. Prediction model buidling

For the project work, three type of approaches are used to build prediction model: (1) Classification tree, (2) Random forest, (3) Gradient boosted model.

Confusion Matrix is displayed at the end of each analysis to better visualize the accuracy of the models. The final model will be chosen based on highest accuracy among three models.

1. Classification Trees

First, the model was built by using classification tree, and then use fancyRpartPlot() function to plot classification tree.

set.seed(1235)
modTree <- train(classe ~., method="rpart", data=trainset)
print(modTree$finalModel)
## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12559 8662 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.95 1122    7 A (0.99 0.0062 0 0 0) *
##      5) pitch_forearm>=-33.95 11437 8655 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 9694 6967 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 123.5 6037 3599 A (0.4 0.19 0.18 0.17 0.058) *
##         21) roll_forearm>=123.5 3657 2434 C (0.079 0.18 0.33 0.22 0.18) *
##       11) magnet_dumbbell_y>=439.5 1743  864 B (0.032 0.5 0.045 0.23 0.19) *
##    3) roll_belt>=130.5 1178    9 E (0.0076 0 0 0 0.99) *
fancyRpartPlot(modTree$finalModel)

The model “modTree” is used to validate on the “testset” dataset and to evaluate the accuracy rate.

predtree <- predict(modTree, testset)
confusionMatrix(predtree, testset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1527  457  492  414  172
##          B   26  407   30  173  150
##          C  116  275  504  377  298
##          D    0    0    0    0    0
##          E    5    0    0    0  462
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4928          
##                  95% CI : (0.4799, 0.5056)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.337           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9122  0.35733  0.49123   0.0000  0.42699
## Specificity            0.6355  0.92014  0.78061   1.0000  0.99896
## Pos Pred Value         0.4987  0.51781  0.32102      NaN  0.98929
## Neg Pred Value         0.9479  0.85644  0.87903   0.8362  0.88557
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2595  0.06916  0.08564   0.0000  0.07850
## Detection Prevalence   0.5203  0.13356  0.26678   0.0000  0.07935
## Balanced Accuracy      0.7738  0.63874  0.63592   0.5000  0.71297

2. Random forest

Second, the model is built by using random forest.

trainCT <- trainControl(method="boot", number=4)
modrf <- train(classe ~., method="rf", data=trainset, prox=TRUE, trControl=trainCT)
print(modrf$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.62%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3901    4    0    0    1 0.001280082
## B   18 2633    7    0    0 0.009405568
## C    0    9 2380    7    0 0.006677796
## D    0    0   27 2224    1 0.012433393
## E    0    1    4    6 2514 0.004356436

The model “modrf” is used to validate on the “testset” dataset and to evaluate the accuracy rate.

predrf <- predict(modrf, testset)
confusionMatrix(predrf, testset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    4    0    0    0
##          B    1 1133    7    0    0
##          C    0    2 1012   15    2
##          D    0    0    7  945    2
##          E    0    0    0    4 1078
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9925        
##                  95% CI : (0.99, 0.9946)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9905        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9947   0.9864   0.9803   0.9963
## Specificity            0.9991   0.9983   0.9961   0.9982   0.9992
## Pos Pred Value         0.9976   0.9930   0.9816   0.9906   0.9963
## Neg Pred Value         0.9998   0.9987   0.9971   0.9961   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1925   0.1720   0.1606   0.1832
## Detection Prevalence   0.2850   0.1939   0.1752   0.1621   0.1839
## Balanced Accuracy      0.9992   0.9965   0.9912   0.9892   0.9977

3. Gradient boosted model

Third,the model is built by using gradient boosted model .

modgbm <- train(classe~., method="gbm", data=trainset, verbose=FALSE, trControl=trainCT)
print(modgbm$finalModel)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 51 had non-zero influence.

The model “modgbm” is used to validate on the “testset” dataset and to evaluate the accuracy rate.

predgbm <- predict(modgbm, testset)
confusionMatrix(predgbm, testset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1651   34    0    2    2
##          B   14 1069   30    6   15
##          C    5   34  981   29    8
##          D    2    0   14  921   22
##          E    2    2    1    6 1035
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9613        
##                  95% CI : (0.956, 0.966)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.951         
##                                         
##  Mcnemar's Test P-Value : 3.522e-07     
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9863   0.9385   0.9561   0.9554   0.9566
## Specificity            0.9910   0.9863   0.9844   0.9923   0.9977
## Pos Pred Value         0.9775   0.9427   0.9281   0.9604   0.9895
## Neg Pred Value         0.9945   0.9853   0.9907   0.9913   0.9903
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2805   0.1816   0.1667   0.1565   0.1759
## Detection Prevalence   0.2870   0.1927   0.1796   0.1630   0.1777
## Balanced Accuracy      0.9886   0.9624   0.9702   0.9738   0.9771

IV. Model comparison

By comparing the accuracy rate between models, the model by using random forest is suggested to better prediction. Therefore, it is chosen to predict classes for the 20 test cases.

confusionMatrix(predtree, testset$classe)$overall['Accuracy']
##  Accuracy 
## 0.4927782
confusionMatrix(predrf, testset$classe)$overall['Accuracy']
##  Accuracy 
## 0.9925234
confusionMatrix(predgbm, testset$classe)$overall['Accuracy']
##  Accuracy 
## 0.9612574

V. Apply optimal model to predict classes for the 20 test cases

The results are shown as below,

predict(modrf, test)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E