Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The dataset used in the project comes from “Velloso,E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. QualitativeActivity Recognition of Weight Lifting Exercises. Proceedings of4th International Conference in Cooperation with SIGCHI (AugmentedHuman ’13) . Stuttgart, Germany: ACM SIGCHI, 2013”.
Data description: Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (ClassA), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).
The training data for this project is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The above URL provided are the source to load the datasets.
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE)
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE)
dim(training)
## [1] 19622 160
dim(test)
## [1] 20 160
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
The R packages below are required for the data analyses. The caret (Classification And Regression Training) package is used to streamline the model training process for complex regression and classification problems, and the rattle package is providing a graphical user interface for the results.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
There are three steps for data manipulation. First, the first 7 variables in training dataset are removed since they are patient names and time frames which are not predictors and have no impact on prediction. Second, those varabies have any of NA are excluded since no missing value is allowed in model buidling. Third, a total of 5 classes are to be categorized. The near zero variance variables need to be exclued as well.
training_clean <- training[, -c(1:7)]
dim(training_clean)
## [1] 19622 153
training_na <- training_clean[sapply(training_clean, function(x) !any(is.na(x)))]
dim(training_na)
## [1] 19622 86
training_nzv <- training_na[, -nearZeroVar(training_na)]
dim(training_nzv)
## [1] 19622 53
Training data is splitting into 70% as train set and 30% as test set. Train set is used to build the prediction model and test set is served to compute the out of sample error.
set.seed(1212)
CVdata <- createDataPartition(y=training_nzv$classe, p=0.7, list=FALSE)
trainset <- training_nzv[CVdata,]; dim(trainset)
## [1] 13737 53
testset <- training_nzv[-CVdata,]; dim(testset)
## [1] 5885 53
For the project work, three type of approaches are used to build prediction model: (1) Classification tree, (2) Random forest, (3) Gradient boosted model.
Confusion Matrix is displayed at the end of each analysis to better visualize the accuracy of the models. The final model will be chosen based on highest accuracy among three models.
First, the model was built by using classification tree, and then use fancyRpartPlot() function to plot classification tree.
set.seed(1235)
modTree <- train(classe ~., method="rpart", data=trainset)
print(modTree$finalModel)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12559 8662 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 1122 7 A (0.99 0.0062 0 0 0) *
## 5) pitch_forearm>=-33.95 11437 8655 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 9694 6967 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 123.5 6037 3599 A (0.4 0.19 0.18 0.17 0.058) *
## 21) roll_forearm>=123.5 3657 2434 C (0.079 0.18 0.33 0.22 0.18) *
## 11) magnet_dumbbell_y>=439.5 1743 864 B (0.032 0.5 0.045 0.23 0.19) *
## 3) roll_belt>=130.5 1178 9 E (0.0076 0 0 0 0.99) *
fancyRpartPlot(modTree$finalModel)
The model “modTree” is used to validate on the “testset” dataset and to evaluate the accuracy rate.
predtree <- predict(modTree, testset)
confusionMatrix(predtree, testset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1527 457 492 414 172
## B 26 407 30 173 150
## C 116 275 504 377 298
## D 0 0 0 0 0
## E 5 0 0 0 462
##
## Overall Statistics
##
## Accuracy : 0.4928
## 95% CI : (0.4799, 0.5056)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.337
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9122 0.35733 0.49123 0.0000 0.42699
## Specificity 0.6355 0.92014 0.78061 1.0000 0.99896
## Pos Pred Value 0.4987 0.51781 0.32102 NaN 0.98929
## Neg Pred Value 0.9479 0.85644 0.87903 0.8362 0.88557
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2595 0.06916 0.08564 0.0000 0.07850
## Detection Prevalence 0.5203 0.13356 0.26678 0.0000 0.07935
## Balanced Accuracy 0.7738 0.63874 0.63592 0.5000 0.71297
Second, the model is built by using random forest.
trainCT <- trainControl(method="boot", number=4)
modrf <- train(classe ~., method="rf", data=trainset, prox=TRUE, trControl=trainCT)
print(modrf$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.62%
## Confusion matrix:
## A B C D E class.error
## A 3901 4 0 0 1 0.001280082
## B 18 2633 7 0 0 0.009405568
## C 0 9 2380 7 0 0.006677796
## D 0 0 27 2224 1 0.012433393
## E 0 1 4 6 2514 0.004356436
The model “modrf” is used to validate on the “testset” dataset and to evaluate the accuracy rate.
predrf <- predict(modrf, testset)
confusionMatrix(predrf, testset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 4 0 0 0
## B 1 1133 7 0 0
## C 0 2 1012 15 2
## D 0 0 7 945 2
## E 0 0 0 4 1078
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.99, 0.9946)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9947 0.9864 0.9803 0.9963
## Specificity 0.9991 0.9983 0.9961 0.9982 0.9992
## Pos Pred Value 0.9976 0.9930 0.9816 0.9906 0.9963
## Neg Pred Value 0.9998 0.9987 0.9971 0.9961 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1925 0.1720 0.1606 0.1832
## Detection Prevalence 0.2850 0.1939 0.1752 0.1621 0.1839
## Balanced Accuracy 0.9992 0.9965 0.9912 0.9892 0.9977
Third,the model is built by using gradient boosted model .
modgbm <- train(classe~., method="gbm", data=trainset, verbose=FALSE, trControl=trainCT)
print(modgbm$finalModel)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 51 had non-zero influence.
The model “modgbm” is used to validate on the “testset” dataset and to evaluate the accuracy rate.
predgbm <- predict(modgbm, testset)
confusionMatrix(predgbm, testset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1651 34 0 2 2
## B 14 1069 30 6 15
## C 5 34 981 29 8
## D 2 0 14 921 22
## E 2 2 1 6 1035
##
## Overall Statistics
##
## Accuracy : 0.9613
## 95% CI : (0.956, 0.966)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.951
##
## Mcnemar's Test P-Value : 3.522e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9863 0.9385 0.9561 0.9554 0.9566
## Specificity 0.9910 0.9863 0.9844 0.9923 0.9977
## Pos Pred Value 0.9775 0.9427 0.9281 0.9604 0.9895
## Neg Pred Value 0.9945 0.9853 0.9907 0.9913 0.9903
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2805 0.1816 0.1667 0.1565 0.1759
## Detection Prevalence 0.2870 0.1927 0.1796 0.1630 0.1777
## Balanced Accuracy 0.9886 0.9624 0.9702 0.9738 0.9771
By comparing the accuracy rate between models, the model by using random forest is suggested to better prediction. Therefore, it is chosen to predict classes for the 20 test cases.
confusionMatrix(predtree, testset$classe)$overall['Accuracy']
## Accuracy
## 0.4927782
confusionMatrix(predrf, testset$classe)$overall['Accuracy']
## Accuracy
## 0.9925234
confusionMatrix(predgbm, testset$classe)$overall['Accuracy']
## Accuracy
## 0.9612574
The results are shown as below,
predict(modrf, test)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E