Synopsis

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

Data Processing

The training data for this project are available here & the test data are available here and read by the read.csv() function. The data for this project come from here.

training_raw <- read.csv("./data/pml-training.csv", header = T); dim(training_raw)
## [1] 19622   160
testing_raw <- read.csv("./data/pml-testing.csv", header = T); dim(testing_raw)
## [1]  20 160
head(colnames(training_raw), 10)
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"

The first 7 columns are ID and time variables. So, we do not need them for analysis.

training_raw <- training_raw[, -c(1:7)]; dim(training_raw)
## [1] 19622   153
testing_raw <- testing_raw[, -c(1:7)]; dim(testing_raw)
## [1]  20 153

Up next, the libraries needed for analysis are loaded.

library(caret)
library(kernlab)
library(rpart)
library(ggplot2)
library(randomForest)
library(rattle)
library(Metrics)

Next, near zero variance variables are removed from the training data set and also from the testing data set.

NZV_train <- nearZeroVar(training_raw)
training_raw <- training_raw[, -NZV_train]; dim(training_raw)
## [1] 19622    94
NZV_test <- nearZeroVar(testing_raw)
testing_raw <- testing_raw[, -NZV_test]; dim(testing_raw)
## [1] 20 53

At last, NA values are removed.

training_raw <- training_raw[, colSums(is.na(training_raw)) == 0]; dim(training_raw)
## [1] 19622    53
testing_raw <- testing_raw[, colSums(is.na(testing_raw)) == 0]; dim(testing_raw)
## [1] 20 53

Data Analysis

For cross-validation, a sub-training data set and a validation data set are created by splitting the training data into a 70:30 ratio.

set.seed(123321) 
inTrain <- createDataPartition(training_raw$classe, p = 0.7, list = F)
training_Data <- training_raw[inTrain, ]; dim(training_Data)
## [1] 13737    53
validation_Data <- training_raw[-inTrain, ]; dim(validation_Data)
## [1] 5885   53

Creating and Testing the Models

Decision Tree, Random Forrest, Gradient Boosted Machine (GBM), Support Vector Machine (SVM), and Linear Discriminant Analysis (LDA) models were created and tested. Now, let’s set up a control for the sub-training data set to use cross-validation.

control <- trainControl(method = "repeatedcv", number = 3, repeats = 5, verboseIter = F)

Decision Tree Model

dec_tree_model <- train(classe ~., data = training_Data, method = "rpart", trControl = control, tuneLength = 5)
fancyRpartPlot(dec_tree_model$finalModel, sub = "Decision Tree Model")

pred_dec_tree <- predict(dec_tree_model, validation_Data)
confusionMatrix(pred_dec_tree, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1517  456  501  451  139
##          B   30  354   35    7  119
##          C   92  246  388  119  263
##          D   29   83  102  387   72
##          E    6    0    0    0  489
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5327          
##                  95% CI : (0.5199, 0.5455)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3907          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9062  0.31080  0.37817  0.40145  0.45194
## Specificity            0.6326  0.95976  0.85182  0.94188  0.99875
## Pos Pred Value         0.4951  0.64954  0.35018  0.57504  0.98788
## Neg Pred Value         0.9443  0.85300  0.86644  0.88929  0.88998
## Prevalence             0.2845  0.19354  0.17434  0.16381  0.18386
## Detection Rate         0.2578  0.06015  0.06593  0.06576  0.08309
## Detection Prevalence   0.5206  0.09261  0.18828  0.11436  0.08411
## Balanced Accuracy      0.7694  0.63528  0.61499  0.67167  0.72535

So, the out of sample error for this model will be:

1-as.numeric(confusionMatrix(pred_dec_tree, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.4672897

Random Forrest Model

rf_model <- train(classe ~., data = training_Data, method = "rf", trControl = control, tuneLength = 5)
pred_rf <- predict(rf_model, validation_Data)
confusionMatrix(pred_rf, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    5    0    0    0
##          B    3 1127    3    0    0
##          C    0    7 1020    8    3
##          D    0    0    3  955    5
##          E    1    0    0    1 1074
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9934         
##                  95% CI : (0.991, 0.9953)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9916         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9895   0.9942   0.9907   0.9926
## Specificity            0.9988   0.9987   0.9963   0.9984   0.9996
## Pos Pred Value         0.9970   0.9947   0.9827   0.9917   0.9981
## Neg Pred Value         0.9990   0.9975   0.9988   0.9982   0.9983
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1915   0.1733   0.1623   0.1825
## Detection Prevalence   0.2846   0.1925   0.1764   0.1636   0.1828
## Balanced Accuracy      0.9982   0.9941   0.9952   0.9945   0.9961

So, the out of sample error for this model will be:

1-as.numeric(confusionMatrix(pred_rf, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.006627018

Gradient Boosted Machine (GBM) Model

gbm_model <- train(classe ~., data = training_Data, method = "gbm", trControl = control, tuneLength = 5)
pred_gbm <- predict(gbm_model, validation_Data)
confusionMatrix(pred_gbm, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1662   10    0    0    0
##          B    8 1119    4    0    5
##          C    4   10 1018    7    3
##          D    0    0    4  957    8
##          E    0    0    0    0 1066
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9893          
##                  95% CI : (0.9863, 0.9918)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9865          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9928   0.9824   0.9922   0.9927   0.9852
## Specificity            0.9976   0.9964   0.9951   0.9976   1.0000
## Pos Pred Value         0.9940   0.9850   0.9770   0.9876   1.0000
## Neg Pred Value         0.9972   0.9958   0.9983   0.9986   0.9967
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2824   0.1901   0.1730   0.1626   0.1811
## Detection Prevalence   0.2841   0.1930   0.1771   0.1647   0.1811
## Balanced Accuracy      0.9952   0.9894   0.9936   0.9952   0.9926

So, the out of sample error for this model will be:

1-as.numeric(confusionMatrix(pred_gbm, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.01070518

Support Vector Machine (SVM) Model

svm_model <- train(classe ~., data = training_Data, method = "svmLinear", trControl = control, tuneLength = 5)
pred_svm <- predict(svm_model, validation_Data)
confusionMatrix(pred_svm, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1519  153   84   58   59
##          B   37  824   89   41  134
##          C   53   54  818  119   76
##          D   53   24   23  710   67
##          E   12   84   12   36  746
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7845         
##                  95% CI : (0.7738, 0.795)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7262         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9074   0.7234   0.7973   0.7365   0.6895
## Specificity            0.9159   0.9366   0.9378   0.9661   0.9700
## Pos Pred Value         0.8110   0.7324   0.7304   0.8096   0.8382
## Neg Pred Value         0.9614   0.9338   0.9563   0.9493   0.9327
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2581   0.1400   0.1390   0.1206   0.1268
## Detection Prevalence   0.3183   0.1912   0.1903   0.1490   0.1512
## Balanced Accuracy      0.9117   0.8300   0.8676   0.8513   0.8297

So, the out of sample error for this model will be:

1-as.numeric(confusionMatrix(pred_svm, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.215463

Linear Discriminant Analysis (LDA) Model

lda_model <- train(classe ~., data = training_Data, method = "lda", trControl = control, tuneLength = 5)
pred_lda <- predict(lda_model, validation_Data)
confusionMatrix(pred_lda, reference = factor(validation_Data$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1341  168  103   61   47
##          B   46  734  107   38  177
##          C  132  129  690  116  108
##          D  146   49  114  711   94
##          E    9   59   12   38  656
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7021          
##                  95% CI : (0.6903, 0.7138)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8011   0.6444   0.6725   0.7376   0.6063
## Specificity            0.9100   0.9225   0.9002   0.9181   0.9754
## Pos Pred Value         0.7797   0.6661   0.5872   0.6382   0.8475
## Neg Pred Value         0.9200   0.9153   0.9287   0.9470   0.9167
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2279   0.1247   0.1172   0.1208   0.1115
## Detection Prevalence   0.2923   0.1873   0.1997   0.1893   0.1315
## Balanced Accuracy      0.8555   0.7834   0.7863   0.8278   0.7909

So, the out of sample error for this model will be:

1-as.numeric(confusionMatrix(pred_lda, reference = factor(validation_Data$classe))$overall["Accuracy"])
## [1] 0.297876

Result of the analysis

The Random Forrest Model showed better accuracy & lower out of sample error than the other models. So, the Random Forrest Model will be used to predict the testing data set.

pred_test <- predict(rf_model, testing_raw)
pred_test
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E