Classe Prediction

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In the dataset provided,

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg)

Youtube Video of the experiment is also available at https://www.youtube.com/watch?v=meNf1b1yY0Y

The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Exploratory Analysis

data is the training set that is dowloaded from the course project site.

set.seed(0)
data <- read.csv("pml-training.csv",na.strings = c("NA", ""))

Initial exploratory analysis of roll_belt colored by classe gives us

ggplot(data, aes(x=X, y= roll_belt )) + geom_point(aes(color = classe))

We notice two levels of roll_belt data for the same classe. To further understand the difference in the data, we split them based on the users

ggplot(data, aes(x=X, y= roll_belt )) + geom_point(aes(color = user_name))

We now clearly distinguish the difference in the roll_belt data based on users. Each user performs the activity differently and this can be see in the sensor data. Hence collection of a lot of sensor data is requried to predict the correct activity and not just one is enough.

Data Cleaning and Training set allocation

The project provides two data sets

Training set
Testing set

The machine learning algorithm that is buit will be using the training set to predict the testing data. We further split the current training set as training and validation set. The algorithm is buit on the new training set and validated with our validation set

data is the training set that is dowloaded from the course project site. we split the data into training and testing(which is the validation set) by classe variable in 60-40%

inTrain = createDataPartition(y=data$classe, p=0.6, list=FALSE)

training = data[inTrain,]
testing = data[-inTrain,]

If we look at the structure of training data, we notice a lot of NA’s for a lot of variables. The training data set have 160 variables and 11776 rows of data. To speed up our machine learning alogorithm, we need to clean the data set by removing the NA variables

str(training)

## 'data.frame':    11776 obs. of  160 variables:
##  $ X                       : int  2 3 4 5 7 8 9 10 11 14 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  808298 820366 120339 196328 368296 440390 484323 484434 500302 576390 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 12 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.42 1.48 1.48 1.42 1.42 1.43 1.45 1.45 1.42 ...
##  $ pitch_belt              : num  8.07 8.07 8.05 8.07 8.09 8.13 8.16 8.17 8.18 8.21 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 396 levels "-0.016850","-0.021024",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : Factor w/ 316 levels "-0.021887","-0.060755",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : Factor w/ 394 levels "-0.003095","-0.010002",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : Factor w/ 337 levels "-0.005928","-0.005960",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 67 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 3 levels "#DIV/0!","0.00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0.02 0 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 ...
##  $ gyros_belt_y            : num  0 0 0 0.02 0 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 0 -0.02 -0.02 ...
##  $ accel_belt_x            : int  -22 -20 -22 -21 -22 -22 -20 -21 -21 -22 ...
##  $ accel_belt_y            : int  4 5 3 2 3 4 2 4 2 4 ...
##  $ accel_belt_z            : int  22 23 21 24 21 21 24 22 23 21 ...
##  $ magnet_belt_x           : int  -7 -2 -6 -6 -4 -2 1 -3 -5 -8 ...
##  $ magnet_belt_y           : int  608 600 604 600 599 603 602 609 596 598 ...
##  $ magnet_belt_z           : int  -311 -305 -310 -302 -311 -313 -312 -308 -317 -310 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.1 22.1 21.9 21.8 21.7 21.6 21.5 21.4 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0.02 0.02 0.02 0 0 0.02 0.02 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  -0.02 -0.02 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 0 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 0.02 0 0 0 -0.02 -0.02 0 -0.03 ...
##  $ accel_arm_x             : int  -290 -289 -289 -289 -289 -289 -288 -288 -290 -288 ...
##  $ accel_arm_y             : int  110 110 111 111 111 111 109 110 110 111 ...
##  $ accel_arm_z             : int  -125 -126 -123 -123 -125 -124 -122 -124 -123 -124 ...
##  $ magnet_arm_x            : int  -369 -368 -372 -374 -373 -372 -369 -376 -366 -371 ...
##  $ magnet_arm_y            : int  337 344 344 337 336 338 341 334 339 331 ...
##  $ magnet_arm_z            : int  513 513 512 506 509 510 518 516 509 523 ...
##  $ kurtosis_roll_arm       : Factor w/ 329 levels "-0.02438","-0.04190",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : Factor w/ 327 levels "-0.00484","-0.01311",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : Factor w/ 394 levels "-0.01548","-0.01749",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : Factor w/ 330 levels "-0.00051","-0.00696",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : Factor w/ 327 levels "-0.00184","-0.01185",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : Factor w/ 394 levels "-0.00311","-0.00562",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 12.9 13.4 13.4 13.1 ...
##  $ pitch_dumbbell          : num  -70.6 -70.3 -70.4 -70.4 -70.2 ...
##  $ yaw_dumbbell            : num  -84.7 -85.1 -84.9 -84.9 -85.1 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 397 levels "-0.0035","-0.0073",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : Factor w/ 400 levels "-0.0163","-0.0233",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : Factor w/ 400 levels "-0.0082","-0.0096",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : Factor w/ 401 levels "-0.0053","-0.0084",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : Factor w/ 1 level "#DIV/0!": NA NA NA NA NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 72 levels "-0.1","-0.2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

We first identify the amount of NA’s in the training dataset.

na_train = sapply(training, function(x) {sum(is.na(x))})
table(na_train)

## na_train
##     0 11518 
##    60   100

If \(Sum > 0\), there are more than 1 NA’s in the variable. From table function output, we clearly distinguish two groups

Columns with zero NA’s have a \(Sum = 0\)
Columns with all NA’s have a \(Sum = 11518\)

We then choose only those columns which have a \(Sum = 0\). We also exclude the first 7 columns as they do not contribute any data for the machine learning algorithm.

good_columns = (names(na_train[na_train==0]))

training = training[, names(training) %in% good_columns]
training = training[,-c(1:7)]

Model Selection

There are multiple machine learning models available that can predict the data set accurately. We will choose our model based on the accuracy and computational time.

We will be testing the following two models to the current data

Predicting with trees using rpart
Prediction with trees using randomForest

Since the computational time is high due to higher variable count and data set, we will use parallel computing to expidete the training algorithm

library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

We will also be using cross validation with repeated cv method that uses 10 fold and 10 repeats for the training data set to get an unbiased estimate of the test set error.

This will used in the train function in caret pacakge with trainControl parameter

fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv",
  number = 10,
  ## repeated ten times
  repeats = 10)

Predicting with trees using rpart

We have used 10 fold and 10 repeat cross valiation for our current training set data. rpart method is used to create a single prediction tree.

model_rpart = train(classe~., method="rpart", data=training,trControl = fitControl)

Graphical representation of the tree

Output of model_rpart is

CART 

11776 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 

Summary of sample sizes: 10599, 10599, 10597, 10599, 10598, 10599, ... 

Resampling results across tuning parameters:

  cp          Accuracy   Kappa       Accuracy SD  Kappa SD  
  0.03334124  0.5053238  0.35403917  0.01531608   0.02051850
  0.06126404  0.4038033  0.18890208  0.05601898   0.09479561
  0.11556716  0.3184252  0.05229872  0.03874004   0.05941591

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.03334124.

The accuracy of the rpart for the training set is ~50%.

We then test our rpart model on the validation data. We still get 49% accuracy with 33% Kappa. The current model is defintely bad and compares to even random selection accuracy.

confusionMatrix(testing$classe,predict(model_rpart,testing))

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1982   42  174    0   34
         B  640  518  360    0    0
         C  651   40  677    0    0
         D  553  239  494    0    0
         E  206  189  362    0  685

Overall Statistics
                                          
               Accuracy : 0.4922          
                 95% CI : (0.4811, 0.5034)
    No Information Rate : 0.5139          
    P-Value [Acc > NIR] : 0.9999          
                                          
                  Kappa : 0.3368          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.4916  0.50389  0.32753       NA  0.95271
Specificity            0.9345  0.85333  0.88043   0.8361  0.89378
Pos Pred Value         0.8880  0.34124  0.49488       NA  0.47503
Neg Pred Value         0.6348  0.91941  0.78543       NA  0.99469
Prevalence             0.5139  0.13102  0.26345   0.0000  0.09164
Detection Rate         0.2526  0.06602  0.08629   0.0000  0.08731
Detection Prevalence   0.2845  0.19347  0.17436   0.1639  0.18379
Balanced Accuracy      0.7130  0.67861  0.60398       NA  0.92325

Predicting with trees using randomForest

We have used 10 fold and 10 repeat cross valiation for our current training set data. Cross validation is not required for random forest method as it is estimated internally using bootstrap sample. We still do it since, we are comparing with rpart model for which cross validation was done.

rf method is used to create multiple trees and predicts the outcome by voting individual tree results. rf method is usually time consuming and we need to tweak the paraments to estimate the correct one.

If we use carat package for training, it tests multiple ntree (Number of trees to grow) and mtry (Number of variables randomly sampled as candidates at each split) and chooses the best model automatically.

model_rf = train(classe~., method="rf", data=training, prox=TRUEsdo.trace =100,trControl = fitControl)

For our current model we notice OOB error (out of bag) is minimum for \(ntree = 200\) which is 0.83%

ntree      OOB      1      2      3      4      5
  100:   1.02%  0.15%  2.02%  1.07%  2.02%  0.37%
  200:   0.83%  0.15%  1.45%  0.88%  1.87%  0.28%
  300:   0.92%  0.12%  1.67%  0.88%  2.07%  0.37%
  400:   0.85%  0.06%  1.54%  0.88%  1.92%  0.37%
  500:   0.84%  0.06%  1.40%  0.88%  2.02%  0.37%

For our current model we notice Accuracy is highest for \(mtry = 2\) which is 99%

  Random Forest 

11776 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 

Summary of sample sizes: 10598, 10598, 10599, 10598, 10599, 10598, ... 

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
   2    0.9903195  0.9877525  0.002502226  0.003167007
  27    0.9894281  0.9866255  0.002615775  0.003309730
  52    0.9801891  0.9749381  0.003626607  0.004588097

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.

We then test our randomForest model on the validation data. We get a good accuracy of 99% on the validation data.

confusionMatrix(testing$classe,predict(model_rf,testing))

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2224    5    3    0    0
         B    8 1499   11    0    0
         C    0    6 1361    1    0
         D    0    0   31 1253    2
         E    0    0    7    4 1431

Overall Statistics
                                          
               Accuracy : 0.9901          
                 95% CI : (0.9876, 0.9921)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9874          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9964   0.9927   0.9632   0.9960   0.9986
Specificity            0.9986   0.9970   0.9989   0.9950   0.9983
Pos Pred Value         0.9964   0.9875   0.9949   0.9743   0.9924
Neg Pred Value         0.9986   0.9983   0.9920   0.9992   0.9997
Prevalence             0.2845   0.1925   0.1801   0.1603   0.1826
Detection Rate         0.2835   0.1911   0.1735   0.1597   0.1824
Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
Balanced Accuracy      0.9975   0.9949   0.9811   0.9955   0.9984

To understand the importance of variables used in the randomForest algorithm, we plot the varImp variable. We notice out of 52 variable, only 20 of them have >30% importance.

Variable of Importance

Conclusion

randomForest(Accuracy = 99%) method predicts better than rpart method(Accuracy = 49%) on the validation set. But the computation time of randomForest method is definitely higher than rpart. Following modication to the algorithm in the caret package can defintely decrease the computation time

\(ntree = 200\)
\(mtry = 2\)