Practical Machine Learning Project

Overview

The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data.

Data Processing

Loading Libraries

library(caret)
library(rattle)
library(randomForest)
library(rpart)
library(e1071)
library(gbm)
library(corrplot)

Loading Data

trainData <- read.csv("./pml-training.csv",header=TRUE)
validData <- read.csv("./pml-testing.csv",header=TRUE)
dim(trainData)
## [1] 19622   160
dim(validData)
## [1]  20 160
str(trainData)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : chr  "" "" "" "" ...
##  $ kurtosis_picth_belt     : chr  "" "" "" "" ...
##  $ kurtosis_yaw_belt       : chr  "" "" "" "" ...
##  $ skewness_roll_belt      : chr  "" "" "" "" ...
##  $ skewness_roll_belt.1    : chr  "" "" "" "" ...
##  $ skewness_yaw_belt       : chr  "" "" "" "" ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : chr  "" "" "" "" ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : chr  "" "" "" "" ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : chr  "" "" "" "" ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : chr  "" "" "" "" ...
##  $ kurtosis_picth_arm      : chr  "" "" "" "" ...
##  $ kurtosis_yaw_arm        : chr  "" "" "" "" ...
##  $ skewness_roll_arm       : chr  "" "" "" "" ...
##  $ skewness_pitch_arm      : chr  "" "" "" "" ...
##  $ skewness_yaw_arm        : chr  "" "" "" "" ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : chr  "" "" "" "" ...
##  $ kurtosis_picth_dumbbell : chr  "" "" "" "" ...
##  $ kurtosis_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ skewness_roll_dumbbell  : chr  "" "" "" "" ...
##  $ skewness_pitch_dumbbell : chr  "" "" "" "" ...
##  $ skewness_yaw_dumbbell   : chr  "" "" "" "" ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : chr  "" "" "" "" ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

The training data has 19622 observations with 160 columns. On viewing the summary of the data, it can be noticed that many columns have mostly NA or blank values. Also, the first seven columns are names and timestamps of people who performed the test. All these columns do not provide any valuable information to our model so we will drop them.

Cleaning Data

# removing columns containing mean values
trainData <- trainData[, colSums(is.na(trainData)) == 0]
dim(trainData)
## [1] 19622    93
validData <- validData[, colSums(is.na(validData)) == 0]
dim(validData)
## [1] 20 60
# removing identity columns
trainData <- trainData[, -c(1:7)]
validData <- validData[, -c(1:7)]
dim(trainData)
## [1] 19622    86
dim(validData)
## [1] 20 53

The training data only has 86 variables now. To use it for modelling, we can further clean this data by removing those features which contribute almost zero variance using nearZeroVar.

NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
dim(trainData)
## [1] 19622    53

Splitting Data

Post cleaning, the trainData can be split into two sets and then tested using the validData.

trainData$classe <- as.factor(trainData$classe)
inTrain  <- createDataPartition(trainData$classe, p=0.7, list=FALSE)
trainSet <- trainData[inTrain, ]
testSet  <- trainData[-inTrain, ]
dim(trainSet)
## [1] 13737    53
dim(testSet)
## [1] 5885   53

Now we will use the trainSet to explore the variables and build a model off of it.

Correlation Analysis

corrMatrix <- cor(trainSet[, -53])
par(ps=16)
corrplot(corrMatrix, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.5, tl.col = rgb(0, 0, 0))

This correlation matrix follows the order of First Principal Component. The variables which contribute highest to the variance, are the darkest.

Prediction Model Training

We will use three methods to model the regression. They are:

  1. Classification Tree
  2. Random Forest
  3. Gradient Boosting Machine

Classification Tree

# fitting model
set.seed(1)
modelCT <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modelCT, cex=0.3)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

# testing model
predictCT <- predict(modelCT, newdata = testSet, type = "class")
cmCT <- confusionMatrix(predictCT, (testSet$classe))
cmCT
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1560  248   15   94   31
##          B   38  613  102   44   91
##          C   37  147  816  101  109
##          D   29   80   63  647   65
##          E   10   51   30   78  786
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7514          
##                  95% CI : (0.7402, 0.7624)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6839          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9319   0.5382   0.7953   0.6712   0.7264
## Specificity            0.9079   0.9421   0.9189   0.9518   0.9648
## Pos Pred Value         0.8008   0.6903   0.6744   0.7319   0.8230
## Neg Pred Value         0.9710   0.8947   0.9551   0.9366   0.9400
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2651   0.1042   0.1387   0.1099   0.1336
## Detection Prevalence   0.3310   0.1509   0.2056   0.1502   0.1623
## Balanced Accuracy      0.9199   0.7401   0.8571   0.8115   0.8456
cmCT$overall[1]
##  Accuracy 
## 0.7514019

We notice that the accuracy is around 0.751 which is considerable.

Random Forest

# fitting model
crossV <- trainControl(method="cv", number=5, verboseIter = FALSE)
modelRF <- train(classe ~ ., data=trainSet, method="rf", trControl=crossV)
# testing model
predictRF <- predict(modelRF, newdata = testSet)
cmRF <- confusionMatrix(predictRF, testSet$classe)
cmRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    2    0    0    0
##          B    0 1135    2    0    0
##          C    0    2 1022    3    0
##          D    0    0    2  959    1
##          E    0    0    0    2 1081
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9976         
##                  95% CI : (0.996, 0.9987)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.997          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9965   0.9961   0.9948   0.9991
## Specificity            0.9995   0.9996   0.9990   0.9994   0.9996
## Pos Pred Value         0.9988   0.9982   0.9951   0.9969   0.9982
## Neg Pred Value         1.0000   0.9992   0.9992   0.9990   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1929   0.1737   0.1630   0.1837
## Detection Prevalence   0.2848   0.1932   0.1745   0.1635   0.1840
## Balanced Accuracy      0.9998   0.9980   0.9975   0.9971   0.9993
cmRF$overall[1]
##  Accuracy 
## 0.9976211

Random Forest gives us a much higher accuracy of around 0.998.

Gradient Boosting Machine

# fitting model
crossVgbm <- trainControl(method="repeatedcv", number=5, repeats=1)
modelGBM <- train(classe ~ ., data=trainSet, method="gbm", 
                  trControl=crossVgbm, verbose = FALSE)
# testing model
predictGBM <- predict(modelGBM, newdata = testSet)
cmGBM <- confusionMatrix(predictGBM, testSet$classe)
cmGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1653   30    0    1    2
##          B   12 1095   28    6    6
##          C    5   14  989   29   10
##          D    3    0    9  924   12
##          E    1    0    0    4 1052
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9708          
##                  95% CI : (0.9661, 0.9749)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.963           
##                                           
##  Mcnemar's Test P-Value : 2.848e-08       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9614   0.9639   0.9585   0.9723
## Specificity            0.9922   0.9890   0.9881   0.9951   0.9990
## Pos Pred Value         0.9804   0.9547   0.9446   0.9747   0.9953
## Neg Pred Value         0.9950   0.9907   0.9924   0.9919   0.9938
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2809   0.1861   0.1681   0.1570   0.1788
## Detection Prevalence   0.2865   0.1949   0.1779   0.1611   0.1796
## Balanced Accuracy      0.9898   0.9752   0.9760   0.9768   0.9856
cmGBM$overall[1]
##  Accuracy 
## 0.9707732

The GBM method gives an accuracy of around 0.971 which is slightly less as compared to Random forest.

Conclusion

Comparing the three methods, Random forest has the highest accuracy. So we use that model to predict the 20 testcases.

Output <- predict(modelRF, newdata=validData)
Output
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E