Practical Machine Learning: Prediction Assignment Writeup

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset). Data for this project has been taken from the same source.

Data

The training data for this project is available here.

And the test data is available here

Loading the required libraries and data

library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)

trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

readTrain <- read.csv(url(trainUrl))
readTest <- read.csv(url(testUrl))

dim(readTrain)

## [1] 19622   160

dim(readTest)

## [1]  20 160

We see that the training data set has 19622 records and the testing data set has 20 records. The number of variables is 160.

Cleaning the data

Removing variables with variance ≈ 0

var0 <- nearZeroVar(readTrain)

train <- readTrain[,-var0]
test <- readTest[,-var0]

dim(train)

## [1] 19622   100

We see that 60 redundant variables are removed in the first step.

Removing Variables that have NA value, threshold being 95%

valNA <- sapply(train, function(x) mean(is.na(x))) > 0.95

train <- train[, valNA == FALSE]
test <- test[, valNA == FALSE]

dim(train)

## [1] 19622    59

The second step leaves 59 variables.

Removing variables that are non-numeric as they don’t contribute to our model. A look at the data set shows that only the first 7 variables are non-numeric, hence they are removed.

train <- train[,8:59]
test <- test[,8:59]

We now take a look at the column names of the data set.

colnames(train)

##  [1] "pitch_belt"           "yaw_belt"             "total_accel_belt"    
##  [4] "gyros_belt_x"         "gyros_belt_y"         "gyros_belt_z"        
##  [7] "accel_belt_x"         "accel_belt_y"         "accel_belt_z"        
## [10] "magnet_belt_x"        "magnet_belt_y"        "magnet_belt_z"       
## [13] "roll_arm"             "pitch_arm"            "yaw_arm"             
## [16] "total_accel_arm"      "gyros_arm_x"          "gyros_arm_y"         
## [19] "gyros_arm_z"          "accel_arm_x"          "accel_arm_y"         
## [22] "accel_arm_z"          "magnet_arm_x"         "magnet_arm_y"        
## [25] "magnet_arm_z"         "roll_dumbbell"        "pitch_dumbbell"      
## [28] "yaw_dumbbell"         "total_accel_dumbbell" "gyros_dumbbell_x"    
## [31] "gyros_dumbbell_y"     "gyros_dumbbell_z"     "accel_dumbbell_x"    
## [34] "accel_dumbbell_y"     "accel_dumbbell_z"     "magnet_dumbbell_x"   
## [37] "magnet_dumbbell_y"    "magnet_dumbbell_z"    "roll_forearm"        
## [40] "pitch_forearm"        "yaw_forearm"          "total_accel_forearm" 
## [43] "gyros_forearm_x"      "gyros_forearm_y"      "gyros_forearm_z"     
## [46] "accel_forearm_x"      "accel_forearm_y"      "accel_forearm_z"     
## [49] "magnet_forearm_x"     "magnet_forearm_y"     "magnet_forearm_z"    
## [52] "classe"

Data partitioning

We divide our training data (train) into 2 sets, training (60%) and testing (40%). We will use the original testing data, test as our validation set.

trainClasse <- createDataPartition(train$classe, p=0.6, list=FALSE)
training <- train[trainClasse,]
testing <- train[-trainClasse,]

Decision tree

treeModfit <- train(classe ~ ., data = training, method="rpart")
treePred <- predict(treeModfit, testing)
confusionMatrix(treePred, as.factor(testing$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674  354   72  122   77
##          B   58  684   56   81  302
##          C  458  343 1074  689  436
##          D   40  136  154  394    7
##          E    2    1   12    0  620
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5667          
##                  95% CI : (0.5556, 0.5777)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.452           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7500  0.45059   0.7851  0.30638  0.42996
## Specificity            0.8887  0.92146   0.7027  0.94863  0.99766
## Pos Pred Value         0.7281  0.57917   0.3580  0.53899  0.97638
## Neg Pred Value         0.8994  0.87487   0.9393  0.87463  0.88601
## Prevalence             0.2845  0.19347   0.1744  0.16391  0.18379
## Detection Rate         0.2134  0.08718   0.1369  0.05022  0.07902
## Detection Prevalence   0.2930  0.15052   0.3824  0.09317  0.08093
## Balanced Accuracy      0.8193  0.68603   0.7439  0.62750  0.71381

rpart.plot(treeModfit$finalModel, roundint=FALSE)

We see that the accuracy ≈ 50%, which is quite low.

Random forest

forestModfit <- train(classe ~ ., data = training, method = "rf", ntree = 100)
forestPred <- predict(forestModfit, testing)
forestPredConfusion <- confusionMatrix(forestPred, as.factor(testing$classe))
forestPredConfusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231   12    0    0    0
##          B    0 1502   20    0    0
##          C    0    3 1344   14    2
##          D    1    0    4 1272    2
##          E    0    1    0    0 1438
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9925          
##                  95% CI : (0.9903, 0.9943)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9905          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9895   0.9825   0.9891   0.9972
## Specificity            0.9979   0.9968   0.9971   0.9989   0.9998
## Pos Pred Value         0.9947   0.9869   0.9861   0.9945   0.9993
## Neg Pred Value         0.9998   0.9975   0.9963   0.9979   0.9994
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1914   0.1713   0.1621   0.1833
## Detection Prevalence   0.2859   0.1940   0.1737   0.1630   0.1834
## Balanced Accuracy      0.9987   0.9931   0.9898   0.9940   0.9985

We see that the accuracy ≈ 99%, which is great. hence, we select the random forest model as our prediction model for this analysis.

Final Prediction

We now apply our model to the testing data, test

finalPred <- predict(forestModfit, test )
finalPred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

As we see from the result, the random forest outperforms the decision tree in terms of accuracy. While the decision tree gives us ≈50% accuracy, using the random forest gives us a whooping 99% accuracy.