Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

Data

In this project, I will use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Class A: User performed exactly according to the specification.
Class B: User threw their elbows to the front.
Class C: User lifted the dumbbell only halfway.
Class D: User lowered the dumbbell only halfway.
Class E: User threw their hips to the front.

Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3s021ksx8 More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Prepare Workspace

To make our workspace ready to make predictions I will load the necessary libraries. I also set the seed for reproducible.

library(caret)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(randomForest)
library(corrplot)
library(e1071)

set.seed(39021)

Load Data

Now we can load our data and take a look at it.

training <- read.csv("training.csv", na.strings = c("", "NA", "#DIV/0!"))
testing <- read.csv("testing.csv", na.strings = c("", "NA", "#DIV/0!"))

Clean Data

The first seven features hold information about users, timestamps, and time windows. These won’t be used to make our predictions, and so will be dropped.

training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]

There are many features that are only used to store summary statistics for each time frame window, with the rest of the observations being filled with NA’s. We aren’t concerned with the recorded time windows, and so we won’t be using these summary statistics.

NAcount <- vector()
for (i in 1:length(names(training))) {
     NAcount[i] <- sum(is.na(training[,i]))
}
training <- training[,-which(NAcount > 1000)]
testing <- testing[,-which(NAcount > 1000)]
dim(training)[2]

## [1] 53

We are left with 53 features, including our response.

The data needs to be numeric, lets make sure.

table(sapply(training, class))

## 
## character   integer   numeric 
##         1        25        27

There is 1 character class and all other variables are numeric and can be used for prediction.

class(training$classe)

## [1] "character"

Our predictor classe is a character and needs to be a factor.

training$classe <- factor(training$classe)

Data Partitioning

Now I will partition our set into training and test sets for cross-validation at a 60% split.

part <- createDataPartition(training$classe, p = .6, list = FALSE)
trainingfit <- training[part,]
testingfit <- training[-part,]

Model Fitting

My first model will use a decision tree.

fitrpart <- rpart(classe~., data=trainingfit, method=c("class"))
fancyRpartPlot(fitrpart)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

predictionrpart <- predict(fitrpart, testingfit[,-53], type="class")
confusionMatrix(testingfit$classe, predictionrpart)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1978   82   50   82   40
##          B  231  872  291   92   32
##          C   31   72 1178   87    0
##          D   73  123  164  831   95
##          E    7  117  186   93 1039
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7517         
##                  95% CI : (0.742, 0.7612)
##     No Information Rate : 0.2957         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6858         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8526   0.6888   0.6303   0.7013   0.8615
## Specificity            0.9540   0.9018   0.9682   0.9317   0.9393
## Pos Pred Value         0.8862   0.5744   0.8611   0.6462   0.7205
## Neg Pred Value         0.9391   0.9377   0.8933   0.9460   0.9739
## Prevalence             0.2957   0.1614   0.2382   0.1510   0.1537
## Detection Rate         0.2521   0.1111   0.1501   0.1059   0.1324
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9033   0.7953   0.7992   0.8165   0.9004

The decision tree doesn’t offer great predictions, but given closer inspection may lead to some interesting insights into our data.

Random Forest will likely give more accurate predictions.

time <- proc.time()
(fitRF <- randomForest(classe ~ ., data=trainingfit, importance = TRUE))

## 
## Call:
##  randomForest(formula = classe ~ ., data = trainingfit, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.76%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3344    3    1    0    0 0.001194743
## B   16 2253   10    0    0 0.011408513
## C    0   15 2036    3    0 0.008763389
## D    0    0   28 1899    3 0.016062176
## E    0    0    3    7 2155 0.004618938

proc.time()-time

##    user  system elapsed 
##  58.994   0.315  59.334

predictionRF <- predict(fitRF, testingfit[,-53], type="response")
confusionMatrix(testingfit$classe, predictionRF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228    3    0    0    1
##          B   12 1497    9    0    0
##          C    0   10 1358    0    0
##          D    0    0   10 1275    1
##          E    0    0    1    4 1437
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9935          
##                  95% CI : (0.9915, 0.9952)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9918          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9946   0.9914   0.9855   0.9969   0.9986
## Specificity            0.9993   0.9967   0.9985   0.9983   0.9992
## Pos Pred Value         0.9982   0.9862   0.9927   0.9914   0.9965
## Neg Pred Value         0.9979   0.9979   0.9969   0.9994   0.9997
## Prevalence             0.2855   0.1925   0.1756   0.1630   0.1834
## Detection Rate         0.2840   0.1908   0.1731   0.1625   0.1832
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9970   0.9940   0.9920   0.9976   0.9989

The random forest model produces some great predictions. Still, the processing time is substantial. This model will work fine for our needs, but I would like to see if the processing time could be lowered without increasing the error rate too much.

The random forest model also provides an estimation on the importance of the various features.

varImpPlot(fitRF, cex = .7)

Lets take the most important features and build a model from those. I’ll use the top ten selections according to the Mean Decrease Accuracy and Mean Decrease Gini estimations.

imptop <- vector()
imptop <- unique(names(c(sort(fitRF$importance[,6], decreasing=TRUE)[1:10],
                         sort(fitRF$importance[,7], decreasing=TRUE)[1:10])))
paste(imptop, collapse = "+")

## [1] "roll_belt+roll_forearm+magnet_dumbbell_y+magnet_dumbbell_z+yaw_belt+magnet_dumbbell_x+pitch_belt+pitch_forearm+roll_dumbbell+accel_dumbbell_y"

Now I’ll train a new model using only our selected features. I’ve also lowered the number of trees to 150. This should speed things up quite a bit.

time <- proc.time()
(fitRFimp <- randomForest(trainingfit$classe ~ 
                               roll_forearm+
                               roll_belt+
                               magnet_dumbbell_y+
                               yaw_belt+
                               magnet_dumbbell_z+
                               magnet_dumbbell_x+
                               pitch_belt+
                               pitch_forearm+
                               roll_dumbbell+
                               accel_dumbbell_y+
                               accel_belt_z,
                          data=trainingfit, ntree=150, importance=TRUE))

## 
## Call:
##  randomForest(formula = trainingfit$classe ~ roll_forearm + roll_belt +      magnet_dumbbell_y + yaw_belt + magnet_dumbbell_z + magnet_dumbbell_x +      pitch_belt + pitch_forearm + roll_dumbbell + accel_dumbbell_y +      accel_belt_z, data = trainingfit, ntree = 150, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 150
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.8%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3314   20   12    2    0  0.01015532
## B   27 2202   35   13    2  0.03378675
## C    3   25 2007   19    0  0.02288218
## D    0    2   24 1898    6  0.01658031
## E    0    8    6    8 2143  0.01016166

proc.time() - time

##    user  system elapsed 
##   3.918   0.092   4.012

Processing time has been substantially lowered. The OOB is higher, but still reasonably low.

predictionRFimp <- predict(fitRFimp, testingfit[,-53], type="response")
confusionMatrix(testingfit$classe, predictionRFimp)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2212   10    8    1    1
##          B   20 1463   27    8    0
##          C    0    8 1353    7    0
##          D    0    3    8 1275    0
##          E    0    5    8    3 1426
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9851          
##                  95% CI : (0.9822, 0.9877)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9811          
##                                           
##  Mcnemar's Test P-Value : 7.539e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9910   0.9825   0.9637   0.9853   0.9993
## Specificity            0.9964   0.9913   0.9977   0.9983   0.9975
## Pos Pred Value         0.9910   0.9638   0.9890   0.9914   0.9889
## Neg Pred Value         0.9964   0.9959   0.9921   0.9971   0.9998
## Prevalence             0.2845   0.1898   0.1789   0.1649   0.1819
## Detection Rate         0.2819   0.1865   0.1724   0.1625   0.1817
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9937   0.9869   0.9807   0.9918   0.9984

The overall accuracy drops by about 1%. This makes the out-of-sample error approximately .0167

Though I have a more accurate model built, I do believe that the faster model is accurate enough for the small testing set I’ll be applying it to.

Results

Providing output for the predictions on the test data set.

predictionRFimp <- predict(fitRFimp, testing[,-53])
predictionRFimp

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Practical Machine Learning Assignment

Adam Ross