Acknowledgement

I would like to thank http://groupware.les.inf.puc-rio.br/har for the use of its training data and test data, which are used to predict the quality of barbell lifts

Executive Summary

It is now possible to use wearables to collect a large amount of data about personal activity. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this paper, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Loading the data

PmlTrain <- read.csv("pml-training.csv", header=T)
PmlTest <- read.csv("pml-testing.csv", header=T)
dim(PmlTrain); dim(PmlTest)
## [1] 19622   160
## [1]  20 160

Preparing and Cleaning the data

The paper uses accelerometers on the belt, forearm, arm and dumbell as predictors; the outcome is classe. Some of the accelerometers have NA values and these are removed.

library(plyr); library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.3
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
PmlTrainAccel <- select(PmlTrain, contains("accel"), contains("classe"))
PmlTrainAccel <- PmlTrainAccel[ , colSums(is.na(PmlTrainAccel)) == 0]
summary(PmlTrainAccel)
##  total_accel_belt  accel_belt_x       accel_belt_y     accel_belt_z    
##  Min.   : 0.00    Min.   :-120.000   Min.   :-69.00   Min.   :-275.00  
##  1st Qu.: 3.00    1st Qu.: -21.000   1st Qu.:  3.00   1st Qu.:-162.00  
##  Median :17.00    Median : -15.000   Median : 35.00   Median :-152.00  
##  Mean   :11.31    Mean   :  -5.595   Mean   : 30.15   Mean   : -72.59  
##  3rd Qu.:18.00    3rd Qu.:  -5.000   3rd Qu.: 61.00   3rd Qu.:  27.00  
##  Max.   :29.00    Max.   :  85.000   Max.   :164.00   Max.   : 105.00  
##  total_accel_arm  accel_arm_x       accel_arm_y      accel_arm_z     
##  Min.   : 1.00   Min.   :-404.00   Min.   :-318.0   Min.   :-636.00  
##  1st Qu.:17.00   1st Qu.:-242.00   1st Qu.: -54.0   1st Qu.:-143.00  
##  Median :27.00   Median : -44.00   Median :  14.0   Median : -47.00  
##  Mean   :25.51   Mean   : -60.24   Mean   :  32.6   Mean   : -71.25  
##  3rd Qu.:33.00   3rd Qu.:  84.00   3rd Qu.: 139.0   3rd Qu.:  23.00  
##  Max.   :66.00   Max.   : 437.00   Max.   : 308.0   Max.   : 292.00  
##  total_accel_dumbbell accel_dumbbell_x  accel_dumbbell_y 
##  Min.   : 0.00        Min.   :-419.00   Min.   :-189.00  
##  1st Qu.: 4.00        1st Qu.: -50.00   1st Qu.:  -8.00  
##  Median :10.00        Median :  -8.00   Median :  41.50  
##  Mean   :13.72        Mean   : -28.62   Mean   :  52.63  
##  3rd Qu.:19.00        3rd Qu.:  11.00   3rd Qu.: 111.00  
##  Max.   :58.00        Max.   : 235.00   Max.   : 315.00  
##  accel_dumbbell_z  total_accel_forearm accel_forearm_x   accel_forearm_y 
##  Min.   :-334.00   Min.   :  0.00      Min.   :-498.00   Min.   :-632.0  
##  1st Qu.:-142.00   1st Qu.: 29.00      1st Qu.:-178.00   1st Qu.:  57.0  
##  Median :  -1.00   Median : 36.00      Median : -57.00   Median : 201.0  
##  Mean   : -38.32   Mean   : 34.72      Mean   : -61.65   Mean   : 163.7  
##  3rd Qu.:  38.00   3rd Qu.: 41.00      3rd Qu.:  76.00   3rd Qu.: 312.0  
##  Max.   : 318.00   Max.   :108.00      Max.   : 477.00   Max.   : 923.0  
##  accel_forearm_z   classe  
##  Min.   :-446.00   A:5580  
##  1st Qu.:-182.00   B:3797  
##  Median : -39.00   C:3422  
##  Mean   : -55.29   D:3216  
##  3rd Qu.:  26.00   E:3607  
##  Max.   : 291.00

Exploring the data

dim(PmlTrainAccel); names(PmlTrainAccel)
## [1] 19622    17
##  [1] "total_accel_belt"     "accel_belt_x"         "accel_belt_y"        
##  [4] "accel_belt_z"         "total_accel_arm"      "accel_arm_x"         
##  [7] "accel_arm_y"          "accel_arm_z"          "total_accel_dumbbell"
## [10] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [13] "total_accel_forearm"  "accel_forearm_x"      "accel_forearm_y"     
## [16] "accel_forearm_z"      "classe"
pie(summary(PmlTrainAccel$classe), main="5 different classes of barbell lifts")

Splitting the data and plotting the predictors

The data are split into a training set and testing set. The 2 data sets are used to build the prediction model and to determine the error rate.

library(caret)
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.2
inTrain <- createDataPartition(y=PmlTrainAccel$classe, p=0.6, list=FALSE)
training <- PmlTrainAccel[inTrain,]; testing <- PmlTrainAccel[-inTrain,]
dim(training); dim(testing)
## [1] 11776    17
## [1] 7846   17
featurePlot(x=training[, c(-17)], y=training$classe, plot="pairs")

Preprocessing the predictors with Random Forest method (training)

library(randomForest); library(ipred); set.seed(12345)
ModFit <- randomForest(classe ~., data=training, preProcess=c("center","scale")); ModFit
## 
## Call:
##  randomForest(formula = classe ~ ., data = training, preProcess = c("center",      "scale")) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.99%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3207   35   45   58    3  0.04211470
## B   93 2068   75   22   21  0.09258447
## C   29   63 1938   18    6  0.05647517
## D   49   12   84 1774   11  0.08082902
## E    5   33   20   23 2084  0.03741339

Accuracy is equal to 1-error rate: 100% - 6%. The accuracy of the model with training data set is more than 90%.

Applying the Random Forest method to the testing data set (testing)

ModFittest <- randomForest(classe ~ ., data=testing, importance=T, prox=T); ModFittest
## 
## Call:
##  randomForest(formula = classe ~ ., data = testing, importance = T,      prox = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 7.51%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2123   17   34   53    5  0.04883513
## B   72 1359   58   16   13  0.10474308
## C   32   44 1271   17    4  0.07090643
## D   54   10   60 1149   13  0.10653188
## E    4   38   23   22 1355  0.06033287

Accuracy is equal to 1-error rate: 100% - 8%. The accuracy of the model with testing data set is more than 90%.

Predicting the “classe” with the PmlTest data set (validation)

PMLA2 <- predict(ModFit, newdata=PmlTest); PMLA2
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Cross validating ModFit1 and ModFit2

The PMLA2 values are used to predict the answers to the 20 questions in the PMLTest data set - there are 20 correct answers out of 20 questions. Indeed, as shown by the Confusion Matrixes, the accuracy of the Random Forest method is more than 90%.