Introduction and Study Design

This model was built for the Coursera Practical Machine Learning class project, part of Data Science Specialization taught by Professors Jeff Leek, Roger Peng, and Brian Caffo.

Data come from sensors used while weightlifting, and we will use these sensors to predict if the particpants were using the excersize correctly (classe='A') or doing one of 4 common mistakes (classe = 'B', 'C', 'D', or 'E').

Data Partitioning

This study uses separate training, validation, and testing datasets, with 20% of the data in the test set, 16% in the validation set, and 64% of the data in the training set.

set.seed(1337)
allData  <- read.csv('./data/train.csv')
inTest   <- createDataPartition(y=allData$X,p=.2,list=FALSE)
testData <- allData[inTest,]
trainData<- allData[-inTest,]
inValid  <- createDataPartition(y=trainData$X,p=.2,list=FALSE) #.8*.2=16% of allData
validData<- trainData[inValid,]
trainData<- trainData[-inValid,]

Data Preparation

To minimize overfitting to outliers, any variables that were 75% null, blank, or NA were removed from the training set.

##Remove Columns where more than 75% are blank
trainData<-trainData[,sapply(trainData,function(x) (length(x[x != ""])>(0.75*nrow(trainData))))]
##Remove Columns where more than 75% are NA
trainData<-trainData[,sapply(trainData,function(x) (length(x[!is.na(x)])>(0.75*nrow(trainData))))]

Exploratory Data Analysis

Three-dimensional scatter plots were constructed to explore the data. Each plot shows the spacial movement from the gyroscope from one of the 4 sensors.

We can see from some of the graphs that are is at least one outlier in the dataset that may impact the model. Were I doing regression, this observation would need to be reviewed more thoroughly, but because I’m using trees, I decided not to worry about this particular observation

Modeling

Models were evaluated on two key criteria:
1. Speed. Due to technological constraints (having an older computer and limited time), I did not pursue the random forest model because it took too long to run and evaluate.
2. Accuracy. Models were compared based on accuracy when run against the validation set.

Model 1: Simple RPart Tree

Model 1 is an rpart tree using the rpart function and library.

accuracy <- function(values,prediction){sum((prediction == values))/length(values)}
model1 <- rpart(classe~.,data=trainData,method="class")

data.frame(training_set = accuracy(trainData$classe,predict(model1,newdata=trainData,type="class")),
           validation_set = accuracy(validData$classe,predict(model1,newdata=validData,type="class")))
##   training_set validation_set
## 1    0.7539025      0.7515924

Model 2: Simple CTree

Model 2 is another tree model, built using the ctree and caret libraries.

model2 <- train(classe~.,data=trainData,method="ctree")
data.frame(training_set = accuracy(trainData$classe,predict(model2,newdata=trainData)),
         validation_set = accuracy(validData$classe,predict(model2,newdata=validData)))
##   training_set validation_set
## 1    0.9219497      0.8757962

Model 3: CTree with Principal Component Analysis

Model 3 is also built with the ctree modeling library, but also uses caret functionality to apply principle component analysis as preprocessing.

model3 <- train(classe~.,data=trainData,method="ctree", preProcess="pca")
data.frame(training_set = accuracy(trainData$classe,predict(model3,newdata=trainData)),
         validation_set = accuracy(validData$classe,predict(model3,newdata=validData)))
##   training_set validation_set
## 1    0.8660401      0.7761146

Results

Of the 3 Models Evaluated, Model 2, CTree without PCA, performed the best on the validation set.

bestModel <- model2
print(xtable(table(trainData$classe,predict(bestModel,newdata=trainData)),caption="Performance in Training Set"),type="html")
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
Performance in Training Set
A B C D E
A 3429 74 31 24 20
B 92 2201 66 26 52
C 33 106 1926 71 41
D 48 65 52 1850 39
E 20 64 24 32 2170
print(xtable(table(testData$classe,predict(bestModel,newdata=testData,)),caption="Performance in Test Set"),type="html")
Performance in Test Set
A B C D E
A 1025 54 10 11 15
B 50 631 31 21 25
C 18 55 576 20 20
D 17 35 32 545 18
E 9 37 20 13 638
print(xtable(data.frame(training_accuracy = 
                 accuracy(trainData$classe,predict(bestModel,newdata=trainData)),
            validation_accuracy = 
                 accuracy(validData$classe,predict(bestModel,newdata=validData)),
            testing_accuracy =
                 accuracy(testData$classe ,predict(bestModel,newdata=testData)))),
      type="html")
training_accuracy validation_accuracy testing_accuracy
1 0.92 0.88 0.87

References

Data Generously provided by Velloso et al. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3sA3HLCEL