Introduction and Study Design

This model was built for the Coursera Practical Machine Learning class project, part of Data Science Specialization taught by Professors Jeff Leek, Roger Peng, and Brian Caffo.

Data come from sensors used while weightlifting, and we will use these sensors to predict if the particpants were using the excersize correctly (classe='A') or doing one of 4 common mistakes (classe = 'B', 'C', 'D', or 'E').

Data Partitioning

This study uses separate training, validation, and testing datasets, with 20% of the data in the test set, 16% in the validation set, and 64% of the data in the training set.

set.seed(1337)
allData  <- read.csv('./data/train.csv')
inTest   <- createDataPartition(y=allData$X,p=.2,list=FALSE)
testData <- allData[inTest,]
trainData<- allData[-inTest,]
inValid  <- createDataPartition(y=trainData$X,p=.2,list=FALSE) #.8*.2=16% of allData
validData<- trainData[inValid,]
trainData<- trainData[-inValid,]

Data Preparation

To minimize overfitting to outliers, any variables that were 75% null, blank, or NA were removed from the training set.

##Remove Columns where more than 75% are blank
trainData<-trainData[,sapply(trainData,function(x) (length(x[x != ""])>(0.75*nrow(trainData))))]
##Remove Columns where more than 75% are NA
trainData<-trainData[,sapply(trainData,function(x) (length(x[!is.na(x)])>(0.75*nrow(trainData))))]

Exploratory Data Analysis

Three-dimensional scatter plots were constructed to explore the data. Each plot shows the spacial movement from the gyroscope from one of the 4 sensors.

We can see from some of the graphs that are is at least one outlier in the dataset that may impact the model. Were I doing regression, this observation would need to be reviewed more thoroughly, but because I’m using trees, I decided not to worry about this particular observation

Modeling

Models were evaluated on two key criteria:
1. Speed. Due to technological constraints (having an older computer and limited time), I did not pursue the random forest model because it took too long to run and evaluate.
2. Accuracy. Models were compared based on accuracy when run against the validation set.

Model 1: Simple RPart Tree

Model 1 is an rpart tree using the rpart function and library.

accuracy <- function(values,prediction){sum((prediction == values))/length(values)}
model1 <- rpart(classe~.,data=trainData,method="class")

data.frame(training_set = accuracy(trainData$classe,predict(model1,newdata=trainData,type="class")),
           validation_set = accuracy(validData$classe,predict(model1,newdata=validData,type="class")))

##   training_set validation_set
## 1    0.7539025      0.7515924

Model 2: Simple CTree

Model 2 is another tree model, built using the ctree and caret libraries.

model2 <- train(classe~.,data=trainData,method="ctree")
data.frame(training_set = accuracy(trainData$classe,predict(model2,newdata=trainData)),
         validation_set = accuracy(validData$classe,predict(model2,newdata=validData)))

##   training_set validation_set
## 1    0.9219497      0.8757962

Model 3: CTree with Principal Component Analysis

Model 3 is also built with the ctree modeling library, but also uses caret functionality to apply principle component analysis as preprocessing.

model3 <- train(classe~.,data=trainData,method="ctree", preProcess="pca")
data.frame(training_set = accuracy(trainData$classe,predict(model3,newdata=trainData)),
         validation_set = accuracy(validData$classe,predict(model3,newdata=validData)))

##   training_set validation_set
## 1    0.8660401      0.7761146

Results

Of the 3 Models Evaluated, Model 2, CTree without PCA, performed the best on the validation set.

bestModel <- model2
print(xtable(table(trainData$classe,predict(bestModel,newdata=trainData)),caption="Performance in Training Set"),type="html")

## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

Performance in Training Set
	A	B	C	D	E
A	3429	74	31	24	20
B	92	2201	66	26	52
C	33	106	1926	71	41
D	48	65	52	1850	39
E	20	64	24	32	2170

print(xtable(table(testData$classe,predict(bestModel,newdata=testData,)),caption="Performance in Test Set"),type="html")

Performance in Test Set
	A	B	C	D	E
A	1025	54	10	11	15
B	50	631	31	21	25
C	18	55	576	20	20
D	17	35	32	545	18
E	9	37	20	13	638

print(xtable(data.frame(training_accuracy = 
                 accuracy(trainData$classe,predict(bestModel,newdata=trainData)),
            validation_accuracy = 
                 accuracy(validData$classe,predict(bestModel,newdata=validData)),
            testing_accuracy =
                 accuracy(testData$classe ,predict(bestModel,newdata=testData)))),
      type="html")

	training_accuracy	validation_accuracy	testing_accuracy
1	0.92	0.88	0.87

References

Data Generously provided by Velloso et al. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Machine Learning Data Science

Dann Hekman

November 18, 2015