This model was built for the Coursera Practical Machine Learning class project, part of Data Science Specialization taught by Professors Jeff Leek, Roger Peng, and Brian Caffo.
Data come from sensors used while weightlifting, and we will use these sensors to predict if the particpants were using the excersize correctly (classe='A') or doing one of 4 common mistakes (classe = 'B', 'C', 'D', or 'E').
This study uses separate training, validation, and testing datasets, with 20% of the data in the test set, 16% in the validation set, and 64% of the data in the training set.
set.seed(1337)
allData <- read.csv('./data/train.csv')
inTest <- createDataPartition(y=allData$X,p=.2,list=FALSE)
testData <- allData[inTest,]
trainData<- allData[-inTest,]
inValid <- createDataPartition(y=trainData$X,p=.2,list=FALSE) #.8*.2=16% of allData
validData<- trainData[inValid,]
trainData<- trainData[-inValid,]
To minimize overfitting to outliers, any variables that were 75% null, blank, or NA were removed from the training set.
##Remove Columns where more than 75% are blank
trainData<-trainData[,sapply(trainData,function(x) (length(x[x != ""])>(0.75*nrow(trainData))))]
##Remove Columns where more than 75% are NA
trainData<-trainData[,sapply(trainData,function(x) (length(x[!is.na(x)])>(0.75*nrow(trainData))))]
Three-dimensional scatter plots were constructed to explore the data. Each plot shows the spacial movement from the gyroscope from one of the 4 sensors.
We can see from some of the graphs that are is at least one outlier in the dataset that may impact the model. Were I doing regression, this observation would need to be reviewed more thoroughly, but because I’m using trees, I decided not to worry about this particular observation
Models were evaluated on two key criteria:
1. Speed. Due to technological constraints (having an older computer and limited time), I did not pursue the random forest model because it took too long to run and evaluate.
2. Accuracy. Models were compared based on accuracy when run against the validation set.
Model 1 is an rpart tree using the rpart function and library.
accuracy <- function(values,prediction){sum((prediction == values))/length(values)}
model1 <- rpart(classe~.,data=trainData,method="class")
data.frame(training_set = accuracy(trainData$classe,predict(model1,newdata=trainData,type="class")),
validation_set = accuracy(validData$classe,predict(model1,newdata=validData,type="class")))
## training_set validation_set
## 1 0.7539025 0.7515924
Model 2 is another tree model, built using the ctree and caret libraries.
model2 <- train(classe~.,data=trainData,method="ctree")
data.frame(training_set = accuracy(trainData$classe,predict(model2,newdata=trainData)),
validation_set = accuracy(validData$classe,predict(model2,newdata=validData)))
## training_set validation_set
## 1 0.9219497 0.8757962
Model 3 is also built with the ctree modeling library, but also uses caret functionality to apply principle component analysis as preprocessing.
model3 <- train(classe~.,data=trainData,method="ctree", preProcess="pca")
data.frame(training_set = accuracy(trainData$classe,predict(model3,newdata=trainData)),
validation_set = accuracy(validData$classe,predict(model3,newdata=validData)))
## training_set validation_set
## 1 0.8660401 0.7761146
Of the 3 Models Evaluated, Model 2, CTree without PCA, performed the best on the validation set.
bestModel <- model2
print(xtable(table(trainData$classe,predict(bestModel,newdata=trainData)),caption="Performance in Training Set"),type="html")
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 3429 | 74 | 31 | 24 | 20 |
| B | 92 | 2201 | 66 | 26 | 52 |
| C | 33 | 106 | 1926 | 71 | 41 |
| D | 48 | 65 | 52 | 1850 | 39 |
| E | 20 | 64 | 24 | 32 | 2170 |
print(xtable(table(testData$classe,predict(bestModel,newdata=testData,)),caption="Performance in Test Set"),type="html")
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 1025 | 54 | 10 | 11 | 15 |
| B | 50 | 631 | 31 | 21 | 25 |
| C | 18 | 55 | 576 | 20 | 20 |
| D | 17 | 35 | 32 | 545 | 18 |
| E | 9 | 37 | 20 | 13 | 638 |
print(xtable(data.frame(training_accuracy =
accuracy(trainData$classe,predict(bestModel,newdata=trainData)),
validation_accuracy =
accuracy(validData$classe,predict(bestModel,newdata=validData)),
testing_accuracy =
accuracy(testData$classe ,predict(bestModel,newdata=testData)))),
type="html")
| training_accuracy | validation_accuracy | testing_accuracy | |
|---|---|---|---|
| 1 | 0.92 | 0.88 | 0.87 |
Data Generously provided by Velloso et al. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3sA3HLCEL