This study uses data from wearable device sensors to predict human activity. A combined model (randomForest, gbm, treebag) is used to achieve an misclassification error rate of less than 0.5%.
This study uses data from mobile accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information for this study is available from the website: http://groupware.les.inf.puc-rio.br/har.
The training data for this project is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. The training data is separated into a training set and a test set. The validation data is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv. The datasets contain 5 activity classes to be predicted: sitting-down (A), standing-up (B), standing (C), walking (D), and sitting (E).
Both data sets were downloaded on July 15. 2015.
The goal of this study is to predict the human activity (“classe” variable) on the 20 validation data cases.
This report describes
The combined model is used to predict the activity of the 20 cases in the validation data set.
data=read.csv("../data/pml-training.csv")
dim(data)
## [1] 19622 160
The training data contains 19622 rows and 160 columns.
After exploratory data analysis, several actions are taken to clean the data:
# remove identifier column 'X'
data <- data[,-1]
# remove columns with more than 50% NAs in it
data <- data[,colSums(is.na(data)) < (nrow(data)*0.50)]
# remove columns which contains more than 50% empty cells
data <- data[,colSums(data=="") < (nrow(data)*0.50)]
dim(data)
## [1] 19622 59
The resulting training data set contains now only 59 columns.
The training data contains about 20000 rows. It takes very long to train prediction models with this large dataset. Therefore a random sample of 5000 rows is taken for further processing. Exploratory analysis showed that this is sufficient for building the different prediction models.
set.seed(98765) # set seed for reproducability
sampleLength <- 5000
dataSample <- data[sample(nrow(data),sampleLength),]
The training data is then sub-divided into a training set (75%) and a test set (25%). The training set is used to train the models. The test set is used to estimate the out-of-sample error of the models.
inTrain = createDataPartition(dataSample$classe, p = 0.75)[[1]]
training = dataSample[ inTrain,]
test = dataSample[-inTrain,]
# dim(training); dim(test)
Three models will be used for training: randomForest, gbm, and treebag (exploratory analysis showed that these models have low error rates).
All three models are trained with standardized and imputed data. Cross validation with k=5 is used for randomForest and treebag models.
modRf <- train(classe~., method="rf", data=training, preProcess=c("center","scale","knnImpute"),
trControl=trainControl(method="cv"), number=5 )
# modRf$finalModel$confusion
# gbm
modGbm <- train(classe~., method="gbm", data=training, verbose=FALSE, preProcess=c("center","scale","knnImpute"))
# modGbm$finalModel
modTreebag <- train(classe~., method="treebag", data=training, preProcess=c("center","scale","knnImpute"),
trControl=trainControl(method="cv"), number=5 )
# modTreebag$finalModel
To get an estimation of the out-of-sample error, the different models are evaluated on the test set. Confusion matrix is used to determine the error rate (1 - accuracy).
predictRf <- predict(modRf, test)
cmRf <- confusionMatrix(predictRf, test$classe)
acc <- cmRf$overall[[1]]
errorRf <- 1-acc
errorRf
## [1] 0.00400641
The estimated out-of-sample error of the randomForest model is 0.0040064. The 95% CI is (0.9906753, 0.9986979)
predictGbm <- predict(modGbm, test)
cmGbm <- confusionMatrix(predictGbm, test$classe)
acc <- cmGbm$overall[[1]]
errorGbm <- 1-acc
errorGbm
## [1] 0.004807692
The estimated out-of-sample error of the gbm model is 0.0048077. The 95% CI is (0.9895653, 0.9982337)
predictTreebag <- predict(modTreebag, test)
cmTreebag <- confusionMatrix(predictTreebag, test$classe)
acc <- cmTreebag$overall[[1]]
errorTreebag <- 1-acc
errorTreebag
## [1] 0.006410256
The estimated out-of-sample error of the treebag model is 0.0064103. The 95% CI is (0.9874085, 0.9972286)
The three base models are now used to train a combined model (method: randomForest). The dataset used contains the predictions results of the three base models.
combResults <- data.frame(predictRf, predictGbm, predictTreebag, classe=test$classe)
modComb <- train(classe ~.,method="rf",data=combResults)
predictComb <- predict(modComb,combResults)
cmComb <- confusionMatrix(predictComb, test$classe)
acc <- cmComb$overall[[1]]
errorComb <- 1-acc
errorComb
## [1] 0.002403846
The estimated out-of-sample error of the combined model is 0.0024038. The 95% CI is (0.9929912, 0.999504)
The out-of-sample errors of the different models - calculated on the test data - are:
As expected the combined model has the smallest out-of-sample error estimation.
The combined model is used to predict the classe of the validation data cases.
validation=read.csv("../data/pml-testing.csv")
dim(validation)
## [1] 20 160
The validation data set contains 20 test cases.
As the combined model is based on the prediction results of the three base models, the activity is first predicted for the base models.
predRf <- predict(modRf, validation)
predGbm <- predict(modGbm, validation)
predTreebag <- predict(modTreebag, validation)
The prediction results are used to form a new dataset for the combined model, which is then used to predict the final result of the classe variable - used for submission.
combResults <- data.frame(predictRf=predRf, predictGbm=predGbm, predictTreebag=predTreebag)
predComb <- predict(modComb,combResults)
predComb
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The prediction result for the 20 validation data cases: B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B
The prediction result is saved to disk for submission - one file for each of the 20 cases.
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("../submission/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predComb)