Summary

This study uses data from wearable device sensors to predict human activity. A combined model (randomForest, gbm, treebag) is used to achieve an misclassification error rate of less than 0.5%.

Introduction

This study uses data from mobile accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information for this study is available from the website: http://groupware.les.inf.puc-rio.br/har.

The training data for this project is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. The training data is separated into a training set and a test set. The validation data is available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv. The datasets contain 5 activity classes to be predicted: sitting-down (A), standing-up (B), standing (C), walking (D), and sitting (E).

Both data sets were downloaded on July 15. 2015.

The goal of this study is to predict the human activity (“classe” variable) on the 20 validation data cases.

This report describes

The combined model is used to predict the activity of the 20 cases in the validation data set.

Loading the training data

data=read.csv("../data/pml-training.csv")
dim(data)
## [1] 19622   160

The training data contains 19622 rows and 160 columns.

Cleaning the training data

After exploratory data analysis, several actions are taken to clean the data:

# remove identifier column 'X'
data <- data[,-1]  

# remove columns with more than 50% NAs in it
data <- data[,colSums(is.na(data)) < (nrow(data)*0.50)]

# remove columns which contains more than 50% empty cells
data <- data[,colSums(data=="") < (nrow(data)*0.50)]
dim(data)
## [1] 19622    59

The resulting training data set contains now only 59 columns.

Get a random sample of the training data

The training data contains about 20000 rows. It takes very long to train prediction models with this large dataset. Therefore a random sample of 5000 rows is taken for further processing. Exploratory analysis showed that this is sufficient for building the different prediction models.

set.seed(98765)  # set seed for reproducability
sampleLength <- 5000
dataSample <- data[sample(nrow(data),sampleLength),]

Sub-divide the training data into a training set and test set

The training data is then sub-divided into a training set (75%) and a test set (25%). The training set is used to train the models. The test set is used to estimate the out-of-sample error of the models.

inTrain = createDataPartition(dataSample$classe, p = 0.75)[[1]]
training = dataSample[ inTrain,]
test = dataSample[-inTrain,]
# dim(training); dim(test)

Training three base models

Three models will be used for training: randomForest, gbm, and treebag (exploratory analysis showed that these models have low error rates).

All three models are trained with standardized and imputed data. Cross validation with k=5 is used for randomForest and treebag models.

randomForest model

modRf <- train(classe~., method="rf", data=training, preProcess=c("center","scale","knnImpute"),
               trControl=trainControl(method="cv"), number=5 )
# modRf$finalModel$confusion

gbm model

# gbm
modGbm <- train(classe~., method="gbm", data=training, verbose=FALSE, preProcess=c("center","scale","knnImpute"))
# modGbm$finalModel

treebag model

modTreebag <- train(classe~., method="treebag", data=training, preProcess=c("center","scale","knnImpute"),
                    trControl=trainControl(method="cv"), number=5 )
# modTreebag$finalModel

Evaluating the different models

To get an estimation of the out-of-sample error, the different models are evaluated on the test set. Confusion matrix is used to determine the error rate (1 - accuracy).

randomForest model

predictRf <- predict(modRf, test)
cmRf <- confusionMatrix(predictRf, test$classe)
acc <- cmRf$overall[[1]]
errorRf <- 1-acc
errorRf
## [1] 0.00400641

The estimated out-of-sample error of the randomForest model is 0.0040064. The 95% CI is (0.9906753, 0.9986979)

gbm model

predictGbm <- predict(modGbm, test)
cmGbm <- confusionMatrix(predictGbm, test$classe)
acc <- cmGbm$overall[[1]]
errorGbm <- 1-acc
errorGbm
## [1] 0.004807692

The estimated out-of-sample error of the gbm model is 0.0048077. The 95% CI is (0.9895653, 0.9982337)

treebag model

predictTreebag <- predict(modTreebag, test)
cmTreebag <- confusionMatrix(predictTreebag, test$classe)
acc <- cmTreebag$overall[[1]]
errorTreebag <- 1-acc
errorTreebag
## [1] 0.006410256

The estimated out-of-sample error of the treebag model is 0.0064103. The 95% CI is (0.9874085, 0.9972286)

Fitting a combined model

The three base models are now used to train a combined model (method: randomForest). The dataset used contains the predictions results of the three base models.

combResults <- data.frame(predictRf, predictGbm, predictTreebag, classe=test$classe)
modComb <- train(classe ~.,method="rf",data=combResults)

predictComb <- predict(modComb,combResults)
cmComb <- confusionMatrix(predictComb, test$classe)
acc <- cmComb$overall[[1]]
errorComb <- 1-acc
errorComb
## [1] 0.002403846

The estimated out-of-sample error of the combined model is 0.0024038. The 95% CI is (0.9929912, 0.999504)

Evaluation of the models

The out-of-sample errors of the different models - calculated on the test data - are:

As expected the combined model has the smallest out-of-sample error estimation.

Predicting activity on the validation data

The combined model is used to predict the classe of the validation data cases.

Loading the validation data

validation=read.csv("../data/pml-testing.csv")
dim(validation)
## [1]  20 160

The validation data set contains 20 test cases.

Predicting the activity of the validation data cases

As the combined model is based on the prediction results of the three base models, the activity is first predicted for the base models.

predRf <- predict(modRf, validation)
predGbm <- predict(modGbm, validation)
predTreebag <- predict(modTreebag, validation)

The prediction results are used to form a new dataset for the combined model, which is then used to predict the final result of the classe variable - used for submission.

combResults <- data.frame(predictRf=predRf, predictGbm=predGbm, predictTreebag=predTreebag)
predComb <- predict(modComb,combResults)
predComb
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

The prediction result for the 20 validation data cases: B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B

Saving the predictions to file

The prediction result is saved to disk for submission - one file for each of the 20 cases.

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("../submission/problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(predComb)

References

1 caret package - Classification and Regression Training

2 The caret Package