Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity. These type of devices are part of the quantified self movement. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants collected while perform barbell lifts correctly and incorrectly in order to predict whether or not these lifts were done correctly. This is the “classe” variable in the data sets.
We would like to thank the ‘Human Activity Recognition’ group of the Informatics Department at the Pontifical Catholic University of Rio de Janeiro for making this dataset available for this study.
Though the assignment explicitly provides both training and test sets, for the purpose of this study we will partition just the training to generate the test set, and the original test set will be used as validation set.
The training/test set can be found here, and the validation can be downloaded from here. We will download both locally to avoid network overhead, and explore and process the training set first. Note that it will be necessary to transform the classe var into a factor, and that doing this in the original training set will cause the generated test set to have it as factor as well.
dat <- read.csv('./pml-training.csv',
header = T,
sep=",",
na.strings = c("#DIV/0!","NA",""),
strip.white = T,
stringsAsFactors=F)
validation <- read.csv('./pml-testing.csv',
header = T,
sep=",",
na.strings = c("#DIV/0!","NA",""),
strip.white = T,
stringsAsFactors=F)
dat$classe <- as.factor(dat$classe)
As explained throughout the course, the first and foremost step in building a statistical learning model is to partition the data. We will have 70% of the original training set as training, and 30% of it as test set. Note that the validation set will not be touched beyond this point.
inTrain <- createDataPartition(y=dat$classe, p=0.7, list=F)
training <- dat[inTrain,] # Don't forget the commas!
test <- dat[-inTrain,] # Don't forget the commas!
We end up with 13737 rows for the training, 5885 for the test set, and 20 for the validation test. Upon first inspection, we find that the following columns have mostly NA observations, which means we should only consider those columns that have enough information. We have determined that relevant variables are those with at least 30% of non-NAs. This and subsequent transformations will only happen on the training set, since both test and validation sets must remain untouched.
redTraining <- training[,colMeans(!is.na(training)) >= 0.3]
We also remove the X since, being just a row id, it will have 100 variance and this will have an effect on our prediction. We’ll also remove the user_name column since, being the subject’s name, it doesn’t add information to the model. Finally, we remove the *timestamp* and the *window* columns, since they are time-related variables and we do not want a prediction model on a time-series.
finalTraining <- redTraining[,-c(1:7)]
With this we have gone from 160 to 53. We can now proceed to feature and model selection.
We choose a Random Forest model due to it having built-in feature selection. According to the caret package documentation, its feature selection algorithm is coupled with the parameter estimation algorithm, making it faster than if the features were searched for externally.
Also, even though some authors state that RFs do not require cross-validation, we will nonetheless train our RF with a 5-fold, single-pass cross-validation in order to reduce any potential bias, and to further get a more accurate estimate of the out-of-sample error.
registerDoMC(cores = 3) # Register core for parallel processing.
tControl <- trainControl(method='cv', number = 5, allowParallel = T) # train control function for X-validation.
ptm <- proc.time()
rfModel <- train(classe ~ ., data = finalTraining, method='rf', trControl=tControl, importance = T) # model building
finalTime <- proc.time() - ptm
rfModel
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10989, 10990, 10990, 10990, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9903911 0.9878443 0.001736499 0.002197175
## 27 0.9895903 0.9868326 0.001533900 0.001939395
## 52 0.9809274 0.9758708 0.002719702 0.003442569
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
We can also prove that the Random Forest algorithm is computationally expensive by looking at the following table, where it shows that on an Intel Quad-core i7 with 16GB RAM and SSD the model trained in 12 mins, even though we registered 3 cores for parallel execution. Doing it with a single core increases that time to 37 mins.
finalTime
## user system elapsed
## 1114.083 9.257 714.307
According to the plot, RF’s built-in feature selection has determined that the variables that most contribute to decrease model impurity are roll_belt, yaw_belt, magnet_dumbbell_z and magnet_dumbbell_y. This means that it would be possible to build a random forest with these variables at the expense of small decrease in accuracy and a small increase in OOB error.
trainingConf <- rfModel$finalModel$confusion
trainingConf
## A B C D E class.error
## A 3903 2 0 0 1 0.0007680492
## B 13 2641 4 0 0 0.0063957863
## C 0 17 2376 3 0 0.0083472454
## D 0 0 34 2216 2 0.0159857904
## E 0 0 1 9 2515 0.0039603960
From the fitted model above, we can see that we’ve achieved an in-sample accuracy measure of 99.29% for mtry = 2 (vars sampled as potential splits). with an in-sample error of 0.7%.
We will now apply the model to the test set and explore its performance by looking at the out-of-sample error and the accuracy. Remember that our process of stripping variables was only applied to the training set.
pred <- predict(rfModel, test)
testConf <- confusionMatrix(pred, test$classe)
testConf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 10 0 0 0
## B 1 1125 8 0 0
## C 0 4 1018 18 0
## D 0 0 0 945 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9927
## 95% CI : (0.9902, 0.9947)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9908
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9877 0.9922 0.9803 0.9991
## Specificity 0.9976 0.9981 0.9955 0.9998 0.9998
## Pos Pred Value 0.9941 0.9921 0.9788 0.9989 0.9991
## Neg Pred Value 0.9998 0.9971 0.9983 0.9962 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1912 0.1730 0.1606 0.1837
## Detection Prevalence 0.2860 0.1927 0.1767 0.1607 0.1839
## Balanced Accuracy 0.9985 0.9929 0.9938 0.9900 0.9994
As we can see, we’ve achieved an accuracy of 99.27% and an out-of-sample error of 0.73%. As we can observe, these figures fall closely in line with the in-sample error and accuracy. The general expectation is for the out-of-sample metrics to be much greater. This difference is curious, but not entirely unexpected, due to any of the following:
0.73% and the model’s 0.7% sample error.We have called this test set the ‘validation set’, and even though we don’t have the correct predictions in it, we will nonetheless apply our model to it.
blindPred <- predict(rfModel, validation)
blindPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We will convert the predictions to chr data type and will generate 20 text files with the individual prediction for each case for submission purposes.
finalBlindPred <- as.character(blindPred)
pml_write_files = function(x) {
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(finalBlindPred)
The accuracy of these predictions is reserved by the author.