The Human Activity Recognition goal in this report is to predict the manner in which a person is performing a weight lifting exercise. Specifically to detect on a test set if the Unilateral Dumbbell Biceps Curl is being performed according to the specification. In this project, there is data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: [http://groupware.les.inf.puc-rio.br/har] (see the section on the Weight Lifting Exercise Dataset). The goal is to predict from a testing set the classification on the exercise.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement. A group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Read more: [http://groupware.les.inf.puc-rio.br/har#ixzz55KOSz1Ls]
Download and parse the raw data from the provided URLs, this is part of a Coursera class exercise and the datasets are modified from the original study and provided on cloudfront.
test_datafile <- "./data/pml-testing.csv"
train_datafile <- "./data/pml-training.csv"
train_fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
dir.create("data", showWarnings=FALSE)
if(!file.exists(train_datafile)) {
download.file(train_fileURL, destfile=train_datafile)
}
if(!file.exists(test_datafile)) {
download.file(test_fileURL, destfile=test_datafile)
}
raw_train <- read.csv(train_datafile)
raw_test <- read.csv(test_datafile)
Clear out the columns which are blank, inspection will show that there are \(19622\) raw observations with \(160\) attributes. The lists M1 and M2 show the column names that are missing data, in all cases the number of suspect observations is \(19216\) which corresponds to the occurances when new_window is true. A view of the testing data will show we have this value as no in all cases so it is ok to discard all these attributes.
We also discard some of the non predictive attributes such as name, and timestamps.
M1 <- sapply(raw_train, function(x) sum(is.na(x))); M1 <- M1[M1 > 0]
M2 <- sapply(raw_train, function(x) sum(x == "", na.rm=TRUE)); M2 <- M2[M2 > 0]
training_validation <- select(raw_train, -c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window)) %>%
select( -one_of(names(M1))) %>%
select( -one_of(names(M2)))
testing <- select(raw_test, -c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window)) %>%
select( -one_of(names(M1))) %>%
select( -one_of(names(M2)))
For our purposes we will be using Cross Validation, but we still want to pull a set of data where the outcome is known to measure the accuracy of our models, we will split of a 20% set of data for this purpose.
inTrain = createDataPartition(training_validation$classe, p = 0.85)[[1]]
training = training_validation[ inTrain,]
validation = training_validation[-inTrain,]
flds <- createFolds(training$classe, k = 4, list = TRUE, returnTrain = FALSE)
For our cases, we will do a break down of 4 fold validation, and to summarize we will use the preProcess method built in. Train 4 models and predict against the fold.
tic("rf model1")
rf_fit1 <- train(classe ~ ., data=training[-flds[[1]], ], method="parRF")
rf_pred1 <- predict(rf_fit1, newdata=training[ flds[[1]],])
toc()
rf model1: 438.073 sec elapsed
tic("rf model2")
rf_fit2 <- train(classe ~ ., data=training[-flds[[2]], ], method="parRF")
rf_pred2 <- predict(rf_fit2, newdata=training[ flds[[2]],])
toc()
rf model2: 426.471 sec elapsed
tic("rf model3")
rf_fit3 <- train(classe ~ ., data=training[-flds[[3]], ], method="parRF")
rf_pred3 <- predict(rf_fit3, newdata=training[ flds[[3]],])
toc()
rf model3: 422.278 sec elapsed
tic("rf model4")
rf_fit4 <- train(classe ~ ., data=training[-flds[[4]], ], method="parRF")
rf_pred4 <- predict(rf_fit4, newdata=training[ flds[[4]],])
toc()
rf model4: 427.734 sec elapsed
Make a combined model by training our 4 CV models on the entire training set and fit. Score this on the validation set, and generate the prediction on the test set
combined_data <- data.frame(
rf1_pred = predict(rf_fit1, newdata=training),
rf2_pred = predict(rf_fit2, newdata=training),
rf3_pred = predict(rf_fit3, newdata=training),
rf4_pred = predict(rf_fit4, newdata=training),
classe = training$classe)
tic("train our combined model")
combined_fit <- train(classe ~ ., method="parRF", data=combined_data)
toc()
train our combined model: 75.656 sec elapsed
Lets review the results for each of the 4 models on the their cross-validation folds. These scores would represent an accuracy on the hold-out fourth fold based on training on the other 3 parts of the dataset.
afn <- function(aa,bb) { confusionMatrix(aa,bb)$overall[c(1,3,4)] }
accuracyResults <- data.frame(cbind(
rf1=afn(rf_pred1, training[ flds[[1]],]$classe),
rf2=afn(rf_pred2, training[ flds[[2]],]$classe),
rf3=afn(rf_pred3, training[ flds[[3]],]$classe),
rf4=afn(rf_pred4, training[ flds[[4]],]$classe)
))
accuracyResults
rf1 rf2 rf3 rf4
Accuracy 0.9899305 0.9925677 0.9918465 0.9942418
AccuracyLower 0.9864131 0.9894669 0.9886248 0.9914444
AccuracyUpper 0.9927334 0.9949447 0.9943471 0.9963073
Score the combined model on the validation set we held out, and we can also compare our score with the interim models. As you can see by combining these results into another tree we get further improvement
combined_validation_data <- data.frame(
rf1_pred = predict(rf_fit1, newdata=validation),
rf2_pred = predict(rf_fit2, newdata=validation),
rf3_pred = predict(rf_fit3, newdata=validation),
rf4_pred = predict(rf_fit4, newdata=validation),
classe = validation$classe)
validation_pred <- predict(combined_fit, newdata=combined_validation_data)
accuracyResults <- data.frame(cbind(
rf1=afn(combined_validation_data$rf1_pred, validation$classe),
rf2=afn(combined_validation_data$rf2_pred, validation$classe),
rf3=afn(combined_validation_data$rf3_pred, validation$classe),
rf4=afn(combined_validation_data$rf4_pred, validation$classe),
combined=afn(validation_pred, validation$classe)
))
accuracyResults
rf1 rf2 rf3 rf4 combined
Accuracy 0.9932019 0.9945615 0.9918423 0.9932019 0.9959211
AccuracyLower 0.9895203 0.9911832 0.9878861 0.9895203 0.9928859
AccuracyUpper 0.9958427 0.9968883 0.9947664 0.9958427 0.9978907
# for now just dump the entire matrix
confusionMatrix(validation_pred, validation$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 836 0 0 0 0
B 0 568 3 0 0
C 0 1 510 6 0
D 0 0 0 476 1
E 1 0 0 0 540
Overall Statistics
Accuracy : 0.9959
95% CI : (0.9929, 0.9979)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9948
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9988 0.9982 0.9942 0.9876 0.9982
Specificity 1.0000 0.9987 0.9971 0.9996 0.9996
Pos Pred Value 1.0000 0.9947 0.9865 0.9979 0.9982
Neg Pred Value 0.9995 0.9996 0.9988 0.9976 0.9996
Prevalence 0.2845 0.1934 0.1744 0.1638 0.1839
Detection Rate 0.2842 0.1931 0.1734 0.1618 0.1835
Detection Prevalence 0.2842 0.1941 0.1757 0.1621 0.1839
Balanced Accuracy 0.9994 0.9985 0.9956 0.9936 0.9989
#plot(combined_fit, main="Combination of 4-CV Random Forest")
#plot(varImp(combined_fit), top=10)
Using the traincontrol option we can achieve the same results
tic("run with traincontrol option")
tc_fit <- train(classe ~ ., data=training, method="parRF", trainControl=trainControl(method="cv", number=4))
tc_pred <- predict(tc_fit, newdata=validation)
toc()
run with traincontrol option: 622.838 sec elapsed
# print the error range
afn(tc_pred, validation$classe)
Accuracy AccuracyLower AccuracyUpper
0.9945615 0.9911832 0.9968883
plot(tc_fit, main="Random Forest 4-CV using the trainControl option")
plot(varImp(tc_fit), top=10)
The Random Forest is a very effective method to use, and with cross valdiation the accuracy can be even further improved.
A final calculation shows the predicted classification for the type of exercise for each of the 20 test cases
combined_final_data <- data.frame(
rf1_pred = predict(rf_fit1, newdata=testing),
rf2_pred = predict(rf_fit2, newdata=testing),
rf3_pred = predict(rf_fit3, newdata=testing),
rf4_pred = predict(rf_fit4, newdata=testing))
final_pred <- predict(combined_fit, newdata=combined_final_data)
final_pred
[1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
And since we have it, a second final prediction using the conditioned model
final_pred2 <- predict(tc_fit, newdata=testing)
final_pred2
[1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
The processing time was intense to train these models. Although not documented in this report the parallel method provided about a 10x lift in training the models, this would be one good area of future study to see what architcture may work best for these.
Additionally it would be interesting to continue to increase the folds in validation to see that helps, and to combare this to using more limited data sets to make an informed comparison of the value in model tuning compared to size of the data set.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.