Human Activity Recognition

Summary

The human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above).

In the experiment, six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

The puropose of this project is to predict how the excersice was done, depending in the parcicular variable “class” and other variables.

Analysis

Loading the Data

library(caret);library(rpart); library(rpart.plot)
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fileUrl_train<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl_test<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
dir_train<-"train.csv"
dir_test<-"test.csv"
download.file(fileUrl_train,dir_train)
download.file(fileUrl_test,dir_test)

training<-read.table("train.csv",header = TRUE, sep = ",",na.strings=c("NA","#DIV/0!",""))

test<-read.table("test.csv",header = TRUE, sep = ",",na.strings=c("NA","#DIV/0!",""))

Cleaning the Data

First, let’s remove the first seven column that do not have relevant relation over the variable classe. Then, let’s remove variables that contain any missing value.

# removing 7 columns
training<-training[,-c(1:7)]
test<-test[,-c(1:7)]

# removing columns with missing NAs
training <- training[, colSums(is.na(training)) == 0]
test <- test[, colSums(is.na(test)) == 0]

The cleaned data sets have 53 columns with the same first 53 variables. Training data has 19622 rows while Test data has 20 rows.

Splitting the Data

In order to get out-of-sample errors, we split the cleaned training datset into a training set (mytrain, 70%) for prediction and a validation set (mytest 30%) to compute the out-of-sample errors.

intrain<-createDataPartition(y=training$classe,p=0.7,list = FALSE)
mytrain<-training[intrain,]
mytest<-training[-intrain,]
dim(mytrain);dim(mytest)

## [1] 13737    53
## [1] 5885   53

Test Harness

We will 5-fold crossvalidation to estimate accuracy. This will split mytrain dataset into 5 parts, train in 4 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 5 groups, in an effort to get a more accurate estimate.

control <- trainControl(method="cv", number=5,allowParallel = TRUE)
metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage. We will be using the metric variable when we run build and evaluate each model next

Creating a Model

Let’s going to evaluate e models: classification trees, random forest and boosting.

Classification Trees

set.seed(125)
fit_ct <- train(classe~., data=mytrain, method="rpart", trControl=control,metric=metric)

Random Forest

set.seed(125)
fit_rf <- train(classe~., data=mytrain, method="rf", trControl=control,metric=metric)

Generalized Boosted Regression

set.seed(125)
fit_gbm <- train(classe ~ ., data=mytrain, method = "gbm", trControl=control,metric=metric, verbose = FALSE)

We now have 3 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

# summarize accuracy of models
results <- resamples(list(class_tree=fit_ct, ram_tree=fit_rf, boost=fit_gbm))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: class_tree, ram_tree, boost 
## Number of resamples: 5 
## 
## Accuracy 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## class_tree 0.4848926 0.4896326 0.4925373 0.4995287 0.5120087 0.5185725
## ram_tree   0.9908992 0.9909025 0.9912696 0.9921384 0.9934474 0.9941733
## boost      0.9559680 0.9585002 0.9592134 0.9598889 0.9610484 0.9647144
##            NA's
## class_tree    0
## ram_tree      0
## boost         0
## 
## Kappa 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## class_tree 0.3313067 0.3378231 0.3553755 0.3516431 0.3620862 0.3716240
## ram_tree   0.9884860 0.9884930 0.9889537 0.9900545 0.9917108 0.9926289
## boost      0.9443222 0.9474924 0.9484009 0.9492516 0.9506975 0.9553449
##            NA's
## class_tree    0
## ram_tree      0
## boost         0

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 5 times (5 fold cross validation).

# compare accuracy of models
dotplot(results)

The results for random forest model is summarized.

# summarize Best Model
print(fit_rf)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10990, 10989, 10991, 10988 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9915560  0.9893173
##   27    0.9921384  0.9900545
##   52    0.9861689  0.9825008
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Validating

The random forest was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the rf model directly on the validation set and summarize the results in a confusion matrix.

# estimate skill of LDA on the validation dataset
predictions <- predict(fit_rf, mytest)
confusionMatrix(predictions, mytest$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    7    0    0    0
##          B    3 1129    3    0    0
##          C    0    2 1021   12    2
##          D    0    1    2  950    4
##          E    1    0    0    2 1076
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9934         
##                  95% CI : (0.991, 0.9953)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9916         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9912   0.9951   0.9855   0.9945
## Specificity            0.9983   0.9987   0.9967   0.9986   0.9994
## Pos Pred Value         0.9958   0.9947   0.9846   0.9927   0.9972
## Neg Pred Value         0.9990   0.9979   0.9990   0.9972   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1918   0.1735   0.1614   0.1828
## Detection Prevalence   0.2850   0.1929   0.1762   0.1626   0.1833
## Balanced Accuracy      0.9980   0.9950   0.9959   0.9920   0.9969

We can see that the accuracy is 99%, that means that we may have an accurate and a reliably model.

Predicting

We now use random forests to predict the outcome variable classe for the testing set.

prediction_test <- predict(fit_rf, test)
prediction_test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusions

For this dataset, random forest method is the best model. The accuracy rate is 0.9907, and so the out-of-sample error rate is 0.0093. This may be due to the fact that many predictors are highly correlated. Random forests chooses a subset of predictors at each split and decorrelate the trees. This leads to high accuracy, although this algorithm is sometimes difficult to interpret and computationally inefficient.

References

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)

FINAL PROJECT

Human Activity Recognition

Summary

Analysis

Conclusions