Human Activity Recognition (HAR) focuses on recognize the body movements necessary to perform different activities. In this study 6 subjects performed 1 set of 10 Unilateral Dumbbell Biceps Curl in 5 different ways [5]. The participants carried sensors placed on the subjects’ belts, armbands, glove and dumbbells
The data-set provides 39,242 samples with 159 variables labeled with 5 types of activity to discriminate between 1 correct and 4 different incorrect fashions:
Read more at: http://groupware.les.inf.puc-rio.br/har#ixzz34dpS6oks.
1. Question:
It is possible to determine the quality and identify mistakes in weight-lifting exercise activities using machine learning algorithms?
2. Input Data:
You can download the data-set (15.3 MB) at: http://groupware.les.inf.puc-rio.br/static/WLE/WearableComputing_weight_lifting_exercises_biceps_curl_variations.csv
3. Features:
Dealing with NA values; Zero Frequency Ratio; Remove if necessary redundant highly correlated variables; Remove near-zero predictors.
4. Training and Testing data-sets
A seed is assign and a ratio of 70/30 for training and testing data-sets are created
5. Build the model:
I used ‘Random Forest’ algorithm to see if I could build a successful predictive model to enable such an app. This model achieved 99.96% prediction accuracy with this data-set making unnecessary to use other models.However, in random forest models over-fitting is an issue. A 10-fold cross validation was used to reduce the bias and minimize over-fitting. The random forests were tuned to include 250 trees in each cross validation iteration.
6. Predictions The final model was used to predict the remaining 30% of data to test the model.This model predicted over the test data-set with a 99.96% accuracy
7. Conclusion
Citation:
[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
library(plyr)
library(caret)
library(Hmisc)
library(ggplot2)
library(lattice)
library(plotly)
The data file can be download in csv format.
# Check if folder exist
if (!file.exists("Data")){
dir.create("Data")
}
#File URL and Destination
fileURL <- "http://groupware.les.inf.puc-rio.br/static/WLE/WearableComputing_weight_lifting_exercises_biceps_curl_variations.csv"
desfile <- "~/R/Human Activity Recognition/Weight Lifting Exercises Dataset/Data/Weight_Lifting.csv"
#####download.file(fileURL, destfile = desfile)
# Date download
dateDownloaded <- date()
Load the data-set but replace all missing 'NA', 'NaN' or empty spaces by 'NA'
mydata = read.csv("./Data/Weight_Lifting.csv",
header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))
First we will take a look at the balance in the distribution of predictors using an Interactive graph
Find_NA = data.frame (apply(mydata, 2, function(x) sum(is.na(x)==T))); Find_NA/nrow(mydata)
mydata1<-mydata[, colMeans(is.na(mydata)) <= .1]
This function checks if more than 10% of the observables belonging to a variable are 'NA', and if so, the variable will be removed. In this particular set 59 variables remain from the initial 159 variables.
ZC <- nearZeroVar(mydata1, saveMetrics = T); ZC
nzv_T <- which(ZC$nzv == T); nzv_T # locate those variables
mydata1 <- mydata1[,-(nzv_T)]; #erase nzv variables
mydata1$classe <- mydata$classe # re-insert our predictor classe
More than one variable has near zero variance. As a result 54 variables remain after cleaning the data-set
set.seed(512)
intrain <- createDataPartition(y = mydata1$classe, p = 0.7, list = F)
training <- mydata1[intrain,]
testing <- mydata1[-intrain,]
Exploratory Analysis
An interactive graph allows to identify the users and the classe type. Zoom is possible allowing to subset a specific region and specific user if desired.
Visualization of the data for roll belt sensor (interactive graph)
Visualization of the data for pitch belt sensor (interactive graph)
The plots shows:
* Roll belt data points cluster around two different mean values. Rolling over the graph it can be seen that Charles, Pedro and Adelmo’s means are around One hundred thirty and that Eurico and Carlitos’ means are around zero
* Pich belt data points cluster around different mean values. Pedro around twenty six, Charles around sixteen, Eurico, Jeremy and Carlitos around six and adelmo around -forty one.
This indicates that the calibration of the sensors is not the same for all the subjects.
Collinearity
The prediction model of choice is the random forest model because it can balance error in class population unbalanced data sets. How ever, possible correlation between any two trees in the forest will increase the forest error rate. Therefore, the generation of a correlation plot is necessary to verify how strong the relationship between variables are.
To do so, first we will plot a correlation matrix by removing all the non-numerical variables from the data-set
library(corrplot)
# PLot a correlation Matrix after non-numerical varibles are removed
training1 <- data.frame(training[!chrVar])
corMTX <- cor(data.frame(training[!chrVar]))
corrplot(corMTX, method = "color", type = "lower", order = "FPC", tl.cex=0.6 )
There is not’t much concern for highly correlated predictors which means that none of the variables will be eliminated in the model.
Including the variable classe in the data-set to be used
Parallel Processing
library(parallel)
library(doSNOW)
cl <- makeCluster(4, type = "SOCK"); registerDoSNOW(cl)
registerDoSNOW(makeCluster(4, type = "SOCK"))
library(pROC);library(gbm)
Train Control parameters: We will use 10-fold cross validation to evaluate the models.
ctrl <- trainControl(method="cv",
number = 10,
classProbs=TRUE,
savePredictions =TRUE)
Random forest predictive method was used to improve the accuracy of the model.However, over-fitting random forest models is of concern. To minimize over-fitting and reduce bias, a 10-fold cross validation was used. The random forests were tuned to include 250 trees in each cross validation iteration.
set.seed(512)
rf_fit <- train(classe~.,
data = training1,
method = "rf",
ntree = 250,
metric = "ROC",
trControl = ctrl,
verbose=FALSE)
Now we stop parallel processing
stopCluster(cl)
library(randomForest)
rf_fit$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 250, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 26
##
## OOB estimate of error rate: 0.03%
## Confusion matrix:
## A B C D E class.error
## A 7811 1 0 0 0 0.0001280082
## B 2 5314 0 0 0 0.0003762227
## C 0 2 4789 0 0 0.0004174494
## D 0 0 2 4500 1 0.0006662225
## E 0 0 0 0 5050 0.0000000000
Random forest model produced a very small OOB error rate of .03%. This was satisfactory to progress the testing.
The error rates of the final model are plotted over 250 trees.
Plot with the top ten most important variables for the Random Forest Model is shown below.
varImpPlot(rf_fit$finalModel,
sort = TRUE, pch = 19, col = 'Blue', main="Variable Importance: Random Forest Model")
Random Forest is a stochastic method. This means that it is like a black box classifier. Its function is not to explain the data structure.
A plot is used to offer a visualization of the structure of the data.In this case the Tree technique like Recursive partitioning for classification, regression and survival trees will do the job.
library(rpart)
library(rpart.plot)
treeModel <- rpart(classe ~ ., data=training1, method="class")
The Random Tree model is now used to classify the test data-set consisting of 30% of the initial data. The results were placed in a confusion matrix along with classifications in order to determine the accuracy of the model.
PredTesting <- predict(rf_fit,testing)
ConMatrix <-confusionMatrix(testing$classe,PredTesting); ConMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3347 0 0 0 0
## B 0 2277 0 0 0
## C 0 1 2052 0 0
## D 0 0 0 1928 1
## E 0 0 0 1 2163
##
## Overall Statistics
##
## Accuracy : 0.9997
## 95% CI : (0.9993, 0.9999)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9996 1.0000 0.9995 0.9995
## Specificity 1.0000 1.0000 0.9999 0.9999 0.9999
## Pos Pred Value 1.0000 1.0000 0.9995 0.9995 0.9995
## Neg Pred Value 1.0000 0.9999 1.0000 0.9999 0.9999
## Prevalence 0.2844 0.1935 0.1743 0.1639 0.1839
## Detection Rate 0.2844 0.1935 0.1743 0.1638 0.1838
## Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 1.0000 0.9998 0.9999 0.9997 0.9997
print(ConMatrix$overall)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9997451 0.9996776 0.9992553 0.9999474 0.2843670
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The Random Tree model predicted over the test data-set with a 99.97% accuracy. This results demonstrates that Random Forest model was the correct choice to analyze the data. The weight lifting training data-set was used to create a model that predicted the way a subset performed the weight lifting exercise.