Luis Jaraquemada Silva
Machine Learning
ljaraque@yahoo.com
ljaraque@alfanetics.com
Human Activity Recognition (HAR) through machine learning has become common in the industry of wearable devices, like wirst bands and watches[1]. Mobile phone devices are consistently evolving to incorporate various sensors to provide data collection ability. Software and applications to report and get conclusions from this data is a mainstream for context-aware systems. Possible applications are: elderly monitoring, daily activities tracking, weight loss programs, etc. Currently, the most common application of HAR is to recognize the type of activity that an individual is conducting in a specific moment, like: sitting, sleeping, running, jumping, etc. Studies up until now have focused mainly on detecting the activity type but not the quality of how the activity is being performed (How well it is conducted). Sensor data is utilized as base of this article was collected by Groupware[2] with an implemetation of body sensors and generating a training set. We train a model based on training data, fit a prediction model and evaluate its accuracy over a testing set. Finally we perform predictions over unlabelled data. This is a testbed analysis, and there are possible extensions of this work to industrial applications where processes completed by humans or machines have to be evaluated in terms of performance and effectiveness for specific goals.
The data is collected from an implementation of an on-body sensing approach, in which the individual is equiped with 4 accelerometers/gyroscopes/magnetometers sets, 3 of them on-body and 1 in an excercising tool. They measure: Arm orientation, Forearm orientation, Belt orientation and Dumbbell orientation. The dataset is obtained from the following sources:
- Labelled Dataset:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv.
- Unlabelled Testing Set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv.
if(file.exists("pml-training.csv")==FALSE) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url=fileUrl, destfile="pml-training.csv", method="curl")
}
if(file.exists("pml-testing.csv")==FALSE) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url=fileUrl, destfile="pml-testing.csv", method="curl")
}
Six supervised participants were requested to perform 10 repetitions of Unilateral Dumbbell Biceps Curl in five different manners (Figure 1), one of them being correctly and the remaining ones corresponding to common mistakes. The output factor levels are: A: Correctly according to specification, B: Throuwing the elbow to the front, C: lifting the dumbbell halfway up, D: lowering the dumbbell halfway, and E: Throwing the hips to the front.
The dataset was constructed based on feature extraction using a sliding window approach with lengths between 0.5s and 2.5s, with 0.5s of overlap. Raw measurements from accelerometers, gyroscope and magnetometer were considered as features, as well as generated features like: mean, variance, standard deviation, max, min, amplitude, kurtosis and skewness.
Data Cleaning
The dataset contains 159 features and the corresponding outcome for each observation. The raw dataset obtained from [2] has plenty of irregularities like: empty elements, divisions by zero, blank spaces and direct NAs. We have conducted a data filling with 0 for this elements which is suitable for the nature of the variables in which this happens.
In addition to the missing values, the dataset has a great quantity of columns with high percentage of Zeros as a value, which do not contribute much information to the analysis. We have decided to remove all columns with quanity of zeros greater than 95%. and by this criteria the 159 variables have been reduced to 59. Finally, there are some variables that have no relation to prediction, like row id and name of individual. We have deleted the corresponding columns and reached a final dataset of 57 variables and one outcome.
data <- read.table("pml-training.csv", sep = ",", header=T, na.strings = c("#DIV/0!", "", " ", "NA"))
data[is.na(data)] <- 0
validColumns <- -which(as.numeric(colSums(data==0)) >= nrow(data)*0.95) # columns with quantity of zeros < percentage
data <- data[, c(validColumns,-1,-2)] # Remove all columns with >=95% of zeros
realTesting <- read.table("pml-testing.csv", sep = ",", header=T, na.strings = c("#DIV/0!", "", " ", "NA"))
realTesting[is.na(realTesting)] <- 0
realTesting <- realTesting[,c(validColumns,-1,-2)] # Remove all columns with >=95% of zeros
# Adaptations to realTesting set in order to match types of training dataset.
colnames(realTesting)[colnames(realTesting)=="problem_id"] <- "classe"
realTesting$classe <- as.factor(realTesting$classe)
realTesting$classe <- as.factor("A")
realTesting$magnet_forearm_y <- as.numeric(realTesting$magnet_forearm_y)
realTesting$magnet_forearm_z <- as.numeric(realTesting$magnet_forearm_z)
realTesting$magnet_dumbbell_z <- as.numeric(realTesting$magnet_dumbbell_z)
# This was necessary to fix difference in Var Types between Dataset and Unlabelled Testing Set
dataBig <- rbind(data,realTesting)
realTesting <- dataBig[19623:19642,]
rownames(realTesting) <- c(1:20)
In addition to the previous manipulations over the original dataset, the dataset has been splitted in two groups, one for training and another for testing. We used 75% of the data for training purposes and the remaining 25% for testing in a cross validation scenario to estimate the generalization error of the model to be constructed for prediction.
inTrain = createDataPartition(data$classe, p = 0.75, list=FALSE)
training = data[ inTrain,]
testing = data[-inTrain,]
Some Data Observations
We have gone through multiple observations of variable pairs to detect the degree of separability, whith which our model will have to cope with (Figures 2 and 3).
qplot(magnet_belt_z,roll_arm,colour=classe, data=training)
qplot(pitch_dumbbell,roll_arm,colour=classe, data=training)
Because of the characteristic noise in the sensor data, we used a Random Forest approach to find a suitable model to describe our training data and then apply it to the testing set, and preform the final predictions over the unlabelled data.
Rendom Forest algorithms are characterized by a subset of features, selected in a random and independent manner with the same distribution for each of the trees in the forest. We use VarImp() function to have an insight of the layered importance of the different features in the dataset, according to the judgement of our prediction model.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(298374)
modFit <- randomForest(classe~., data=training, importance=FALSE)
b<- varImp(modFit)
rev(order(b))
## [1] 3 1 5 6 8 43 7 44 46 45 42 17 15 32 40 18 41 52 12 35 39 19 16
## [24] 29 9 34 37 54 57 26 33 30 56 21 14 55 13 20 31 47 11 27 23 53 28 36
## [47] 24 50 10 48 22 51 49 38 25 2 4
predictionsTrainingData <- predict(modFit,newdata=training)
confusionMatrix(predictionsTrainingData, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
We can see from the confusion matrix that the preformance of the algorithm is extremely possitive over the training data, with an accuracy of 100% in classification.
In order to estimate the generalization error of our model or, in other words, how it will behave in data external to the one used to train the system, we predict the outcome Classe based on the testing data that we have left out from the begining of our analysis and then compare these predictions with the ground truth.
predictionsTestingData <- predict(modFit,newdata=testing)
confusionMatrix(predictionsTestingData, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 1 0 0 0
## B 1 948 1 0 0
## C 0 0 853 1 0
## D 0 0 1 803 1
## E 0 0 0 0 900
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9973, 0.9996)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9989 0.9977 0.9988 0.9989
## Specificity 0.9997 0.9995 0.9998 0.9995 1.0000
## Pos Pred Value 0.9993 0.9979 0.9988 0.9975 1.0000
## Neg Pred Value 0.9997 0.9997 0.9995 0.9998 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1933 0.1739 0.1637 0.1835
## Detection Prevalence 0.2845 0.1937 0.1741 0.1642 0.1835
## Balanced Accuracy 0.9995 0.9992 0.9987 0.9991 0.9994
We can see only 3 misclassifications from the total testing set. The Accuracy obtained is very acceptable with a value of 99.9%. The expected error is 0.1% which is the error level we should expect from future predictions of data.
We now take our validated model from previous sections and apply it over the testing set that does not contain labelled outcomes as follows:
predictionsRealTesting <- predict(modFit,newdata=realTesting)
predictionsRealTesting
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The error we estimate to occur in these predictions is 0.1%, or 99.9% of accuracy, considering the results of cross-validation in section 5.
This study is a simple approach to Human Bahavior Activity based on Sensor Data collected and Pre-processed to generate the total quantity of features that allowed us to get a model that can describe and predict the Quality of different activity types.
We have used a Random Forest Machine Learning technique with which we were able to predict the training data with a 100% of accuracy and estimate a generalization accuracy of 99.9% over testing set according to a cross validation. We have predicted the quality Classe outcome for a set of 20 testing observations for which we do not have the corresponding labels.
This study has been of great interest, not only for this specific case, but as a potential starting point of Sensor Data analysis for other type of scenarios, like health monitoring, industrial applications, amongst others.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Dataset on Human Activity Recognition - by Groupware at http://groupware.les.inf.puc-rio.br, LES: group of research and development of groupware technologies.