This work is for the completion of the practical machine learning coursera course. The objective is to build a model that can recognize human activity. The data come from movement sensors attached to participants, who were performing an activity a) in a correct way and b) four wrong ways. The data are cleaned and then divided in a training set and a test set. The training set was used to build the model and the test set to validate the model. The model build is a random forest that predicts with a very good accuracy the activity.
Huge amounts of data have been available through the use of devices that measure physical activity. The analysis of this data can reveal the activity performed. Human Activity Recognition is the new research filed that does exactly that. This area is well developed now and human activity can be recognized quite well. This work investigates on how well a specific activity is performed.
This work is for the completion of the practical machine learning coursera course and uses data from Velloso et.al. (2013). The rest of this work describes the steps performed to build the model, presents the model and tests its accuracy. The full code of the project is given in the Appendix.
The data come from sensors to participants that were performing one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl. The exercises were performed in five different fashions: one according to the specification (Class A), and four (Classes B to E) in a specified “mistaken” way. The data set is available here (https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv).
The data set is consisted of 19622 observations of 160 variables. Observations that contain “NA”, “#DIV/0” or are empty are treated as missing values.The data set contains a lot of missing values. Variables that have a high percentage of missing values have nothing to offer in the modelling procedure and are deleted from the data set.
The following table summarizes the number of the variables by the number of missing values they contain. It can be seen that there are 60 variables that do not contain any missing values. The rest 100 variables contain missing values over than 19,215 of 19,622 total observations.
## na_test
## 0 19216 19217 19218 19220 19221 19225 19226 19227 19248 19293 19294
## 60 67 1 1 1 4 1 4 2 2 1 1
## 19296 19299 19300 19301 19622
## 2 1 4 2 6
Additionally, the first 7 variables are not related to the movement. They are related to the object and the data collection (object name, time etc. ) so they are not useful in a model that tries to predict the movement. The variable number has been reduced to 53. The integer variables are converted to numeric. Now the data are in a shape that we can continue on building the model.
Before building the prediction model the data are partitioned in two sets on a 60% - 40% rule. The first set (training) is for building the model and the second (test) for validation.
set.seed(528963)
inTrain <- createDataPartition(y=data$classe, p=0.6, list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
Purpose of the model is to predict the activity class based on the 52 movement variables. Because some of the variables may be not be normally distributed and may trick the machine learning algorithm the data are first pre-processed. The aim of the pre-process is to center and standardize the variables. The pre-process procedure is applied also to the testing subset.
prepr <-preProcess(training[,-53],method=c("center", "scale"))
trainp <- predict(prepr, training[,-53])
trainp$classe <- training$classe
testp <-predict(prepr,testing[,-53])
testorp <- predict(prepr,test.or[,-53])
The algorithm selected to built the model is random forest algorithm. Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.Random forests give very accurate results, but the results are difficult to interpret. Additionally, they are time consuming and may over fit.
modFit <- train(classe ~., method="rf", data=trainp, trControl=trainControl(method='cv'), number=5, allowParallel=TRUE )
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10597, 10599, 10598, 10599, 10599, 10598, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9915936 0.9893649 0.002637431 0.003336848
## 27 0.9904893 0.9879682 0.003576151 0.004524654
## 52 0.9851397 0.9811981 0.004259389 0.005392498
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
First, the model is tested on the training set. This estimates the in-sample error. The accuracy of the model is 1. The prediction matrix shows that it accurately predicts all the classes.
trPred <- predict(modFit, trainp)
confusionMatrix(trPred, trainp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2279 0 0 0
## C 0 0 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The model is also tested in the test set. This estimates the out-of-sample error. The accuracy here is expected to be less than the training set but still accurate. I expect not to fall less that 0.98.
tePred <- predict(modFit, testp)
confusionMatrix(tePred, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2230 16 0 0 0
## B 2 1499 10 0 0
## C 0 3 1355 29 0
## D 0 0 3 1253 1
## E 0 0 0 4 1441
##
## Overall Statistics
##
## Accuracy : 0.9913
## 95% CI : (0.989, 0.9933)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.989
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9875 0.9905 0.9743 0.9993
## Specificity 0.9971 0.9981 0.9951 0.9994 0.9994
## Pos Pred Value 0.9929 0.9921 0.9769 0.9968 0.9972
## Neg Pred Value 0.9996 0.9970 0.9980 0.9950 0.9998
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1911 0.1727 0.1597 0.1837
## Detection Prevalence 0.2863 0.1926 0.1768 0.1602 0.1842
## Balanced Accuracy 0.9981 0.9928 0.9928 0.9869 0.9993
It can be seen from the results that the accuracy is 0.99 and the 95% confidence interval of that is between 0.989 and 0.9933.
The model built can predict the class of the exercise performed with about 99% accuracy. It is expected that the model will work very well to the original test set provided. The original test set contains 20 instances and the model should predict almost all of them. The results are given below.
testPred <- predict(modFit, testorp)
testPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3awZyeGyZ
Data and Caret loaded
data <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test.or <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
library(caret)
NA Detection and deletion
na_test = sapply(data, function(x) {sum(is.na(x))})
table(na_test)
na_columns <- names(na_test[na_test > 19215])
data <- data[, !names(data) %in% na_columns]
test.or <- test.or[, !names(test.or) %in% na_columns]
Data collection related columns deleted
numeric.features <- 8:60
data <- data[, numeric.features]
test.or <- test.or[,numeric.features]
for(i in c(1:52)) {data[,i] = as.numeric(as.character(data[,i]))}
for(i in c(1:52)) {test.or[,i] = as.numeric(as.character(test.or[,i]))}
data subsetting to training and testing sets.
set.seed(528963)
inTrain <- createDataPartition(y=data$classe, p=0.6, list=FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
Pre-processing
prepr <-preProcess(training[,-53],method=c("center", "scale"))
trainp <- predict(prepr, training[,-53])
trainp$classe <- training$classe
testp <-predict(prepr,testing[,-53])
testorp <- predict(prepr,test.or[,-53])
Model Fitting
modFit <- train(classe ~., method="rf", data=trainp, trControl=trainControl(method='cv'), number=5, allowParallel=TRUE )
Training prediction and confusion matrix
trPred <- predict(modFit, trainp)
confusionMatrix(trPred, trainp$classe)
Tresting prediction and confusion matrix
tePred <- predict(modFit, testp)
confusionMatrix(tePred, testing$classe)
Original test set prediction
testPred <- predict(modFit, testorp)