As part of the course “Practical Machine Learning”, this final assignment analyses the “HAR” data.
According to Leek et al, “one thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants” (Leek,. J. et al. 2016, coursera PML material).
In this report, a model will be built to try to predict how well people exercise, using a know data and a set of predictor variables.
This report contains: * Data adquisition, data understanding, Data preparation. * Exploratory data analysis: understanding the variable classe. Data partitioning between train and test.
* A process to identify and apply predictive models to the data sets. Description of how a model was built.
* Model selection, comparison and analysis of the expected sample error and the reasoning on why one model is selected above the other.
* A set of 20 different test cases will be predicted based on the model built.
Accelerometers placed on the belt, forearm, arm, and dumbbell of 6 participants were used to record data related to physical activity. Individuals were directed to perform barbell lifts correctly and incorrectly in 5 different ways:(sitting-down, standing-up, standing, walking, and sitting) and data was collected on 8 hours of activities
Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz4PxFL5Tuh
Data originates from the HAR project. This dataset is licensed under the Creative Commons license (CC BY-SA). Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6., http://groupware.les.inf.puc-rio.br/har#ixzz4PxEjxWYN
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
library(ggplot2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.5
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
## Warning: package 'rpart' was built under R version 3.2.5
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
library(RColorBrewer)
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(433)
urlT="http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFit <- read.csv(url(urlT),na.strings=c("NA","#DIV/0!",""))
dim(trainFit)
## [1] 19622 160
urls <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testset <- read.csv(url(urls),na.strings=c("NA","#DIV/0!",""))
dim(testset)
## [1] 20 160
Data tiding –“NA” variables removed Columns 1-7 are not neccesary, as they are administartive variables.
trainFit<-trainFit[,colSums(is.na(trainFit)) == 0]
testset <-testset[,colSums(is.na(testset)) == 0]
trainFit <- trainFit[,-c(1:7)]
testset <-testset[,-c(1:7)]
The Main data set is split between training and test subsets (70% for training / 30% for testing) random seed=433
set.seed(433);
trainIn <- createDataPartition(y=trainFit$classe,p=.70,list=F)
training <- trainFit[trainIn,]
testing <- trainFit[-trainIn,]
dim(trainFit);dim(training);dim(testing)
## [1] 19622 53
## [1] 13737 53
## [1] 5885 53
The six participants in the study were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
plot(training$classe, col="red", main="Frequency Distribution of the CLASSE Variable - Training Data Set", xlab="classe categories", ylab="Freq.")
summary(training$classe);summary(testing$classe)
## A B C D E
## 3906 2658 2396 2252 2525
## A B C D E
## 1674 1139 1026 964 1082
The objective is to build a classification (predictive) model that allows to predict “classe” based on the sub-set of remaining variables (tidy dataset).
A decision tree will be used to demonstrate the application of a single model to understand the data and try to predict “classe”.
dt1 <- rpart(classe ~ . , data=training, method="class")
pd1 <- predict(dt1,testing, type = "class")
rpart.plot(dt1, main="HR Data Set- RPart Predictive Model", extra=102, under=TRUE, faclen=0)
confusionMatrix(pd1, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1535 157 16 49 27
## B 51 711 111 115 123
## C 45 147 799 155 130
## D 21 88 64 613 89
## E 22 36 36 32 713
##
## Overall Statistics
##
## Accuracy : 0.7427
## 95% CI : (0.7314, 0.7539)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6739
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9170 0.6242 0.7788 0.6359 0.6590
## Specificity 0.9409 0.9157 0.9018 0.9468 0.9738
## Pos Pred Value 0.8604 0.6400 0.6262 0.7006 0.8498
## Neg Pred Value 0.9661 0.9103 0.9507 0.9299 0.9269
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2608 0.1208 0.1358 0.1042 0.1212
## Detection Prevalence 0.3031 0.1888 0.2168 0.1487 0.1426
## Balanced Accuracy 0.9289 0.7700 0.8403 0.7913 0.8164
The accuracy of the rPart decision tree (rpart) is 75%
Random Forest will be used to build a model based on combined classifiers of the same type. Again, the variable “classe”in the HRA dataset will be predicted based on the subset of remaining variables after tiding the data.
library(randomForest)
rf1 <- randomForest(classe ~. , data=training, method="class")
pd2=predict(rf1, testing, type ="class")
head(pd2)
## 1 12 14 15 21 30
## A A A A A A
## Levels: A B C D E
As expected, the use of a combined classifier, such as random forest, produces a model with a higher level of accuracy. The accuracy of RF (99.3%) is significantly higher than the accuracy of asingle model like rPart (74%). It is expected that using random forest only 2% of instances may be missclassified (expected sample error rate).
pd2=predict(rf1, testing, type ="class")
confusionMatrix(pd2, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 11 0 0 0
## B 3 1124 12 0 0
## C 0 4 1010 5 1
## D 0 0 4 958 2
## E 1 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.99, 0.9946)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9868 0.9844 0.9938 0.9972
## Specificity 0.9974 0.9968 0.9979 0.9988 0.9996
## Pos Pred Value 0.9935 0.9868 0.9902 0.9938 0.9981
## Neg Pred Value 0.9990 0.9968 0.9967 0.9988 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1910 0.1716 0.1628 0.1833
## Detection Prevalence 0.2856 0.1935 0.1733 0.1638 0.1837
## Balanced Accuracy 0.9975 0.9918 0.9912 0.9963 0.9984
EOOSE = (1 - confusionMatrix(pd2, testing$classe)$overall[[1]])
EOOSE
## [1] 0.007476636
The estimated out-of-sample error rate (EOOSE) on the testing data set is calculated as: 1 - confusionMatrix(pd2, testing\(classe)\)overall[[1]]. Thus, given the small EOOS error rate the model fit is considered satisfactory.
Using Random Forest, the 20 cases provided in the training set will predicted.
testpred <- predict(rf1, testset, type="class")
testpred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E