Introduction

As part of the course “Practical Machine Learning”, this final assignment analyses the “HAR” data.

According to Leek et al, “one thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants” (Leek,. J. et al. 2016, coursera PML material).

In this report, a model will be built to try to predict how well people exercise, using a know data and a set of predictor variables.

This report contains: * Data adquisition, data understanding, Data preparation. * Exploratory data analysis: understanding the variable classe. Data partitioning between train and test.
* A process to identify and apply predictive models to the data sets. Description of how a model was built.
* Model selection, comparison and analysis of the expected sample error and the reasoning on why one model is selected above the other.
* A set of 20 different test cases will be predicted based on the model built.

The “Fit”" data

Accelerometers placed on the belt, forearm, arm, and dumbbell of 6 participants were used to record data related to physical activity. Individuals were directed to perform barbell lifts correctly and incorrectly in 5 different ways:(sitting-down, standing-up, standing, walking, and sitting) and data was collected on 8 hours of activities

Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz4PxFL5Tuh

Data originates from the HAR project. This dataset is licensed under the Creative Commons license (CC BY-SA). Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6., http://groupware.les.inf.puc-rio.br/har#ixzz4PxEjxWYN

library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
library(ggplot2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.5
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rpart)
## Warning: package 'rpart' was built under R version 3.2.5
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
library(RColorBrewer) 
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(433)
urlT="http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFit <- read.csv(url(urlT),na.strings=c("NA","#DIV/0!",""))
dim(trainFit)
## [1] 19622   160
urls <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testset <- read.csv(url(urls),na.strings=c("NA","#DIV/0!",""))
dim(testset)
## [1]  20 160

Data tiding –“NA” variables removed Columns 1-7 are not neccesary, as they are administartive variables.

trainFit<-trainFit[,colSums(is.na(trainFit)) == 0]
testset <-testset[,colSums(is.na(testset)) == 0]
trainFit <- trainFit[,-c(1:7)]
testset <-testset[,-c(1:7)]

The Main data set is split between training and test subsets (70% for training / 30% for testing) random seed=433

set.seed(433);
trainIn <- createDataPartition(y=trainFit$classe,p=.70,list=F)
training <- trainFit[trainIn,]
testing <- trainFit[-trainIn,]
 
dim(trainFit);dim(training);dim(testing)
## [1] 19622    53
## [1] 13737    53
## [1] 5885   53

The classe variable

The six participants in the study were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

plot(training$classe, col="red", main="Frequency Distribution of the CLASSE Variable - Training Data Set", xlab="classe categories", ylab="Freq.")

summary(training$classe);summary(testing$classe)
##    A    B    C    D    E 
## 3906 2658 2396 2252 2525
##    A    B    C    D    E 
## 1674 1139 1026  964 1082

A Single Predictive Model

The objective is to build a classification (predictive) model that allows to predict “classe” based on the sub-set of remaining variables (tidy dataset).

A decision tree will be used to demonstrate the application of a single model to understand the data and try to predict “classe”.

dt1 <- rpart(classe ~ . , data=training, method="class")
pd1 <- predict(dt1,testing, type = "class")
rpart.plot(dt1, main="HR Data Set- RPart Predictive Model", extra=102, under=TRUE, faclen=0)

confusionMatrix(pd1, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1535  157   16   49   27
##          B   51  711  111  115  123
##          C   45  147  799  155  130
##          D   21   88   64  613   89
##          E   22   36   36   32  713
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7427          
##                  95% CI : (0.7314, 0.7539)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6739          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9170   0.6242   0.7788   0.6359   0.6590
## Specificity            0.9409   0.9157   0.9018   0.9468   0.9738
## Pos Pred Value         0.8604   0.6400   0.6262   0.7006   0.8498
## Neg Pred Value         0.9661   0.9103   0.9507   0.9299   0.9269
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2608   0.1208   0.1358   0.1042   0.1212
## Detection Prevalence   0.3031   0.1888   0.2168   0.1487   0.1426
## Balanced Accuracy      0.9289   0.7700   0.8403   0.7913   0.8164

The accuracy of the rPart decision tree (rpart) is 75%

A Combined Predictive Model

Random Forest will be used to build a model based on combined classifiers of the same type. Again, the variable “classe”in the HRA dataset will be predicted based on the subset of remaining variables after tiding the data.

library(randomForest)
rf1 <- randomForest(classe ~. , data=training, method="class")
pd2=predict(rf1, testing, type ="class")
head(pd2)
##  1 12 14 15 21 30 
##  A  A  A  A  A  A 
## Levels: A B C D E

As expected, the use of a combined classifier, such as random forest, produces a model with a higher level of accuracy. The accuracy of RF (99.3%) is significantly higher than the accuracy of asingle model like rPart (74%). It is expected that using random forest only 2% of instances may be missclassified (expected sample error rate).

Confusion Matrix and Crossvalidation

pd2=predict(rf1, testing, type ="class")
confusionMatrix(pd2, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670   11    0    0    0
##          B    3 1124   12    0    0
##          C    0    4 1010    5    1
##          D    0    0    4  958    2
##          E    1    0    0    1 1079
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9925        
##                  95% CI : (0.99, 0.9946)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9905        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9868   0.9844   0.9938   0.9972
## Specificity            0.9974   0.9968   0.9979   0.9988   0.9996
## Pos Pred Value         0.9935   0.9868   0.9902   0.9938   0.9981
## Neg Pred Value         0.9990   0.9968   0.9967   0.9988   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1910   0.1716   0.1628   0.1833
## Detection Prevalence   0.2856   0.1935   0.1733   0.1638   0.1837
## Balanced Accuracy      0.9975   0.9918   0.9912   0.9963   0.9984
EOOSE = (1 - confusionMatrix(pd2,  testing$classe)$overall[[1]])
EOOSE
## [1] 0.007476636

The estimated out-of-sample error rate (EOOSE) on the testing data set is calculated as: 1 - confusionMatrix(pd2, testing\(classe)\)overall[[1]]. Thus, given the small EOOS error rate the model fit is considered satisfactory.

Predicting new cases

Using Random Forest, the 20 cases provided in the training set will predicted.

testpred <- predict(rf1, testset, type="class")
testpred
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E