Background

The data from this prediction assignment is an example of Human Activity Recognition, or HAR, which is a research area that is gaining interest, especially in the development of context-aware systems. This data contains 5 classes that denote the “manner” in which the subjects performed an exercise, which is the “classe” variable in the dataset. These are: “sitting-down, standing-up, standing, walking, and sitting.” The goal of this project is to predict the manner in which they did the exercise.

Citation

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)

Cross-Validation & Expected Out-of-Sample Error

The training data set has 19622 observations on 153 variables. This data set is split into a training and test set, and the most accurate model will be chosen to compare to the original test set.

The expected out-of-sample error will be the expected number of misplaced classes in the prediction from the total sample. This will be the complement of the accuracy of the model fits.

This analysis first constructs a decision tree through recursive partitioning, then uses an ensemble learning method – random forests – to construct a multitude of decision trees so as to combine outputs and correct overfitting.

Preliminary Work

Load packages & set seed

packages <- c("caret", "rpart", "rpart.plot", "rattle", "randomForest", "kernlab")
lapply(packages, library, character.only = TRUE, quietly = TRUE)
## Warning: package 'caret' was built under R version 3.2.4
## Warning: package 'ggplot2' was built under R version 3.2.4
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## [[1]]
##  [1] "caret"     "ggplot2"   "lattice"   "stats"     "graphics" 
##  [6] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "rpart"     "caret"     "ggplot2"   "lattice"   "stats"    
##  [6] "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [11] "base"     
## 
## [[3]]
##  [1] "rpart.plot" "rpart"      "caret"      "ggplot2"    "lattice"   
##  [6] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
## [11] "methods"    "base"      
## 
## [[4]]
##  [1] "rattle"     "rpart.plot" "rpart"      "caret"      "ggplot2"   
##  [6] "lattice"    "stats"      "graphics"   "grDevices"  "utils"     
## [11] "datasets"   "methods"    "base"      
## 
## [[5]]
##  [1] "randomForest" "rattle"       "rpart.plot"   "rpart"       
##  [5] "caret"        "ggplot2"      "lattice"      "stats"       
##  [9] "graphics"     "grDevices"    "utils"        "datasets"    
## [13] "methods"      "base"        
## 
## [[6]]
##  [1] "kernlab"      "randomForest" "rattle"       "rpart.plot"  
##  [5] "rpart"        "caret"        "ggplot2"      "lattice"     
##  [9] "stats"        "graphics"     "grDevices"    "utils"       
## [13] "datasets"     "methods"      "base"
set.seed(33833)

Load Data

trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

Replace missing strings with “NA”

train <- read.csv(url(trainURL), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(testURL), na.strings=c("NA","#DIV/0!",""))

Clean data

Remove irrelevant variables

The first seven columns are not needed to predict the manner in which the subjects performed the exercises, so they can be deleted.

train <- train[,-c(1:7)]; test <- test[,-c(1:7)]

Remove variables with over 70% missing data (arbitrary threshold) from the training and test sets.

clean.train <- NULL
clean.train <- train[!((apply(train, 2, 
                         function(x) sum(is.na(x)))/nrow(train))>0.7)]
clean.test <- NULL
clean.test <- test[!((apply(test, 2, 
                         function(x) sum(is.na(x)))/nrow(test))>0.7)]

dim(train); dim(clean.train)
## [1] 19622   153
## [1] 19622    53
dim(test); dim(clean.test)
## [1]  20 153
## [1] 20 53

This process removed 100 variables.

Cross-Validation

The training set is partitioned into 2 sets: train.train and test.train.

cval <- createDataPartition(y=clean.train$classe, p=0.7, list=FALSE)
train.train <- clean.train[cval, ] 
test.train <- clean.train[-cval, ]

Exploratory Analysis

Bar plot of class counts:

ggplot(data=train.train, aes(train.train$classe, fill = train.train$classe)) + 
        geom_bar(colour="#FF9999") + 
        labs(title="Bar Plot for Class Variable") +
        labs(x="Class", y="Count") +
        scale_fill_brewer(palette="Spectral", name = "Class")

Class A accounts for a significantly larger number of the data, whereas classes B-C contain a similar number of counts.

Prediction

Recursive Partitioning

Classify members of the population by splitting it into sub-populations based on several dichotomous independent variables.

modelFitRP <- rpart(classe ~ ., data = train.train, method = "class")
predictionRP <- predict(modelFitRP, test.train, type = "class")
fancyRpartPlot(modelFitRP)

confusionMatrix(predictionRP, test.train$classe)$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.439252e-01   6.745133e-01   7.325698e-01   7.550383e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   2.118203e-59

The accuracy rate for recursive partitioning is 74%.

Random Forest

Construct a multitude of decision trees at training time and output the class that is the mode of the classes (classification) of the individual trees and correct any overfitting.

modelFitRF <- randomForest(classe ~ ., data=train.train, method = "class",
                           importance = TRUE, proximity = TRUE)
print(modelFitRF)
## 
## Call:
##  randomForest(formula = classe ~ ., data = train.train, method = "class",      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.52%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    1    0    0    1 0.0005120328
## B   15 2636    7    0    0 0.0082768999
## C    0   19 2376    1    0 0.0083472454
## D    0    0   19 2230    3 0.0097690941
## E    0    0    1    5 2519 0.0023762376
predictionRF <- predict(modelFitRF, test.train, type = "class")

confusionMatrix(test.train$classe, predictionRF)$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9920136      0.9898977      0.9893938      0.9941262      0.2841121 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

The random forest model is significantly more accurate, so this method will be chosen for the predicition model on the original testing data. The expected out-of-sample error is 0.0054 or .5%.

Final Prediction

The model will now be applied to the 20 test cases available in the test data:

finpred <- predict(modelFitRF, clean.test, type="class")
finpred
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E