The data from this prediction assignment is an example of Human Activity Recognition, or HAR, which is a research area that is gaining interest, especially in the development of context-aware systems. This data contains 5 classes that denote the “manner” in which the subjects performed an exercise, which is the “classe” variable in the dataset. These are: “sitting-down, standing-up, standing, walking, and sitting.” The goal of this project is to predict the manner in which they did the exercise.
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)
The training data set has 19622 observations on 153 variables. This data set is split into a training and test set, and the most accurate model will be chosen to compare to the original test set.
The expected out-of-sample error will be the expected number of misplaced classes in the prediction from the total sample. This will be the complement of the accuracy of the model fits.
This analysis first constructs a decision tree through recursive partitioning, then uses an ensemble learning method – random forests – to construct a multitude of decision trees so as to combine outputs and correct overfitting.
packages <- c("caret", "rpart", "rpart.plot", "rattle", "randomForest", "kernlab")
lapply(packages, library, character.only = TRUE, quietly = TRUE)
## Warning: package 'caret' was built under R version 3.2.4
## Warning: package 'ggplot2' was built under R version 3.2.4
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## [[1]]
## [1] "caret" "ggplot2" "lattice" "stats" "graphics"
## [6] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "rpart" "caret" "ggplot2" "lattice" "stats"
## [6] "graphics" "grDevices" "utils" "datasets" "methods"
## [11] "base"
##
## [[3]]
## [1] "rpart.plot" "rpart" "caret" "ggplot2" "lattice"
## [6] "stats" "graphics" "grDevices" "utils" "datasets"
## [11] "methods" "base"
##
## [[4]]
## [1] "rattle" "rpart.plot" "rpart" "caret" "ggplot2"
## [6] "lattice" "stats" "graphics" "grDevices" "utils"
## [11] "datasets" "methods" "base"
##
## [[5]]
## [1] "randomForest" "rattle" "rpart.plot" "rpart"
## [5] "caret" "ggplot2" "lattice" "stats"
## [9] "graphics" "grDevices" "utils" "datasets"
## [13] "methods" "base"
##
## [[6]]
## [1] "kernlab" "randomForest" "rattle" "rpart.plot"
## [5] "rpart" "caret" "ggplot2" "lattice"
## [9] "stats" "graphics" "grDevices" "utils"
## [13] "datasets" "methods" "base"
set.seed(33833)
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- read.csv(url(trainURL), na.strings=c("NA","#DIV/0!",""))
test <- read.csv(url(testURL), na.strings=c("NA","#DIV/0!",""))
The first seven columns are not needed to predict the manner in which the subjects performed the exercises, so they can be deleted.
train <- train[,-c(1:7)]; test <- test[,-c(1:7)]
Remove variables with over 70% missing data (arbitrary threshold) from the training and test sets.
clean.train <- NULL
clean.train <- train[!((apply(train, 2,
function(x) sum(is.na(x)))/nrow(train))>0.7)]
clean.test <- NULL
clean.test <- test[!((apply(test, 2,
function(x) sum(is.na(x)))/nrow(test))>0.7)]
dim(train); dim(clean.train)
## [1] 19622 153
## [1] 19622 53
dim(test); dim(clean.test)
## [1] 20 153
## [1] 20 53
This process removed 100 variables.
The training set is partitioned into 2 sets: train.train and test.train.
cval <- createDataPartition(y=clean.train$classe, p=0.7, list=FALSE)
train.train <- clean.train[cval, ]
test.train <- clean.train[-cval, ]
Bar plot of class counts:
ggplot(data=train.train, aes(train.train$classe, fill = train.train$classe)) +
geom_bar(colour="#FF9999") +
labs(title="Bar Plot for Class Variable") +
labs(x="Class", y="Count") +
scale_fill_brewer(palette="Spectral", name = "Class")
Class A accounts for a significantly larger number of the data, whereas classes B-C contain a similar number of counts.
Classify members of the population by splitting it into sub-populations based on several dichotomous independent variables.
modelFitRP <- rpart(classe ~ ., data = train.train, method = "class")
predictionRP <- predict(modelFitRP, test.train, type = "class")
fancyRpartPlot(modelFitRP)
confusionMatrix(predictionRP, test.train$classe)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.439252e-01 6.745133e-01 7.325698e-01 7.550383e-01 2.844520e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 2.118203e-59
The accuracy rate for recursive partitioning is 74%.
Construct a multitude of decision trees at training time and output the class that is the mode of the classes (classification) of the individual trees and correct any overfitting.
modelFitRF <- randomForest(classe ~ ., data=train.train, method = "class",
importance = TRUE, proximity = TRUE)
print(modelFitRF)
##
## Call:
## randomForest(formula = classe ~ ., data = train.train, method = "class", importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.52%
## Confusion matrix:
## A B C D E class.error
## A 3904 1 0 0 1 0.0005120328
## B 15 2636 7 0 0 0.0082768999
## C 0 19 2376 1 0 0.0083472454
## D 0 0 19 2230 3 0.0097690941
## E 0 0 1 5 2519 0.0023762376
predictionRF <- predict(modelFitRF, test.train, type = "class")
confusionMatrix(test.train$classe, predictionRF)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9920136 0.9898977 0.9893938 0.9941262 0.2841121
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
The random forest model is significantly more accurate, so this method will be chosen for the predicition model on the original testing data. The expected out-of-sample error is 0.0054 or .5%.
The model will now be applied to the 20 test cases available in the test data:
finpred <- predict(modelFitRF, clean.test, type="class")
finpred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E