Synopsis and objective

Human activity research(HAR) has had an influx of data thanks to devices like Fitbit, Jawbone Up and even some smart phones. Disseminating this data could be useful for a wide range of applications, from at home elderly care to extreme sports. Most often data is used to measure movement in a quantitative or discriminative sense to determine the amount or which exercise was performed. However, in this report we’ll take data from Groupware’s HAR research(1) on Bicep Curls to determine the quality of activity. The data collected was of six participants performing bicep curls in specific manners, both correctly(labeled A) and incorrectly(bad form, labeled B,C,D,E). This was collected via 4 accelerometers located on the bicep, waist, forearm and the dumbbell itself. Let’s not try to duplicate their report, but rather validate or reject it.
(1)http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201

Caret

The study will deploy a veriety of Machine learning approaches in R using the “caret” package, short for Classification And REgression Training, is a set of functions that attempt to streamline the process for creating predictive models.

The Attack

Preprocess

After a little exploring we reveal that the data containd many NA’s associated with columns that had statistical data compiled from other columns of raw data. this was removed and the remaining set was tested for Near Zero-variance, documentation here.

Truncated sample
freqRatio percentUnique zeroVar nzv
X 1.00 100.00 FALSE FALSE
user_name 1.10 0.03 FALSE FALSE
raw_timestamp_part_1 1.00 4.27 FALSE FALSE
raw_timestamp_part_2 1.00 85.53 FALSE FALSE
cvtd_timestamp 1.00 0.10 FALSE FALSE
new_window 47.33 0.01 FALSE TRUE
num_window 1.00 4.37 FALSE FALSE
roll_belt 1.10 6.78 FALSE FALSE
pitch_belt 1.04 9.38 FALSE FALSE
yaw_belt 1.06 9.97 FALSE FALSE


Next step was to take a look a principal components to get an idea of what the data might look like and maybe understand which components had the most variance. This would then be useful for model selection and possibly number of components needed to explain a desired percent of variance.

Model building

As you can see from our previous analysis, 20 or so components make up the majority of our data variance(left plot). Also the structure of the data on the right would lead us to think that more of a classification training approach would be appropriate. let’s train 3 different models and compare.
Now although we do have a test set of data I’m going to spit my training data set into a training and test set as well. This will allow me to test my results against results that i know are correct so i can get and idea of how accurate the models will be. Let’s see accuracies of Random Forests, Gradient Boosting and Linear Discriminate Analysis. We’ll set all prediction controls to 10 fold Cross Validation, we don’t want to use anything with too many folds, so to keep the computing time down. In addition we will utilize parallel processing to help speed things up.

##          Random Forests Gradient Boosting Linear Discriminant
## Accuracy      0.9794046         0.8085237           0.5375204
Confusion Matrix Tables - left to right are: RF, GBM, LDA
A B C D E
A 1382 4 2 1 1
B 6 929 9 6 7
C 2 14 833 18 8
D 3 1 11 777 3
E 2 1 0 2 882
x
|
x
|
A B C D E
A 1211 85 54 32 34
B 49 696 53 20 72
C 52 96 706 97 51
D 73 39 29 634 26
E 10 33 13 21 718
x
|
x
|
A B C D E
A 940 192 211 58 88
B 119 416 94 143 182
C 113 174 481 131 115
D 196 106 50 377 94
E 27 61 19 95 422

Final predictions

Random Forests shows the best promise for a good prediction model. With a predicted accuracy of 98% and an out of sample error of about 2%.
Let’s see the full statistics.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1382    4    2    1    1
##          B    6  929    9    6    7
##          C    2   14  833   18    8
##          D    3    1   11  777    3
##          E    2    1    0    2  882
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9794         
##                  95% CI : (0.975, 0.9832)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.974          
##                                          
##  Mcnemar's Test P-Value : 0.02267        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9907   0.9789   0.9743   0.9664   0.9789
## Specificity            0.9977   0.9929   0.9896   0.9956   0.9988
## Pos Pred Value         0.9942   0.9707   0.9520   0.9774   0.9944
## Neg Pred Value         0.9963   0.9949   0.9945   0.9934   0.9953
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2818   0.1894   0.1699   0.1584   0.1799
## Detection Prevalence   0.2834   0.1951   0.1784   0.1621   0.1809
## Balanced Accuracy      0.9942   0.9859   0.9819   0.9810   0.9888

Conclusion and answers

Now creating a training model based on the entire training data set (not just a subset), and use that model to predict our test questions. Why use the entire set now? Well, mainly because we can and probably should, but also there was a slight difference in predictions cast from a model built on 75% and 100% of the training data.

## [1] "Prediction using subset of training data"
##  [1] B A C A A E D B A A B C B A E E A B B B
## Levels: A B C D E
## [1] "prediction using all available training data"
##  [1] B A C A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Final thoughts

Random Forests was a great option to predict these kinds of problems. Oftentimes we must consider at least attempting multiple methods in order to establish not only a viable framework, but one that is optimized among others.

code

knitr::opts_chunk$set(echo = TRUE) 
library(dplyr)
library(caret)
library(ggplot2)
library(gridExtra)
library(xtable)
trainFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download files
if (!file.exists("./pml-training.csv")) {
        mydir<- paste0(getwd(),"/","pml-training.csv")
        download.file(trainFileUrl, destfile = mydir)
}
if (!file.exists("./pml-testing.csv")){
        mydir<- paste0(getwd(),"/","pml-testing.csv")
        download.file(testFileUrl, destfile = mydir)
}

#read files into R *****edit stringsAsFactors = false
training <- read.csv("pml-training.csv", stringsAsFactors = FALSE)
testing <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)

# reproducibility
set.seed(25)
# use caret to find near zero values
# creating training data of columns that don't contain na's and
# filter out the near zero covariats that might not applicable to real 
# world analyzation
clntrain <- training[colSums(is.na(training))==0]
nsv <- nearZeroVar(clntrain, saveMetrics = TRUE)

# clean training data
clntrain <- clntrain[,nsv$nzv==FALSE]
clntrain <- select(clntrain, -c(X,
                                user_name,
                                cvtd_timestamp
                                #raw_timestamp_part_1,
                                #raw_timestamp_part_2,
                                #num_window
                                ))

print(xtable(nsv[1:10,]), type = "html")
#pca analysis
plotPcPros <- prcomp(clntrain[,-56], scale. = TRUE, center = TRUE)
plot1 <- qplot((y=plotPcPros$sdev^2)/sum(plotPcPros$sdev^2)*100,
        x = seq_along(plotPcPros$sdev),
        xlab = "Principal Components",
        ylab = "Percentage of Variance Explained",
        main = "Principal Component Variance",
        geom = c("point", "line"))

pcPros <- preProcess(clntrain[,-56], method = c("pca", "center", "scale"))
pc <- predict(pcPros, clntrain[,-56])
Exercise <- clntrain$classe
plot2 <-qplot(x = pc[,1], y = pc[,2], color = Exercise,
        xlab = "PC1",
        ylab = "PC2",
        main = "Principal Component Analysis",
        legend.position = "bottom") + theme(legend.position = "bottom")

grid.arrange(plot1, plot2, ncol = 2)
# no classe data in the testing data set which looking
# back now makes sense, soooooo split it to win it
inTrain <- createDataPartition(clntrain$classe, p = .75)[[1]]
xTraining <- clntrain[inTrain,]
xTesting <- clntrain[-inTrain,]

#turn on the afterburners... permission to buzz the tower
library(parallel)
library(doParallel)
library(knitr)
clustr <- makeCluster(detectCores()-1)
registerDoParallel(clustr)
#fit control allow for parallel processing
fitControl <- trainControl(method = "cv",
                           allowParallel = TRUE,
                           preProcOptions = list(thresh = .95))
#fit three models random forest, gradient boosting,
#and linear discriminant analysis
fitRf<- train(classe ~.,
                        data = xTraining,
                        method = "parRF",
                        preProcess = "pca",
                        trControl = fitControl)
#gabage wrapped to keep gbm output clean
garbage <- capture.output(
        fitGbm <- train(classe ~., data = xTraining,
                method = "gbm",
                preProcess = "pca",
                trControl = fitControl))

fitLda <- train(classe ~., data = xTraining,
                method = "lda",
                preProcess = "pca",
                trControl = fitControl)

predRf<- predict(fitRf, xTesting)
predGbm<- predict(fitGbm, xTesting)
predLda<- predict(fitLda, xTesting)


confMatRf <- confusionMatrix(predRf, as.factor(xTesting$classe))
confMatGbm <- confusionMatrix(predGbm, as.factor(xTesting$classe))
confMatLda <- confusionMatrix(predLda, as.factor(xTesting$classe))

#show results
Accuracy <- data.frame(confMatRf$overall[1],confMatGbm$overall[1],
                       confMatLda$overall[1])
names(Accuracy)<- c("Random Forests", "Gradient Boosting",
                    "Linear Discriminant")

Accuracy
References <- kable(list(confMatRf$table,
                         "|","|",confMatGbm$table,
                         "|","|",confMatLda$table),
                    caption = c("Confusion Matrix Tables - left to right are:
                                RF, GBM, LDA"),
                    align = c("c"))
References
confMatRf
lastfitRf<- train(classe ~.,
                        data = clntrain,
                        method = "parRF",
                        preProcess = "pca",
                        trControl = fitControl)
stopCluster(clustr)
lastPred<- predict(lastfitRf, testing)
print("Prediction using subset of training data")
predict(fitRf, testing)
print("prediction using all available training data")
lastPred