Machine Learning in R

The Attack

Preprocess

After a little exploring we reveal that the data containd many NA’s associated with columns that had statistical data compiled from other columns of raw data. this was removed and the remaining set was tested for Near Zero-variance, documentation here.

Truncated sample

	freqRatio	percentUnique	zeroVar	nzv
X	1.00	100.00	FALSE	FALSE
user_name	1.10	0.03	FALSE	FALSE
raw_timestamp_part_1	1.00	4.27	FALSE	FALSE
raw_timestamp_part_2	1.00	85.53	FALSE	FALSE
cvtd_timestamp	1.00	0.10	FALSE	FALSE
new_window	47.33	0.01	FALSE	TRUE
num_window	1.00	4.37	FALSE	FALSE
roll_belt	1.10	6.78	FALSE	FALSE
pitch_belt	1.04	9.38	FALSE	FALSE
yaw_belt	1.06	9.97	FALSE	FALSE

Next step was to take a look a principal components to get an idea of what the data might look like and maybe understand which components had the most variance. This would then be useful for model selection and possibly number of components needed to explain a desired percent of variance.

Model building

As you can see from our previous analysis, 20 or so components make up the majority of our data variance(left plot). Also the structure of the data on the right would lead us to think that more of a classification training approach would be appropriate. let’s train 3 different models and compare.
Now although we do have a test set of data I’m going to spit my training data set into a training and test set as well. This will allow me to test my results against results that i know are correct so i can get and idea of how accurate the models will be. Let’s see accuracies of Random Forests, Gradient Boosting and Linear Discriminate Analysis. We’ll set all prediction controls to 10 fold Cross Validation, we don’t want to use anything with too many folds, so to keep the computing time down. In addition we will utilize parallel processing to help speed things up.

##          Random Forests Gradient Boosting Linear Discriminant
## Accuracy      0.9794046         0.8085237           0.5375204

Confusion Matrix Tables - left to right are: RF, GBM, LDA

	A	B	C	D	E
A	1382	4	2	1	1
B	6	929	9	6	7
C	2	14	833	18	8
D	3	1	11	777	3
E	2	1	0	2	882

x
\|

x
\|

	A	B	C	D	E
A	1211	85	54	32	34
B	49	696	53	20	72
C	52	96	706	97	51
D	73	39	29	634	26
E	10	33	13	21	718

x
\|

x
\|

	A	B	C	D	E
A	940	192	211	58	88
B	119	416	94	143	182
C	113	174	481	131	115
D	196	106	50	377	94
E	27	61	19	95	422

Final predictions

Random Forests shows the best promise for a good prediction model. With a predicted accuracy of 98% and an out of sample error of about 2%.
Let’s see the full statistics.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1382    4    2    1    1
##          B    6  929    9    6    7
##          C    2   14  833   18    8
##          D    3    1   11  777    3
##          E    2    1    0    2  882
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9794         
##                  95% CI : (0.975, 0.9832)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.974          
##                                          
##  Mcnemar's Test P-Value : 0.02267        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9907   0.9789   0.9743   0.9664   0.9789
## Specificity            0.9977   0.9929   0.9896   0.9956   0.9988
## Pos Pred Value         0.9942   0.9707   0.9520   0.9774   0.9944
## Neg Pred Value         0.9963   0.9949   0.9945   0.9934   0.9953
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2818   0.1894   0.1699   0.1584   0.1799
## Detection Prevalence   0.2834   0.1951   0.1784   0.1621   0.1809
## Balanced Accuracy      0.9942   0.9859   0.9819   0.9810   0.9888

Conclusion and answers

Now creating a training model based on the entire training data set (not just a subset), and use that model to predict our test questions. Why use the entire set now? Well, mainly because we can and probably should, but also there was a slight difference in predictions cast from a model built on 75% and 100% of the training data.

## [1] "Prediction using subset of training data"

##  [1] B A C A A E D B A A B C B A E E A B B B
## Levels: A B C D E

## [1] "prediction using all available training data"

##  [1] B A C A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Final thoughts

Random Forests was a great option to predict these kinds of problems. Oftentimes we must consider at least attempting multiple methods in order to establish not only a viable framework, but one that is optimized among others.

code

knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(caret)
library(ggplot2)
library(gridExtra)
library(xtable)
trainFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testFileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download files
if (!file.exists("./pml-training.csv")) {
        mydir<- paste0(getwd(),"/","pml-training.csv")
        download.file(trainFileUrl, destfile = mydir)
}
if (!file.exists("./pml-testing.csv")){
        mydir<- paste0(getwd(),"/","pml-testing.csv")
        download.file(testFileUrl, destfile = mydir)
}

#read files into R *****edit stringsAsFactors = false
training <- read.csv("pml-training.csv", stringsAsFactors = FALSE)
testing <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)

# reproducibility
set.seed(25)

# use caret to find near zero values
# creating training data of columns that don't contain na's and
# filter out the near zero covariats that might not applicable to real 
# world analyzation
clntrain <- training[colSums(is.na(training))==0]
nsv <- nearZeroVar(clntrain, saveMetrics = TRUE)

# clean training data
clntrain <- clntrain[,nsv$nzv==FALSE]
clntrain <- select(clntrain, -c(X,
                                user_name,
                                cvtd_timestamp
                                #raw_timestamp_part_1,
                                #raw_timestamp_part_2,
                                #num_window
                                ))

print(xtable(nsv[1:10,]), type = "html")

#pca analysis
plotPcPros <- prcomp(clntrain[,-56], scale. = TRUE, center = TRUE)
plot1 <- qplot((y=plotPcPros$sdev^2)/sum(plotPcPros$sdev^2)*100,
        x = seq_along(plotPcPros$sdev),
        xlab = "Principal Components",
        ylab = "Percentage of Variance Explained",
        main = "Principal Component Variance",
        geom = c("point", "line"))

pcPros <- preProcess(clntrain[,-56], method = c("pca", "center", "scale"))
pc <- predict(pcPros, clntrain[,-56])
Exercise <- clntrain$classe
plot2 <-qplot(x = pc[,1], y = pc[,2], color = Exercise,
        xlab = "PC1",
        ylab = "PC2",
        main = "Principal Component Analysis",
        legend.position = "bottom") + theme(legend.position = "bottom")

grid.arrange(plot1, plot2, ncol = 2)

# no classe data in the testing data set which looking
# back now makes sense, soooooo split it to win it
inTrain <- createDataPartition(clntrain$classe, p = .75)[[1]]
xTraining <- clntrain[inTrain,]
xTesting <- clntrain[-inTrain,]

#turn on the afterburners... permission to buzz the tower
library(parallel)
library(doParallel)
library(knitr)
clustr <- makeCluster(detectCores()-1)
registerDoParallel(clustr)
#fit control allow for parallel processing
fitControl <- trainControl(method = "cv",
                           allowParallel = TRUE,
                           preProcOptions = list(thresh = .95))
#fit three models random forest, gradient boosting,
#and linear discriminant analysis
fitRf<- train(classe ~.,
                        data = xTraining,
                        method = "parRF",
                        preProcess = "pca",
                        trControl = fitControl)
#gabage wrapped to keep gbm output clean
garbage <- capture.output(
        fitGbm <- train(classe ~., data = xTraining,
                method = "gbm",
                preProcess = "pca",
                trControl = fitControl))

fitLda <- train(classe ~., data = xTraining,
                method = "lda",
                preProcess = "pca",
                trControl = fitControl)

predRf<- predict(fitRf, xTesting)
predGbm<- predict(fitGbm, xTesting)
predLda<- predict(fitLda, xTesting)


confMatRf <- confusionMatrix(predRf, as.factor(xTesting$classe))
confMatGbm <- confusionMatrix(predGbm, as.factor(xTesting$classe))
confMatLda <- confusionMatrix(predLda, as.factor(xTesting$classe))

#show results
Accuracy <- data.frame(confMatRf$overall[1],confMatGbm$overall[1],
                       confMatLda$overall[1])
names(Accuracy)<- c("Random Forests", "Gradient Boosting",
                    "Linear Discriminant")

Accuracy
References <- kable(list(confMatRf$table,
                         "|","|",confMatGbm$table,
                         "|","|",confMatLda$table),
                    caption = c("Confusion Matrix Tables - left to right are:
                                RF, GBM, LDA"),
                    align = c("c"))
References

confMatRf

lastfitRf<- train(classe ~.,
                        data = clntrain,
                        method = "parRF",
                        preProcess = "pca",
                        trControl = fitControl)
stopCluster(clustr)
lastPred<- predict(lastfitRf, testing)
print("Prediction using subset of training data")
predict(fitRf, testing)
print("prediction using all available training data")
lastPred

Machine Learning in R

gr3n4d3s

January 19, 2018

Synopsis and objective

Caret