Practical machine learning

Data processing

Preprocessing

Loading necessary libraries and setting the seed:

library(caret)
library(rpart)
library(rattle)
library(randomForest)

set.seed(12345)

Downloading and reading training and testing datasets:

trainingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

training <- read.csv(trainingURL, na.strings=c("NA",""), header=TRUE)
testing <- read.csv(testURL, na.strings=c("NA",""), header=TRUE)

Deleting in datasets columns with only NA:

indexForNA_training <- apply(training,2,function(x) {sum(is.na(x))}) 
training <- training[,which(indexForNA_training == 0)]

indexForNA_testing <- apply(testing,2,function(x) {sum(is.na(x))}) 
testing <- testing[,which(indexForNA_testing == 0)]

Setting classe as factor:

training$classe <- as.factor(training$classe)

Preprocessing colums with numeric data:

numericCol <- which(lapply(training, class) %in% "numeric")

preObj <-preProcess(training[,numericCol],method=c('knnImpute', 'center', 'scale'))
trainPreProcessed <- predict(preObj, training[,numericCol])
trainPreProcessed$classe <- training$classe

testingPreProcessed <-predict(preObj,testing[,numericCol])

Removing the variables with values near zero:

nzvTraining <- nearZeroVar(trainPreProcessed,saveMetrics=TRUE)
trainPreProcessed <- trainPreProcessed[,nzvTraining$nzv==FALSE]

nzvTesting <- nearZeroVar(testingPreProcessed,saveMetrics=TRUE)
testingPreProcessed <- testingPreProcessed[,nzvTesting$nzv==FALSE]

Training set has 28 variables from the initial 160.

Cross validation

Dividing the training set in two parts, one for training and the other for cross validation:

inTrain = createDataPartition(trainPreProcessed$classe, p = 3/4, list=FALSE)
trainingPart = trainPreProcessed[inTrain,]

testingPart = trainPreProcessed[-inTrain,]

Decision Tree Model

Fitting a decision tree:

decisiontree <- train(classe~.,method="rpart", data=trainingPart)
fancyRpartPlot(decisiontree$finalModel)

Predicting and using confusion matrix to test results:

predictions <- predict(decisiontree,newdata = testingPart)
confusionMatrix(testingPart$classe, predictions)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 870   2 380 138   5
##          B 162 176 329 282   0
##          C  29  16 710 100   0
##          D  46   4 352 402   0
##          E  16   4 264 228 389
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5194          
##                  95% CI : (0.5053, 0.5334)
##     No Information Rate : 0.415           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4002          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7747  0.87129   0.3489  0.34957  0.98731
## Specificity            0.8611  0.83560   0.9495  0.89291  0.88647
## Pos Pred Value         0.6237  0.18546   0.8304  0.50000  0.43174
## Neg Pred Value         0.9279  0.99343   0.6728  0.81756  0.99875
## Prevalence             0.2290  0.04119   0.4150  0.23450  0.08034
## Detection Rate         0.1774  0.03589   0.1448  0.08197  0.07932
## Detection Prevalence   0.2845  0.19352   0.1743  0.16395  0.18373
## Balanced Accuracy      0.8179  0.85344   0.6492  0.62124  0.93689

The decision tree is a fairly poor fit having an accuracy rate of roughly 50%.

Random forest

Train model with random forest due to its highly accuracy rate. Cross validation is used as train control method.

modFit <- train(classe ~.,
                method="rf", 
                data=trainingPart, 
                trControl=trainControl(method='cv'), 
                number=5, 
                allowParallel=TRUE )

modFit

## Random Forest 
## 
## 14718 samples
##    27 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13247, 13246, 13245, 13246, 13247, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9927982  0.9908903  0.002504764  0.003168151
##   14    0.9921190  0.9900311  0.002583399  0.003267831
##   27    0.9899450  0.9872823  0.003048269  0.003854566
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Following the computation on the accuracy of trainig and cross validation set

Training set:

trainingPartPrediction <- predict(modFit, trainingPart)
confusionMatrix(trainingPartPrediction, trainingPart$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4185    0    0    0    0
##          B    0 2848    0    0    0
##          C    0    0 2567    0    0
##          D    0    0    0 2412    0
##          E    0    0    0    0 2706
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Cross validation set:

testingPartPrediction <- predict(modFit, testingPart)
confusionMatrix(testingPartPrediction, testingPart$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395   12    0    0    0
##          B    0  933    5    0    0
##          C    0    4  844    8    1
##          D    0    0    6  795    2
##          E    0    0    0    1  898
## 
## Overall Statistics
##                                           
##                Accuracy : 0.992           
##                  95% CI : (0.9891, 0.9943)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9899          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9831   0.9871   0.9888   0.9967
## Specificity            0.9966   0.9987   0.9968   0.9980   0.9998
## Pos Pred Value         0.9915   0.9947   0.9848   0.9900   0.9989
## Neg Pred Value         1.0000   0.9960   0.9973   0.9978   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1903   0.1721   0.1621   0.1831
## Detection Prevalence   0.2869   0.1913   0.1748   0.1637   0.1833
## Balanced Accuracy      0.9983   0.9909   0.9920   0.9934   0.9982

The end: predictions on the real testing set

testingPrediction <- predict(modFit, testingPreProcessed)
testingPrediction

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical machine learning - course work

Zanin Pavel

April 25, 2016

Introduction