Course Project: M-L based Exercise Manner Prediction

Terence LIU, 2015/9


Background

This is a course project of Practical Machine Learning in the Data Science Specialization. The main work of this report is to apply two useful machine learning (M-L) tecniques including Classification trees and Boosting with trees to the prediction of excecise manner. More background information is available from < http://groupware.les.inf.puc-rio.br/har>.

Data Loading & cleaning

We set the working dicrectory, load essential packages and training and testing data. Note that all predict varibles are set to be numerical type.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(gridExtra)
## Loading required package: grid
train_dat <- read.csv("pml-training.csv",header=T)
vali_dat <- read.csv("pml-testing.csv",header=T)
train_dat <- train_dat[,-1] #the 1st colume X is obviously not regarded as predictor.
vali_dat <- vali_dat[,-1]
for (i in 1:(dim(train_dat)[2]-1)) { train_dat[,i] <- as.numeric(train_dat[,i])}
for (i in 1:(dim(vali_dat)[2]-1)) { vali_dat[,i] <- as.numeric(vali_dat[,i])}

We further filter out the train_dat varibles containing of NA to obtain more efficient and reasonable models. After that, the number of variables is reduced to 92.

ind <- NULL
for (i in 1:(dim(train_dat)[2])) {
ind[i] <- all(!is.na(train_dat[,i]))
}
train_dat <- train_dat[,ind]
vali_dat <- vali_dat[,ind]
dim(train_dat)[2]
## [1] 92

Then we filter out the varibles nearly without variance which cab barely explain the model. After that, the number of variables is reduced to 58.

ind2 <- nearZeroVar(train_dat[,-dim(train_dat)[2]])
train_dat <- train_dat[,-ind2]
vali_dat <- vali_dat[,-ind2]
dim(train_dat)[2]
## [1] 58

Model Construction & Prediction

PCA Preprocessing

The clean dataset has been splitted in a 60% training and 40% testing subset.

inTrain <- createDataPartition(y=train_dat$class,p=0.6,list=F)
training <- train_dat[inTrain,]
testing <- train_dat[-inTrain,]

Then the PCA is employed to trainsform the data to another space where the variables are uncorrelated with the others.

preProc <- preProcess(training[,-58],method=c("BoxCox","center","scale","pca"),thresh=0.8)
trainPC <- predict(preProc,training[,-58])
preProc
## 
## Call:
## preProcess.default(x = training[, -58], method = c("BoxCox",
##  "center", "scale", "pca"), thresh = 0.8)
## 
## Created from 11776 samples and 57 variables
## Pre-processing: Box-Cox transformation, centered, scaled,
##  principal component signal extraction 
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.600   0.700   0.700   1.086   1.450   2.000      50 
## 
## PCA needed 14 components to capture 80 percent of the variance

The PCA need 13 PCs to caupture the desired variance.

To find some interesting patterns in classifying outcomes by those PCs. Here we illustrate them by x-y scatterplots with pairs of some PCs.

a=ggplot(trainPC,aes(PC1,PC2,color=training$class))+geom_point()
b=ggplot(trainPC,aes(PC1,PC3,color=training$class))+geom_point()
c=ggplot(trainPC,aes(PC1,PC4,color=training$class))+geom_point()
d=ggplot(trainPC,aes(PC1,PC5,color=training$class))+geom_point()
grid.arrange(a,b,c,d,ncol=2)

RM (Random Forest) Algorithm Training & Testing

In this session, we employ the Random Forest, a most commonly used and powerful algorithm. The model has been resampled with 10-fold cross-validation.

if (file.exists("modelRF.rda")) {
  load(file="modelRF.rda")
} else {
  train_cont <- trainControl(method="cv",number=10)
  modelFit <- train(training$class~.,method="rf",data=trainPC,trControl=train_cont,prox=T)
  save(modelFit,file="modelRF.rda")
}
modelFit
## Random Forest 
## 
## 11776 samples
##    13 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 10598, 10599, 10597, 10600, 10599, 10599, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9613606  0.9511289  0.006532348  0.008247808
##    8    0.9601728  0.9496205  0.007101099  0.008980548
##   14    0.9537190  0.9414631  0.008123538  0.010277867
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
pred <- predict(modelFit,trainPC)
confusionMatrix(pred,training$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2279    0    0    0
##          C    0    0 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Then we test the model with the testing subdateset.

testPC <- predict(preProc,testing[,-58])
pred <- predict(modelFit,testPC)
confusionMatrix(testing$classe,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2179   19   18   13    3
##          B   27 1462   29    0    0
##          C   14   12 1331   10    1
##          D    6    1   63 1214    2
##          E    1   13   11   14 1403
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9672          
##                  95% CI : (0.9631, 0.9711)
##     No Information Rate : 0.2838          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9586          
##  Mcnemar's Test P-Value : 1.75e-13        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9784   0.9701   0.9167   0.9704   0.9957
## Specificity            0.9906   0.9912   0.9942   0.9891   0.9939
## Pos Pred Value         0.9763   0.9631   0.9730   0.9440   0.9730
## Neg Pred Value         0.9914   0.9929   0.9813   0.9944   0.9991
## Prevalence             0.2838   0.1921   0.1851   0.1594   0.1796
## Detection Rate         0.2777   0.1863   0.1696   0.1547   0.1788
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9845   0.9807   0.9554   0.9798   0.9948

We can see the out-of-sample accuracy is close to 1, indicating a good performance RF model has been constructed.

Exercise Manner Prediction by RF

Finaly, we perform the constructed RF model on the clean vali_dat dataset and obtain the results.

validPC <- predict(preProc,vali_dat[,-58])
res <- predict(modelFit,validPC)
res
##  [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E