This is a course project of Practical Machine Learning in the Data Science Specialization. The main work of this report is to apply two useful machine learning (M-L) tecniques including Classification trees and Boosting with trees to the prediction of excecise manner. More background information is available from < http://groupware.les.inf.puc-rio.br/har>.
We set the working dicrectory, load essential packages and training and testing data. Note that all predict varibles are set to be numerical type.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(gridExtra)
## Loading required package: grid
train_dat <- read.csv("pml-training.csv",header=T)
vali_dat <- read.csv("pml-testing.csv",header=T)
train_dat <- train_dat[,-1] #the 1st colume X is obviously not regarded as predictor.
vali_dat <- vali_dat[,-1]
for (i in 1:(dim(train_dat)[2]-1)) { train_dat[,i] <- as.numeric(train_dat[,i])}
for (i in 1:(dim(vali_dat)[2]-1)) { vali_dat[,i] <- as.numeric(vali_dat[,i])}
We further filter out the train_dat varibles containing of NA to obtain more efficient and reasonable models. After that, the number of variables is reduced to 92.
ind <- NULL
for (i in 1:(dim(train_dat)[2])) {
ind[i] <- all(!is.na(train_dat[,i]))
}
train_dat <- train_dat[,ind]
vali_dat <- vali_dat[,ind]
dim(train_dat)[2]
## [1] 92
Then we filter out the varibles nearly without variance which cab barely explain the model. After that, the number of variables is reduced to 58.
ind2 <- nearZeroVar(train_dat[,-dim(train_dat)[2]])
train_dat <- train_dat[,-ind2]
vali_dat <- vali_dat[,-ind2]
dim(train_dat)[2]
## [1] 58
The clean dataset has been splitted in a 60% training and 40% testing subset.
inTrain <- createDataPartition(y=train_dat$class,p=0.6,list=F)
training <- train_dat[inTrain,]
testing <- train_dat[-inTrain,]
Then the PCA is employed to trainsform the data to another space where the variables are uncorrelated with the others.
preProc <- preProcess(training[,-58],method=c("BoxCox","center","scale","pca"),thresh=0.8)
trainPC <- predict(preProc,training[,-58])
preProc
##
## Call:
## preProcess.default(x = training[, -58], method = c("BoxCox",
## "center", "scale", "pca"), thresh = 0.8)
##
## Created from 11776 samples and 57 variables
## Pre-processing: Box-Cox transformation, centered, scaled,
## principal component signal extraction
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.600 0.700 0.700 1.086 1.450 2.000 50
##
## PCA needed 14 components to capture 80 percent of the variance
The PCA need 13 PCs to caupture the desired variance.
To find some interesting patterns in classifying outcomes by those PCs. Here we illustrate them by x-y scatterplots with pairs of some PCs.
a=ggplot(trainPC,aes(PC1,PC2,color=training$class))+geom_point()
b=ggplot(trainPC,aes(PC1,PC3,color=training$class))+geom_point()
c=ggplot(trainPC,aes(PC1,PC4,color=training$class))+geom_point()
d=ggplot(trainPC,aes(PC1,PC5,color=training$class))+geom_point()
grid.arrange(a,b,c,d,ncol=2)
In this session, we employ the Random Forest, a most commonly used and powerful algorithm. The model has been resampled with 10-fold cross-validation.
if (file.exists("modelRF.rda")) {
load(file="modelRF.rda")
} else {
train_cont <- trainControl(method="cv",number=10)
modelFit <- train(training$class~.,method="rf",data=trainPC,trControl=train_cont,prox=T)
save(modelFit,file="modelRF.rda")
}
modelFit
## Random Forest
##
## 11776 samples
## 13 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10598, 10599, 10597, 10600, 10599, 10599, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9613606 0.9511289 0.006532348 0.008247808
## 8 0.9601728 0.9496205 0.007101099 0.008980548
## 14 0.9537190 0.9414631 0.008123538 0.010277867
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
pred <- predict(modelFit,trainPC)
confusionMatrix(pred,training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2279 0 0 0
## C 0 0 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Then we test the model with the testing subdateset.
testPC <- predict(preProc,testing[,-58])
pred <- predict(modelFit,testPC)
confusionMatrix(testing$classe,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2179 19 18 13 3
## B 27 1462 29 0 0
## C 14 12 1331 10 1
## D 6 1 63 1214 2
## E 1 13 11 14 1403
##
## Overall Statistics
##
## Accuracy : 0.9672
## 95% CI : (0.9631, 0.9711)
## No Information Rate : 0.2838
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9586
## Mcnemar's Test P-Value : 1.75e-13
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9784 0.9701 0.9167 0.9704 0.9957
## Specificity 0.9906 0.9912 0.9942 0.9891 0.9939
## Pos Pred Value 0.9763 0.9631 0.9730 0.9440 0.9730
## Neg Pred Value 0.9914 0.9929 0.9813 0.9944 0.9991
## Prevalence 0.2838 0.1921 0.1851 0.1594 0.1796
## Detection Rate 0.2777 0.1863 0.1696 0.1547 0.1788
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9845 0.9807 0.9554 0.9798 0.9948
We can see the out-of-sample accuracy is close to 1, indicating a good performance RF model has been constructed.
Finaly, we perform the constructed RF model on the clean vali_dat dataset and obtain the results.
validPC <- predict(preProc,vali_dat[,-58])
res <- predict(modelFit,validPC)
res
## [1] B A A A A E D B A A B C B A E E A B B B
## Levels: A B C D E