Practical Machine Learning Project Report

METHODOLOGY

Loading Required Packages

library(rpart)
library(caret)
library(randomForest)

Loading Data:

train <- read.csv("pml-training.csv", na.strings=c("", "NA", "NULL"))
test <- read.csv("pml-testing.csv", na.strings=c("", "NA", "NULL"))

Data Pre-processing

Lets remove the columns containing NA’s from the data

train <-train[,colSums(is.na(train)) == 0]
test <-test[,colSums(is.na(test)) == 0]

Lets remove the unwanted columns fromt he data

train   <-train[,-c(1:7)]
test <-test[,-c(1:7)]

Data Partition

Lets partition the train data in to training and validation set.

trainset <- createDataPartition(train$classe, p = 0.8, list = FALSE)
Training <- train[trainset, ]
Validation <- train[-trainset, ]

Histogram Plot

Lets draw a simple histogram plot for the prediction variable.

plot(Training$classe, col="gray", 
     main="Histogram of Predicting variable(classe) in Training set", 
     xlab="classe levels", ylab="Frequency")

Building Model

Lets use reandom forest for building the model.

rfMod <- randomForest(classe ~. , data=Training, method="class")
rfMod

## 
## Call:
##  randomForest(formula = classe ~ ., data = Training, method = "class") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.38%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4462    2    0    0    0 0.0004480287
## B   11 3024    3    0    0 0.0046082949
## C    0   12 2724    2    0 0.0051132213
## D    0    0   22 2550    1 0.0089389817
## E    0    0    2    5 2879 0.0024255024

Cross Validation.

Lets cross validate using the Validation set

rfPred <- predict(rfMod, Validation, type = "class")
confusionMatrix(rfPred, Validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1115    4    0    0    0
##          B    1  754    5    0    0
##          C    0    1  679    5    1
##          D    0    0    0  637    2
##          E    0    0    0    1  718
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9921, 0.9969)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9934   0.9927   0.9907   0.9958
## Specificity            0.9986   0.9981   0.9978   0.9994   0.9997
## Pos Pred Value         0.9964   0.9921   0.9898   0.9969   0.9986
## Neg Pred Value         0.9996   0.9984   0.9985   0.9982   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1922   0.1731   0.1624   0.1830
## Detection Prevalence   0.2852   0.1937   0.1749   0.1629   0.1833
## Balanced Accuracy      0.9988   0.9958   0.9953   0.9950   0.9978

The Cross validation accuracy is 99.5% so the out of sample error is 0.5%, which confirms our model has performed good.

Test Prediction

Lets predict the test set using the our model rfMod

testPred <- predict(rfMod, test, type="class")
testPred

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Lets save the output as txt files as mentioned in the submission instructions using the following code, which was given as the submission instructions.

answers <- as.character(testPred)

pml_write_files = function(x){
    n = length(x)
    for(i in 1:n){
        filename = paste0("problem_id_",i,".txt")
        write.table(x[i], file=filename, quote=FALSE, row.names=FALSE, 
                    col.names=FALSE)
    }
}

pml_write_files(answers)