PROBLEM STATEMENT:

The goal of this project is to predict or classify new exercises using a subset of predictiors from the data (Predicting the classe vairalbe), using data recorded from four types of body sensors during body building exercises. After applying various data cleanup and preprossesing techniques, a training model was built to predict 20 observations from a test dataset. In this document, explained the methods and results obtained.

DATA:

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

METHODOLOGY

Loading Required Packages

library(rpart)
library(caret)
library(randomForest)

Loading Data:

train <- read.csv("pml-training.csv", na.strings=c("", "NA", "NULL"))
test <- read.csv("pml-testing.csv", na.strings=c("", "NA", "NULL"))

Data Pre-processing

Lets remove the columns containing NA’s from the data

train <-train[,colSums(is.na(train)) == 0]
test <-test[,colSums(is.na(test)) == 0]

Lets remove the unwanted columns fromt he data

train   <-train[,-c(1:7)]
test <-test[,-c(1:7)]

Data Partition

Lets partition the train data in to training and validation set.

trainset <- createDataPartition(train$classe, p = 0.8, list = FALSE)
Training <- train[trainset, ]
Validation <- train[-trainset, ]

Histogram Plot

Lets draw a simple histogram plot for the prediction variable.

plot(Training$classe, col="gray", 
     main="Histogram of Predicting variable(classe) in Training set", 
     xlab="classe levels", ylab="Frequency")

Building Model

Lets use reandom forest for building the model.

rfMod <- randomForest(classe ~. , data=Training, method="class")
rfMod
## 
## Call:
##  randomForest(formula = classe ~ ., data = Training, method = "class") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.38%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4462    2    0    0    0 0.0004480287
## B   11 3024    3    0    0 0.0046082949
## C    0   12 2724    2    0 0.0051132213
## D    0    0   22 2550    1 0.0089389817
## E    0    0    2    5 2879 0.0024255024

Cross Validation.

Lets cross validate using the Validation set

rfPred <- predict(rfMod, Validation, type = "class")
confusionMatrix(rfPred, Validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1115    4    0    0    0
##          B    1  754    5    0    0
##          C    0    1  679    5    1
##          D    0    0    0  637    2
##          E    0    0    0    1  718
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9921, 0.9969)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9934   0.9927   0.9907   0.9958
## Specificity            0.9986   0.9981   0.9978   0.9994   0.9997
## Pos Pred Value         0.9964   0.9921   0.9898   0.9969   0.9986
## Neg Pred Value         0.9996   0.9984   0.9985   0.9982   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1922   0.1731   0.1624   0.1830
## Detection Prevalence   0.2852   0.1937   0.1749   0.1629   0.1833
## Balanced Accuracy      0.9988   0.9958   0.9953   0.9950   0.9978

The Cross validation accuracy is 99.5% so the out of sample error is 0.5%, which confirms our model has performed good.

Test Prediction

Lets predict the test set using the our model rfMod

testPred <- predict(rfMod, test, type="class")
testPred
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Lets save the output as txt files as mentioned in the submission instructions using the following code, which was given as the submission instructions.

answers <- as.character(testPred)

pml_write_files = function(x){
    n = length(x)
    for(i in 1:n){
        filename = paste0("problem_id_",i,".txt")
        write.table(x[i], file=filename, quote=FALSE, row.names=FALSE, 
                    col.names=FALSE)
    }
}

pml_write_files(answers)

CONCLUSION

  1. Used Random Forest to create the model.
  2. Cross Validation is done and the out of sample error is found out to be 0.5% which is pretty good model.