Coursera : Practical Machine Learning Project

Data Processing

Setting up working directory:

setwd("~/Desktop/Machine Learning/Coursera-PML-Project")

The required R Packeges were installed :

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(rpart)
library(rpart.plot)
library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(corrplot)

Download the Data :

        The training and test data were downloaded as instructed.
        Trainig Data : pml-training.csv 
        Test data    : pml-testing.csv

trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile  <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
  dir.create("./data")
}
if (!file.exists(trainFile)) {
  download.file(trainUrl, destfile=trainFile, method="curl")
}
if (!file.exists(testFile)) {
  download.file(testUrl, destfile=testFile, method="curl")
}

Read the data into Dataframes and check for the content:

trainRaw <- read.csv("./data/pml-training.csv")
testRaw <- read.csv("./data/pml-testing.csv")
dim(trainRaw)

## [1] 19622   160

dim(testRaw)

## [1]  20 160

Cleaning the Data :

On inspecting both traing and test the dataset, some missing data and unwated data were found which were to be cleaned.

sum(complete.cases(trainRaw))

## [1] 406

The columns containing NA were replaced with 0:

trainRaw <- trainRaw[, colSums(is.na(trainRaw)) == 0] 
testRaw <- testRaw[, colSums(is.na(testRaw)) == 0]

Few columns where those data do not have any meaning with the measurements data were removed.

classe <- trainRaw$classe
trainRemove <- grepl("^X|timestamp|window", names(trainRaw))
trainRaw <- trainRaw[, !trainRemove]
trainCleaned <- trainRaw[, sapply(trainRaw, is.numeric)]
trainCleaned$classe <- classe
testRemove <- grepl("^X|timestamp|window", names(testRaw))
testRaw <- testRaw[, !testRemove]
testCleaned <- testRaw[, sapply(testRaw, is.numeric)]

The cleaned training data set now contains 19622 observations and 53 variables, while the testing data set contains 20 observations and 53 variables. The “classe” variable is still in the cleaned training set.

Slicing the data:

Then, we can split the cleaned training set into a pure training data set (70%) and a validation data set (30%). We will use the validation data set to conduct cross validation in future steps.

set.seed(22519) # For reproducibile purpose
inTrain <- createDataPartition(trainCleaned$classe, p=0.70, list=F)
trainData <- trainCleaned[inTrain, ]
testData <- trainCleaned[-inTrain, ]

Data Modeling:

We fit a predictive model for activity recognition using Random Forest algorithm because it automatically selects important variables and is robust to correlated covariates & outliers in general. We will use 5-fold cross validation when applying the algorithm.

controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=trainData, method="rf", trControl=controlRf, ntree=250)
modelRf

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10989, 10989, 10991, 10990, 10989 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9910462  0.9886729  0.001301269  0.001647776
##   27    0.9914102  0.9891334  0.001717547  0.002174708
##   52    0.9850037  0.9810264  0.002718384  0.003439965
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The performance of the model on the validation data set is estimated.

predictRf <- predict(modelRf, testData)
confusionMatrix(testData$classe, predictRf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    1    0    0    1
##          B    7 1128    4    0    0
##          C    0    0 1021    5    0
##          D    0    0   14  949    1
##          E    0    0    1    7 1074
## 
## Overall Statistics
##                                          
##                Accuracy : 0.993          
##                  95% CI : (0.9906, 0.995)
##     No Information Rate : 0.2853         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9912         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9958   0.9991   0.9817   0.9875   0.9981
## Specificity            0.9995   0.9977   0.9990   0.9970   0.9983
## Pos Pred Value         0.9988   0.9903   0.9951   0.9844   0.9926
## Neg Pred Value         0.9983   0.9998   0.9961   0.9976   0.9996
## Prevalence             0.2853   0.1918   0.1767   0.1633   0.1828
## Detection Rate         0.2841   0.1917   0.1735   0.1613   0.1825
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9977   0.9984   0.9903   0.9922   0.9982

accuracy <- postResample(predictRf, testData$classe)
accuracy

##  Accuracy     Kappa 
## 0.9930331 0.9911872

outSampErr <- 1 - as.numeric(confusionMatrix(testData$classe, predictRf)$overall[1])
outSampErr

## [1] 0.006966865

The accuracy of the model is estimated as 99.30% and out of sample error of 0.69 % is estimated.

Predicting for Test Data Set:

Finally we apply the model to the original testing data set downloaded from the data source. We remove the problem_id column first.

result <- predict(modelRf, testCleaned[, -length(names(testCleaned))])
result

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Coursera : Practical Machine Learning Project - Prediction Model

nvramamoorthy

14 June 2015

Synopsis

Initial Setup

Enivironment

Hardware :

Software :

Data Processing

Setting up working directory:

Download the Data :

Read the data into Dataframes and check for the content:

Cleaning the Data :

Slicing the data:

Data Modeling:

Predicting for Test Data Set:

Conclusion

It is found that the accuracy of the model is 99.30% and out of sample error is 0.69% .

Appendix

Plot 1: Decision Tree Visualization

Plot 2 :Correlation Matrix Visualization