This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement Â- a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this data set, the participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants toto predict the manner in which praticipants did the exercise.
Since we have a data set with too many columns and we need make a class prediction, therefore decided to implement a random forests model, that’s no need cross-validation or a separate test set to get an unbiased estimate of the test set error.
The dependent variable or response is the “classe” variable in the training set.
Load the data.
training <- read.csv("pml-training.csv", header=T, sep=",", na.strings=c("NA",""))
testing<-read.csv("pml-testing.csv", header=T,sep=",", na.strings=c("NA",""))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Firstly Load the Libraries and Set Seed for results to be Reproducible.
library(caret)
library(randomForest)
set.seed(45)
The following approaches for used for reducing the number of predictors.
##this will return the total number of Non NAs values in each Column
ColSums<-colSums(!is.na(training[,-ncol(training)]))
head(ColSums)
## X user_name raw_timestamp_part_1
## 19622 19622 19622
## raw_timestamp_part_2 cvtd_timestamp new_window
## 19622 19622 19622
## this gives the number of valid columns
sum(colSums(!is.na(training[,-ncol(training)]))>=0.7*nrow(training))
## [1] 59
Subset the Data by considering only valid columns
validCol<-colSums(!is.na(training[,-ncol(training)]))>=0.7*nrow(training)
trainingWithlessNA<-training[,validCol]
dim(trainingWithlessNA)
## [1] 19622 60
nzv <- nearZeroVar(trainingWithlessNA, saveMetrics = TRUE) ##this creates a matrix
head(nzv)
## freqRatio percentUnique zeroVar nzv
## X 1.000000 100.00000000 FALSE FALSE
## user_name 1.100679 0.03057792 FALSE FALSE
## raw_timestamp_part_1 1.000000 4.26562022 FALSE FALSE
## raw_timestamp_part_2 1.000000 85.53154622 FALSE FALSE
## cvtd_timestamp 1.000668 0.10192641 FALSE FALSE
## new_window 47.330049 0.01019264 FALSE TRUE
#this will return only those col numbers
nzv <- nearZeroVar(trainingWithlessNA)
trainingWithNozeroVar <- trainingWithlessNA[, -nzv]
dim(trainingWithNozeroVar)
## [1] 19622 59
trainingWithNozeroVarandID<-trainingWithNozeroVar[,-1]
dim(trainingWithNozeroVarandID)
## [1] 19622 58
##number of the columns with numeric values
sum(sapply(trainingWithNozeroVarandID, is.numeric))
## [1] 55
##VAriables with high correraltion
corrMatrix <-
cor(na.omit(trainingWithNozeroVarandID[sapply(trainingWithNozeroVarandID, is.numeric)]))
dim(corrMatrix)
## [1] 55 55
##A cut-off of 90% is kept
highCorr<-findCorrelation(corrMatrix, cutoff = .90, verbose = T)
## Final Training set
Final_training<-trainingWithNozeroVarandID[,-highCorr]
dim(Final_training)
## [1] 19622 51
The same is to be done with testing Data
testingWithLessNA<-testing[,validCol]
testingWithNoZeroVar<-testingWithLessNA[,-nzv]
testingWithNoZeroVarAndID<-testingWithNoZeroVar[,-1]
Final_testing<-testingWithNoZeroVarAndID[,-highCorr]
dim(Final_testing)
## [1] 20 51
set.seed(45)
inTrain<-createDataPartition(Final_training$classe,p=0.75,list=F)
modelTrain<-Final_training[inTrain,]
modeltest<-Final_training[-inTrain,]
Random forests build lots of bushy trees, and then average them to reduce the variance.
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the execution. So, we proced with the training the model (Random Forest) with the training data set.
set.seed(45)
require(randomForest)
modelFit<-randomForest(classe~.,data=Final_training,importance=TRUE)
modelFit
##
## Call:
## randomForest(formula = classe ~ ., data = Final_training, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.05%
## Confusion matrix:
## A B C D E class.error
## A 5580 0 0 0 0 0.0000000000
## B 2 3795 0 0 0 0.0005267316
## C 0 5 3417 0 0 0.0014611338
## D 0 0 1 3214 1 0.0006218905
## E 0 0 0 0 3607 0.0000000000
No. of variables tried at each split: 7. It means every time we only randomly use 7 predictors to grow the tree.
varImpPlot(modelFit)
Now lets evaluate this tree on the testing data.
Accuracy is 1 (100%)
confusionMatrix(predict(modelFit,newdata=modeltest[,-ncol(modeltest)]),
modeltest$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 0 949 0 0 0
## C 0 0 855 0 0
## D 0 0 0 804 0
## E 0 0 0 0 901
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Now lets Predict for the Final_testing Data
predictions <- predict(modelFit,newdata=Final_testing)
predictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Those answers were submitted as a part of Course and are Correct