The goal of this project is to predict or classify new exercises using a subset of predictiors from the data (Predicting the classe vairalbe), using data recorded from four types of body sensors during body building exercises. After applying various data cleanup and preprossesing techniques, a training model was built to predict 20 observations from a test dataset. In this document, explained the methods and results obtained.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
library(rpart)
library(caret)
library(randomForest)
train <- read.csv("pml-training.csv", na.strings=c("", "NA", "NULL"))
test <- read.csv("pml-testing.csv", na.strings=c("", "NA", "NULL"))
Lets remove the columns containing NA’s from the data
train <-train[,colSums(is.na(train)) == 0]
test <-test[,colSums(is.na(test)) == 0]
Lets remove the unwanted columns fromt he data
train <-train[,-c(1:7)]
test <-test[,-c(1:7)]
Lets partition the train data in to training and validation set.
trainset <- createDataPartition(train$classe, p = 0.8, list = FALSE)
Training <- train[trainset, ]
Validation <- train[-trainset, ]
Lets draw a simple histogram plot for the prediction variable.
plot(Training$classe, col="gray",
main="Histogram of Predicting variable(classe) in Training set",
xlab="classe levels", ylab="Frequency")
Lets use reandom forest for building the model.
rfMod <- randomForest(classe ~. , data=Training, method="class")
rfMod
##
## Call:
## randomForest(formula = classe ~ ., data = Training, method = "class")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.38%
## Confusion matrix:
## A B C D E class.error
## A 4462 2 0 0 0 0.0004480287
## B 11 3024 3 0 0 0.0046082949
## C 0 12 2724 2 0 0.0051132213
## D 0 0 22 2550 1 0.0089389817
## E 0 0 2 5 2879 0.0024255024
Lets cross validate using the Validation set
rfPred <- predict(rfMod, Validation, type = "class")
confusionMatrix(rfPred, Validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1115 4 0 0 0
## B 1 754 5 0 0
## C 0 1 679 5 1
## D 0 0 0 637 2
## E 0 0 0 1 718
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9921, 0.9969)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9934 0.9927 0.9907 0.9958
## Specificity 0.9986 0.9981 0.9978 0.9994 0.9997
## Pos Pred Value 0.9964 0.9921 0.9898 0.9969 0.9986
## Neg Pred Value 0.9996 0.9984 0.9985 0.9982 0.9991
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1922 0.1731 0.1624 0.1830
## Detection Prevalence 0.2852 0.1937 0.1749 0.1629 0.1833
## Balanced Accuracy 0.9988 0.9958 0.9953 0.9950 0.9978
The Cross validation accuracy is 99.5% so the out of sample error is 0.5%, which confirms our model has performed good.
Lets predict the test set using the our model rfMod
testPred <- predict(rfMod, test, type="class")
testPred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Lets save the output as txt files as mentioned in the submission instructions using the following code, which was given as the submission instructions.
answers <- as.character(testPred)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i], file=filename, quote=FALSE, row.names=FALSE,
col.names=FALSE)
}
}
pml_write_files(answers)