Applications of human activity recognition are escalating over the past few years which often estimate the quantity of work done. The experimental data used in this project aims to predict the form of barbell lifts performed in 5 different ways only one of which is precise. Data is collected from the accelerometers attached to the belt,arm,forearm and dumbbell of each participant. Predictions using random forest algorithm gives an accuracy of 99.78% and estimated out of sample error rate of 0.25%
library(caret)
library(randomForest)
traindata <- read.csv("pml-training.csv")
testfinal <- read.csv("pml-testing.csv")
dim <- as.data.frame(rbind("traindata"=dim(traindata),"testfinal"=dim(testfinal)))
names(dim) <- c("rows","columns")
dim
## rows columns
## traindata 19622 160
## testfinal 20 160
traindata <- traindata[,-c(1,2,3,4,5)]
testfinal <- testfinal[,-c(1,2,3,4,5)]
#Removing NA's
nacols <- c(NULL)
for(i in 1:length(traindata)) nacols[i] <- sum(is.na(traindata[,i]))>5000
trainingdata <- traindata[!nacols]
testfinal <- testfinal[!nacols]
dim <- rbind(dim,trainingdata=dim(trainingdata))
dim
## rows columns
## traindata 19622 160
## testfinal 20 160
## trainingdata 19622 88
The complete dataset, traindata and the final testset, testfinal have 160 variables few of which can be eliminated for our analysis for various reasons. Firstly, there are few index variables( columns 1 to 5 ) and there are several other variables which have lots of NA’s(nacols ). These variables are removed to obtain a clean dataset, trainingdata. Corresponding changes are even made to the final test dataset. The numnber of variables considerably diminished to 88
As our final test set has only 20 samples, we need to build training(75%) and test datasets(25%) from the clean trainingdata for the sake of cross validation.Any preprocessing is to be done on the new training dataset on which are about to build a model.
set.seed(1234)
inTrain <- createDataPartition(trainingdata$classe,p=0.75,list=FALSE)
train <- trainingdata[inTrain,]
test <- trainingdata[-inTrain,]
It is also important that all the variables have similar class for a faster computation. Our variables are a mixture of integers and numeric classes and hence all of them,except Classe, are converted to numeric.
a <- length(training)-1
for(i in 1:a) {
training[,i] <- as.numeric(training[,i])
testing[,i] <- as.numeric(testing[,i])
testfinal[,i] <- as.numeric(testfinal[,i])
}
A random forest algorithm is used on the training data to build our model and the predictions are made on the testing data. The cross validation resulted in a good accuracy of 99.78%, which means that the out of sample error rate(OOB) of the model is 0.25%. The model has 500 trees with 6 variables tried at each split.
mod1 <- randomForest(classe~.,data=training)
pred <- predict(mod1,testing,type="class")
mod1
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 0.25%
## Confusion matrix:
## A B C D E class.error
## A 4184 0 0 0 1 0.0002389486
## B 4 2842 2 0 0 0.0021067416
## C 0 10 2556 1 0 0.0042851578
## D 0 0 13 2398 1 0.0058043118
## E 0 0 0 5 2701 0.0018477458
confusionMatrix(pred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 1 0 0 0
## B 0 947 5 0 0
## C 0 1 850 4 0
## D 0 0 0 800 0
## E 0 0 0 0 901
##
## Overall Statistics
##
## Accuracy : 0.9978
## 95% CI : (0.996, 0.9989)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9972
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9979 0.9942 0.9950 1.0000
## Specificity 0.9997 0.9987 0.9988 1.0000 1.0000
## Pos Pred Value 0.9993 0.9947 0.9942 1.0000 1.0000
## Neg Pred Value 1.0000 0.9995 0.9988 0.9990 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1931 0.1733 0.1631 0.1837
## Detection Prevalence 0.2847 0.1941 0.1743 0.1631 0.1837
## Balanced Accuracy 0.9999 0.9983 0.9965 0.9975 1.0000
predictions <- predict(mod1,testfinal,type="class")
answers <- as.character(predictions)
predictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
actual = as.data.frame(table(testing$classe))
names(actual) = c("Actual","ActualFreq")
predicted <- as.data.frame(table("Predicted"=pred,"Actual"=testing$classe))
confusion = cbind(predicted, actual)
confusion$Percent = confusion$Freq/confusion$ActualFreq*100
## Plotting Heatmap
tile <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion,
color="black",size=0.1) +labs(x="Actual",y="Predicted")
tile = tile + geom_text(aes(x=Actual,y=Predicted,
label=sprintf("%.2f", Percent)), data=confusion, size=3,
colour="black") + scale_fill_gradient(low="grey",high="red")
tile = tile + geom_tile(aes(x=Actual,y=Predicted),
data=subset(confusion, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0)
tile
Source:
Classes of exercises(Output):