Synopsis

Applications of human activity recognition are escalating over the past few years which often estimate the quantity of work done. The experimental data used in this project aims to predict the form of barbell lifts performed in 5 different ways only one of which is precise. Data is collected from the accelerometers attached to the belt,arm,forearm and dumbbell of each participant. Predictions using random forest algorithm gives an accuracy of 99.78% and estimated out of sample error rate of 0.25%

Loading Data

library(caret)
library(randomForest)
traindata <- read.csv("pml-training.csv")
testfinal <- read.csv("pml-testing.csv")
dim <- as.data.frame(rbind("traindata"=dim(traindata),"testfinal"=dim(testfinal)))
names(dim) <- c("rows","columns")
dim
##            rows columns
## traindata 19622     160
## testfinal    20     160

Cleaning Data

      traindata <- traindata[,-c(1,2,3,4,5)]
      testfinal <- testfinal[,-c(1,2,3,4,5)]
      
#Removing NA's
      nacols <- c(NULL)
        for(i in 1:length(traindata)) nacols[i] <- sum(is.na(traindata[,i]))>5000
          trainingdata <- traindata[!nacols]
          testfinal <- testfinal[!nacols]
          dim <- rbind(dim,trainingdata=dim(trainingdata))
          dim
##               rows columns
## traindata    19622     160
## testfinal       20     160
## trainingdata 19622      88

The complete dataset, traindata and the final testset, testfinal have 160 variables few of which can be eliminated for our analysis for various reasons. Firstly, there are few index variables( columns 1 to 5 ) and there are several other variables which have lots of NA’s(nacols ). These variables are removed to obtain a clean dataset, trainingdata. Corresponding changes are even made to the final test dataset. The numnber of variables considerably diminished to 88

Splitting Data

As our final test set has only 20 samples, we need to build training(75%) and test datasets(25%) from the clean trainingdata for the sake of cross validation.Any preprocessing is to be done on the new training dataset on which are about to build a model.

set.seed(1234)
inTrain <- createDataPartition(trainingdata$classe,p=0.75,list=FALSE)
train <- trainingdata[inTrain,]
test <- trainingdata[-inTrain,]

Check for highly correlated variables

Even though our data is now ready for modeling, a more efficient way is to check for ineffective variables. First, the variables with variances close to zero are removed using nearZeroVar() function and then the rest of the variables are checked for redundancy. It is observed that three variables are highly correlated() and are removed using findCorrelation() function.

set.seed(1234)
nearZeroVars <- nearZeroVar(train,saveMetrics=TRUE)
training_NZ <- train[!nearZeroVars$nzv]
testing_NZ <- test[!nearZeroVars$nzv]
testfinal <- testfinal[!nearZeroVars$nzv]

set.seed(2345)
highCorr <- findCorrelation(cor(training_NZ[,-length(training_NZ)]))
training <- training_NZ[,-highCorr]
testing <- testing_NZ[,-highCorr]
testfinal <- testfinal[,-highCorr]
dim <- rbind(dim,"training"=dim(training),
             "testing"=dim(testing),"testfinal"=dim(testfinal))
dim
##               rows columns
## traindata    19622     160
## testfinal       20     160
## trainingdata 19622      88
## training     14718      47
## testing       4904      47
## testfinal1      20      47

Transformations

It is also important that all the variables have similar class for a faster computation. Our variables are a mixture of integers and numeric classes and hence all of them,except Classe, are converted to numeric.

a <- length(training)-1
          for(i in 1:a) {
              training[,i] <- as.numeric(training[,i])
              testing[,i] <- as.numeric(testing[,i])
              testfinal[,i] <- as.numeric(testfinal[,i])
          }

Fitting a model

A random forest algorithm is used on the training data to build our model and the predictions are made on the testing data. The cross validation resulted in a good accuracy of 99.78%, which means that the out of sample error rate(OOB) of the model is 0.25%. The model has 500 trees with 6 variables tried at each split.

mod1 <- randomForest(classe~.,data=training)
pred <- predict(mod1,testing,type="class")
mod1
## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.25%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4184    0    0    0    1 0.0002389486
## B    4 2842    2    0    0 0.0021067416
## C    0   10 2556    1    0 0.0042851578
## D    0    0   13 2398    1 0.0058043118
## E    0    0    0    5 2701 0.0018477458

Cross Validation

confusionMatrix(pred,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  947    5    0    0
##          C    0    1  850    4    0
##          D    0    0    0  800    0
##          E    0    0    0    0  901
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9978         
##                  95% CI : (0.996, 0.9989)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9972         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9979   0.9942   0.9950   1.0000
## Specificity            0.9997   0.9987   0.9988   1.0000   1.0000
## Pos Pred Value         0.9993   0.9947   0.9942   1.0000   1.0000
## Neg Pred Value         1.0000   0.9995   0.9988   0.9990   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1931   0.1733   0.1631   0.1837
## Detection Prevalence   0.2847   0.1941   0.1743   0.1631   0.1837
## Balanced Accuracy      0.9999   0.9983   0.9965   0.9975   1.0000

Predicting the final test data

predictions <- predict(mod1,testfinal,type="class")
answers <- as.character(predictions)
predictions
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Plotting Percentage of Predictions

actual = as.data.frame(table(testing$classe))
names(actual) = c("Actual","ActualFreq")

predicted <- as.data.frame(table("Predicted"=pred,"Actual"=testing$classe))
confusion = cbind(predicted, actual)
confusion$Percent = confusion$Freq/confusion$ActualFreq*100

## Plotting Heatmap
tile <- ggplot() +
  geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion, 
            color="black",size=0.1) +labs(x="Actual",y="Predicted")
tile = tile + geom_text(aes(x=Actual,y=Predicted, 
              label=sprintf("%.2f", Percent)), data=confusion, size=3, 
              colour="black") + scale_fill_gradient(low="grey",high="red")

tile = tile + geom_tile(aes(x=Actual,y=Predicted),
              data=subset(confusion, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0)
tile

Appendix

Source:

Classes of exercises(Output):