Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this analysis, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this analysis is to predict the manner in which they did the exercise.

Before begining the analysis lets first load the required libraries for analysis.

library(knitr); library(printr)
library(dplyr)
library(caret); library(ggplot2); library(randomForest)

Obtaining data

The data for this analysis come from this source: Human Activity Recognition

The training data for this analysis are available here, and the test data are available here.

While reading the data, cells with Excel division error strings #DIV/0! were removed and replaced with NA values and bank fields were replaced by NA values.

trainData <- read.csv("./Data/pml-training.csv", head=TRUE, sep=",", na.strings=c("NA","#DIV/0!","")) 
testData <- read.csv("./Data/pml-testing.csv", head=TRUE, sep=",", na.strings=c("NA","#DIV/0!",""))         

The dimensions of the test and training data are :

train_dimention 19622 160
test_dimention 20 160

As mentioned earlier, the goal of your project is to predict the manner in which participants did the exercise. This is the classe variable in the training set, which is a factor variable with the following levels :

## [1] "A" "B" "C" "D" "E"

Data pre-processing

Dectecting missing vlaues: If Missing values exist in the data set used for prediction, the prediction algorithm will fail to work. Hence in this part of the analysis we perform the analysis to clean the data of the missing values.

First we calculate the percentage of missing values in each of the variables in the original train and test datasets.

NApercentTrain <- sapply(trainData, function(df) {sum(is.na(df)==TRUE)/length(df)})
NApercentTest <- sapply(testData, function(df) {sum(is.na(df)==TRUE)/length(df)})
table(NApercentTrain > .97)
FALSE TRUE
60 100
table(NApercentTest > .97)
FALSE TRUE
60 100

The result indicates that \(100\) variables in both datasets have more that \(97\%\) missing values. Therefore we decide to remove these \(100\) variables from the train and test dataset.

colnames1 <- names(which(NApercentTrain < 0.97))
trainingData <- trainData[, colnames1]
colnames2 <- names(which(NApercentTest < 0.97))
testingData <- testData[, colnames2]

Once again we check to make sure that both data sets are free from missing values.

sum(is.na(trainingData) == TRUE)
## [1] 0
sum(is.na(testingData) == TRUE)
## [1] 0

\(\\\)

Removing zero covariates: Next we remove the predictors that do not contribute much in prediction. The basic idea here is to identify variables that have very little variability in them and therefore not being useful predictors.

nsv1 <- nearZeroVar(trainingData,saveMetrics=TRUE)
nsv2 <- nearZeroVar(testingData,saveMetrics=TRUE)
CleanTrainData <- trainingData[,which(nsv1$nzv==FALSE)]
CleanTestData <- testingData[,which(nsv2$nzv==FALSE)]

\(\\\)

Removing non-measurement data: Next we eliminate the variables that do not contribute into the prediction and instead have harmful effect if kept in the training dataset. The variables are X, timestamp and user_name problem_id and so forth. Hence we identify and remove them.

RemoveInx1 <- grepl("X|timestamp|user_name", names(CleanTrainData))
CleanTrainData <- CleanTrainData[, which(RemoveInx1 ==FALSE)]

RemoveInx2 <- grepl("X|timestamp|user_name|problem_id", names(CleanTestData))
CleanTestData <- CleanTestData[, which(RemoveInx2 ==FALSE)]

\(\\\)

As a result, the final training dataset being used for this analysis will have \(54\) variables and testing dataset will have \(53\) variables.

train_dimention 19622 54
test_dimention 20 53

Data splitting for resampling

For the pupose of cross-validation resampling methods, we split the clean training dataset into training set( \(70\%\) ) and test set( \(30\%\) ) that will be used for validating the fitted model.

set.seed(24322)
inTrain <- createDataPartition (CleanTrainData$classe, p=0.7, list=FALSE)
training <- CleanTrainData [inTrain ,]
testing <- CleanTrainData [- inTrain,]

Model fitting

The choice of the appropriate model in machine learning is a difficult choice as it depends on the nature and characteristics problem being modeled. In this study we have classification problem and I intend to use random forest method for modeling. The is reason is that we sill have a large mount of variables and random forest in particular is well suited for handling large number of variables in prediction. It also selects important variables automatically and is very robust when it comes to outlines and correlated variables. There are lots of other reasons as to why random forest in powerful methodology, but I believe these few reasons justify our choice.

We therefore use random forest with 10 folds cross validation, a common choice for the number of folds. We limit the number of trees to \(200\) as you will see in the following section that in terms of the errors, \(200\) is sufficinetly large.

ctrl <- trainControl(method = "cv", number=10)

rf_fit <- train(classe ~ .,
             data = training,
             method = "rf",  
             trControl = ctrl,
             allowParallel=TRUE,
             ntree=200)

print(rf_fit)
## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 12364, 12363, 12362, 12364, 12364, 12362, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9944672  0.9930009  0.001790278  0.002264912
##   27    0.9974523  0.9967774  0.001465737  0.001853931
##   53    0.9946859  0.9932774  0.002932587  0.003709830
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Plotting the fitted model

Before evaluating the fitted model, lets look at a few exploratory graphs concerning the final fitted model. the first graph shows the accuracy vs. the selected predictors :

plot(rf_fit)

The next graph presents the number of trees vs. error, as you can observe the limiting to even \(50\) trees is sufficient for this study :

plot(rf_fit$finalModel)

The next graph demonstrates the order of importance of the predictors. As you can observe the num_window is the most important predictor.

print(plot(varImp(rf_fit)))

Evaluating the fitted model

Now its time to evaluate the fitted model by using it to predict validation data set. We compute the confusion matrix and associated statistics to asses the preformance of the model fit :

pre_rf <- predict(rf_fit, newdata = testing)
confusionMatrix(data = pre_rf, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    3    0    0    0
##          B    1 1135    1    0    0
##          C    0    1 1025    8    0
##          D    0    0    0  956    1
##          E    0    0    0    0 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9965   0.9990   0.9917   0.9991
## Specificity            0.9993   0.9996   0.9981   0.9998   1.0000
## Pos Pred Value         0.9982   0.9982   0.9913   0.9990   1.0000
## Neg Pred Value         0.9998   0.9992   0.9998   0.9984   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1929   0.1742   0.1624   0.1837
## Detection Prevalence   0.2848   0.1932   0.1757   0.1626   0.1837
## Balanced Accuracy      0.9993   0.9980   0.9986   0.9957   0.9995

That is a pretty amazingly good model with \(0.99\) accuracy. The graph below also demonstrates how the fitted model correctly predicted majority of the classes. To construct this graph I only selected two arbitrary variables.

predRight <- pre_rf == testing$classe
qplot(accel_belt_x,accel_belt_y,pitch_belt, data=testing, colour=predRight)   

Prediction new values

Now the final step would be to use the fitted model to predict classe for the origional testing dataset. The result presents the final predictions for the original test data.

finalResult <- predict(rf_fit, CleanTestData)
finalResult
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Submission

And finally we use the following code to generate the text file for submission.

answers <- finalResult

pml_write_files <- function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i], file=filename, quote=FALSE, row.names=FALSE, col.names=FALSE)
  }
}

pml_write_files(answers)