Prediction - “how well they do it?”

In this report we will analyze the personal activity data collected for 6 participants using “Human Activity Recognition” gadgets like accelerometers on the belt, forearm, arm, and dumbell used by the participants. The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways and we will predict the manner in which they did the exercise. More details about “Human Activity Recognition” can be found at http://groupware.les.inf.puc-rio.br/har

Prediction study design

The prediction study design is composed of the following sequence of steps:

Question -> Input Data -> Features -> M/C Algorithm -> Parameters -> Evaluation

There are 2 datasets provided - training and testing.We will use the testing dataset for validation purposes and not touch it until the model is finalized. The training dataset will be split into two, one for exploration and model creation and the other for testing purposes.

High level steps that we will follow for this study design:

1. Exploration to identify keys features of the dataset

2. Pre-processing

3. Multiple Model creation and best fit selection

4. Out-of-sample application and errors

Exploration

Let us first load the relevant datasets and perform some explorations to find some trends and observations

Dimensions of the training, testing and validation datasets:

## [1] 13737   159
## [1] 5885  159
## [1]  20 160

No of variables/columns in the training dataset with more than 95% NA values:

## [1] 67

No of variables/columns in the training dataset with more than 95% items with no observations(blanks).

## [1] 33

So there are around 159 variables/columns in the training dataset of which 67 have 95% or more observations as NA.Along with NAs we can also see columns with lots of blank entries.Variables/columns with blank values are similar to variables with near-zero-variance as they do not posses any specific feature that adds meaning to the dataset. Hence it makes sense to remove them from the model as prediction models don’t fare well with input data containing NAs or blanks. Also imputing using methods like K-nearest neighbours(knnImpute) will not work as the columns are highly sparse.

The summary statistics of the whole training dataset will be pretty huge so lets summarize the outcome variable - “classe”

##    A    B    C    D    E 
## 3906 2658 2396 2252 2525
##  Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

“classe” the variable we are trying to predict is a factor variable with 5 levels.

Pre-Processing

As we saw while exploring the training dataset that it has variables more than 95% NA and blank values and it makes sense to get rid of them first and examine how many variables are we left with.

dim(nTraining)
## [1] 13737    59

So we are left with 59 variables

Model creation and selection

We will first apply Linear Discriminat Analysis, followed by Decision trees and then use the most accurate Random forests with boosting.

Linear Discriminant Analysis

The reason why we have chosen LDA is because the other linear models like Generalized Linear Models and Linear Models are not typically suited to predict outcomes with more than 2-classes(“classe” is a factor variable with 5 levels).

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1556  102   16    0    0
##          B  162  820  154    3    0
##          C    4  132  852   35    3
##          D    0    1  109  801   53
##          E    0    0    5  116  961
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8479         
##                  95% CI : (0.8385, 0.857)
##     No Information Rate : 0.2926         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8075         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9036   0.7773   0.7500   0.8387   0.9449
## Specificity            0.9717   0.9340   0.9634   0.9669   0.9751
## Pos Pred Value         0.9295   0.7199   0.8304   0.8309   0.8882
## Neg Pred Value         0.9606   0.9505   0.9416   0.9687   0.9883
## Prevalence             0.2926   0.1793   0.1930   0.1623   0.1728
## Detection Rate         0.2644   0.1393   0.1448   0.1361   0.1633
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9376   0.8556   0.8567   0.9028   0.9600

So we can see that the model has around 85% accuracy

Decision Trees

Now that we have considered a model(LDA) that does well with linear relationships let us apply decision tree based model which is easy to interpret and does better with nonlinear settings

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1618   44   12    0    0
##          B   54  935  147    3    0
##          C    7   67  927   24    1
##          D    0   42  117  721   84
##          E    0    0    9  122  951
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8754          
##                  95% CI : (0.8667, 0.8838)
##     No Information Rate : 0.2853          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8425          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9637   0.8594   0.7649   0.8287   0.9180
## Specificity            0.9867   0.9575   0.9788   0.9515   0.9730
## Pos Pred Value         0.9665   0.8209   0.9035   0.7479   0.8789
## Neg Pred Value         0.9855   0.9678   0.9413   0.9697   0.9823
## Prevalence             0.2853   0.1849   0.2059   0.1478   0.1760
## Detection Rate         0.2749   0.1589   0.1575   0.1225   0.1616
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9752   0.9084   0.8718   0.8901   0.9455

With decision trees the accuracy is around 86%. Let us plot the decision tree in pictorial format like a dendogram as it is better for understanding purposes

Random Forests

Random forests is an extension to Bagging(Bootstrap Aggregation) for classification and regression trees and along with boosting it is considered as one of the most accurate prediction algorithm.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    1 1138    0    0    0
##          C    0    1 1025    0    0
##          D    0    0    3  959    2
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9976, 0.9995)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9991   0.9971   1.0000   0.9982
## Specificity            1.0000   0.9998   0.9998   0.9990   1.0000
## Pos Pred Value         1.0000   0.9991   0.9990   0.9948   1.0000
## Neg Pred Value         0.9998   0.9998   0.9994   1.0000   0.9996
## Prevalence             0.2846   0.1935   0.1747   0.1630   0.1842
## Detection Rate         0.2845   0.1934   0.1742   0.1630   0.1839
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9997   0.9995   0.9984   0.9995   0.9991

So we can see that our model built using Random Forests algorithm has a 100% accuracy.

Out-of-sample application and errors

All the models made in the last step were applied to the testing set carved out of the training dataset. This is called in-sample analysis and its application resulted in in-sample errors. It is generally observed that in-sample application is generally more optimistic as we have already explored, pre-processed and tunned the input dataset. To get an honest prediction accuracy we must apply out best model to a dataset that is untouched(validation dataset here). The validation dataset doesn’t have the “classe” outcome variable. We will be predict the outcome and out-of-sample error using the Random Forests model and predictors present in the validation dataset.

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E