Health Data Analysis

Overview

This document is the final report of the Peer Assignment from the course Practical Machine Learning offered by JHU. It was built in RStudio using knitr function and published as html document.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

The following statements were made by the original authors of the dataset.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).

Source : Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th Augmented Human (AH) International Conference in cooperation with ACM SIGCHI (Augmented Human’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Loading Required Libraries

suppressPackageStartupMessages(library(caret)) 
suppressPackageStartupMessages(library(rpart)) 
suppressPackageStartupMessages(library(rpart.plot)) 
suppressPackageStartupMessages(library(rattle))  
suppressPackageStartupMessages(library(randomForest)) 
suppressPackageStartupMessages(library(corrplot))

Loading Dataset and cleaning

We will be using the training data for analysis and testing data for answering quiz questions. Training data will be split into training and testing set in the ratio 70:30. Many columns have NAs which would be removed and the identity columns along with ID columns will be removed.

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',destfile='./training_set.csv',method='curl')
download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',destfile='./testing_set.csv',method='curl')
train_dat<-read.csv('./training_set.csv')
test_dat<-read.csv('./testing_set.csv')
inTrain<-createDataPartition(y=train_dat$classe,p=0.7,list=FALSE)
train_set<-train_dat[inTrain,]
test_set<-train_dat[-inTrain,]
NZV<-nearZeroVar(train_set)
train_set<-train_set[,-NZV]
test_set<-test_set[,-NZV]
All_NA<-sapply(train_set,function(x) mean(is.na(x)))
train_set<-train_set[,All_NA==FALSE]
test_set<-test_set[,All_NA==FALSE]
train_set<-train_set[,-(1:5)]
test_set<-test_set[,-(1:5)]

After cleaning cleaning process, both training and testing sets have 54 columns

Corelation

Before creating prediction models, correlation is found among the variables

The highly correlated variables are depicted in dark colors in the above plot. Since, there are very few variables which are correlated strongly, PCA is not necessary to be performed as pre-processing step.

Prediction Model Building

Three methods are used for model building and the one with highest accuracy score will be applied to the testing data.

Random Forest

set.seed(12345)
controlRF<-trainControl(method='cv',number=3,verboseIter=FALSE)
mdlRF<-train(classe~.,data=train_set,method='rf',trControl=controlRF)
mdlRF$finalModel


Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 0.23%
Confusion matrix:
     A    B    C    D    E  class.error
A 3904    1    0    0    1 0.0005120328
B    8 2647    2    1    0 0.0041384500
C    0    5 2391    0    0 0.0020868114
D    0    0    7 2244    1 0.0035523979
E    0    0    0    5 2520 0.0019801980

predRF<-predict(mdlRF,newdata=test_set)
conmatRF<-confusionMatrix(predRF,as.factor(test_set$classe))
conmatRF

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1673    3    0    0    0
         B    1 1132    3    0    0
         C    0    4 1023    2    0
         D    0    0    0  961    3
         E    0    0    0    1 1079

Overall Statistics
                                          
               Accuracy : 0.9971          
                 95% CI : (0.9954, 0.9983)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9963          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9994   0.9939   0.9971   0.9969   0.9972
Specificity            0.9993   0.9992   0.9988   0.9994   0.9998
Pos Pred Value         0.9982   0.9965   0.9942   0.9969   0.9991
Neg Pred Value         0.9998   0.9985   0.9994   0.9994   0.9994
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2843   0.1924   0.1738   0.1633   0.1833
Detection Prevalence   0.2848   0.1930   0.1749   0.1638   0.1835
Balanced Accuracy      0.9993   0.9965   0.9979   0.9981   0.9985

Decision Trees

set.seed(12345)
mdldtree<-rpart(classe~.,data=train_set,method='class')
fancyRpartPlot(mdldtree)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

preddtree<-predict(mdldtree,newdata=test_set,type='class')
conmatdtree<-confusionMatrix(preddtree,as.factor(test_set$classe))
conmatdtree

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1555  180   19   41    8
         B   34  773   85   96   41
         C    1   53  821   38    4
         D   70   82   92  712   84
         E   14   51    9   77  945

Overall Statistics
                                          
               Accuracy : 0.8167          
                 95% CI : (0.8065, 0.8265)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7675          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9289   0.6787   0.8002   0.7386   0.8734
Specificity            0.9411   0.9461   0.9802   0.9333   0.9686
Pos Pred Value         0.8625   0.7512   0.8953   0.6846   0.8622
Neg Pred Value         0.9708   0.9246   0.9587   0.9480   0.9714
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2642   0.1314   0.1395   0.1210   0.1606
Detection Prevalence   0.3064   0.1749   0.1558   0.1767   0.1862
Balanced Accuracy      0.9350   0.8124   0.8902   0.8360   0.9210

Generalized Boosting Method

set.seed(12345)
controlGBM<-trainControl(method='repeatedcv',number=5,repeats=1)
mdlGBM<-train(classe~.,data=train_set,method='gbm',trControl=controlGBM,verbose=FALSE)
mdlGBM$finalModel

A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 53 predictors of which 53 had non-zero influence.

predGBM<-predict(mdlGBM,newdata=test_set)
conmatGBM<-confusionMatrix(predGBM,as.factor(test_set$classe))
conmatGBM

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1673   13    0    0    1
         B    1 1111   13    3    4
         C    0   14 1010   13    2
         D    0    1    2  947    6
         E    0    0    1    1 1069

Overall Statistics
                                        
               Accuracy : 0.9873        
                 95% CI : (0.9841, 0.99)
    No Information Rate : 0.2845        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.9839        
                                        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9994   0.9754   0.9844   0.9824   0.9880
Specificity            0.9967   0.9956   0.9940   0.9982   0.9996
Pos Pred Value         0.9917   0.9814   0.9721   0.9906   0.9981
Neg Pred Value         0.9998   0.9941   0.9967   0.9966   0.9973
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2843   0.1888   0.1716   0.1609   0.1816
Detection Prevalence   0.2867   0.1924   0.1766   0.1624   0.1820
Balanced Accuracy      0.9980   0.9855   0.9892   0.9903   0.9938

Applying Best model to Test Data

The accuracy of the 3 regression models are - a) Random Forest : 0.9976 b) Decision Trees : 0.7443 c) Generalized Boosting Method : 0.9891

So, the best model is random forest.

predictTest<-predict(mdlRF,newdata=test_dat)
predictTest

 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E