Practical Machine Learning : Peer Assessment

Rithesh Kumar

October 2014

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Executive Summary

The Weight Lifting Exercise Dataset was analysed to predict class of the exercise using the other predictors in the dataset. Initially, the data was preprocessed to remove to columns with large number of NA values. Next, nearZeroVar function was used to check if attributes with near zero variance was present. After preprocessing the data, model fitting was performed using Trees with 4-fold cross-validation. Since the out of sample accuracy turned out to be <50%, random forests were trained on the data, also with 4-fold cross-validation. An out of sample accuracy of 99% was obtained and hence this model was selected.

Loading And PreProcessing The Data

It was analysed that certain columns had very large number of missing values.

{
     set.seed(1234)
     library(caret,quietly=TRUE)
     data <- read.csv("pml-training.csv",na.string=c("","NA","NULL"))
     quiz <- read.csv("pml-testing.csv",na.string=c("","NA","NULL"))
     table(sapply(data,function(x) sum(is.na(x))))
}
## 
##     0 19216 
##    60   100

These columns were removed to transform it into a clean dataset.

The first 7 columns of the dataset are removed also, as they contain trivial parameters which do not aid in prediction of the class.

Also nearZeroVar function is used to check if any column has near zero variance, as it affects the model training process.

     cleanData <- data[,which(as.numeric(colSums(is.na(data)))==0)]
     cleanData <- cleanData[,-c(1:7)] #First 7 Columns Of The dataset are removed
     nearZeroVar(cleanData)
## integer(0)

Splitting The Data Into Training And Cross-Validation Sets

     inTrain <- createDataPartition(cleanData$classe,p=0.7,list=FALSE)
     training <- cleanData[inTrain,]
     testing <- cleanData[-inTrain,]

Model Fitting Using Trees

     modFit <- train(classe~.,data=training,method="rpart",trControl = trainControl(method="cv",number=4,allowParallel=TRUE))
## Loading required namespace: e1071
     confusionMatrix(testing$classe,predict(modFit,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1530   35  105    0    4
##          B  486  379  274    0    0
##          C  493   31  502    0    0
##          D  452  164  348    0    0
##          E  168  145  302    0  467
## 
## Overall Statistics
##                                         
##                Accuracy : 0.489         
##                  95% CI : (0.476, 0.502)
##     No Information Rate : 0.532         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.331         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.489   0.5027   0.3279       NA   0.9915
## Specificity             0.948   0.8519   0.8797    0.836   0.8864
## Pos Pred Value          0.914   0.3327   0.4893       NA   0.4316
## Neg Pred Value          0.620   0.9210   0.7882       NA   0.9992
## Prevalence              0.532   0.1281   0.2602    0.000   0.0800
## Detection Rate          0.260   0.0644   0.0853    0.000   0.0794
## Detection Prevalence    0.284   0.1935   0.1743    0.164   0.1839
## Balanced Accuracy       0.718   0.6773   0.6038       NA   0.9390

Since the accuracy in the cross-validation set (Out of sample accuracy) is <50%, we try fitting a different model.

Model Fitting Using Random Forests

     modFit <- train(classe~.,data=training,method="rf",trControl = trainControl(method="cv",number=4,allowParallel=TRUE))
 confusionMatrix(testing$classe,predict(modFit,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B   11 1127    1    0    0
##          C    0    4 1018    4    0
##          D    0    2    6  955    1
##          E    0    1    2    3 1076
## 
## Overall Statistics
##                                         
##                Accuracy : 0.994         
##                  95% CI : (0.992, 0.996)
##     No Information Rate : 0.286         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.992         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.993    0.994    0.991    0.993    0.999
## Specificity             1.000    0.997    0.998    0.998    0.999
## Pos Pred Value          1.000    0.989    0.992    0.991    0.994
## Neg Pred Value          0.997    0.999    0.998    0.999    1.000
## Prevalence              0.286    0.193    0.175    0.163    0.183
## Detection Rate          0.284    0.192    0.173    0.162    0.183
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       0.997    0.996    0.995    0.995    0.999

This model is accepted as the out-of sample accuracy (accuracy in the cross-validation set) is >90%.

Applying Selected Model To Test Set

cleanTestData <- quiz[,which(as.numeric(colSums(is.na(data)))==0)] #Selecting Same Variables In Test Set As In The Training Set
cleanTestData <- cleanTestData[,-c(1:7)] #First 7 Columns Of The dataset are removed
answers <- predict(modFit,cleanTestData)
print(answers)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Creating Submission File

n = length(answers)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(answers[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }

These results will be submitted for the assignment.