Human Activity Recognition - qualitative activity recognition

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Goal

The goal of your project is to predict the manner in which they did the exercise. This is the classe variable in the training set.

setwd("C:/Raghu/Rscipts/ML")

# downloading training set
if(!file.exists("./data")){dir.create("./data")}
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl1, destfile = "./data/training.csv", method="curl")
## Warning: running command 'curl "https://d396qusza40orc.cloudfront.net/
## predmachlearn/pml-training.csv" -o "./data/training.csv"' had status 127
## Warning in download.file(fileUrl1, destfile = "./data/training.csv", method
## = "curl"): download had nonzero exit status
train <- read.csv("./data/training.csv")

dim(train)
## [1] 19622   160
table(train$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Data clean up and Train data Split

Removing unnecessary columns and the columns that has NAs.

Train data is further split into 70-30 for training and testing, which helps performing cross validation testing the model fit before we put it for testing on the actual test data.

set.seed(32768)

## remove NA columns - # 2 indicates Columns
NACols <- apply(train,2,function(x) {sum(is.na(x))});
noNAsTrain <- train[,which(NACols == 0)];

# Removing unnecessary columns
unNecessaryColumns <- grep("timestamp|X|user_name|new_window",names(noNAsTrain));
cleanedTrain <- noNAsTrain[,-unNecessaryColumns];

library(caret);
## Warning: package 'caret' was built under R version 3.2.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.1
# Removing near zero varience columns
# nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors)
colNZV <- nearZeroVar(cleanedTrain)
cleanedTrain <- cleanedTrain[, -colNZV]

# splitting train into train and test data sets.
inTrain <- createDataPartition(y=cleanedTrain$classe, p=0.70, list=FALSE)
inTrain_training <- cleanedTrain[inTrain,]
inTrain_testing <- cleanedTrain[-inTrain,]
dim(inTrain_training)
## [1] 13737    54
dim(inTrain_testing)
## [1] 5885   54
# Resampling using cross validation

Model Fit

Because of the characteristic noise in the sensor data, I think, Random Forest approach is more appropriate and provides better accuracy. This algorithm is characterized by a subset of features, selected in a random and independent manner with the same distribution for each of the trees in the forest.

require(randomForest)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.2.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
modRF <- randomForest(classe~.,data=inTrain_training,  importance=TRUE, ntrees = 10 )
modRF
## 
## Call:
##  randomForest(formula = classe ~ ., data = inTrain_training, importance = TRUE,      ntrees = 10) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.28%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3906    0    0    0    0 0.000000000
## B    4 2651    3    0    0 0.002633559
## C    0    8 2388    0    0 0.003338898
## D    0    0   17 2234    1 0.007992895
## E    0    0    0    6 2519 0.002376238
#imps <- varImp(fit)
#order(imps)

Confusion Matrix on training dataset

The outcome of confusion matrix on the training dataset should show the high accuracy since Random Forest model fit occurred on the same dataset.

ptraining <- predict(modRF, inTrain_training)
print(confusionMatrix(ptraining, inTrain_training$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Please note the perfect accuracy in the above outcome.

Cross Validation

Here we are performing the prediction with intrain_testing dataset which is considered as CV dataset. Confusion Matrix should show the high accuracy on this set as well, if above model fit is appropriate.

ptraining <- predict(modRF, inTrain_testing)
print(confusionMatrix(ptraining, inTrain_testing$classe))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    2    0    0    0
##          B    0 1135    3    0    0
##          C    0    2 1023    9    0
##          D    0    0    0  955    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9952, 0.9982)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9965   0.9971   0.9907   0.9982
## Specificity            0.9995   0.9994   0.9977   0.9996   1.0000
## Pos Pred Value         0.9988   0.9974   0.9894   0.9979   1.0000
## Neg Pred Value         1.0000   0.9992   0.9994   0.9982   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1929   0.1738   0.1623   0.1835
## Detection Prevalence   0.2848   0.1934   0.1757   0.1626   0.1835
## Balanced Accuracy      0.9998   0.9979   0.9974   0.9951   0.9991

The cross validation accuracy is 99.7% and the out-of-sample error is therefore 0.3% so our model fit is a highly appropriate.

Testing with Testset

Now performing predictions on the actual test data.

# downloading testing set
if(!file.exists("./data")){dir.create("./data")}
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl1, destfile = "./data/testing.csv", method="curl")
## Warning: running command 'curl "https://d396qusza40orc.cloudfront.net/
## predmachlearn/pml-testing.csv" -o "./data/testing.csv"' had status 127
## Warning in download.file(fileUrl1, destfile = "./data/testing.csv", method
## = "curl"): download had nonzero exit status
test <- read.csv("./data/testing.csv")

dim(test)
## [1]  20 160

Predicting using the model modRF

testPreds <- predict(modRF, test)
testPreds
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E
table(testPreds)
## testPreds
## A B C D E 
## 7 8 1 1 3

Writting to files using the given function for submission

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

answers <- as.vector(testPreds)
pml_write_files(answers)

After submission, I see that all the 20 predictions turned out to be correct