Human Activity Recognition, Machine Learning Project

Bhavana Shah

Background

The devices such as Jawbone Up, Nike FuelBand, and Fitbit now collect a large amount of data about personal activity. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

More information is available from the website: http://groupware.les.inf.puc-rio.br/har (section on the Weight Lifting Exercise Dataset).

Load Data

#Training data set
pmlTrainDS <- read.csv("./pml-training.csv", na.strings = c("", "NA", "NULL"))
#Testing data set
pmlTestDS <- read.csv("./pml-testing.csv", na.strings = c("", "NA", "NULL"))

Loading all necessary packages for the project

library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Loading required package: lattice

## Loading required package: ggplot2

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.2.5

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

dim(pmlTrainDS)

## [1] 19622   160

Exploring the training dataset we observe that there are quite many variables to predict the dependent ‘Classe’ variable, that has 5 levels [A,B,C,D,E]. In order to build accurate prediction model, we will perform initial pre-processing to identify and filter out the un-necessary, empty, highly correlated, near-zero variance variables.

Removing empty columns from dataset

filtered_pmT <- pmlTrainDS[ , colSums(is.na(pmlTrainDS)) == 0] 
dim(filtered_pmT)

## [1] 19622    60

Removing near-zero variance columns, using nearZeroVar() from ‘caret’ package

nzv <- nearZeroVar(filtered_pmT)
filtered_pmT <- filtered_pmT[, -nzv]
dim(filtered_pmT)

## [1] 19622    59

Removing highly correlated variables, using 0.80 as cutoff point

#create correlation matrix
cor_pt <- cor(filtered_pmT[ , sapply(filtered_pmT, is.numeric)])
dim(cor_pt)

## [1] 56 56

#Plotting the correlation matrix, using 'corrplot' package
corrplot(cor_pt, order = "alphabet", tl.cex=0.7, tl.col ="steelblue")

#Display the correlation summary, prior to removal
summary(cor_pt[upper.tri(cor_pt)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.102000  0.001729  0.001405  0.084720  0.980900

#using findCorrelation() from 'caret' package, flag the predictors
highlyCorVars <- findCorrelation(cor_pt, cutoff = 0.80)
filtered_pmT <- filtered_pmT[, -highlyCorVars]
#Display correlation summary, after removing predictors with absolute correlations above 0.80.
postCorRem  <- cor(filtered_pmT[ , sapply(filtered_pmT, is.numeric)])
summary(postCorRem[upper.tri(postCorRem)])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.102200  0.002635  0.003229  0.088840  0.980900

dim(filtered_pmT)

## [1] 19622    46

Removing first five columns [“X”, “user_name”, “raw_timestamp_part_1”, “raw_timestamp_part_2”, “num_window”], they are not useful for prediction

filtered_pmT <- filtered_pmT[, -c(1:5)]
dim(filtered_pmT)

## [1] 19622    41

The number of predictors have reduced from 160 to 40, using all the above stated methods.

Splitting data into training and validation sets

set.seed(999)
trainIndex  <- createDataPartition(filtered_pmT$classe, p = 0.70, list = FALSE)
training  <- filtered_pmT[trainIndex,] #70%
dim(training)

## [1] 13737    41

validSet  <- filtered_pmT[-trainIndex,] #30%
dim(validSet)

## [1] 5885   41

Creating Model using Random Forest method

There are numerous machine learning algorithms to build prediction models. For our classification problem, we choose Random Forest method. Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees (ref Wikipedia). This algorithm is best-known for its accuracy, handles large datasets and large number of variables very efficiently. It provides estimates of which variables are important in the classification.

First we will build the prediction model using only the training set. Then we explore importance and accuracy results.

set.seed(999)

Fitting the model using randomForest algorithm

rfModel <- randomForest(classe ~ ., type= "classification", data = training, ntree = 200, 
                        importance = TRUE)
rfModel

## 
## Call:
##  randomForest(formula = classe ~ ., data = training, type = "classification",      ntree = 200, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.58%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B   14 2638    6    0    0 0.0075244545
## C    0   14 2381    1    0 0.0062604341
## D    0    0   30 2219    3 0.0146536412
## E    0    2    3    6 2514 0.0043564356

Plotting the error rates of the randomForest object, we observe that, as the number of trees increase, the error rates (miss-classification) decrease. Black line is the out-of-bag estimate and other colors denote each class error.

layout(matrix(c(1,2),nrow = 1), width = c(4,1)) 
par(mar=c(5,4,4,0)); plot(rfModel, main = "Error rates per class and OOB")
par(mar=c(5,0,4,2)); plot(c(0,1),type = "n", axes=F, xlab = "", ylab = "")
legend("top", colnames(rfModel$err.rate), col = 1:6, cex = 0.8, fill = 1:6)

Variable Importance

With the plot below we can see which predictors have higher importance (sorted in decreasing order of importance)

varImpPlot(rfModel, main = "Variable Importance Plot", cex = 0.6, col ="steelblue")

Partial plots

Partial plots gives a graphical depiction of the marginal effect of an individual variable on the class probability.

Displaying plots for top 10 variables

imp <- importance(rfModel)
impvar <- rownames(imp)[order(imp[, "MeanDecreaseAccuracy"], decreasing=TRUE)]
impvarTop10 <- impvar[1:10]
par(mfrow = c(2, 5), mar = c(1,1,1,1))
for (i in seq_along(impvarTop10)) {
        par(mar = c(4,2,2,2))
        partialPlot(rfModel, training, impvarTop10[i], xlab = impvarTop10[i], main = "")
}

Out of Sample Accuracy

Our Random Forest model had Out-of-Bag(OOB) estimates of 0.58% from training data. We can test the accuracy of the model using validation set.

pred <- predict(rfModel, validSet)
print(confusionMatrix(pred, validSet$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    6    0    0    0
##          B    1 1129    6    0    0
##          C    0    4 1020   13    1
##          D    1    0    0  950    1
##          E    0    0    0    1 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9942         
##                  95% CI : (0.9919, 0.996)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9912   0.9942   0.9855   0.9982
## Specificity            0.9986   0.9985   0.9963   0.9996   0.9998
## Pos Pred Value         0.9964   0.9938   0.9827   0.9979   0.9991
## Neg Pred Value         0.9995   0.9979   0.9988   0.9972   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1918   0.1733   0.1614   0.1835
## Detection Prevalence   0.2851   0.1930   0.1764   0.1618   0.1837
## Balanced Accuracy      0.9987   0.9949   0.9952   0.9925   0.9990

We observe that Accuracy of 99.4% is obtained when predicting model using validation data.

Margin of predictions

The margin of a data point is defined as the proportion of votes for the correct class minus maximum proportion of votes for the other classes. Thus under majority votes, positive margin means correct classification.

plot(margin(rfModel, validSet$classe), cex = 0.7, main = "Margin of Predictions")

From the plot, we can observe positive margin indicating that classification is correct.

Conclusion

Finally we perform model prediction on the original test data.

result <- predict(rfModel, newdata = pmlTestDS)
result

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E