Overview

Human Activity Recognition - HAR - has emerged as a key research area in the last few years and is gaining increasing attention by the pervasive computing research community. Useful, especially for elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. [1]

Question Statement

Can we predict the the manner in which an individual did a particular exercise? This is the classe variable in the data set(A,B,C,D,E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Data Collection

First of all, the training and testing data sets, respectively, are available at:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

If you are interested in a better understanding of the data sets see: http://groupware.les.inf.puc-rio.br/har (ref. [1])

setwd( "F:/Albert/Coursera/Practical Machine Learning/Project")


tnUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
ttUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

#train <- read.csv("./pml-training.csv", na.strings=c("NA",""))
#test <- read.csv("./pml-testing.csv", na.strings=c("NA",""))

train <- read.csv(url(tnUrl), na.strings=c("NA",""))
test <- read.csv(url(ttUrl), na.strings=c("NA",""))
str(train$classe)

##  Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Data Features

Cross Validation

The training data is split into a training set (train1) and a testing set (test1). These two stes will be used for cross validation , later, when we build or models and predictions.

inTrain<-createDataPartition(y=train$classe,p=.70,list=FALSE)
train1<-train[inTrain,]
test1<-train[-inTrain,]
dim(test1);dim(train1);dim(test)

## [1] 5885  160

## [1] 13737   160

## [1]  20 160

Cleaning

We clean the data by removing the unwanted and unneeded columns. Initially, we are given data sets with 160 columns to analyze. After the cleaning effort there are only 46 columns remaining.

#Remove NA's and near zero variance items along with times


train1<-train1[-c(1:7)] # remove names - time related etc.
t<-nearZeroVar(train1)
train1<-train1[-c(t)]
tt<-grep("NA",train1)
train1<-train1[-c(tt)]


test1<-test1[-c(1:7)] # remove names - time related etc.
test1<-test1[-c(t)]
test1<-test1[-c(tt)]

Test<-test[-c(1:7)] # remove names - time related etc.
Test<-Test[-c(t)]
Test<-Test[-c(tt)]




#Remove items with high correlation
t <- which(names(train1) == "classe")
## High Correlation Columns
hCC <- findCorrelation(abs(cor(train1[-c(t)])),0.90)
## High Correlation Features
hCF <- names(train1)[hCC]
train1 <- train1[-c(hCC)]
t <- which(names(train1) == "classe")
#Remove the same items from both test sets
test1 <- test1[-c(hCC)]
Test <- Test[-c(hCC)]


dim(train1);dim(test1);dim(Test)

## [1] 13737    46

## [1] 5885   46

## [1] 20 46

Classification Tree

The classification tree based on the Gini coefficient, see figure 1, separates the data by its purity with respect to a given value and is delineated below. The classification is based on a non-linear model and are based on interactions between the variables. The rpart model splits the data from the root (top item) into leaves and nodes depending on each item’s order to a selected value. Below the root are the branches starting with pitch_forearm with a an asterisk(*) on the end. This corresponds to to the left hand side of the branch and is a terminal node. The second reference to pitch_forearm, on the next line below, is depicted as the right side of the branch in figure 1.

As you can see by examining figure 1, the classification tree is six levels deep. The deepest level is given by magnet_belt_y.

modRp <- train(classe ~ ., method = "rpart", data=train1)
print(modRp$finalModel)

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) pitch_forearm< -33.95 1071    5 A (1 0.0047 0 0 0) *
##    3) pitch_forearm>=-33.95 12666 9826 A (0.22 0.21 0.19 0.18 0.2)  
##      6) magnet_belt_y>=555.5 11665 8828 A (0.24 0.23 0.21 0.18 0.15)  
##       12) magnet_dumbbell_y< 439.5 9740 6966 A (0.28 0.18 0.24 0.17 0.13)  
##         24) roll_forearm< 122.5 6116 3630 A (0.41 0.18 0.18 0.14 0.093) *
##         25) roll_forearm>=122.5 3624 2411 C (0.079 0.18 0.33 0.22 0.19)  
##           50) accel_forearm_x>=-108.5 2552 1583 C (0.089 0.22 0.38 0.086 0.22) *
##           51) accel_forearm_x< -108.5 1072  497 D (0.057 0.087 0.23 0.54 0.092) *
##       13) magnet_dumbbell_y>=439.5 1925 1016 B (0.033 0.47 0.037 0.21 0.25) *
##      7) magnet_belt_y< 555.5 1001  190 E (0.003 0.003 0.001 0.18 0.81) *

Importance

The order of importance for the first 12 items is given below. Note that the top two items are provided in a scatter plot below in close detail. See figure 2 in the Appendix for a less detailed feature plot of the four items of highest importance.

#Feature Importance
fsRF = randomForest(train1[-c(t)], train1$classe, importance = T)
rfImp = data.frame(fsRF$importance)
impF = order(-rfImp$MeanDecreaseGini)
names(train1[impF[1:12]])

##  [1] "yaw_belt"          "pitch_belt"        "magnet_dumbbell_z"
##  [4] "pitch_forearm"     "magnet_dumbbell_y" "roll_forearm"     
##  [7] "magnet_belt_y"     "magnet_dumbbell_x" "magnet_belt_z"    
## [10] "gyros_belt_z"      "roll_dumbbell"     "accel_dumbbell_y"

qplot(yaw_belt,pitch_belt,color=classe,data=train1)

Analysis of the Data (Algorithms)

The best Machine Learning methods should be interpretable, simple, accurate, fast and scalable.

Model Selection and Parameters

Bagging or Bootstrap Aggregating method is performed by resampling data and averaging or majority voting the results. The bagging method is more useful for non-linear functions and has the advantage of reduced variance and similar bias.
Random Forest model uses bootstrapping to grow multiple trees. Then it uses popular vote for the decision. The pros for random forests include accuracy. The cons include being slow, difficult to interpret and are prone to overfitting.
Boosting takes many predictors and weights them, sums them up and finally averages the sum. This gives rise to a stronger predictor. The gbm method uses boosting with trees.

KNN Model

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression.[3] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

#Training
set.seed(49111)
ctrlKNN = trainControl(method = "adaptive_cv")
modelKNN = train(classe ~ ., train1, method = "knn", trControl = ctrlKNN)

GBM Model

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

set.seed(49111)
ctrlGBM <- trainControl(method = "repeatedcv",number = 5,repeats = 1)
modelGBM <- train(classe ~ ., data=train1, method = "gbm", trControl = ctrlGBM, verbose = FALSE)

RF Model

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.[2]

set.seed(49111)
ctrlRF = trainControl(method = "oob")
modelRF = train(classe ~ ., train1, method = "rf", ntree = 200, trControl = ctrlRF)


# Frame the results
resultsKNN = data.frame(modelKNN$results)
resultsRF = data.frame(modelRF$results)
resultsGBM = data.frame(modelGBM$results)

Predictions

Sample errors occur on the data training set (aka resubstitution errors). As you increase the number of predictors, the sample error becomes less.
Out of sample errors occur on the data testing set (aka generalization errors). But, as you increase the number of predictors the out of sample error initially goes down but starts to increase later because of sample overfitting.

fitKNN = predict(modelKNN, test1)
fitRF = predict(modelRF, test1)
fitGBM = predict(modelGBM,test1)

Evaluation

For each model used, the accuracy mean is provided along with table and overall score taken from the confusion matrix. The table shows the results of testing data predictions using training data as the reference to calculate the predictions.

KNN Analysis

See Figure 3., in the Appendix, for the complete Confusion Matrix.

cmknn <- confusionMatrix(fitKNN, test1$classe)
mean(resultsKNN$Accuracy);cmknn$table;cmknn$overall

## [1] 0.8852065

##           Reference
## Prediction    A    B    C    D    E
##          A 1613   59   11   20   20
##          B   13 1009   32    6   21
##          C   20   40  954   66   22
##          D   21   23   21  850   37
##          E    7    8    8   22  982

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.189465e-01   8.973924e-01   9.116791e-01   9.257953e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   1.019025e-14

GBM Analysis

See Figure 4., in the Appendix, for the complete Confusion Matrix.

cmgbm <- confusionMatrix(fitGBM, test1$classe)
mean(resultsGBM$Accuracy);cmgbm$table;cmgbm$overall

## [1] 0.8740545

##           Reference
## Prediction    A    B    C    D    E
##          A 1668   19    0    0    0
##          B    6 1101   14    3    0
##          C    0   16 1008   11    3
##          D    0    2    4  946    8
##          E    0    1    0    4 1071

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9845370      0.9804366      0.9810483      0.9875323      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

RF Analysis

See Figure 5., in the Appendix, for the complete Confusion Matrix.

cmrf <- confusionMatrix(fitRF, test1$classe)
mean(resultsRF$Accuracy);cmrf$table;cmrf$overall

## [1] 0.9899541

##           Reference
## Prediction    A    B    C    D    E
##          A 1674    6    0    0    0
##          B    0 1131    5    0    0
##          C    0    2 1021    3    0
##          D    0    0    0  960    2
##          E    0    0    0    1 1080

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9967715      0.9959158      0.9949628      0.9980551      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

Conclusion

Examining the averages of the accuracy for each model we find that the KNN has 89%, the GBM has 87% and the RF has 99%. The prediction table results show KNN is the worst, GBM in the middle and RF on top. The RF model, again has the best overall score. Analysis of the best model, we can conclude that thr RF model will be used for future predictions.

Test Data Prediction

TP <- predict(modelRF, Test, type = "raw")

TPResults <- data.frame(problem_id=Test$problem_id,predicted=TP)
print(TPResults)

##    problem_id predicted
## 1           1         B
## 2           2         A
## 3           3         B
## 4           4         A
## 5           5         A
## 6           6         E
## 7           7         D
## 8           8         B
## 9           9         A
## 10         10         A
## 11         11         B
## 12         12         C
## 13         13         B
## 14         14         A
## 15         15         E
## 16         16         E
## 17         17         A
## 18         18         B
## 19         19         B
## 20         20         B

References

[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. Cited by 2 (Google Scholar)

[2] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5.

[3] Altman, N. S. (1992). “An introduction to kernel and nearest-neighbor nonparametric regression”. The American Statistician. 46 (3): 175-185. doi:10.1080/00031305.1992.10475879

APPENDIX - Reference Graphs

FIGURE 1: Tree Plot of data items

This is a tree plot of the data features. Sorry that the legibility is so poor.

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

FIGURE 2: Important Data Set Features

This is a list of the twelve most important items in the data set, in decreasing order. The importance of is based on their relationship to the five different classes.

FIGURE 3: Confusion Matrix for KNN Model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1613   59   11   20   20
##          B   13 1009   32    6   21
##          C   20   40  954   66   22
##          D   21   23   21  850   37
##          E    7    8    8   22  982
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9189          
##                  95% CI : (0.9117, 0.9258)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8974          
##  Mcnemar's Test P-Value : 1.019e-14       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9636   0.8859   0.9298   0.8817   0.9076
## Specificity            0.9739   0.9848   0.9695   0.9793   0.9906
## Pos Pred Value         0.9362   0.9334   0.8657   0.8929   0.9562
## Neg Pred Value         0.9853   0.9729   0.9849   0.9769   0.9794
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2741   0.1715   0.1621   0.1444   0.1669
## Detection Prevalence   0.2928   0.1837   0.1873   0.1618   0.1745
## Balanced Accuracy      0.9687   0.9353   0.9497   0.9305   0.9491

FIGURE 4: Confusion Matrix for GBM Model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668   19    0    0    0
##          B    6 1101   14    3    0
##          C    0   16 1008   11    3
##          D    0    2    4  946    8
##          E    0    1    0    4 1071
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9845         
##                  95% CI : (0.981, 0.9875)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9804         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9666   0.9825   0.9813   0.9898
## Specificity            0.9955   0.9952   0.9938   0.9972   0.9990
## Pos Pred Value         0.9887   0.9795   0.9711   0.9854   0.9954
## Neg Pred Value         0.9986   0.9920   0.9963   0.9963   0.9977
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2834   0.1871   0.1713   0.1607   0.1820
## Detection Prevalence   0.2867   0.1910   0.1764   0.1631   0.1828
## Balanced Accuracy      0.9960   0.9809   0.9881   0.9892   0.9944

FIGURE 5: Confusion Matrix for RF Model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    6    0    0    0
##          B    0 1131    5    0    0
##          C    0    2 1021    3    0
##          D    0    0    0  960    2
##          E    0    0    0    1 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9968         
##                  95% CI : (0.995, 0.9981)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9959         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9930   0.9951   0.9959   0.9982
## Specificity            0.9986   0.9989   0.9990   0.9996   0.9998
## Pos Pred Value         0.9964   0.9956   0.9951   0.9979   0.9991
## Neg Pred Value         1.0000   0.9983   0.9990   0.9992   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1922   0.1735   0.1631   0.1835
## Detection Prevalence   0.2855   0.1930   0.1743   0.1635   0.1837
## Balanced Accuracy      0.9993   0.9960   0.9970   0.9977   0.9990

```

How Well They Do It - A Machine Learning Project

Albert C Grover

August 6, 2016