PROJECT: “Human Activity Recognition” Class Prediction (Classification)

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Goal of the Project:

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

**Data:**

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

**[1] Load the Data:**

setwd("C:\\") # start with setting the working directory in C:
if(!file.exists("./c8_project")){dir.create("./c8_project")}
setwd("C:\\c8_project") # SETTING THE WORKING DIRECTORY 

url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(url, destfile = "./training.csv")

url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url, destfile = "./testing.csv")

# Read the above files and create data tables.
train.set = read.csv("training.csv")
test.set = read.csv("testing.csv")

**[2] Explore the Data:**

dim(train.set)
## [1] 19622   160
dim(test.set)
## [1]  20 160
table(colSums(is.na(train.set)))# how many 0 NAs and how many non-zero NAs and their count.
## 
##     0 19216 
##    93    67
table(colSums(is.na(test.set)))
## 
##   0  20 
##  60 100

**[3] Pre-processing of the Data:**

Removing NAs and unimportant predictors:

n = nrow(train.set)
p = n * 0.8

train.nona = train.set[,which(colSums(is.na(train.set)) < p)]
table(colSums(is.na(train.nona)))
## 
##  0 
## 93
test.nona = test.set[,which(colSums(is.na(train.set)) < p)]
table(colSums(is.na(test.nona)))
## 
##  0 20 
## 60 33
test.nona.2 = test.nona[,which(colSums(is.na(test.nona)) != 20)]
table(colSums(is.na(test.nona.2)))
## 
##  0 
## 60
train.nona.2 = train.nona[,which(colSums(is.na(test.nona)) != 20)]
table(colSums(is.na(train.nona.2)))
## 
##  0 
## 60
# removing variables not useful in classification of "classe":
train.nona.2 = subset(train.nona.2, select=-c(1:6))
test.nona.2 = subset(test.nona.2, select=-c(1:6))

table(names(train.nona.2) == names(test.nona.2))
## 
## FALSE  TRUE 
##     1    53
# which columns in train.set is not in test.set:
table(names(train.nona.2) == names(test.nona.2))
## 
## FALSE  TRUE 
##     1    53
names(train.nona.2)[which(names(train.nona.2) != names(test.nona.2))]
## [1] "classe"
# "classe" column is in train.set but not in test.set. this makes sense as this is the column we will be trying to predict for test.set.

# which column in test.set is not in train.set:
table(names(test.nona.2) == names(train.nona.2))
## 
## FALSE  TRUE 
##     1    53
names(test.nona.2)[which(names(test.nona.2) != names(train.nona.2))]
## [1] "problem_id"
# "problem id" is the column in test.set not in train.set. we should remove this column.
test = subset(test.nona.2, select=-c(problem_id))

#write.csv(train.nona.2, "train.nona.2.csv")
#write.csv(test.nona.2, "test.nona.2.csv")

[4] Creating Training and Validation-sets:

The idea here is to use 70% of the provided training-set for 10-fold cross-validation to estimate test-error (or accuracy) of the class prediction.

And then use the remaining 30% of set as Validation set using the trained model to check validation-error (or accuracy) of the class prediction.

by 10-fold CV we are trying to estimate test-error. while with validation-set we are trying to see how the selected model will performa on an unknown data set.

set.seed(1)

index = createDataPartition(y=train.nona.2$classe, p=0.7, list=FALSE)

train = train.nona.2[index,]
validation = train.nona.2[-index,]

dim(train)
## [1] 13737    54
dim(validation)
## [1] 5885   54
#write.csv(train, "train.csv")

[5] Modeling for Classification & Model Evaluation:

we will create multiple classification model. we will evaluate each model ased on their 10-fold cv test accruacy and validation-set accuracy.

Then we will perform model selection based on best estimated test accuracy and validation set accuracy.

The Models we will consider are:

  • Decision Tree
  • LDA (Linear Discriminant Analysis)
  • SVM (Support Vector Machines)
  • Bagging
  • Boosting
  • Random Forests

[A] Model-1: Decision Tree

set.seed(1)

# MODELING:
tree.model = train(classe ~ ., method="rpart", data=train,
                   trControl = trainControl(method = "cv", number=10))
## Loading required package: rpart
tree.model
## CART 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03931441  0.5299648  0.39652742
##   0.05947174  0.4173415  0.21089859
##   0.11789238  0.3152014  0.04728215
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03931441.
# PREDICTION:
tree.pred = predict(tree.model, validation)
table(tree.pred, validation$classe)
##          
## tree.pred    A    B    C    D    E
##         A 1510  463  462  427  179
##         B   27  389   36  163  141
##         C  132  287  528  374  299
##         D    0    0    0    0    0
##         E    5    0    0    0  463
# VALIDATION-SET ACCURACY:
tree.accuracy = mean(tree.pred == validation$classe)
tree.accuracy
## [1] 0.491079

** Decision-Tree Model Evaluation:**

  • estimated Test-Accuracy = 53% (using 10-fold CV)
  • Validation-Accuracy = 49.1% (using a validation-set)

[B] Model-2: Linear Discriminant Analysis (LDA)

set.seed(1)

# MODELING:
lda.model = train(classe ~ ., method="lda", data=train,
                   trControl = trainControl(method = "cv", number=10))
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
lda.model
## Linear Discriminant Analysis 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7122376  0.6357642
## 
## 
# PREDICTION:
lda.pred = predict(lda.model, validation)
table(lda.pred, validation$classe)
##         
## lda.pred    A    B    C    D    E
##        A 1372  153  100   45   46
##        B   52  743   98   44  176
##        C  114  140  673  120   92
##        D  129   50  119  718   98
##        E    7   53   36   37  670
# VALIDATION-SET ACCURACY:
lda.accuracy = mean(lda.pred == validation$classe)
lda.accuracy
## [1] 0.7096007

** LDA Model Evaluation:**

  • estimated Test-Accuracy = 71.2% (using 10-fold CV)
  • Validation-Accuracy = 71% (using a validation-set)

[C] Model-3: SVM - Support Vector Machines

set.seed(1)

# MODELING:
svm.model = svm(classe ~ ., data=train, cross=10)
summary(svm.model)
## 
## Call:
## svm(formula = classe ~ ., data = train, cross = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.01886792 
## 
## Number of Support Vectors:  6281
## 
##  ( 1290 1435 1430 1165 961 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  A B C D E
## 
## 10-fold cross-validation on training data:
## 
## Total Accuracy: 93.98704 
## Single Accuracies:
##  94.39184 94.25036 94.54148 94.31901 93.44978 93.377 94.31901 93.95924 94.54148 92.72198
# PREDICTION:
svm.pred = predict(svm.model, validation)
table(svm.pred, validation$classe)
##         
## svm.pred    A    B    C    D    E
##        A 1659   70    4    2    0
##        B    1 1036   28    1    3
##        C   14   29  987   90   13
##        D    0    2    6  871   19
##        E    0    2    1    0 1047
# VALIDATION-SET ACCURACY:
svm.accuracy = mean(svm.pred == validation$classe)
svm.accuracy
## [1] 0.9515718

** “Support Vector Machines” Model Evaluation:**

  • estimated Test-Accuracy = 94.0% (using 10-fold CV)
  • Validation-Accuracy = 95.2% (using a validation-set)

[D] Model-4: Bagging

set.seed(1)

# MODELING:
bag.model = bag(train[,1:53], train[,54], B=10,
                bagControl = bagControl(fit = ctreeBag$fit,
                                      predict = ctreeBag$pred,
                                      aggregate = ctreeBag$aggregate),
              trControl = trainControl(method = "cv", number=10))
summary(bag.model)
## 
## Call:
## bag.default(x = train[, 1:53], y = train[, 54], B = 10, bagControl
##  = ctreeBag$aggregate), trControl = trainControl(method = "cv", number
##  = 10))
## 
## Out of bag statistics (B = 10):
## 
##        Accuracy  Kappa
##   0.0%   0.8608 0.8238
##   2.5%   0.8616 0.8249
##  25.0%   0.8674 0.8323
##  50.0%   0.8720 0.8379
##  75.0%   0.8788 0.8464
##  97.5%   0.8905 0.8615
## 100.0%   0.8912 0.8623
# PREDICTION:
bag.pred = predict(bag.model, validation)
table(bag.pred, validation$classe)
##         
## bag.pred    A    B    C    D    E
##        A 1646   22    5    4    0
##        B   17 1082   27    5   12
##        C    2   17  975   22    5
##        D    6    9   18  923   11
##        E    3    9    1   10 1054
# VALIDATION-SET ACCURACY:
bag.accuracy = mean(bag.pred == validation$classe)
bag.accuracy
## [1] 0.9651657

** “Bagging” Model Evaluation:**

  • estimated Test-Accuracy = 89.1% (using 10-fold CV)
  • Validation-Accuracy = 96.5% (using a validation-set)

[E] Model-5: Boosting

set.seed(1)

# MODELING:
boost.model = train(classe ~., data=train, method = "gbm", 
                    verbose = FALSE,
                    trControl = trainControl(method = "cv", number=10))
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
boost.model
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7622458  0.6983534
##   1                  100      0.8308971  0.7859187
##   1                  150      0.8688964  0.8340455
##   2                   50      0.8835304  0.8524731
##   2                  100      0.9389997  0.9228070
##   2                  150      0.9618581  0.9517388
##   3                   50      0.9306287  0.9121743
##   3                  100      0.9702285  0.9623277
##   3                  150      0.9858784  0.9821365
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
# PREDICTION:
boost.pred = predict(boost.model, validation)
table(boost.pred, validation$classe)
##           
## boost.pred    A    B    C    D    E
##          A 1667    7    0    2    0
##          B    6 1118   14    6    5
##          C    0   13 1010    8    2
##          D    1    1    1  948    5
##          E    0    0    1    0 1070
# VALIDATION-SET ACCURACY:
boost.accuracy = mean(boost.pred == validation$classe)
boost.accuracy
## [1] 0.9877655

** “Bagging” Model Evaluation:**

  • estimated Test-Accuracy = 98.6% (using 10-fold CV)
  • Validation-Accuracy = 98.8% (using a validation-set)

[F] Model-6: Random Forests

set.seed(1)

# MODELING:
rf.model = train(classe ~., data=train, method = "rf",
                    trControl = trainControl(method = "cv", number=10))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf.model
## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12365, 12363, 12363, 12364, 12363, 12362, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9941766  0.9926335
##   27    0.9975251  0.9968695
##   53    0.9947592  0.9933705
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
# PREDICTION:
rf.pred = predict(rf.model, validation)
table(rf.pred, validation$classe)
##        
## rf.pred    A    B    C    D    E
##       A 1674    1    0    0    0
##       B    0 1137    0    0    2
##       C    0    0 1026    0    0
##       D    0    1    0  964    1
##       E    0    0    0    0 1079
# VALIDATION-SET ACCURACY:
rf.accuracy = mean(rf.pred == validation$classe)
rf.accuracy
## [1] 0.9991504

** “Random Forests” Model Evaluation:**

  • estimated Test-Accuracy = 99.8% (using 10-fold CV)
  • Validation-Accuracy = 99.92% (using a validation-set)

[6] Model Selection:

Based on highest 10-fold cross-validation estimated Test-Accuracy of 99.8% and ValidationSet-Accuracy of 99.92%, I select “Random Forest” as the model of choice for classification task of this project.

[7] Final Prediction of Test-set using “Random Forests”:

# final prediction on test set:
test.pred = predict(rf.model, test)
table(test.pred)
## test.pred
## A B C D E 
## 7 8 1 1 3
# adding predcited class to test set as a column.
test$classe = test.pred

dim(test)
## [1] 20 54
test$classe
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E