Wanted the Best Predictive Model

1. Synopsis

This is a project for Practical Machine Learning course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Data for the assignment comes from Groupware, viz: accelerometers on the belt, forearm, arm, and dumbell of six participants. They performed 10 bicep curls in five different fashions: exactly according to the correct specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). The data set contains a train and a test data sets.

The Project aims to build machine learning models on various algorithms predicting what exercise was performed, and then choose the best one to predict 20 different test cases in test data set
The best prediction model turned out to be random forest with the highest accuracy and lowest out-of-sample error

2. Data Processing

Data

Code chunks can be displayed by clicking Code button

Load packages & data

library(caret); library(dplyr); library(rattle)

load <- function(name, url) {
  dest <- paste0("./data/1124_DS-ML-w4_ActivityRecognition/", name,".csv")
  if(!file.exists(dest)) {download.file(url, destfile = dest, method = "curl")}
  load<- read.csv(dest)
}
url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training<- load("training", url)

url<- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing <- load("testing", url)

Look at the dimension and classes of the training set:

dim(training)

[1] 19622   160

table(training$classe)


   A    B    C    D    E 
5580 3797 3422 3216 3607

there are 19622 observations on 160 variables, and
five levels dependent variable classe: A, B, C, D, E

Create testIN data set for later cross-validation:

set.seed(12345)
index <- createDataPartition(y = training$classe, p = 0.6, list = FALSE)
train <- training[index, ]
testIN <- training[-index, ]

Variables Selection

deal with NAs

Append to NA values also blanks (“”), “NA”, div0, then count “non-NA” proportion:

train[train == ""] <- NA
train[train=="#DIV/0!"] <- NA
train[train=="<NA>"] <- NA
NAs <- unique(apply(train, 2,function(x){sum(is.na(x))}))
NAs <- dim(train)[1]-NAs[2]
nonNAs <- NAs/dim(train)[1]

there are only 1.95\(\%\) rows without NAs

Yet choosing only columns with non-NAs, leaves the following variables:

NAcols<-unique(names(train[colSums(is.na(train)) > 0]))
train <- train%>% select(-NAcols)
cols<- dim(train)[2]
NAs<- sum(is.na(train))
names(train)

 [1] "X"                    "user_name"            "raw_timestamp_part_1"
 [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
 [7] "num_window"           "roll_belt"            "pitch_belt"          
[10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
[13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
[16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
[19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
[22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
[25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
[28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
[31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
[34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
[37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
[40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
[43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
[46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
[49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
[52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
[55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
[58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

So, there are now 60 columns in the training set with 0 NAs in them.

remove near zero variance predictors, and the first six columns as irrelevant

near0 <- nearZeroVar(train, saveMetrics = TRUE)
train <- train[, near0$zeroVar == FALSE & near0$nzv == FALSE][
  , -c(1:6)]

make the same processing with testIN set

testIN <- testIN %>% select(-NAcols)
testIN <- testIN[, near0$zeroVar == FALSE & near0$nzv == FALSE][
  , -c(1:6)]

3. Predictive Models

Test three 5-fold models: classification tree, random forest, gradient boosting

3.1 Classification Tree

trControl <- trainControl(method="cv", number=5)
model_CT <- train(classe~., data=train, method="rpart", trControl=trControl)
accurT<- getTrainPerf(model_CT)[[1]]
fancyRpartPlot(model_CT$finalModel, sub="")

model_CT

CART 

11776 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 9421, 9421, 9420, 9421, 9421 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa     
  0.03440911  0.4753702  0.30552993
  0.05964246  0.4150861  0.20742048
  0.11449929  0.3322835  0.07309721

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.03440911.

Model accuracy of 47.54\(\%\) is very low.

3.2 Random Forest

set.seed(1234)
model_RF <- train(classe~., data=train, method="rf",
                  trControl=trControl, verbose=FALSE)
model_RF

Random Forest 

11776 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 9420, 9422, 9420, 9421, 9421 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9885364  0.9854970
  27    0.9883664  0.9852831
  52    0.9828474  0.9783011

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Final model variable importance:

varImp(model_RF)

rf variable importance

  only 20 most important variables shown (out of 52)

                  Overall
roll_belt          100.00
yaw_belt            75.42
magnet_dumbbell_z   68.21
pitch_belt          59.97
pitch_forearm       59.68
magnet_dumbbell_y   58.39
magnet_dumbbell_x   50.67
roll_forearm        50.56
magnet_belt_y       43.84
accel_belt_z        42.68
accel_dumbbell_y    41.52
roll_dumbbell       41.24
magnet_belt_z       40.51
accel_dumbbell_z    34.59
accel_forearm_x     32.25
roll_arm            32.16
accel_dumbbell_x    30.00
yaw_dumbbell        29.15
gyros_belt_z        28.40
magnet_forearm_x    26.84

Accuracy of 98.85\(\%\) is very nice, so carry out a cross-validation using testIN subset, and check then confusion matrix and accuracy.

Confusion matrix:

pred_Rf <- predict(model_RF, testIN)
confusionMatrix(pred_Rf, as.factor(testIN$classe))$table

          Reference
Prediction    A    B    C    D    E
         A 2231    6    0    0    0
         B    1 1511    9    0    0
         C    0    1 1353   16    4
         D    0    0    6 1269    3
         E    0    0    0    1 1435

confusionMatrix(pred_Rf, as.factor(testIN$classe))$overall[1]

 Accuracy 
0.9940097

accur<- confusionMatrix(pred_Rf, as.factor(testIN$classe))$overall[[1]]
err<- 1-accur

There is a very high cross-validation accuracy of 99.4\(\%\) and obtained expected out-of-sample error is 0.6\(\%\).

3.3 Gradient Boosting Method

model_GBM <- train(classe~., data=train, method="gbm", trControl=trControl,
                   verbose=FALSE)
plot(model_GBM, lw=3)

model_GBM$finalModel

A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 52 predictors of which 51 had non-zero influence.

Accuracy of 96.01\(\%\) is quite good. Carry out a cross-validation using testIN subset, and check then confusion matrix and accuracy.

Confusion matrix:

pred_GBM <- predict(model_GBM, testIN)
confusionMatrix(pred_GBM, as.factor(testIN$classe))$table

          Reference
Prediction    A    B    C    D    E
         A 2188   39    0    2    2
         B   34 1438   44    7    7
         C    4   31 1304   30   17
         D    6    3   18 1244   26
         E    0    7    2    3 1390

confusionMatrix(pred_GBM, as.factor(testIN$classe))$overall[1]

 Accuracy 
0.9640581

accurG<- confusionMatrix(pred_GBM, as.factor(testIN$classe))$overall[[1]]

Accuracy is good, but lower than in Random Forest model, and correspondingly, expected out-of-sample error of 3.59\(\%\) is higher.

4. Conclusion

Accuracy of Random Forest algorithm (99.4\(\%\)) is higher than both Classification Tree (47.54\(\%\)) and Gradient Boosting (96.41\(\%\)) models
Besides, expected out-of-sample error of Random Forest (0.6\(\%\)) is lower than Gradient Boosting one (3.59\(\%\))
So, Random Forest is the best model

Look then at Random Forest performance on the original testing data set:

test_RF <- predict(model_RF, testing)
test_RF

 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E