Weight Lifting Exercise - Multiclass Classification based on Random Forest

Overview

In this project, I implement various classification models to predict the manner in which people do exercise, based on accelerometer data from devices like Jawbone Up. I perform the machine learning experiments in two different environments: (i) AzureML and (ii) RStudio.

Background

(Text taken from the course project description page)

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

1. Data processing

The initial training data consisted of 160 features of more than 19000 instances, with some of the columns consisting of identification data and timestamps. Out of the measurement features, more than 90 features consisted primarily NA (missing) values. These columns were removed in both the training and test data, resulting in only 52 features and 1 outcome (classe).

# Loading the required libraries
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(rpart.plot))
suppressMessages(library(rpart))
suppressMessages(library(rattle))
suppressMessages(library(randomForest))

# Read the csv files into R
file1 <- "./pml-training.csv"
file2 <- "./pml-testing.csv"
pmltrain <- read.csv(file1, header = TRUE)
pmltest <- read.csv(file2, header = TRUE)

# Preliminary exploration of the data
str(pmltrain)

## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","0.000673",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","0.006078",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","0.00","0.0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","0.01388","0.01574",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","0.00000","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","0.00000","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","0.0016","-0.0035",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","0.0045","0.0130",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","0.0011","0.0014",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","0.0063",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

As a first step in the data processing, We remove the first 8 columns which consist participant identification details and timestamps of the observations (Note: Timestamp information is irrelevant for training these datasets).

# Subset the variables used to train the model
pmltr.sub <- select(pmltrain,8:160)
pmlte.sub <- select(pmltest,8:160)

The below R code outputs the number of columns in the data with no missing data in them. Inspection of the remaining columns showed that there were too many missing values (> 98% NAs) in each of those columns. These columns were removed from the training and testing set.

sum(apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0)

## [1] 53

# Removing features that have NAs (every feature where NA is present > 19000 NAs) 
pmltr.sub <- pmltr.sub[apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0]
pmlte.sub <- pmlte.sub[apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0]

# Dimensions of the training and test sets
dim(pmltr.sub)

## [1] 19622    53

dim(pmlte.sub)

## [1] 20 53

Above, we see the dimension of the training set and the final-test set (the cross-validation set). The final column in the training set is the outcome (classe) and the final column in the final-test set consists of user id. It is also useful to check if any of the features used for modeling consists of near zero variance in them.

# Checking if there is any feature(s) with near zero variance
nzv <- nearZeroVar(pmltr.sub, saveMetrics = TRUE)
sum(nzv$nzv == TRUE)

## [1] 0

We see that there is no column (feature) with non-zero variance in the selected features.

2. Data Partition

The training data (pmltr.sub) consists of 19622 observations. We split this dataset into a training set (p = 0.65) and a testing set (p = 0.35).

# Data partition
intrain <- createDataPartition(y = pmltr.sub$classe, p = 0.65, list = FALSE)
training <- pmltr.sub[intrain,]
testing <- pmltr.sub[-intrain,]

Till this point, we perform all data processing in RStudio. Now, we write the above testing and training data as .csv files and import the data into AzureML Studio for conducting modeling experiment separately.

3. Modeling and Prediction

First let us discuss the modeling performed in RStudio environment using Classification and Regression Tree (CART) and Random Forest methods.

Classification tree with `rpart()`

We use the rpart() function from rpart package to train a multiclass classification machine learning model.

set.seed(12345)
# Training with classification tree
modfit.rpart <- rpart(classe ~ ., data=training, method="class", xval = 4)
print(modfit.rpart, digits = 3)

## n= 12757 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 12757 9130 A (0.28 0.19 0.17 0.16 0.18)  
##      2) roll_belt< 130 11678 8060 A (0.31 0.21 0.19 0.18 0.11)  
##        4) pitch_forearm< -26.7 1158   50 A (0.96 0.043 0 0 0) *
##        5) pitch_forearm>=-26.7 10520 8010 A (0.24 0.23 0.21 0.2 0.12)  
##         10) magnet_dumbbell_y< 440 8902 6450 A (0.28 0.18 0.24 0.19 0.11)  
##           20) roll_forearm< 122 5528 3330 A (0.4 0.18 0.19 0.17 0.062)  
##             40) magnet_dumbbell_z< -27.5 1823  604 A (0.67 0.21 0.012 0.076 0.032) *
##             41) magnet_dumbbell_z>=-27.5 3705 2700 C (0.27 0.17 0.27 0.22 0.077)  
##               82) accel_dumbbell_y>=-40.5 3248 2270 A (0.3 0.19 0.19 0.24 0.083)  
##                164) yaw_belt>=170 425   51 A (0.88 0.054 0 0.066 0) *
##                165) yaw_belt< 170 2823 2060 D (0.21 0.2 0.22 0.27 0.095)  
##                  330) pitch_belt< -43.2 304   45 B (0.016 0.85 0.066 0.049 0.016) *
##                  331) pitch_belt>=-43.2 2519 1770 D (0.24 0.13 0.23 0.3 0.1)  
##                    662) roll_belt>=126 593  255 C (0.38 0.032 0.57 0.01 0.0034)  
##                     1324) magnet_belt_z< -322 203    6 A (0.97 0.0049 0.015 0 0.0099) *
##                     1325) magnet_belt_z>=-322 390   55 C (0.079 0.046 0.86 0.015 0) *
##                    663) roll_belt< 126 1926 1180 D (0.19 0.16 0.13 0.39 0.14)  
##                     1326) accel_dumbbell_z< 26.5 1379  772 D (0.26 0.086 0.18 0.44 0.036)  
##                       2652) accel_forearm_x>=-79.5 736  420 A (0.43 0.11 0.16 0.24 0.057) *
##                       2653) accel_forearm_x< -79.5 643  215 D (0.065 0.059 0.2 0.67 0.012) *
##                     1327) accel_dumbbell_z>=26.5 547  336 E (0.027 0.33 0.0055 0.25 0.39)  
##                       2654) roll_dumbbell< 35.6 164   30 B (0.018 0.82 0.018 0.061 0.085) *
##                       2655) roll_dumbbell>=35.6 383  186 E (0.031 0.13 0 0.33 0.51) *
##               83) accel_dumbbell_y< -40.5 457   56 C (0.0066 0.044 0.88 0.031 0.042) *
##           21) roll_forearm>=122 3374 2240 C (0.075 0.17 0.34 0.23 0.18)  
##             42) magnet_dumbbell_y< 290 1997 1030 C (0.089 0.15 0.48 0.15 0.13)  
##               84) magnet_dumbbell_z>=286 281  140 A (0.5 0.16 0.046 0.078 0.22) *
##               85) magnet_dumbbell_z< 286 1716  761 C (0.022 0.14 0.56 0.16 0.12)  
##                170) pitch_belt>=26.1 125   19 B (0.08 0.85 0.032 0 0.04) *
##                171) pitch_belt< 26.1 1591  640 C (0.017 0.089 0.6 0.17 0.13) *
##             43) magnet_dumbbell_y>=290 1377  885 D (0.054 0.21 0.12 0.36 0.26)  
##               86) accel_forearm_x>=-90.5 857  544 E (0.043 0.27 0.17 0.16 0.37)  
##                172) roll_forearm< 132 106   11 C (0.0094 0.066 0.9 0 0.028) *
##                173) roll_forearm>=132 751  441 E (0.048 0.29 0.064 0.18 0.41)  
##                  346) roll_dumbbell< 41.6 170   38 B (0.053 0.78 0 0.053 0.12) *
##                  347) roll_dumbbell>=41.6 581  291 E (0.046 0.15 0.083 0.22 0.5) *
##               87) accel_forearm_x< -90.5 520  164 D (0.071 0.13 0.044 0.68 0.075) *
##         11) magnet_dumbbell_y>=440 1618  791 B (0.034 0.51 0.038 0.23 0.19)  
##           22) total_accel_dumbbell>=5.5 1143  396 B (0.048 0.65 0.052 0.02 0.23)  
##             44) roll_belt>=-0.58 958  211 B (0.057 0.78 0.063 0.024 0.076) *
##             45) roll_belt< -0.58 185    0 E (0 0 0 0 1) *
##           23) total_accel_dumbbell< 5.5 475  133 D (0 0.17 0.0021 0.72 0.11) *
##      3) roll_belt>=130 1079   10 E (0.0093 0 0 0 0.99) *

The trained model is used on the testing set to predict the outcome. The predicted outcome is then compared with the actual outcome to compute the out-of-sample accuracy of the model.

# Predict the testing set with the trained model 
predictions1 <- predict(modfit.rpart, testing, type = "class")

# Accuracy and other metrics
confusionMatrix(predictions1, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1796  313   90  190   88
##          B   50  739   70   37   88
##          C   40   87  928  166  127
##          D   36  109   82  630   59
##          E   31   80   27  102  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.7166, 0.7378)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6517          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9196   0.5565   0.7753  0.56000   0.7132
## Specificity            0.8614   0.9558   0.9259  0.95017   0.9572
## Pos Pred Value         0.7251   0.7510   0.6884  0.68777   0.7895
## Neg Pred Value         0.9642   0.8998   0.9512  0.91679   0.9368
## Prevalence             0.2845   0.1934   0.1744  0.16387   0.1838
## Detection Rate         0.2616   0.1076   0.1352  0.09177   0.1311
## Detection Prevalence   0.3608   0.1433   0.1964  0.13343   0.1661
## Balanced Accuracy      0.8905   0.7561   0.8506  0.75509   0.8352

As we see, the classification tree model has not performed well with the given data. The overall accuracy is about 0.7145.

Random forest model

We use the randomForest() function from the randomForest package, which is very fast compared to training with train() function in the caret package.

set.seed(12345)
# Training with Random forest model
modfit.rf <- randomForest(classe ~. , data=training)

# Predict the testing set with the trained model
predictions2 <- predict(modfit.rf, testing, type = "class")

# Accuracy and other metrics
confusionMatrix(predictions2, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1952   10    0    0    0
##          B    1 1318   18    0    0
##          C    0    0 1179   10    1
##          D    0    0    0 1112    0
##          E    0    0    0    3 1261
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9937          
##                  95% CI : (0.9916, 0.9955)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9921          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9995   0.9925   0.9850   0.9884   0.9992
## Specificity            0.9980   0.9966   0.9981   1.0000   0.9995
## Pos Pred Value         0.9949   0.9858   0.9908   1.0000   0.9976
## Neg Pred Value         0.9998   0.9982   0.9968   0.9977   0.9998
## Prevalence             0.2845   0.1934   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1920   0.1717   0.1620   0.1837
## Detection Prevalence   0.2858   0.1948   0.1733   0.1620   0.1841
## Balanced Accuracy      0.9987   0.9945   0.9915   0.9942   0.9993

The random forest model has performed very well with an accuracy of 0.9948 and an out-of-sample error rate of 0.0052. This is very good model and is ready for performing prediction on the cross validation dataset. Let us use the above model to predict the outcome for the final test (CV) dataset. The outcome is printed below.

# Predict the outcome for actual test case
pred.final <- predict(modfit.rf, pmlte.sub)
pred.final

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

The prediction accuracy is 100% on this CV dataset, as per the final quiz score in the Practical Machine Learning course in Coursera Data Science Specialization.

Now, let us discuss briefly, the implementation details of the modeling experiment in AzureML Studio and let us discuss the prediction metrics.

AzureML Studio experiment

Figure 1 is a snapshot of the actual multiclass classification experiment performed in AzureML Studio (i) independently with Multiclass Decision Forest algorithm provided with the ML Studio and (ii) also with R’s random forest method (in caret package). The random forest implementation discussed above in performed with randomForest package in RStudio. The rf method as implemented in the caret package is computationally resource intensive and is hard to perform for large datasets in normal laptop or PC (even with parallelization). That is a reason why I tried using caret’s rf method in AzureML environment (the caret implementation of random forest inside AzureML environment using ‘Execute R Script’ method is not discussed in this report).

Figure 1

As stated before, the datasets that we processed and paritioned above as testing and training sets are the same datasets that we use in the AzureML experiment, so we do little to none data processing here. In the ‘Multiclass Decision Forest’ tool in AzureML, there is some flexibility in adjusting the parameters of the modeling tool. We select the bagging resampling method, with 32 decision trees, maximum depth of decision trees to be 128. The rest we leave to default settings.

As shown in the experiment process flow, we train the model with the training data using the Multiclass Decision Forest algorithm. Then, using Score Model, we predict the outcome of the testing set with trained model. Then, we use ‘Evaluate Model’ tool to calculate the metrics. Figures 2 and 3 show the accuracy metrics and the confusion matrix of the predicted outcome, respectively.

Figure 2

Figure 3

As we see, the Multiclass Decision Forest algorithm has provided the highest out-of-sample accuracy of the models that we tested in RStudio and in AzureML Studio.

The final test dataset (CV set) that we predicted in the previous section (RStudio experiment) is also used in this section as a final test of our model. The AzureML trained model predicted the same outcome as before. We know that the outcome prediction was accurate for the final prediction in the previous section. Thus, AzureML too, predicted all the outcomes accurately, except with (marginally) higher out-of-sample accuracy (0.9955).

Conclusion

We trained multiclass classification models using rpart() and randomForest() in RStudio and using Multiclass Decision Forest algorithm in AzureML Studio. The random forest model performed very well in both setting, with AzureML implementation providing a marginally better out-of-sample accuracy.

Weight Lifting Exercise - Multiclass Classification based on Random Forest

Chockalingam Sivakumar

27 January 2017

Overview

Background

Data

1. Data processing

2. Data Partition

3. Modeling and Prediction

Classification tree with `rpart()`

Random forest model

AzureML Studio experiment

Conclusion

Weight Lifting Exercise - Multiclass Classification based on Random Forest

Chockalingam Sivakumar

27 January 2017

Overview

Background

Data

1. Data processing

2. Data Partition

3. Modeling and Prediction

Classification tree with rpart()

Random forest model

AzureML Studio experiment

Conclusion

Classification tree with `rpart()`