In this project, I implement various classification models to predict the manner in which people do exercise, based on accelerometer data from devices like Jawbone Up. I perform the machine learning experiments in two different environments: (i) AzureML and (ii) RStudio.
(Text taken from the course project description page)
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
The initial training data consisted of 160 features of more than 19000 instances, with some of the columns consisting of identification data and timestamps. Out of the measurement features, more than 90 features consisted primarily NA (missing) values. These columns were removed in both the training and test data, resulting in only 52 features and 1 outcome (classe).
# Loading the required libraries
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(rpart.plot))
suppressMessages(library(rpart))
suppressMessages(library(rattle))
suppressMessages(library(randomForest))
# Read the csv files into R
file1 <- "./pml-training.csv"
file2 <- "./pml-testing.csv"
pmltrain <- read.csv(file1, header = TRUE)
pmltest <- read.csv(file2, header = TRUE)
# Preliminary exploration of the data
str(pmltrain)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","0.000673",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","0.006078",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","0.00","0.0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","0.01388","0.01574",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","0.00000","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","0.00000","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","0.0016","-0.0035",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","0.0045","0.0130",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","0.0011","0.0014",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","0.0063",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
As a first step in the data processing, We remove the first 8 columns which consist participant identification details and timestamps of the observations (Note: Timestamp information is irrelevant for training these datasets).
# Subset the variables used to train the model
pmltr.sub <- select(pmltrain,8:160)
pmlte.sub <- select(pmltest,8:160)
The below R code outputs the number of columns in the data with no missing data in them. Inspection of the remaining columns showed that there were too many missing values (> 98% NAs) in each of those columns. These columns were removed from the training and testing set.
sum(apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0)
## [1] 53
# Removing features that have NAs (every feature where NA is present > 19000 NAs)
pmltr.sub <- pmltr.sub[apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0]
pmlte.sub <- pmlte.sub[apply(pmlte.sub, 2, function(x) sum(is.na(x))) == 0]
# Dimensions of the training and test sets
dim(pmltr.sub)
## [1] 19622 53
dim(pmlte.sub)
## [1] 20 53
Above, we see the dimension of the training set and the final-test set (the cross-validation set). The final column in the training set is the outcome (classe) and the final column in the final-test set consists of user id. It is also useful to check if any of the features used for modeling consists of near zero variance in them.
# Checking if there is any feature(s) with near zero variance
nzv <- nearZeroVar(pmltr.sub, saveMetrics = TRUE)
sum(nzv$nzv == TRUE)
## [1] 0
We see that there is no column (feature) with non-zero variance in the selected features.
The training data (pmltr.sub) consists of 19622 observations. We split this dataset into a training set (p = 0.65) and a testing set (p = 0.35).
# Data partition
intrain <- createDataPartition(y = pmltr.sub$classe, p = 0.65, list = FALSE)
training <- pmltr.sub[intrain,]
testing <- pmltr.sub[-intrain,]
Till this point, we perform all data processing in RStudio. Now, we write the above testing and training data as .csv files and import the data into AzureML Studio for conducting modeling experiment separately.
First let us discuss the modeling performed in RStudio environment using Classification and Regression Tree (CART) and Random Forest methods.
rpart()
We use the rpart()
function from rpart package to train a multiclass classification machine learning model.
set.seed(12345)
# Training with classification tree
modfit.rpart <- rpart(classe ~ ., data=training, method="class", xval = 4)
print(modfit.rpart, digits = 3)
## n= 12757
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 12757 9130 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130 11678 8060 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -26.7 1158 50 A (0.96 0.043 0 0 0) *
## 5) pitch_forearm>=-26.7 10520 8010 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 440 8902 6450 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 122 5528 3330 A (0.4 0.18 0.19 0.17 0.062)
## 40) magnet_dumbbell_z< -27.5 1823 604 A (0.67 0.21 0.012 0.076 0.032) *
## 41) magnet_dumbbell_z>=-27.5 3705 2700 C (0.27 0.17 0.27 0.22 0.077)
## 82) accel_dumbbell_y>=-40.5 3248 2270 A (0.3 0.19 0.19 0.24 0.083)
## 164) yaw_belt>=170 425 51 A (0.88 0.054 0 0.066 0) *
## 165) yaw_belt< 170 2823 2060 D (0.21 0.2 0.22 0.27 0.095)
## 330) pitch_belt< -43.2 304 45 B (0.016 0.85 0.066 0.049 0.016) *
## 331) pitch_belt>=-43.2 2519 1770 D (0.24 0.13 0.23 0.3 0.1)
## 662) roll_belt>=126 593 255 C (0.38 0.032 0.57 0.01 0.0034)
## 1324) magnet_belt_z< -322 203 6 A (0.97 0.0049 0.015 0 0.0099) *
## 1325) magnet_belt_z>=-322 390 55 C (0.079 0.046 0.86 0.015 0) *
## 663) roll_belt< 126 1926 1180 D (0.19 0.16 0.13 0.39 0.14)
## 1326) accel_dumbbell_z< 26.5 1379 772 D (0.26 0.086 0.18 0.44 0.036)
## 2652) accel_forearm_x>=-79.5 736 420 A (0.43 0.11 0.16 0.24 0.057) *
## 2653) accel_forearm_x< -79.5 643 215 D (0.065 0.059 0.2 0.67 0.012) *
## 1327) accel_dumbbell_z>=26.5 547 336 E (0.027 0.33 0.0055 0.25 0.39)
## 2654) roll_dumbbell< 35.6 164 30 B (0.018 0.82 0.018 0.061 0.085) *
## 2655) roll_dumbbell>=35.6 383 186 E (0.031 0.13 0 0.33 0.51) *
## 83) accel_dumbbell_y< -40.5 457 56 C (0.0066 0.044 0.88 0.031 0.042) *
## 21) roll_forearm>=122 3374 2240 C (0.075 0.17 0.34 0.23 0.18)
## 42) magnet_dumbbell_y< 290 1997 1030 C (0.089 0.15 0.48 0.15 0.13)
## 84) magnet_dumbbell_z>=286 281 140 A (0.5 0.16 0.046 0.078 0.22) *
## 85) magnet_dumbbell_z< 286 1716 761 C (0.022 0.14 0.56 0.16 0.12)
## 170) pitch_belt>=26.1 125 19 B (0.08 0.85 0.032 0 0.04) *
## 171) pitch_belt< 26.1 1591 640 C (0.017 0.089 0.6 0.17 0.13) *
## 43) magnet_dumbbell_y>=290 1377 885 D (0.054 0.21 0.12 0.36 0.26)
## 86) accel_forearm_x>=-90.5 857 544 E (0.043 0.27 0.17 0.16 0.37)
## 172) roll_forearm< 132 106 11 C (0.0094 0.066 0.9 0 0.028) *
## 173) roll_forearm>=132 751 441 E (0.048 0.29 0.064 0.18 0.41)
## 346) roll_dumbbell< 41.6 170 38 B (0.053 0.78 0 0.053 0.12) *
## 347) roll_dumbbell>=41.6 581 291 E (0.046 0.15 0.083 0.22 0.5) *
## 87) accel_forearm_x< -90.5 520 164 D (0.071 0.13 0.044 0.68 0.075) *
## 11) magnet_dumbbell_y>=440 1618 791 B (0.034 0.51 0.038 0.23 0.19)
## 22) total_accel_dumbbell>=5.5 1143 396 B (0.048 0.65 0.052 0.02 0.23)
## 44) roll_belt>=-0.58 958 211 B (0.057 0.78 0.063 0.024 0.076) *
## 45) roll_belt< -0.58 185 0 E (0 0 0 0 1) *
## 23) total_accel_dumbbell< 5.5 475 133 D (0 0.17 0.0021 0.72 0.11) *
## 3) roll_belt>=130 1079 10 E (0.0093 0 0 0 0.99) *
The trained model is used on the testing set to predict the outcome. The predicted outcome is then compared with the actual outcome to compute the out-of-sample accuracy of the model.
# Predict the testing set with the trained model
predictions1 <- predict(modfit.rpart, testing, type = "class")
# Accuracy and other metrics
confusionMatrix(predictions1, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1796 313 90 190 88
## B 50 739 70 37 88
## C 40 87 928 166 127
## D 36 109 82 630 59
## E 31 80 27 102 900
##
## Overall Statistics
##
## Accuracy : 0.7273
## 95% CI : (0.7166, 0.7378)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6517
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9196 0.5565 0.7753 0.56000 0.7132
## Specificity 0.8614 0.9558 0.9259 0.95017 0.9572
## Pos Pred Value 0.7251 0.7510 0.6884 0.68777 0.7895
## Neg Pred Value 0.9642 0.8998 0.9512 0.91679 0.9368
## Prevalence 0.2845 0.1934 0.1744 0.16387 0.1838
## Detection Rate 0.2616 0.1076 0.1352 0.09177 0.1311
## Detection Prevalence 0.3608 0.1433 0.1964 0.13343 0.1661
## Balanced Accuracy 0.8905 0.7561 0.8506 0.75509 0.8352
As we see, the classification tree model has not performed well with the given data. The overall accuracy is about 0.7145.
We use the randomForest()
function from the randomForest package, which is very fast compared to training with train()
function in the caret package.
set.seed(12345)
# Training with Random forest model
modfit.rf <- randomForest(classe ~. , data=training)
# Predict the testing set with the trained model
predictions2 <- predict(modfit.rf, testing, type = "class")
# Accuracy and other metrics
confusionMatrix(predictions2, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1952 10 0 0 0
## B 1 1318 18 0 0
## C 0 0 1179 10 1
## D 0 0 0 1112 0
## E 0 0 0 3 1261
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9916, 0.9955)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9921
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9995 0.9925 0.9850 0.9884 0.9992
## Specificity 0.9980 0.9966 0.9981 1.0000 0.9995
## Pos Pred Value 0.9949 0.9858 0.9908 1.0000 0.9976
## Neg Pred Value 0.9998 0.9982 0.9968 0.9977 0.9998
## Prevalence 0.2845 0.1934 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1920 0.1717 0.1620 0.1837
## Detection Prevalence 0.2858 0.1948 0.1733 0.1620 0.1841
## Balanced Accuracy 0.9987 0.9945 0.9915 0.9942 0.9993
The random forest model has performed very well with an accuracy of 0.9948 and an out-of-sample error rate of 0.0052. This is very good model and is ready for performing prediction on the cross validation dataset. Let us use the above model to predict the outcome for the final test (CV) dataset. The outcome is printed below.
# Predict the outcome for actual test case
pred.final <- predict(modfit.rf, pmlte.sub)
pred.final
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The prediction accuracy is 100% on this CV dataset, as per the final quiz score in the Practical Machine Learning course in Coursera Data Science Specialization.
Now, let us discuss briefly, the implementation details of the modeling experiment in AzureML Studio and let us discuss the prediction metrics.
Figure 1 is a snapshot of the actual multiclass classification experiment performed in AzureML Studio (i) independently with Multiclass Decision Forest algorithm provided with the ML Studio and (ii) also with R’s random forest method (in caret package). The random forest implementation discussed above in performed with randomForest package in RStudio. The rf method as implemented in the caret package is computationally resource intensive and is hard to perform for large datasets in normal laptop or PC (even with parallelization). That is a reason why I tried using caret’s rf method in AzureML environment (the caret implementation of random forest inside AzureML environment using ‘Execute R Script’ method is not discussed in this report).
Figure 1
As stated before, the datasets that we processed and paritioned above as testing and training sets are the same datasets that we use in the AzureML experiment, so we do little to none data processing here. In the ‘Multiclass Decision Forest’ tool in AzureML, there is some flexibility in adjusting the parameters of the modeling tool. We select the bagging resampling method, with 32 decision trees, maximum depth of decision trees to be 128. The rest we leave to default settings.
As shown in the experiment process flow, we train the model with the training data using the Multiclass Decision Forest algorithm. Then, using Score Model, we predict the outcome of the testing set with trained model. Then, we use ‘Evaluate Model’ tool to calculate the metrics. Figures 2 and 3 show the accuracy metrics and the confusion matrix of the predicted outcome, respectively.
Figure 2
Figure 3
As we see, the Multiclass Decision Forest algorithm has provided the highest out-of-sample accuracy of the models that we tested in RStudio and in AzureML Studio.
The final test dataset (CV set) that we predicted in the previous section (RStudio experiment) is also used in this section as a final test of our model. The AzureML trained model predicted the same outcome as before. We know that the outcome prediction was accurate for the final prediction in the previous section. Thus, AzureML too, predicted all the outcomes accurately, except with (marginally) higher out-of-sample accuracy (0.9955).
We trained multiclass classification models using rpart()
and randomForest()
in RStudio and using Multiclass Decision Forest algorithm in AzureML Studio. The random forest model performed very well in both setting, with AzureML implementation providing a marginally better out-of-sample accuracy.