Random forests, also known as random decision forests, are a popular ensemble method that can be used to build predictive models for both classification and regression problems. Ensemble methods use multiple learning models to gain better predictive results - in the case of a random forest, the model creates an entire forest of random uncorrelated decision trees to arrive at the best possible answer.
To demonstrate how this works in practice - specifically in a classification context - I’ll be walking you through an example using a famous data set from the University of California, Irvine (UCI) Machine Learning Repository. The data set, called the Breast Cancer Wisconsin (Diagnostic) Data Set, deals with binary classification and includes features computed from digitized images of biopsies. The data set can be downloaded here. To follow this tutorial, you will need some familiarity with classification and regression tree (CART) modeling. I will provide a brief overview of different CART methodologies that are relevant to random forest, beginning with decision trees. If you’d like to brush up on your knowledge of CART modeling before beginning the tutorial, I highly recommend reading Chapter 8 of the book “An Introduction to Statistical Learning with Applications in R,” which can be downloaded here.
Decision trees are simple but intuitive models that utilize a top-down approach in which the root node creates binary splits until a certain criteria is met. This binary splitting of nodes provides a predicted value based on the interior nodes leading to the terminal (final) nodes. In a classification context, a decision tree will output a predicted target class for each terminal node produced. Although intuitive, decision trees have limitations that prevent them from being useful in machine learning applications. You can learn more about implementing a decision tree here.
Decision trees tend to have high variance when they utilize different training and test sets of the same data, since they tend to overfit on training data. This leads to poor performance on unseen data. Unfortunately, this limits the usage of decision trees in predictive modeling. However, using ensemble methods, we can create models that utilize underlying decision trees as a foundation for producing powerful results.
Through a process known as bootstrap aggregating (or bagging), it’s possible to create an ensemble (forest) of trees where multiple training sets are generated with replacement, meaning data instances - or in the case of this tutorial, patients - can be repeated. Once the training sets are created, a CART model can be trained on each subsample. This approach helps reduce variance by averaging the ensemble’s results, creating a majority-votes model. Another important feature of bagging trees is that the resulting model uses the entire feature space when considering node splits. Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power. However, a downside to this process is that the utilization of the entire feature space creates a risk of correlation between trees, increasing bias in the model.
The main limitation of bagging trees is that it uses the entire feature space when creating splits in the trees. If some variables within the feature space are indicative of certain predictions, you run the risk of having a forest of correlated trees, thereby increasing bias and reducing variance. However, a simple tweak of the bagging trees methodology can prove advantageous to the model’s predictive power.
Random forest aims to reduce the previously mentioned correlation issue by choosing only a subsample of the feature space at each split. Essentially, it aims to make the trees de-correlated and prune the trees by setting a stopping criteria for node splits, which I will cover in more detail later.
We load our packages unto our Rstudio. In my case, I will be employing a Rmarkdown file.
suppressWarnings(library(tidyverse))
suppressWarnings(library(caret))
suppressWarnings(library(ggcorrplot))
suppressWarnings(library(GGally))
suppressWarnings(library(randomForest))
suppressWarnings(library(e1071))
suppressWarnings(library(ROCR))
suppressWarnings(library(pROC))
suppressWarnings(library(RCurl))
For this section, I’ll load the data into a tibble using the RCurl package similar to the python version. I do recommend on keeping a static file for your dataset as well. Next, I created a list with the appropriate names and set them as the column names, once I load them unto a data frame.
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean',
'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean',
'concavity_mean','concave_points_mean',
'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se',
'area_se', 'smoothness_se', 'compactness_se',
'concavity_se', 'concave_points_se',
'symmetry_se', 'fractal_dimension_se',
'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst',
'concavity_worst', 'concave_points_worst',
'symmetry_worst', 'fractal_dimension_worst')
breast_cancer <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)
breast_cancer$id_number <- NULL
Let’s preview the data set utilizing the head() function which will give the first 6 values of our data frame.
head(breast_cancer)
## diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 M 17.99 10.38 122.80 1001.0
## 2 M 20.57 17.77 132.90 1326.0
## 3 M 19.69 21.25 130.00 1203.0
## 4 M 11.42 20.38 77.58 386.1
## 5 M 20.29 14.34 135.10 1297.0
## 6 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave_points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 1 0.03003 0.006193 25.38 17.33
## 2 0.01389 0.003532 24.99 23.41
## 3 0.02250 0.004571 23.57 25.53
## 4 0.05963 0.009208 14.91 26.50
## 5 0.01756 0.005115 22.54 16.67
## 6 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 1 184.60 2019.0 0.1622 0.6656
## 2 158.80 1956.0 0.1238 0.1866
## 3 152.50 1709.0 0.1444 0.4245
## 4 98.87 567.7 0.2098 0.8663
## 5 152.20 1575.0 0.1374 0.2050
## 6 103.40 741.6 0.1791 0.5249
## concavity_worst concave_points_worst symmetry_worst
## 1 0.7119 0.2654 0.4601
## 2 0.2416 0.1860 0.2750
## 3 0.4504 0.2430 0.3613
## 4 0.6869 0.2575 0.6638
## 5 0.4000 0.1625 0.2364
## 6 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 1 0.11890
## 2 0.08902
## 3 0.08758
## 4 0.17300
## 5 0.07678
## 6 0.12440
Next, we’ll give the dimensions of the data set; where the first value is the number of patients and the second value is the number of features. We print the data types of our data set this is important because this will often be an indicator of missing data, as well as giving us context to anymore data cleanage.
breast_cancer %>%
dim()
## [1] 569 31
breast_cancer %>%
str()
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave_points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave_points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave_points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
The distribution for diagnosis is important because it brings up the discussion of Class Imbalance within Machine learning and data mining applications. Class Imbalance refers to when a target class within a data set is outnumbered by the other target class (or classes). This can lead to misleading accuracy metrics, known as accuracy paradox, therefore we have to make sure our target classes aren’t imblanaced. We do so by creating a function that will output the distribution of the target classes.
NOTE: If your data set suffers from class imbalance I suggest reading documentation on upsampling and downsampling.
breast_cancer %>%
count(diagnosis) %>%
group_by(diagnosis) %>%
summarize(perc_dx = round((n / 569)* 100, 2))
## # A tibble: 2 x 2
## diagnosis perc_dx
## <fctr> <dbl>
## 1 B 62.74
## 2 M 37.26
Fortunately, this data set does not suffer from class imbalance. Next we will use a useful function that gives us standard descriptive statistics for each feature including mean, standard deviation, minimum value, maximum value, and range intervals.
summary(breast_cancer)
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave_points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median :0.006380 Median :0.020450 Median :0.02589
## Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave_points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave_points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
We can see through the maximum row that our data varies in distribution, this will be important when considering classification models. Standardization is an important requirement for many classification models that should be considered when implementing pre-processing. Some models (like neural networks) can perform poorly if pre-processing isn’t considered, so the describe() function can be a good indicator for standardization. Fortunately Random Forest does not require any pre-processing (for use of categorical data see sklearn’s Encoding Categorical Data section).
We split the data set into our training and test sets which will be (pseudo) randomly selected having a 80-20% splt. We will use the training set to train our model along with some optimization, and use our test set as the unseen data that will be a useful final metric to let us know how well our model does.
When using this method for machine learning always be weary of utilizing your test set when creating models. The issue of data leakage is a grave and serious issue that is common in practice and can result in over-fitting. More on data leakage can be found in this Kaggle article
set.seed(42)
trainIndex <- createDataPartition(breast_cancer$diagnosis,
p = .8,
list = FALSE,
times = 1)
training_set <- breast_cancer[ trainIndex, ]
test_set <- breast_cancer[ -trainIndex, ]
NOTE: What I mean when I say pseudo-random is that we would want everyone who replicates this project to get the same results. So we use a random seed generator and set it equal to a number of our choosing, this will then make the results the same for anyone who uses this generator, awesome for reproducibility.
The R version is very different because the caret package hyperparameter optimization will be done in the same chapter as fitting model along with cross validation. If you want an in more depth look check the python version.
Here we’ll create a custom model to allow us to do a grid search, I will see which parameters output the best model based on accuracy.
# Custom grid search
# From https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/
customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree", "nodesize"), class = rep("numeric", 3), label = c("mtry", "ntree", "nodesize"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
randomForest(x, y, mtry = param$mtry, ntree=param$ntree, nodesize=param$nodesize, ...)
}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
Now that we have the custom settings well use the train method which crossvlidates and does a grid search, giving us the best parameters.
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 3,
## repeated ten times
repeats = 10)
grid <- expand.grid(.mtry=c(floor(sqrt(ncol(training_set))), (ncol(training_set) - 1), floor(log(ncol(training_set)))),
.ntree = c(100, 300, 500, 1000),
.nodesize =c(1:4))
set.seed(42)
fit_rf <- train(as.factor(diagnosis) ~ .,
data = training_set,
method = customRF,
metric = "Accuracy",
tuneGrid= grid,
trControl = fitControl)
Let’s print out the different models and best model given by model.
fit_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = param$ntree, mtry = param$mtry, nodesize = param$nodesize)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.39%
## Confusion matrix:
## B M class.error
## B 280 6 0.02097902
## M 14 156 0.08235294
## 456 samples
## 30 predictors
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 10 times)
## Summary of sample sizes: 304, 304, 304, 304, 304, 304, ...
## Resampling results across tuning parameters:
##
## mtry ntree nodesize Accuracy Kappa
## 3 100 1 0.9554934 0.9039383
## 3 100 2 0.9537347 0.9002084
## 3 100 3 0.9535139 0.8997377
## 3 100 4 0.9511117 0.8943956
## 3 300 1 0.9530839 0.8987023
## 3 300 2 0.9533018 0.8990579
## 3 300 3 0.9539626 0.9005017
## 3 300 4 0.9519874 0.8962544
## 3 500 1 0.9557156 0.9041983
## 3 500 2 0.9528589 0.8980725
## 3 500 3 0.9530825 0.8985487
## 3 500 4 0.9526396 0.8977251
## 3 1000 1 0.9548369 0.9023078
## 3 1000 2 0.9533018 0.8990624
## 3 1000 3 0.9530839 0.8987166
## 3 1000 4 0.9519860 0.8962269
## 5 100 1 0.9546176 0.9021625
## 5 100 2 0.9530825 0.8986986
## 5 100 3 0.9543983 0.9014955
## 5 100 4 0.9541776 0.9010637
## 5 300 1 0.9557127 0.9045110
## 5 300 2 0.9548354 0.9025743
## 5 300 3 0.9541805 0.9010976
## 5 300 4 0.9526424 0.8976464
## 5 500 1 0.9546176 0.9019989
## 5 500 2 0.9546191 0.9021272
## 5 500 3 0.9528632 0.8982402
## 5 500 4 0.9530825 0.8987442
## 5 1000 1 0.9546133 0.9021241
## 5 1000 2 0.9539626 0.9006260
## 5 1000 3 0.9546147 0.9019917
## 5 1000 4 0.9528675 0.8983071
## 30 100 1 0.9498074 0.8922880
## 30 100 2 0.9489187 0.8905055
## 30 100 3 0.9478193 0.8881618
## 30 100 4 0.9498002 0.8921772
## 30 300 1 0.9493602 0.8912774
## 30 300 2 0.9511117 0.8951204
## 30 300 3 0.9486979 0.8899375
## 30 300 4 0.9500181 0.8926719
## 30 500 1 0.9504552 0.8937226
## 30 500 2 0.9502345 0.8932458
## 30 500 3 0.9489187 0.8904801
## 30 500 4 0.9487023 0.8899668
## 30 1000 1 0.9506731 0.8941300
## 30 1000 2 0.9495766 0.8919188
## 30 1000 3 0.9502359 0.8933305
## 30 1000 4 0.9504567 0.8937393
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, ntree = 500
## and nodesize = 1.
Once we have trained the model, we are able to assess this concept of variable importance. A downside to creating ensemble methods with Decision Trees is we lose the interpretability that a single tree gives. A single tree can outline for us important node splits along with variables that were important at each split.
Forunately ensemble methods utilzing CART models use a metric to evaluate homogeneity of splits. Thus when creating ensembles these metrics can be utilized to give insight to important variables used in the training of the model. Two metrics that are used are gini impurity and entropy.
The two metrics vary and from reading documentation online, many people favor gini impurity due to the computational cost of entropy since it requires calculating the logarithmic function. For more discussion I recommend reading this article.
Here we define each metric:
\[Gini\ Impurity = 1 - \sum_i p_i\]
\[Entropy = \sum_i -p_i * \log_2 p_i\]
where \(p_i\) is defined as the proportion of subsamples that belong to a certain target class.
For the package randomForest, I believe the gini index is used without giving the choice to the information gain.
varImportance <- varImp(fit_rf, scale = FALSE)
varImportanceScores <- data.frame(varImportance$importance)
varImportanceScores <- data.frame(names = row.names(varImportanceScores), var_imp_scores = varImportanceScores$B)
varImportanceScores
## names var_imp_scores
## 1 radius_mean 0.9229638
## 2 texture_mean 0.7948272
## 3 perimeter_mean 0.9336487
## 4 area_mean 0.9239305
## 5 smoothness_mean 0.7208865
## 6 compactness_mean 0.8613534
## 7 concavity_mean 0.9328877
## 8 concave_points_mean 0.9599548
## 9 symmetry_mean 0.7148807
## 10 fractal_dimension_mean 0.5073118
## 11 radius_se 0.8651584
## 12 texture_se 0.5300288
## 13 perimeter_se 0.8709070
## 14 area_se 0.9187269
## 15 smoothness_se 0.5070444
## 16 compactness_se 0.7222851
## 17 concavity_se 0.7726347
## 18 concave_points_se 0.7910839
## 19 symmetry_se 0.5474496
## 20 fractal_dimension_se 0.6225422
## 21 radius_worst 0.9622378
## 22 texture_worst 0.8026018
## 23 perimeter_worst 0.9675339
## 24 area_worst 0.9615693
## 25 smoothness_worst 0.7520053
## 26 compactness_worst 0.8500103
## 27 concavity_worst 0.9108700
## 28 concave_points_worst 0.9595640
## 29 symmetry_worst 0.7410325
## 30 fractal_dimension_worst 0.6824558
ggplot(varImportanceScores,
aes(reorder(names, var_imp_scores), var_imp_scores)) +
geom_bar(stat='identity',
fill = '#875FDB') +
theme(panel.background = element_rect(fill = '#fafafa')) +
coord_flip() +
labs(x = 'Feature', y = 'Importance') +
ggtitle('Feature Importance for Random Forest Model')
Another useful feature of Random Forest is the concept of Out of Bag Error Rate or OOB error rate. When creating the forest, typically only 2/3 of the data is used to train the trees, this gives us 1/3 of unseen data that we can then utilize.
oob_error <- data.frame(mtry = seq(1:100), oob = fit_rf$finalModel$err.rate[, 'OOB'])
paste0('Out of Bag Error Rate for model is: ', round(oob_error[100, 2], 4))
## [1] "Out of Bag Error Rate for model is: 0.0439"
ggplot(oob_error, aes(mtry, oob)) +
geom_line(colour = 'red') +
theme_minimal() +
ggtitle('OOB Error Rate across 100 trees') +
labs(y = 'OOB Error Rate')
Now we will be utilizing the test set that was created earlier to receive another metric for evaluation of our model. Recall the importance of data leakage and that we didn’t touch the test set until now, after we had done hyperparamter optimization.
predict_values <- predict(fit_rf, newdata = test_set)
ftable(predict_values, test_set$diagnosis)
## B M
## predict_values
## B 70 1
## M 1 41
paste0('Test error rate is: ', round(((2/113)), 4))
## [1] "Test error rate is: 0.0177"
For this tutorial we went through a number of metrics to assess the capabilites of our Random Forest, but this can be taken further when using background information of the data set. Feature engineering would be a powerful tool to extract and move forward into research regarding the important features. As well defining key metrics to utilize when optimizing model paramters.
There have been advancements with image classification in the past decade that utilize the images intead of extracted features from images, but this data set is a great resource to become with machine learning processes. Especially for those who are just beginning to learn machine learning concepts. If you have any suggestions, recommendations, or corrections please reach out to me.