Introduction
In this blog, I will demonstrate how to use the random forest model to
predict the species of iris flowers in the popular ‘Iris’ dataset.
Random forest is a machine learning algorithm that constructs multiple
decision trees and combines their outputs to improve accuracy and reduce
overfitting.
Load Packages
We will use the randomForest and caret packages to demonstrate random
forest.
#install.packages("randomForest")
library(randomForest)
library(caret)
Load Data
The data consists of 150 rows/observations and five variables.
The five variables are:
data(iris) #load data
str(iris) #structure of data
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris) #first six rows
Split into Test and Train
The goal of this analysis, as previously stated, is to predict the iris
species based on sepal length, sepal width, petal length, and petal
width using random forest modeling. Splitting the dataset into training
and testing sets allows us to train the model on known data and evaluate
it’s performance on the unseen test data. This helps us determine how
well the model predict species on new data.
The below code splits the dataset into train-test partitions in a 80(train):20(test) ratio. The 80:20 ratio is commonly employed in machine learning models.
set.seed(123)
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
Build and Train the Model
Below the model is built with species defined as the dependent variable
and all other four variables as the independent variables. The code
‘ntree = 100’ sets the number of decisions trees produced to 100. The
larger the number of decision trees tends to align with greater model
accuracy. The code ‘mtry = 2’ denotes the number of variables that are
randomly selected splitting at each node in the 100 decision trees. The
code ‘mtry’ is set to 2 because, in classification models, the number of
variables considered for splitting at each node is typically chosen as
the square root of the total number of variables.
Below the ‘OOB estimate of error rate’ of 4.17% means the model misclassified 4.17% of training samples. In the resulting confusion matrix, the actual specials (true values) are represented in the rows, while the predicted values are represented in the columns. Indirect diagonal values represent true predictions (true positives) while the values which are not diagonal are classification errors.
set.seed(123)
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100, mtry = 2)
print(rf_model)
##
## Call:
## randomForest(formula = Species ~ ., data = train_data, ntree = 100, mtry = 2)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.17%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 40 0 0 0.000
## versicolor 0 38 2 0.050
## virginica 0 3 37 0.075
Test the model
Below the model is run on the test data to evaluate the model.
Using the test data, 28 predictions were correct (true positives), while 2 were misclassified (false negatives). The correct classifications can be found along the diagonal of the confusion matrix, while the misclassifications are located in the off-diagonal cells.
The models accuracy with the test data was determined to be 93%, meaning the model correctly classified 93% of the test data. The 95% confidence interval was determined to be 0.7793 and 0.9918, meaning the true accuracy of the model is likely between 78% and 99%.
Interpreting the ‘Statistics by Class’
Setosa
The model correctly classified all setosa with 100% sensitivity and
specificity, meaning none were misclassified for that particular
species.
Versicolor
All versicolor were correctly classified meaning the model determined
100% sensitivity (true positives). Specificity was determined to be 90%,
meaning some virginica samples were misclassified as versicolor.
Precision was 83.33%, denoting some false positives.
Virginica
Sensitivity was 80%, meaning 20% of virginica were misclassified as
versicolor. Precision was 100%, meaning all samples predicted to be
virginica were actually virginica.
rf_predictions <- predict(rf_model, test_data)
confusion_matrix <- confusionMatrix(rf_predictions, test_data$Species)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 2
## virginica 0 0 8
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.7793, 0.9918)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.747e-12
##
## Kappa : 0.9
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8000
## Specificity 1.0000 0.9000 1.0000
## Pos Pred Value 1.0000 0.8333 1.0000
## Neg Pred Value 1.0000 1.0000 0.9091
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2667
## Detection Prevalence 0.3333 0.4000 0.2667
## Balanced Accuracy 1.0000 0.9500 0.9000
Hyperparameter Tuning
Below 5-fold cross-validation was used to evaluate model performance. 96
samples were used for training each of the 5 folds and the remaining 24
were used for model validation. Accuracy and kappa were determined to be
higher for mtry=2 or mtry=3 compared to mytry=1, meaning that the model
performed better when 2 or 3 predictors were used, as opposed to 1.
Also of note there was no difference in the models accuracy and kappa
when mtry=2 or mtry=3, meaning that using 2 predictors generated
identical results as using 3 predictors and that adding a third
predictor did not improve the model.
tune_grid <- expand.grid(mtry = c(1, 2, 3))
set.seed(123)
tune_model <- train(Species ~ ., data = train_data,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = tune_grid)
print(tune_model)
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 96, 96, 96, 96, 96
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.9583333 0.9375
## 2 0.9666667 0.9500
## 3 0.9666667 0.9500
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Mean Decreased Gini Analysis
Below, Mean Decrease Gini values are plotted to assess the importance of
each variable in the model. Mean Decrease Gini reflects how much each
variable contributes to reducing Gini impurity, which measures the
likelihood of misclassification in the dataset. Variables with higher
Mean Decrease Gini values contribute more significantly to improving the
model’s predictive accuracy by minimizing impurity
As seen below, petal.width is the most important variable in the model denoted by the largest Mean Decreased Gini score. Petal.length has the second largest Mean Decreased Gini score and is therefore the second most important variable in the model in terms of classification accuracy.
# Plot feature importance
varImpPlot(rf_model)
Conclusion
In this analysis, we built and evaluated a random forest model to
classify iris species based on their sepal length, sepal width, petal
length and petal width. The model was determined to have an accuracy of
93% with out-of-bag error rate of 4%, demonstrating the models high
accuracy and low misclassification rate. Mean Decreased Gini analysis
determined that petal.length followed by petal.width were the most
significant variables in predicting species. Cross-validation further
validated the model’s performance, with the optimal parameter of mtry=2
yielding the highest accuracy and Kappa values.