DATA 621: Blog 5

Introduction
In this blog, I will demonstrate how to use the random forest model to predict the species of iris flowers in the popular ‘Iris’ dataset. Random forest is a machine learning algorithm that constructs multiple decision trees and combines their outputs to improve accuracy and reduce overfitting.

Load Packages
We will use the randomForest and caret packages to demonstrate random forest.

#install.packages("randomForest")
library(randomForest)
library(caret)

Load Data
The data consists of 150 rows/observations and five variables.
The five variables are:

Sepal.Length = Sepal length in centimeters. Numerical variable.
Sepal.Width = Sepal width in centimeters. Numerical variable.
Petal.Length = Petal length in centimeters. Numerical variable.
Petal.Width = Petal width in centimeters. Numerical variable.
Species = Three species of iris: setosa, versicolor or virginica. Categorical variable.

data(iris) #load data
str(iris) #structure of data

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris) #first six rows

Split into Test and Train
The goal of this analysis, as previously stated, is to predict the iris species based on sepal length, sepal width, petal length, and petal width using random forest modeling. Splitting the dataset into training and testing sets allows us to train the model on known data and evaluate it’s performance on the unseen test data. This helps us determine how well the model predict species on new data.

The below code splits the dataset into train-test partitions in a 80(train):20(test) ratio. The 80:20 ratio is commonly employed in machine learning models.

set.seed(123) 
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)

train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

Build and Train the Model
Below the model is built with species defined as the dependent variable and all other four variables as the independent variables. The code ‘ntree = 100’ sets the number of decisions trees produced to 100. The larger the number of decision trees tends to align with greater model accuracy. The code ‘mtry = 2’ denotes the number of variables that are randomly selected splitting at each node in the 100 decision trees. The code ‘mtry’ is set to 2 because, in classification models, the number of variables considered for splitting at each node is typically chosen as the square root of the total number of variables.

Below the ‘OOB estimate of error rate’ of 4.17% means the model misclassified 4.17% of training samples. In the resulting confusion matrix, the actual specials (true values) are represented in the rows, while the predicted values are represented in the columns. Indirect diagonal values represent true predictions (true positives) while the values which are not diagonal are classification errors.

set.seed(123)  
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100, mtry = 2)
print(rf_model)

## 
## Call:
##  randomForest(formula = Species ~ ., data = train_data, ntree = 100,      mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.17%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         40          0         0       0.000
## versicolor      0         38         2       0.050
## virginica       0          3        37       0.075

Test the model
Below the model is run on the test data to evaluate the model.

Using the test data, 28 predictions were correct (true positives), while 2 were misclassified (false negatives). The correct classifications can be found along the diagonal of the confusion matrix, while the misclassifications are located in the off-diagonal cells.

The models accuracy with the test data was determined to be 93%, meaning the model correctly classified 93% of the test data. The 95% confidence interval was determined to be 0.7793 and 0.9918, meaning the true accuracy of the model is likely between 78% and 99%.

Interpreting the ‘Statistics by Class’
Setosa
The model correctly classified all setosa with 100% sensitivity and specificity, meaning none were misclassified for that particular species.

Versicolor
All versicolor were correctly classified meaning the model determined 100% sensitivity (true positives). Specificity was determined to be 90%, meaning some virginica samples were misclassified as versicolor. Precision was 83.33%, denoting some false positives.

Virginica
Sensitivity was 80%, meaning 20% of virginica were misclassified as versicolor. Precision was 100%, meaning all samples predicted to be virginica were actually virginica.

rf_predictions <- predict(rf_model, test_data)
confusion_matrix <- confusionMatrix(rf_predictions, test_data$Species)
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         2
##   virginica       0          0         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.7793, 0.9918)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 8.747e-12       
##                                           
##                   Kappa : 0.9             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8000
## Specificity                 1.0000            0.9000           1.0000
## Pos Pred Value              1.0000            0.8333           1.0000
## Neg Pred Value              1.0000            1.0000           0.9091
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2667
## Detection Prevalence        0.3333            0.4000           0.2667
## Balanced Accuracy           1.0000            0.9500           0.9000

Hyperparameter Tuning
Below 5-fold cross-validation was used to evaluate model performance. 96 samples were used for training each of the 5 folds and the remaining 24 were used for model validation. Accuracy and kappa were determined to be higher for mtry=2 or mtry=3 compared to mytry=1, meaning that the model performed better when 2 or 3 predictors were used, as opposed to 1.

Also of note there was no difference in the models accuracy and kappa when mtry=2 or mtry=3, meaning that using 2 predictors generated identical results as using 3 predictors and that adding a third predictor did not improve the model.

tune_grid <- expand.grid(mtry = c(1, 2, 3))
set.seed(123)
tune_model <- train(Species ~ ., data = train_data,
                    method = "rf",
                    trControl = trainControl(method = "cv", number = 5),
                    tuneGrid = tune_grid)
print(tune_model)

## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 96, 96, 96, 96, 96 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa 
##   1     0.9583333  0.9375
##   2     0.9666667  0.9500
##   3     0.9666667  0.9500
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Mean Decreased Gini Analysis
Below, Mean Decrease Gini values are plotted to assess the importance of each variable in the model. Mean Decrease Gini reflects how much each variable contributes to reducing Gini impurity, which measures the likelihood of misclassification in the dataset. Variables with higher Mean Decrease Gini values contribute more significantly to improving the model’s predictive accuracy by minimizing impurity

As seen below, petal.width is the most important variable in the model denoted by the largest Mean Decreased Gini score. Petal.length has the second largest Mean Decreased Gini score and is therefore the second most important variable in the model in terms of classification accuracy.

# Plot feature importance
varImpPlot(rf_model)

Conclusion
In this analysis, we built and evaluated a random forest model to classify iris species based on their sepal length, sepal width, petal length and petal width. The model was determined to have an accuracy of 93% with out-of-bag error rate of 4%, demonstrating the models high accuracy and low misclassification rate. Mean Decreased Gini analysis determined that petal.length followed by petal.width were the most significant variables in predicting species. Cross-validation further validated the model’s performance, with the optimal parameter of mtry=2 yielding the highest accuracy and Kappa values.

DATA 621: Blog 5

Gregg Maloy