In the realm of botany and data science, the classification of plant species based on their morphological characteristics has long been a fundamental task. With advancements in machine learning and data analysis techniques, researchers now have powerful tools at their disposal to automate and enhance this classification process. The "iris" dataset, a well-known dataset in R, provides a rich foundation for exploring these methodologies. This study seeks to leverage machine learning algorithms to classify iris flowers into different species based on their distinct morphological attributes.
Understanding and accurately identifying plant species is crucial for various scientific, ecological, and horticultural endeavors. The ability to classify iris flowers accurately can aid researchers in studying species distributions, ecological patterns, and evolutionary relationships. Moreover, in horticulture and agriculture, precise species identification is essential for plant breeding, conservation efforts, and crop management.
By applying machine learning algorithms to the "iris" dataset, this study aims to:
Enhance Species Identification: Machine learning algorithms can analyze large datasets of iris flower measurements and learn patterns that distinguish between different species. This automated approach can provide more consistent and reliable species identification compared to manual methods.
Improve Research Efficiency: Automating the species classification process can save researchers significant time and effort, allowing them to focus on higher-level analysis and interpretation of results. This efficiency can accelerate botanical research and facilitate the exploration of broader ecological questions.
Facilitate Education and Training: The study serves as an educational resource for students and enthusiasts interested in both botany and data science. By demonstrating the application of machine learning algorithms to a well-known botanical dataset like "iris," learners can gain practical experience in data analysis and understand the interdisciplinary nature of modern scientific research.
Showcase the Power of Data Science: By showcasing the application of machine learning algorithms in botany, this study highlights the transformative potential of data-driven approaches in various scientific disciplines. It underscores how advancements in data science can complement traditional botanical methods and lead to new insights and discoveries.
This study focused on using the random forest algorithm for classifying Iris species based on morphological measurements. The Iris dataset, containing data on Sepal Length, Sepal Width, Petal Length, and Petal Width for three species (Setosa, Versicolor, and Virginica), was employed for this purpose. The methododologies utilized involve processing the data, exploring its characteristics, splitting the data, training the algorithm, and assessing the model's performance.
Data Preprocessing: Data preprocessing involves initially loading the data into R, followed by any necessary data cleaning and manipulation steps. However, in the case of the Iris dataset, which comes pre-packaged with R, there's typically no need for data cleaning as the Iris dataset is already considered to be clean and well-structured.
# Loading the iris dataset
data(iris)
Exploratory Data Analysis: Let's take a look at the structure of the dataset, summary statistics, and visualize the data to understand its characteristics.
# Viewing the structure of the dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Summary statistics
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Visualize the data
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
# Correlation matrix
correlation_matrix <- cor(iris[,1:4])
correlation_matrix
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Data Split: Dividing the dataset into training and testing sets to evaluate the model's performance.
set.seed(123)
split <- sample.split(iris$Species, SplitRatio = 0.7)
train_data <- subset(iris, split == TRUE)
test_data <- subset(iris, split == FALSE)
Random Forest Model Training: Training the data using the "randomForest" package in R.
model_rf <- randomForest(Species ~ ., data = train_data)
Model Performance: Evaluating the performance of the model on the training data using metrics such as accuracy, precision, recall, and F1 score.
# Make predictions on the test data
predictions <- predict(model_rf, test_data)
# Confusion matrix
confusion_matrix <- table(predictions, test_data$Species)
print(confusion_matrix)
##
## predictions setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 12 1
## virginica 0 3 14
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- diag(confusion_matrix) / rowSums(confusion_matrix)
recall <- diag(confusion_matrix) / colSums(confusion_matrix)
RFsummary = cbind(accuracy, precision, recall)
RFsummary
## accuracy precision recall
## setosa 0.9111111 1.0000000 1.0000000
## versicolor 0.9111111 0.9230769 0.8000000
## virginica 0.9111111 0.8235294 0.9333333
The evaluation metrics show how the random forest classification model performs for three classes: setosa, versicolor, and virginica. The model's accuracy is about 91.11%, meaning it predicts correctly around 91.11% of instances overall. Precision, indicating the accuracy of positive predictions, ranges from 82.35% to 100%, with slight variations among classes.
In summary, the study on iris species classification using machine learning algorithms holds significant importance in advancing botanical research, enhancing species identification processes, and promoting interdisciplinary collaboration between botany and data science. Through this endeavor, we aim to contribute to the broader understanding of plant diversity and ecological dynamics while showcasing the potential of data-driven approaches in botanical science.
Akintande, O.J. (2024). Basic Programming Language (Lecture note). Department of Statistics University of Ibadan.
Adewumi, A. O., & Adagunodo, E. R. (2019). A Comparative Study of Machine Learning Algorithms for Iris Species Classification. Journal of Computer Science and Its Applications.
Olaniyan, R., & Afolabi, A. (2020). Comparative Analysis of Machine Learning Techniques for Iris Species Classification. International Journal of Advanced Computer Science and Applications.
Oluwafemi, T. A., & Olatunbosun, S. O. (2020). Performance Evaluation of Machine Learning Algorithms for Iris Flower Species Classification. International Journal of Computer Applications.
Ajibola, M. O., et al. (2019). A Comparative Study of Machine Learning Algorithms for Iris Species Classification. International Journal of Computer Science and Information Security.
Igbeka, U. G., et al. (2019). Machine Learning-Based Iris Species Classification: A Comparative Study. Nigerian Journal of Technology.