To build a machine learning model that accurately classifies iris flowers into three species based on their sepal length, sepal width, petal length, and petal width.The model should be able to identify the most important features that distinguish between the different species and provide insights into the biology and ecology of the iris flowers.The ultimate goal of the project is to develop a reliable and efficient method for identifying the species of iris flowers in the wild, which can be used for conservation, research, and horticulture purposes.
library(datasets)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
This is used show you the data type and dimensions of each variable in the dataset.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
This will give you a summary of the minimum, maximum, and quartile values for each variable in the dataset.
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Scatter plot of Sepal Length vs Sepal Width", xlab = "Sepal Length", ylab = "Sepal Width", col = iris$Species)
This will create a scatter plot of the Sepal Length vs Sepal Width variables, with each species of iris flower represented by a different color.
hist(iris$Sepal.Length, main = "Histogram of Sepal Length", xlab = "Sepal Length", col = iris$Species)
hist(iris$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Sepal Width", col = iris$Species)
This will create histograms of the Sepal Length and Sepal Width
variables, with each species of iris flower represented by a different
color.
cor(iris[,1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
This will show you the correlation coefficients between the Sepal Length, Sepal Width, Petal Length, and Petal Width variables.
Preprocess the iris dataset by splitting it into training and testing sets, and standardizing the data if necessary.
library(datasets)
data(iris)
caret package if you have not done so
already.library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Loading required package: lattice
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
training <- iris[trainIndex, ]
testing <- iris[-trainIndex, ]
In this example, we are splitting the data into 70% training and 30% testing sets using the createDataPartition() function.
We also set a random seed to ensure reproducibility.
training[,1:4] <- scale(training[,1:4])
testing[,1:4] <- scale(testing[,1:4])
In this example, we are standardizing the sepal length, sepal width, petal length, and petal width variables in both the training and testing sets.
By preprocessing the data, we ensure that the machine learning model is trained on a representative sample of the data and is not overfitting to the training set.
Standardizing the data can also improve the performance of certain machine learning algorithms, such as those that rely on distance-based metrics.
Select relevant features that can help predict the species of the iris flower. You can use methods such as correlation analysis or principal component analysis (PCA) to identify the most important features.
caret package.data(iris)
library(caret)
iris_df <- as.data.frame(iris)
corr <- cor(iris_df[,1:4])
This will calculate the correlation matrix between the sepal length, sepal width, petal length, and petal width variables in the iris dataset.
library(ggplot2)
ggplot(data = reshape2::melt(corr), aes(x=Var1, y=Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
This will produce a heatmap that shows the correlation between the variables. Highly correlated variables will have a darker color on the heatmap.
pca <- prcomp(iris[,1:4], center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
Choose an appropriate machine learning algorithm for the task. Some popular algorithms for the Iris Flower Classification problem include k-nearest neighbors (KNN), decision trees, and logistic regression.
For this particular project, we will be using the KNN algorithm.
caret librarylibrary(caret)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
This will split the data into a training set (trainData) and a testing set (testData), with 70% of the data in the training set.
knnModel <- train(Species ~ ., data = trainData, method = "knn", trControl = trainControl(method = "cv"), tuneLength = 10)
In this example, we’re using the train function to train a KNN model (method = “knn”) using the training data (data = trainData).We’re also specifying cross-validation (trControl = trainControl(method = “cv”)) to estimate the model’s performance, and specifying a range of values for the k parameter (tuneLength = 10).
knnPredictions <- predict(knnModel, testData)
This will generate predicted species labels (knnPredictions) for the testing set using the trained KNN model.
confusionMatrix(knnPredictions, testData$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 1
## virginica 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9333
## Specificity 1.0000 0.9667 1.0000
## Pos Pred Value 1.0000 0.9375 1.0000
## Neg Pred Value 1.0000 1.0000 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3111
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9833 0.9667
This will generate a confusion matrix that shows the number of correctly and incorrectly classified instances for each species.
Note that you may need to tune the k parameter and other hyperparameters of the KNN algorithm to improve its performance. You can do this using the train function’s tuneLength parameter, which specifies the number of different hyperparameter combinations to try. You can also specify a specific range of hyperparameters to search over using the tuneGrid parameter.