Objective:

To build a machine learning model that accurately classifies iris flowers into three species based on their sepal length, sepal width, petal length, and petal width.The model should be able to identify the most important features that distinguish between the different species and provide insights into the biology and ecology of the iris flowers.The ultimate goal of the project is to develop a reliable and efficient method for identifying the species of iris flowers in the wild, which can be used for conservation, research, and horticulture purposes.

Loading the data for exploration

Load the dataset into R using the “datasets” package:

library(datasets)
data(iris)

View the structure of the dataset:

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

This is used show you the data type and dimensions of each variable in the dataset.

View the summary statistics of the dataset:

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

This will give you a summary of the minimum, maximum, and quartile values for each variable in the dataset.

Plot the variables against each other to identify any patterns or trends:

plot(iris$Sepal.Length, iris$Sepal.Width, main = "Scatter plot of Sepal Length vs Sepal Width", xlab = "Sepal Length", ylab = "Sepal Width", col = iris$Species)

This will create a scatter plot of the Sepal Length vs Sepal Width variables, with each species of iris flower represented by a different color.

Plot histograms of the variables to see their distributions:

hist(iris$Sepal.Length, main = "Histogram of Sepal Length", xlab = "Sepal Length", col = iris$Species)

hist(iris$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Sepal Width", col = iris$Species)

This will create histograms of the Sepal Length and Sepal Width variables, with each species of iris flower represented by a different color.

Calculate the correlation coefficients between the variables:

cor(iris[,1:4])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

This will show you the correlation coefficients between the Sepal Length, Sepal Width, Petal Length, and Petal Width variables.

Preprocess the data:

Preprocess the iris dataset by splitting it into training and testing sets, and standardizing the data if necessary.

Load the iris dataset into R using the datasets package:

library(datasets)
data(iris)

Split the data into training and testing sets using the caret package. Install the caret package if you have not done so already.

library(caret)

## Warning: package 'caret' was built under R version 4.2.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

## Loading required package: lattice

set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
training <- iris[trainIndex, ]
testing <- iris[-trainIndex, ]

In this example, we are splitting the data into 70% training and 30% testing sets using the createDataPartition() function.

We also set a random seed to ensure reproducibility.

If necessary, standardize the data by scaling the variables to have zero mean and unit variance using the scale() function:

training[,1:4] <- scale(training[,1:4])
testing[,1:4] <- scale(testing[,1:4])

In this example, we are standardizing the sepal length, sepal width, petal length, and petal width variables in both the training and testing sets.

By preprocessing the data, we ensure that the machine learning model is trained on a representative sample of the data and is not overfitting to the training set.

Standardizing the data can also improve the performance of certain machine learning algorithms, such as those that rely on distance-based metrics.

Feature selection

Select relevant features that can help predict the species of the iris flower. You can use methods such as correlation analysis or principal component analysis (PCA) to identify the most important features.

Load the iris dataset and caret package.

data(iris)

library(caret)

Convert the dataset to a data frame.

iris_df <- as.data.frame(iris)

Examine the correlation between variables using the cor() function:

corr <- cor(iris_df[,1:4])

This will calculate the correlation matrix between the sepal length, sepal width, petal length, and petal width variables in the iris dataset.

Visualize the correlation matrix using a heatmap to identify highly correlated variables using the heatmap() function:

library(ggplot2)
ggplot(data = reshape2::melt(corr), aes(x=Var1, y=Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

This will produce a heatmap that shows the correlation between the variables. Highly correlated variables will have a darker color on the heatmap.

Use PCA to identify the most important features that contribute to the variation in the data using the prcomp() function:

pca <- prcomp(iris[,1:4], center = TRUE, scale. = TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Model selection

Choose an appropriate machine learning algorithm for the task. Some popular algorithms for the Iris Flower Classification problem include k-nearest neighbors (KNN), decision trees, and logistic regression.

For this particular project, we will be using the KNN algorithm.

To get started we will load the caret library

library(caret)

Split the data into training and testing sets:

set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

This will split the data into a training set (trainData) and a testing set (testData), with 70% of the data in the training set.

Train the KNN model using the train function:

knnModel <- train(Species ~ ., data = trainData, method = "knn", trControl = trainControl(method = "cv"), tuneLength = 10)

In this example, we’re using the train function to train a KNN model (method = “knn”) using the training data (data = trainData).We’re also specifying cross-validation (trControl = trainControl(method = “cv”)) to estimate the model’s performance, and specifying a range of values for the k parameter (tuneLength = 10).

Use the trained model to make predictions on the testing set:

knnPredictions <- predict(knnModel, testData)

This will generate predicted species labels (knnPredictions) for the testing set using the trained KNN model.

Evaluate the performance of the model using a confusion matrix:

confusionMatrix(knnPredictions, testData$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         1
##   virginica       0          0        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9333
## Specificity                 1.0000            0.9667           1.0000
## Pos Pred Value              1.0000            0.9375           1.0000
## Neg Pred Value              1.0000            1.0000           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3111
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9833           0.9667

This will generate a confusion matrix that shows the number of correctly and incorrectly classified instances for each species.

Note that you may need to tune the k parameter and other hyperparameters of the KNN algorithm to improve its performance. You can do this using the train function’s tuneLength parameter, which specifies the number of different hyperparameter combinations to try. You can also specify a specific range of hyperparameters to search over using the tuneGrid parameter.

Iris Machine Learning Project

Isaac Tubonemi

2023-03-29

Objective:

Loading the data for exploration

Preprocess the data:

Feature selection

Model selection