Iris Data Project

#load library
library(corrplot)

## corrplot 0.92 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Loading required package: lattice

library(klaR)

## Warning: package 'klaR' was built under R version 4.3.3

## Loading required package: MASS

#load data
data("iris")
head(iris, n=20)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa

#calculate correlation
correlations <-cor(iris[,1:4])
correlations

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

There are Strong Positive Correlations between Sepal.Length and Petal.Length, as well as between Petal.Length and Petal.Width.

There are moderate to weak negative correlations between Sepal.Width and both Petal.Length and Petal.Width.

This correlation matrix reveals important insights into how the features of the Iris dataset relate to one another, with petal dimensions being highly correlated with one another, while sepal width shows weak to moderate negative correlations with the other features.

#plot correlations
corrplot(correlations, method = "circle")

Interpretation: A dot-representation was used where blue represents positive correlation and red negative.The larger the dot the larger the correlation. We can see that the matrix is symmetrical and that the diagonal attributes are perfectly positively correlated (because it shows the correlation of each attribute with itself). We can see that some of the attributes are highly correlated.

#create a pair-wise scatter plot for the four attributes
pairs(iris, col = 'blue')

#using the class label to separate the classes
pairs(Species~., data=iris, col=iris$Species)

#create box and whisker plot
x <- iris[,1:4]
y <- iris[,5]
featurePlot(x=x, y=y, plot="box")

data <- iris

#spliting the dataset into train and test data

data_train <- createDataPartition(data$Species, p=0.80, list = FALSE)
data_test <- data[-data_train,]

#working with the training dataset
Dtrain <- data[data_train,]

#SUMMARIZE DATASET

#data dimension
dim(Dtrain)

## [1] 120   5

The training data has 120 instances and 5 attributes

#Type of attributes
sapply(Dtrain, class)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

All of the inputs are double and that the class value is a factor

#Take a peek at the 1st 5 rows of your data
head(Dtrain,5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

#list the levels for the class variable 
levels(Dtrain$Species)

## [1] "setosa"     "versicolor" "virginica"

In the results, the class variable has 3 different labels. Hence, this is a multi-class or a multinomial classification problem. If there were two levels, it would be a binary classification problem.

#show a breakdown of each class in a frequency table
cbind(freq = table(Dtrain$Species), percentage = prop.table(table(Dtrain$Species))*100)

##            freq percentage
## setosa       40   33.33333
## versicolor   40   33.33333
## virginica    40   33.33333

Each class has the same number of instances (40 or 33% of the dataset).

#Statisical Summary
summary(Dtrain)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.500   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.865   Mean   :3.067   Mean   :3.763   Mean   :1.191  
##  3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :40  
##  versicolor:40  
##  virginica :40  
##                 
##                 
##

Sepal dimensions (length and width) are generally more symmetric than petal dimensions (length and width), which exhibit mild skewness. The species are equally distributed, which is common in the Iris dataset, ensuring no class imbalance. These descriptive statistics give a broad understanding of the central tendency, spread, and skewness of the features, offering a foundation for further analysis or model building.

#Data Pre-processing (standarding the parameters)
preprocess_Params <- preProcess(iris[,1:4], method=c("center", "scale"))

#print the preprocessed parameters
print(preprocess_Params)

## Created from 150 samples and 4 variables
## 
## Pre-processing:
##   - centered (4)
##   - ignored (0)
##   - scaled (4)

#transform the preprocessed parameters
transform <- predict(preprocess_Params, iris[,1:4])

#summarize the transformed parameters
summary(transform)

##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064

#Data Visualization

#separate the input data
x <- Dtrain[,1:4]
y <- Dtrain[,5]

par(mfrow=c(1,4))
for (i in 1:4) {
  boxplot(x[,i], main = names(iris)[i])
}

#barplot for class breakdowm
plot(y)

#Scatterplot matrix
featurePlot(x=x, y=y, plot = "ellipse")

#Box and Whisker plot
featurePlot(x=x, y=y, plot = "box")

It shows that there are clearly different distributions of the attributes for each class value.

#Density plot of the attributes
scales <- list(x=list(relation = "free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot = "density", scales=scales)

#Evaluation of some Algorithms
#10-fold cross validation will be used to estimate accuracy. This will split our dataset into 10parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.

#10 fold cross validation
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"

The metric of Accuracy will be used to evaluate models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).

#Evaluating five different models

#LDA
set.seed(7)
fit.lda <- train(Species~., data=Dtrain, method="lda", metric=metric, trControl=trainControl)

#CART (Classification and Regression Tress)
set.seed(7)
fit.cart <- train(Species~., data=Dtrain,method="rpart", metric=metric, trControl=trainControl)

#KNN (K Nearest Neighbour)
set.seed(7)
fit.knn <- train(Species~., data=Dtrain, method="knn", metric=metric, trControl=trainControl)

#SVM (Support Vector Machine)
set.seed(7)
fit.svm <- train(Species~., data=Dtrain, method="svmRadial", metric=metric, trControl=trainControl)

#RF (Random Forest)
set.seed(7)
fit.rf <- train(Species~., data = Dtrain, method="rf", metric=metric, trControl=trainControl)

A good mixture of simple linear (LDA), non-linear (CART, KNN) and complex non-linear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

#Summary Accuracy of models
fitted_Results <- resamples(list(lda = fit.lda, cart = fit.cart, knn = fit.knn, svm = fit.svm, rf = fit.rf))
summary(fitted_Results)

## 
## Call:
## summary.resamples(object = fitted_Results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## lda  0.9166667 0.9375000 1.0000000 0.9750000 1.0000000    1    0
## cart 0.9166667 0.9166667 0.9166667 0.9333333 0.9166667    1    0
## knn  0.9166667 0.9375000 1.0000000 0.9750000 1.0000000    1    0
## svm  0.8333333 0.9166667 0.9166667 0.9333333 0.9791667    1    0
## rf   0.8333333 0.9166667 1.0000000 0.9583333 1.0000000    1    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.875 0.90625  1.000 0.9625 1.00000    1    0
## cart 0.875 0.87500  0.875 0.9000 0.87500    1    0
## knn  0.875 0.90625  1.000 0.9625 1.00000    1    0
## svm  0.750 0.87500  0.875 0.9000 0.96875    1    0
## rf   0.750 0.87500  1.000 0.9375 1.00000    1    0

All models (LDA, CART, KNN, SVM, and RF) perform very well, with most models having high median and mean accuracy values, generally close to 1.0. The minimum accuracy for all models is 0.83, indicating occasional lower performance, but overall, these models provide high accuracy in predicting the Iris species. KNN, Random Forest, and LDA show the highest classification accuracy, with perfect performance for the majority of cases.

LDA model has the highest and accuracy of 0.9750000.

dotplot(fitted_Results)

#Summarize the best model
print(fit.lda)

## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.975     0.9625

# Since the LDA made the best model, estimate skill of LDA on the validation dataset
predictions <- predict(fit.lda, data_test)
predictions

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica  virginica  virginica  virginica 
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

#confusion matrix
confusionMatrix(predictions, data_test$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Accuracy is 100%, and the model’s predictions perfectly match the true labels across all three classes. All metrics (sensitivity, specificity, precision, etc.) are 1 for each class, indicating no misclassifications or false positives/negatives. The model’s Kappa value of 1 further confirms that the predicted and true labels are in perfect agreement.

In conclusion, accuracy refers to the percentage of correct predictions made by the model. An accuracy of 97.5% means that the model correctly predicted the class of the iris flowers in 97.5% of the cases. This high accuracy suggests that the model is performing very well in classifying the flowers into the three species: setosa, versicolor, and virginica.

Iris Data Project

Ezebuike Esther

2024-12-13