#load library
library(corrplot)
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Loading required package: lattice
library(klaR)
## Warning: package 'klaR' was built under R version 4.3.3
## Loading required package: MASS
#load data
data("iris")
head(iris, n=20)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
#calculate correlation
correlations <-cor(iris[,1:4])
correlations
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
There are Strong Positive Correlations between Sepal.Length and Petal.Length, as well as between Petal.Length and Petal.Width.
There are moderate to weak negative correlations between Sepal.Width and both Petal.Length and Petal.Width.
This correlation matrix reveals important insights into how the features of the Iris dataset relate to one another, with petal dimensions being highly correlated with one another, while sepal width shows weak to moderate negative correlations with the other features.
#plot correlations
corrplot(correlations, method = "circle")
Interpretation: A dot-representation was used where blue represents
positive correlation and red negative.The larger the dot the larger the
correlation. We can see that the matrix is symmetrical and that the
diagonal attributes are perfectly positively correlated (because it
shows the correlation of each attribute with itself). We can see that
some of the attributes are highly correlated.
#create a pair-wise scatter plot for the four attributes
pairs(iris, col = 'blue')
#using the class label to separate the classes
pairs(Species~., data=iris, col=iris$Species)
#create box and whisker plot
x <- iris[,1:4]
y <- iris[,5]
featurePlot(x=x, y=y, plot="box")
data <- iris
#spliting the dataset into train and test data
data_train <- createDataPartition(data$Species, p=0.80, list = FALSE)
data_test <- data[-data_train,]
#working with the training dataset
Dtrain <- data[data_train,]
#SUMMARIZE DATASET
#data dimension
dim(Dtrain)
## [1] 120 5
The training data has 120 instances and 5 attributes
#Type of attributes
sapply(Dtrain, class)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "numeric" "numeric" "numeric" "numeric" "factor"
All of the inputs are double and that the class value is a factor
#Take a peek at the 1st 5 rows of your data
head(Dtrain,5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
#list the levels for the class variable
levels(Dtrain$Species)
## [1] "setosa" "versicolor" "virginica"
In the results, the class variable has 3 different labels. Hence, this is a multi-class or a multinomial classification problem. If there were two levels, it would be a binary classification problem.
#show a breakdown of each class in a frequency table
cbind(freq = table(Dtrain$Species), percentage = prop.table(table(Dtrain$Species))*100)
## freq percentage
## setosa 40 33.33333
## versicolor 40 33.33333
## virginica 40 33.33333
Each class has the same number of instances (40 or 33% of the dataset).
#Statisical Summary
summary(Dtrain)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.500 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.865 Mean :3.067 Mean :3.763 Mean :1.191
## 3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :40
## versicolor:40
## virginica :40
##
##
##
Sepal dimensions (length and width) are generally more symmetric than petal dimensions (length and width), which exhibit mild skewness. The species are equally distributed, which is common in the Iris dataset, ensuring no class imbalance. These descriptive statistics give a broad understanding of the central tendency, spread, and skewness of the features, offering a foundation for further analysis or model building.
#Data Pre-processing (standarding the parameters)
preprocess_Params <- preProcess(iris[,1:4], method=c("center", "scale"))
#print the preprocessed parameters
print(preprocess_Params)
## Created from 150 samples and 4 variables
##
## Pre-processing:
## - centered (4)
## - ignored (0)
## - scaled (4)
#transform the preprocessed parameters
transform <- predict(preprocess_Params, iris[,1:4])
#summarize the transformed parameters
summary(transform)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :-1.86378 Min. :-2.4258 Min. :-1.5623 Min. :-1.4422
## 1st Qu.:-0.89767 1st Qu.:-0.5904 1st Qu.:-1.2225 1st Qu.:-1.1799
## Median :-0.05233 Median :-0.1315 Median : 0.3354 Median : 0.1321
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5567 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.0805 Max. : 1.7799 Max. : 1.7064
#Data Visualization
#separate the input data
x <- Dtrain[,1:4]
y <- Dtrain[,5]
par(mfrow=c(1,4))
for (i in 1:4) {
boxplot(x[,i], main = names(iris)[i])
}
#barplot for class breakdowm
plot(y)
#Scatterplot matrix
featurePlot(x=x, y=y, plot = "ellipse")
#Box and Whisker plot
featurePlot(x=x, y=y, plot = "box")
It shows that there are clearly different distributions of the
attributes for each class value.
#Density plot of the attributes
scales <- list(x=list(relation = "free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot = "density", scales=scales)
#Evaluation of some Algorithms
#10-fold cross validation will be used to estimate accuracy. This will split our dataset into 10parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.
#10 fold cross validation
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
The metric of Accuracy will be used to evaluate models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).
#Evaluating five different models
#LDA
set.seed(7)
fit.lda <- train(Species~., data=Dtrain, method="lda", metric=metric, trControl=trainControl)
#CART (Classification and Regression Tress)
set.seed(7)
fit.cart <- train(Species~., data=Dtrain,method="rpart", metric=metric, trControl=trainControl)
#KNN (K Nearest Neighbour)
set.seed(7)
fit.knn <- train(Species~., data=Dtrain, method="knn", metric=metric, trControl=trainControl)
#SVM (Support Vector Machine)
set.seed(7)
fit.svm <- train(Species~., data=Dtrain, method="svmRadial", metric=metric, trControl=trainControl)
#RF (Random Forest)
set.seed(7)
fit.rf <- train(Species~., data = Dtrain, method="rf", metric=metric, trControl=trainControl)
A good mixture of simple linear (LDA), non-linear (CART, KNN) and complex non-linear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
#Summary Accuracy of models
fitted_Results <- resamples(list(lda = fit.lda, cart = fit.cart, knn = fit.knn, svm = fit.svm, rf = fit.rf))
summary(fitted_Results)
##
## Call:
## summary.resamples(object = fitted_Results)
##
## Models: lda, cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.9166667 0.9375000 1.0000000 0.9750000 1.0000000 1 0
## cart 0.9166667 0.9166667 0.9166667 0.9333333 0.9166667 1 0
## knn 0.9166667 0.9375000 1.0000000 0.9750000 1.0000000 1 0
## svm 0.8333333 0.9166667 0.9166667 0.9333333 0.9791667 1 0
## rf 0.8333333 0.9166667 1.0000000 0.9583333 1.0000000 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.875 0.90625 1.000 0.9625 1.00000 1 0
## cart 0.875 0.87500 0.875 0.9000 0.87500 1 0
## knn 0.875 0.90625 1.000 0.9625 1.00000 1 0
## svm 0.750 0.87500 0.875 0.9000 0.96875 1 0
## rf 0.750 0.87500 1.000 0.9375 1.00000 1 0
All models (LDA, CART, KNN, SVM, and RF) perform very well, with most models having high median and mean accuracy values, generally close to 1.0. The minimum accuracy for all models is 0.83, indicating occasional lower performance, but overall, these models provide high accuracy in predicting the Iris species. KNN, Random Forest, and LDA show the highest classification accuracy, with perfect performance for the majority of cases.
LDA model has the highest and accuracy of 0.9750000.
dotplot(fitted_Results)
#Summarize the best model
print(fit.lda)
## Linear Discriminant Analysis
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results:
##
## Accuracy Kappa
## 0.975 0.9625
# Since the LDA made the best model, estimate skill of LDA on the validation dataset
predictions <- predict(fit.lda, data_test)
predictions
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica virginica virginica virginica
## [25] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
#confusion matrix
confusionMatrix(predictions, data_test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 4.857e-15
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
Accuracy is 100%, and the model’s predictions perfectly match the true labels across all three classes. All metrics (sensitivity, specificity, precision, etc.) are 1 for each class, indicating no misclassifications or false positives/negatives. The model’s Kappa value of 1 further confirms that the predicted and true labels are in perfect agreement.
In conclusion, accuracy refers to the percentage of correct predictions made by the model. An accuracy of 97.5% means that the model correctly predicted the class of the iris flowers in 97.5% of the cases. This high accuracy suggests that the model is performing very well in classifying the flowers into the three species: setosa, versicolor, and virginica.