Sambhav Shrestha
In this project, I try to analayze the big data from Iris flowers dataset and classify them based on their features. I will be using different machine learning models and trying to find the best machine learning model that can accurately distinguish one flower from another. First we will start by importing the libraries.
# Importing Required Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(cowplot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(caTools)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(lda)
library(rpart)
library(rpart.plot)
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
The iris data is pre-available in R and can be loaded with following code.
#importing the iris datset
data(iris)
# summary of iris dataset
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# first and last rows
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# check for any missing values
colSums(is.na(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
From the above summary, we can see that this dataset has 5 columns, (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species) with first 4 being the attributes and the species column as label (setosa, versicolor, virginica) distributed equally among 150 observations. Also, there are no missing values and the four attributes are on same scale (cm), so we don’t need to normalize them.
Let’s prepare our training and test set by splitting the data. I chose to split the data 8:2.
# split the data into training set and test set
set.seed(100)
split <- sample.split(iris$Species, SplitRatio = 0.8)
train <- subset(iris, split == TRUE)
test <- subset(iris, split == FALSE)
dim(train)
## [1] 120 5
dim(test)
## [1] 30 5
# check if we have equal number of each species in training and test data.
count(train, Species)
## Species n
## 1 setosa 40
## 2 versicolor 40
## 3 virginica 40
count(test, Species)
## Species n
## 1 setosa 10
## 2 versicolor 10
## 3 virginica 10
We will be looking at both the univariate and multivariate plots to understand each attribute and relationships between the elements in our training data
qplot(x = train$Species, fill = train$Species, xlab = "Species", ylab ="count")
Box plots help in visualizing the InterQuartile Range and dispersion of dataset.
par(mfrow=c(1,4))
color <- c("red", "green", "orange", "yellow")
for (i in 1:4) {
boxplot(train[, -c(5)][i], main=names(train)[i], col = color[i] )
}
We can see that the sepal length and sepal width of flowers have comparatively smaller ICR than that of petals. Also, petals are more negatively skewed.
Let’s Compare the boxplots for each species. In order to do so, let’s find the means of each attribute according to species.
# table of means
train %>%
group_by(Species) %>% summarise(avg_SL = mean(Sepal.Length), avg_SW = mean(Sepal.Width), avg_PL = mean(Petal.Length), avg_PW = mean(Petal.Width))
## # A tibble: 3 x 5
## Species avg_SL avg_SW avg_PL avg_PW
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.03 3.46 1.47 0.238
## 2 versicolor 5.92 2.78 4.21 1.32
## 3 virginica 6.66 2.99 5.64 2.05
we can already see some characteristics of each species. Setosa has smaller average petal length and width and bigger sepal width than other two. Virginica has the biggest sepal length, petal length and petal width in average. Let’s visualize our data in multivariate box plots.
featurePlot(x=train[, -c(5)], y=train[, 5], plot='box')
scatter plot helps us identify the separation of each species visually
featurePlot(x=train[, -c(5)], y=train[, 5], plot='ellipse')
From the above scatter plot, the separation between Setosa and other two species is pretty visible. Versicolor and Verginica overlap each other but Verginica has higher value of attributes.
Scatterplot helps in finding the relation between different columns. We can find if there is a linear relationship between length and width of petals of each species and also calculate the density plot for each species
# scatter plot between sepal length and width
scatsepal <- train %>% ggplot(aes(x = Sepal.Length, y = Sepal.Width, shape = Species, color = Species)) +
geom_point(size=2) + geom_smooth(method=lm, se = FALSE, formula = y ~ x) + theme(legend.position = "none")
# density plots
xdensity1 <- train %>% ggplot(aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.5)
ydensity1 <- train %>% ggplot(aes(x = Sepal.Width, fill = Species)) + geom_density(alpha = 0.5) + theme(legend.position = "none")
# blank plot
blankPlot <- ggplot()+ theme_void()
# arranging all the plots
grid.arrange(xdensity1, blankPlot, scatsepal, ydensity1, ncol=2, nrow=2, widths=c(3, 2), heights=c(2, 3))
# scatter plot between sepal length and width
scatpetal <- train %>% ggplot(aes(x = Petal.Length, y = Petal.Width, shape = Species, color = Species)) +
geom_point(size=2) + geom_smooth(method=lm, se = FALSE, formula = y ~ x) + theme(legend.position = "none")
# density plots
xdensity2 <- train %>% ggplot(aes(x = Petal.Length, fill = Species)) + geom_density(alpha = 0.5)
ydensity2 <- train %>% ggplot(aes(x = Petal.Width, fill = Species)) + geom_density(alpha = 0.5) + theme(legend.position = "none")
# blank plot
blankPlot <- ggplot()+ theme_void()
# arranging all the plots
grid.arrange(xdensity2, blankPlot, scatpetal, ydensity2, ncol=2, nrow=2, widths=c(3, 2), heights=c(2, 3))
From the two combined plots above, we can see that Petal plots have three separate clusters for each species while Sepal plots have overlapping clusters. Thus, it can be concluded that petals measurements have strong bearing on the model. Now, we shall build our model.
correlation <-cor(train[,c(1:4)], method = 'pearson')
corrplot(correlation, number.cex = 1, method = "color", type = "lower", tl.cex=0.8, tl.col="black")
The correlation plot shows the strong correlation between Petal.length and Sepal.Length and Petal.width and Petal.length.
we will use 10-fold cross validation to estimate accuracy. This validation splits our data into 10 parts, training in 9 and testing on 1 part.
control <- trainControl(method='cv', number=10)
metric <- 'Accuracy'
The first model that I will be using is LDA (Linear Discriminant Analysis) (similar to logistic regression but for more than two variables)
set.seed(123)
lda_fit <- train(Species~., data=train, method='lda',
trControl=control, metric=metric)
lda_fit
## Linear Discriminant Analysis
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9833333 0.975
The accuracy of LDA on training set is 98% with Kappa of 0.975. Let’s test it on test data
# predicting on test data
lda_predict <- predict(lda_fit, test)
# confusion matrix
lda_cm <- confusionMatrix(lda_predict, test$Species)
lda_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 0
## virginica 0 1 10
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 1.0000
## Specificity 1.0000 1.0000 0.9500
## Pos Pred Value 1.0000 1.0000 0.9091
## Neg Pred Value 1.0000 0.9524 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3333
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.9500 0.9750
# keep track of all model accuracies
Models_Accuracies <- tibble(model = "LDA", accuracy = lda_cm$overall['Accuracy'])
The accuracy of LDA is 96.67%
set.seed(123)
cart_fit <- train(Species~., data=train, method='rpart',
trControl=control, metric=metric)
cart_fit
## CART
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0000 0.9500000 0.925
## 0.4625 0.7833333 0.675
## 0.5000 0.3333333 0.000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
The best accuracy from CART Model obtained was 92.5 % on training data.
# predicting on test data
cart_predict <- predict(cart_fit, test)
# confusion matrix
cart_cm <- confusionMatrix(cart_predict, test$Species)
cart_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 7 1
## virginica 0 3 9
##
## Overall Statistics
##
## Accuracy : 0.8667
## 95% CI : (0.6928, 0.9624)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.296e-09
##
## Kappa : 0.8
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.7000 0.9000
## Specificity 1.0000 0.9500 0.8500
## Pos Pred Value 1.0000 0.8750 0.7500
## Neg Pred Value 1.0000 0.8636 0.9444
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2333 0.3000
## Detection Prevalence 0.3333 0.2667 0.4000
## Balanced Accuracy 1.0000 0.8250 0.8750
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "CART", accuracy = cart_cm$overall['Accuracy'])
Again, the accuracy on test set went down to 86 % with Kappa of 0.8
set.seed(123)
knn_fit <- train(Species~., data=train, method='knn', trControl=control, metric=metric)
knn_fit
## k-Nearest Neighbors
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9916667 0.9875
## 7 0.9916667 0.9875
## 9 0.9916667 0.9875
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
The accuracy from KNN on training data was 99.16 % highest till now.
Let’s see how it performs on test data
# test data
knn_predict <- predict(knn_fit, test)
# confusion matrix
knn_cm <- confusionMatrix(knn_predict, test$Species)
knn_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 2
## virginica 0 1 8
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.8000
## Specificity 1.0000 0.9000 0.9500
## Pos Pred Value 1.0000 0.8182 0.8889
## Neg Pred Value 1.0000 0.9474 0.9048
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.2667
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9000 0.8750
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "KNN", accuracy = knn_cm$overall['Accuracy'])
The accuracy on test set dropped to devastating 90%. It seems that the KNN overfitted the training set.
set.seed(123)
svm_fit <- train(Species~., data=train, method='svmRadial', trControl=control, metric=metric)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9583333 0.9375
## 0.50 0.9833333 0.9750
## 1.00 0.9833333 0.9750
##
## Tuning parameter 'sigma' was held constant at a value of 0.6165725
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.6165725 and C = 0.5.
The accuracy from SVM Model is 97.5% on training data.
# test data
svm_predict <- predict(svm_fit, test)
# confusion matrix
svm_cm <- confusionMatrix(svm_predict, test$Species)
svm_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 9 0 0
## versicolor 1 8 2
## virginica 0 2 8
##
## Overall Statistics
##
## Accuracy : 0.8333
## 95% CI : (0.6528, 0.9436)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.444e-08
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 0.9000 0.8000 0.8000
## Specificity 1.0000 0.8500 0.9000
## Pos Pred Value 1.0000 0.7273 0.8000
## Neg Pred Value 0.9524 0.8947 0.9000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3000 0.2667 0.2667
## Detection Prevalence 0.3000 0.3667 0.3333
## Balanced Accuracy 0.9500 0.8250 0.8500
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "SVM", accuracy = svm_cm$overall['Accuracy'])
Same as KNN, SVM also performed poorly on test set with accuracy of only 83%.
tree_fit <- rpart(Species ~ ., train, control = rpart.control(minsplit = 6, minbucket = 2), method = "class")
rpart.plot(tree_fit)
# test data
tree_predict <- predict(tree_fit, test, type="class")
tree_cm <- confusionMatrix(tree_predict, test$Species)
tree_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 1
## virginica 0 2 9
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.9000
## Specificity 1.0000 0.9500 0.9000
## Pos Pred Value 1.0000 0.8889 0.8182
## Neg Pred Value 1.0000 0.9048 0.9474
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.3000
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.8750 0.9000
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "Decision Tree", accuracy = tree_cm$overall['Accuracy'])
The accuracy from decision tree was 90%.
set.seed(123)
rf_fit <- train(Species~., data=train, method='ranger', trControl=control, metric=metric)
rf_fit
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9833333 0.9750
## 2 extratrees 0.9666667 0.9500
## 3 gini 0.9833333 0.9750
## 3 extratrees 0.9666667 0.9500
## 4 gini 0.9750000 0.9625
## 4 extratrees 0.9750000 0.9625
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
## and min.node.size = 1.
The random forest model got the training set accuracy of 97.5%.
# Test data
rf_predict <- predict(rf_fit, test)
# Confusion matrix
rf_cm <- confusionMatrix(rf_predict, test$Species)
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 2
## virginica 0 2 8
##
## Overall Statistics
##
## Accuracy : 0.8667
## 95% CI : (0.6928, 0.9624)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.296e-09
##
## Kappa : 0.8
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.8000
## Specificity 1.0000 0.9000 0.9000
## Pos Pred Value 1.0000 0.8000 0.8000
## Neg Pred Value 1.0000 0.9000 0.9000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.2667
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.8500 0.8500
# Keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest", accuracy = rf_cm$overall['Accuracy'])
The random forest model also didn’t perform so well on test data with accuracy of only 86.67%
set.seed(40)
#Convert class labels from factor to numeric
labels <- train$Species
y <- as.integer(labels) - 1
# xgb fit
xgb_fit <- xgboost(data = data.matrix(train[,-5]),
label = y,
num_class = 3,
eta = 0.3,
gamma = 0.1,
max_depth = 30,
nrounds = 20,
objective = "multi:softprob",
colsample_bytree = 0.6,
verbose = 0,
nthread = 7,
nfold = 10,
prediction = TRUE,
)
## [14:34:56] WARNING: amalgamation/../src/learner.cc:541:
## Parameters: { nfold, prediction } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [14:34:56] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
xgb_fit
## ##### xgb.Booster
## raw: 45.5 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, num_class = 3, eta = 0.3, gamma = 0.1,
## max_depth = 30, objective = "multi:softprob", colsample_bytree = 0.6,
## nthread = 7, nfold = 10, prediction = TRUE)
## params (as set within xgb.train):
## num_class = "3", eta = "0.3", gamma = "0.1", max_depth = "30", objective = "multi:softprob", colsample_bytree = "0.6", nthread = "7", nfold = "10", prediction = "TRUE", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 4
## niter: 20
## nfeatures : 4
## evaluation_log:
## iter train_mlogloss
## 1 0.741699
## 2 0.525989
## ---
## 19 0.025517
## 20 0.024078
# Test data
xgb_predict <- predict(xgb_fit, data.matrix(test[, -5]), reshape = T) %>% as.data.frame()
colnames(xgb_predict) = levels(labels)
# Use the predicted label with the highest probability
xgb_predict$prediction = apply(xgb_predict, 1 ,function(x) colnames(xgb_predict)[which.max(x)])
# Confusion matrix
xgb_cm <- confusionMatrix(factor(xgb_predict$prediction), factor(test$Species))
xgb_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 1
## virginica 0 2 9
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.9000
## Specificity 1.0000 0.9500 0.9000
## Pos Pred Value 1.0000 0.8889 0.8182
## Neg Pred Value 1.0000 0.9048 0.9474
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.3000
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.8750 0.9000
# Keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "XGBoost", accuracy = xgb_cm$overall['Accuracy'])
XGBoose continuosly achieved 90% accuracy on test data on different hyperparameterization.
Let’s compare how the model performed on training data.
# Compare the results of these algorithms
iris.results <- resamples(list(lda=lda_fit, cart=cart_fit, knn=knn_fit, svm=svm_fit, rf=rf_fit))
# Table Comparison
summary(iris.results)
##
## Call:
## summary.resamples(object = iris.results)
##
## Models: lda, cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.9166667 1.0000000 1.0000000 0.9833333 1 1 0
## cart 0.8333333 0.9166667 0.9583333 0.9500000 1 1 0
## knn 0.9166667 1.0000000 1.0000000 0.9916667 1 1 0
## svm 0.8333333 1.0000000 1.0000000 0.9833333 1 1 0
## rf 0.9166667 1.0000000 1.0000000 0.9833333 1 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.875 1.000 1.0000 0.9750 1 1 0
## cart 0.750 0.875 0.9375 0.9250 1 1 0
## knn 0.875 1.000 1.0000 0.9875 1 1 0
## svm 0.750 1.000 1.0000 0.9750 1 1 0
## rf 0.875 1.000 1.0000 0.9750 1 1 0
dotplot(iris.results)
From the above data, we can see that KNN performed better than all the other model on training data with an average accuracy of 98.75%. The accuracy on training set may be affected by various reasons one being overfitting which performs extremely well on training data but breaks in test data.
Let’s compare the accuracies on our test set.
arrange(Models_Accuracies, desc(accuracy))
## # A tibble: 7 x 2
## model accuracy
## <chr> <dbl>
## 1 LDA 0.967
## 2 KNN 0.9
## 3 Decision Tree 0.9
## 4 XGBoost 0.9
## 5 CART 0.867
## 6 Random Forest 0.867
## 7 SVM 0.833
As we can see, LDA performed extremely better than most other models, followed by KNN, XGBoost and Decision Tree. The LDA encompassed all 4 variables, and so was able to reflect variance better. Although, since the dataset is too small, this could also be the reason for other models not performing better than they are supposed to.