Dataset description

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Source: https://archive.ics.uci.edu/ml/datasets/iris. The dataset comes installed with R.

Libraries used

ggplot2: https://cran.r-project.org/web/packages/ggplot2/index.html and ggally: https://cran.r-project.org/web/packages/GGally/index.html

# Import libraries
library(ggplot2)
library(GGally)
library(dplyr)

Preliminary descriptive visualization

The function ggpairs of the library ggally provides a summary of the dataset and the correlations between variables.

# Create custom color palette
colPalette <- c("#8f2d56", "#ffbc42","#73d2de")

summary <- ggpairs(iris, aes(color=Species, alpha=0.3))
summary <- summary + scale_color_manual(values=colPalette)
summary <- summary + scale_fill_manual(values=colPalette)

summary

# This chunk of code marks datapoints at the intersection of versicolors and virginicas on the petal-size space.

# subset of versicolor
versicolor <- subset(iris, iris$Species == "versicolor")

# versicolor extreme top values
max.PetalW.Versicolor <- max(versicolor$Petal.Width)
max.PetalL.Versicolor <- max(versicolor$Petal.Length)

# subset of virginica
virginica <- subset(iris, iris$Species == "virginica")

# selection of virginica datapoints below the versicolor top bundaries
virginicaSubset <- subset (virginica,
                           virginica$Petal.Width <= max.PetalW.Versicolor)

virginicaSubset <- subset (virginicaSubset,
                           virginicaSubset$Petal.Length <= max.PetalL.Versicolor) 

# virginica extreme low values
min.PetalW.Virginica <- min(virginicaSubset$Petal.Width)
min.PetalL.Virginica <- min(virginicaSubset$Petal.Length)

# selection of versicolor datapoints over the virsinica lower bundaries
versicolorSubset <- subset(versicolor,
                           versicolor$Petal.Width >= min.PetalW.Virginica)

versicolorSubset <- subset(versicolorSubset,
                           versicolorSubset$Petal.Length >= min.PetalL.Virginica)

intersection <- bind_rows(virginicaSubset,versicolorSubset)
# First plot, descriptive statistics using scatter & boxplots
resumen <- ggplot(iris, aes(Petal.Length, Petal.Width))
resumen <- resumen + geom_point(aes(color=Species, alpha=0.3), position = "jitter")
# color palettes
resumen <- resumen + scale_fill_manual(values=colPalette)
resumen <- resumen + scale_color_manual(values=colPalette)
# boundaries
resumen <- resumen + geom_hline(yintercept=min.PetalW.Virginica, alpha=0.3)
resumen <- resumen + geom_hline(yintercept=max.PetalW.Versicolor, alpha=0.3)
resumen <- resumen + geom_vline(xintercept=min.PetalL.Virginica, alpha=0.3)
resumen <- resumen + geom_vline(xintercept=max.PetalL.Versicolor, alpha=0.3)

# labels
resumen <- resumen + labs(title="Datapoints plotted on a petal-size space and organized by Species",subtitle = "Positions jittered to avoid occlusion", caption='The boudaries show the intersection of versicolor and virginica species.')

# show plot
resumen

Preliminary conclusion

We can observe that either petal dimension create the maximum differentiation between Setosas and other species. Hence, Setosas could be classified using the Petal.Width or Petal.Length. The criterium could be: every datapoint with a Petal.Length lower than the maximum petal length of the setosas is a Setosa.

Similarly, every datapoint beyond the maximum Versicolor petal width and length is a Virginica. But there is an area that require further specifications. We will determine them visually.

Sequence to refine the classification elements

The following parallel coordinates diagrams show step by step the remaining dataset of the classification process.

# Parallel coordinates 1
plot <- ggparcoord(iris,
                   columns=c(1:4),
                   showPoints=TRUE,
                   groupColumn = 'Species',
                   scale = 'std',
                   splineFactor = 6,
                   alphaLines = 0.3)
plot <- plot + scale_color_manual(values = colPalette)
plot <- plot + labs(title = "Parallel coordinates plot of the entire Iris dataset")
plot

STEP 1: Setosas are easily classified, hende they are removed. Similarly, the Versicolor and Virginicas beyond the extreme values are also removed. The following diagramam shows the intersection of Versicolor and Virginica datapoints.

colPalette2 <- c("#ffbc42","#73d2de")

# Parallel coordinates 3

plot <- ggparcoord(intersection,
                   columns=c(1:4),
                   showPoints=TRUE,
                   groupColumn = 'Species',
                   scale = 'robust',
                   splineFactor = 6,
                   order=c(3,2,4,1),
                   alphaLines = 0.7)
plot <- plot + scale_color_manual(values = colPalette2)
plot <- plot + labs(title = "Parallel coordinates plot of datapoints in the intersection of versicolors and virginicas", subtitle = "Coordinates sorted for readability")
plot

STEP 2: Versicolor datapoints above the maximum value were removed.

# Parallel coordinates 4
maxTemp <- max(virginicaSubset$Sepal.Width)
intersection2 <- subset(intersection, intersection$Sepal.Width <= maxTemp)

plot <- ggparcoord(intersection2,
                   columns=c(1:4),
                   showPoints=TRUE,
                   groupColumn = 'Species',
                   scale = 'robust',
                   splineFactor = 6,
                   order=c(3,2,4,1),
                   alphaLines = 0.7)
plot <- plot + scale_color_manual(values = colPalette2)
plot <- plot + labs(title = "Parallel coordinates plot of datapoints in the intersection of versicolors and virginicas", subtitle = "Removing Versicolors above the max Virginica's Sepal.Width")
plot

STEP 3: Virginicas with the maximum value of Petal.Width were removed.

# Parallel coordinates 4
maxTemp <- max(intersection2$Petal.Width)
intersection3 <- subset(intersection2, intersection2$Petal.Width != maxTemp)

plot <- ggparcoord(intersection3,
                   columns=c(1:4),
                   showPoints=TRUE,
                   groupColumn = 'Species',
                   scale = 'std',
                   splineFactor = 6,
                   order=c(3,2,4,1),
                   alphaLines = 0.7)
plot <- plot + scale_color_manual(values = colPalette2)
plot <- plot + labs(title = "Parallel coordinates plot of datapoints in the intersection of versicolors and virginicas", subtitle = "Removing Virginicas equal to the max Virginica's Petal.Width")
plot

STEP 4: At this point 139/150 (93%) datapoints are correctly classified. Further model fitting could be done doing a second pass on the Sepal.Width variable risking to overfit the model.

Conclusion

The variable Petal.Width has the highest classification potential followed by Sepal.Width. These visualizations inform the implementation of a classification tree that needs further evaluation of precission and accuracy.