Dataset description

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Source: https://archive.ics.uci.edu/ml/datasets/iris. The dataset comes installed with R.

Libraries used

ggplot2: https://cran.r-project.org/web/packages/ggplot2/index.html and ggally: https://cran.r-project.org/web/packages/GGally/index.html

Preliminary descriptive visualization

The function ggpairs of the library ggally provides a summary of the dataset and the correlations between variables.

Preliminary conclusion

We can observe that either petal dimension create the maximum differentiation between Setosas and other species. Hence, Setosas could be classified using the Petal.Width or Petal.Length. The criterium could be: every datapoint with a Petal.Length lower than the maximum petal length of the setosas is a Setosa.

Similarly, every datapoint beyond the maximum Versicolor petal width and length is a Virginica. But there is an area that require further specifications. We will determine them visually.

Sequence to refine the classification elements

The following parallel coordinates diagrams show step by step the remaining dataset of the classification process.

STEP 1: Setosas are easily classified, hende they are removed. Similarly, the Versicolor and Virginicas beyond the extreme values are also removed. The following diagramam shows the intersection of Versicolor and Virginica datapoints.

STEP 2: Versicolor datapoints above the maximum value were removed.

STEP 3: Virginicas with the maximum value of Petal.Width were removed.

STEP 4: At this point 139/150 (93%) datapoints are correctly classified. Further model fitting could be done doing a second pass on the Sepal.Width variable risking to overfit the model.

Simple demo of boxplots and parallel coordinates for classification

Juan Salamanca

6/13/2022

Dataset description

Libraries used

Preliminary descriptive visualization

Preliminary conclusion

Sequence to refine the classification elements

Conclusion