Data Exploration Iris Data Set Slides

February 13, 2017

A Study of an 80% Random Sample of the "Iris" Data Set

Iris Data Set Data Exploration

About the Iris Data Set
* The Iris Data Set is a multivariate data set used by R. A Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminate analysis.
* The data was collected by Edgar Andersen in 1935 to quantify the morphologic variation of Iris flowers of three related species.
* Two of the species were collected in the Gaspé Peninsula, Canada, in one pasture, picked the same day, measured at the same time by the same person using the same equipment
* The Iris data set contains 150 random observations and 5 variables (one categorical and 4 numeric) from three iris species, setosa, versicolor, and virginica.
* There are 50 observations from each of the three iris species, measuring sepal length, sepal width, petal length and petal width, all numeric values in centimeters.
* There is no missing data.

Sample Data from a Random Sample (80%) Subset of Iris Data Set

## Summary Statistics for 80% Subset of Iris Data set

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 120 5.81 0.80   5.70    5.79 0.89 4.3 7.9   3.6  0.26
## Sepal.Width     2 120 3.04 0.43   3.00    3.03 0.37 2.0 4.2   2.2  0.22
## Petal.Length    3 120 3.73 1.74   4.35    3.75 1.70 1.0 6.9   5.9 -0.32
## Petal.Width     4 120 1.20 0.78   1.30    1.19 1.04 0.1 2.5   2.4 -0.09
## Species*        5 120 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.66 0.07
## Sepal.Width     -0.03 0.04
## Petal.Length    -1.45 0.16
## Petal.Width     -1.38 0.07
## Species*        -1.52 0.07

Pairs Plot of Variables of Iris Data Subset

Histogram of All Three Species

Plot All Species Density Curves and Means

Boxplot of Three Species

3D Scatterplot of Petal.Width, Petal.Length and Petal.Width

Parallel Coordinates Plot

K Means Clustering (3 Species) Elliptical Cluster

Hierarchial Clustering

CONCLUSIONS

Histogram of petal length in iris dataset segregates at least "setosa" species from the other two species. Analysis of sepal length and width shows normal distribution indicating that species cannot be differentiated based on these. The petal length and width show skewed uneven distributions.
The density plot was another method used for visualization and indicated the same result as histogram. This plot shows that there is significant overlap yet some differentiation between virginica and versicolor species. Further analysis is required to determine if these two species can be distinguished from each other.
The box plot analysis the length of the petal in "setosa" is shorter than "versicolor" and "virginica" species of iris. Range of values of petal length for setosa species are much smaller than that for the other two species.

CONCLUSIONS

Scatterplot analysis petal width and petal length shows that these two attributes are in linear relationship.
Cluster and hierarchical analysis of these 4 attributes grouped "versicolor" and "virginical" iris species are more closely related to each other than "setosa".
Petal length is a promising attribute to analyze for species differentiation.
The test set successfully predicted the pattern and classification.