A Study of an 80% Random Sample of the “Iris” Data Set

Iris Data Set Data Exploration

About the Iris Data Set

  • The Iris Data Set is a multivariate data set used by R. A Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems” as an example of linear discriminate analysis.
  • The data was collected by Edgar Andersen in 1935 to quantify the morphologic variation of Iris flowers of three related species.
  • Two of the species were collected in the Gaspé Peninsula, Canada, in one pasture, picked the same day, measured at the same time by the same person using the same equipment
  • The Iris data set contains 150 random observations and 5 variables (one categorical and 4 numeric) from three iris species, setosa, versicolor, and virginica.
  • There are 50 observations from each of the three iris species, measuring sepal length, sepal width, petal length and petal width, all numeric values in centimeters.
  • There is no missing data.

Sample Data from a Random Sample (80%) Subset of Iris Data Set

Summary Statistics for 80% Subset of Iris Data set

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 120 5.81 0.80   5.70    5.79 0.89 4.3 7.9   3.6  0.26
## Sepal.Width     2 120 3.04 0.43   3.00    3.03 0.37 2.0 4.2   2.2  0.22
## Petal.Length    3 120 3.73 1.74   4.35    3.75 1.70 1.0 6.9   5.9 -0.32
## Petal.Width     4 120 1.20 0.78   1.30    1.19 1.04 0.1 2.5   2.4 -0.09
## Species*        5 120 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.66 0.07
## Sepal.Width     -0.03 0.04
## Petal.Length    -1.45 0.16
## Petal.Width     -1.38 0.07
## Species*        -1.52 0.07

Pairs Plot of Variables of Iris Data Subset

Histogram of All Three Species

Plot All Species Density Curves and Means

Boxplot of Three Species

3D Scatterplot of Petal.Width, Petal.Length and Petal.Width

Parallel Coordinates Plot

K Means Clustering (3 Species) Elliptical Cluster

Hierarchial Clustering