February 13, 2017

A Study of an 80% Random Sample of the "Iris" Data Set

Iris Data Set Data Exploration

About the Iris Data Set

  • The Iris Data Set is a multivariate data set used by R. A Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminate analysis
  • The data was collected by Edgar Andersen in 1935 to quantify the morphologic variation of Iris flowers of three related species.
  • Two of the species were collected in the Gaspé Peninsula, Canada, in one pasture, picked the same day, measured at the same time by the same person using the same equipment
  • The Iris data set contains 150 random observations and 5 variables (one categorical and 4 numeric) from three iris species, setosa, versicolor, and virginica
  • There are 50 observations from each of the three iris species, measuring sepal length, sepal width, petal length and petal width, all numeric values in centimeters
  • There is no missing data

Sample Data from a Random Sample (80%) Subset of Iris Data Set

## Summary Statistics for 80% Subset of Iris Data set

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 120 5.81 0.80   5.70    5.79 0.89 4.3 7.9   3.6  0.26
## Sepal.Width     2 120 3.04 0.43   3.00    3.03 0.37 2.0 4.2   2.2  0.22
## Petal.Length    3 120 3.73 1.74   4.35    3.75 1.70 1.0 6.9   5.9 -0.32
## Petal.Width     4 120 1.20 0.78   1.30    1.19 1.04 0.1 2.5   2.4 -0.09
## Species*        5 120 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.66 0.07
## Sepal.Width     -0.03 0.04
## Petal.Length    -1.45 0.16
## Petal.Width     -1.38 0.07
## Species*        -1.52 0.07

Pairs Plot of Variables of Iris Data Subset

Histogram of All Three Species

Plot All Species Density Curves and Means

Boxplot of Three Species

3D Scatterplot of Petal.Width, Petal.Length and Petal.Width

Parallel Coordinates Plot

K Means Clustering

1. Clustering technique comes under unsupervised learning   
2. It is widely used for exploratory data analysis  
3. K-means clustering :   
   K - number of clusters; data should be separated in  
   means : using the means of the Euclidean distance to decide the centroid for each cluster  

Step 1 Randomly assign any k-points as the centroids
Step 2 Calculate the Euclidean Distance between each data point and centroid and assign a cluster
Step 3 Calculate the mean of data points in each cluster and make it as a centroid
Step 4 Repeat Steps 2 & 3 till no changes in centroid or some other stopping condition has been met

K-Means on Iris Data Set with K = 3

Hierarchial Clustering

Conclusion

  • Histogram of petal length in iris dataset segregates at least "setosa" species from the other two species. Analysis of sepal length and width shows normal distribution indicating that species cannot be differentiated based on these. The petal length and width show skewed uneven distributions
  • Density plot indicated the same result as histogram. This plot shows that there is significant overlap yet some differentiation between virginica and versicolor species. Further analysis is required to determine if these two species can be distinguished from each other
  • box plot analysis the length of the petal in "setosa" is shorter than "versicolor" and "virginica" species of iris. Range of values of petal length for setosa species are much smaller than that for the other two species
  • Scatterplot analysis petal width and petal length shows that these two attributes are in linear relationship
  • Cluster and hierarchical analysis of these 4 attributes grouped "versicolor" and "virginical" iris species are more closely related to each other than "setosa"

Conclusion Cont'd

  • Petal length is a promising attribute to analyze for species differentiation
  • The test set successfully predicted the pattern and classification