The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
The dataset contains 150 observations with 5 attributes named as Sepal width, Sepal length, Petal width, Petal length and flower type.
## [1] "Number of Rows"
## [1] 150
## [1] "Number of Column"
## [1] 5
Out of which 4 are quantitative variables which are Sepal width, Sepal length, Petal width and Petal length and 1 is categorical variable which is flower type or species having 3 values of Virginica, Versicolour and Setosa.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Observing the data, we realize that every species of flower have 50 observations each.
Following is the summary of the dataset.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The above graph shows that there are no missing values in iris dataframe
Histogram: A visual representation of how the data points are distributed with respect to the frequency.
Looking at the overall distribution, petal length and petal width does not have a normal distribution, whereas sepal length and sepal width are uniformly distributed.
Scatter Plot: Scatter plots are used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. The relationship between two variables is called their correlation.
From the below plot, there seems to be a positive correlation between the length and width of all the species, however there is a distinguishing strong correlation and relationship between petal length and petal width.
Box Plot: A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. The ends of the box are the upper and lower quartiles, so the box spans the interquartile range and the median is marked by a vertical line inside the box.
Even if the box looks perfectly balanced, we cannot just interpret that the datapoint is evenly distributed. The whiskers provide the information whether the data is positive or negatively skewed. So, first we will look at the box and see if the median is perfectly balanced. If the median is at the lower end of box, we interpret that the data range is positively skewed. If the median lies on the upper end of the box stating that the data is negatively skewed. As the median is perfectly balanced, then we try to interpret information from the whisker seeing that what end the vertical line is stretched.
The boxplot also benefits us with the information on outliers. An outlier is an observation that is numerically distant from the rest of the data.
From the below box plot it’s clear sighted that the Sepal Length for virginica and Sepal Width of Setosa both have outliers (They are the dots that out run the whiskers). While all the other boxplots looked perfectly balanced, we can see that that petal width for both setosa and versicolor are positively skewed as the median lie at the lower end of the boxplot.
Density plot: A density plot is a smoothed version of a histogram. Formally, it is a non-negative function that integrates to 1. The area under the curve in any given interval is an estimate of the probability that that variable will fall into that interval.
The idea behind cluster partitioning method like k-means clustering is to define clusters such that the total intra-cluster variation is minimized. Trying different values for k for k-means clustering through trial and error is a tedious process and is not feasible for a large dataset. For instance, we have run k-means for different values of k (3, 4 and 9) after standarizing iris dataset for center and scale as shown below.
A more logical approach to determine the optimum value of k is the elbow method. Running the scaled iris dataset to generate the elbow graph, we find that the likely values for k could be 3 or 4.
From the graph, the bend indicates that additional clusters beyond the third add little value. Clustering the dataset for both k =3 and 4, we see that k=3 gives better/clearer clusters compared to 4 and also gives a less complex model. So, we consider k=3 for out final output as below.
Also, performing hierarchical clustering for k = 3:
KNN classification averages the labels of K nearest neighbour samples to come to a decision. Plotting kNN for different values of k we observe the following percentage accuracy.
KNN classification averages the labels of K nearest neighbour samples to come to a decision. Also, since KNN uses the concept of Euclidean distance, the measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally. Plotting kNN for different values of k on the scaled iris dataset, we observe as below.
Confusion Matrix for k = 3
## Predicted
## True setosa versicolor virginica
## setosa 9 1 0
## versicolor 0 9 1
## virginica 0 1 9
Number of mismatch values for k = 3
## [1] 3
Percentage Accuracy for k = 3
## [1] 90
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica virginica versicolor virginica
## [25] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Confusion Matrix for k = 5
## Predicted
## True setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 1 9
Number of mismatch values for k = 5
## [1] 1
Percentage Accuracy for k = 5
## [1] 96.66667
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa versicolor setosa setosa versicolor versicolor
## [13] versicolor versicolor virginica versicolor versicolor versicolor
## [19] versicolor versicolor virginica virginica versicolor virginica
## [25] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Confusion Matrix for k = 7
## Predicted
## True setosa versicolor virginica
## setosa 9 1 0
## versicolor 0 9 1
## virginica 0 1 9
Number of mismatch values for k = 7
## [1] 3
Percentage Accuracy for k = 7
## [1] 90
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica virginica versicolor virginica
## [25] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Confusion Matrix for k = 15
## Predicted
## True setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 1 9
Number of mismatch values for k = 15
## [1] 1
Percentage Accuracy for k = 15
## [1] 96.66667
As we increase K, it will look for more points in neighborhood and hence increase the generalizability at the cost of variance. According to the general rule of thumb, the value of k is sqrt(N)/2, where N stands for the number of samples in your training dataset, so in our case we are choosing sqrt(150)/2 ~ 7 as our optimum value for k which gives us a relatively high prediction accuracy with an optimised model.