Assignment1.utf8.md

INDEX

History of Iris Dataset
Data Dimension and Categorization
Data Visualization
Unsupervised Learning
Supervised Learning
References

History of Iris Dataset

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

Sir Ronald Aylmer Fisher

It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

[SOURCE]

Data Dimension and Categorization

[Jump To Top]

The dataset contains 150 observations with 5 attributes named as Sepal width, Sepal length, Petal width, Petal length and flower type.

## [1] "Number of Rows"

## [1] 150

## [1] "Number of Column"

## [1] 5

Out of which 4 are quantitative variables which are Sepal width, Sepal length, Petal width and Petal length and 1 is categorical variable which is flower type or species having 3 values of Virginica, Versicolour and Setosa.

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Observing the data, we realize that every species of flower have 50 observations each.

Following is the summary of the dataset.

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

“Image” The above graph shows that there are no missing values in iris dataframe

Data Visualization

[Jump To Top]

Histogram: A visual representation of how the data points are distributed with respect to the frequency.

Looking at the overall distribution, petal length and petal width does not have a normal distribution, whereas sepal length and sepal width are uniformly distributed.

The distribution of Iris-Setosa petal is completely different from the other 2 species
Using sepal length and sepal width, we can’t separate one species from another as the distribution is overlapping
Iris-Setosa is not normally distributed by sepal length and petal width
Petal length can be used as a differentiating factor in terms of the distribution of the 3 flower species.

Scatter Plot: Scatter plots are used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. The relationship between two variables is called their correlation.

From the below plot, there seems to be a positive correlation between the length and width of all the species, however there is a distinguishing strong correlation and relationship between petal length and petal width.

Box Plot: A box plot is a way of summarizing a set of data measured on an interval scale. It is often used in explanatory data analysis. The ends of the box are the upper and lower quartiles, so the box spans the interquartile range and the median is marked by a vertical line inside the box.

Even if the box looks perfectly balanced, we cannot just interpret that the datapoint is evenly distributed. The whiskers provide the information whether the data is positive or negatively skewed. So, first we will look at the box and see if the median is perfectly balanced. If the median is at the lower end of box, we interpret that the data range is positively skewed. If the median lies on the upper end of the box stating that the data is negatively skewed. As the median is perfectly balanced, then we try to interpret information from the whisker seeing that what end the vertical line is stretched.

The boxplot also benefits us with the information on outliers. An outlier is an observation that is numerically distant from the rest of the data.

From the below box plot it’s clear sighted that the Sepal Length for virginica and Sepal Width of Setosa both have outliers (They are the dots that out run the whiskers). While all the other boxplots looked perfectly balanced, we can see that that petal width for both setosa and versicolor are positively skewed as the median lie at the lower end of the boxplot.

Density plot: A density plot is a smoothed version of a histogram. Formally, it is a non-negative function that integrates to 1. The area under the curve in any given interval is an estimate of the probability that that variable will fall into that interval.

Unsupervised Learning

[Jump To Top]

The idea behind cluster partitioning method like k-means clustering is to define clusters such that the total intra-cluster variation is minimized. Trying different values for k for k-means clustering through trial and error is a tedious process and is not feasible for a large dataset. For instance, we have run k-means for different values of k (3, 4 and 9) after standarizing iris dataset for center and scale as shown below.

A more logical approach to determine the optimum value of k is the elbow method. Running the scaled iris dataset to generate the elbow graph, we find that the likely values for k could be 3 or 4.

From the graph, the bend indicates that additional clusters beyond the third add little value. Clustering the dataset for both k =3 and 4, we see that k=3 gives better/clearer clusters compared to 4 and also gives a less complex model. So, we consider k=3 for out final output as below.

Also, performing hierarchical clustering for k = 3:

Supervised Learning

[Jump To Top]

KNN classification averages the labels of K nearest neighbour samples to come to a decision. Plotting kNN for different values of k we observe the following percentage accuracy.

KNN classification averages the labels of K nearest neighbour samples to come to a decision. Also, since KNN uses the concept of Euclidean distance, the measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally. Plotting kNN for different values of k on the scaled iris dataset, we observe as below.

Confusion Matrix for k = 3

##             Predicted
## True         setosa versicolor virginica
##   setosa          9          1         0
##   versicolor      0          9         1
##   virginica       0          1         9

Number of mismatch values for k = 3

## [1] 3

Percentage Accuracy for k = 3

## [1] 90

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica  virginica  versicolor virginica 
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Confusion Matrix for k = 5

##             Predicted
## True         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          1         9

Number of mismatch values for k = 5

## [1] 1

Percentage Accuracy for k = 5

## [1] 96.66667

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     versicolor setosa     setosa     versicolor versicolor
## [13] versicolor versicolor virginica  versicolor versicolor versicolor
## [19] versicolor versicolor virginica  virginica  versicolor virginica 
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Confusion Matrix for k = 7

##             Predicted
## True         setosa versicolor virginica
##   setosa          9          1         0
##   versicolor      0          9         1
##   virginica       0          1         9

Number of mismatch values for k = 7

## [1] 3

Percentage Accuracy for k = 7

## [1] 90

##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     versicolor versicolor
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor virginica  virginica  versicolor virginica 
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Confusion Matrix for k = 15

##             Predicted
## True         setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          1         9

Number of mismatch values for k = 15

## [1] 1

Percentage Accuracy for k = 15

## [1] 96.66667

As we increase K, it will look for more points in neighborhood and hence increase the generalizability at the cost of variance. According to the general rule of thumb, the value of k is sqrt(N)/2, where N stands for the number of samples in your training dataset, so in our case we are choosing sqrt(150)/2 ~ 7 as our optimum value for k which gives us a relatively high prediction accuracy with an optimised model.

References

[Jump To Top]

Density Plot
Elbow Plot