1. Executive Summary

The main purpose of the study is to obtain a better understanding of the Iris flower data set through exploratory data analysis (EDA), also to estimate likely performance of a model on out-of-sample data. The Iris flower data set used in our study is one of the best known data set to be found in the pattern recognition literature. It is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936. The data set contains 3 classes (setosa, versicolor, and virginica) of 50 instances each, where each class refers to a type of iris plant. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. In section 2, the exploratory data analysis was performed. Firstly, the summary statistics of the entire data set was obtained. Secondly, the mean values and standard deviations of the four quantitative variables were calculated, which indicates that across different species, petal length and petal width are quite different, especially for setosa. To make this point more straightforward, the histogram, density plot and scatter plots of deliberately chosen variables are shown in section 2.2. The visualization reveals some strong classification criterion. In section 3, the data were split into training set and test set. The K-Nearest Neighbor was used to make out-of-sample prediction of classification. It assigns weight to the contributions of the neighbors, so that the nearer the neighbors contribute more to the average than the more distant ones. The last attribute of the data set, species, is the target variable, i.e., the variable we are predicting. Since the choice of K has a drastic effect on the KNN classifier obtained, different K values were studied. When K is small, the decision boundary is overly flexible, it corresponds to a classifier that has low bias but very high variance. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear, which corresponds to a low-variance but high-bias classifier.

2. EDA of the Data

2.1 Summary Statistics

The summary( ) function in R is used to obtain the summary statistics of the dataset, including minimum value, 1st quantile, median, mean 3rd quantile, maximum value for each numerical variable and the count for each level of the only categorical variable “species”, which is displayed in Table 1.

Table 1: Summary Statistics of the Iris flower data set

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Besides understanding the summary statistics of the entire dataset, we are also interested in the distribution of each variable within each species. From Table 2, we can see that setosa has quite different mean and sd values from the other two species, especially for petal length and petal width.

Table 2: Mean and standard deviation within each species

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## # A tibble: 3 x 3
##      Species `mean(Petal.Length)` `sd(Petal.Length)`
##       <fctr>                <dbl>              <dbl>
## 1     setosa                1.462          0.1736640
## 2 versicolor                4.260          0.4699110
## 3  virginica                5.552          0.5518947

2.2 Visualization

2.2. Histogram and Density Plot

The histograms and density plots of the four variables were plotted in R. What we see from those plots is that sepal length and sepal width do not vary much across species, however, petal length and petal width are quite different for different species. As shown in Figure 1 and Figure 2, petal length of versicolor and virginica are approximately normally distributed with different means and similar variability. Also, species setosa lies far away from these two species.

# plotting histogram and density plot

library(dplyr)
library(ggplot2)
iris%>%
    ggplot(aes(x=Petal.Width, fill=Species))+
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 1: Histogram of petal length

iris%>%
    ggplot(aes(x=Petal.Length, fill=Species))+
    geom_density(alpha=0.5)

Figure 2: Density plot of petal length

2.2.2 Scatter Plots

Scatter plots give us a good idea of how much one variable is affected by another. In other words, it is very helpful when we are trying to see if there is any correlation between two variables. As shown in Fig. 3, there is a high correlation between the sepal length and the sepal width of the Setosa iris flowers, while the correlation is somewhat less high for the Virginica and Versicolor flowers: the data points are more spread out over the graph and do not form a cluster like shown in the case of the Setosa flowers. The scatter plot that maps the petal length and the petal width displayed as Fig. 4 tells a similar story: The graph indicates a positive correlation between the petal length and the petal width for all different species that are included into the “iris” data set. More importantly, the scatter plots reveal a strong classification criteria. Setosa has the smallest petals versicolor has medium-sized petals and virginica has the largest petals.

# scatter plot

iris%>%
    ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour = Species))+
    geom_point(size = 3)

Figure 3: Sepal Width versus Sepal Length by Species

iris%>%
    ggplot(aes(x = Petal.Length, y = Petal.Width, colour = Species))+
    geom_point(size = 3)

Figure 4: Petal Width versus Petal Length by Species

3. K-Nearest Neighbor

3.1 Normalization

As part of the data preparation, in some cases, data need to be normalized so that distances between variables with larger ranges will not be over-emphasized. In our case, the Iris data set does not need to be normalized. All values of all attributes are contained within the range of 0.1 and 7.9.

3.2 Splitting Data Set into Training and Test Sets

Random sampling is used to take 60% of the original data set to form the training set. In the training set, the numbers of each species are: 28 setosa, 30 versicolor, and 32 virginica. The amount of instances of all three species are more or less equal so that we do not favor one or the other class in the predictions.

3.3 K-Nearest Neighbor in R

As stated above, the iris flower data set was split into a training data set (60%) and a testing data set(40%) to assess the classification accuracy. There is not a strong relationship between the training error rate and the test error rate. With K=1, the KNN training error rate is 0, but the test error rate is quite high compared the those achieved by other K values. In Figure 5, we have plotted the KNN test and train errors as a function of K. As K increases, the method becomes less flexible. The training error rate consistently increases as the flexibility decreases. However, the test error rate exhibits a characteristic U-shape, declining at first (with a minimum at approximately K=6) before increasing again when the method becomes excessively inflexible.

########### KNN #############
# random sample a training data set
set.seed(12345)
allrows <- 1:nrow(iris)
trainrows <- sample(allrows, replace = F, size = 0.6*length(allrows))
train_iris <- iris[trainrows, 1:4]
train_label <- iris[trainrows, 5]
table(train_label)
## train_label
##     setosa versicolor  virginica 
##         34         27         29
test_iris <- iris[-trainrows, 1:4]
test_label <- iris[-trainrows, 5]
table(test_label)
## test_label
##     setosa versicolor  virginica 
##         16         23         21
library(class)
error.train <- replicate(0,30)
for(k in 1:30) {
    pred_iris <- knn(train = train_iris, test = train_iris, cl = train_label, k)
    error.train[k]<-1-mean(pred_iris==train_label)
}

error.train <- unlist(error.train, use.names=FALSE)

error.test <- replicate(0,30)
for(k in 1:30) {
    pred_iris <- knn(train = train_iris, test = test_iris, cl = train_label, k)
    error.test[k]<-1-mean(pred_iris==test_label)
}

error.test <- unlist(error.test, use.names = FALSE)

plot(error.train, type="o", ylim=c(0,0.15), col="blue", xlab = "K values", ylab = "Misclassification errors")
lines(error.test, type = "o", col="red")
legend("topright", legend=c("Training error","Test error"), col = c("blue","red"), lty=1:1)

Figure 5: Training and test accuracy using KNN with different K’s

Based on Figure 5, K=6 yields the smallest test error rates. Therefore, we chose K=6 as the best number of neighbors for KNN in our prediction. The classification result based on K==6 is shown in the scatter plot in Figure 6. Since we do have the real values of the target variable, we would like to see what are the points that were misclassified. By comparing the two scatter plots in Figure 6 and Figure 7, we can see a couple of points, one versicolor is misclassified as virginica and one virginica is misclassified as versicolor.

pred_iris<-knn(train = train_iris, test = test_iris, cl = train_label, 6)
result <- cbind(test_iris, pred_iris)
combinetest <- cbind(test_iris, test_label)

result%>%
    ggplot(aes(x=Petal.Width, y=Petal.Length, color=pred_iris))+
    geom_point(size=3)

Figure 6:Scatter plot based on prediction

combinetest%>%
    ggplot(aes(x=Petal.Width, y=Petal.Length, color=test_label))+
    geom_point(size=3)

Figure 7: Scatter plot of real test data set