Case study: Iris flower

1.0 Data set Overview

1.1 Background:

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus.

Iris Flower

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features.

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

1.2 Attribute Information:

  1. sepal length(in cm)
  2. sepal width(in cm)
  3. petal length(in cm)
  4. petal width(in cm)
  5. class: – Iris Setosa – Iris Versicolour – Iris Virginica

1.3 General Overview:

Metadata Information

Dataset

kable(iris)  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica

2.0 Exploratory Data Analysis

We divide the entire data into 2 part for training and testing, Training data will be 80% of whole dataset and testing data will be 20% of whole data set.We work on Training data for our EDA phase.

index<-sample(seq_len(nrow(iris)), size = nrow(iris)*.80)
iris_train<-iris[index,]
iris_test<-iris[-index,]

We can use sampling for cross validation and model assessment purpose. As this is only an EDA we many not require to split the data into train and test. We will use full dataset for our EDA Purpose

2.1 Checking count of data species

dim(iris)#Checking dimensions, iris data have 150 observations & 6 features 
## [1] 150   5

Observation

  • 150 observations to be analysed across 5 Attribues.
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Observation

  • 5 numeric continious Attributes namely : Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
  • 1 Categorical nominal Attirbute nameed : Factor: Species
table(iris$Species) #Have equal number of each
## 
##     setosa versicolor  virginica 
##         50         50         50

Observation

  • Each specie has 50 observations
colSums(is.na(iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

Observation

  • No Missing values are present
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Observation

  • No abnormal values
  • Table is clean and not Messy
  • Observations as present as rows
  • Attributes are present as columns
  • One type of observational unit per table
df<-data.frame(mini=sapply(iris[,1:4],min),maxi=sapply(iris[,1:4],max))
df$range<-df$maxi-df$mini
df
##              mini maxi range
## Sepal.Length  4.3  7.9   3.6
## Sepal.Width   2.0  4.4   2.4
## Petal.Length  1.0  6.9   5.9
## Petal.Width   0.1  2.5   2.4

Observation

Range of Minimum maximum for attributes follow a sequence: Petal.Length > Sepal.Length>Sepal.width=Petal.width

2.2 Analysis based on visualization

Please note the tabs below

2.2.1 Number of observations

iris%>%
group_by(Species)%>%
summarise(count=n()) %>%
ggplot(aes(x=Species,y = count)) +geom_bar(stat = "identity",fill='#3990E5' ,alpha=0.5)+
 labs(x = "Species", 
       y = "count",
       title = "Number observations per species",
       subtitle = "Segrated by types")

There are equal number of 3 species in Data

2.2.2 Distribution Analysis

meanMedian<-iris %>%
  group_by(Species) %>%
  summarise(Spl.len.mean=mean(Sepal.Length),spl.len.median=median(Sepal.Length),
            Spl.Width.mean=mean(Sepal.Width),Spl.Width.median=median(Sepal.Width),
            ptl.len.mean=mean(Petal.Length),ptl.len.median=median(Petal.Length),
            ptl.width.mean=mean(Petal.Width),ptl.Width.median=median(Petal.Width))
kable(meanMedian)  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
Species Spl.len.mean spl.len.median Spl.Width.mean Spl.Width.median ptl.len.mean ptl.len.median ptl.width.mean ptl.Width.median
setosa 5.006 5.0 3.428 3.4 1.462 1.50 0.246 0.2
versicolor 5.936 5.9 2.770 2.8 4.260 4.35 1.326 1.3
virginica 6.588 6.5 2.974 3.0 5.552 5.55 2.026 2.0

Observation

  • Mean values for all 4 attributes in close to the median.
  • Petal length has significantly difference between mean and median.

More clarity will come lokking at the histogram distributions

2.2.3 Histogram: All species

  sl<-ggplot(iris, aes(Sepal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Sepal length")
  
 sw<-ggplot(iris, aes(Sepal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Sepal Width")
 
  pl<-ggplot(iris, aes(Petal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Petal length")
   
  pw<- ggplot(iris, aes(Petal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Petal Width")

  grid.arrange(sl,sw,pl,pw,ncol=2,nrow=2)

Observation All 3 flowers when taken in unison Petal length and Petal width form a bi modal density plot. Also, notice that mean and median values for Petal width and Petal Length has significant difference.

We can find out more insights by looking at each flower specie separately.

2.2.4 Histogram: Setosa

iris_s<-subset(iris,Species=="setosa")

sl_Set<-ggplot(iris_s, aes(Sepal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Setosa Sepal length")
  
 sw_Set<-ggplot(iris_s, aes(Sepal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Setosa Sepal Width")
 
  pl_Set<-ggplot(iris_s, aes(Petal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Setosa Petal length")
   
  pw_Set<- ggplot(iris_s, aes(Petal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Setosa Petal Width")

  grid.arrange(sl_Set,sw_Set,pl_Set,pw_Set,ncol=2,nrow=2)

Observation

All but Petal width for Setosa seems to have a rough density plot, The plot also seems right skewed.

2.2.5 Histogram: Versicolor

iris_ve<-subset(iris,Species=="versicolor")

sl_ve<-ggplot(iris_ve, aes(Sepal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Versicolor Sepal length")
  
 sw_ve<-ggplot(iris_ve, aes(Sepal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Versicolor Sepal Width")
 
  pl_ve<-ggplot(iris_ve, aes(Petal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Versicolor Petal length")
   
  pw_ve<- ggplot(iris_ve, aes(Petal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "Versicolor Petal Width")

  grid.arrange(sl_ve,sw_ve,pl_ve,pw_ve,ncol=2,nrow=2)

Observation

As compared to Setosa Versicolor seems to have more skewed distributions, specially in categories of Petal length and Petal Width.

2.2.6 Histogram: Virginica

iris_vi<-subset(iris,Species=="virginica")

sl_vi<-ggplot(iris_vi, aes(Sepal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "virginica Sepal length")
  
 sw_vi<-ggplot(iris_vi, aes(Sepal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Sepal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Sepal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "virginica Sepal Width")
 
  pl_vi<-ggplot(iris_vi, aes(Petal.Length))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Length))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Length))
             , color="green", linetype="dashed", size=1) +
  labs(title = "virginica Petal length")
   
  pw_vi<- ggplot(iris_vi, aes(Petal.Width))+
  geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
  geom_density(alpha=0.7,color="dodgerblue4",size=1) +
  geom_vline(aes(xintercept=mean(Petal.Width))
             , color="grey28", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(Petal.Width))
             , color="green", linetype="dashed", size=1) +
  labs(title = "virginica Petal Width")

  grid.arrange(sl_vi,sw_vi,pl_vi,pw_vi,ncol=2,nrow=2)

Observation

Similar to Versicolor, Virginica also seems to have skewed distributions, specially in the case of Petal Length and Prtal Width.

2.3 Detecting Outliers with Boxplot

###Checking for outliers
boxplot(iris[,1:4], col=c("red", "blue", "yellow", "grey"))

slb<-ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) + 
    geom_boxplot(
        alpha=0.3,
        outlier.colour="red",
        outlier.fill="red",
        outlier.size=2) +
    theme(legend.position="none")

swb<-ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) + 
    geom_boxplot(
        alpha=0.3,
        outlier.colour="red",
        outlier.fill="red",
        outlier.size=2) +
    theme(legend.position="none")

plb<-ggplot(iris, aes(x=Species, y=Petal.Length, fill=Species)) + 
    geom_boxplot(
        alpha=0.3,
        outlier.colour="red",
        outlier.fill="red",
        outlier.size=2) +
    theme(legend.position="none")

plw<-ggplot(iris, aes(x=Species, y=Petal.Width, fill=Species)) + 
    geom_boxplot(
        alpha=0.3,
        outlier.colour="red",
        outlier.fill="red",
        outlier.size=2) +
    theme(legend.position="none")
  
grid.arrange(slb,swb,plb,plw,ncol=2,nrow=2)

Observation

  • We observe significant number (4) outliers are present in Setosa specie as compared to others.
  • Sepal Length: Virginica has 1 outlier
  • Sepal Width : Setosa has 1 outlier
  • Petal Length: Setosa has 1 outlier Versicolor has 1 outlier
  • Petal Width: Setosa has 2 outlier

Scatter plot

iris1<-iris
iris1$ID <- seq.int(nrow(iris1))
sls<-ggplot(iris1, aes(x=Sepal.Length, y=ID,  color=Species)) + geom_point() +labs(title = "Sepal length")

sws<-ggplot(iris1, aes(x=Sepal.Width, y=ID,  color=Species)) + geom_point() +labs(title = "Sepal Width")

pls<-ggplot(iris1, aes(x=Petal.Length, y=ID,  color=Species)) + geom_point() +labs(title = "Petal Length")

pws<-ggplot(iris1, aes(x=Petal.Width, y=ID,  color=Species)) + geom_point() +labs(title = "Petal Width")

grid.arrange(sls,sws,pls,pws,ncol=2,nrow=2)

Observation

  • Sepal Length: Have minor overlap in 3 species
  • Sepal Width : Have considerably higher overlap in 3 species
  • Petal Length: Have a clear distinction in 3 species
  • Petal Width: Have a clear distinction in 3 species

Ranking based on scatter plot Petal Width > Petal Length > Sepal Length > Sepal Width

2.4 Correlation Analysis

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
C<-cor(iris[,c(1:4)])
c0<-corrplot(C,  method = "square",type="lower")

??corrplot()
## starting httpd help server ...
##  done

Observation

  • Strong Correlation in Sepal Length and Petal Length.
  • Strong Correlation in Sepal Length and Petal width.

Inference: Flowers with Longer sepal tend to have wider and longer Petals

  • Negative Correlation between Sepal width and Petal Length Inference: As sepal width increases the petals tend to be shorter in length

  • Negative Correlation between Sepal width and Petal width Inference: As sepal width increases the petals tend to be shorter in length as well as in width

  • Strong Correlation in Petal Length and Petal width. Inference: As petal width increases petals tend to be wider

2.5 Density Ridgeline plots

The density ridgeline plot is an alternative to the standard geom_density() function that can be useful for visualizing changes in distributions, of a continuous variable, over time or space. Ridgeline plots are partially overlapping line plots that create the impression of a mountain range.

theme_set(theme_ridges())

p1<-ggplot(iris, aes(x = Sepal.Length, y = Species)) +
  geom_density_ridges(aes(fill = Species)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

p2<-ggplot(iris, aes(x = Sepal.Width, y = Species)) +
  geom_density_ridges(aes(fill = Species)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))


p3<-ggplot(iris, aes(x = Petal.Width, y = Species)) +
  geom_density_ridges(aes(fill = Species)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))


p4<-ggplot(iris, aes(x = Petal.Width, y = Species)) +
  geom_density_ridges(aes(fill = Species)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
library(ggpubr)
## Loading required package: magrittr
p1
## Picking joint bandwidth of 0.181

p2
## Picking joint bandwidth of 0.13

p3
## Picking joint bandwidth of 0.075

p4
## Picking joint bandwidth of 0.075

These plots help us understand not just the mean and variance of a density plot but also overlap if any amongst the density plots. for example: we see that sepal width and sepal length overlap but petal width and petal length does not overlap.

Thankyou…!