The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus.
Iris Flower
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features.
This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Predicted attribute: class of iris plant.
Metadata Information
Dataset
kable(iris) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 5.4 | 3.7 | 1.5 | 0.2 | setosa |
| 4.8 | 3.4 | 1.6 | 0.2 | setosa |
| 4.8 | 3.0 | 1.4 | 0.1 | setosa |
| 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 5.8 | 4.0 | 1.2 | 0.2 | setosa |
| 5.7 | 4.4 | 1.5 | 0.4 | setosa |
| 5.4 | 3.9 | 1.3 | 0.4 | setosa |
| 5.1 | 3.5 | 1.4 | 0.3 | setosa |
| 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| 5.4 | 3.4 | 1.7 | 0.2 | setosa |
| 5.1 | 3.7 | 1.5 | 0.4 | setosa |
| 4.6 | 3.6 | 1.0 | 0.2 | setosa |
| 5.1 | 3.3 | 1.7 | 0.5 | setosa |
| 4.8 | 3.4 | 1.9 | 0.2 | setosa |
| 5.0 | 3.0 | 1.6 | 0.2 | setosa |
| 5.0 | 3.4 | 1.6 | 0.4 | setosa |
| 5.2 | 3.5 | 1.5 | 0.2 | setosa |
| 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 4.8 | 3.1 | 1.6 | 0.2 | setosa |
| 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 5.2 | 4.1 | 1.5 | 0.1 | setosa |
| 5.5 | 4.2 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.2 | 1.2 | 0.2 | setosa |
| 5.5 | 3.5 | 1.3 | 0.2 | setosa |
| 4.9 | 3.6 | 1.4 | 0.1 | setosa |
| 4.4 | 3.0 | 1.3 | 0.2 | setosa |
| 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 5.0 | 3.5 | 1.3 | 0.3 | setosa |
| 4.5 | 2.3 | 1.3 | 0.3 | setosa |
| 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 5.0 | 3.5 | 1.6 | 0.6 | setosa |
| 5.1 | 3.8 | 1.9 | 0.4 | setosa |
| 4.8 | 3.0 | 1.4 | 0.3 | setosa |
| 5.1 | 3.8 | 1.6 | 0.2 | setosa |
| 4.6 | 3.2 | 1.4 | 0.2 | setosa |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
| 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
| 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
| 5.5 | 2.3 | 4.0 | 1.3 | versicolor |
| 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
| 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
| 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
| 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
| 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
| 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
| 5.9 | 3.0 | 4.2 | 1.5 | versicolor |
| 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
| 6.1 | 2.9 | 4.7 | 1.4 | versicolor |
| 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
| 6.7 | 3.1 | 4.4 | 1.4 | versicolor |
| 5.6 | 3.0 | 4.5 | 1.5 | versicolor |
| 5.8 | 2.7 | 4.1 | 1.0 | versicolor |
| 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
| 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
| 5.9 | 3.2 | 4.8 | 1.8 | versicolor |
| 6.1 | 2.8 | 4.0 | 1.3 | versicolor |
| 6.3 | 2.5 | 4.9 | 1.5 | versicolor |
| 6.1 | 2.8 | 4.7 | 1.2 | versicolor |
| 6.4 | 2.9 | 4.3 | 1.3 | versicolor |
| 6.6 | 3.0 | 4.4 | 1.4 | versicolor |
| 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
| 6.7 | 3.0 | 5.0 | 1.7 | versicolor |
| 6.0 | 2.9 | 4.5 | 1.5 | versicolor |
| 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
| 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
| 5.5 | 2.4 | 3.7 | 1.0 | versicolor |
| 5.8 | 2.7 | 3.9 | 1.2 | versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | versicolor |
| 5.4 | 3.0 | 4.5 | 1.5 | versicolor |
| 6.0 | 3.4 | 4.5 | 1.6 | versicolor |
| 6.7 | 3.1 | 4.7 | 1.5 | versicolor |
| 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
| 5.6 | 3.0 | 4.1 | 1.3 | versicolor |
| 5.5 | 2.5 | 4.0 | 1.3 | versicolor |
| 5.5 | 2.6 | 4.4 | 1.2 | versicolor |
| 6.1 | 3.0 | 4.6 | 1.4 | versicolor |
| 5.8 | 2.6 | 4.0 | 1.2 | versicolor |
| 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
| 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
| 5.7 | 3.0 | 4.2 | 1.2 | versicolor |
| 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
| 6.2 | 2.9 | 4.3 | 1.3 | versicolor |
| 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
| 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
| 6.3 | 3.3 | 6.0 | 2.5 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 7.1 | 3.0 | 5.9 | 2.1 | virginica |
| 6.3 | 2.9 | 5.6 | 1.8 | virginica |
| 6.5 | 3.0 | 5.8 | 2.2 | virginica |
| 7.6 | 3.0 | 6.6 | 2.1 | virginica |
| 4.9 | 2.5 | 4.5 | 1.7 | virginica |
| 7.3 | 2.9 | 6.3 | 1.8 | virginica |
| 6.7 | 2.5 | 5.8 | 1.8 | virginica |
| 7.2 | 3.6 | 6.1 | 2.5 | virginica |
| 6.5 | 3.2 | 5.1 | 2.0 | virginica |
| 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 6.8 | 3.0 | 5.5 | 2.1 | virginica |
| 5.7 | 2.5 | 5.0 | 2.0 | virginica |
| 5.8 | 2.8 | 5.1 | 2.4 | virginica |
| 6.4 | 3.2 | 5.3 | 2.3 | virginica |
| 6.5 | 3.0 | 5.5 | 1.8 | virginica |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 6.0 | 2.2 | 5.0 | 1.5 | virginica |
| 6.9 | 3.2 | 5.7 | 2.3 | virginica |
| 5.6 | 2.8 | 4.9 | 2.0 | virginica |
| 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 6.3 | 2.7 | 4.9 | 1.8 | virginica |
| 6.7 | 3.3 | 5.7 | 2.1 | virginica |
| 7.2 | 3.2 | 6.0 | 1.8 | virginica |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica |
| 6.1 | 3.0 | 4.9 | 1.8 | virginica |
| 6.4 | 2.8 | 5.6 | 2.1 | virginica |
| 7.2 | 3.0 | 5.8 | 1.6 | virginica |
| 7.4 | 2.8 | 6.1 | 1.9 | virginica |
| 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 6.4 | 2.8 | 5.6 | 2.2 | virginica |
| 6.3 | 2.8 | 5.1 | 1.5 | virginica |
| 6.1 | 2.6 | 5.6 | 1.4 | virginica |
| 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| 6.3 | 3.4 | 5.6 | 2.4 | virginica |
| 6.4 | 3.1 | 5.5 | 1.8 | virginica |
| 6.0 | 3.0 | 4.8 | 1.8 | virginica |
| 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 5.9 | 3.0 | 5.1 | 1.8 | virginica |
We divide the entire data into 2 part for training and testing, Training data will be 80% of whole dataset and testing data will be 20% of whole data set.We work on Training data for our EDA phase.
index<-sample(seq_len(nrow(iris)), size = nrow(iris)*.80)
iris_train<-iris[index,]
iris_test<-iris[-index,]
We can use sampling for cross validation and model assessment purpose. As this is only an EDA we many not require to split the data into train and test. We will use full dataset for our EDA Purpose
dim(iris)#Checking dimensions, iris data have 150 observations & 6 features
## [1] 150 5
Observation
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Observation
table(iris$Species) #Have equal number of each
##
## setosa versicolor virginica
## 50 50 50
Observation
colSums(is.na(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
Observation
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Observation
df<-data.frame(mini=sapply(iris[,1:4],min),maxi=sapply(iris[,1:4],max))
df$range<-df$maxi-df$mini
df
## mini maxi range
## Sepal.Length 4.3 7.9 3.6
## Sepal.Width 2.0 4.4 2.4
## Petal.Length 1.0 6.9 5.9
## Petal.Width 0.1 2.5 2.4
Observation
Range of Minimum maximum for attributes follow a sequence: Petal.Length > Sepal.Length>Sepal.width=Petal.width
Please note the tabs below
iris%>%
group_by(Species)%>%
summarise(count=n()) %>%
ggplot(aes(x=Species,y = count)) +geom_bar(stat = "identity",fill='#3990E5' ,alpha=0.5)+
labs(x = "Species",
y = "count",
title = "Number observations per species",
subtitle = "Segrated by types")
There are equal number of 3 species in Data
meanMedian<-iris %>%
group_by(Species) %>%
summarise(Spl.len.mean=mean(Sepal.Length),spl.len.median=median(Sepal.Length),
Spl.Width.mean=mean(Sepal.Width),Spl.Width.median=median(Sepal.Width),
ptl.len.mean=mean(Petal.Length),ptl.len.median=median(Petal.Length),
ptl.width.mean=mean(Petal.Width),ptl.Width.median=median(Petal.Width))
kable(meanMedian) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
| Species | Spl.len.mean | spl.len.median | Spl.Width.mean | Spl.Width.median | ptl.len.mean | ptl.len.median | ptl.width.mean | ptl.Width.median |
|---|---|---|---|---|---|---|---|---|
| setosa | 5.006 | 5.0 | 3.428 | 3.4 | 1.462 | 1.50 | 0.246 | 0.2 |
| versicolor | 5.936 | 5.9 | 2.770 | 2.8 | 4.260 | 4.35 | 1.326 | 1.3 |
| virginica | 6.588 | 6.5 | 2.974 | 3.0 | 5.552 | 5.55 | 2.026 | 2.0 |
Observation
More clarity will come lokking at the histogram distributions
sl<-ggplot(iris, aes(Sepal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Sepal length")
sw<-ggplot(iris, aes(Sepal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Sepal Width")
pl<-ggplot(iris, aes(Petal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Petal length")
pw<- ggplot(iris, aes(Petal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Petal Width")
grid.arrange(sl,sw,pl,pw,ncol=2,nrow=2)
Observation All 3 flowers when taken in unison Petal length and Petal width form a bi modal density plot. Also, notice that mean and median values for Petal width and Petal Length has significant difference.
We can find out more insights by looking at each flower specie separately.
iris_s<-subset(iris,Species=="setosa")
sl_Set<-ggplot(iris_s, aes(Sepal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Setosa Sepal length")
sw_Set<-ggplot(iris_s, aes(Sepal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Setosa Sepal Width")
pl_Set<-ggplot(iris_s, aes(Petal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Setosa Petal length")
pw_Set<- ggplot(iris_s, aes(Petal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Setosa Petal Width")
grid.arrange(sl_Set,sw_Set,pl_Set,pw_Set,ncol=2,nrow=2)
Observation
All but Petal width for Setosa seems to have a rough density plot, The plot also seems right skewed.
iris_ve<-subset(iris,Species=="versicolor")
sl_ve<-ggplot(iris_ve, aes(Sepal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Versicolor Sepal length")
sw_ve<-ggplot(iris_ve, aes(Sepal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Versicolor Sepal Width")
pl_ve<-ggplot(iris_ve, aes(Petal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "Versicolor Petal length")
pw_ve<- ggplot(iris_ve, aes(Petal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "Versicolor Petal Width")
grid.arrange(sl_ve,sw_ve,pl_ve,pw_ve,ncol=2,nrow=2)
Observation
As compared to Setosa Versicolor seems to have more skewed distributions, specially in categories of Petal length and Petal Width.
iris_vi<-subset(iris,Species=="virginica")
sl_vi<-ggplot(iris_vi, aes(Sepal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "virginica Sepal length")
sw_vi<-ggplot(iris_vi, aes(Sepal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Sepal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Sepal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "virginica Sepal Width")
pl_vi<-ggplot(iris_vi, aes(Petal.Length))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Length))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Length))
, color="green", linetype="dashed", size=1) +
labs(title = "virginica Petal length")
pw_vi<- ggplot(iris_vi, aes(Petal.Width))+
geom_histogram(alpha=0.5, binwidth =.3,position="identity", aes(y = ..density..), color="darkgray",fill='#3990E5' ) +
geom_density(alpha=0.7,color="dodgerblue4",size=1) +
geom_vline(aes(xintercept=mean(Petal.Width))
, color="grey28", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(Petal.Width))
, color="green", linetype="dashed", size=1) +
labs(title = "virginica Petal Width")
grid.arrange(sl_vi,sw_vi,pl_vi,pw_vi,ncol=2,nrow=2)
Observation
Similar to Versicolor, Virginica also seems to have skewed distributions, specially in the case of Petal Length and Prtal Width.
###Checking for outliers
boxplot(iris[,1:4], col=c("red", "blue", "yellow", "grey"))
slb<-ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_boxplot(
alpha=0.3,
outlier.colour="red",
outlier.fill="red",
outlier.size=2) +
theme(legend.position="none")
swb<-ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) +
geom_boxplot(
alpha=0.3,
outlier.colour="red",
outlier.fill="red",
outlier.size=2) +
theme(legend.position="none")
plb<-ggplot(iris, aes(x=Species, y=Petal.Length, fill=Species)) +
geom_boxplot(
alpha=0.3,
outlier.colour="red",
outlier.fill="red",
outlier.size=2) +
theme(legend.position="none")
plw<-ggplot(iris, aes(x=Species, y=Petal.Width, fill=Species)) +
geom_boxplot(
alpha=0.3,
outlier.colour="red",
outlier.fill="red",
outlier.size=2) +
theme(legend.position="none")
grid.arrange(slb,swb,plb,plw,ncol=2,nrow=2)
Observation
Scatter plot
iris1<-iris
iris1$ID <- seq.int(nrow(iris1))
sls<-ggplot(iris1, aes(x=Sepal.Length, y=ID, color=Species)) + geom_point() +labs(title = "Sepal length")
sws<-ggplot(iris1, aes(x=Sepal.Width, y=ID, color=Species)) + geom_point() +labs(title = "Sepal Width")
pls<-ggplot(iris1, aes(x=Petal.Length, y=ID, color=Species)) + geom_point() +labs(title = "Petal Length")
pws<-ggplot(iris1, aes(x=Petal.Width, y=ID, color=Species)) + geom_point() +labs(title = "Petal Width")
grid.arrange(sls,sws,pls,pws,ncol=2,nrow=2)
Observation
Ranking based on scatter plot Petal Width > Petal Length > Sepal Length > Sepal Width
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
C<-cor(iris[,c(1:4)])
c0<-corrplot(C, method = "square",type="lower")
??corrplot()
## starting httpd help server ...
## done
Observation
Inference: Flowers with Longer sepal tend to have wider and longer Petals
Negative Correlation between Sepal width and Petal Length Inference: As sepal width increases the petals tend to be shorter in length
Negative Correlation between Sepal width and Petal width Inference: As sepal width increases the petals tend to be shorter in length as well as in width
Strong Correlation in Petal Length and Petal width. Inference: As petal width increases petals tend to be wider
The density ridgeline plot is an alternative to the standard geom_density() function that can be useful for visualizing changes in distributions, of a continuous variable, over time or space. Ridgeline plots are partially overlapping line plots that create the impression of a mountain range.
theme_set(theme_ridges())
p1<-ggplot(iris, aes(x = Sepal.Length, y = Species)) +
geom_density_ridges(aes(fill = Species)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
p2<-ggplot(iris, aes(x = Sepal.Width, y = Species)) +
geom_density_ridges(aes(fill = Species)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
p3<-ggplot(iris, aes(x = Petal.Width, y = Species)) +
geom_density_ridges(aes(fill = Species)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
p4<-ggplot(iris, aes(x = Petal.Width, y = Species)) +
geom_density_ridges(aes(fill = Species)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
library(ggpubr)
## Loading required package: magrittr
p1
## Picking joint bandwidth of 0.181
p2
## Picking joint bandwidth of 0.13
p3
## Picking joint bandwidth of 0.075
p4
## Picking joint bandwidth of 0.075
These plots help us understand not just the mean and variance of a density plot but also overlap if any amongst the density plots. for example: we see that sepal width and sepal length overlap but petal width and petal length does not overlap.
Thankyou…!