Exploring the Iris data set.

The goal of this project is to investigate the iris data set that has data on the sepal length and width and the petal length and width of each of three species of iris namely, setosa, versicolor and virginica.

Plotting the Sepal width vs Sepal length for each flower species

iris$Species<- as.factor(iris$Species)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
ggplot(iris,aes(Sepal.Width,(Sepal.Length),color=Species))+
  geom_point(position="jitter")

By the two plots I can conclude that out of all the three, virginica must be the biggest flower and setosa must be the smallest as the median petal length is 5.55 and 1.5 for each respectively. Versicolor comes close to virginica with a median petal length of 4.35.

as the virginica has the highest petal length, relatively its sepal length is also big. with a median of 5.8

The graph above describes the relationship between sepal width and petal width for each of the species

#sepal width and petal width for each species
ggplot(iris, aes(Sepal.Length,Petal.Length , color=Species))+
  geom_point(position = "jitter")+
  geom_smooth()

To better describe the relationship between sepal length and petal length, i added a smoother. This shows that for the versicolor and virginica, the petal length has kind of a linear relationship with the sepal length. But for setosa, the smoother line is a bit curved, which means that it should be straight but appears curved due to outliers. We can inspect the setosa species seperately to know more about its distribution.

#plot for sepal length and petal length of setosa species
ggplot(subset(iris, iris$Species=="setosa"),aes(Sepal.Length,Petal.Length))+
  geom_point(position="jitter")+
  coord_cartesian(ylim=c(1,1.75))+
  geom_smooth()

we see that the plot is quite non-linear. There are quite notable outliers in the data for the species setosa. we can see a flower with a sepal length of 4.6 and a petal length of 1 while there is another flower with a petal length of 1.55 and sepal length ofaround 4.55. throughout the plot the data points are spread out, down and up and accross. hence when tried to smooth, the line becomes a bit curved due to these outliers. Another instance of this non linearity is shown at 5.75 where the petal length is around 1.2. to its left there is another point with a value of around 5.7 with a petal length of 1.7.

This non linearity calls for a question. Is there another species hidden inside the setosa data? To confirm this, i checked the the other factors for the same species which is sepal width and petal width.

#plot for sepal width and petal width faceted by species
ggplot(iris,aes(Sepal.Width,Petal.Width, color=Species))+
  geom_point(position="jitter")+
  geom_smooth()+
  facet_wrap(~Species)

Here we see that the Sepal width doesnt seem to have a correlation with petal length in the setosa species with the help of the smoothed line. For other species there is a linear relationship for petal width and sepal width. we check this using cor.test for all the species.

#corelation for sepalwidth and petal width
with(subset(iris,iris$Species=="setosa"), cor.test(Sepal.Width,Petal.Width))

## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Width and Petal.Width
## t = 1.6581, df = 48, p-value = 0.1038
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0487543  0.4800023
## sample estimates:
##      cor 
## 0.232752

with(subset(iris,iris$Species=="virginica"), cor.test(Sepal.Width,Petal.Width))

## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Width and Petal.Width
## t = 4.4187, df = 48, p-value = 5.648e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3050368 0.7098315
## sample estimates:
##      cor 
## 0.537728

with(subset(iris,iris$Species=="versicolor"), cor.test(Sepal.Width,Petal.Width))

## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Width and Petal.Width
## t = 6.1523, df = 48, p-value = 1.467e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4730884 0.7953482
## sample estimates:
##       cor 
## 0.6639987

The test proves that there is a weak corelation between the sepal width and petal width for virginica and versicolor. But for setosa the corelation test fails.

Final Plots and summary

sepal length vs petal length for each species

knitr::opts_chunk$set(echo=FALSE)

ggplot(iris, aes(Sepal.Length,Petal.Length , color=Species))+
  geom_point(position = "jitter")+
  geom_smooth()

sepal width vs petal width for each species

plot for sepal width and petal width faceted by species

From all the plots above, we can see that although the 3 species of flowers are from the same family, the setosa is linearly seperable, ie a line can be drawn to seperate setosa from the other 2 species on each of the graphs. So we can say that virginica and versicolor are quite identical as there are a few points from both these species that fall a bit close to each other, but the setosa is comparitively different from the other 2 species as all its points lie away from the group of versicolor and virginica. Also in all the graphs there was a linear trend for virginica and versicolor, whereas, setosa showed a non linearity or if adjusted for outliers, we can say that the setosa doesnt show a trend with respect to its petal and sepal dimensions. I will look more into the setosa species in later analysis.

EDA Project

Shivakumar Rajagopalan

22 August 2016