Problem 1.7: Fisher’s “iris” dataset

  1. There are 150 observations or cases in the dataset
  2. There are 4 numerical variables in the dataset:

All 4 are continuous variables.

  1. “Species” is a categorical variable and the values or labels are setosa, versicolor and vriginica.

Data Description

The dataset is summarized below:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

To interpret our dataset, we will start with the function “pairs.panels” from the psych library:

library(psych)
pairs.panels(iris[1:4],gap=0, main="Fisher's Iris Dataset",pch=21,bg=c("red","green","blue"))

Using the “pairs.panel” function, a high correlation can be observed between:

  1. Sepal Length and Petal Length,

  2. Sepal Length and Petal Width and

  3. Petal Length and Petal Width.

Using a Linear Regression model on the relationship between Petal Length and Petal Width:

summary(lm(Petal.Width ~ Petal.Length, data=iris))
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

The p-values above confirm what we see in the Scatterplot for Petal Width and Petal Length.

The above correlations are quite easily observed, but let us take the example of a relationship that is not so simple: Sepal Length to Sepal Width. Using a Linear model:

summary(lm(Sepal.Width ~ Sepal.Length, data=iris))
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1095 -0.2454 -0.0167  0.2763  1.3338 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.41895    0.25356   13.48   <2e-16 ***
## Sepal.Length -0.06188    0.04297   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

Looking at the p-values, this Linear Model is not statistically significant in proving a relationship between Sepal Length and Sepal Width.

Lets see if looking at the data by species provides any further insights.

summary(lm(Sepal.Length ~ Sepal.Width:Species + Species-1, data=iris))
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width:Species + Species - 1, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26067 -0.25861 -0.03305  0.18929  1.44917 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## Speciessetosa                   2.6390     0.5715   4.618 8.53e-06 ***
## Speciesversicolor               3.5397     0.5580   6.343 2.74e-09 ***
## Speciesvirginica                3.9068     0.5827   6.705 4.25e-10 ***
## Sepal.Width:Speciessetosa       0.6905     0.1657   4.166 5.31e-05 ***
## Sepal.Width:Speciesversicolor   0.8651     0.2002   4.321 2.88e-05 ***
## Sepal.Width:Speciesvirginica    0.9015     0.1948   4.628 8.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4397 on 144 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9944 
## F-statistic:  4478 on 6 and 144 DF,  p-value: < 2.2e-16
plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue"), main="Fisher's Iris Data", xlab="Sepal Width", ylab="Sepal length")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients, col="red")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients, col="green3")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients, col="blue")

We can see from the chart above that Sepal Length and Sepal Width are highly correlated by species.