Problem 1.7: Fisher’s “iris” dataset
All 4 are continuous variables.
Data Description
The dataset is summarized below:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
To interpret our dataset, we will start with the function “pairs.panels” from the psych library:
library(psych)
pairs.panels(iris[1:4],gap=0, main="Fisher's Iris Dataset",pch=21,bg=c("red","green","blue"))
Using the “pairs.panel” function, a high correlation can be observed between:
Sepal Length and Petal Length,
Sepal Length and Petal Width and
Petal Length and Petal Width.
Using a Linear Regression model on the relationship between Petal Length and Petal Width:
summary(lm(Petal.Width ~ Petal.Length, data=iris))
##
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56515 -0.12358 -0.01898 0.13288 0.64272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.363076 0.039762 -9.131 4.7e-16 ***
## Petal.Length 0.415755 0.009582 43.387 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
The p-values above confirm what we see in the Scatterplot for Petal Width and Petal Length.
The above correlations are quite easily observed, but let us take the example of a relationship that is not so simple: Sepal Length to Sepal Width. Using a Linear model:
summary(lm(Sepal.Width ~ Sepal.Length, data=iris))
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1095 -0.2454 -0.0167 0.2763 1.3338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41895 0.25356 13.48 <2e-16 ***
## Sepal.Length -0.06188 0.04297 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Looking at the p-values, this Linear Model is not statistically significant in proving a relationship between Sepal Length and Sepal Width.
Lets see if looking at the data by species provides any further insights.
summary(lm(Sepal.Length ~ Sepal.Width:Species + Species-1, data=iris))
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width:Species + Species - 1,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26067 -0.25861 -0.03305 0.18929 1.44917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Speciessetosa 2.6390 0.5715 4.618 8.53e-06 ***
## Speciesversicolor 3.5397 0.5580 6.343 2.74e-09 ***
## Speciesvirginica 3.9068 0.5827 6.705 4.25e-10 ***
## Sepal.Width:Speciessetosa 0.6905 0.1657 4.166 5.31e-05 ***
## Sepal.Width:Speciesversicolor 0.8651 0.2002 4.321 2.88e-05 ***
## Sepal.Width:Speciesvirginica 0.9015 0.1948 4.628 8.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4397 on 144 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9944
## F-statistic: 4478 on 6 and 144 DF, p-value: < 2.2e-16
plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue"), main="Fisher's Iris Data", xlab="Sepal Width", ylab="Sepal length")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients, col="red")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients, col="green3")
abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients, col="blue")
We can see from the chart above that Sepal Length and Sepal Width are highly correlated by species.