Iris

For today’s learning log, I will be conducting a simply linear regression on Iris petal length and width. This will showcase my ability to use R to perform linear regression as well provide some insight into Iris petals.

We’ll start by loading in our Iris data and checking its layout and formatting.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

So we see that the dataset includes 5 characteristics for each Iris datapoint. We’ll just focus on petal length and width for this regression.

Our interest for today’s exercise is examining if we know the petal length of an Iris plant, can we accurately estimate its petal width?

Let’s get a better feel for the Iris petal data from a scatterplot.

attach(iris)
plot(Petal.Length, Petal.Width, ylab = "Petal Width",
     xlab = "Petal Length",
     main = " Iris Petal Length and Width")

Fasinating, it appears that there is a positive association between Petal Length and width. Namely, as Petal Length increases, so does Petal Width.

But we’re interested in precise conclusions, so we’ll fit a linear model to the data better understand this relationship.

Irismod <- lm(Petal.Width ~ Petal.Length)
Irismod
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      -0.3631        0.4158

As we originally thought, Petal Length and Width are positively associated, displayed by a slope of 0.4158 for our regression line. In context, our model says that for every cm of Petal Length growth, there will be 0.4158 cm of Petal Width grow. For this dataset, our model provides a intercept of -0.3631 cm. However, this implies a negative width when our Petal Length is 0 cm–not useful or valid information for us at this time. (B0 = -0.3631 B1 = 0.4158)

Further, if we check our regression line against our data, our preliminary observation is that the line fits the data well.

plot(Petal.Length, Petal.Width, ylab = "Petal Width",
     xlab = "Petal Length",
     main = " Iris Petal Length and Width")

abline(-0.3631,  0.4158)

But as statisticians, we want to know HOW well the model fits the data, so we’ll use residuals to check our model assumptions and predictability.

Our first assumption to check is to see if our residuals are normally distributed.

Let’s start by taking a peek at our residual histogram.

IrisResiduals <- Irismod$residuals
hist(IrisResiduals)

From this histogram, the residuals seems to follow a normal distribution, but we’ll double check ourselves anyway with a Q-Q plot.

qqnorm(IrisResiduals)
qqline(IrisResiduals)

From the Q-Q plot, we can further see that the residuals mostly follow the straight line with some slight deviation at the ends, which shows us that they are normally distributed.

Our next model assumption to check is if the variance of our residuals is constant over the datapoints.

plot(Petal.Length, Petal.Width, ylab = "Petal Width (cm)",
     xlab = " Petal Length (cm)",
     main = "Relationship Between Petal Length and Width")

From a first check of the scatterplot, it appears that the variance of the residuals is relatively constant, but we’ll double check.

plot(IrisResiduals ~ Petal.Length)
abline(0,0)

If we instead look at our residuals for different Petal Lengths, a pattern emerges. The variance for Petal lengths 1-3 appear to be much smaller than the varianaces for Petal lengths 4-6. This makes the homscedastacity assumption harder to confirm.

Our other regression assumptions are confirmed as follows: 1. Mean of the error term = 0

mean(IrisResiduals)
## [1] 4.245307e-18
  1. Each error term is independent of other error terms. Our intuition of plant behavior allows us to accept this assumption.

Finally, we’ll acknowledge the precision of our model.

summary(Irismod)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Our linear model produced a SSE of 0.2065 and an R-Squared value of 0.9271. Along with the other summary parameters, the model can describe the association between Iris Petal Length and Width fairly well.