Exploratory Data Analysis: Graphics and Descriptive Summary
#Boxplot
boxplot(Petal.Area~Species,data=iris, xlab="Iris Species", ylab="Petal Area (cm^2)")
title("Effect of Iris Species on Petal Area")

#boxplot of the petal area data from each iris species
#Plot response variable as a function of the explanatory explanatory variables
plot(Petal.Area~Sepal.Area, data=iris, pch=as.character(Species), xlab="Sepal Area (cm^2)", ylab="Petal Area (cm^2)", main ="Petal Area vs Sepal Area")

plot(Petal.Area~Petal.Ratio, data=iris, pch=as.character(Species), xlab="Petal Ratio", ylab="Petal Area (cm^2)", main="Petal Area vs Petal Ratio")

The boxplot above shows the effect of iris species on the petal area (cm^2). There is a clear difference between the medians of each species and very little overlap of boxplot whiskers, indicating that it may be true that the variation of petal area can be explained by the variation in flower type. The Petal Area vs Sepal Area shows that petal area increases with sepal area for only species 2 and 3. The dependence of petal area on the species of iris can clearly be seen between species 2 and 3 when compared to 1. The difference between species 2 and 3 is evident, but less pronouced. The Petal Area vs Petal Ratio plot shows that for species 2 and 3, area decreases significantly as the ratio increases. The area remains relatively constant for species 1.
ANCOVA Testing
Covariance is a measurement of how the change of one variable affects the change of another. Analysis of covariance (ANCOVA) is a general linear that evaluates if the means of a dependent variable are equal for all categorical, independent variable, and controls for any influence that other continuous variables that are not of interest might have. In a linear ANCOVA model, like the one in this recipe, we assume that Y_(i,n) = µ + alpha_i + beta*X_(i,n) + e_(i,n),i =1,..,I and n = 1,…,N. The null hypothesis is: alpha_i =0, i =1,2,…,I. and that beta =0. The analysis will estimate these parameters. The null hypthesis written out is that the species of iris has no effec to the petal area, and that the explainatory variables sepal area and petal ratio also do not have an effect. If the null hypothesis is reject, the alternative hypothesis, that the species does have an effect on the petal area, in addition to the sepal area and petal ratio, is accepted.
model=lm(Petal.Area~Species+Petal.Ratio+Sepal.Area, data=iris)
anova(model)
## Analysis of Variance Table
##
## Response: Petal.Area
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 2987 1494 1053.03 < 2e-16 ***
## Petal.Ratio 1 4 4 2.96 0.087 .
## Sepal.Area 1 112 112 78.64 2.5e-15 ***
## Residuals 145 206 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-values produced in the anova above, it can be stated that the null hypothesis can be rejected. The variation in species can explain the variation in petal area.The petal ratio was found to not have an affect on the petal ratio, while the sepal area did.
Estimation of Parameters
#produce a summary of the linear model
summary(model)
##
## Call:
## lm(formula = Petal.Area ~ Species + Petal.Ratio + Sepal.Area,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.039 -0.703 -0.036 0.643 4.489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1914 0.7374 -5.68 7.0e-08 ***
## Species2 5.3906 0.3241 16.63 < 2e-16 ***
## Species3 10.0519 0.3436 29.25 < 2e-16 ***
## Petal.Ratio -0.0467 0.0590 -0.79 0.43
## Sepal.Area 0.2827 0.0319 8.87 2.5e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.19 on 145 degrees of freedom
## Multiple R-squared: 0.938, Adjusted R-squared: 0.936
## F-statistic: 547 on 4 and 145 DF, p-value: <2e-16
This shows estimates and confidence intervals for the coefficients: mu = -4.1914, beta = 5.3906, alpha1 = 10.0519, alpha2 = -0.0467, and alpha3 = 0.2827. Only alpha2 (petal ratio) did not return a low Pr(>|t|) value.
#confidence intervals of the linear model
confint(model)
## 2.5 % 97.5 %
## (Intercept) -5.6488 -2.73405
## Species2 4.7500 6.03129
## Species3 9.3728 10.73105
## Petal.Ratio -0.1632 0.06992
## Sepal.Area 0.2197 0.34575
Diagnostics/Model Adequacy Checking
Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot.
# Q-Q norm plot with best fit line
qqnorm(residuals(model))
qqline(residuals(model))

The Q-Q normal plot above produced a relatively linear fit, indicating a normal distribution and that the statisitical test used was appropriate.
A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.
plot(fitted(model),residuals(model), main = "Residual Plot", xlab= "Fitted Values", ylab ="Residual Values")

Early in the residual plot, the residuals are positively skewed and then graudally become negatively skewed. The middle set of residuals seems to be normally distributed about zero. The last set of residuals seems to be centered about zero as well but has a lot more vairation. This first set of residuals indicates that the fit was not appropriate for early fitted values. The later residuals appear to be better fitted.
#Petal Ratio
plot(Petal.Area~Petal.Ratio, data=iris, pch=unclass(Species), main="Petal Area vs Petal Ratio", xlab = "Petal Ratio", ylab = "Petal Area (cm^2)")
for (i in 1:3) abline(lm(Petal.Area~Petal.Ratio,data=iris[iris$Species==i,]))

#Sepal Area
plot(Petal.Area~Sepal.Area, data=iris, pch=unclass(Species), main="Petal Area vs Sepal Area", xlab = "Sepal Area (cm^2)", ylab = "Petal Area (cm^2)")
for (i in 1:3) abline(lm(Petal.Area~Sepal.Area,data=iris[iris$Species==i,]))

Both plots above produced graphs with fit lines from species 2 and 3 being relatively parallel with each other and interesting the fit line for 1. There is no indication that the true lines for 2 and 3 would not be parallel, while the true line for species 1 may not be parallel.