Recipe 6: Analysis of Covariance

Iris Petals: An Analysis of Covariance

Max Winkelman

Rensselaer Polytechnic Institute

October 30, 2014

Version 1

1. Setting

R. A. Fisher’s Iris Data

The data set contains the lengths and widths of the petals and sepals for three northern american species of iris.

Install the ‘iris.csv’ file

iris <- read.csv("~/RPI/Classes/Design of Experiments/R/iris.csv")
#reads in the csv file "iris.csv" from My Documents and stores it in the variable "iris"
iris$Species = as.factor(iris$Species)
#makes the factor "Species" as a categorical factor

Factors and Levels

Factor: Iris Species

Levels: Iris Setosa (1), Iris Versicolor (2), Iris Virginica (3)

Independent Explanatory Variables: Sepal Area and Petal Ratio

Response Variable: Petal Area

#Summary of Data 
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1          5.1         3.5          1.4         0.2       1      17.85
## 2          4.9         3.0          1.4         0.2       1      14.70
## 3          4.7         3.2          1.3         0.2       1      15.04
## 4          4.6         3.1          1.5         0.2       1      14.26
## 5          5.0         3.6          1.4         0.2       1      18.00
## 6          5.4         3.9          1.7         0.4       1      21.06
##   Petal.Area Sepal.Ratio Petal.Ratio
## 1       0.28       1.457        7.00
## 2       0.28       1.633        7.00
## 3       0.26       1.469        6.50
## 4       0.30       1.484        7.50
## 5       0.28       1.389        7.00
## 6       0.68       1.385        4.25
#displays the first 6 sets of variables 
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 145          6.7         3.3          5.7         2.5       3      22.11
## 146          6.7         3.0          5.2         2.3       3      20.10
## 147          6.3         2.5          5.0         1.9       3      15.75
## 148          6.5         3.0          5.2         2.0       3      19.50
## 149          6.2         3.4          5.4         2.3       3      21.08
## 150          5.9         3.0          5.1         1.8       3      17.70
##     Petal.Area Sepal.Ratio Petal.Ratio
## 145      14.25       2.030       2.280
## 146      11.96       2.233       2.261
## 147       9.50       2.520       2.632
## 148      10.40       2.167       2.600
## 149      12.42       1.824       2.348
## 150       9.18       1.967       2.833
#displays the last 6 sets of variables 
summary(iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width  Species
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1   1:50   
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3   2:50   
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3   3:50   
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2          
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8          
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5          
##    Sepal.Area     Petal.Area     Sepal.Ratio    Petal.Ratio   
##  Min.   :10.0   Min.   : 0.11   Min.   :1.27   Min.   : 2.12  
##  1st Qu.:15.7   1st Qu.: 0.42   1st Qu.:1.55   1st Qu.: 2.80  
##  Median :17.7   Median : 5.62   Median :2.03   Median : 3.30  
##  Mean   :17.8   Mean   : 5.79   Mean   :1.95   Mean   : 4.31  
##  3rd Qu.:20.3   3rd Qu.: 9.69   3rd Qu.:2.22   3rd Qu.: 4.67  
##  Max.   :30.0   Max.   :15.87   Max.   :2.96   Max.   :15.00
#displays a summary of the variables

#Plot all sets of data with each other to see
plot(iris)

plot of chunk unnamed-chunk-2

The plot above displays all of the vairables in iris.csv plotted against each other. Specific varaibles will be plotted against each other in this recipe.

Continuous variables

The continuous variables in this data set are sepal length ande width, petal length and width, sepal and petal area, and sepal and petal ratio.

Response variables:

The response variable in this recipe will be petal area (cm^2).

The Data: How is it organized and what does it look like?

The data from “iris.csv” is organized into columns that contain the variables sepal length ande width (cm), petal length and width(cm), sepal and petal area (cm^2), sepal and petal ratio, and species.

Randomization

The data for length and width of the sepal and petal for the three species of iris can be assumed to have been gathered randomly for the sake of this recipe.

2. Experimental Design

In this recipe, an analysis of covariance (ANCOVA) will evaluate if the means of the dependent variable, petal area (cm^2), are equal for all levels of the independent, catagorical variable, species, while controlling for the effects of the continuous explanatory variables, sepal area (cm^2) and petal ratio.

What is the rationale for this design?

An analysis of covariance is an appropriate choice of statical analysis for this recipe because covariates, sepal area and petal ratio, are not the primary interest in the analysis and can not be controlled.

Randomize: What is the Randomization Scheme?

No randomization scheme was provided with this data set.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures in this data set.

Block: Did you use blocking in the design?

There will be no blocking variables in this recipe.

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive Summary

#Boxplot
boxplot(Petal.Area~Species,data=iris, xlab="Iris Species", ylab="Petal Area (cm^2)")
title("Effect of Iris Species on Petal Area")

plot of chunk unnamed-chunk-3

#boxplot of the petal area data from each iris species

#Plot response variable as a function of the explanatory explanatory variables
plot(Petal.Area~Sepal.Area, data=iris, pch=as.character(Species), xlab="Sepal Area (cm^2)", ylab="Petal Area (cm^2)", main ="Petal Area vs Sepal Area")

plot of chunk unnamed-chunk-3

plot(Petal.Area~Petal.Ratio, data=iris, pch=as.character(Species), xlab="Petal Ratio", ylab="Petal Area (cm^2)", main="Petal Area vs Petal Ratio")

plot of chunk unnamed-chunk-3

The boxplot above shows the effect of iris species on the petal area (cm^2). There is a clear difference between the medians of each species and very little overlap of boxplot whiskers, indicating that it may be true that the variation of petal area can be explained by the variation in flower type. The Petal Area vs Sepal Area shows that petal area increases with sepal area for only species 2 and 3. The dependence of petal area on the species of iris can clearly be seen between species 2 and 3 when compared to 1. The difference between species 2 and 3 is evident, but less pronouced. The Petal Area vs Petal Ratio plot shows that for species 2 and 3, area decreases significantly as the ratio increases. The area remains relatively constant for species 1.

ANCOVA Testing

Covariance is a measurement of how the change of one variable affects the change of another. Analysis of covariance (ANCOVA) is a general linear that evaluates if the means of a dependent variable are equal for all categorical, independent variable, and controls for any influence that other continuous variables that are not of interest might have. In a linear ANCOVA model, like the one in this recipe, we assume that Y_(i,n) = µ + alpha_i + beta*X_(i,n) + e_(i,n),i =1,..,I and n = 1,…,N. The null hypothesis is: alpha_i =0, i =1,2,…,I. and that beta =0. The analysis will estimate these parameters. The null hypthesis written out is that the species of iris has no effec to the petal area, and that the explainatory variables sepal area and petal ratio also do not have an effect. If the null hypothesis is reject, the alternative hypothesis, that the species does have an effect on the petal area, in addition to the sepal area and petal ratio, is accepted.

model=lm(Petal.Area~Species+Petal.Ratio+Sepal.Area, data=iris)
anova(model)
## Analysis of Variance Table
## 
## Response: Petal.Area
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## Species       2   2987    1494 1053.03 < 2e-16 ***
## Petal.Ratio   1      4       4    2.96   0.087 .  
## Sepal.Area    1    112     112   78.64 2.5e-15 ***
## Residuals   145    206       1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the p-values produced in the anova above, it can be stated that the null hypothesis can be rejected. The variation in species can explain the variation in petal area.The petal ratio was found to not have an affect on the petal ratio, while the sepal area did.

Estimation of Parameters

#produce a summary of the linear model
summary(model)
## 
## Call:
## lm(formula = Petal.Area ~ Species + Petal.Ratio + Sepal.Area, 
##     data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.039 -0.703 -0.036  0.643  4.489 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.1914     0.7374   -5.68  7.0e-08 ***
## Species2      5.3906     0.3241   16.63  < 2e-16 ***
## Species3     10.0519     0.3436   29.25  < 2e-16 ***
## Petal.Ratio  -0.0467     0.0590   -0.79     0.43    
## Sepal.Area    0.2827     0.0319    8.87  2.5e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.19 on 145 degrees of freedom
## Multiple R-squared:  0.938,  Adjusted R-squared:  0.936 
## F-statistic:  547 on 4 and 145 DF,  p-value: <2e-16

This shows estimates and confidence intervals for the coefficients: mu = -4.1914, beta = 5.3906, alpha1 = 10.0519, alpha2 = -0.0467, and alpha3 = 0.2827. Only alpha2 (petal ratio) did not return a low Pr(>|t|) value.

#confidence intervals of the linear model
confint(model)
##               2.5 %   97.5 %
## (Intercept) -5.6488 -2.73405
## Species2     4.7500  6.03129
## Species3     9.3728 10.73105
## Petal.Ratio -0.1632  0.06992
## Sepal.Area   0.2197  0.34575

Diagnostics/Model Adequacy Checking

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. Based on the theoretical distribution, the expected value for each datum is determined. If the data values in a set follow the theoretical distribution, then they will appear as a straight line on a Q-Q plot.

# Q-Q norm plot with best fit line
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-7

The Q-Q normal plot above produced a relatively linear fit, indicating a normal distribution and that the statisitical test used was appropriate.

A Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. These plots are used to identify linearity, outliers, and error variances.

plot(fitted(model),residuals(model), main = "Residual Plot", xlab= "Fitted Values", ylab ="Residual Values")

plot of chunk unnamed-chunk-8

Early in the residual plot, the residuals are positively skewed and then graudally become negatively skewed. The middle set of residuals seems to be normally distributed about zero. The last set of residuals seems to be centered about zero as well but has a lot more vairation. This first set of residuals indicates that the fit was not appropriate for early fitted values. The later residuals appear to be better fitted.

#Petal Ratio
plot(Petal.Area~Petal.Ratio, data=iris, pch=unclass(Species), main="Petal Area vs Petal Ratio", xlab = "Petal Ratio", ylab = "Petal Area (cm^2)")
for (i in 1:3) abline(lm(Petal.Area~Petal.Ratio,data=iris[iris$Species==i,]))

plot of chunk unnamed-chunk-9

#Sepal Area
plot(Petal.Area~Sepal.Area, data=iris, pch=unclass(Species), main="Petal Area vs Sepal Area", xlab = "Sepal Area (cm^2)", ylab = "Petal Area (cm^2)") 
for (i in 1:3) abline(lm(Petal.Area~Sepal.Area,data=iris[iris$Species==i,]))

plot of chunk unnamed-chunk-9

Both plots above produced graphs with fit lines from species 2 and 3 being relatively parallel with each other and interesting the fit line for 1. There is no indication that the true lines for 2 and 3 would not be parallel, while the true line for species 1 may not be parallel.

4. References to the Literature

No references used in this recipe

5. Appendices

The raw data used in this statistical analysis can be found at: http://www.statlab.uni-heidelberg.de/data/iris/