The following experimental analysis looks at a data set containing the size characteristics of different flowers. Below are the first 6 lines of this data set, and below that is a figure showing the structure of the data set.
Flowers<-read.csv("C:/Users/Anthony/Desktop/Flowers.csv",header=TRUE)
# To typecast variables for R analysis
Species<-as.factor(Flowers$Species)
head(Flowers)
## Sepal.Length Petal.Length Species Petal.Area
## 1 5.1 1.4 1 0.28
## 2 4.9 1.4 1 0.28
## 3 4.7 1.3 1 0.26
## 4 4.6 1.5 1 0.30
## 5 5.0 1.4 1 0.28
## 6 5.4 1.7 1 0.68
str(Flowers)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Species : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Petal.Area : num 0.28 0.28 0.26 0.3 0.28 0.68 0.42 0.3 0.28 0.15 ...
In this experiment I will perform an analysis of covariance (ANCOVA) to help determine the effects of the two continuous variables Petal Length, and Sepal Length on the response variable of Petal Area. A third factor of interest will be the categorical variable of species, which will also be analyzed to determine if it has an effect on the Petal Area of the flowers being observed.
For the categorical variable, Species, I will be analyzing all three species types (labelled as 1, 2, and 3) to determine if a change in flower species results in a change in the Petal Area of the flower.
For the other two continuous variables, Petal length and Sepal Length, I will determine if these two have a statistically significant effect on the Petal Area.
The continuous valuables of interest for this statistical analysis are the response variable, Petal Area, as well as the independent variables, Petal Length and Sepal Length.
This data set contains recordings of Sepal Length, Petal Length, and Petal Area. Each of these is recorded for three different species of flowers, and 50 replicates were performed for each specie of flower.
How will the experiment be organized and conducted to test the hypothesis?
The linear ANCOVA model assumes Y=µ+??+ßX+e. In this experiment I will perform an analysis of covariance to test the null hypothesis that ??=0 for the variables in our linear model that will be used to model this data. First, the analysis of covariance will be performed and then the coefficients of the linear model will be obtained to determine a value of ??.
After this data analysis is complete I will look further into the adequacy of the linear model. I will do this by creating a QQ plot and a fit vs. residual plot.
What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of ANCOVA when analyzing the effects of uncontrollable continuous variables on a single response variable.
Below is a series of plots that represents the data of interest for this statistical analysis.
plot(Flowers)
This plotting scheme is a more effective method of exploratory data analysis than using box plots because it compares all of the independent variables with the response variable in a more condensed manner so any obvious trends can be observed.
Below I perform an analysis of covariance (ANCOVA) to determine the effect of Species, Petal Length, and Sepal Length on the response variable of Petal Area.
model<-lm(Petal.Area~Species+Petal.Length+Sepal.Length,data=Flowers)
anova(model)
## Analysis of Variance Table
##
## Response: Petal.Area
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 1 2987 2987 2289.1 < 2e-16 ***
## Petal.Length 1 107 107 82.1 7.7e-16 ***
## Sepal.Length 1 24 24 18.5 3.1e-05 ***
## Residuals 146 191 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Returning to the described null hypothesis from earlier, based on the results of the above ANCOVA it appears that all three factors, Species, Petal Length, and Sepal Length all have significant effects on the response variable petal area. Below we will determine the values of ?? to determine if we can thus reject the null hypothesis, which when translated into layman’s terms states that the three variables of interest do not have significant effects on the response variable.
summary(model)
##
## Call:
## lm(formula = Petal.Area ~ Species + Petal.Length + Sepal.Length,
## data = Flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.684 -0.726 0.048 0.557 3.912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.259 1.105 -8.38 4.1e-14 ***
## Species 2.820 0.379 7.45 7.6e-12 ***
## Petal.Length 0.892 0.223 4.00 1e-04 ***
## Sepal.Length 1.037 0.241 4.30 3.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.14 on 146 degrees of freedom
## Multiple R-squared: 0.942, Adjusted R-squared: 0.941
## F-statistic: 797 on 3 and 146 DF, p-value: <2e-16
As seen above in the summary of our linear model, the values of ?? (2.820, 0.892, and 1.047) are not equal to zero and thus we can reject the null hypotheses that all three variables do not have a significant effect on the response variable, and we can assume that the effect that they have on the response variable is attributable to something other than randomization.
To check the adequacy of using ANCOVA of a linear model as a means of analyzing this set of data I performed a Quantile-Quantile (Q-Q) test on the residual errors to determine if the residuals followed a normal distribution.
The nearly linear fit of the residuals in the QQ plot is an indication that the linear model may be adequate for this analysis. A perfectly linear fit in these QQ plots would mean that the model that I used perfectly satisfies the assumptions of normality.
The second type of plot is a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be centered around zero for the model it can be concluded that the linear model used in this analysis is accurate for determining the effect of Species, Petal Length, and Sepal Lenght on the response variable of Petal Area.
# QQ Plot for residuals in ANCOVA analysis
qqnorm(residuals(model))
qqline(residuals(model))
# Residual vs. Fits plot
plot(fitted(model),residuals(model))
The raw data used in this demonstration of ANCOVA was obtained from http://www.statlab.uni-heidelberg.de/data/iris/.