This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses statistics about different clones and strains of a plant over 3 years to observe their effect on on the yield for those plants. A study was done in Poland over three years where statistics about 6 different clones of Miscanthus were recorded each year along with their tuft yield.
Below is the installation and initial examination of the dataset:
#Obtaining data package
data<-read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)
shapirodata<-read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)
head(data)
## Genotype.clone year height tillering tuftdia stemdia tuftweight
## 1 MG/1 2003 57 28 29.1 0.44 0.72
## 2 MG/1 2004 178 58 57.5 0.44 1.27
## 3 MG/1 2005 287 73 66.6 0.46 1.93
## 4 MG/2 2003 61 29 33.2 0.44 0.86
## 5 MG/2 2004 205 58 61.6 0.46 1.39
## 6 MG/2 2005 295 81 68.6 0.49 2.17
attach(data)
For this multi-factor analysis we will be examining the factors of height, tillering, tuft diameter, and stem diameter.Since each factor is continuous they will be separated into 3 levels reflecting the range of the data.
#Subset dataset so each factor has 3 levels.
data$height[as.numeric(data$height) >= 50 & as.numeric(data$height) < 150] ="50-150"
data$height[as.numeric(data$height) >= 150 & as.numeric(data$height) < 250] ="150-250"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$height[as.numeric(data$height) >= 250] =">250"
## Warning: NAs introduced by coercion
data$tillering[data$tillering>= 10 & data$tillering < 30] = "10-30"
data$tillering[data$tillering>= 30 & data$tillering < 60] = "30-60"
data$tillering[data$tillering>= 60 & data$tillering < 90] = "60-90"
data$tuftdia[data$tuftdia>= 10 & data$tuftdia < 30] = "10-30"
data$tuftdia[data$tuftdia>= 30 & data$tuftdia < 50] = "30-50"
data$tuftdia[data$tuftdia>= 50 & data$tuftdia < 70] = "50-70"
data$stemdia[data$stemdia>= .35 & data$stemdia < .4] = ".35-.4"
data$stemdia[data$stemdia>= .4 & data$stemdia < .45] = ".4-.45"
data$stemdia[data$stemdia>= .45 & data$stemdia < .5] = ".45-.5"
head(data)
## Genotype.clone year height tillering tuftdia stemdia tuftweight
## 1 MG/1 2003 50-150 10-30 10-30 .4-.45 0.72
## 2 MG/1 2004 150-250 30-60 50-70 .4-.45 1.27
## 3 MG/1 2005 >250 60-90 50-70 .45-.5 1.93
## 4 MG/2 2003 50-150 10-30 30-50 .4-.45 0.86
## 5 MG/2 2004 150-250 30-60 50-70 .45-.5 1.39
## 6 MG/2 2005 >250 60-90 50-70 .45-.5 2.17
tail(data)
## Genotype.clone year height tillering tuftdia stemdia tuftweight
## 13 MS/5 2003 50-150 10-30 10-30 .45-.5 0.55
## 14 MS/5 2004 150-250 10-30 10-30 .45-.5 0.82
## 15 MS/5 2005 150-250 10-30 30-50 .45-.5 0.99
## 16 MS/6 2003 50-150 10-30 10-30 .4-.45 0.57
## 17 MS/6 2004 150-250 10-30 10-30 .45-.5 0.84
## 18 MS/6 2005 150-250 10-30 10-30 .45-.5 1.06
summary(data)
## Genotype.clone year height tillering
## MG/1:3 Min. :2003 Length:18 Length:18
## MG/2:3 1st Qu.:2003 Class :character Class :character
## MS/3:3 Median :2004 Mode :character Mode :character
## MS/4:3 Mean :2004
## MS/5:3 3rd Qu.:2005
## MS/6:3 Max. :2005
## tuftdia stemdia tuftweight
## Length:18 Length:18 Min. :0.550
## Class :character Class :character 1st Qu.:0.745
## Mode :character Mode :character Median :0.940
## Mean :1.069
## 3rd Qu.:1.308
## Max. :2.170
#Levels of Height
unique(data$height)
## [1] "50-150" "150-250" ">250"
#Levels of Tillering
unique(data$tillering)
## [1] "10-30" "30-60" "60-90"
#Levels of Tuft Diameter
unique(data$tuftdia)
## [1] "10-30" "50-70" "30-50"
#Levels of Stem Diameter
unique(data$stemdia)
## [1] ".4-.45" ".45-.5" ".35-.4"
All variables are continuous.
The response variable for this study is the tuft weight which is considered the yield of the plant.
The data set is organized by plant strain and then year.
Height is the height of the plant.
Tillering is the number of productive shoots per plant.
Tuft Diameter is the diameter of the tufts of the plant.
Stem diameter is the diameter of the stem.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment. Several plants of each type were collected and analyzed each year.
To perform the experiment the data will first be subset for the continuous variables This is done to create the 3 levels for all factors. Then an analysis of variance will be performed on each factor individually, then on combinations of factors indicated as having an effect by the study. These combination of factors are Tillering, tuft diameter and height. From these analysis we will be able to test the null hypothesis, that the yield (tuft weight) of plants is independant of the factors.
The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.
The data was collected in groups once per year but do not know if there was any randomization to the collection.
There are repeated measures in the data. Each plant was measured once per year each year.
The original experiment involved blocking by year to determine results. However due to the lack of data we will be observing the mean across all years to see if we get similar results as the original study.
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.
#Defining height as a factor
data$height = as.factor(data$height)
#Defining tillering as a factor
data$tillering = as.factor(data$tillering)
#Defining tuft diameter as a factor
data$tuftdia = as.factor(data$tuftdia)
#Defining stem diameter as a factor
data$stemdia = as.factor(data$stemdia)
#Boxplots of of the means of each variable against response variable
boxplot(tuftweight~height, data=data, xlab="Height", ylab="Tuft Weight")
boxplot(tuftweight~tillering, data=data, xlab="Tillering", ylab="Tuft Weight")
boxplot(tuftweight~tuftdia, data=data, xlab="Tuft Diameter", ylab="Tuft Weight")
boxplot(tuftweight~stemdia, data=data, xlab="Stem Diameter", ylab="Tuft Weight")
Examining the boxplots there appears to be an obvious positive relationship between all factors and tuftweight except for stem diameter.
To test the hypotheses we perform an ANOVA test on the factors individually and then in the combination suggested by the study.
The null hypothesis of the first three tests is that the single factor does not have an effect on the response variable of traffic fatality rates.
#Analysis of Variance for Height
model1=aov(tuftweight~height, data=data)
anova(model1)
## Analysis of Variance Table
##
## Response: tuftweight
## Df Sum Sq Mean Sq F value Pr(>F)
## height 2 3.018 1.509 28.4 8e-06 ***
## Residuals 15 0.798 0.053
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Tillering
model2=aov(tuftweight~tillering, data=data)
anova(model2)
## Analysis of Variance Table
##
## Response: tuftweight
## Df Sum Sq Mean Sq F value Pr(>F)
## tillering 2 3.091 1.545 32 3.9e-06 ***
## Residuals 15 0.725 0.048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Tuft Diameter
model3=aov(tuftweight~tuftdia, data=data)
anova(model3)
## Analysis of Variance Table
##
## Response: tuftweight
## Df Sum Sq Mean Sq F value Pr(>F)
## tuftdia 2 2.58 1.289 15.6 0.00021 ***
## Residuals 15 1.24 0.082
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Analysis of Variance for Stem Diameter
model4=aov(tuftweight~stemdia, data=data)
anova(model4)
## Analysis of Variance Table
##
## Response: tuftweight
## Df Sum Sq Mean Sq F value Pr(>F)
## stemdia 2 0.37 0.185 0.81 0.46
## Residuals 15 3.44 0.230
#Analysis of Variance for combination of Tillering, Tuft Diameter and Height.
model123=aov(tuftweight~tillering*height*tuftdia, data=data)
anova(model123)
## Analysis of Variance Table
##
## Response: tuftweight
## Df Sum Sq Mean Sq F value Pr(>F)
## tillering 2 3.091 1.545 48.63 7e-06 ***
## height 1 0.187 0.187 5.89 0.036 *
## tuftdia 2 0.173 0.086 2.72 0.114
## tillering:tuftdia 1 0.036 0.036 1.14 0.311
## height:tuftdia 1 0.011 0.011 0.36 0.561
## Residuals 10 0.318 0.032
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Individual factor tests for height, tillering and tuft diameter had resulting p-values of less than .001.The p-value represents the probability that we can get the F value using the null hypothesis.Since our probability is extremely close to 0 we can assume that each factor demonstrates an effect on the response variable. We are lead to reject the null hypothesis on an individual factor basis for height, tillering and tuft diameter. However our p-value for stem diameter is extremely high (.4643) leading us to accept the null hypothesis that stem diameter does not have an effect.
Next we looked at the analysis of variance of the model with all three factors. We still saw effects of tillering and height, both with p-values less than .05. However the p-value for tuft diameter is now much less significant at .11. The p-values for the interaction effects also lead us to believe there is no interaction effect betweent he variables.
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid. We also do a Shapiro-Wilke normality test on our data as a secondary normality test for our continuous variables.
#QQ plots for residuals of Height model
qqnorm(residuals(model1))
qqline(residuals(model1))
#QQ plots for residuals of tillering model
qqnorm(residuals(model2))
qqline(residuals(model2))
#QQ plots for residuals of tuft diameter model
qqnorm(residuals(model3))
qqline(residuals(model3))
#QQ plots for residuals of stem diameter model
qqnorm(residuals(model4))
qqline(residuals(model4))
#QQ plots for residuals of combination of factors model
qqnorm(residuals(model123))
qqline(residuals(model123))
#Shapiro-Wilke Test of Normality for our factors
shapiro.test(read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[,"height"])
##
## Shapiro-Wilk normality test
##
## data: read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[, "height"]
## W = 0.9008, p-value = 0.0593
shapiro.test(read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[,"tillering"])
##
## Shapiro-Wilk normality test
##
## data: read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[, "tillering"]
## W = 0.8865, p-value = 0.03364
shapiro.test(read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[,"tuftdia"])
##
## Shapiro-Wilk normality test
##
## data: read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[, "tuftdia"]
## W = 0.9078, p-value = 0.07871
shapiro.test(read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[,"stemdia"])
##
## Shapiro-Wilk normality test
##
## data: read.csv("C:/Users/tothk2/Desktop/Rpi/School/plantdata2.csv", header = TRUE)[, "stemdia"]
## W = 0.8895, p-value = 0.03779
Our qq-plots show data that appears normal.
We use the original data set in our Shapiro-Wilke test to include the original values for each factor. We feel the shapiro-wilke test shows adequate normality if our p is < 0.1. In all cases our p-value is adequate.
We also perform the Tukey test which is used to determine which groups in the sample differ as opposed to the anova which tells whether or not there are differences in groups.
#Tukey HSD Test for height model
TukeyHSD(model1, ordered = FALSE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tuftweight ~ height, data = data)
##
## $height
## diff lwr upr p adj
## 150-250->250 -0.9250 -1.3891 -0.4609 0.0003
## 50-150->250 -1.4017 -1.8909 -0.9125 0.0000
## 50-150-150-250 -0.4767 -0.7861 -0.1673 0.0031
#Tukey HSD Test for tillering model
TukeyHSD(model2, ordered = FALSE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tuftweight ~ tillering, data = data)
##
## $tillering
## diff lwr upr p adj
## 30-60-10-30 0.4967 0.2018 0.7916 0.0015
## 60-90-10-30 1.2900 0.8476 1.7324 0.0000
## 60-90-30-60 0.7933 0.3270 1.2596 0.0014
#Tukey HSD Test for tuft diameter model
TukeyHSD(model3, ordered = FALSE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tuftweight ~ tuftdia, data = data)
##
## $tuftdia
## diff lwr upr p adj
## 30-50-10-30 0.4302 0.01416 0.8463 0.0423
## 50-70-10-30 0.9522 0.50398 1.4005 0.0002
## 50-70-30-50 0.5220 0.02162 1.0224 0.0404
#Tukey HSD Test for stem diameter model
TukeyHSD(model4, ordered = FALSE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tuftweight ~ stemdia, data = data)
##
## $stemdia
## diff lwr upr p adj
## .4-.45-.35-.4 0.1380 -0.6493 0.9253 0.8928
## .45-.5-.35-.4 0.3387 -0.3709 1.0484 0.4491
## .45-.5-.4-.45 0.2007 -0.5089 0.9104 0.7471
#Tukey HSD Test for the model with combination of three factors
TukeyHSD(model123, ordered = FALSE, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tuftweight ~ tillering * height * tuftdia, data = data)
##
## $tillering
## diff lwr upr p adj
## 30-60-10-30 0.4967 0.2443 0.749 8e-04
## 60-90-10-30 1.2900 0.9115 1.669 0e+00
## 60-90-30-60 0.7933 0.3943 1.192 7e-04
##
## $height
## diff lwr upr p adj
## 150-250->250 0.0670 -0.3115 0.44553 0.8798
## 50-150->250 -0.1117 -0.5107 0.28734 0.7305
## 50-150-150-250 -0.1787 -0.4310 0.07369 0.1777
##
## $tuftdia
## diff lwr upr p adj
## 30-50-10-30 0.17004 -0.1025 0.4426 0.2489
## 50-70-10-30 0.10787 -0.1858 0.4015 0.5893
## 50-70-30-50 -0.06217 -0.3900 0.2656 0.8635
##
## $`tillering:tuftdia`
## diff lwr upr p adj
## 30-60:10-30-10-30:10-30 1.643e-01 -0.56579 0.8943 0.9901
## 60-90:10-30-10-30:10-30 NA NA NA NA
## 10-30:30-50-10-30:10-30 1.714e-01 -0.37280 0.7155 0.9350
## 30-60:30-50-10-30:10-30 6.043e-01 0.13828 1.0703 0.0097
## 60-90:30-50-10-30:10-30 NA NA NA NA
## 10-30:50-70-10-30:10-30 NA NA NA NA
## 30-60:50-70-10-30:10-30 6.043e-01 0.06011 1.1484 0.0270
## 60-90:50-70-10-30:10-30 1.324e+00 0.78011 1.8684 0.0001
## 60-90:10-30-30-60:10-30 NA NA NA NA
## 10-30:30-50-30-60:10-30 7.083e-03 -0.83592 0.8501 1.0000
## 30-60:30-50-30-60:10-30 4.400e-01 -0.35479 1.2348 0.5001
## 60-90:30-50-30-60:10-30 NA NA NA NA
## 10-30:50-70-30-60:10-30 NA NA NA NA
## 30-60:50-70-30-60:10-30 4.400e-01 -0.40301 1.2830 0.5651
## 60-90:50-70-30-60:10-30 1.160e+00 0.31699 2.0030 0.0064
## 10-30:30-50-60-90:10-30 NA NA NA NA
## 30-60:30-50-60-90:10-30 NA NA NA NA
## 60-90:30-50-60-90:10-30 NA NA NA NA
## 10-30:50-70-60-90:10-30 NA NA NA NA
## 30-60:50-70-60-90:10-30 NA NA NA NA
## 60-90:50-70-60-90:10-30 NA NA NA NA
## 30-60:30-50-10-30:30-50 4.329e-01 -0.19542 1.0613 0.2703
## 60-90:30-50-10-30:30-50 NA NA NA NA
## 10-30:50-70-10-30:30-50 NA NA NA NA
## 30-60:50-70-10-30:30-50 4.329e-01 -0.25540 1.1212 0.3608
## 60-90:50-70-10-30:30-50 1.153e+00 0.46460 1.8412 0.0014
## 60-90:30-50-30-60:30-50 NA NA NA NA
## 10-30:50-70-30-60:30-50 NA NA NA NA
## 30-60:50-70-30-60:30-50 2.220e-16 -0.62834 0.6283 1.0000
## 60-90:50-70-30-60:30-50 7.200e-01 0.09166 1.3483 0.0222
## 10-30:50-70-60-90:30-50 NA NA NA NA
## 30-60:50-70-60-90:30-50 NA NA NA NA
## 60-90:50-70-60-90:30-50 NA NA NA NA
## 30-60:50-70-10-30:50-70 NA NA NA NA
## 60-90:50-70-10-30:50-70 NA NA NA NA
## 60-90:50-70-30-60:50-70 7.200e-01 0.03169 1.4083 0.0386
##
## $`height:tuftdia`
## diff lwr upr p adj
## 150-250:10-30->250:10-30 NA NA NA NA
## 50-150:10-30->250:10-30 NA NA NA NA
## >250:30-50->250:10-30 NA NA NA NA
## 150-250:30-50->250:10-30 NA NA NA NA
## 50-150:30-50->250:10-30 NA NA NA NA
## >250:50-70->250:10-30 NA NA NA NA
## 150-250:50-70->250:10-30 NA NA NA NA
## 50-150:50-70->250:10-30 NA NA NA NA
## 50-150:10-30-150-250:10-30 -0.20450 -0.6662 0.2572 0.7308
## >250:30-50-150-250:10-30 NA NA NA NA
## 150-250:30-50-150-250:10-30 0.05843 -0.4283 0.5451 0.9999
## 50-150:30-50-150-250:10-30 0.13527 -0.6343 0.9048 0.9980
## >250:50-70-150-250:10-30 -0.03620 -0.6323 0.5599 1.0000
## 150-250:50-70-150-250:10-30 0.03713 -0.5590 0.6332 1.0000
## 50-150:50-70-150-250:10-30 NA NA NA NA
## >250:30-50-50-150:10-30 NA NA NA NA
## 150-250:30-50-50-150:10-30 0.26293 -0.1988 0.7247 0.4688
## 50-150:30-50-50-150:10-30 0.33977 -0.4142 1.0938 0.7150
## >250:50-70-50-150:10-30 0.16830 -0.4076 0.7442 0.9556
## 150-250:50-70-50-150:10-30 0.24163 -0.3343 0.8175 0.7769
## 50-150:50-70-50-150:10-30 NA NA NA NA
## 150-250:30-50->250:30-50 NA NA NA NA
## 50-150:30-50->250:30-50 NA NA NA NA
## >250:50-70->250:30-50 NA NA NA NA
## 150-250:50-70->250:30-50 NA NA NA NA
## 50-150:50-70->250:30-50 NA NA NA NA
## 50-150:30-50-150-250:30-50 0.07684 -0.6927 0.8464 1.0000
## >250:50-70-150-250:30-50 -0.09463 -0.6907 0.5015 0.9990
## 150-250:50-70-150-250:30-50 -0.02130 -0.6174 0.5748 1.0000
## 50-150:50-70-150-250:30-50 NA NA NA NA
## >250:50-70-50-150:30-50 -0.17148 -1.0145 0.6715 0.9948
## 150-250:50-70-50-150:30-50 -0.09814 -0.9411 0.7449 0.9999
## 50-150:50-70-50-150:30-50 NA NA NA NA
## 150-250:50-70->250:50-70 0.07333 -0.6150 0.7616 0.9999
## 50-150:50-70->250:50-70 NA NA NA NA
## 50-150:50-70-150-250:50-70 NA NA NA NA
We also use an interaction plot to visualize the interaction of the factors on the response variable.
# Interaction Plot of Height and Tillering
interaction.plot(data$height,data$tillering,data$tuftweight)
# Interaction Plot of Height and Tuft Diameter
interaction.plot(data$height,data$tuftdia,data$tuftweight)
# Interaction Plot of Tillering and Tuft Diameter
interaction.plot(data$tillering,data$tuftdia,data$tuftweight)
We can see parrallelism and no intersections in each of our plots which implies there is no interaction effect.
Lastly we plot a fitted model against the residuals. We do not see a very large degree of variation in the plot.
#Plot of Fitted vs Residuals of the model with all factors
plot(fitted(model123), residuals(model123))
Overrall the results of our models and our adequacy tests lead us to accept that the results of the study are true and that the factors height, tillering and tuft diameter all have an effect on our response variable, the yield of the planets (tuft weight).
No literature was used
The study examined can be found at:
http://www.sciencedirect.com/science/article/pii/S0926669007001148
It is possible that the conclusions of our analysis are the results of chance. One concern is that the data is collected over only 3 years. There are a large number of factors that fluctuate year to year that could effect the the yield of the plants. Changes in weather pattern and climates, animal effects on the plants, and insects could all effect the yield in addition to plant size statistics.
It’s also possible that the different strains of the plant could thrive under different conditions. So while certain strains with different statistics could thrive in the given environment, other strains may have thrived better in different weather or environments.