This recipe will conduct an experiment on a dataset representing the nitrogen uptake of cotton on leaves. The experiment will attempt to investigate the amount of nitrogen uptake of cotton and examine the analysis of variance between the N rate and Water Salinity on Cotton Uptake on Leaves in hopes of supporting or refuting the claim that the N Rate and Water Salinity on Cotton Uptake on Leaves do not have much variance.
cotton <- read.csv("~/Desktop/cotton.csv")
head(cotton)
## Nrate Water.Salinity Year Stem Leaves Bolls Total
## 1 N0 FW 2011 7.02 19.52 35.81 62.35
## 2 N0 BW 2011 7.23 21.33 36.68 65.24
## 3 N0 SW 2011 5.81 20.22 35.39 61.42
## 4 N360 FW 2011 19.32 68.51 120.52 208.33
## 5 N360 BW 2011 20.05 76.82 106.90 203.70
## 6 N360 SW 2011 15.51 62.71 91.42 169.64
A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. In this instance, I am conducting a two-factor analysis.
The term level is also used for categorical variables. In this case, this is a multi-level analysis.
The first factor that this experiment will examine is the N Rate.
The second factor that I will consider is the concentration of Water Salinity.
head(cotton)
## Nrate Water.Salinity Year Stem Leaves Bolls Total
## 1 N0 FW 2011 7.02 19.52 35.81 62.35
## 2 N0 BW 2011 7.23 21.33 36.68 65.24
## 3 N0 SW 2011 5.81 20.22 35.39 61.42
## 4 N360 FW 2011 19.32 68.51 120.52 208.33
## 5 N360 BW 2011 20.05 76.82 106.90 203.70
## 6 N360 SW 2011 15.51 62.71 91.42 169.64
tail(cotton)
## Nrate Water.Salinity Year Stem Leaves Bolls Total
## 7 N0 FW 2012 14.81 38.53 48.05 101.39
## 8 N0 BW 2012 10.41 29.56 39.96 79.93
## 9 N0 SW 2012 8.88 22.20 37.74 68.82
## 10 N360 FW 2012 33.30 91.02 139.86 264.18
## 11 N360 BW 2012 17.76 85.09 112.89 215.73
## 12 N360 SW 2012 15.54 56.24 99.90 171.68
summary(cotton)
## Nrate Water.Salinity Year Stem Leaves
## N0 :6 BW:4 Min. :2011 Min. : 5.81 Min. :19.5
## N360:6 FW:4 1st Qu.:2011 1st Qu.: 8.47 1st Qu.:22.0
## SW:4 Median :2012 Median :15.16 Median :47.4
## Mean :2012 Mean :14.64 Mean :49.3
## 3rd Qu.:2012 3rd Qu.:18.15 3rd Qu.:70.6
## Max. :2012 Max. :33.30 Max. :91.0
## Bolls Total
## Min. : 35.4 Min. : 61.4
## 1st Qu.: 37.5 1st Qu.: 67.9
## Median : 69.7 Median :135.5
## Mean : 75.4 Mean :139.4
## 3rd Qu.:108.4 3rd Qu.:204.9
## Max. :139.9 Max. :264.2
If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.
In this instance, only one variable can be considered continuous. Since uptake of cotton on leaves is not a categorical variable, it is continuous.
A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is city gas mileage, since it will attempt to describe the difference between levels of the two factors of interst.
The data is organized initially into an 7 column table. The columns are titled as follows: Nrate, Water.Salinity, Year, Stems, Leaves, Bolls, and Total. All data is numeric minus Nrate and Water.Salinity, which are textual.
This data comes from a field experiment that begain in 2011 and 2012 cotton growing seasons at an agricultural experiment station in China. This experiment was conducted in six treatments replicated 3 times in a randomized complete factorial block design.
In order to conduct this experiment, I will conduct two separate analysis of the factors at hand. First, I will analyze multiple levels of the N Rate (Nrate) of the data. I will then look at the Cotton growth on Leaves (Leaves) values to see if an obvious difference or pattern can be seen.
Second, I will analyze multiple levels of the Water Salinity (Water.Salinity) of the data, which is the second factor. Again, I will then look at the Cotton growth on Leaves (Leaves) values to see if an obvious difference or pattern can be seen.
I have chosen to use this type of experimental design to demonstrate proper experimentation with a data set with at least two factors and at least two levels of each factor.
This experiment was conducted in six treatments replicated 3 times in a randomized complete factorial block design.
There are no replicates, but repeated measures do occur between the factors and levels.
The only blocking that I performed in this experimental data analysis is seen in the blocking of the N Rate into the different levels of their respective factors.
At this point, I must define the N Rate (Nrate) and the Water Salinity (Water.Salinity) as the factors for analysis.
cotton$Nrate=as.factor(cotton$Nrate)
cotton$Water.Salinity=as.factor(cotton$Water.Salinity)
Below are the boxplots of the Cotton uptake on Leaves of all levels of the two factors of interest.
par(mfrow=c(1,1))
hist(cotton$Leaves)
par(mfrow=c(1,1))
boxplot(cotton$Leaves, main="Boxplot of Cotton Growth on Leaves", xlab="Leaves", ylab="Cotton Growth Metric", names=c("Leaves"))
boxplot(Leaves~Nrate, data=cotton)
boxplot(Leaves~Water.Salinity, data=cotton)
At this point, I am introducitng the Analysis of Variance (ANOVA) test. The ANOVA test is used to analyze the differences in the mean Cotton growth on leaves of the data with varying N Rates and Water Salinity levels.. A third ANOVA test analyzes the interaction effect between the two factors.
model_Nrate=aov(Leaves~Nrate,data=cotton)
model_Water.Salinity=aov(Leaves~Water.Salinity,data=cotton)
model_Nrate_Water.Salinity=aov(Leaves~Nrate*Water.Salinity,data=cotton)
anova(model_Nrate)
## Analysis of Variance Table
##
## Response: Leaves
## Df Sum Sq Mean Sq F value Pr(>F)
## Nrate 1 6962 6962 59.5 1.6e-05 ***
## Residuals 10 1169 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_Water.Salinity)
## Analysis of Variance Table
##
## Response: Leaves
## Df Sum Sq Mean Sq F value Pr(>F)
## Water.Salinity 2 486 243 0.29 0.76
## Residuals 9 7645 849
anova(model_Nrate_Water.Salinity)
## Analysis of Variance Table
##
## Response: Leaves
## Df Sum Sq Mean Sq F value Pr(>F)
## Nrate 1 6962 6962 79.56 0.00011 ***
## Water.Salinity 2 486 243 2.78 0.14019
## Nrate:Water.Salinity 2 159 79 0.91 0.45299
## Residuals 6 525 87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test that analyzed the variation in cotton growth on leaves as a result in variation of the N Rate returned a p-value of 1.6e-05. This small p-value translates to the fact that there is a small probability that the variations in cotton growth on leaves with regards to N Rate is a result of randomization. Thus the conclusion may be drawn that the change cotton growth on leaves is not a result in the change of the N Rate.
The ANOVA test that analyzed the variation in cotton growth on leaves as a result in variation of the Water Salinity levels returned a p-value of 0.76. This large p-value translates to the fact that there is a large probability that the variations cotton growth on leaves with regards to N Rate is a result of randomization. Thus the conclusion may be drawn that the change cotton growth on leaves is a result in the change of the Water Salinity Levels.
Because both ANOVAs alluded to the fact that both factors can effect the cotton growth on leaves, I then performed an ANOVA to analyze the interaction effect of the two factors. The resulting p-value was once again 0.45299 which indicates that when the two factors work together there is a very high probability that the changes in the cotton growth on leaves is a result of randomization.
To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual error to determine if the residuals followed a normal distribution. I also created an interaction plot to see if there was an interaction effect between the two factors.
The nearly linear fit of the residuals in the first QQ plot in reference to ‘Nrate’ is an indication that the model is adequate for this analysis.
The barely linear fit of the residuals in the second QQ plot in refernece to ‘Water.Salinity’ is an indication that the model is less adequate for this analysis.
The interaction plot following the QQ plots shows that the two factors are interacting with eachother to create an effect in the response variable whenever there is an intersection of curves on the plot.
The third type of plot is a Residuals vs.Fits plot which is used to identify the linearity of the residual values and to detemrine if there are any outlying values. As can be seen from the plot, the slopes for the different types of Water Salinity generally do not intersect and are somewhat non-interactive besides the instance of the slight interaction between BW and FW Water Salinity levels.
qqnorm(residuals(model_Nrate))
qqline(residuals(model_Nrate))
qqnorm(residuals(model_Water.Salinity))
qqline(residuals(model_Water.Salinity))
interaction.plot(cotton$Nrate, cotton$Water.Salinity, cotton$Leaves)
plot(fitted(model_Nrate),residuals(model_Nrate))
plot(fitted(model_Water.Salinity),residuals(model_Water.Salinity))
Tukey’s HSD test is a post-hoc test, meaning that it is performed after an analysis of variance (ANOVA) test. This means that to maintain integrity, a statistician should not perform Tukey’s HSD test unless she has first performed an ANOVA analysis. In statistics, post-hoc tests are used only for further data analysis; these types of tests are not pre-planned. In other words, you should have no plans to use Tukey’s HSD test before you collect and analyze the data once.
The purpose of Tukey’s HSD test is to determine which groups in the sample differ. While ANOVA can tell the researcher whether groups in the sample differ, it cannot tell the researcher which groups differ. That is, if the results of ANOVA are positive in the sense that they state there is a significant difference among the groups, the obvious question becomes: Which groups in this sample differ significantly? It is not likely that all groups differ when compared to each other, only that a handful have significant differences. Tukey’s HSD can clarify to the researcher which groups among the sample in specific have significant differences.
Tuke<- TukeyHSD(aov(Leaves~Nrate*Water.Salinity, data=cotton))
plot(Tuke)
See course canvas site. Also http://www.sciencedirect.com/science/article/pii/S0378429014002524