Recipe 4

Matthew Macchi

Rensselaer Polytechnic Institute

10/16/14 Version 1

1. Setting

System under test

This recipe will conduct an experiment on a dataset representing the nitrogen uptake of cotton on leaves. The experiment will attempt to investigate the amount of nitrogen uptake of cotton and examine the analysis of variance between the N rate and Water Salinity on Cotton Uptake on Leaves in hopes of supporting or refuting the claim that the N Rate and Water Salinity on Cotton Uptake on Leaves do not have much variance.

cotton <- read.csv("~/Desktop/cotton.csv")
head(cotton)
##   Nrate Water.Salinity Year  Stem Leaves  Bolls  Total
## 1    N0             FW 2011  7.02  19.52  35.81  62.35
## 2    N0             BW 2011  7.23  21.33  36.68  65.24
## 3    N0             SW 2011  5.81  20.22  35.39  61.42
## 4  N360             FW 2011 19.32  68.51 120.52 208.33
## 5  N360             BW 2011 20.05  76.82 106.90 203.70
## 6  N360             SW 2011 15.51  62.71  91.42 169.64

Factors and Levels

A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. In this instance, I am conducting a two-factor analysis.

The term level is also used for categorical variables. In this case, this is a multi-level analysis.

The first factor that this experiment will examine is the N Rate.

The second factor that I will consider is the concentration of Water Salinity.

head(cotton)
##   Nrate Water.Salinity Year  Stem Leaves  Bolls  Total
## 1    N0             FW 2011  7.02  19.52  35.81  62.35
## 2    N0             BW 2011  7.23  21.33  36.68  65.24
## 3    N0             SW 2011  5.81  20.22  35.39  61.42
## 4  N360             FW 2011 19.32  68.51 120.52 208.33
## 5  N360             BW 2011 20.05  76.82 106.90 203.70
## 6  N360             SW 2011 15.51  62.71  91.42 169.64
tail(cotton)
##    Nrate Water.Salinity Year  Stem Leaves  Bolls  Total
## 7     N0             FW 2012 14.81  38.53  48.05 101.39
## 8     N0             BW 2012 10.41  29.56  39.96  79.93
## 9     N0             SW 2012  8.88  22.20  37.74  68.82
## 10  N360             FW 2012 33.30  91.02 139.86 264.18
## 11  N360             BW 2012 17.76  85.09 112.89 215.73
## 12  N360             SW 2012 15.54  56.24  99.90 171.68
summary(cotton)
##   Nrate   Water.Salinity      Year           Stem           Leaves    
##  N0  :6   BW:4           Min.   :2011   Min.   : 5.81   Min.   :19.5  
##  N360:6   FW:4           1st Qu.:2011   1st Qu.: 8.47   1st Qu.:22.0  
##           SW:4           Median :2012   Median :15.16   Median :47.4  
##                          Mean   :2012   Mean   :14.64   Mean   :49.3  
##                          3rd Qu.:2012   3rd Qu.:18.15   3rd Qu.:70.6  
##                          Max.   :2012   Max.   :33.30   Max.   :91.0  
##      Bolls           Total      
##  Min.   : 35.4   Min.   : 61.4  
##  1st Qu.: 37.5   1st Qu.: 67.9  
##  Median : 69.7   Median :135.5  
##  Mean   : 75.4   Mean   :139.4  
##  3rd Qu.:108.4   3rd Qu.:204.9  
##  Max.   :139.9   Max.   :264.2

Continuous variables (if any)

If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable.

In this instance, only one variable can be considered continuous. Since uptake of cotton on leaves is not a categorical variable, it is continuous.

Response variables

A response variable is defined as the outcome of a study. It is a variable you would be interested in predicting or forecasting. It is often called a dependent variable or predicted variable. In this instance, a response variable is city gas mileage, since it will attempt to describe the difference between levels of the two factors of interst.

The Data: How is it organized and what does it look like?

The data is organized initially into an 7 column table. The columns are titled as follows: Nrate, Water.Salinity, Year, Stems, Leaves, Bolls, and Total. All data is numeric minus Nrate and Water.Salinity, which are textual.

Randomization

This data comes from a field experiment that begain in 2011 and 2012 cotton growing seasons at an agricultural experiment station in China. This experiment was conducted in six treatments replicated 3 times in a randomized complete factorial block design.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In order to conduct this experiment, I will conduct two separate analysis of the factors at hand. First, I will analyze multiple levels of the N Rate (Nrate) of the data. I will then look at the Cotton growth on Leaves (Leaves) values to see if an obvious difference or pattern can be seen.

Second, I will analyze multiple levels of the Water Salinity (Water.Salinity) of the data, which is the second factor. Again, I will then look at the Cotton growth on Leaves (Leaves) values to see if an obvious difference or pattern can be seen.

What is the rationale for this design?

I have chosen to use this type of experimental design to demonstrate proper experimentation with a data set with at least two factors and at least two levels of each factor.

Randomize: What is the Randomization Scheme?

This experiment was conducted in six treatments replicated 3 times in a randomized complete factorial block design.

Replicate: Are there replicates and/or repeated measures?

There are no replicates, but repeated measures do occur between the factors and levels.

Block: Did you use blocking in the design?

The only blocking that I performed in this experimental data analysis is seen in the blocking of the N Rate into the different levels of their respective factors.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

At this point, I must define the N Rate (Nrate) and the Water Salinity (Water.Salinity) as the factors for analysis.

cotton$Nrate=as.factor(cotton$Nrate)
cotton$Water.Salinity=as.factor(cotton$Water.Salinity)

Below are the boxplots of the Cotton uptake on Leaves of all levels of the two factors of interest.

par(mfrow=c(1,1))
hist(cotton$Leaves)

plot of chunk unnamed-chunk-4

par(mfrow=c(1,1))
boxplot(cotton$Leaves, main="Boxplot of Cotton Growth on Leaves", xlab="Leaves", ylab="Cotton Growth Metric", names=c("Leaves"))

plot of chunk unnamed-chunk-4

boxplot(Leaves~Nrate, data=cotton)

plot of chunk unnamed-chunk-4

boxplot(Leaves~Water.Salinity, data=cotton)

plot of chunk unnamed-chunk-4

Testing

At this point, I am introducitng the Analysis of Variance (ANOVA) test. The ANOVA test is used to analyze the differences in the mean Cotton growth on leaves of the data with varying N Rates and Water Salinity levels.. A third ANOVA test analyzes the interaction effect between the two factors.

model_Nrate=aov(Leaves~Nrate,data=cotton)
model_Water.Salinity=aov(Leaves~Water.Salinity,data=cotton)
model_Nrate_Water.Salinity=aov(Leaves~Nrate*Water.Salinity,data=cotton)
anova(model_Nrate)
## Analysis of Variance Table
## 
## Response: Leaves
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## Nrate      1   6962    6962    59.5 1.6e-05 ***
## Residuals 10   1169     117                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_Water.Salinity)
## Analysis of Variance Table
## 
## Response: Leaves
##                Df Sum Sq Mean Sq F value Pr(>F)
## Water.Salinity  2    486     243    0.29   0.76
## Residuals       9   7645     849
anova(model_Nrate_Water.Salinity)
## Analysis of Variance Table
## 
## Response: Leaves
##                      Df Sum Sq Mean Sq F value  Pr(>F)    
## Nrate                 1   6962    6962   79.56 0.00011 ***
## Water.Salinity        2    486     243    2.78 0.14019    
## Nrate:Water.Salinity  2    159      79    0.91 0.45299    
## Residuals             6    525      87                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Results

The ANOVA test that analyzed the variation in cotton growth on leaves as a result in variation of the N Rate returned a p-value of 1.6e-05. This small p-value translates to the fact that there is a small probability that the variations in cotton growth on leaves with regards to N Rate is a result of randomization. Thus the conclusion may be drawn that the change cotton growth on leaves is not a result in the change of the N Rate.

The ANOVA test that analyzed the variation in cotton growth on leaves as a result in variation of the Water Salinity levels returned a p-value of 0.76. This large p-value translates to the fact that there is a large probability that the variations cotton growth on leaves with regards to N Rate is a result of randomization. Thus the conclusion may be drawn that the change cotton growth on leaves is a result in the change of the Water Salinity Levels.

Because both ANOVAs alluded to the fact that both factors can effect the cotton growth on leaves, I then performed an ANOVA to analyze the interaction effect of the two factors. The resulting p-value was once again 0.45299 which indicates that when the two factors work together there is a very high probability that the changes in the cotton growth on leaves is a result of randomization.

Diagnostics/Model Adequacy Checking

To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual error to determine if the residuals followed a normal distribution. I also created an interaction plot to see if there was an interaction effect between the two factors.

The nearly linear fit of the residuals in the first QQ plot in reference to ‘Nrate’ is an indication that the model is adequate for this analysis.

The barely linear fit of the residuals in the second QQ plot in refernece to ‘Water.Salinity’ is an indication that the model is less adequate for this analysis.

The interaction plot following the QQ plots shows that the two factors are interacting with eachother to create an effect in the response variable whenever there is an intersection of curves on the plot.

The third type of plot is a Residuals vs.Fits plot which is used to identify the linearity of the residual values and to detemrine if there are any outlying values. As can be seen from the plot, the slopes for the different types of Water Salinity generally do not intersect and are somewhat non-interactive besides the instance of the slight interaction between BW and FW Water Salinity levels.

qqnorm(residuals(model_Nrate))
qqline(residuals(model_Nrate))

plot of chunk unnamed-chunk-7

qqnorm(residuals(model_Water.Salinity))
qqline(residuals(model_Water.Salinity))

plot of chunk unnamed-chunk-8

interaction.plot(cotton$Nrate, cotton$Water.Salinity, cotton$Leaves)

plot of chunk unnamed-chunk-9

plot(fitted(model_Nrate),residuals(model_Nrate))

plot of chunk unnamed-chunk-9

plot(fitted(model_Water.Salinity),residuals(model_Water.Salinity))

plot of chunk unnamed-chunk-9

4. Post-Hoc Test

Tukey’s HSD test is a post-hoc test, meaning that it is performed after an analysis of variance (ANOVA) test. This means that to maintain integrity, a statistician should not perform Tukey’s HSD test unless she has first performed an ANOVA analysis. In statistics, post-hoc tests are used only for further data analysis; these types of tests are not pre-planned. In other words, you should have no plans to use Tukey’s HSD test before you collect and analyze the data once.

The purpose of Tukey’s HSD test is to determine which groups in the sample differ. While ANOVA can tell the researcher whether groups in the sample differ, it cannot tell the researcher which groups differ. That is, if the results of ANOVA are positive in the sense that they state there is a significant difference among the groups, the obvious question becomes: Which groups in this sample differ significantly? It is not likely that all groups differ when compared to each other, only that a handful have significant differences. Tukey’s HSD can clarify to the researcher which groups among the sample in specific have significant differences.

Tuke<- TukeyHSD(aov(Leaves~Nrate*Water.Salinity, data=cotton))
plot(Tuke)

plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10

5. References to the literature

See course canvas site. Also http://www.sciencedirect.com/science/article/pii/S0378429014002524

A summary of, or pointer to, the raw data

complete and documented R code