library(qcc)
## Warning: package 'qcc' was built under R version 4.3.3
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
library(stats)

Ok, lecture 8 in Tim Matis series on statistical analysis/DoE.

Analysis of variance. Developed by agriculture statisticians. Terms reflect this background. We could compare the fertilizer results pairwise, as we have done in the past. The problem with this is that when we look at 1-2, 2-3, 1-3 each of these two sample t tests has a probability of alpha=0.05 that we reject the null hypothesis when it’s true. Looking at the entire experiement we end up with alpha family, a type one error in any of the tests. This is 1-(1-alpha)3 = 14%, which is unacceptably high. The number of comparison is n!/(2!(n-2)!). If we were looking at 5 different fertilizers? This is 10 comparisons. Then the alpha family is 1- (0.95)^5   which is 40% not good.

The correct way to do this is called anova, or analysis of variance. Sounds like a hypothesis concerning variances, but in fact the hypothesis test is on the mean. It is a two step process. First, is there any difference between the fertilizers?

H0: u1 = u2 = u3, Ha: at least one of them is different.

If we reject H0 then which one is different. We’ll use Tukeys test which controls the familywise error rate.

Anova assumes that the populations being compared are all normal, and that they have the same variance. Now have to back up a lot to understand this. We have 3 treatments from which we will sample how much corn is produced. Take the same number of samples from each population - this is a balanced experiment. Plot observations and plot on liee. If H0 is true then all observations come from a single distribution, can estimate u with xbar..,, average of all measurements. If H0 is false then u1, u2 and u3 differ. We could estimate u1 = xbar1., u2 = xbar2., xbar3.. xbar.. is called the grand mean. The others are the populations means. R emember that we have assumed all the populations are normal. and have the same variances.

The sum of squares total = sum of squares treatment + sum of squares error. SST = SSTR + SSE. These are calculated the usual way, summing over the squares of the differences between the grand mean and the populations means. The SSTr uses the grand mean, SSE compares each observation to the relevant population mean. It’s easiest to generate an anova table:

source SS dof MS t

treatment SSTr I-1 MSTr=SSTr/(I-1) MSTr/MSE

error SSE I*J - I MSE= SSE/(I-1)

total SST I*J - 1

I = # treatments

J = # observations

The ratios of the MSTr/MSE is the t characteristic which is distributed as an F distribution. After picking a error level, alpha, this determines a critical value for the ratio. then test hypothesis using F(I-1, IJ-1).

There are some errors in this, using the right degrees of freedom. Check Tim’s reposting of slide deck and excel sheet. Larger numbers of observations create more reliable predictions. Smaller vrariances also make it easier to distinguish between populations. Now in R
There are some errors in this, using the right degrees of freedom. Check Tim’s reposting of slide deck and excel sheet. Larger numbers of observations create more reliable predictions. Smaller vrariances also make it easier to distinguish between populations. Now in R
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzer.csv") #read in data
str(dat)           #fertilizer is read as integer
## 'data.frame':    120 obs. of  2 variables:
##  $ Fertilizer: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight    : num  975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
str(dat)
## 'data.frame':    120 obs. of  2 variables:
##  $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight    : num  975 823 920 843 988 ...
model<-aov(Weight~Fertilizer,data=dat)
summary(model)
##              Df  Sum Sq Mean Sq F value  Pr(>F)   
## Fertilizer    2  123394   61697     6.7 0.00176 **
## Residuals   117 1077397    9209                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(model)

model$residuals
##         1         2         3         4         5         6         7         8 
##  -21.7225 -174.0225  -76.6225 -153.7225   -9.0225   24.6775  -23.7225 -139.5225 
##         9        10        11        12        13        14        15        16 
##   62.4775  124.2775   62.9775  114.1775  -24.6225   53.3775   30.2775   89.1775 
##        17        18        19        20        21        22        23        24 
##   54.9775  -11.9225   57.6775   28.2775  -52.0225  174.5775  -42.0225  -33.2225 
##        25        26        27        28        29        30        31        32 
##   98.1775  -77.8225 -261.6225  -32.8225   71.5775   80.4775  -83.3225  136.1775 
##        33        34        35        36        37        38        39        40 
##  -73.3225  -85.9225   98.0775   -6.5225  -89.7225   14.7775  -32.9225  129.9775 
##        41        42        43        44        45        46        47        48 
##  -99.0750  -44.3750  -41.8750   62.0250  -92.1750   84.3250   38.3250  220.6250 
##        49        50        51        52        53        54        55        56 
##  109.3250   84.1250    9.5250   -8.7750  -66.4750  -14.7750 -134.2750 -204.8750 
##        57        58        59        60        61        62        63        64 
##  -80.9750   26.5250   -1.1750 -110.8750  176.3250 -100.3750  119.4250  -82.1750 
##        65        66        67        68        69        70        71        72 
##   86.7250   -0.5750   40.0250   69.7250  -72.9750  135.0250   -3.6750  -28.1750 
##        73        74        75        76        77        78        79        80 
##  -32.0750    3.6250   54.4250    8.9250 -178.1750  108.3250  -46.1750    6.7250 
##        81        82        83        84        85        86        87        88 
##   87.6100   10.3100   92.4100   32.7100  -90.9900  144.9100 -146.7900  -22.0900 
##        89        90        91        92        93        94        95        96 
##   85.2100  129.5100  -31.8900 -147.0900 -207.2900  -26.2900    8.9100   10.0100 
##        97        98        99       100       101       102       103       104 
##  -41.3900 -113.6900   53.4100   40.7100  197.8100  -56.4900  137.4100  185.8100 
##       105       106       107       108       109       110       111       112 
##  -16.1900  169.5100  -83.0900  -42.3900   50.9100 -179.0900 -127.4900    6.5100 
##       113       114       115       116       117       118       119       120 
##    9.7100  -74.8900  -27.7900   30.7100   38.9100   58.9100   -0.4900 -146.4900

Looks good. They all look good. Which ones are different?

Use Tukeys:
Use Tukeys:
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Weight ~ Fertilizer, data = dat)
## 
## $Fertilizer
##         diff          lwr      upr     p adj
## 2-1  50.3525   -0.5858011 101.2908 0.0534220
## 3-1 -27.0325  -77.9708011  23.9058 0.4207520
## 3-2 -77.3850 -128.3233011 -26.4467 0.0013213
plot(TukeyHSD(model))

Now, let’s add water…

u is grand average, alpha is fertilizer, beta is water, alphabeta is interaction, epsilon is the error.
u is grand average, alpha is fertilizer, beta is water, alphabeta is interaction, epsilon is the error.

We assume epsilon is normally distributed. Now H0: alphabetaij =0, no interaction. Ha: there is an effect of the interaction between fertilizer and water. Then test for alpha and beta individually (main effects).

Get the data:

dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzerWater.csv")

convert fertilizer and water to factors.

str(dat)
## 'data.frame':    120 obs. of  3 variables:
##  $ Fertilizer: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Water     : chr  "low" "low" "low" "low" ...
##  $ Weight    : num  975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
dat$Water<-as.factor(dat$Water)
levels(dat$Water)
## [1] "high" "low"
str(dat)
## 'data.frame':    120 obs. of  3 variables:
##  $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Water     : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Weight    : num  975 823 920 843 988 ...

to get the three factors - main water and fertilizer plus interaction

model<-aov(Weight~Fertilizer+Water+Fertilizer:Water,data=dat)
summary(model)
##                   Df  Sum Sq Mean Sq F value Pr(>F)  
## Fertilizer         2   20544   10272   0.709 0.4943  
## Water              1   57064   57064   3.938 0.0496 *
## Fertilizer:Water   2   99894   49947   3.447 0.0352 *
## Residuals        114 1651712   14489                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like there is an interaction between fertilizer and water, p value of .032. Let’s look at plot

interaction.plot(dat$Fertilizer,dat$Water,dat$Weight)

Not a lot of differences between high water and low water for fertilizers 1 and 3, but a big difference for fertilizer 2. So, if I have a lot of water, go with fertilizer 2. If it’s dry, use fertilizer 1. Look for lines that are not parallel.

New data set, water but no interaction

dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzerWaterNoInteraction.csv")
str(dat)
## 'data.frame':    120 obs. of  3 variables:
##  $ Fertilizer: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Water     : chr  "low" "low" "low" "low" ...
##  $ Weight    : num  1129 870 951 1049 983 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
dat$Water<-as.factor(dat$Water)
levels(dat$Water)
## [1] "high" "low"
str(dat)
## 'data.frame':    120 obs. of  3 variables:
##  $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Water     : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Weight    : num  1129 870 951 1049 983 ...
model<-aov(Weight~Fertilizer+Water+Fertilizer:Water,data=dat)
summary(model)
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## Fertilizer         2  18232    9116   1.139 0.323885    
## Water              1 108222  108222  13.517 0.000362 ***
## Fertilizer:Water   2   2330    1165   0.146 0.864743    
## Residuals        114 912753    8007                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the p value for the interaction is .8, so no interaction. The plot?

interaction.plot(dat$Fertilizer,dat$Water,dat$Weight)

Much different. Use water regardless of which fertilizer. Let’s take the interaction out and see if main effects are significant:

model<-aov(Weight~Fertilizer+Water,data=dat)
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Fertilizer    2  18232    9116   1.156 0.318462    
## Water         1 108222  108222  13.719 0.000327 ***
## Residuals   116 915083    7889                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Weight ~ Fertilizer + Water, data = dat)
## 
## $Fertilizer
##       diff       lwr      upr     p adj
## 2-1  13.52 -33.63186 60.67186 0.7751579
## 3-1 -16.62 -63.77186 30.53186 0.6809594
## 3-2 -30.14 -77.29186 17.01186 0.2864403
## 
## $Water
##               diff       lwr       upr     p adj
## low-high -60.06167 -92.17926 -27.94407 0.0003267
plot(TukeyHSD(model))

What does Farmer Bob really do. The completely randomized model above is unrealistic. Much easier to do this:

Completely randomized is good, but not always possible. How do we quantify this, quality of experiment? Power.

#from the mean pattern Bob has seen
var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
                  between.var=var(c(950,1000,1050)),
                  within.var=100^2,
                 sig.level=0.05,
                 power=.80)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 20.30205
##     between.var = 2500
##      within.var = 10000
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

What if the variances were closer?

var(c(950,950,1050))
## [1] 3333.333
power.anova.test(groups=3,n=NULL,
                  between.var=var(c(950,950,1050)),
                  within.var=100^2,
                 sig.level=0.05,
                 power=.80)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 15.49562
##     between.var = 3333.333
##      within.var = 10000
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

What if I wanted 90% certainty?

var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
                  between.var=var(c(950,950,1050)),
                  within.var=100^2,
                 sig.level=0.05,
                 power=.90)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 20.01728
##     between.var = 3333.333
##      within.var = 10000
##       sig.level = 0.05
##           power = 0.9
## 
## NOTE: n is number in each group

and if the variances were 950,950,1000?

var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
                  between.var=var(c(950,950,1000)),
                  within.var=100^2,
                 sig.level=0.05,
                 power=.90)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 76.93183
##     between.var = 833.3333
##      within.var = 10000
##       sig.level = 0.05
##           power = 0.9
## 
## NOTE: n is number in each group

If you’re really doomed - not normal, not constant variance, transformations don’t work, use Kruskal Wallace. Go back to original data.

dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzer.csv")

str(dat)
## 'data.frame':    120 obs. of  2 variables:
##  $ Fertilizer: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight    : num  975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
str(dat)
## 'data.frame':    120 obs. of  2 variables:
##  $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight    : num  975 823 920 843 988 ...
model<-aov(Weight~Fertilizer,data=dat)
plot(model)

summary(model)
##              Df  Sum Sq Mean Sq F value  Pr(>F)   
## Fertilizer    2  123394   61697     6.7 0.00176 **
## Residuals   117 1077397    9209                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model$residuals
##         1         2         3         4         5         6         7         8 
##  -21.7225 -174.0225  -76.6225 -153.7225   -9.0225   24.6775  -23.7225 -139.5225 
##         9        10        11        12        13        14        15        16 
##   62.4775  124.2775   62.9775  114.1775  -24.6225   53.3775   30.2775   89.1775 
##        17        18        19        20        21        22        23        24 
##   54.9775  -11.9225   57.6775   28.2775  -52.0225  174.5775  -42.0225  -33.2225 
##        25        26        27        28        29        30        31        32 
##   98.1775  -77.8225 -261.6225  -32.8225   71.5775   80.4775  -83.3225  136.1775 
##        33        34        35        36        37        38        39        40 
##  -73.3225  -85.9225   98.0775   -6.5225  -89.7225   14.7775  -32.9225  129.9775 
##        41        42        43        44        45        46        47        48 
##  -99.0750  -44.3750  -41.8750   62.0250  -92.1750   84.3250   38.3250  220.6250 
##        49        50        51        52        53        54        55        56 
##  109.3250   84.1250    9.5250   -8.7750  -66.4750  -14.7750 -134.2750 -204.8750 
##        57        58        59        60        61        62        63        64 
##  -80.9750   26.5250   -1.1750 -110.8750  176.3250 -100.3750  119.4250  -82.1750 
##        65        66        67        68        69        70        71        72 
##   86.7250   -0.5750   40.0250   69.7250  -72.9750  135.0250   -3.6750  -28.1750 
##        73        74        75        76        77        78        79        80 
##  -32.0750    3.6250   54.4250    8.9250 -178.1750  108.3250  -46.1750    6.7250 
##        81        82        83        84        85        86        87        88 
##   87.6100   10.3100   92.4100   32.7100  -90.9900  144.9100 -146.7900  -22.0900 
##        89        90        91        92        93        94        95        96 
##   85.2100  129.5100  -31.8900 -147.0900 -207.2900  -26.2900    8.9100   10.0100 
##        97        98        99       100       101       102       103       104 
##  -41.3900 -113.6900   53.4100   40.7100  197.8100  -56.4900  137.4100  185.8100 
##       105       106       107       108       109       110       111       112 
##  -16.1900  169.5100  -83.0900  -42.3900   50.9100 -179.0900 -127.4900    6.5100 
##       113       114       115       116       117       118       119       120 
##    9.7100  -74.8900  -27.7900   30.7100   38.9100   58.9100   -0.4900 -146.4900

This is ok, but let’s pretend it isn’t. Need library MASS which is in the package MASS

library(MASS)
boxcox(model)

kruskal.test(Weight~Fertilizer,data=dat)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Weight by Fertilizer
## Kruskal-Wallis chi-squared = 10.956, df = 2, p-value = 0.004177