library(qcc)
## Warning: package 'qcc' was built under R version 4.3.3
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
library(stats)
Ok, lecture 8 in Tim Matis series on statistical analysis/DoE.
Analysis of variance. Developed by agriculture statisticians. Terms reflect this background. We could compare the fertilizer results pairwise, as we have done in the past. The problem with this is that when we look at 1-2, 2-3, 1-3 each of these two sample t tests has a probability of alpha=0.05 that we reject the null hypothesis when it’s true. Looking at the entire experiement we end up with alpha family, a type one error in any of the tests. This is 1-(1-alpha)3 = 14%, which is unacceptably high. The number of comparison is n!/(2!(n-2)!). If we were looking at 5 different fertilizers? This is 10 comparisons. Then the alpha family is 1- (0.95)^5 which is 40% not good.
The correct way to do this is called anova, or analysis of variance. Sounds like a hypothesis concerning variances, but in fact the hypothesis test is on the mean. It is a two step process. First, is there any difference between the fertilizers?
H0: u1 = u2 = u3, Ha: at least one of them is different.
If we reject H0 then which one is different. We’ll use Tukeys test which controls the familywise error rate.
Anova assumes that the populations being compared are all normal, and that they have the same variance. Now have to back up a lot to understand this. We have 3 treatments from which we will sample how much corn is produced. Take the same number of samples from each population - this is a balanced experiment. Plot observations and plot on liee. If H0 is true then all observations come from a single distribution, can estimate u with xbar..,, average of all measurements. If H0 is false then u1, u2 and u3 differ. We could estimate u1 = xbar1., u2 = xbar2., xbar3.. xbar.. is called the grand mean. The others are the populations means. R emember that we have assumed all the populations are normal. and have the same variances.
The sum of squares total = sum of squares treatment + sum of squares error. SST = SSTR + SSE. These are calculated the usual way, summing over the squares of the differences between the grand mean and the populations means. The SSTr uses the grand mean, SSE compares each observation to the relevant population mean. It’s easiest to generate an anova table:
source SS dof MS t
treatment SSTr I-1 MSTr=SSTr/(I-1) MSTr/MSE
error SSE I*J - I MSE= SSE/(I-1)
total SST I*J - 1
I = # treatments
J = # observations
The ratios of the MSTr/MSE is the t characteristic which is distributed as an F distribution. After picking a error level, alpha, this determines a critical value for the ratio. then test hypothesis using F(I-1, IJ-1).
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzer.csv") #read in data
str(dat) #fertilizer is read as integer
## 'data.frame': 120 obs. of 2 variables:
## $ Fertilizer: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : num 975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
str(dat)
## 'data.frame': 120 obs. of 2 variables:
## $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : num 975 823 920 843 988 ...
model<-aov(Weight~Fertilizer,data=dat)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 123394 61697 6.7 0.00176 **
## Residuals 117 1077397 9209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(model)
model$residuals
## 1 2 3 4 5 6 7 8
## -21.7225 -174.0225 -76.6225 -153.7225 -9.0225 24.6775 -23.7225 -139.5225
## 9 10 11 12 13 14 15 16
## 62.4775 124.2775 62.9775 114.1775 -24.6225 53.3775 30.2775 89.1775
## 17 18 19 20 21 22 23 24
## 54.9775 -11.9225 57.6775 28.2775 -52.0225 174.5775 -42.0225 -33.2225
## 25 26 27 28 29 30 31 32
## 98.1775 -77.8225 -261.6225 -32.8225 71.5775 80.4775 -83.3225 136.1775
## 33 34 35 36 37 38 39 40
## -73.3225 -85.9225 98.0775 -6.5225 -89.7225 14.7775 -32.9225 129.9775
## 41 42 43 44 45 46 47 48
## -99.0750 -44.3750 -41.8750 62.0250 -92.1750 84.3250 38.3250 220.6250
## 49 50 51 52 53 54 55 56
## 109.3250 84.1250 9.5250 -8.7750 -66.4750 -14.7750 -134.2750 -204.8750
## 57 58 59 60 61 62 63 64
## -80.9750 26.5250 -1.1750 -110.8750 176.3250 -100.3750 119.4250 -82.1750
## 65 66 67 68 69 70 71 72
## 86.7250 -0.5750 40.0250 69.7250 -72.9750 135.0250 -3.6750 -28.1750
## 73 74 75 76 77 78 79 80
## -32.0750 3.6250 54.4250 8.9250 -178.1750 108.3250 -46.1750 6.7250
## 81 82 83 84 85 86 87 88
## 87.6100 10.3100 92.4100 32.7100 -90.9900 144.9100 -146.7900 -22.0900
## 89 90 91 92 93 94 95 96
## 85.2100 129.5100 -31.8900 -147.0900 -207.2900 -26.2900 8.9100 10.0100
## 97 98 99 100 101 102 103 104
## -41.3900 -113.6900 53.4100 40.7100 197.8100 -56.4900 137.4100 185.8100
## 105 106 107 108 109 110 111 112
## -16.1900 169.5100 -83.0900 -42.3900 50.9100 -179.0900 -127.4900 6.5100
## 113 114 115 116 117 118 119 120
## 9.7100 -74.8900 -27.7900 30.7100 38.9100 58.9100 -0.4900 -146.4900
Looks good. They all look good. Which ones are different?
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Weight ~ Fertilizer, data = dat)
##
## $Fertilizer
## diff lwr upr p adj
## 2-1 50.3525 -0.5858011 101.2908 0.0534220
## 3-1 -27.0325 -77.9708011 23.9058 0.4207520
## 3-2 -77.3850 -128.3233011 -26.4467 0.0013213
plot(TukeyHSD(model))
Now, let’s add water…
We assume epsilon is normally distributed. Now H0: alphabetaij =0, no interaction. Ha: there is an effect of the interaction between fertilizer and water. Then test for alpha and beta individually (main effects).
Get the data:
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzerWater.csv")
convert fertilizer and water to factors.
str(dat)
## 'data.frame': 120 obs. of 3 variables:
## $ Fertilizer: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Water : chr "low" "low" "low" "low" ...
## $ Weight : num 975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
dat$Water<-as.factor(dat$Water)
levels(dat$Water)
## [1] "high" "low"
str(dat)
## 'data.frame': 120 obs. of 3 variables:
## $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Water : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
## $ Weight : num 975 823 920 843 988 ...
to get the three factors - main water and fertilizer plus interaction
model<-aov(Weight~Fertilizer+Water+Fertilizer:Water,data=dat)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 20544 10272 0.709 0.4943
## Water 1 57064 57064 3.938 0.0496 *
## Fertilizer:Water 2 99894 49947 3.447 0.0352 *
## Residuals 114 1651712 14489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks like there is an interaction between fertilizer and water, p value of .032. Let’s look at plot
interaction.plot(dat$Fertilizer,dat$Water,dat$Weight)
Not a lot of differences between high water and low water for fertilizers 1 and 3, but a big difference for fertilizer 2. So, if I have a lot of water, go with fertilizer 2. If it’s dry, use fertilizer 1. Look for lines that are not parallel.
New data set, water but no interaction
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzerWaterNoInteraction.csv")
str(dat)
## 'data.frame': 120 obs. of 3 variables:
## $ Fertilizer: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Water : chr "low" "low" "low" "low" ...
## $ Weight : num 1129 870 951 1049 983 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
levels(dat$Fertilizer)
## [1] "1" "2" "3"
dat$Water<-as.factor(dat$Water)
levels(dat$Water)
## [1] "high" "low"
str(dat)
## 'data.frame': 120 obs. of 3 variables:
## $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Water : Factor w/ 2 levels "high","low": 2 2 2 2 2 2 2 2 2 2 ...
## $ Weight : num 1129 870 951 1049 983 ...
model<-aov(Weight~Fertilizer+Water+Fertilizer:Water,data=dat)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 18232 9116 1.139 0.323885
## Water 1 108222 108222 13.517 0.000362 ***
## Fertilizer:Water 2 2330 1165 0.146 0.864743
## Residuals 114 912753 8007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now the p value for the interaction is .8, so no interaction. The plot?
interaction.plot(dat$Fertilizer,dat$Water,dat$Weight)
Much different. Use water regardless of which fertilizer. Let’s take the interaction out and see if main effects are significant:
model<-aov(Weight~Fertilizer+Water,data=dat)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 18232 9116 1.156 0.318462
## Water 1 108222 108222 13.719 0.000327 ***
## Residuals 116 915083 7889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Weight ~ Fertilizer + Water, data = dat)
##
## $Fertilizer
## diff lwr upr p adj
## 2-1 13.52 -33.63186 60.67186 0.7751579
## 3-1 -16.62 -63.77186 30.53186 0.6809594
## 3-2 -30.14 -77.29186 17.01186 0.2864403
##
## $Water
## diff lwr upr p adj
## low-high -60.06167 -92.17926 -27.94407 0.0003267
plot(TukeyHSD(model))
What does Farmer Bob really do. The completely randomized model above is unrealistic. Much easier to do this:
Completely randomized is good, but not always possible. How do we quantify this, quality of experiment? Power.
#from the mean pattern Bob has seen
var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
between.var=var(c(950,1000,1050)),
within.var=100^2,
sig.level=0.05,
power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 20.30205
## between.var = 2500
## within.var = 10000
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
What if the variances were closer?
var(c(950,950,1050))
## [1] 3333.333
power.anova.test(groups=3,n=NULL,
between.var=var(c(950,950,1050)),
within.var=100^2,
sig.level=0.05,
power=.80)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 15.49562
## between.var = 3333.333
## within.var = 10000
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
What if I wanted 90% certainty?
var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
between.var=var(c(950,950,1050)),
within.var=100^2,
sig.level=0.05,
power=.90)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 20.01728
## between.var = 3333.333
## within.var = 10000
## sig.level = 0.05
## power = 0.9
##
## NOTE: n is number in each group
and if the variances were 950,950,1000?
var(c(950,1000,1050))
## [1] 2500
power.anova.test(groups=3,n=NULL,
between.var=var(c(950,950,1000)),
within.var=100^2,
sig.level=0.05,
power=.90)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 76.93183
## between.var = 833.3333
## within.var = 10000
## sig.level = 0.05
## power = 0.9
##
## NOTE: n is number in each group
If you’re really doomed - not normal, not constant variance, transformations don’t work, use Kruskal Wallace. Go back to original data.
dat<-read.csv("C:/Users/rgale/Downloads/DoEstuff/FarmerBobFertilzer.csv")
str(dat)
## 'data.frame': 120 obs. of 2 variables:
## $ Fertilizer: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : num 975 823 920 843 988 ...
dat$Fertilizer<-as.factor(dat$Fertilizer)
str(dat)
## 'data.frame': 120 obs. of 2 variables:
## $ Fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : num 975 823 920 843 988 ...
model<-aov(Weight~Fertilizer,data=dat)
plot(model)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 123394 61697 6.7 0.00176 **
## Residuals 117 1077397 9209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model$residuals
## 1 2 3 4 5 6 7 8
## -21.7225 -174.0225 -76.6225 -153.7225 -9.0225 24.6775 -23.7225 -139.5225
## 9 10 11 12 13 14 15 16
## 62.4775 124.2775 62.9775 114.1775 -24.6225 53.3775 30.2775 89.1775
## 17 18 19 20 21 22 23 24
## 54.9775 -11.9225 57.6775 28.2775 -52.0225 174.5775 -42.0225 -33.2225
## 25 26 27 28 29 30 31 32
## 98.1775 -77.8225 -261.6225 -32.8225 71.5775 80.4775 -83.3225 136.1775
## 33 34 35 36 37 38 39 40
## -73.3225 -85.9225 98.0775 -6.5225 -89.7225 14.7775 -32.9225 129.9775
## 41 42 43 44 45 46 47 48
## -99.0750 -44.3750 -41.8750 62.0250 -92.1750 84.3250 38.3250 220.6250
## 49 50 51 52 53 54 55 56
## 109.3250 84.1250 9.5250 -8.7750 -66.4750 -14.7750 -134.2750 -204.8750
## 57 58 59 60 61 62 63 64
## -80.9750 26.5250 -1.1750 -110.8750 176.3250 -100.3750 119.4250 -82.1750
## 65 66 67 68 69 70 71 72
## 86.7250 -0.5750 40.0250 69.7250 -72.9750 135.0250 -3.6750 -28.1750
## 73 74 75 76 77 78 79 80
## -32.0750 3.6250 54.4250 8.9250 -178.1750 108.3250 -46.1750 6.7250
## 81 82 83 84 85 86 87 88
## 87.6100 10.3100 92.4100 32.7100 -90.9900 144.9100 -146.7900 -22.0900
## 89 90 91 92 93 94 95 96
## 85.2100 129.5100 -31.8900 -147.0900 -207.2900 -26.2900 8.9100 10.0100
## 97 98 99 100 101 102 103 104
## -41.3900 -113.6900 53.4100 40.7100 197.8100 -56.4900 137.4100 185.8100
## 105 106 107 108 109 110 111 112
## -16.1900 169.5100 -83.0900 -42.3900 50.9100 -179.0900 -127.4900 6.5100
## 113 114 115 116 117 118 119 120
## 9.7100 -74.8900 -27.7900 30.7100 38.9100 58.9100 -0.4900 -146.4900
This is ok, but let’s pretend it isn’t. Need library MASS which is in the package MASS
library(MASS)
boxcox(model)
kruskal.test(Weight~Fertilizer,data=dat)
##
## Kruskal-Wallis rank sum test
##
## data: Weight by Fertilizer
## Kruskal-Wallis chi-squared = 10.956, df = 2, p-value = 0.004177