Exercises: 15.22, 16.9, 17.10, 16.12, 16.17, 17.18, 18.15, 18.16, 18.19
The experimental unit is a bag(or other form) of popcorn.
The experimental factors are cooking time and power settings. Cooking time has 3 levels of 3,4,and 5 minutes while cooking power has 2 levels of low and high power. We have \(3 \times 2 = 6\) factor-level combinations.
One blocking factor could be different voltage of microwaves. Another blocking factor can be different brands of popcorn.
Let’s assume we will look at two differnt voltage of microwaves and 2 brands of popcorn.
We will preform randomized complete block design with r = 10 (or any number that is found to be good from previous studies). We will split our popcorn bags such that every factor-level-blocking combination has an equal amount of bags. In this instance we have \(3 \times 2 \times 2 \times 2 = 24\) different slots, with 10 popcorn bags to be popped for each (240 popcorn bags).
The way the randomization will work is as follows: For each blocking level combination (4) we have randomly assign 10 into each factor-level combination. For instance, for voltage 1 brand 2 of popcorn, we will randomly assign 60 popcorn bag such that 10 fall into each of the factor-level combination.
df = read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%2016%20Data%20Sets/CH16PR09.txt")
df$Recover = df$V1
df$Fitness = df$V2
df$V1 <- df$V2 <- df$V3 <- NULL
df$Fitness[df$Fitness ==1] = "Below Average"
df$Fitness[df$Fitness ==2] = "Average"
df$Fitness[df$Fitness ==3] = "Above Average"
library(devtools)
## Warning: package 'devtools' was built under R version 3.4.2
library(easyGgplot2)
## Loading required package: ggplot2
library("ggplot2")
ggplot2.dotplot(data=df, xName='Fitness',yName='Recover', groupName='Fitness'
,legendPosition="top",addBoxplot=TRUE)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Both the means and variability seem to be different between each group. The below average group seems to have very high variability (a few extreme observations).
below.mean = mean(df$Recover[df$Fitness == "Below Average"])
average.mean = mean(df$Recover[df$Fitness == "Average"])
above.mean = mean(df$Recover[df$Fitness == "Above Average"])
below.mean
## [1] 38
average.mean
## [1] 32
above.mean
## [1] 24
df$Fitness = factor(df$Fitness)
av = aov(Recover~Fitness,data=df)
sum.resid = sum(av$residuals)
sum.resid/.Machine$double.eps
## [1] -26
round(sum.resid,10)
## [1] 0
The residuals sum to zero (they are numerically off of 0 by 26 machine epsilons)
anova(av)
## Analysis of Variance Table
##
## Response: Recover
## Df Sum Sq Mean Sq F value Pr(>F)
## Fitness 2 672 336.00 16.962 4.129e-05 ***
## Residuals 21 416 19.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let \(\alpha = .01\) \[H_0: \mu_1 = \mu_2 = \mu_3\] \[H_1: \neg H_0 \]
qf(.99,2,21)
## [1] 5.780416
Since \(F^* = 16.962 > 5.780416 = F(.99,2,21)\) we reject the null hypothesis and conclude that at least two of the means are not the same.
1-pf(16.962,2,21) # Also given in anova table
## [1] 4.128573e-05
Since $p_{value} = 4.128573e-05 < 0.01 = $ we know we can reject the null hypothesis.
The \(p_{value}\) represents the probability given the null, what is the probability of having a more extreme value than the one observed. We are okay with make a type I error with probability 0.01, here we are seeing a probability of 4.128573e-05 of being wrong, an even lower probability. Thus we can reject the null. ## Part G
The nature of the relationship between physical health and duration of therapy is that the healthier you were to begin with, the shorter time you will spend in therapy. That is a negative relationship between health and duration of therapy, which is what you would expact.
x <- data.frame(c(below.mean,average.mean,above.mean),1) ## 1 is your "height"
plot(x, type = 'b')
The left most point is above average, the middle point is average, the right point is below average. The plot suggests that the effect of prior physical fittness is negative on recovery time.
n_avg = length(df$Fitness[df$Fitness=="Average"])
t = qt(.995,length(df$Fitness) - n_avg)
MSE = 19.81 # From ANOVA table from the last question
s_Y_avg = sqrt(MSE/n_avg)
average.mean+t*c(-s_Y_avg,s_Y_avg)
## [1] 27.81015 36.18985
That is the 99 % confidence interval is \(27.81015 < \mu_{average} < 36.18985\)
We will construct a 95 % bonferoni interval
bonf = qt(1-.05/(2*2),length(df$Fitness)-3)
d_1 = average.mean-above.mean
d_2 = below.mean-average.mean
n_1 = length(df$Fitness[df$Fitness=="Below Average"])
n_2 = length(df$Fitness[df$Fitness=="Average"])
n_3 = length(df$Fitness[df$Fitness=="Above Average"])
s_1 = sqrt(MSE*(1/n_2+1/n_3))
s_2 =sqrt(MSE*(1/n_1+1/n_2))
d_1+bonf*s_1*c(-1,1)
## [1] 2.452006 13.547994
d_2+bonf*s_2*c(-1,1)
## [1] 0.9038421 11.0961579
So $S^2{} =MSE(+) = 5.282667 $ and \(S^2\{\hat{D_2}\} =MSE(\frac{1}{n_1}+\frac{1}{n_2}) = 4.45725\)
We get the bonferonis with 95 %: \(2.452006<D_1< 13.547994\) and \(0.9038421<D_2< 11.0961579\)
Since 0 is not in either of the family confidence intervals, it suggests that neither diffrence is 0.
The tukey procedure would have have a different constant C for \(D \pm C * s\{\hat{D\}}\) for both \(D_1\) and \(D_2\).
That C for bonferoni is:
qt(1-.05/(2*2),length(df$Fitness)-3)
## [1] 2.413845
That C for Tukey is:
tukey = qtukey(.95,3,length(df$Fitness)-3)/sqrt(2)
tukey
## [1] 2.52057
Which is smaller than bonferoni. Thus the tukey procedure is prefered.
In bonferoni the B would change to \(t(1-\frac{.05}{2*3},21)\) from \(t(1-\frac{.05}{2*2},21)\).
With Tukey staying the same
The new bonferoni is
qt(1-.05/(2*3),length(df$Fitness)-3)
## [1] 2.60135
Which is bigger than Tukey! ## Part F - NOT SURE
We will find the Tukey confidence intervals for all 3 difference:
d_3 = below.mean-above.mean
s_3 =sqrt(MSE*(1/n_1+1/n_3))
d_1+tukey*s_1*c(-1,1)
## [1] 2.206708 13.793292
d_2+tukey*s_2*c(-1,1)
## [1] 0.6785216 11.3214784
d_3+tukey*s_3*c(-1,1)
## [1] 7.94123 20.05877
So the Tukey intervals are
df = read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%2016%20Data%20Sets/CH16PR12.txt")
df$Days = df$V1
df$Dist = as.factor(df$V2)
df$V1 <- df$V2 <- df$V3 <- NULL
library(devtools)
library(easyGgplot2)
library("ggplot2")
ggplot2.dotplot(data=df, xName='Dist',yName='Days', groupName='Dist'
,legendPosition="top",addBoxplot=TRUE)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
The factor mean levels appear to differ. The variability looks roughly the same between each group.
mean.1 = mean(df$Days[df$Dist == 1 ])
mean.2 = mean(df$Days[df$Dist == 2 ])
mean.3 = mean(df$Days[df$Dist == 3 ])
mean.4 = mean(df$Days[df$Dist == 4 ])
mean.5 = mean(df$Days[df$Dist == 5 ])
mean.1
## [1] 24.55
mean.2
## [1] 22.55
mean.3
## [1] 11.75
mean.4
## [1] 14.8
mean.5
## [1] 30.1
av = aov(Days~Dist,data=df)
sum.resid = sum(av$residuals)
sum.resid/.Machine$double.eps
## [1] 17.125
round(sum.resid,10)
## [1] 0
The sum of residuals adds up to zero, other than numerical error.
anova(av)
## Analysis of Variance Table
##
## Response: Days
## Df Sum Sq Mean Sq F value Pr(>F)
## Dist 4 4430.1 1107.53 147.23 < 2.2e-16 ***
## Residuals 95 714.6 7.52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Set \(\alpha = 0.1\) We have these hypothesis \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\] \[H_a: \neg H_0\]
qf(.9,4,95)
## [1] 2.004992
\(F^* = 147.23 > F(.9,4,95) = 2.004992\)
We thus reject the null hypothesis and conclude that at \(\mu_i \neq \mu_j\) for some \(i \neq j, i,j \in \{1,2,3,4,5\}\)
Since P value is \(p_{value} = 2.2e-16 < .1\) we can reject the null hypothesis.
The \(p_{value}\) represents the probability given the null, what is the probability of having a more extreme value than the one observed. We are okay with make a type I error with probability 0.1, here we are seeing a probability of 2.2e-16 of being wrong, an even lower probability. Thus we can reject the null.
There appears to be some variation in the time lapse for the five agents. There is not enough evidence to conclude that this change is due to efficiency of operations. It could be due to different distances traveled by each distributer or some other factor not accounted for.