Exercises: 15.22, 16.9, 17.10, 16.12, 16.17, 17.18, 18.15, 18.16, 18.19

Question 1 - 15.22

Part A

The experimental unit is a bag(or other form) of popcorn.

Part B

The experimental factors are cooking time and power settings. Cooking time has 3 levels of 3,4,and 5 minutes while cooking power has 2 levels of low and high power. We have \(3 \times 2 = 6\) factor-level combinations.

Part C

One blocking factor could be different voltage of microwaves. Another blocking factor can be different brands of popcorn.

Part D

Let’s assume we will look at two differnt voltage of microwaves and 2 brands of popcorn.

We will preform randomized complete block design with r = 10 (or any number that is found to be good from previous studies). We will split our popcorn bags such that every factor-level-blocking combination has an equal amount of bags. In this instance we have \(3 \times 2 \times 2 \times 2 = 24\) different slots, with 10 popcorn bags to be popped for each (240 popcorn bags).

The way the randomization will work is as follows: For each blocking level combination (4) we have randomly assign 10 into each factor-level combination. For instance, for voltage 1 brand 2 of popcorn, we will randomly assign 60 popcorn bag such that 10 fall into each of the factor-level combination.

Question 2 - 16.9

Part A

df = read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%2016%20Data%20Sets/CH16PR09.txt")
df$Recover = df$V1
df$Fitness = df$V2
df$V1 <- df$V2 <- df$V3 <- NULL 
df$Fitness[df$Fitness ==1] = "Below Average"
df$Fitness[df$Fitness ==2] = "Average"
df$Fitness[df$Fitness ==3] = "Above Average"

library(devtools)
## Warning: package 'devtools' was built under R version 3.4.2
library(easyGgplot2)
## Loading required package: ggplot2
library("ggplot2")
ggplot2.dotplot(data=df, xName='Fitness',yName='Recover', groupName='Fitness'
                ,legendPosition="top",addBoxplot=TRUE)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Both the means and variability seem to be different between each group. The below average group seems to have very high variability (a few extreme observations).

Part B

below.mean = mean(df$Recover[df$Fitness == "Below Average"])
average.mean = mean(df$Recover[df$Fitness == "Average"])
above.mean = mean(df$Recover[df$Fitness == "Above Average"])
below.mean
## [1] 38
average.mean
## [1] 32
above.mean
## [1] 24

Part C

df$Fitness = factor(df$Fitness)
av = aov(Recover~Fitness,data=df)
sum.resid = sum(av$residuals)
sum.resid/.Machine$double.eps
## [1] -26
round(sum.resid,10)
## [1] 0

The residuals sum to zero (they are numerically off of 0 by 26 machine epsilons)

Part D

anova(av)
## Analysis of Variance Table
## 
## Response: Recover
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Fitness    2    672  336.00  16.962 4.129e-05 ***
## Residuals 21    416   19.81                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part E

Let \(\alpha = .01\) \[H_0: \mu_1 = \mu_2 = \mu_3\] \[H_1: \neg H_0 \]

qf(.99,2,21)
## [1] 5.780416

Since \(F^* = 16.962 > 5.780416 = F(.99,2,21)\) we reject the null hypothesis and conclude that at least two of the means are not the same.

Part F

1-pf(16.962,2,21) # Also given in anova table
## [1] 4.128573e-05

Since \(p_{value} < 0.01 = \alpha\) we know we can reject the null hypothesis.

The \(p_{value}\) represents the probability given the null, what is the probability of having a more extreme value than the one observed. We are okay with make a type I error with probability 0.01, here we are seeing a probability of 4.128573e-05 of being wrong, an even lower probability. Thus we can reject the null. ## Part G

The nature of the relationship between physical health and duration of therapy is that the healthier you were to begin with, the shorter time you will spend in therapy. That is a negative relationship between health and duration of therapy, which is what you would expact.

Question 3 - 17.10

Part A

x <- data.frame(c(below.mean,average.mean,above.mean),1) ## 1 is your "height"
plot(x, type = 'b')

The left most point is above average, the middle point is average, the right point is below average. The plot suggests that the effect of prior physical fittness is negative on recovery time.

Part B

n_blavg =length(df$Fitness[df$Fitness=="Below Average"]) 
n_avg = length(df$Fitness[df$Fitness=="Average"])
n_abvg = length(df$Fitness[df$Fitness=="Above Average"])
t = qt(.995,length(df$Fitness) - n_avg)
MSE = 19.81 # From ANOVA table from the last question
s_Y_avg = sqrt(MSE/n_avg)
average.mean+t*c(-s_Y_avg,s_Y_avg)
## [1] 27.81015 36.18985

That is the 99 % confidence interval is \(27.81015 < \mu_{average} < 36.18985\)

Part C

We will construct a 95 % bonferoni interval

bonf = qt(1-.05/(2*2),length(df$Fitness)-3)
 
s_1 = sqrt(MSE*(1/n_avg+1/n_abvg))
s_2 =sqrt(MSE*(1/n_blavg+1/n_avg))

d_1 = average.mean- above.mean 
d_2 = below.mean-average.mean 

d_1+bonf*s_1*c(-1,1)
## [1]  2.452006 13.547994
d_2+bonf*s_2*c(-1,1) ## Diferent
## [1]  0.9038421 11.0961579

So $ S^2( ) =MSE(+) = 5.282667 $ and $S^2() =MSE(+) = 4.45725 $

We get the bonferonis with 95 %:

\[2.452006<D_1< 13.547994\] and \[0.9038421<D_2< 11.0961579\]

Since 0 is not in either of the family confidence intervals, it suggests that neither diffrence is 0.

Furthere the confidence interval will cover the true value of \(D_1\) and \(D_2\) with a 0.95 probability.

Part D

The tukey procedure would have have a different constant C for \(D \pm C * s\{\hat{D\}}\) for both \(D_1\) and \(D_2\).

That C for bonferoni is:

qt(1-.05/(2*2),length(df$Fitness)-3)
## [1] 2.413845

That C for Tukey is:

tukey = qtukey(.95,3,length(df$Fitness)-3)/sqrt(2)
tukey
## [1] 2.52057

The bonferoni has a tighter range. Thus it is prefered.

Part E

In bonferoni the B would change to \(t(1-\frac{.05}{2*3},21)\) from \(t(1-\frac{.05}{2*2},21)\).

With Tukey staying the same

The new bonferoni is

qt(1-.05/(2*3),length(df$Fitness)-3)
## [1] 2.60135

Which is bigger than Tukey! ## Part F - NOT SURE

We will find the Tukey confidence intervals for all 3 difference:

d_3 = below.mean-above.mean
s_3 =sqrt(MSE*(1/n_blavg+1/n_abvg))
d_1+tukey*s_1*c(-1,1)
## [1]  2.206708 13.793292
d_2+tukey*s_2*c(-1,1)
## [1]  0.6785216 11.3214784
d_3+tukey*s_3*c(-1,1)
## [1]  7.94123 20.05877

Since none of these confidence intervals have 0 in them, we can assume that the differences are significant.

Question 4 - 16.12

df = read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%2016%20Data%20Sets/CH16PR12.txt")
df$Days = df$V1
df$Dist = as.factor(df$V2)
df$V1 <- df$V2 <- df$V3 <- NULL

Part A

library(devtools)
library(easyGgplot2)
library("ggplot2")
ggplot2.dotplot(data=df, xName='Dist',yName='Days', groupName='Dist'
                ,legendPosition="top",addBoxplot=TRUE)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

The factor mean levels appear to differ. The variability looks roughly the same between each group. However, the 5th group may have an outlier.

Part B

mean.1 = mean(df$Days[df$Dist == 1 ])
mean.2 = mean(df$Days[df$Dist == 2 ])
mean.3 = mean(df$Days[df$Dist == 3 ])
mean.4 = mean(df$Days[df$Dist == 4 ])
mean.5 = mean(df$Days[df$Dist == 5 ])
mean.1
## [1] 24.55
mean.2
## [1] 22.55
mean.3
## [1] 11.75
mean.4
## [1] 14.8
mean.5
## [1] 30.1

Part C

av = aov(Days~Dist,data=df)
sum.resid = sum(av$residuals)
sum.resid/.Machine$double.eps
## [1] 17.125
round(sum.resid,10)
## [1] 0

The sum of residuals adds up to zero, other than numerical error.

Part D

anova(av)
## Analysis of Variance Table
## 
## Response: Days
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Dist       4 4430.1 1107.53  147.23 < 2.2e-16 ***
## Residuals 95  714.6    7.52                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part E

Set \(\alpha = 0.1\) We have these hypothesis \[H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\] \[H_a: \neg H_0\]

qf(.9,4,95)
## [1] 2.004992

\(F^* = 147.23 > F(.9,4,95) = 2.004992\)

We thus reject the null hypothesis and conclude that \(\mu_i \neq \mu_j\) for some \(i \neq j, i,j \in \{1,2,3,4,5\}\)

Part F

Since P value is \(p_{value} = 2.2 \times 10 ^{-16} < .1\) we can reject the null hypothesis.

The \(p_{value}\) represents the probability given the null of having a more extreme value than the one observed. We are okay with make a type I error with probability 0.1, here we are seeing a probability of \(2.2 \times 10 ^{-16}\) of being wrong, an even lower probability. Thus we can reject the null.

Part G

Yes, there appears to be some variation in the time lapse for the five agents. There is not enough evidence to conclude that this change is due to efficiency of operations. There are no variables that suggest a causal relationship of this nature. It could be due to different distances traveled by each distributer or some other factor not accounted for.

Question 5 - 16.17

Part A

\(\mu = .25 \mu_1 +.2(\mu_2+\mu_3+\mu_4) + .15\mu_5\)

mu_dot = .25*mean.1 + .2*(mean.2+mean.3+mean.4) + .15*mean.5
mu_dot
## [1] 20.4725

Part B

\[H_0: \tau_1 = \tau_2 = \tau_3 =\tau_4 = \tau_5\] \[H_a : \neg H_0\] The statement would be the same regardless of how \(\mu.\) is defined. Since \(t_i = \mu_i - \mu.\) the choice of \(\mu.\) doesn’t effect the equlity above. It only relys on the \(\mu_i'\)’s.

Question 6 - 17.18

Part A

An unbiased estimator of \(L = \frac{\mu_1 +\mu_2}{2}-\frac{\mu_3 +\mu_4}{2}\) is: \(\hat{L} = \sum_{i=1}^r c_i \bar{Y_i} = 1/2[\bar{Y_1}+\bar{Y_2}-\bar{Y_3}-\bar{Y_4}]\)

L_hat = .5*mean.1 + .5*mean.2 - .5*(mean.3+mean.4) 
L_hat
## [1] 10.275

\(\hat{L} = 10.275\)

We will now consturct a confidence interval for \(\hat{L}\) FIX THIS An estimator of \(\sigma^2\{\hat{L}\}\) is $S^2{} = MSE _{i=1}^r $

From the ANOVA table in question 4 part d we get \(MSE = 7.52\) and from the description of the study we know \(n_i = 20\) and \(c_i^2 = .5^2 = .25\) for all \(i \in \{1,2,3,4,5\}\). The estimator changes to \(S^2\{\hat{L}\} = MSE\frac{(4*.25)}{20} = \frac{7.52}{20} = 0.376\)

So \(S\{\hat{L}\}= 0.61318\)

Our crtical value \(t(.95,95)\):

t = qt(.95,95)

So we get:

s = 0.61318
L_hat+t*s*c(-1,1)
## [1]  9.256476 11.293524

The 90% Confidence interval for L is \((9.256476, 11.293524)\).

90% of the time, the confidence interval will have the true value of L.

Part B

We will construct a 90% Scheffe confidence interval for the following contrasts: \[D_1 = \mu_1 - \mu_2\] \[D_2 = \mu_3 - \mu_4\]

\[L_1 = \frac{\mu_1+\mu_2}{2}-\mu_5\] \[L_2 = \frac{\mu_3+\mu_4}{2}-\mu_5\] \[L_5 = \frac{\mu_1 +\mu_2}{2}-\frac{\mu_3 +\mu_4}{2}\]

d1 = mean.1-mean.2
d2 = mean.3-mean.4
l1 = (mean.1+mean.2)/2-mean.5
l2 = (mean.3+mean.4)/2-mean.5
l3 = L_hat
d1
## [1] 2
d2
## [1] -3.05
l1
## [1] -6.55
l2
## [1] -16.825
l3
## [1] 10.275

\(S^2\{D_1\} = S^2\{D_2\} = MSE(\frac{1}{20}+\frac{1}{20}) = 0.752\) $S^2{L_1} = S^2{L_2} = MSE(++) = 0.564 $ From last section \(S\{\hat{L_5}\}= 0.61318\)

s.dif = sqrt(0.752)
s.l1 = sqrt(0.564)
s.l3 = 0.61318
shaf = sqrt(qf(.9,4,95)*(2))
d1+shaf*s.dif*c(-1,1)
## [1] 0.2634783 3.7365217
d2+shaf*s.dif*c(-1,1)
## [1] -4.786522 -1.313478
l1+shaf*s.l1*c(-1,1)
## [1] -8.053872 -5.046128
l2+shaf*s.l1*c(-1,1)
## [1] -18.32887 -15.32113
l3+shaf*s.l3*c(-1,1)
## [1]  9.047111 11.502889

The Scheffe confidence intervals are: \[0.2634783 < D_1 <3.7365217\] \[-4.786522 <D_2<-1.313478\] \[-8.053872 < L_1< -5.046128\] \[-18.32887< L_2 < -15.32113\] \[9.047111< L_3< 11.502889\] For each of these there is a 90 % probability that the confidence interval will contain the contrast.

Part C NOT SURE Bonf or regular

Question 7- 18.15

df = read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets/KutnerData/Chapter%2018%20Data%20Sets/CH18PR15.txt")
df$Shift = as.factor(df$V2)
df$Freq = df$V1
df$V1 <- df$V2 <- df$V3 <- NULL

Part A

mean.1 = mean(df$Freq[df$Shift == 1 ])
mean.2 = mean(df$Freq[df$Shift == 2 ])
mean.3 = mean(df$Freq[df$Shift == 3 ])
mean.4 = mean(df$Freq[df$Shift == 4 ])
mean.1
## [1] 3.9
mean.2
## [1] 1.15
mean.3
## [1] 2
mean.4
## [1] 3.4
av =aov(Freq~Shift,data=df)
residual = av$residuals
residual
##             1             2             3             4             5 
##  1.000000e-01 -9.000000e-01  1.100000e+00  1.000000e-01  2.100000e+00 
##             6             7             8             9            10 
## -9.000000e-01 -1.900000e+00  1.100000e+00  3.100000e+00 -2.900000e+00 
##            11            12            13            14            15 
## -1.900000e+00  1.100000e+00  1.000000e-01  3.100000e+00  1.000000e-01 
##            16            17            18            19            20 
##  1.100000e+00 -3.900000e+00  1.000000e-01 -2.900000e+00  2.100000e+00 
##            21            22            23            24            25 
## -1.150000e+00  8.500000e-01 -1.150000e+00  1.850000e+00  8.500000e-01 
##            26            27            28            29            30 
## -1.500000e-01 -1.150000e+00  1.850000e+00 -1.500000e-01 -1.150000e+00 
##            31            32            33            34            35 
## -1.150000e+00 -1.500000e-01 -1.500000e-01 -1.150000e+00 -1.500000e-01 
##            36            37            38            39            40 
##  1.850000e+00 -1.500000e-01  8.500000e-01  8.500000e-01 -1.150000e+00 
##            41            42            43            44            45 
##  1.543904e-16 -1.000000e+00 -2.000000e+00  1.000000e+00  2.000000e+00 
##            46            47            48            49            50 
## -1.000000e+00  1.000000e+00  2.000000e+00  1.543904e-16 -2.000000e+00 
##            51            52            53            54            55 
## -1.000000e+00  1.000000e+00  1.543904e-16  2.000000e+00 -2.000000e+00 
##            56            57            58            59            60 
## -1.000000e+00  1.000000e+00 -2.000000e+00  1.543904e-16  2.000000e+00 
##            61            62            63            64            65 
##  1.600000e+00 -1.400000e+00  6.000000e-01  6.000000e-01  2.600000e+00 
##            66            67            68            69            70 
##  1.600000e+00 -4.000000e-01  1.600000e+00  3.600000e+00 -4.000000e-01 
##            71            72            73            74            75 
## -2.400000e+00 -3.400000e+00 -1.400000e+00 -4.000000e-01 -4.000000e-01 
##            76            77            78            79            80 
##  6.000000e-01 -2.400000e+00  1.600000e+00 -1.400000e+00 -4.000000e-01

Part B

plot(av,1)

The residuals do not seem equal between all groups.

Part C

We will conduct a Brown-Forshythe Test \[H_0: \sigma_i = \sigma \forall i\] \[H_0: \sigma_i \neq \sigma_j ; i\neq j\]

require("onewaytests")
## Loading required package: onewaytests
## Warning: package 'onewaytests' was built under R version 3.4.3
bf.test(Freq~Shift, df, alpha = .1, verbose = TRUE)
## 
##   Brown-Forsythe Test 
## --------------------------------------------------------- 
##   data : Freq and Shift 
## 
##   statistic  : 12.3149 
##   num df     : 3 
##   denom df   : 65.63471 
##   p.value    : 1.739806e-06 
## 
##   Result     : Difference is statistically significant. 
## ---------------------------------------------------------

The P value of the test is $ P_{value} = 1.739806e-06 < 0.1 = $ we thus reject the null hypothesis of homoskedasticity. These results are consistent with part B.

Part D

s_1 = sqrt(sum((df$Freq[df$Shift == 1 ]-mean.1)^2)/(length(df$Freq[df$Shift == 1 ])))
s_2 = sqrt(sum((df$Freq[df$Shift == 2 ]-mean.2)^2)/(length(df$Freq[df$Shift == 2 ])))
s_3 = sqrt(sum((df$Freq[df$Shift == 3 ]-mean.3)^2)/(length(df$Freq[df$Shift == 3 ])))
s_4 = sqrt(sum((df$Freq[df$Shift == 4 ]-mean.4)^2)/(length(df$Freq[df$Shift == 4 ])))
s_1
## [1] 1.920937
s_2
## [1] 1.061838
s_3
## [1] 1.414214
s_4
## [1] 1.74356

\(S_1 = 1.920937\), \(S_2 = 1.061838\), \(S_3=1.414214\), \(S_4 = 1.74356\) \(Y_1. = 3.9\), \(Y_2. = 1.15\), \(Y_3. = 2\), \(Y_4.=3.4\)

s2y = c(s_1^2/mean.1,s_2^2/mean.2,s_3^2/mean.3,s_4^2/mean.4)
sy = c(s_1/mean.1,s_2/mean.2,s_3/mean.3,s_4/mean.4)
sy2 = c(s_1/mean.1^2,s_2/mean.2^2,s_3/mean.3^2,s_4/mean.4^2)
my_data = data.frame(s2y,sy,sy2)
row.names(my_data) = c(1,2,3,4)
my_data
##         s2y        sy       sy2
## 1 0.9461538 0.4925480 0.1262944
## 2 0.9804348 0.9233374 0.8029021
## 3 1.0000000 0.7071068 0.3535534
## 4 0.8941176 0.5128116 0.1508270

The most stable of these transformations is \(S^2_i/\bar{Y_i.}\) so we will use \(Y' = \sqrt{Y}\)

Part E

require(ALSM)
## Loading required package: ALSM
## Loading required package: leaps
## Loading required package: SuppDists
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.1
bxcx = boxcox.sse(df$Shift,df$Freq+1,l=seq(-1,1,.1))

bxcx$lambda[bxcx$SSE==min(bxcx$SSE)]
## [1] 0.5

The min value of SSE is at \(\lambda = 0.5\), making it a reasonable value for \(\lambda\) acording to box cox.

Question 8 - 18.16

Part A

df$Freq.t = sqrt(df$Freq)
mean.1 = mean(df$Freq.t[df$Shift == 1 ])
mean.2 = mean(df$Freq.t[df$Shift == 2 ])
mean.3 = mean(df$Freq.t[df$Shift == 3 ])
mean.4 = mean(df$Freq.t[df$Shift == 4 ])
means.df = data.frame(c("1","2","3","4"),c(mean.1,mean.2,mean.3,mean.4))
colnames(means.df) = c("Group","Mean Values")
means.df
##   Group Mean Values
## 1     1   1.8713641
## 2     2   0.8426503
## 3     3   1.2292529
## 4     4   1.7471204
av = aov(Freq.t~Shift,data=df)
anova(av)
## Analysis of Variance Table
## 
## Response: Freq.t
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Shift      3 13.609  4.5362  10.294 9.172e-06 ***
## Residuals 76 33.489  0.4406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part B DOUBLE CHECK

plot(av,1)

The residuals variation looks pretty similar between these values.

plot(av,2)

n = length(av$residuals)
se = sqrt(sum(av$residuals^2/n))
exp_values =  sapply(1:n, function(k) se * qnorm((k-.375)/(n+.25)))
cor(exp_values,sort(av$residuals))
## [1] 0.9637813

\(\hat{\rho} = 0.9637813\)

We will test for normality under \(\alpha = .05\), from table B.6 we get a critcal value of .987 we get that \(\hat{\rho} < .987\) and we reject normality.

Part C DOUBLE CHECK

We will conduct a Brown-Forsythe test \[H_0: \sigma_i = \sigma \forall i\] \[H_0: \sigma_i \neq \sigma_j ; i\neq j\]

bf.test(Freq.t~Shift,data=df, alpha = 0.1, na.rm = TRUE, verbose = TRUE)
## 
##   Brown-Forsythe Test 
## --------------------------------------------------------- 
##   data : Freq.t and Shift 
## 
##   statistic  : 10.29446 
##   num df     : 3 
##   denom df   : 74.83915 
##   p.value    : 9.436758e-06 
## 
##   Result     : Difference is statistically significant. 
## ---------------------------------------------------------

We thus reject the null hypothesis and conclude that we do not have homoskedasticity which is consident with part B

Question 18.19

We will test with \(\alpha = 0.05\) \[H_0: \mu_i = \mu_j \forall i,j\] \[H_a:\exists i,j: \mu_i \neq \mu_j;i\neq j \] We have a full model \[Y_{ij} = \mu_1X_{ij1}+\mu_2X_{ij2}+\mu_3X_{ij3}+\mu_4X_{ij4} +\epsilon_{ij}\] Reduced model: \[Y_{ij} = \mu_c+\epsilon_{ij}\]

# rm.f = rlm(Freq~Shift-1,data=df)
# anova(rm.f)
# sse.f = 197.35
# rm.r = rlm(Freq~1,data=df)
# anova(rm.r)
# sse.r = 293.32
# f = (sse.r-sse.f)/(3) *(76)/(sse.f)
# f
# qf(.95,3,76)

\(SSE_f = 197.35\) and \(SSS_r = 293.32\) \(F^* = \frac{SSE_r-SSE_f}{r-1} \div \frac{SSE_f}{n_T-r} = 12.31943 > 2.724944 = F_{critical}\)

We thus reject the null hypothesis and conclude that the full model is the true model, ie not all the fitted values are the same.