STA 106 - Project 2

Topic 1

Introduction

We have a dataset with measurements of wing length as the response variable and different hawk species as the explainatory variable. We are testing if the data fits the assumptions of Single Factor ANOVA, and then adjusting or transforming the data if it does not.

Summary

The model fit for the original model is Y = µ. + gammai + Ɛi, sum of gammai = 0

After running diagnostics, the data does not meet the assumptions on Single Factor ANOVA. The data does not pass the Shaprio Wilks test for normality, which has a p-value of 1.430798710^{-7}, which is lower than any reasonable alpha, but does pass Brown-Forsythe test for constant variance with a p-value of 0.0991.

Diagnostics

After testing for outliers we found three with a t-cutoff with alpha = 0.05.

After removing the outliers, we still found the assumption of normality violated. The Shapiro-Wilks test for the data with outliers removed is 0.0040214, which is lower than any resonable alpha.

We decided to transform the data with Box-Cox Transformation in respect to maximizing our p-value for the Shapiro-Wilks test.

After the transformation, the assumptions of normality and constant variance are met. The Shapiro Wilks test for the transformed data resulted in p-value of 0.0117899 and the Brown-Forsythe test resulted in a p-value of 0.0893, both of which are larger than alpha = 0.01. However, this transormation did lower the p-value for the Brown-Forsythe test. We will carry on with this data because the assumptions of constant variance and normality are met.

To be thorough, we tested normality and variance if we transformed the data using the Box-Cox transformation on the original data. The assumption of constant variance is not met with the Shapiro-Wilks test resulting in a p-value of 3.475886510^{-7}, which is less than any reasonable alpha.

Discussion

We removed three outliers and transformed the data using the Box-Cox transformation which resulted in the data meeting the assumptions of constant variance and normality, however this is only if alpha = 0.01. The p-value for both the corresponding tests for normality and variance are low and just above 0.01.

Additionally, while transforming the data does allow us to perfom ANOVA tests, it can be difficult to make inferences about the original data.

We decided to continue with the transformed data because it was the only data that met the assumptions of ANOVA.

Topic 2

Introduction

We were given a dataset containing Beck scores for those that went into either a long or short treatment, who were further organized by having never used drugs, previously used drugs, and recently used drugs. We are going to test if there are any interaction effects between the history of drug use and the length of treatment on Beck scores. We will test this through Two Factor Anova hypothesis tests as well as examining partial R^2 values to get a better idea of how much each factor contributes to the model.

Furthermore we will be testing if there are any significant differences in mean Beck scores between groups through confidence intervals.

Summary

The group means of the data, with Factor A as the history of drug use, and Factor B as the length of the treatment.

## $A
##    Never Previous   Recent 
## 18.06742 19.39286 23.67470 
## 
## $B
##     Long    Short 
## 17.09259 21.86986 
## 
## $AB
##          Never Previous   Recent
## Long  15.83871 15.90000 21.00000
## Short 19.25862 21.33333 24.17143

The sizes of the groups

## $A
##    Never Previous   Recent 
##       89       28       83 
## 
## $B
##  Long Short 
##    54   146 
## 
## $AB
##       Never Previous Recent
## Long     31       10     13
## Short    58       18     70

The standard deviations of the groups

## $A
##     Never  Previous    Recent 
##  9.367252  8.517256 10.014755 
## 
## $B
##     Long    Short 
## 8.600721 9.993283 
## 
## $AB
##          Never Previous    Recent
## Long  9.455498 6.887186  6.770032
## Short 9.179612 8.884885 10.470070

The data appears to have equal means per group of around 20. The groups appear to be normally distributed, but they may not have constant variance. It appears there may be one outlier in the Never/Long group.

Diagnostics

We removed the outlier from the Never/Long group. This one observation only makes up 2% of the Never/Long group and less than 1% of the overall data.

The new data with the outlier removed meets the assumptions of TFA with the Shapiro-Wilks test p-value of 0.0490472 and the Brown-Forsythe test p-value of 0.3632602 which are both larger than alpha = 0.01, so the data meet the assumptions of constant variance and normality.

Analysis and Interpretation

The best model is the no interaction effect model, with Factors A and B.

Y = µ.. + Gammai + Deltaj + Ɛi

We tested for interaction effects using an F test, which had a p-value of 0.8853328, which is larger than alpha = 0.01, so we conclude that there are no significant interaction effects. The test for Factor B effects has a p-value of 0.0073423 and the test for Factor A effects has a p-value of 0.0022854, both of which are lower than alpha = 0.01, so we conclude that both Factors are significant.

Addtionally, the partial R^2 for Factor A effects is 0.0362777, and for Factor B is 0.0604661. This is the proportion in reduction in error when adding Factor A to a model with just Factor B, and the proportion in reduction in error when adding Factor B to a model with just Factor A.

The estimated parameters are

## $Mu..
## [1] 19.40828
## 
## $Gam
##      Never   Previous     Recent 
## -2.2460891 -0.6044762  2.8505654 
## 
## $Del
##      Long     Short 
## -2.061677  2.061677 
## 
## $GamDel
##          Long Short
## Never       0     0
## Previous    0     0
## Recent      0     0

including Gammai sums to 0 and Deltaj sums to 0.

## (1)Long+(-1)Short       lower bound       upper bound 
##         -5.209486         -9.079478         -1.339493

We are overall 99% confident that the true average Beck score for Longer treatment patients is lower than Shorter treatment patients by about 1.339493 to 9.079478, ignoring drug use effect.

## (1)Never+(-1)Previous           lower bound           upper bound 
##             -1.574675             -7.509307              4.359956

We are overall 99% confident that there is no difference betweem the true average Beck score for those that never useed drugs and those that have used drugs previously, ignoring treatment time.

## (1)Never+(-1)Recent         lower bound         upper bound 
##           -5.856517          -10.041587           -1.671446

We are overall 99% confident that the true average Beck score for those that never used drugs is lower than those that used drugs recently by about 1.671446 to 10.041587, ignoring length of treatment time.

## (1)Previous+(-1)Recent            lower bound            upper bound 
##              -4.281842             -10.259465               1.695782

We are overall 99% confident that there is no difference in the true average Beck score for those that used drugs previously and those that used drugs recently, ignoring length of treatment time.

##      someAB lower bound upper bound 
##   -4.829310  -10.808797    1.150176

We are overall 99% confident that there is no difference in the true average Beck score for those that never used drugs, averaged over the long and short treatments and those who previously used drugs, averaged over the short and long treatments.

##      someAB lower bound upper bound 
##   -3.698358   -8.657492    1.260776

We are overall 99% confident that the true average Beck score of the average of those that never used drugs is lower than the average of those that used drugs recently by about 1.695070 to 7.443025, when they are each averaged over short and long treatment lengths.

Conclusion

After examining the data, it appeared that the different levels of drug use groups had about the same mean, with no drug use possibly being significantly lower. The short and long length of treatment groups also appeared to have a similar mean. We tested for interaction effects and Factor A and B effects to find the best model to use. There are no interactions between drug use and length of treatment and both factors, A and B, are statistically significant, so we used the no interaction model with Factor A and Factor B effects.

After creating various confidence intervals, we are able to infer that the average Beck score for those who were in treatment longer is significantly lower than those who had a shorter treatment time. Additionally, we are able to tell that the true average Beck score those that have never used drugs is significantly lower than those that have recently used drugs. This is seen in a pairwise interval which ignored treatment length, and again in another contrast interval which averaged the Beck score of those with no drug use and those who recently used drugs, each respectively over the length of treatment.

R Appendix

library(readr)
NewHawk <- read.csv("~/Downloads/NewHawk.csv")
the.model1 = lm(Wing ~ Species, data = NewHawk)
ei = the.model1$residuals
qqnorm(the.model1$residuals)
qqline(the.model1$residuals)
the.SWtest = shapiro.test(ei)
p.val.sw.1 = the.SWtest[2]
boxplot(Wing ~ Species, data = NewHawk, horizontal = TRUE)
plot(the.model1$fitted.values, the.model1$residuals, main = "Errors vs. Groups",xlab = "Groups",ylab = "Errors") 
abline(h = 0,col = "purple")
library(car)
the.BFtest = leveneTest(ei ~ Species, data = NewHawk, center = median)
p.val.bf.1 = the.BFtest[[3]][1]
NewHawk$ei = the.model1$residuals
nt1 = nrow(NewHawk) 
a1 = length(unique(NewHawk$Species)) 
SSE1 = sum(NewHawk$ei^2)
MSE1 = SSE1/(nt1-a1) 
eij.star1 = the.model1$residuals/sqrt(MSE1)
alphahawk = 0.05
t.cutoff= qt(1-alphahawk/(2*nt1), nt1-a1)
CO.eij = which(abs(eij.star1) > t.cutoff)
outliers1 = CO.eij
new.data1 = NewHawk[-outliers1,]
hawk.model1 = lm(Wing ~ Species, data = new.data1)
ei2 = hawk.model1$residuals
the.SWtest.2 = shapiro.test(ei2)
p.val.sw.2 = the.SWtest.2[2]
qqnorm(hawk.model1$residuals)
qqline(hawk.model1$residuals)
the.BFtest.2 = leveneTest(ei2 ~ Species, data = new.data1, center = median)
p.val.bf.2 = the.BFtest.2[[3]][1]
plot(hawk.model1$fitted.values, hawk.model1$residuals, main = "Errors vs. Groups",xlab = "Groups",ylab = "Errors") 
abline(h = 0,col = "purple")
library(EnvStats)
L2 = boxcox(hawk.model1 ,objective.name = "Shapiro-Wilk",optimize = TRUE)$lambda
YT = (new.data1$Wing^(L2)-1)/L2
t.data = data.frame(Wing = YT, Species = new.data1$Species)
t.model = lm(Wing ~ Species, data = t.data)
ei3 = t.model$residuals
the.SWtest.3 = shapiro.test(ei3)
p.val.sw.3 = the.SWtest.3[2]
qqnorm(t.model$residuals)
qqline(t.model$residuals)
the.BFtest.3 = leveneTest(ei3 ~ Species, data = t.data, center = median)
p.val.bf.3 = the.BFtest.3[[3]][1]
plot(t.model$fitted.values, t.model$residuals, main = "Errors vs. Groups",xlab = "Groups",ylab = "Errors") 
abline(h = 0,col = "purple")
library(EnvStats)
L1 = boxcox(the.model1 ,objective.name = "Shapiro-Wilk",optimize = TRUE)$lambda
YT1 = (NewHawk$Wing^(L1)-1)/L1
t.data1 = data.frame(Wing = YT1, Species = NewHawk$Species)
t.model1 = lm(Wing ~ Species, data = t.data1)
ei4 = t.model1$residuals
the.SWtest.4 = shapiro.test(ei4)
p.val.sw.4 = the.SWtest.4[2]
the.BFtest.4 = leveneTest(ei4 ~ Species, data = t.data1, center = median)
p.val.bf.4 = the.BFtest.4[[3]][1]
Scores <- read.csv("~/Downloads/Scores.csv")
find.means = function(the.data,fun.name = mean){
  a = length(unique(the.data[,2]))
  b = length(unique(the.data[,3]))
  means.A = by(the.data[,1], the.data[,2], fun.name)
  means.B = by(the.data[,1],the.data[,3],fun.name)
  means.AB = by(the.data[,1],list(the.data[,2],the.data[,3]),fun.name)
  MAB = matrix(means.AB,nrow = b, ncol = a, byrow = TRUE)
  colnames(MAB) = names(means.A)
  rownames(MAB) = names(means.B)
  MA = as.numeric(means.A)
  names(MA) = names(means.A)
  MB = as.numeric(means.B)
  names(MB) = names(means.B)
  results = list(A = MA, B = MB, AB = MAB)
  return(results)
}

the.means = find.means(Scores,mean)
the.sizes = find.means(Scores,length)
the.sds = find.means(Scores,sd)
the.means
the.sizes
the.sds
boxplot(Beck ~ Drug*Treatment, data = Scores)
CO3 = which(Scores$Drug=="Never" & Scores$Treatment=="Long" & Scores$Beck > 36)
outliers = CO3
the.data = Scores[-outliers,]
names(the.data) = c("Y","A","B")
AB = lm(Y ~ A*B,the.data)
A.B = lm(Y ~ A + B,the.data)
A = lm(Y ~ A,the.data)
B = lm(Y ~ B,the.data)
N = lm(Y ~ 1, the.data)

ei5 = AB$residuals
the.BFtest5 = leveneTest(ei5~ A*B, data=the.data, center=median)
p.val.bf.5 = the.BFtest5[[3]][1]
the.SWtest.5 = shapiro.test(ei5)
p.val.sw.5 = the.SWtest.5[2]
interaction.test = anova(A.B, AB)[2,6]
factorb.test = anova(A, A.B)[2,6]
factora.test = anova(B, A.B)[2,6]
Partial.R2 = function(small.model,big.model){
  SSE1 = sum(small.model$residuals^2)
  SSE2 = sum(big.model$residuals^2)
  PR2 = (SSE1 - SSE2)/SSE1
  return(PR2)
}
a.r2 = Partial.R2(A, A.B)
b.r2 = Partial.R2(B, A.B)
nt = nrow(the.data)
a = length(unique(the.data[,2]))
b = length(unique(the.data[,3]))
get.gamma.delta = function(the.model,the.data){
  nt = nrow(the.data)
  a = length(unique(the.data[,2]))
  b = length(unique(the.data[,3]))
  the.data$hat = the.model$fitted.values
  the.ns = find.means(the.data,length)
  a.vals = sort(unique(the.data[,2]))
  b.vals= sort(unique(the.data[,3]))
  muij = matrix(nrow = a, ncol = b)
  rownames(muij) = a.vals
  colnames(muij) = b.vals
  for(i in 1:a){
    for(j in 1:b){
      muij[i,j] = the.data$hat[which(the.data[,2] == a.vals[i] & the.data[,3] == b.vals[j])[1]]
    }
  }
  mi. = rowMeans(muij)  
  m.j = colMeans(muij)
  mu.. = sum(muij)/(a*b)
  gammai = mi. - mu..
  deltaj = m.j - mu..
  gmat = matrix(rep(gammai,b),nrow = a, ncol = b, byrow= FALSE)
  dmat = matrix(rep(deltaj,a),nrow = a, ncol = b,byrow=TRUE)
  gamma.deltaij =round(muij -(mu.. + gmat + dmat),8)
  results = list(Mu.. = mu.., Gam = gammai, Del = deltaj, GamDel = gamma.deltaij)
  return(results)
}
get.gamma.delta(A.B, the.data)
SSE = sum(A.B$residuals^2)
dfSSE = nt - (a+b-1)
MSE = SSE/(dfSSE)
find.mult = function(alpha,a,b,dfSSE,g,group){
  if(group == "A"){
    Tuk = round(qtukey(1-alpha,a,dfSSE)/sqrt(2),3)
    Bon = round(qt(1-alpha/(2*g), dfSSE ) ,3)
    Sch = round(sqrt((a-1)*qf(1-alpha, a-1, dfSSE)),3) 
  }else if(group == "B"){
    Tuk = round(qtukey(1-alpha,b,dfSSE)/sqrt(2),3)
    Bon = round(qt(1-alpha/(2*g), dfSSE ) ,3)
    Sch = round(sqrt((b-1)*qf(1-alpha, b-1, dfSSE)),3) 
  }else if(group == "AB"){
    Tuk = round(qtukey(1-alpha,a*b,dfSSE)/sqrt(2),3)
    Bon = round(qt(1-alpha/(2*g), dfSSE ) ,3)
    Sch = round(sqrt((a*b-1)*qf(1-alpha, a*b-1, dfSSE)),3) 
  }
  results = c(Bon, Tuk,Sch)
  names(results) = c("Bonferroni","Tukey","Scheffe")
  return(results)
}
scary.CI = function(the.data,MSE,equal.weights = TRUE,multiplier,group,cs){
   if(sum(cs) != 0 & sum(cs !=0 ) != 1){
    return("Error - you did not input a valid contrast")
  }else{
    the.means = find.means(the.data)
    the.ns =find.means(the.data,length)
    nt = nrow(the.data)
    a = length(unique(the.data[,2]))
    b = length(unique(the.data[,3]))
    if(group =="A"){
      if(equal.weights == TRUE){
        a.means = rowMeans(the.means$AB)
        est = sum(a.means*cs)
        mul = rowSums(1/the.ns$AB)
        SE = sqrt(MSE/b^2 * (sum(cs^2*mul)))
        N = names(a.means)[cs!=0]
        CS = paste("(",cs[cs!=0],")",sep = "")
        fancy = paste(paste(CS,N,sep =""),collapse = "+")
        names(est) = fancy
      } else{
        a.means = the.means$A
        est = sum(a.means*cs)
        SE = sqrt(MSE*sum(cs^2*(1/the.ns$A)))
        N = names(a.means)[cs!=0]
        CS = paste("(",cs[cs!=0],")",sep = "")
        fancy = paste(paste(CS,N,sep =""),collapse = "+")
        names(est) = fancy
      }
    }else if(group == "B"){
      if(equal.weights == TRUE){
        b.means = colMeans(the.means$AB)
        est = sum(b.means*cs)
        mul = colSums(1/the.ns$AB)
        SE = sqrt(MSE/a^2 * (sum(cs^2*mul)))
        N = names(b.means)[cs!=0]
        CS = paste("(",cs[cs!=0],")",sep = "")
        fancy = paste(paste(CS,N,sep =""),collapse = "+")
        names(est) = fancy
      } else{
        b.means = the.means$B
        est = sum(b.means*cs)
        SE = sqrt(MSE*sum(cs^2*(1/the.ns$B)))
        N = names(b.means)[cs!=0]
        CS = paste("(",cs[cs!=0],")",sep = "")
        fancy = paste(paste(CS,N,sep =""),collapse = "+")
        names(est) = fancy
      }
    } else if(group == "AB"){
      est = sum(cs*the.means$AB)
      SE = sqrt(MSE*sum(cs^2/the.ns$AB))
      names(est) = "someAB"
    }
    the.CI = est + c(-1,1)*multiplier*SE
    results = c(est,the.CI)
    names(results) = c(names(est),"lower bound","upper bound")
    return(results)
  }
}
bon.b = find.mult(0.01, a, b, dfSSE, 1, "B")[1]
b.cs.1 = c(1, -1)
scary.CI(the.data, MSE, equal.weights = FALSE, bon.b, "B", b.cs.1)
tuk.a = find.mult(0.01, a, b, dfSSE, 3, "A")[2]
a.cs.1 = c(1, -1, 0)
a.cs.2 = c(1, 0, -1)
a.cs.3 = c(0, 1, -1)
scary.CI(the.data, MSE, equal.weights = FALSE, tuk.a, "A", a.cs.1)
scary.CI(the.data, MSE, equal.weights = FALSE, tuk.a, "A", a.cs.2)
scary.CI(the.data, MSE, equal.weights = FALSE, tuk.a, "A", a.cs.3)
a2 = length(unique(the.data[,3]))
b2 = length(unique(the.data[,2]))
bon.ab = find.mult(0.01, a, b, dfSSE, 2, "AB")[1]
AB.cs.2 = matrix(0,nrow = a2, ncol = b2)
AB.cs.2[1,1] = 1/2
AB.cs.2[1,2] = 1/2
AB.cs.2[2,1] = -1/2
AB.cs.2[2,2] = -1/2
scary.CI(the.data, MSE, equal.weights = FALSE, bon.ab, "AB", AB.cs.2)
AB.cs.3 = matrix(0,nrow = a2, ncol = b2)
AB.cs.3[1,1] = 1/2
AB.cs.3[1,3] = 1/2
AB.cs.3[2,1] = -1/2
AB.cs.3[2,3] = -1/2
scary.CI(the.data, MSE, equal.weights = FALSE, bon.ab, "AB", AB.cs.3)

STA 106 - Project 2

Dylan Schwarz

Topic 1

Topic 2

R Appendix