##Kevin Kuipers (Completed byself) ##November 11/13/2018

##Problem 1

  1. Consider dataset from the package. Compare the results when using and TukeyHSD (Refer to Chap 5 review for TukeyHSD).
library(lme4)
library(coin)
library(multcomp)
library(HSAUR3)
library(sandwich)
data(alpha)

a <- data.frame (
  length = abbreviate(alpha$alength),
  level = alpha$elevel
)

##General Linear Hypothesis & TukeyHSD

using the aov() command I will fit two models and compare the glht to TukeyHSD. In both cases the formula will be expression levels are explained by allele length.

aov_model <- aov(level ~ length, data=a)
alpha_glht <- glht(aov_model, linfct = mcp(length='Tukey'))
alpha_tukey <- TukeyHSD(aov_model)

#Summary of General Linear Hypotheses

summary(alpha_glht)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = level ~ length, data = a)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)  
## long - intr == 0   0.7546     0.4579   1.648   0.2270  
## shrt - intr == 0  -0.4342     0.3836  -1.132   0.4924  
## shrt - long == 0  -1.1888     0.5203  -2.285   0.0614 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

#Summary of TukeyHSD

alpha_tukey
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = level ~ length, data = a)
## 
## $length
##                 diff       lwr        upr     p adj
## long-intr  0.7545977 -0.335752 1.84494741 0.2307995
## shrt-intr -0.4341523 -1.347742 0.47943766 0.4970962
## shrt-long -1.1887500 -2.427675 0.05017513 0.0628589

#Confidence Interval plots The top plot will represent the 95% Family-wise confidence level of the general linear hypotheses and the bottom will represent the 95% Family-wise confidence level for the TukeyHSD

layout(matrix(2:1, ncol=1))
plot(alpha_tukey)
plot(alpha_glht, main="95% Family-wise confidence level for General Linear Hypotheses")

coef(alpha_glht)
## long - intr shrt - intr shrt - long 
##   0.7545977  -0.4341523  -1.1887500
vcov(alpha_glht)
##             long - intr shrt - intr shrt - long
## long - intr  0.20963611  0.04307591  -0.1665602
## shrt - intr  0.04307591  0.14717604   0.1041001
## shrt - long -0.16656020  0.10410012   0.2706603

##Problem 2

  1. Consider data from package

##Problem 2 A)

a. Read and write a report (no longer than one page) 
on the clouds data given in Chapter 15 section 15.3.3 from Handbook Ed 3.
data(clouds)
?clouds

The clouds data is an interesting study on weather modification. The study conducted focuses on clouds or storm systems and how they interact with inorganic or organic materials. Through this process the study is looking at if certain factors increase rainfall. The experiment took place in 1975 and explored whether or not silver iodide 100 to 1000 grams per cloud in cloud seeding increased rainfall. The time frame for the study was 24 days and was based on a measurement called SNE (suitability criterion) to see if the iodide caused an increase in the railfall.

During the experiment several variables were recorded: seeding, time, SNE, cloudcover, prewetness, echomotion, ad rainfall.

Seeding: is a categorical variable. Yes if seeding action occured and no otherwise

Time: is the number of days after the first day of the experiment

SNE: is the suitability criterion

cloudclover: was measured using radar and tracked the percentage of cloud cover in the experimental area

prewetness: is the total ranfall for the target area one hur before seeding and measured in cubic meters times 1e+8

echomotion: is a categorical variable documenting the radar echo as moving or stationary

rainfall: is the amount of rain in cubic meters times 1e+8

It has been noted that the rainfall and SNE values have a dependency. So lets look at some graphs. There will be to scatterplots one of no seeding and one with seeding with rainfall explained by SNE values. The graphs will contain a best fit linear line with confidence interval bands. The code comes from A handbook of statistical analyses using R, 3rd Edition on Page 294.

confband <- function(subset, main) {
  mod <- lm(rainfall ~ sne, data=clouds, subset=subset)
  sne_grid <- seq(from = 1.5, to = 4.5, by = 0.25)
  K <- cbind(1, sne_grid)
  sne_ci <- confint(glht(mod, linfct = K))
  plot(rainfall ~ sne, data=clouds, subset = subset,
       xlab='S-Ne criterion', main=main,
       xlim=range(clouds$sne),
       ylim=range(clouds$rainfall))
  abline(mod)
  lines(sne_grid, sne_ci$confint[,2], lty=2)
  lines(sne_grid, sne_ci$confint[,3], lyt=2)
}

layout(matrix(1:2, ncol=2))
confband(clouds$seeding=='no', main='No Seeding')
confband(clouds$seeding=='yes', main='Seeding')

It appears that no seeding category has weaker relationship between rainfall and SNE values. In contrast, the seeding seems to have a stronger relationship between the rainfall being explained by the SNE values. The confidence bands are helpful because they can be used to help pinpoint where the greatest certainty and uncertainty lie for rainfall. For example, in the seeding graph the greatest uncertainty about rainfall is when the SNE vaues are very low. But right around 3.2 SNE values that is where the confidence is smallest and thus producing the greatest amount of certainty for explaing rainfall. But then as SNE values increase the uncertainty also increases which means the there is a much wider range for the confidence interavls for explanining rainfall.

##Problem 2 B)

b. Consider the linear model fitted to the clouds data 
as summarised in Chapter 6, Figure 6.5. Set up a matrix K 
corresponding to the global null hypothesis that all interaction 
terms present in the model are zero. 
Test both the global hypothesis and all hypotheses corresponding 
to each of the interaction terms. 

#Linear Model

I will fit the linear model found in A Handbook of statistical analyses using R 3rd Edition page 110. It will be rainfall is explained by seeding and seeding interactions with SNE, cloudcover, prewetness and echomotion. And it will also have time.

clouds_formula <- rainfall ~ seeding + seeding:(sne + cloudcover + prewetness + echomotion) + time

clouds_lm <- lm(clouds_formula, data=clouds)
summary(clouds_lm)
## 
## Call:
## lm(formula = clouds_formula, data = clouds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5259 -1.1486 -0.2704  1.0401  4.3913 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     -0.34624    2.78773  -0.124  0.90306   
## seedingyes                      15.68293    4.44627   3.527  0.00372 **
## time                            -0.04497    0.02505  -1.795  0.09590 . 
## seedingno:sne                    0.41981    0.84453   0.497  0.62742   
## seedingyes:sne                  -2.77738    0.92837  -2.992  0.01040 * 
## seedingno:cloudcover             0.38786    0.21786   1.780  0.09839 . 
## seedingyes:cloudcover           -0.09839    0.11029  -0.892  0.38854   
## seedingno:prewetness             4.10834    3.60101   1.141  0.27450   
## seedingyes:prewetness            1.55127    2.69287   0.576  0.57441   
## seedingno:echomotionstationary   3.15281    1.93253   1.631  0.12677   
## seedingyes:echomotionstationary  2.59060    1.81726   1.426  0.17757   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.205 on 13 degrees of freedom
## Multiple R-squared:  0.7158, Adjusted R-squared:  0.4972 
## F-statistic: 3.274 on 10 and 13 DF,  p-value: 0.02431

#General Linear Hypotheses

Now I will fit the model using the general linear hypotheses method with the same formula as the linear regression model. This model will also use K matrix with interaction terms at zero.

#creating the K matrix and grabing value names
K <- cbind(0, diag(length(coef(clouds_lm)) - 1))
rownames(K) <- names(coef(clouds_lm))[-1]

#fitting the aov model
clouds_aov <- aov(clouds_formula, data=clouds)
#fitting the glht model with K matrix
clouds_glht <- glht(clouds_aov, linfct=K)
#printing summary 
summary(clouds_glht)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: aov(formula = clouds_formula, data = clouds)
## 
## Linear Hypotheses:
##                                      Estimate Std. Error t value Pr(>|t|)
## seedingyes == 0                      15.68293    4.44627   3.527   0.0295
## time == 0                            -0.04497    0.02505  -1.795   0.5004
## seedingno:sne == 0                    0.41981    0.84453   0.497   0.9992
## seedingyes:sne == 0                  -2.77738    0.92837  -2.992   0.0773
## seedingno:cloudcover == 0             0.38786    0.21786   1.780   0.5096
## seedingyes:cloudcover == 0           -0.09839    0.11029  -0.892   0.9656
## seedingno:prewetness == 0             4.10834    3.60101   1.141   0.8856
## seedingyes:prewetness == 0            1.55127    2.69287   0.576   0.9978
## seedingno:echomotionstationary == 0   3.15281    1.93253   1.631   0.6032
## seedingyes:echomotionstationary == 0  2.59060    1.81726   1.426   0.7331
##                                       
## seedingyes == 0                      *
## time == 0                             
## seedingno:sne == 0                    
## seedingyes:sne == 0                  .
## seedingno:cloudcover == 0             
## seedingyes:cloudcover == 0            
## seedingno:prewetness == 0             
## seedingyes:prewetness == 0            
## seedingno:echomotionstationary == 0   
## seedingyes:echomotionstationary == 0  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

##Problem 2 C)

c. How does adjustment for multiple testing change which interactions are significant?

It appears the general linear hypotheses tends to diminish the significance on the values. For example, the seedingyes:sne interaction tends to have a lower p value than the regular linear regression model. In addition, the seedingno:cloudcover was slightly significant in the regular linear model but the general linear hypotheses method does not find any significance for this interaction.

##Problem 3)

  1. For the logistic regression model presented in Chapter 7 in Figure 7.7, perform a multiplicity adjusted test on all regression coefficients (except for the intercept) being zero. Do the conclusions drawn in Chapter 7 remain valid?

#GLM Model Summary

I will fit the glm model of womens role found in the Handbook of Statistical Analeyes using R on page 133. The summary will be outputted

#Loading Data
data(womensrole)
#Regular glm method 
fm2 <- cbind(agree,disagree) ~ gender*education
womensrole_glm_2 <- glm(fm2, data=womensrole, family=binomial())
summary(womensrole_glm_2)
## 
## Call:
## glm(formula = fm2, family = binomial(), data = womensrole)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.39097  -0.88062   0.01532   0.72783   2.45262  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.09820    0.23550   8.910  < 2e-16 ***
## genderFemale            0.90474    0.36007   2.513  0.01198 *  
## education              -0.23403    0.02019 -11.592  < 2e-16 ***
## genderFemale:education -0.08138    0.03109  -2.617  0.00886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 451.722  on 40  degrees of freedom
## Residual deviance:  57.103  on 37  degrees of freedom
## AIC: 203.16
## 
## Number of Fisher Scoring iterations: 4

#General Linear Hypotheses

#glht approach 
K <- cbind(0, diag(length(coef(womensrole_glm_2))-1))
rownames(K) <- names(coef(womensrole_glm_2))[-1]

womensrole_glht <- glht(womensrole_glm_2, linfct=K)
#printing summary 
summary(womensrole_glht)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glm(formula = fm2, family = binomial(), data = womensrole)
## 
## Linear Hypotheses:
##                             Estimate Std. Error z value Pr(>|z|)    
## genderFemale == 0            0.90474    0.36007   2.513   0.0247 *  
## education == 0              -0.23403    0.02019 -11.592   <0.001 ***
## genderFemale:education == 0 -0.08138    0.03109  -2.617   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

#Discussion When looking at the summary glm approach and the glht approach the same conclusions can be drawn–even if the intercept is 0 for the glht approach. The only difference between the models is that the p values for the variables are still significant in the general linear hypotheses but not as significant as the regular glm method. The estimates, standard errors, and z-values are the same between both approaches. Therefore, the same conclusion found in the glm approach and be applied the glht approach.