Homework Solutions (5:30 PM - 6:00 PM)

Basic 4.30 Sunlight Protection Factor

Rebecca Jenkins

Annie O’Connell

Basic 4.31 Effect of Group Therapy on Survival of Breast Cancer Patients

Jaxon Horn3e

Supplemental 4.19 Bumpus’s Study

Jacob Levenson

Supplemental 4.29 Salvage Logging

Volunteers for next week?

Master Problem Exact Permutation Test of Sleuth Case 4.1.1 Challenger Data

Yibiao Liang

Outline of Class 3 on Sleuth Chapter 5 Comparisons Among Several Samples

Fisher’s contributions
Major types of ANOVA
Steps in doing ANOVA
Case Study 5.1 Diet Restriction & Longevity
- 5 a priori hypotheses
- boxplots indicate no problem with heteroscedasticity
- Statistical summary & evaluation of the 5 hypotheses
- Graphical ANOVA
Case Study 5.2: The Spock trial;
- boxplots indicate no problem with heteroscedasticity
- Statistical Summary
- Graphical ANOVA
Comparing 2 means with the pooled estimate of the sd
Testing a priori hypotheses with t statistic with pooled sd from error mean square
ANOVA can be performed given just means and variances
Extra sum of squares F test
The F distribution
The ANOVA table
- Fixed effects ANOVA
- nested ANOVA table
- Type II & Type III sums of squares
- ANOVA Models I, II, and III
Linear contrasts
Random vs. Fixed Effects
ANOVA assumptions
- Analyzing residuals for violations of ANOVA assumptions
Kruskal-Wallis nonparametric ANOVA
Interpreting Confidence Limits
Statistical vs. Scientific significance
- Type II error and the precautionary principle
Spock Judge vs. the others: Fixed or Random Effect?
Conclusions

R. A. Fisher, Inventor of ANOVA

R. A. Fisher, Inventor of ANOVA ## 25 terms invented by Fisher (courtesy of HA David & David Bee (AP Stat group)

variance 1918a; analysis of variance 1918b; likelihood 1921; maximum likelihood 1922a; parameter 1922a; statistic 1922a; consistency 1922a; efficiency 1922a; sufficiency 1922a; degrees of freedom 1922b; Student’s t, t 1924; z-distribution 1924; test of significance 1925a; level of significance 1925a; sufficient statistic 1925b; randomization 1926; randomized blocks 1926; confounding 1926; interaction 1926; sampling distribution 1928; covariance 1930; Greco-Latin square (with F. Yates) 1934; null hypothesis 1935; Yates’s correction for continuity 1936; normal score (with F. Yates) 1938

Most experimental & survey design is based on an ANOVA framework

One can’t really appreciate the need for proper replication without considering the implications for testing treatment effects with ANOVA
Hurlbert’s (1984) monograph criticizing statistics in ecological papers is largely a criticism of inappropriate ANOVA design
- Hurlbert’s pseudoreplication is Tony Underwood’s ‘model misspecification’ and both are largely based on using an inappropriate ANOVA model
While ANOVA is a proper subset of the general linear model (GLM) and regression, as we’ll see, the concepts involving design and partitioning degrees of freedom are more evident in ANOVA models

Ehinger’s partition of statistical methods

image source: Dani Servén Marín (one of the developers of pyGAM)

Major types of ANOVA (there are many more)

One-factor or one-way ANOVA
- 1 factor with 2 or more levels
- If there is just 1 factor with just 2 levels, ANOVA is identical to an independent samples Student’s t test
Randomized block ANOVA
- Similar to factorial ANOVA. The block effect is designed to reduce the contribution of the blocked factor to the error variance producing a more powerful test of the main factors (Sleuth Case Study 13.1)
- However, interactions between blocks and treatments can be important
- If there are replicate blocks, the block x treatment interaction can be formally assessed
- If no replicates, Tukey’s test for additivity can be used to test for interactions (not covered in Sleuth)
Factorial ANOVA
- 2 or more factors with 2 or more levels of each factor
- 2⁴ factorial: 4 factors, each with two levels
- Can assess the interactions among variables, for example salinity and temperature effects on marine animal growth
Split-plot ANOVA
- Full replication of factors isn’t possible, so factors divided into whole plots and subplots
- Experiments in greenhouses or fields; large wafers used to print integrated circuits
Nested or hierarchical ANOVA
- cores within quadrats, replicate quadrats within treatments, fish within aquaria
- What is the experimental unit? It’s the smallest unit to which a treatment can be applied
- For example, for fish within aquaria in an experiment on ocean acidity , the experimental units on which degrees of freedom are alloted are the aquaria not the fish
Mixed model ANOVA: random and fixed factors, often with nested factors
- R’s lme4 & lme
Repeated measures ANOVA
- Repeated sampling of the same plots or individuals
- Now handled as longitudinal models, usually as a type of mixed model
- R’s lme4 & lme

Steps in doing an ANOVA-based experiment or survey

Design the experiment or survey after identifying the hypotheses and describe statistical tests BEFORE collecting data (a priori vs . a posteriori hypotheses)
Do a pre-test or preliminary survey
- If the variance is unknown, consider doing a preliminary experiment to calculate power for the full analysis
- Also, a pre-test will allow covariates and problems to be identified
Endeavor to create balanced designs with equal number of replicates at each combination of treatment and block levels
- ANOVA is robust to heteroscedasticity (unequal spread) if the design is balanced, i.e., equal number of cases per group (Winer et al. 1991)
Decide whether your factors are fixed or random
- ANOVA tests for difference in means (fixed effect) or whether σ²= 0 (random effect) or both (mixed model ANOVA)
- The choice of fixed vs. random effects is often crucial and depends on whether the factor levels represent a random or representative sample from some larger statistical population
- The F statistics, the interpretation of the results, and the extent of statistical inference often change markedly depending on whether factors are fixed or random
Avoid pseudoreplication (Hurlbert 1984)
- Pseudoreplication only occurs after an incorrect analysis has been presented: no analysis, no pseudoreplication
- Pseudoreplication has two causes: inadequate replication at the design stage, or using an inappropriate model especially the wrong ANOVA model with an inappropriate error mean square and error d.f.
- Examples of model misspecification
1. Failing to use a repeated measures design for longitudinal data
2. Confusing factorial and nested ANOVA
3. Inappropriately pooling terms in a nested, randomized block, or factorial ANOVA
4. Avoid pseudofactorialism (Hurlbert & White 1993): Using a factorial ANOVA when separate ANOVAs are appropriate
Set the alpha level for hypothesis tests (i.e., the critical values) in advance. Tests and hypothesis, as far as possible, should be specified in advance. A priori hypotheses, if a small subset of possible tests, can be tested at the experiment-wise alpha level, usually α=0.05.
- Note that Mead, Gelman and Hurlbert all reject the use of multiple comparison tests
- Patterns which reveal themselves after the data have been analyzed, or just graphed, must be assessed using an appropriate multiple comparison procedure that reduces the test α to maintain the experiment-wise or family-wise α level
Sample the appropriate populations & tabulate the data
Graph the data and critically look for violations of the assumptions, especially unequal spread
- Unequal variance = heteroscedasticity = heteroskedacity = lack of homoscedasticity
- Unequal variance is best revealed by box plots
- Unequal spread can be tested with Levene’s test (there are 5 different versions of Levene’s test)
Transform the data to correct unequal spread
- √ transform for Poisson-distributed counts, log (X+1) for logarithmically or log-normally distributed data
- Logit (log (p/(1-p)) transform or arcsin √P for binomial data
- Note that Zuur in his many R-based books rarely if ever transforms data, preferring to use analyses that assume non-normal error structure
Perform the ANOVA
- Assess higher order interaction effects and analyze the influence of outliers
- Graphically display residuals vs. Expected values & assess heteroscedasticity (again) and effects of outliers. Note that an outlier is only an outlier when compared to an underlying probability model
- Use appropriate rules for pooling sums of squares to produce more powerful tests of lower order interactions & main effects
- Evaluate null hypotheses, report p values & effect sizes
Examine the data for outliers, but never remove outliers without strong justification
- Examine data notebooks to find out if there were conditions that justify treating outliers as a different statistical population (e.g., different analyst or different analytical instrument)
- If the outlier’s removal might be justified, do the analysis with and without the outlier
1. If the conclusion remains the same, leave the outlier in, unless it has caused a major violation in assumptions
2. If the conclusion differs, drop the outlier and all similar data
- If there is no reason for removing the outlier
  1. Use rank-based methods, like Kruskal-Wallis or Friedman’s ANOVA which are resistant to outlier effects
  2. Report the results with and without the oultier(s)
Multiple comparisons procedures, from most to least conservative: Sleuth Chapter 6 (next week)
- Scheffé: must be used whenever more than one group is combined in a linear contrast, more conservative than Bonferroni
- Bonferroni
- Tukey’s Honestly Significant Difference (HSD), also called Tukey-Kramer if sample sizes are unequal
- Student-Newman-Keuls More powerful than HSD
- Benjamini–Hochberg False Discovery Rate
- Dunnet’s, appropriate if there is a control group
- Tukey’s LSD with F-protection: Use LSD if the overall F statistic is significant; not sufficiently conservative
Report all relevant p values and df needed to reconstruct the ANOVA table
- Hurlbert (1984): it wasn’t clear in the majority of ecological studies what test was performed
- Avoid the significant/non-significant dichotomy (see Sterne & Smith 2001, Hurlbert et al. 2019, Gelman et al. 2021)
Summarize the results of the ANOVA in the text, table or figure. It is unlikely that a journal will allow both a table and figure, but summary in the text is essential
Report the effect size (i.e., difference in means with 95% confidence intervals, or estimate and standard error)
Report negative results, (e.g., failure to reject the null hypothesis)

Case 5.1 Diet Restriction & Longevity

Sleuth3 Display 5.2

$Sleuth3 Display 5.2$ ***

5 planned contrasts out of choose (6,2) or 15 possible two-way tests.

Sleuth3 Display 5.3

If hypotheses are specified in advance, then you can test at a pre-set α level, without a posteriori (or post hoc, multiple comparison) adjustment of α levels. Recall that alpha = P(Type I error)

See Cook & Farewell (1996, J. Roy. Stat. Assoc. A; Posted in Blackboard). In dose-response studies, no need to adjust for number of dose treatments.
One large design allows the use of a more precise estimate of the error variance
- Separate control vs. treatment t tests are not powerful
- If interaction effects are evident, separate tests can be misleading. They can miss interaction effects.

Summary of Statistical Findings

There is overwhelming evidence that mean lifetimes in the six groups are different (p-value < 0.001); analysis of variance F-test).
Analysis of the 5 particular questions are
1. There is convincing evidence that lifetime increases as a result of restricting the diet from 85 kcal/wk to 50 kcal/wk (1-sided p-value < 0.0001; t test)
2. There is no evidence that reducing the calories before weaning increased lifespan, when the caloric intake after weaning is 50 kcal/wk (1-sided p value = 0.32, t test). A 95% CI for the amount by which the lifetime under the R/R50 diet exceeds the lifetime under the N/R50 diet is - 1.7 to 2.9 months.
3. Further restriction of the diet from 50 to 40 kcal/wk increases lifetime by an estimated 2.8 months (95% CI: 0.5 to 5.1 months). The evidence that this effect is greater than zero is moderate (p=0.017, t test).
4. There was moderate evidence that lifetime was decreased by the lowering of protein in addition to the 50 kcal/wk diet (2-sided p value =0.024; t-test)
5. There is convincing evidence that the control mice live longer than the mice on the non-purified (NP) diet (1-sided p-value <0.0001)
Note that all 5 of these hypotheses can be tested as Tukey Least Significant Difference (LSD) tests (or linear contrasts) see Sleuth Ch 6

Case 0501 Box et al. (2005) Graphical ANOVA

Case 5.2 The Spock Conspiracy Trial—An Observational Study

Sleuth, page 117: Dr. Spock’s venire contained only 1 woman, who was released by the prosecution

Sleuth3 Display 5.4

Key questions

Is there evidence that women were underrepresented on the Spock judge’s venires
Is there evidence that there are differences in women’s representation on the other juries?

Tsarnaev Jury: Was it a biased sample of the eastern Massachusetts population?

NBC News 2/27/2015

The defense said 1,373 people, summoned from a population of about 5 million in eastern Massachusetts, were originally given numbers based on a random pool order list. New numbers were later assigned, based on when the jurors reported to court to complete written questionnaires.

The defense argued that the reordering undermined the randomness of the selection process and pushed certain groups — including blacks, people under 30 and people who live in Boston — down on the list and made them less likely to be chosen for the jury.

Sleuth3 Case 0502. See Display 5.5 in Sleuth3

Statistical Summary

The percentage of women on the Spock judge’s venires were substantially lower than the other judges (t test of Spock judge vs.‘Other judges’)
There is little evidence to reject the null hypothesis of no difference in female representation among the other judges p=0.32 (1-way ANOVA)
The percentage of women is 15% less on the Spock judge’s venires (95% CI: 10% to 20%)

Case0502 Graphical ANOVA, Box et al. (2005)

Case Study 5.1 Diet Restriction and Longevity—A Randomized Experiment: R analysis

First look at the structure of the data

str(case0501)

## 'data.frame':    349 obs. of  2 variables:
##  $ Lifetime: num  35.5 35.4 34.9 34.8 33.8 33.5 32.6 32.4 31.8 31.6 ...
##  $ Diet    : Factor w/ 6 levels "N/N85","N/R40",..: 4 4 4 4 4 4 4 4 4 4 ...

favstats(Lifetime ~ Diet, data = case0501)  # from Horton's mosaic

##    Diet  min     Q1 median    Q3  max     mean       sd  n missing
## 1 N/N85 17.9 31.400  33.10 36.40 42.3 32.69123 5.125297 57       0
## 2 N/R40 19.6 42.275  46.05 50.35 54.6 45.11667 6.703406 60       0
## 3 N/R50 18.6 37.950  43.90 48.20 51.9 42.29718 7.768195 71       0
## 4    NP  6.4 24.800  28.90 31.40 35.5 27.40204 6.133701 49       0
## 5 R/R50 24.2 39.150  43.95 48.35 50.7 42.88571 6.683152 56       0
## 6 lopro 23.4 35.000  41.05 46.45 49.7 39.68571 6.991695 56       0

Do a graphical analysis to evaluate outliers and homogeneity of variance

# Re-order levels for better boxplot organization:
myDiet <- factor(case0501$Diet, levels=c("NP","N/N85","N/R50","R/R50","lopro","N/R40") )
myNames <- c("NP(49)","N/N85(57)","N/R50(71)","R/R50(56)","lopro(56)",
             "N/R40(60)") # Make these for boxplot labeling.
boxplot(Lifetime ~ myDiet, ylab= "Lifetime (months)", names=myNames,
        xlab="Treatment (and sample size)", data=case0501)

densityplot(~Lifetime, groups = Diet, auto.key = TRUE, data=case0501)

Do the analysis of variance (ANOVA) on the Case0501 longevity data

myAov1 <- aov(Lifetime ~ Diet, data=case0501) # One-way analysis of variance
myAov1

## Call:
##    aov(formula = Lifetime ~ Diet, data = case0501)
## 
## Terms:
##                     Diet Residuals
## Sum of Squares  12733.94  15297.42
## Deg. of Freedom        5       343
## 
## Residual standard error: 6.678239
## Estimated effects may be unbalanced

summary(myAov1)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Diet          5  12734  2546.8    57.1 <2e-16 ***
## Residuals   343  15297    44.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.lm(myAov1)

## 
## Call:
## aov(formula = Lifetime ~ Diet, data = case0501)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.6912     0.8846  36.958  < 2e-16 ***
## DietN/R40    12.4254     1.2352  10.059  < 2e-16 ***
## DietN/R50     9.6060     1.1877   8.088 1.06e-14 ***
## DietNP       -5.2892     1.3010  -4.065 5.95e-05 ***
## DietR/R50    10.1945     1.2565   8.113 8.88e-15 ***
## Dietlopro     6.9945     1.2565   5.567 5.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

attributes(myAov1)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "aov" "lm"

Analyze the residuals to examine homogeneity of variance and outliers

# Residual vs. fitted plot
plot(myAov1, which=1) # Plot residuals versus estimated means.

summary(myAov1)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Diet          5  12734  2546.8    57.1 <2e-16 ***
## Residuals   343  15297    44.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual plot indicates no violation of assumptions.

Plot the Honestly Significant Differences (Sleuth Ch 6). Not as powerful as testing a priori hypotheses

TukeyHSD(myAov1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Lifetime ~ Diet, data = case0501)
## 
## $Diet
##                    diff        lwr         upr     p adj
## N/R40-N/N85  12.4254386   8.885436  15.9654413 0.0000000
## N/R50-N/N85   9.6059550   6.202170  13.0097399 0.0000000
## NP-N/N85     -5.2891873  -9.017748  -1.5606269 0.0008380
## R/R50-N/N85  10.1944862   6.593417  13.7955556 0.0000000
## lopro-N/N85   6.9944862   3.393417  10.5955556 0.0000008
## N/R50-N/R40  -2.8194836  -6.175736   0.5367684 0.1564608
## NP-N/R40    -17.7146259 -21.399845 -14.0294069 0.0000000
## R/R50-N/R40  -2.2309524  -5.787127   1.3252222 0.4684413
## lopro-N/R40  -5.4309524  -8.987127  -1.8747778 0.0002306
## NP-N/R50    -14.8951423 -18.449713 -11.3405719 0.0000000
## R/R50-N/R50   0.5885312  -2.832070   4.0091319 0.9963976
## lopro-N/R50  -2.6114688  -6.032070   0.8091319 0.2460200
## R/R50-NP     15.4836735  11.739756  19.2275913 0.0000000
## lopro-NP     12.2836735   8.539756  16.0275913 0.0000000
## lopro-R/R50  -3.2000000  -6.816968   0.4169683 0.1167873

plot(TukeyHSD(myAov1))

Box et al. (2005) Graphical ANOVA see Display 5.16, 5.17, p 134 & 135

anovaPlot(myAov1, stacked = TRUE, base = TRUE, axes = TRUE,
          faclab = TRUE, labels = TRUE, cex = par("cex"),
          cex.lab = par("cex.lab"))

The graphical ANOVA shows the strong differences between most of the diets, especially NP and N/R40

All Pairwise t tests with pooled sd, as in Ramsey & Schafer (2013)

pairwise.t.test(case0501$Lifetime,case0501$Diet, pool.SD=TRUE, p.adj="none")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  case0501$Lifetime and case0501$Diet 
## 
##       N/N85   N/R40   N/R50   NP      R/R50
## N/R40 < 2e-16 -       -       -       -    
## N/R50 1.1e-14 0.017   -       -       -    
## NP    5.9e-05 < 2e-16 < 2e-16 -       -    
## R/R50 8.9e-15 0.073   0.622   < 2e-16 -    
## lopro 5.2e-08 1.6e-05 0.029   < 2e-16 0.012
## 
## P value adjustment method: none

General Linear Hypothesis Testing (glht package) of Specified Means

p-VALUES AND CONFIDENCE INTERVALS FOR SPECIFIED COMPARISONS OF MEANS

if(require(multcomp)){
  diet <- factor(case0501$Diet,labels=c("lopro", "NN85", "NR40", "NR50", "NP", "RR50"))
  
  myAov2 <- aov(Lifetime ~ diet - 1, data=case0501)  # Note the -1 is in the Sleuth3 vignette
  # It drops the dietlopro diet from the analysis, making it the llinear model
  # reference group
  summary(myAov2)
  myComparisons <- glht(myAov2,
                        linfct=c("dietNR50 - dietNN85 = 0",
                                 "dietNR40 - dietNR50 = 0",
                                 "dietRR50 - dietNR50 = 0",
                                 "dietlopro - dietNR50 = 0",
                                 "dietNN85 - dietNP = 0") )
  summary(myComparisons,test=adjusted("none")) # No multiple comparison adjust.
  confint(myComparisons, calpha = univariate_calpha()) # No adjustment
}

## 
##   Simultaneous Confidence Intervals
## 
## Fit: aov(formula = Lifetime ~ diet - 1, data = case0501)
## 
## Quantile = 1.9669
## 95% confidence level
##  
## 
## Linear Hypotheses:
##                           Estimate lwr      upr     
## dietNR50 - dietNN85 == 0  -17.7146 -20.2438 -15.1854
## dietNR40 - dietNR50 == 0   14.8951  12.4556  17.3347
## dietRR50 - dietNR50 == 0   12.2837   9.7142  14.8532
## dietlopro - dietNR50 == 0   5.2892   2.7302   7.8481
## dietNN85 - dietNP == 0      2.2310  -0.2097   4.6716

Case Study 5.2 The Spock Conspiracy Trial—An Observational Study: R analysis

First look at the structure of the data

str(case0502)

## 'data.frame':    46 obs. of  2 variables:
##  $ Percent: num  6.4 8.7 13.3 13.6 15 15.2 17.7 18.6 23.1 16.8 ...
##  $ Judge  : Factor w/ 7 levels "A","B","C","D",..: 7 7 7 7 7 7 7 7 7 1 ...

favstats(Percent ~ Judge, data = case0502)  # from Horton's mosaic

##     Judge  min     Q1 median     Q3  max     mean        sd n missing
## 1       A 16.8 30.800  33.60 40.500 48.9 34.12000 11.941817 5       0
## 2       B 27.0 29.675  32.35 34.800 45.6 33.61667  6.582224 6       0
## 3       C 21.0 27.500  30.50 32.500 33.8 29.10000  4.592929 9       0
## 4       D 24.3 25.650  27.00 28.350 29.7 27.00000  3.818377 2       0
## 5       E 17.7 20.150  24.70 33.075 40.2 26.96667  9.010142 6       0
## 6       F 16.5 23.500  26.70 29.800 36.2 26.80000  5.968878 9       0
## 7 Spock's  6.4 13.300  15.00 17.700 23.1 14.62222  5.038794 9       0

Do a graphical analysis with a boxplot to evaluate outliers and homogeneity of variance

# Re-order levels for better boxplot organization:
## Make new factor level names (with sample sizes) and analyze boxplots
myNames <- c("A (5)", "B (6)", "C (9)", "D (2)", "E (6)", "F (9)", "Spock's (9)")

boxplot(Percent ~ Judge, ylab = "Percent of Women on Judges' Venires",
        names = myNames, xlab = "Judge (and number of venires)",
        main = "Percent Women on Venires of 7 Massachusetts Judges", 
        data=case0502)

Plot a colored boxplot

boxplot(Percent ~ Judge,  ylab= "Percent of Women on Judges' Venires",
        names=myNames, xlab="Judge (and number of venires)",
        main= "Percent Women on Venires of 7 Massachusetts Judges",
        col="green", boxlwd=2,  medlwd=2,  whisklty=1,  whisklwd=2,
        staplewex=.2, staplelwd=2,  outlwd=2,  outpch=21, outbg="green",
        outcex=1.5, data=case0502)

Do the analysis of variance (ANOVA) for Case0502 Spock data

myAov2  <- aov(Percent ~ Judge, data=case0502)
summary(myAov2) # Initial screening. Any evidence of judge differences? (yes)

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Judge        6   1927   321.2   6.718 6.1e-05 ***
## Residuals   39   1864    47.8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.lm(myAov2)

## 
## Call:
## aov(formula = Percent ~ Judge, data = case0502)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.320  -4.367  -0.250   3.319  14.780 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.1200     3.0921  11.034 1.47e-13 ***
## JudgeB        -0.5033     4.1868  -0.120   0.9049    
## JudgeC        -5.0200     3.8566  -1.302   0.2007    
## JudgeD        -7.1200     5.7848  -1.231   0.2258    
## JudgeE        -7.1533     4.1868  -1.709   0.0955 .  
## JudgeF        -7.3200     3.8566  -1.898   0.0651 .  
## JudgeSpock's -19.4978     3.8566  -5.056 1.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.914 on 39 degrees of freedom
## Multiple R-squared:  0.5083, Adjusted R-squared:  0.4326 
## F-statistic: 6.718 on 6 and 39 DF,  p-value: 6.096e-05

attributes(myAov2)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "aov" "lm"

Analyze the residuals to examine homogeneity of variance and outliers

# Residual vs. fitted plot
plot(myAov2, which=1) # Plot residuals versus estimated means.

summary(myAov2)

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Judge        6   1927   321.2   6.718 6.1e-05 ***
## Residuals   39   1864    47.8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual plot indicates no apparent violation of assumptions.

Analysis 1. TWO-SAMPLE t-TEST (ASSUMING NON-SPOCK JUDGES HAVE A COMMON MEAN)

SpockOrOther <- factor(ifelse(case0502$Judge=="Spock's","Spock","Other"))                    
aovFull      <- aov(Percent ~ Judge, data=case0502)
aovReduced   <- aov(Percent ~ SpockOrOther, data=case0502) 
anova(aovReduced,aovFull) #Any evidence that 7 means fit better than 2 means?

## Analysis of Variance Table
## 
## Model 1: Percent ~ SpockOrOther
## Model 2: Percent ~ Judge
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     44 2190.9                           
## 2     39 1864.5  5    326.46 1.3658 0.2582

 # Evidence that 2 means differ?
t.test(Percent ~ SpockOrOther, var.equal=TRUE, data=case0502)

## 
##  Two Sample t-test
## 
## data:  Percent by SpockOrOther
## t = 5.6697, df = 44, p-value = 1.03e-06
## alternative hypothesis: true difference in means between group Other and group Spock is not equal to 0
## 95 percent confidence interval:
##   9.584045 20.155294
## sample estimates:
## mean in group Other mean in group Spock 
##            29.49189            14.62222

There is strong evidence that the Spock judge differs from the other 6. There is little evidence for a difference among the 6 non-Spock judges

Graphical ANOVA from Box, Hunter & Hunter, 2nd edition.

anovaPlot(aovFull, stacked = TRUE, base = TRUE, axes = TRUE,
          faclab = TRUE, labels = TRUE, cex = par("cex"),
          cex.lab = par("cex.lab"))

There is strong evidence from the Box et al. (2005) graphical ANOVA that the Spock juries differs from the other 6 in the percentage of females.

ANALYSIS 2. COMPARE SPOCK MEAN TO AVERAGE OF OTHER MEANS, using multcomp::glht

myAov3 <- aov(Percent ~ Judge - 1, data = case0502)

myContrast    <- rbind(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6, - 1))
if(require(multcomp)) {    
  myComparison  <- glht(myAov3, linfct=myContrast) 
  summary(myComparison, test=adjusted("none"))
  confint(myComparison)
}

## 
##   Simultaneous Confidence Intervals
## 
## Fit: aov(formula = Percent ~ Judge - 1, data = case0502)
## 
## Quantile = 2.0227
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate lwr     upr    
## 1 == 0 14.9783   9.6348 20.3219

myAov3 <- aov(Percent ~ Judge - 1)

myContrast <- rbind(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6, - 1)) if(require(multcomp)){ # use multcomp library myComparison <- glht(myAov3, linfct=myContrast) summary(myComparison, test=adjusted(“none”))
confint(myComparison) }

Sleuth3 5.2 Comparing any two of several means

Comparisons among means in ANOVA can be analyzed using t statistics, with a new, more precise estimate of pooled error. It is that pooling, with higher df, that makes ANOVA a more powerful method than multiple t tests.

Sleuth3 Display 0506 sp = √ Error Mean Square = √ Within Groups MS

Case 5.1 Pooled standard error

Extra Sum of Squares F test

F distribution

Display 5.10

Case0502 ANOVA table

Case0502 ANOVA linear contrast

Testing linear contrasts with R’s multcomp

myContrast    <- rbind(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6, - 1))
if(require(multcomp)){  # use multcomp library
  myComparison  <- glht(myAov3, linfct=myContrast) 
  summary(myComparison, test=adjusted("none"))   
  confint(myComparison) 
}

## 
##   Simultaneous Confidence Intervals
## 
## Fit: aov(formula = Percent ~ Judge - 1, data = case0502)
## 
## Quantile = 2.0227
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate lwr     upr    
## 1 == 0 14.9783   9.6348 20.3219

When is an effect Random?

See Sleuth Page 137-139 in 3rd ed, 136-138 in 2nd: ‘The Random Effects model’

The differences among subgroup means is NOT of intrinsic interest (not the case with Judges)
- You may be interested in whether the effect changes from day to day – i.e, estimating day-to-day or ‘among day’ variance – but you are not interested specifically in the differences on any pair of days
If the number of levels of a factor is small relative to the total possible levels of a factor (not the case with district Judges since ALL were sampled)
Are the subgroups a representative or random sample of some larger group? (No for judges)

Quinn and Keough on random effects

Spock judge versus the other six judges. Is the judge effect fixed or random?

Model I ANOVA: Fixed effects ANOVA: test for differences in the averages among groups
Model II ANOVA: Random effects ANOVA: test differences in variances due to the group classification
Model III model: Fixed & random factors

Note: The calculations are often identical for random and fixed-effects ANOVA, but the interpretations are different. With Factorial ANOVA (>1 factor), the F statistics differ among models, with a different denominator mean square for random factors. The inference allowed differs among models

ANOVA Robustness of Assumptions

Normality is not critical. Extremely long-tailed distributions or skewed distributions, coupled with different sample sizes present the only serious distributional problems
The assumptions of independence within and across groups is critical
The assumption of equal standard deviations in the populations is crucial. Also called the equal variance, equal spread, or homoscedasticity assumption (vs. Heterorscedasticity). Winer et al. (1991) argue that ANOVA is robust to violations of equal variance if the sample sizes are equal.
The tools are not resistant to severely outlying observations
Check the assumptions by analyzing the residuals with Levene’s test and boxplots

Sleuth3 Display 5.13 p 131

Sleuth3 Display 5.15 p 133

Sleuth3 Display 5.19 p 139

Cumming and Finch (2005) Inference by eye

Statistical versus scientific significance

Always report the effect size and NEVER report ‘significant’ or ‘not significant’

Deming: report effect sizes for tests
Many statistically significant results are trivial ecologically (environmentally, chemically or socially)
- Most null hypotheses (μ1 = μ2) are false and the p-value is often dependent on the sample size
- Gelman never reports p-values in his statistics books, usually effect and standard error
- e.g., a p value of 0.00001 may not be ecologically meaningful if there is only a minor difference in effect and a much larger difference causes meaningful changes in the ecosystem
Test statistics with large p values (>0.1) but with broad 95% confidence intervals may be consistent with important ecological or other effects
- What is the probability of Type II error?
- What are the ecological consequences of failing to reject a false null hypothesis (i.e., committing a Type II error) ?

Buhl-Mortensen (1996) The Precautionary Principle

Type II error and Power

Nonparametric 1-way ANOVA Kruskal Wallis test

Type II error and Power

Conclusions

Assumptions

Homoscedasticity or equal variance among groups
- Levene’s test a rough guide
- Boxplots or residual plots are the standard tools for assessing homoscedasticity (equal variance among groups)
- Spread vs. Level plots
Independence of errors among groups a key ANOVA assumption
Normally distributed errors (not underlying data) not crucial

An ANOVA is more efficient & powerful than multiple, separate t tests

The ANOVA error MS (=within groups MS) provides a more precise estimate of the population standard deviation [It is not a smaller estimate of error {it is an unbiased estimator})

Kruskal-Wallis ANOVA is the rank-based analogue of 1-way ANOVA and is resistant to outliers but not unequal spread

Ties correction must be used
Effect sizes, hierarchic structure, and covariates difficult to handle

ANOVA tests for difference in means (fixed effect) or whether σ²= 0 (random effect) or both (mixed model)

Fixed vs. random effects
- The choice of fixed vs. random effects is often crucial and depends on whether the factor levels (judges in the Spock example) represent a random or representative sample from some larger statistical population
- The F statistics and interpretation of the results sometimes change depending on whether fixed or random effects are chosen

Week 4: Sleuth Chapter 4

Eugene D. Gallagher

9/28/2021

Homework Solutions (5:30 PM - 6:00 PM)

Basic 4.30 Sunlight Protection Factor

Rebecca Jenkins

Annie O’Connell

Basic 4.31 Effect of Group Therapy on Survival of Breast Cancer Patients

Jaxon Horn3e

Supplemental 4.19 Bumpus’s Study

Jacob Levenson

Supplemental 4.29 Salvage Logging

Volunteers for next week?

Master Problem Exact Permutation Test of Sleuth Case 4.1.1 Challenger Data

Yibiao Liang

Outline of Class 3 on Sleuth Chapter 5 Comparisons Among Several Samples

R. A. Fisher, Inventor of ANOVA

Most experimental & survey design is based on an ANOVA framework

Major types of ANOVA (there are many more)

Steps in doing an ANOVA-based experiment or survey

Case 5.1 Diet Restriction & Longevity

5 planned contrasts out of choose (6,2) or 15 possible two-way tests.

Summary of Statistical Findings

Case 5.2 The Spock Conspiracy Trial—An Observational Study

Key questions

Tsarnaev Jury: Was it a biased sample of the eastern Massachusetts population?

Statistical Summary

Case Study 5.1 Diet Restriction and Longevity—A Randomized Experiment: R analysis

First look at the structure of the data

Do a graphical analysis to evaluate outliers and homogeneity of variance

Do the analysis of variance (ANOVA) on the Case0501 longevity data

Analyze the residuals to examine homogeneity of variance and outliers

Plot the Honestly Significant Differences (Sleuth Ch 6). Not as powerful as testing a priori hypotheses

Box et al. (2005) Graphical ANOVA see Display 5.16, 5.17, p 134 & 135

All Pairwise t tests with pooled sd, as in Ramsey & Schafer (2013)

General Linear Hypothesis Testing (glht package) of Specified Means

p-VALUES AND CONFIDENCE INTERVALS FOR SPECIFIED COMPARISONS OF MEANS

Case Study 5.2 The Spock Conspiracy Trial—An Observational Study: R analysis

First look at the structure of the data

Do a graphical analysis with a boxplot to evaluate outliers and homogeneity of variance

Plot a colored boxplot

Do the analysis of variance (ANOVA) for Case0502 Spock data

Analyze the residuals to examine homogeneity of variance and outliers

Analysis 1. TWO-SAMPLE t-TEST (ASSUMING NON-SPOCK JUDGES HAVE A COMMON MEAN)

Graphical ANOVA from Box, Hunter & Hunter, 2nd edition.

ANALYSIS 2. COMPARE SPOCK MEAN TO AVERAGE OF OTHER MEANS, using multcomp::glht

Sleuth3 5.2 Comparing any two of several means

Testing linear contrasts with R’s multcomp

When is an effect Random?

Spock judge versus the other six judges. Is the judge effect fixed or random?

ANOVA Robustness of Assumptions

Statistical versus scientific significance

Nonparametric 1-way ANOVA Kruskal Wallis test

Conclusions

Assumptions

An ANOVA is more efficient & powerful than multiple, separate t tests

Kruskal-Wallis ANOVA is the rank-based analogue of 1-way ANOVA and is resistant to outliers but not unequal spread

ANOVA tests for difference in means (fixed effect) or whether σ2= 0 (random effect) or both (mixed model)

ANOVA tests for difference in means (fixed effect) or whether σ²= 0 (random effect) or both (mixed model)