Homework Solutions

Basic 1.21

Rob Digiovanni

Bryan Mccormack

Basic 1.26

Jessie Kittel

Supplemental 1.16

Jacob Levenson

Supplemental 1.25

Master Problem

Yibiao Liang

Statistical Sleuth Case 2.1

Display 2.1 from Statistical Sleuth 3rd edition


attach(case0201)
stem.leaf.backback(Depth[Year == 1976],
                   Depth[Year == 1978],
                   back.to.back = TRUE,unit=0.1,
                   trim.outliers = FALSE,
                   show.no.depths = TRUE)
## ___________________________________________________
##   1 | 2: represents 1.2, leaf unit: 0.1 
##     Depth[Year == 1976]       Depth[Year == 1978]
## ___________________________________________________
##                      2|  6* |                      
##                      8|  6. |                      
##                    411|  7* |1                     
##                     98|  7. |9                     
##                  44420|  8* |044                   
##          9999977765555|  8. |778                   
##        444321111100000|  9* |0011123344            
##      99999888887777655|  9. |5666666777789999      
##     444433322111111000| 10* |000222223333334444    
##             8766655555| 10. |55555566666777778999  
##                    440| 11* |0000111134444         
##                      7| 11. |5667                  
##                       | 12* |                      
## ___________________________________________________
## n:                  89       89                
## ___________________________________________________
detach(case0201)

Two great books on Darwin’s Finches

Weiner’s Pullitzer prize winning “The Beak of the Finch”

The Grants’ “40 years of evolution”

Weiner The Beak of the Finch

Weiner The Beak of the Finch

The Grants 40 Years of Evolution

The Grants 40 Years of Evolution


Back-to-back histogram from Frank Harrell’s Hmisc package

## Plot back-to-back histograms from Frank Harrell's Hmisc package
options(digits=1)
out <- histbackback(split(case0201$Depth, case0201$Year), probability=FALSE, 
                    xlim=c(-30,30),ylab="Beak Depth (mm)",
                    main = 'Sleuth Case 2.1')          
#! just adding color
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)

options(digits=5) # return the default number of digits

Matlab, SPSS and R back-to-back histograms

R Boxplots

boxplot(Depth ~ Year, data=case0201,     
        ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),  
        main="Beak Depths of Darwin Finches in 1976 and 1978", col="green", 
        boxlwd=2)

# car Boxplot which labels the outliers
Boxplot(Depth ~ Year,             
        ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),  
        main="Beak Depths of Darwin Finches in 1976 and 1978", col="green", 
        boxlwd=2, medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,  
        outlwd=2, outpch=21, outbg="green", outcex=1.5, data=case0201)

## [1] "1"  "2"  "90" "91"
# I don't know why the outlers are printed out.

Two Student’s t test

mean(case0201$Depth[case0201$Year==1978]) - mean(case0201$Depth[case0201$Year==1976])  
## [1] 0.66854
yearFactor <- factor(case0201$Year) # Convert the numerical variable Year into a factor
# with 2 levels. 1976 is "group 1" (it comes first alphanumerically)
t.test(case0201$Depth ~ yearFactor, var.equal=TRUE) # 2-sample t-test; 2-sided by default 
## 
##  Two Sample t-test
## 
## data:  case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 8.6e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is not equal to 0
## 95 percent confidence interval:
##  -0.95641 -0.38067
## sample estimates:
## mean in group 1976 mean in group 1978 
##             9.4697            10.1382
t.test(case0201$Depth ~ yearFactor, var.equal=TRUE, 
       alternative = "less") # 1-sided; alternative: group 1 mean is less 
## 
##  Two Sample t-test
## 
## data:  case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 4.3e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
## 95 percent confidence interval:
##      -Inf -0.42734
## sample estimates:
## mean in group 1976 mean in group 1978 
##             9.4697            10.1382

Matlab plot of Student’s t distribution, 176 degrees of freedom

Statistical Summary of Case 2.1

“These data provide overwhelming evidence that the mean beak depth increased from 1976 to 1978 (one-sided p-value < 0.00001 from a two sample t-test). The 1978 mean was estimated to exceed the 1976 (pre-drought) mean by 0.67 mm (95% confidence interval 0.38 to 0.96 mm).”

This was an observational study, and the entire population was surveyed. Even though the entire population was sampled a t test was appropriate. Lack of independence is an issue.

Statistical Sleuth Case 2.2 Anatomical abnormalities & schizophrenia

A stem & leaf plot

Display 2.2 from Statistical Sleuth 3rd edition

str(case0202)
## 'data.frame':    15 obs. of  2 variables:
##  $ Unaffected: num  1.94 1.44 1.56 1.58 2.06 1.66 1.75 1.77 1.78 1.92 ...
##  $ Affected  : num  1.27 1.63 1.47 1.39 1.93 1.26 1.71 1.67 1.28 1.85 ...
diff <- case0202$Unaffected-case0202$Affected
summary(diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.190   0.055   0.110   0.199   0.315   0.670
# A Tukey Stem & leaf plot
stem(diff, scale = 5, width = 80)
## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   -1 | 9
##   -1 | 
##   -0 | 
##   -0 | 
##    0 | 234
##    0 | 79
##    1 | 013
##    1 | 9
##    2 | 3
##    2 | 
##    3 | 
##    3 | 
##    4 | 0
##    4 | 
##    5 | 0
##    5 | 9
##    6 | 
##    6 | 7

Two R boxplots

boxplot(diff,       
        ylab="Difference in Hippocampus Volume (cubic cm)", 
        xlab="15 Sets of Twins, One Affected with Schizophrenia", 
        main="Hippocampus Difference: Unaffected Twin Minus Affected Twin") 
abline(h=0,lty=2)   # Draw a dashed (lty=2) horizontal line at 0    

# BOXPLOT FOR PRESENTATION from Sleuth3 vignette
boxplot(diff, 
        ylab="Difference in Hippocampus Volume (cubic cm)", 
        xlab="15 Sets of Twins, One Affected with Schizophrenia",
        main="Hippocampus Difference: Unaffected Minus Affected Twin",  
        col="green", boxlwd=2, medlwd=2, whisklty=1, whisklwd=2, 
        staplewex=.2, staplelwd=2, outlwd=2, outpch=21, outbg="green",
        outcex=1.5)      
abline(h=0,lty=2) 

Matlab plot of paired t test assumptions for case 2.2 t.test(case0202\(Unaffected, case0202\)Affected, paired = TRUE)

# Paired t test
t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)
## 
##  Paired t-test
## 
## data:  case0202$Unaffected and case0202$Affected
## t = 3.23, df = 14, p-value = 0.0061
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.066704 0.330629
## sample estimates:
## mean of the differences 
##                 0.19867

Statistical Summary for Case Study 2.2

There is substantial evidence that the mean difference in the left hippocampus volumes between schizophrenic individuals and their nonschizophrenic twins is nonzero (two-sided p-value = 0.006, from a paired t test). It is estimated that the the mean volume is 0.20 cm3 smaller for those with schizophrenia (about 11% smaller). A 95% confidence interval for the difference is from 0.07 to 0.33 cm3.

Interpreting the p-value

p-value (modified from K Wuensch and reef fish, edstat, 3/19/03, and October 2005) The p-value is the probability of obtaining data as or more discrepant with respect to the null and alternative hypothesis than are those in the present sample, assuming that the null hypothesis is absolutely correct. In a one-sample z-test of the hypothesis μ = μo, a z-score of 1.96 can have p-value 0.975, 0.05 or 0.025 depending on whether the alternative hypothesis is μ < μo, μ ≠ μo, or μ > μo, respectively. Contributions of Fisher, Neyman & Pearson and Deming


Sterne & Smith’s (2001) and Sleuth’s interpretation of p-values


The Central Limit Theorem. It really is important!

Display 2.4 demonstrating on the Central Limit Theorem


The Central Limit Theorem for sums and means

The Central Limit Theorem for sums and means

Demonstration of the Central Limit Theorem

With Matlab

Simulation to demonstrate the central limit theorem


Simulation to demonstrate the central limit theorem


Skewed distributions are slow to obey the central limit theorem.


Most Environmental data is positively skewed.


Demonstration of the Central Limit Theorem with Radziwill’s CentralLimitTheorem.R

Uniform distribution

# Program written by Nicole Radziwill at r-bloggers
# https://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/
# modified by Eugene.Gallagher@umb.edu, 10/10/19: added QQplot, last rev 9/14/21

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
  r <- 10000  # Number of replications/samples - DO NOT ADJUST
  # This produces a matrix of observations with  
  # n columns and r rows. Each row is one sample:
  my.samples <- switch(src.dist,
                       "E" = matrix(rexp(n*r,param1),r),
                       "N" = matrix(rnorm(n*r,param1,param2),r),
                       "U" = matrix(runif(n*r,param1,param2),r),
                       "P" = matrix(rpois(n*r,param1),r),
                       "C" = matrix(rcauchy(n*r,param1,param2),r),
                       "B" = matrix(rbinom(n*r,param1,param2),r),
                       "G" = matrix(rgamma(n*r,param1,param2),r),
                       "X" = matrix(rchisq(n*r,param1),r),
                       "T" = matrix(rt(n*r,param1),r))
  all.sample.sums <- apply(my.samples,1,sum)
  all.sample.means <- apply(my.samples,1,mean)   
  all.sample.vars <- apply(my.samples,1,var) 
  op <- par(no.readonly = TRUE)  # save plotting parameters in op
  par(mfrow=c(2,2))  # will plot figures in a 2 x 2 array
  
  hist(my.samples[1,],col="gray",main="Distribution of One Sample")
  hist(all.sample.sums,col="gray",main="Sampling Distribution of
    the Sum")
  hist(all.sample.means,col="gray",main="Sampling Distribution of the Mean")
  # hist(all.sample.vars,col="gray",main="Sampling Distribution of
    # the Variance")
  # add a qq plot for the 4th graph
  qqnorm(all.sample.means,col="gray",main="Quantile-Quantile Plot of Means")
  qqline(all.sample.means, col="red")
#  Returns the plotting parameters to their original values.
  par(op) }

# uncomment just one of these calls to the function sdm.sim at a time.
# sdm.sim(10,src.dist="B",param1=10,param2=0.5)
#  sdm.sim(50,src.dist="B",param1=10,param2=0.5)
# sdm.sim(100,src.dist="B",param1=10,param2=0.5)
# sdm.sim(10,src.dist="C",param1=0,param2=1)  # This produces results where the CLT fails
## See: https://stats.stackexchange.com/questions/74268/cauchy-distribution-and-central-limit-theorem
##  https://web.ma.utexas.edu/users/mks/M358KInstr/Cauchy.pdf
# sdm.sim(50,src.dist="C",param1=0,param2=1)
# sdm.sim(100,src.dist="C",param1=0,param2=1)
# sdm.sim(10,src.dist="E",param1=1)
# sdm.sim(50,src.dist="E",param1=1)
# sdm.sim(100,src.dist="E",param1=1)
# sdm.sim(10,src.dist="G",param1=10,param2=2)
# sdm.sim(50,src.dist="G",param1=5,param2=2)
# sdm.sim(100,src.dist="G",param1=5,param2=2)
# sdm.sim(10,src.dist="N",param1=20,param2=3)
# sdm.sim(50,src.dist="N",param1=20,param2=3)
# sdm.sim(100,src.dist="N",param1=20,param2=3)
# sdm.sim(10,src.dist="P",param1=0.5)
# sdm.sim(50,src.dist="P",param1=0.5)
# sdm.sim(100,src.dist="P",param1=0.5)
# sdm.sim(10,src.dist="T",param1=5)
# sdm.sim(50,src.dist="T",param1=5)
# sdm.sim(100,src.dist="T",param1=5)
# sdm.sim(10,src.dist="U",param1=0,param2=1) # specify min & max
sdm.sim(50,src.dist="U",param1=0,param2=1)

# sdm.sim(100,src.dist="U",param1=0,param2=1)
# sdm.sim(10,src.dist="X",param1=14)
# sdm.sim(50,src.dist="X",param1=14)
# sdm.sim(100,src.dist="X",param1=14)

t distribution distribution

# Now test with the t distribution
sdm.sim(50,src.dist="T",param1=5)


Exponential distribution

# Now test with the exponential distribution
sdm.sim(50,src.dist="E",param1=1)


Cauchy distribution: The Central Limit Theorem doesn’t work with Cauchy

# Now test with the Cauchy distribution
sdm.sim(100,src.dist="C",param1=0,param2=1)

The Cauchy distribution arises often in environmental science. If X is normally distributed and Y is normally distributed, X/Y is NOT normally distributed. It obeys the Cauchy distributions. The use of ratios, particularly by marine chemists is prevalent in the environmental science.


Confidence limits and other methods for expressing uncertainty

Methods for expressing uncertainty


Methods for expressing uncertainty

What is a confidence interval?

Confidence interval Kendall & Stuart (1979, p. 199) state that the ideas of confidence interval estimation are due to Neyman, especially Neyman (1937). If a confidence interval, say a 95% confidence interval for the mean, is properly calculated, 95% of such intervals will contain the true value of the parameter, e.g., the mean, being estimated.


z and t ratios


z and t ratios


Degrees of Freedom


The rule of 5: Take at least 5 replicates to produce more precise confidence limits (with a t statistic with 4 df)

Student’ss t ratio


A t-ratio for two-sample inference

Sampling distribution of the difference between two averages


Pooled standard deviation


Standard error of the difference


95% confidence interval for the difference in means


If one collects a random sample from a distribution with mean μ and calculates the sample mean () and a 95% confidence interval for the mean (μ), this confidence interval for μ will contain the unknown μ 95% of the time; 5% of the time, it will not contain μ. The probability that a single confidence interval contains μ is either 100% or 0%; μ is either inside or outside the confidence interval.

Bayesians use credence levels and refer to the probability that a parameter falls within a certain range.


95% confidence interval for the difference in means


Testing a hypothesis about the difference in means


Was the difference in beaks consistent with chance?


Randomization distribution & P-value interpretation


Randomization distribution & P-value interpretation


Randomization distribution & P-value interpretation


100 simulated data size 4 from a normal distribution


100 simulated data size 40 from a normal distribution


100 simulated data size 40 from a normal distribution


p=pbinom(8, 100, 0.05, lower.tail = FALSE)
    sprintf("The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95%% CI's is %5.3f",p) 
## [1] "The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95% CI's is 0.063"

Conclusions from Chapter 2


Sleuth Chapter 3: A Closer Look at Assumptions

Dislpay 3.1


Dislpay 3.2


# Run code from RScs0301_3rd
attach(case0301) 
str(case0301) #Seeded is level 1 of Treatment (it's first alphabetically)
## 'data.frame':    52 obs. of  2 variables:
##  $ Rainfall : num  1203 830 372 346 321 ...
##  $ Treatment: Factor w/ 2 levels "Seeded","Unseeded": 2 2 2 2 2 2 2 2 2 2 ...
boxplot(Rainfall ~ Treatment) 

boxplot(log(Rainfall) ~ Treatment) # Boxplots of natural logs of Rainfall 

t.test(log(Rainfall) ~ Treatment, var.equal=TRUE,
       alternative="greater") # 1-sided t-test; alternative: level 1 mean is greater
## 
##  Two Sample t-test
## 
## data:  log(Rainfall) by Treatment
## t = 2.54, df = 50, p-value = 0.007
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is greater than 0
## 95 percent confidence interval:
##  0.3904    Inf
## sample estimates:
##   mean in group Seeded mean in group Unseeded 
##                 5.1342                 3.9904
myTest <- t.test(log(Rainfall) ~ Treatment,  var.equal=TRUE,
                 alternative="two.sided") # 2-sided alternative to get confidence interval
exp(myTest$est[1] - myTest$est[2])  # Back-transform estimate on log scale 
## mean in group Seeded 
##               3.1386
exp(myTest$conf) # Back transform endpoints of confidence interval 
## [1] 1.2723 7.7423
## attr(,"conf.level")
## [1] 0.95
boxplot(log(Rainfall) ~ Treatment, 
        ylab="Log of Rainfall Volume in Target Area (Acre Feet)", 
        names=c("On 26 Seeded Days", "On 26 Unseeded Days"), 
        main="Distributions of Rainfalls from Cloud Seeding Experiment") 

## POLISHED BOXPLOTS FOR PRESENTATION FROM HORTON:
opar <- par(no.readonly=TRUE)  # Store device graphics parameters
par(mar=c(4,4,4,4))   # Change margins to allow more space on right
boxplot(log(Rainfall) ~ Treatment, ylab="Log Rainfall (Acre-Feet)",
        names=c("on 26 seeded days","on 26 unseeded days"), 
        main="Boxplots of Rainfall on Log Scale", col="green", boxlwd=2,
        medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,
        outlwd=2, outpch=21, outbg="green", outcex=1.5      )        
myTicks <- c(1,5,10,100,500,1000,2000,3000) # some tick marks for original scale 
axis(4, at=log(myTicks), label=myTicks)   # Add original-scale axis on right    
mtext("Rainfall (Acre Feet)", side=4, line=2.5) # Add right-side axis label 

par(opar)  # Restore previous graphics parameter settings

Back tranform the difference and CI on the log scale to the standard scale

Difference is 1.14 on the log scale, so the difference is 3.1 Acre-Feet on the standard scale exp(1.14)≈3.1 exp(0.241)≈1.3 exp(2.047)≈7.7

Box-Cox transform analysis to check whether the log transform is the most appropriate: it is

MASS::boxcox(lm(I(Rainfall) ~ Treatment, data=case0301),lam=seq(-1,1.5,.1))

# The 95% CI includes lambda=0, which indicates a log transform is appropriate 
# (Draper & Smith, 1998, p 280)

Levene’s test for equality of variance

# Levene's test with the car package
## default for Levene's test is center=median, SPSS is center = mean
attach(case0301)
## The following objects are masked from case0301 (pos = 3):
## 
##     Rainfall, Treatment
OUT=leveneTest(Rainfall ~ Treatment, center=mean, data=case0301)
OUT   
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)  
## group  1    6.09  0.017 *
##       50                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
OUTLN=leveneTest(log(Rainfall) ~ Treatment, center=mean, data=case0301)
OUTLN
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1    0.06   0.81
##       50
detach(case0301)

Boxplots and density plots of the untransformed and transformed rainfall data

# Nick Horton's (Smith/Amherst) code for the case study 3.1 data

trellis.par.set(theme = col.mosaic()) # get a better color scheme for lattice
options(digits = 3, show.signif.stars = FALSE)
summary(case0301)
##     Rainfall       Treatment 
##  Min.   :   1   Seeded  :26  
##  1st Qu.:  29   Unseeded:26  
##  Median : 117                
##  Mean   : 303                
##  3rd Qu.: 307                
##  Max.   :2746
favstats(Rainfall ~ Treatment, data = case0301)
##   Treatment min   Q1 median  Q3  max mean  sd  n missing
## 1    Seeded 4.1 98.1  221.6 406 2746  442 651 26       0
## 2  Unseeded 1.0 24.8   44.2 159 1203  165 278 26       0
bwplot(Rainfall ~ Treatment, data = case0301)

densityplot(~Rainfall, groups = Treatment, auto.key = TRUE, data = case0301)

case0301$lograin = log(case0301$Rainfall)
favstats(lograin ~ Treatment, data = case0301)
##   Treatment  min   Q1 median   Q3  max mean   sd  n missing
## 1    Seeded 1.41 4.58   5.40 6.00 7.92 5.13 1.60 26       0
## 2  Unseeded 0.00 3.21   3.79 5.07 7.09 3.99 1.64 26       0
bwplot(lograin ~ Treatment, data = case0301)

densityplot(~lograin, groups = Treatment, auto.key = TRUE, data = case0301)

ttestlog = t.test(log(Rainfall) ~ Treatment, var.equal = TRUE, data = case0301)
ttestlog
## 
##  Two Sample t-test
## 
## data:  log(Rainfall) by Treatment
## t = 3, df = 50, p-value = 0.01
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0
## 95 percent confidence interval:
##  0.241 2.047
## sample estimates:
##   mean in group Seeded mean in group Unseeded 
##                   5.13                   3.99
summary(lm(lograin ~ Treatment, data = case0301))
## 
## Call:
## lm(formula = lograin ~ Treatment, data = case0301)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.990 -0.745  0.162  1.019  3.102 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)          5.134      0.318   16.15   <2e-16
## TreatmentUnseeded   -1.144      0.450   -2.54    0.014
## 
## Residual standard error: 1.62 on 50 degrees of freedom
## Multiple R-squared:  0.115,  Adjusted R-squared:  0.0969 
## F-statistic: 6.47 on 1 and 50 DF,  p-value: 0.0141
obslogdiff = -diff(mean(lograin ~ Treatment, data = case0301))
obslogdiff
## Unseeded 
##     1.14
multiplier = exp(obslogdiff)
multiplier
## Unseeded 
##     3.14
ttestlog$conf.int
## [1] 0.241 2.047
## attr(,"conf.level")
## [1] 0.95
exp(ttestlog$conf.int)
## [1] 1.27 7.74
## attr(,"conf.level")
## [1] 0.95

Case Study 3.2 Effects of Agent Orange on Troops in Vietnam — An Observational Study

Dislpay 3.3


# Use tidyverse code to summarize the data
case0302 %>%
  group_by(Veteran) %>%
  summarise(min = min(Dioxin), Q1 = quantile(Dioxin, probs = .25),
            median = median(Dioxin), Q3 = quantile(Dioxin, probs = .75),
            max = max(Dioxin), mean = mean(Dioxin), sd = sd(Dioxin))
## # A tibble: 2 x 8
##   Veteran   min    Q1 median    Q3   max  mean    sd
##   <fct>   <int> <dbl>  <dbl> <dbl> <int> <dbl> <dbl>
## 1 Other       0     3      4     5    15  4.19  2.30
## 2 Vietnam     0     3      4     5    45  4.26  2.64
# The data contain zeros, so a small increment, say 0.05 must be added before
# log transforming
attach(case0302)
boxplot(Dioxin ~ Veteran)  

## HISTOGRAMS FOR PRESENTATION  
opar <- par(no.readonly=TRUE)  # Store device graphics parameter settings
par(mfrow=c(2,1), mar=c(3,3,1,1)) # 2 by 1 layout of plots; change margins    
myBreaks <- (0:46) - .5    # Make breaks for histogram bins  
hist(Dioxin[Veteran=="Other"], breaks=myBreaks, xlim=range(Dioxin),
     col="green", xlab="", ylab="", main="")
text(5,25, 
     "Dioxin in 97 'Other' Veterans; Estimated mean =  4.19 ppt (95% CI: 3.72 to 4.65 ppt)",
     pos=4, cex=.75) # CI from 1-sample t-test & subset=(Veteran="Other")    
hist(Dioxin[Veteran=="Vietnam"],breaks=myBreaks,xlim=range(Dioxin),
     col="green", xlab="", ylab="", main="")   
text(5,160,
     "Dioxin in 646 Vietnam Veterans; Estimated mean =  4.26 ppt (95% CI: 4.06 to 4.64 ppt)",
     pos=4, cex=.75)   
text(8,145,"[Estimated Difference in Means: 0.07 ppt (95% CI: -0.63 to 0.48 ppt)]",
     pos=4, cex=.75)  

par(opar) # Restore previous graphics parameter settings

# Gallagher addendum

## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
              ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(log(Dioxin+0.05) ~ Veteran,xlab="Treatment",
              ylab="ln(Dioxin+0.05)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))

detach(case0302)

Dislpay 3.3


Box-Cox Transformation Analysis of the dioxin data from Jim Robison’s Montana State University code

MASS::boxcox(lm(I(Dioxin+.05) ~ Veteran, data=case0302),lam=seq(0,1,.1))

## 0.4 is the appropriate transform
## Here is the Box Cox transformation from p. 280 of Draper & Smith.
geoMean<- exp(mean(log(case0302$Dioxin+0.05)))
geoMean
## [1] 3.72
lambda<-0.4
# See equation on p 280 Equ 13.2.2 in Draper & Smith's 1998\
# Applied Regression Analysis 3rd edition
case0302$V<-((case0302$Dioxin^lambda) - 1)/(lambda*geoMean^(lambda-1))

## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
              ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(V ~ Veteran,xlab="Treatment",
              ylab="Box-Cox Transform (lambda=0.4)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))

t.test(V ~ Veteran, var.equal=TRUE,
       alternative="less", data=case0302) # 1-sided t-test; alternative: group 1 mean is less  
## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 741, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##   -Inf 0.181
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  4.01
# Back transformed means
t.test(V ~ Veteran, alternative="less", var.equal=TRUE, 
       subset(case0302, Dioxin < 40)) # t-test on subset for which Dioxin < 40  
## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 740, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##   -Inf 0.191
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  3.99
t.test(V ~ Veteran, alternative="less", var.equal=TRUE, 
       subset(case0302, Dioxin < 20))
## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.7, df = 739, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##  -Inf  0.2
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  3.97
t.test(V ~ Veteran, var.equal=TRUE, data=case0302) # 2-sided--to get confidence interval 
## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 741, p-value = 0.4
## alternative hypothesis: true difference in means between group Other and group Vietnam is not equal to 0
## 95 percent confidence interval:
##  -0.633  0.252
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  4.01

Randomization distribution of the t statistic for the 3.2 dioxin data

Robustness of the two-sample t tools

  • Two major assumptions

  • Both samples are independent samples from normally distributed populations

  • Both samples have identical standard deviations

  • The t test is usually robust to modest violations of the assumptions

  • These assumptions are never strictly met, but the t test is remarkably robust to violations of the assumptions, especially if the sample sizes are nearly equal

  • Robust means the conclusions from test — e.g., p values & confidence limits — are valid even when the assumptions aren’t strictly met, especially if sample sizes nearly equal

  • Transformations of the data are often used

Violations of the Assumptions that matter

  • With equal sample sizes, the t-test is affected moderately by long-tailedness (leptokurtic or peaked distribution) and very little by skewness (the symmetry of the distribution)

  • Kurtosis: peakedness, platykurtic (flat distribution), leptokurtic (peaked distribution)

  • Skewness: symmetry

  • If the two populations have the same standard deviations and approximately the same shape, with unequal sample size, the t tests are affected moderately by long tailedness (leptokurtic) and substantially by skewness

  • If the skewness differs considerably, the tools can be misleading with small and moderate sample sizes


Display 3.4


Display 3.5


Independence assumption important


Display 3.8


Transformations

  • Log (x+1) transform

  • Most biological data, but not usually diversity

  • Needed when there is a multiplicative process in action:

++ exponential growth

++ bank account interest

++ Many pollutants (aquatic and terrestrial & human)

+++ polynuclear aromatic hydrocarbons

+++ fecal coliform bacteria, but not usually metals

  • Calculate the mean and 95% CI on the transformed scale and then back-transform. For data that are symmetric on the log-transformed scale, the back-transformed mean of the log-transformed data ≈ median. However, this back-transformed mean is the geometric mean and may differ from the median.

  • Many other transforms

  • Arcsin (√Y) for frequency data ranging between 0 and 1 (but the logit transform may be better)

  • Logit transform: log [Y/(1-Y)]

++ % silt clay, but the data must be on the interval 0 to 1

  • Square roots for counts

  • reciprocal for waiting times

  • logit transforms for proportions between 0 and 1 (log (P/(1-P))

  • “… it is recommended here that a trial-and error approach, with graphical analysis, be used instead.” Sleuth


Box-Cox Transformation of the 3.1 data


Matlab Box-Cox analysis of Case 3.1 data


Analysis of the transformed Case 3.1 data


CDC serum pollutant survey cover


CDC serum pollutant survey: lead


CDC serum pollutant survey: DDE


Outliers and resistant procedures

  • A procedure is resistant if it doesn’t change much when a small part of the data changes, perhaps drastically.

  • t tools are based on averages and are strongly affected by outliers

  • Chapter 4 introduces tests based on ranks, which protect against outliers (but not against unequal variance)

  • Practical strategies

  • Do side-by-side box plots to analyze departures from assumptions

  • Consider & test for serial spatial and cluster effects

  • Analyze spatial patterns in the residuals, use more sophisticated tools

++ Legendre & Legendre: if positive spatial autocorrelation, decrease the p value, e.g., test for differences at the 0.001 level instead of the 0.05 level to claim P (Type I error (alpha level) <0.05)


Dealing with outliers 1


Dealing with outliers 2


Dealing with outliers 2

Practical strategies for dealing with outliers

  • Outlier strategy

  • Run analysis with and without outliers

  • Throw-out outliers only if there is very compelling evidence to do so, and document this data paring or culling

  • Note that outlier removal has created tremendous problems

++ POC flux to the deep sea

++ Failure to detect the ozone hole

++ Mendel’s data: 1:2:1 ratios and the chi-square test; documented by Fisher & Bruce Weir

++ Milliken’s study of the charge of the electron


Dealing with outliers 1


Dealing with outliers 2


Conclusions from Chapter 3

  • The t tools are quite robust to violations of the assumptions, especially if sample sizes are equal

  • If variances are unequal and sample sizes unequal, test p values are liberal if the smaller group has the larger variance

  • Boxplots reveal the major problems for Student’s t test: Unequal variances between groups and outliers

  • Transformations, especially the log transform are often necessary

  • Beware of outlier removal, but sometimes it is necessary