Homework Solutions

Basic 1.21

Rob Digiovanni

Bryan Mccormack

Basic 1.26

Jessie Kittel

Supplemental 1.16

Jacob Levenson

Supplemental 1.25

Master Problem

Yibiao Liang

Statistical Sleuth Case 2.1

Display 2.1 from Statistical Sleuth 3rd edition

attach(case0201)
stem.leaf.backback(Depth[Year == 1976],
                   Depth[Year == 1978],
                   back.to.back = TRUE,unit=0.1,
                   trim.outliers = FALSE,
                   show.no.depths = TRUE)

## ___________________________________________________
##   1 | 2: represents 1.2, leaf unit: 0.1 
##     Depth[Year == 1976]       Depth[Year == 1978]
## ___________________________________________________
##                      2|  6* |                      
##                      8|  6. |                      
##                    411|  7* |1                     
##                     98|  7. |9                     
##                  44420|  8* |044                   
##          9999977765555|  8. |778                   
##        444321111100000|  9* |0011123344            
##      99999888887777655|  9. |5666666777789999      
##     444433322111111000| 10* |000222223333334444    
##             8766655555| 10. |55555566666777778999  
##                    440| 11* |0000111134444         
##                      7| 11. |5667                  
##                       | 12* |                      
## ___________________________________________________
## n:                  89       89                
## ___________________________________________________

detach(case0201)

Two great books on Darwin’s Finches

Weiner’s Pullitzer prize winning “The Beak of the Finch”

The Grants’ “40 years of evolution”

Weiner The Beak of the Finch

$The Grants 40 Years of Evolution$

The Grants 40 Years of Evolution

Back-to-back histogram from Frank Harrell’s Hmisc package

## Plot back-to-back histograms from Frank Harrell's Hmisc package
options(digits=1)
out <- histbackback(split(case0201$Depth, case0201$Year), probability=FALSE, 
                    xlim=c(-30,30),ylab="Beak Depth (mm)",
                    main = 'Sleuth Case 2.1')          
#! just adding color
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)

options(digits=5) # return the default number of digits

Matlab, SPSS and R back-to-back histograms

R Boxplots

boxplot(Depth ~ Year, data=case0201,     
        ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),  
        main="Beak Depths of Darwin Finches in 1976 and 1978", col="green", 
        boxlwd=2)

# car Boxplot which labels the outliers
Boxplot(Depth ~ Year,             
        ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),  
        main="Beak Depths of Darwin Finches in 1976 and 1978", col="green", 
        boxlwd=2, medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,  
        outlwd=2, outpch=21, outbg="green", outcex=1.5, data=case0201)

## [1] "1"  "2"  "90" "91"

# I don't know why the outlers are printed out.

Two Student’s t test

mean(case0201$Depth[case0201$Year==1978]) - mean(case0201$Depth[case0201$Year==1976])

## [1] 0.66854

yearFactor <- factor(case0201$Year) # Convert the numerical variable Year into a factor
# with 2 levels. 1976 is "group 1" (it comes first alphanumerically)
t.test(case0201$Depth ~ yearFactor, var.equal=TRUE) # 2-sample t-test; 2-sided by default

## 
##  Two Sample t-test
## 
## data:  case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 8.6e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is not equal to 0
## 95 percent confidence interval:
##  -0.95641 -0.38067
## sample estimates:
## mean in group 1976 mean in group 1978 
##             9.4697            10.1382

t.test(case0201$Depth ~ yearFactor, var.equal=TRUE, 
       alternative = "less") # 1-sided; alternative: group 1 mean is less

## 
##  Two Sample t-test
## 
## data:  case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 4.3e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
## 95 percent confidence interval:
##      -Inf -0.42734
## sample estimates:
## mean in group 1976 mean in group 1978 
##             9.4697            10.1382

Matlab plot of Student’s t distribution, 176 degrees of freedom

Statistical Summary of Case 2.1

“These data provide overwhelming evidence that the mean beak depth increased from 1976 to 1978 (one-sided p-value < 0.00001 from a two sample t-test). The 1978 mean was estimated to exceed the 1976 (pre-drought) mean by 0.67 mm (95% confidence interval 0.38 to 0.96 mm).”

This was an observational study, and the entire population was surveyed. Even though the entire population was sampled a t test was appropriate. Lack of independence is an issue.

Statistical Sleuth Case 2.2 Anatomical abnormalities & schizophrenia

A stem & leaf plot

Display 2.2 from Statistical Sleuth 3rd edition

str(case0202)

## 'data.frame':    15 obs. of  2 variables:
##  $ Unaffected: num  1.94 1.44 1.56 1.58 2.06 1.66 1.75 1.77 1.78 1.92 ...
##  $ Affected  : num  1.27 1.63 1.47 1.39 1.93 1.26 1.71 1.67 1.28 1.85 ...

diff <- case0202$Unaffected-case0202$Affected
summary(diff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.190   0.055   0.110   0.199   0.315   0.670

# A Tukey Stem & leaf plot
stem(diff, scale = 5, width = 80)

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   -1 | 9
##   -1 | 
##   -0 | 
##   -0 | 
##    0 | 234
##    0 | 79
##    1 | 013
##    1 | 9
##    2 | 3
##    2 | 
##    3 | 
##    3 | 
##    4 | 0
##    4 | 
##    5 | 0
##    5 | 9
##    6 | 
##    6 | 7

Two R boxplots

boxplot(diff,       
        ylab="Difference in Hippocampus Volume (cubic cm)", 
        xlab="15 Sets of Twins, One Affected with Schizophrenia", 
        main="Hippocampus Difference: Unaffected Twin Minus Affected Twin") 
abline(h=0,lty=2)   # Draw a dashed (lty=2) horizontal line at 0

# BOXPLOT FOR PRESENTATION from Sleuth3 vignette
boxplot(diff, 
        ylab="Difference in Hippocampus Volume (cubic cm)", 
        xlab="15 Sets of Twins, One Affected with Schizophrenia",
        main="Hippocampus Difference: Unaffected Minus Affected Twin",  
        col="green", boxlwd=2, medlwd=2, whisklty=1, whisklwd=2, 
        staplewex=.2, staplelwd=2, outlwd=2, outpch=21, outbg="green",
        outcex=1.5)      
abline(h=0,lty=2)

Matlab plot of paired t test assumptions for case 2.2 t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)

# Paired t test
t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)

## 
##  Paired t-test
## 
## data:  case0202$Unaffected and case0202$Affected
## t = 3.23, df = 14, p-value = 0.0061
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.066704 0.330629
## sample estimates:
## mean of the differences 
##                 0.19867

Statistical Summary for Case Study 2.2

There is substantial evidence that the mean difference in the left hippocampus volumes between schizophrenic individuals and their nonschizophrenic twins is nonzero (two-sided p-value = 0.006, from a paired t test). It is estimated that the the mean volume is 0.20 cm³ smaller for those with schizophrenia (about 11% smaller). A 95% confidence interval for the difference is from 0.07 to 0.33 cm³.

Interpreting the p-value

p-value (modified from K Wuensch and reef fish, edstat, 3/19/03, and October 2005) The p-value is the probability of obtaining data as or more discrepant with respect to the null and alternative hypothesis than are those in the present sample, assuming that the null hypothesis is absolutely correct. In a one-sample z-test of the hypothesis μ = μo, a z-score of 1.96 can have p-value 0.975, 0.05 or 0.025 depending on whether the alternative hypothesis is μ < μo, μ ≠ μo, or μ > μo, respectively. Contributions of Fisher, Neyman & Pearson and Deming

Sterne & Smith’s (2001) and Sleuth’s interpretation of p-values

The Central Limit Theorem. It really is important!

Display 2.4 demonstrating on the Central Limit Theorem

The Central Limit Theorem for sums and means

Demonstration of the Central Limit Theorem

With Matlab

Simulation to demonstrate the central limit theorem

Skewed distributions are slow to obey the central limit theorem.

Most Environmental data is positively skewed.

Demonstration of the Central Limit Theorem with Radziwill’s CentralLimitTheorem.R

Uniform distribution

# Program written by Nicole Radziwill at r-bloggers
# https://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/
# modified by Eugene.Gallagher@umb.edu, 10/10/19: added QQplot, last rev 9/14/21

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
  r <- 10000  # Number of replications/samples - DO NOT ADJUST
  # This produces a matrix of observations with  
  # n columns and r rows. Each row is one sample:
  my.samples <- switch(src.dist,
                       "E" = matrix(rexp(n*r,param1),r),
                       "N" = matrix(rnorm(n*r,param1,param2),r),
                       "U" = matrix(runif(n*r,param1,param2),r),
                       "P" = matrix(rpois(n*r,param1),r),
                       "C" = matrix(rcauchy(n*r,param1,param2),r),
                       "B" = matrix(rbinom(n*r,param1,param2),r),
                       "G" = matrix(rgamma(n*r,param1,param2),r),
                       "X" = matrix(rchisq(n*r,param1),r),
                       "T" = matrix(rt(n*r,param1),r))
  all.sample.sums <- apply(my.samples,1,sum)
  all.sample.means <- apply(my.samples,1,mean)   
  all.sample.vars <- apply(my.samples,1,var) 
  op <- par(no.readonly = TRUE)  # save plotting parameters in op
  par(mfrow=c(2,2))  # will plot figures in a 2 x 2 array
  
  hist(my.samples[1,],col="gray",main="Distribution of One Sample")
  hist(all.sample.sums,col="gray",main="Sampling Distribution of
    the Sum")
  hist(all.sample.means,col="gray",main="Sampling Distribution of the Mean")
  # hist(all.sample.vars,col="gray",main="Sampling Distribution of
    # the Variance")
  # add a qq plot for the 4th graph
  qqnorm(all.sample.means,col="gray",main="Quantile-Quantile Plot of Means")
  qqline(all.sample.means, col="red")
#  Returns the plotting parameters to their original values.
  par(op) }

# uncomment just one of these calls to the function sdm.sim at a time.
# sdm.sim(10,src.dist="B",param1=10,param2=0.5)
#  sdm.sim(50,src.dist="B",param1=10,param2=0.5)
# sdm.sim(100,src.dist="B",param1=10,param2=0.5)
# sdm.sim(10,src.dist="C",param1=0,param2=1)  # This produces results where the CLT fails
## See: https://stats.stackexchange.com/questions/74268/cauchy-distribution-and-central-limit-theorem
##  https://web.ma.utexas.edu/users/mks/M358KInstr/Cauchy.pdf
# sdm.sim(50,src.dist="C",param1=0,param2=1)
# sdm.sim(100,src.dist="C",param1=0,param2=1)
# sdm.sim(10,src.dist="E",param1=1)
# sdm.sim(50,src.dist="E",param1=1)
# sdm.sim(100,src.dist="E",param1=1)
# sdm.sim(10,src.dist="G",param1=10,param2=2)
# sdm.sim(50,src.dist="G",param1=5,param2=2)
# sdm.sim(100,src.dist="G",param1=5,param2=2)
# sdm.sim(10,src.dist="N",param1=20,param2=3)
# sdm.sim(50,src.dist="N",param1=20,param2=3)
# sdm.sim(100,src.dist="N",param1=20,param2=3)
# sdm.sim(10,src.dist="P",param1=0.5)
# sdm.sim(50,src.dist="P",param1=0.5)
# sdm.sim(100,src.dist="P",param1=0.5)
# sdm.sim(10,src.dist="T",param1=5)
# sdm.sim(50,src.dist="T",param1=5)
# sdm.sim(100,src.dist="T",param1=5)
# sdm.sim(10,src.dist="U",param1=0,param2=1) # specify min & max
sdm.sim(50,src.dist="U",param1=0,param2=1)

# sdm.sim(100,src.dist="U",param1=0,param2=1)
# sdm.sim(10,src.dist="X",param1=14)
# sdm.sim(50,src.dist="X",param1=14)
# sdm.sim(100,src.dist="X",param1=14)

t distribution distribution

# Now test with the t distribution
sdm.sim(50,src.dist="T",param1=5)

Exponential distribution

# Now test with the exponential distribution
sdm.sim(50,src.dist="E",param1=1)

Cauchy distribution: The Central Limit Theorem doesn’t work with Cauchy

# Now test with the Cauchy distribution
sdm.sim(100,src.dist="C",param1=0,param2=1)

The Cauchy distribution arises often in environmental science. If X is normally distributed and Y is normally distributed, X/Y is NOT normally distributed. It obeys the Cauchy distributions. The use of ratios, particularly by marine chemists is prevalent in the environmental science.

Confidence limits and other methods for expressing uncertainty

Methods for expressing uncertainty

What is a confidence interval?

Confidence interval Kendall & Stuart (1979, p. 199) state that the ideas of confidence interval estimation are due to Neyman, especially Neyman (1937). If a confidence interval, say a 95% confidence interval for the mean, is properly calculated, 95% of such intervals will contain the true value of the parameter, e.g., the mean, being estimated.

z and t ratios

Degrees of Freedom

The rule of 5: Take at least 5 replicates to produce more precise confidence limits (with a t statistic with 4 df)

Student’ss t ratio

A t-ratio for two-sample inference

Sampling distribution of the difference between two averages

Pooled standard deviation

Standard error of the difference

95% confidence interval for the difference in means

If one collects a random sample from a distribution with mean μ and calculates the sample mean () and a 95% confidence interval for the mean (μ), this confidence interval for μ will contain the unknown μ 95% of the time; 5% of the time, it will not contain μ. The probability that a single confidence interval contains μ is either 100% or 0%; μ is either inside or outside the confidence interval.

Bayesians use credence levels and refer to the probability that a parameter falls within a certain range.

95% confidence interval for the difference in means

Testing a hypothesis about the difference in means

Was the difference in beaks consistent with chance?

Randomization distribution & P-value interpretation

100 simulated data size 4 from a normal distribution

100 simulated data size 40 from a normal distribution

p=pbinom(8, 100, 0.05, lower.tail = FALSE)
    sprintf("The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95%% CI's is %5.3f",p)

## [1] "The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95% CI's is 0.063"

Conclusions from Chapter 2

Chapter covers 2-sample tests for the difference in means when standard deviations for the two populations are equal
Independent samples t test
Paired t test.
With common σ, the distribution of mean differences will converge to the normal distribution with mean μ1-μ2 and standard deviation σ * sqrt(1/n1+1/n2) as described by the Central Limit theorem
Since σ must be estimated for the independent samples t test, the pooled standard deviation having n1+ n2 -2 df is used to estimate the standard error of the difference in means
5 ways of expressing the uncertainty in an estimate: standard deviation, standard error, 95% confidence interval, coefficient of variation, and more rarely, the analytical error.
The t ratio, based on the observed difference divided by the standard error of the difference is used to calculate p values and confidence limits for samples
Fisher noted that the t ratio has meaning only so far as it estimates the p value that would result from a randomization test using the underlying data.
Fisher and Neyman & Pearson agreed that p values should be used in evaluating tests and all converged on 0.05 as a rough rule of thumb for ruling out chance alone.
Gelman & Hill (2007) and Gelman et al. (2021) don’t use the significant-nonsignficant dichotomy, p-values or confidence intervals, relying on the estimate and its standard error.
Neyman & Pearson felt Fisher hadn’t gone far enough in just rejecting null hypotheses with a test & p value. Neyman & Pearson proposed Type II error & Neyman introduced confidence limits
Confidence intervals calculated for multiple sample estimates will contain the true parameter 95% of the time
For a single sample, the probability that the confidence interval contains the true parameter is 0 or 1
In reporting results you should include
The null and alternative hypotheses (at least specifying whether the alternative is 1-tailed or 2-tailed)
The statistical test & p value
The effect size and estimate of error, either the confidence interval or the standard error
A concluding statement about the hypothesis

Sleuth Chapter 3: A Closer Look at Assumptions

Dislpay 3.1

Dislpay 3.2

# Run code from RScs0301_3rd
attach(case0301) 
str(case0301) #Seeded is level 1 of Treatment (it's first alphabetically)

## 'data.frame':    52 obs. of  2 variables:
##  $ Rainfall : num  1203 830 372 346 321 ...
##  $ Treatment: Factor w/ 2 levels "Seeded","Unseeded": 2 2 2 2 2 2 2 2 2 2 ...

boxplot(Rainfall ~ Treatment)

boxplot(log(Rainfall) ~ Treatment) # Boxplots of natural logs of Rainfall

t.test(log(Rainfall) ~ Treatment, var.equal=TRUE,
       alternative="greater") # 1-sided t-test; alternative: level 1 mean is greater

## 
##  Two Sample t-test
## 
## data:  log(Rainfall) by Treatment
## t = 2.54, df = 50, p-value = 0.007
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is greater than 0
## 95 percent confidence interval:
##  0.3904    Inf
## sample estimates:
##   mean in group Seeded mean in group Unseeded 
##                 5.1342                 3.9904

myTest <- t.test(log(Rainfall) ~ Treatment,  var.equal=TRUE,
                 alternative="two.sided") # 2-sided alternative to get confidence interval
exp(myTest$est[1] - myTest$est[2])  # Back-transform estimate on log scale

## mean in group Seeded 
##               3.1386

exp(myTest$conf) # Back transform endpoints of confidence interval

## [1] 1.2723 7.7423
## attr(,"conf.level")
## [1] 0.95

boxplot(log(Rainfall) ~ Treatment, 
        ylab="Log of Rainfall Volume in Target Area (Acre Feet)", 
        names=c("On 26 Seeded Days", "On 26 Unseeded Days"), 
        main="Distributions of Rainfalls from Cloud Seeding Experiment")

## POLISHED BOXPLOTS FOR PRESENTATION FROM HORTON:
opar <- par(no.readonly=TRUE)  # Store device graphics parameters
par(mar=c(4,4,4,4))   # Change margins to allow more space on right
boxplot(log(Rainfall) ~ Treatment, ylab="Log Rainfall (Acre-Feet)",
        names=c("on 26 seeded days","on 26 unseeded days"), 
        main="Boxplots of Rainfall on Log Scale", col="green", boxlwd=2,
        medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,
        outlwd=2, outpch=21, outbg="green", outcex=1.5      )        
myTicks <- c(1,5,10,100,500,1000,2000,3000) # some tick marks for original scale 
axis(4, at=log(myTicks), label=myTicks)   # Add original-scale axis on right    
mtext("Rainfall (Acre Feet)", side=4, line=2.5) # Add right-side axis label

par(opar)  # Restore previous graphics parameter settings

Back tranform the difference and CI on the log scale to the standard scale

Difference is 1.14 on the log scale, so the difference is 3.1 Acre-Feet on the standard scale exp(1.14)≈3.1 exp(0.241)≈1.3 exp(2.047)≈7.7

Box-Cox transform analysis to check whether the log transform is the most appropriate: it is

MASS::boxcox(lm(I(Rainfall) ~ Treatment, data=case0301),lam=seq(-1,1.5,.1))

# The 95% CI includes lambda=0, which indicates a log transform is appropriate 
# (Draper & Smith, 1998, p 280)

Levene’s test for equality of variance

# Levene's test with the car package
## default for Levene's test is center=median, SPSS is center = mean
attach(case0301)

## The following objects are masked from case0301 (pos = 3):
## 
##     Rainfall, Treatment

OUT=leveneTest(Rainfall ~ Treatment, center=mean, data=case0301)
OUT

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)  
## group  1    6.09  0.017 *
##       50                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OUTLN=leveneTest(log(Rainfall) ~ Treatment, center=mean, data=case0301)
OUTLN

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1    0.06   0.81
##       50

detach(case0301)

Boxplots and density plots of the untransformed and transformed rainfall data

# Nick Horton's (Smith/Amherst) code for the case study 3.1 data

trellis.par.set(theme = col.mosaic()) # get a better color scheme for lattice
options(digits = 3, show.signif.stars = FALSE)
summary(case0301)

##     Rainfall       Treatment 
##  Min.   :   1   Seeded  :26  
##  1st Qu.:  29   Unseeded:26  
##  Median : 117                
##  Mean   : 303                
##  3rd Qu.: 307                
##  Max.   :2746

favstats(Rainfall ~ Treatment, data = case0301)

##   Treatment min   Q1 median  Q3  max mean  sd  n missing
## 1    Seeded 4.1 98.1  221.6 406 2746  442 651 26       0
## 2  Unseeded 1.0 24.8   44.2 159 1203  165 278 26       0

bwplot(Rainfall ~ Treatment, data = case0301)

densityplot(~Rainfall, groups = Treatment, auto.key = TRUE, data = case0301)

case0301$lograin = log(case0301$Rainfall)
favstats(lograin ~ Treatment, data = case0301)

##   Treatment  min   Q1 median   Q3  max mean   sd  n missing
## 1    Seeded 1.41 4.58   5.40 6.00 7.92 5.13 1.60 26       0
## 2  Unseeded 0.00 3.21   3.79 5.07 7.09 3.99 1.64 26       0

bwplot(lograin ~ Treatment, data = case0301)

densityplot(~lograin, groups = Treatment, auto.key = TRUE, data = case0301)

ttestlog = t.test(log(Rainfall) ~ Treatment, var.equal = TRUE, data = case0301)
ttestlog

## 
##  Two Sample t-test
## 
## data:  log(Rainfall) by Treatment
## t = 3, df = 50, p-value = 0.01
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0
## 95 percent confidence interval:
##  0.241 2.047
## sample estimates:
##   mean in group Seeded mean in group Unseeded 
##                   5.13                   3.99

summary(lm(lograin ~ Treatment, data = case0301))

## 
## Call:
## lm(formula = lograin ~ Treatment, data = case0301)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.990 -0.745  0.162  1.019  3.102 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)          5.134      0.318   16.15   <2e-16
## TreatmentUnseeded   -1.144      0.450   -2.54    0.014
## 
## Residual standard error: 1.62 on 50 degrees of freedom
## Multiple R-squared:  0.115,  Adjusted R-squared:  0.0969 
## F-statistic: 6.47 on 1 and 50 DF,  p-value: 0.0141

obslogdiff = -diff(mean(lograin ~ Treatment, data = case0301))
obslogdiff

## Unseeded 
##     1.14

multiplier = exp(obslogdiff)
multiplier

## Unseeded 
##     3.14

ttestlog$conf.int

## [1] 0.241 2.047
## attr(,"conf.level")
## [1] 0.95

exp(ttestlog$conf.int)

## [1] 1.27 7.74
## attr(,"conf.level")
## [1] 0.95

Case Study 3.2 Effects of Agent Orange on Troops in Vietnam — An Observational Study

Dislpay 3.3

# Use tidyverse code to summarize the data
case0302 %>%
  group_by(Veteran) %>%
  summarise(min = min(Dioxin), Q1 = quantile(Dioxin, probs = .25),
            median = median(Dioxin), Q3 = quantile(Dioxin, probs = .75),
            max = max(Dioxin), mean = mean(Dioxin), sd = sd(Dioxin))

## # A tibble: 2 x 8
##   Veteran   min    Q1 median    Q3   max  mean    sd
##   <fct>   <int> <dbl>  <dbl> <dbl> <int> <dbl> <dbl>
## 1 Other       0     3      4     5    15  4.19  2.30
## 2 Vietnam     0     3      4     5    45  4.26  2.64

# The data contain zeros, so a small increment, say 0.05 must be added before
# log transforming
attach(case0302)
boxplot(Dioxin ~ Veteran)

## HISTOGRAMS FOR PRESENTATION  
opar <- par(no.readonly=TRUE)  # Store device graphics parameter settings
par(mfrow=c(2,1), mar=c(3,3,1,1)) # 2 by 1 layout of plots; change margins    
myBreaks <- (0:46) - .5    # Make breaks for histogram bins  
hist(Dioxin[Veteran=="Other"], breaks=myBreaks, xlim=range(Dioxin),
     col="green", xlab="", ylab="", main="")
text(5,25, 
     "Dioxin in 97 'Other' Veterans; Estimated mean =  4.19 ppt (95% CI: 3.72 to 4.65 ppt)",
     pos=4, cex=.75) # CI from 1-sample t-test & subset=(Veteran="Other")    
hist(Dioxin[Veteran=="Vietnam"],breaks=myBreaks,xlim=range(Dioxin),
     col="green", xlab="", ylab="", main="")   
text(5,160,
     "Dioxin in 646 Vietnam Veterans; Estimated mean =  4.26 ppt (95% CI: 4.06 to 4.64 ppt)",
     pos=4, cex=.75)   
text(8,145,"[Estimated Difference in Means: 0.07 ppt (95% CI: -0.63 to 0.48 ppt)]",
     pos=4, cex=.75)

par(opar) # Restore previous graphics parameter settings

# Gallagher addendum

## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
              ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(log(Dioxin+0.05) ~ Veteran,xlab="Treatment",
              ylab="ln(Dioxin+0.05)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))

detach(case0302)

Dislpay 3.3

Box-Cox Transformation Analysis of the dioxin data from Jim Robison’s Montana State University code

MASS::boxcox(lm(I(Dioxin+.05) ~ Veteran, data=case0302),lam=seq(0,1,.1))

## 0.4 is the appropriate transform
## Here is the Box Cox transformation from p. 280 of Draper & Smith.
geoMean<- exp(mean(log(case0302$Dioxin+0.05)))
geoMean

## [1] 3.72

lambda<-0.4
# See equation on p 280 Equ 13.2.2 in Draper & Smith's 1998\
# Applied Regression Analysis 3rd edition
case0302$V<-((case0302$Dioxin^lambda) - 1)/(lambda*geoMean^(lambda-1))

## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
              ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(V ~ Veteran,xlab="Treatment",
              ylab="Box-Cox Transform (lambda=0.4)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))

t.test(V ~ Veteran, var.equal=TRUE,
       alternative="less", data=case0302) # 1-sided t-test; alternative: group 1 mean is less

## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 741, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##   -Inf 0.181
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  4.01

# Back transformed means
t.test(V ~ Veteran, alternative="less", var.equal=TRUE, 
       subset(case0302, Dioxin < 40)) # t-test on subset for which Dioxin < 40

## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 740, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##   -Inf 0.191
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  3.99

t.test(V ~ Veteran, alternative="less", var.equal=TRUE, 
       subset(case0302, Dioxin < 20))

## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.7, df = 739, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
##  -Inf  0.2
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  3.97

t.test(V ~ Veteran, var.equal=TRUE, data=case0302) # 2-sided--to get confidence interval

## 
##  Two Sample t-test
## 
## data:  V by Veteran
## t = -0.8, df = 741, p-value = 0.4
## alternative hypothesis: true difference in means between group Other and group Vietnam is not equal to 0
## 95 percent confidence interval:
##  -0.633  0.252
## sample estimates:
##   mean in group Other mean in group Vietnam 
##                  3.82                  4.01

Randomization distribution of the t statistic for the 3.2 dioxin data

Robustness of the two-sample t tools

Two major assumptions
Both samples are independent samples from normally distributed populations
Both samples have identical standard deviations
The t test is usually robust to modest violations of the assumptions
These assumptions are never strictly met, but the t test is remarkably robust to violations of the assumptions, especially if the sample sizes are nearly equal
Robust means the conclusions from test — e.g., p values & confidence limits — are valid even when the assumptions aren’t strictly met, especially if sample sizes nearly equal
Transformations of the data are often used

Violations of the Assumptions that matter

With equal sample sizes, the t-test is affected moderately by long-tailedness (leptokurtic or peaked distribution) and very little by skewness (the symmetry of the distribution)
Kurtosis: peakedness, platykurtic (flat distribution), leptokurtic (peaked distribution)
Skewness: symmetry
If the two populations have the same standard deviations and approximately the same shape, with unequal sample size, the t tests are affected moderately by long tailedness (leptokurtic) and substantially by skewness
If the skewness differs considerably, the tools can be misleading with small and moderate sample sizes

Display 3.4

Display 3.5

Independence assumption important

Display 3.8

Transformations

Log (x+1) transform
Most biological data, but not usually diversity
Needed when there is a multiplicative process in action:

++ exponential growth

++ bank account interest

++ Many pollutants (aquatic and terrestrial & human)

+++ polynuclear aromatic hydrocarbons

+++ fecal coliform bacteria, but not usually metals

Calculate the mean and 95% CI on the transformed scale and then back-transform. For data that are symmetric on the log-transformed scale, the back-transformed mean of the log-transformed data ≈ median. However, this back-transformed mean is the geometric mean and may differ from the median.
Many other transforms
Arcsin (√Y) for frequency data ranging between 0 and 1 (but the logit transform may be better)
Logit transform: log [Y/(1-Y)]

++ % silt clay, but the data must be on the interval 0 to 1

Square roots for counts
reciprocal for waiting times
logit transforms for proportions between 0 and 1 (log (P/(1-P))
“… it is recommended here that a trial-and error approach, with graphical analysis, be used instead.” Sleuth

Box-Cox Transformation of the 3.1 data

Matlab Box-Cox analysis of Case 3.1 data

Analysis of the transformed Case 3.1 data

CDC serum pollutant survey cover

CDC serum pollutant survey: lead

CDC serum pollutant survey: DDE

Outliers and resistant procedures

A procedure is resistant if it doesn’t change much when a small part of the data changes, perhaps drastically.
t tools are based on averages and are strongly affected by outliers
Chapter 4 introduces tests based on ranks, which protect against outliers (but not against unequal variance)
Practical strategies
Do side-by-side box plots to analyze departures from assumptions
Consider & test for serial spatial and cluster effects
Analyze spatial patterns in the residuals, use more sophisticated tools

++ Legendre & Legendre: if positive spatial autocorrelation, decrease the p value, e.g., test for differences at the 0.001 level instead of the 0.05 level to claim P (Type I error (alpha level) <0.05)

Dealing with outliers 1

Dealing with outliers 2

Practical strategies for dealing with outliers

Outlier strategy
Run analysis with and without outliers
Throw-out outliers only if there is very compelling evidence to do so, and document this data paring or culling
Note that outlier removal has created tremendous problems

++ POC flux to the deep sea

++ Failure to detect the ozone hole

++ Mendel’s data: 1:2:1 ratios and the chi-square test; documented by Fisher & Bruce Weir

++ Milliken’s study of the charge of the electron