Display 2.1 from Statistical Sleuth 3rd edition
attach(case0201)
stem.leaf.backback(Depth[Year == 1976],
Depth[Year == 1978],
back.to.back = TRUE,unit=0.1,
trim.outliers = FALSE,
show.no.depths = TRUE)
## ___________________________________________________
## 1 | 2: represents 1.2, leaf unit: 0.1
## Depth[Year == 1976] Depth[Year == 1978]
## ___________________________________________________
## 2| 6* |
## 8| 6. |
## 411| 7* |1
## 98| 7. |9
## 44420| 8* |044
## 9999977765555| 8. |778
## 444321111100000| 9* |0011123344
## 99999888887777655| 9. |5666666777789999
## 444433322111111000| 10* |000222223333334444
## 8766655555| 10. |55555566666777778999
## 440| 11* |0000111134444
## 7| 11. |5667
## | 12* |
## ___________________________________________________
## n: 89 89
## ___________________________________________________
detach(case0201)
Weiner The Beak of the Finch
The Grants 40 Years of Evolution
## Plot back-to-back histograms from Frank Harrell's Hmisc package
options(digits=1)
out <- histbackback(split(case0201$Depth, case0201$Year), probability=FALSE,
xlim=c(-30,30),ylab="Beak Depth (mm)",
main = 'Sleuth Case 2.1')
#! just adding color
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
options(digits=5) # return the default number of digits
Matlab, SPSS and R back-to-back histograms
boxplot(Depth ~ Year, data=case0201,
ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),
main="Beak Depths of Darwin Finches in 1976 and 1978", col="green",
boxlwd=2)
# car Boxplot which labels the outliers
Boxplot(Depth ~ Year,
ylab="Beak Depth (mm)", names=c("89 Finches in 1976","89 Finches in 1978"),
main="Beak Depths of Darwin Finches in 1976 and 1978", col="green",
boxlwd=2, medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,
outlwd=2, outpch=21, outbg="green", outcex=1.5, data=case0201)
## [1] "1" "2" "90" "91"
# I don't know why the outlers are printed out.
mean(case0201$Depth[case0201$Year==1978]) - mean(case0201$Depth[case0201$Year==1976])
## [1] 0.66854
yearFactor <- factor(case0201$Year) # Convert the numerical variable Year into a factor
# with 2 levels. 1976 is "group 1" (it comes first alphanumerically)
t.test(case0201$Depth ~ yearFactor, var.equal=TRUE) # 2-sample t-test; 2-sided by default
##
## Two Sample t-test
##
## data: case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 8.6e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is not equal to 0
## 95 percent confidence interval:
## -0.95641 -0.38067
## sample estimates:
## mean in group 1976 mean in group 1978
## 9.4697 10.1382
t.test(case0201$Depth ~ yearFactor, var.equal=TRUE,
alternative = "less") # 1-sided; alternative: group 1 mean is less
##
## Two Sample t-test
##
## data: case0201$Depth by yearFactor
## t = -4.58, df = 176, p-value = 4.3e-06
## alternative hypothesis: true difference in means between group 1976 and group 1978 is less than 0
## 95 percent confidence interval:
## -Inf -0.42734
## sample estimates:
## mean in group 1976 mean in group 1978
## 9.4697 10.1382
Matlab plot of Student’s t distribution, 176 degrees of freedom
“These data provide overwhelming evidence that the mean beak depth increased from 1976 to 1978 (one-sided p-value < 0.00001 from a two sample t-test). The 1978 mean was estimated to exceed the 1976 (pre-drought) mean by 0.67 mm (95% confidence interval 0.38 to 0.96 mm).”
This was an observational study, and the entire population was surveyed. Even though the entire population was sampled a t test was appropriate. Lack of independence is an issue.
Display 2.2 from Statistical Sleuth 3rd edition
str(case0202)
## 'data.frame': 15 obs. of 2 variables:
## $ Unaffected: num 1.94 1.44 1.56 1.58 2.06 1.66 1.75 1.77 1.78 1.92 ...
## $ Affected : num 1.27 1.63 1.47 1.39 1.93 1.26 1.71 1.67 1.28 1.85 ...
diff <- case0202$Unaffected-case0202$Affected
summary(diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.190 0.055 0.110 0.199 0.315 0.670
# A Tukey Stem & leaf plot
stem(diff, scale = 5, width = 80)
##
## The decimal point is 1 digit(s) to the left of the |
##
## -1 | 9
## -1 |
## -0 |
## -0 |
## 0 | 234
## 0 | 79
## 1 | 013
## 1 | 9
## 2 | 3
## 2 |
## 3 |
## 3 |
## 4 | 0
## 4 |
## 5 | 0
## 5 | 9
## 6 |
## 6 | 7
boxplot(diff,
ylab="Difference in Hippocampus Volume (cubic cm)",
xlab="15 Sets of Twins, One Affected with Schizophrenia",
main="Hippocampus Difference: Unaffected Twin Minus Affected Twin")
abline(h=0,lty=2) # Draw a dashed (lty=2) horizontal line at 0
# BOXPLOT FOR PRESENTATION from Sleuth3 vignette
boxplot(diff,
ylab="Difference in Hippocampus Volume (cubic cm)",
xlab="15 Sets of Twins, One Affected with Schizophrenia",
main="Hippocampus Difference: Unaffected Minus Affected Twin",
col="green", boxlwd=2, medlwd=2, whisklty=1, whisklwd=2,
staplewex=.2, staplelwd=2, outlwd=2, outpch=21, outbg="green",
outcex=1.5)
abline(h=0,lty=2)
t.test(case0202\(Unaffected, case0202\)Affected, paired = TRUE)
# Paired t test
t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)
##
## Paired t-test
##
## data: case0202$Unaffected and case0202$Affected
## t = 3.23, df = 14, p-value = 0.0061
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.066704 0.330629
## sample estimates:
## mean of the differences
## 0.19867
There is substantial evidence that the mean difference in the left hippocampus volumes between schizophrenic individuals and their nonschizophrenic twins is nonzero (two-sided p-value = 0.006, from a paired t test). It is estimated that the the mean volume is 0.20 cm3 smaller for those with schizophrenia (about 11% smaller). A 95% confidence interval for the difference is from 0.07 to 0.33 cm3.
p-value (modified from K Wuensch and reef fish, edstat, 3/19/03, and October 2005) The p-value is the probability of obtaining data as or more discrepant with respect to the null and alternative hypothesis than are those in the present sample, assuming that the null hypothesis is absolutely correct. In a one-sample z-test of the hypothesis μ = μo, a z-score of 1.96 can have p-value 0.975, 0.05 or 0.025 depending on whether the alternative hypothesis is μ < μo, μ ≠ μo, or μ > μo, respectively.
Sterne & Smith’s (2001) and Sleuth’s interpretation of p-values
Display 2.4 demonstrating on the Central Limit Theorem
The Central Limit Theorem for sums and means
The Central Limit Theorem for sums and means
Simulation to demonstrate the central limit theorem
Simulation to demonstrate the central limit theorem
Skewed distributions are slow to obey the central limit theorem.
Most Environmental data is positively skewed.
# Program written by Nicole Radziwill at r-bloggers
# https://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/
# modified by Eugene.Gallagher@umb.edu, 10/10/19: added QQplot, last rev 9/14/21
sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
r <- 10000 # Number of replications/samples - DO NOT ADJUST
# This produces a matrix of observations with
# n columns and r rows. Each row is one sample:
my.samples <- switch(src.dist,
"E" = matrix(rexp(n*r,param1),r),
"N" = matrix(rnorm(n*r,param1,param2),r),
"U" = matrix(runif(n*r,param1,param2),r),
"P" = matrix(rpois(n*r,param1),r),
"C" = matrix(rcauchy(n*r,param1,param2),r),
"B" = matrix(rbinom(n*r,param1,param2),r),
"G" = matrix(rgamma(n*r,param1,param2),r),
"X" = matrix(rchisq(n*r,param1),r),
"T" = matrix(rt(n*r,param1),r))
all.sample.sums <- apply(my.samples,1,sum)
all.sample.means <- apply(my.samples,1,mean)
all.sample.vars <- apply(my.samples,1,var)
op <- par(no.readonly = TRUE) # save plotting parameters in op
par(mfrow=c(2,2)) # will plot figures in a 2 x 2 array
hist(my.samples[1,],col="gray",main="Distribution of One Sample")
hist(all.sample.sums,col="gray",main="Sampling Distribution of
the Sum")
hist(all.sample.means,col="gray",main="Sampling Distribution of the Mean")
# hist(all.sample.vars,col="gray",main="Sampling Distribution of
# the Variance")
# add a qq plot for the 4th graph
qqnorm(all.sample.means,col="gray",main="Quantile-Quantile Plot of Means")
qqline(all.sample.means, col="red")
# Returns the plotting parameters to their original values.
par(op) }
# uncomment just one of these calls to the function sdm.sim at a time.
# sdm.sim(10,src.dist="B",param1=10,param2=0.5)
# sdm.sim(50,src.dist="B",param1=10,param2=0.5)
# sdm.sim(100,src.dist="B",param1=10,param2=0.5)
# sdm.sim(10,src.dist="C",param1=0,param2=1) # This produces results where the CLT fails
## See: https://stats.stackexchange.com/questions/74268/cauchy-distribution-and-central-limit-theorem
## https://web.ma.utexas.edu/users/mks/M358KInstr/Cauchy.pdf
# sdm.sim(50,src.dist="C",param1=0,param2=1)
# sdm.sim(100,src.dist="C",param1=0,param2=1)
# sdm.sim(10,src.dist="E",param1=1)
# sdm.sim(50,src.dist="E",param1=1)
# sdm.sim(100,src.dist="E",param1=1)
# sdm.sim(10,src.dist="G",param1=10,param2=2)
# sdm.sim(50,src.dist="G",param1=5,param2=2)
# sdm.sim(100,src.dist="G",param1=5,param2=2)
# sdm.sim(10,src.dist="N",param1=20,param2=3)
# sdm.sim(50,src.dist="N",param1=20,param2=3)
# sdm.sim(100,src.dist="N",param1=20,param2=3)
# sdm.sim(10,src.dist="P",param1=0.5)
# sdm.sim(50,src.dist="P",param1=0.5)
# sdm.sim(100,src.dist="P",param1=0.5)
# sdm.sim(10,src.dist="T",param1=5)
# sdm.sim(50,src.dist="T",param1=5)
# sdm.sim(100,src.dist="T",param1=5)
# sdm.sim(10,src.dist="U",param1=0,param2=1) # specify min & max
sdm.sim(50,src.dist="U",param1=0,param2=1)
# sdm.sim(100,src.dist="U",param1=0,param2=1)
# sdm.sim(10,src.dist="X",param1=14)
# sdm.sim(50,src.dist="X",param1=14)
# sdm.sim(100,src.dist="X",param1=14)
# Now test with the t distribution
sdm.sim(50,src.dist="T",param1=5)
# Now test with the exponential distribution
sdm.sim(50,src.dist="E",param1=1)
# Now test with the Cauchy distribution
sdm.sim(100,src.dist="C",param1=0,param2=1)
The Cauchy distribution arises often in environmental science. If X is normally distributed and Y is normally distributed, X/Y is NOT normally distributed. It obeys the Cauchy distributions. The use of ratios, particularly by marine chemists is prevalent in the environmental science.
Methods for expressing uncertainty
Methods for expressing uncertainty
Confidence interval Kendall & Stuart (1979, p. 199) state that the ideas of confidence interval estimation are due to Neyman, especially Neyman (1937). If a confidence interval, say a 95% confidence interval for the mean, is properly calculated, 95% of such intervals will contain the true value of the parameter, e.g., the mean, being estimated.
z and t ratios
z and t ratios
Degrees of Freedom
Student’ss t ratio
Sampling distribution of the difference between two averages
Pooled standard deviation
Standard error of the difference
95% confidence interval for the difference in means
If one collects a random sample from a distribution with mean μ and calculates the sample mean () and a 95% confidence interval for the mean (μ), this confidence interval for μ will contain the unknown μ 95% of the time; 5% of the time, it will not contain μ. The probability that a single confidence interval contains μ is either 100% or 0%; μ is either inside or outside the confidence interval.
Bayesians use credence levels and refer to the probability that a parameter falls within a certain range.
95% confidence interval for the difference in means
Testing a hypothesis about the difference in means
Was the difference in beaks consistent with chance?
Randomization distribution & P-value interpretation
Randomization distribution & P-value interpretation
Randomization distribution & P-value interpretation
100 simulated data size 4 from a normal distribution
100 simulated data size 40 from a normal distribution
100 simulated data size 40 from a normal distribution
p=pbinom(8, 100, 0.05, lower.tail = FALSE)
sprintf("The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95%% CI's is %5.3f",p)
## [1] "The probability of getting 8 CI's NOT containing the parameter with 100 trials with 95% CI's is 0.063"
Chapter covers 2-sample tests for the difference in means when standard deviations for the two populations are equal
Independent samples t test
Paired t test.
With common σ, the distribution of mean differences will converge to the normal distribution with mean μ1-μ2 and standard deviation σ * sqrt(1/n1+1/n2) as described by the Central Limit theorem
Since σ must be estimated for the independent samples t test, the pooled standard deviation having n1+ n2 -2 df is used to estimate the standard error of the difference in means
5 ways of expressing the uncertainty in an estimate: standard deviation, standard error, 95% confidence interval, coefficient of variation, and more rarely, the analytical error.
The t ratio, based on the observed difference divided by the standard error of the difference is used to calculate p values and confidence limits for samples
Fisher noted that the t ratio has meaning only so far as it estimates the p value that would result from a randomization test using the underlying data.
Fisher and Neyman & Pearson agreed that p values should be used in evaluating tests and all converged on 0.05 as a rough rule of thumb for ruling out chance alone.
Gelman & Hill (2007) and Gelman et al. (2021) don’t use the significant-nonsignficant dichotomy, p-values or confidence intervals, relying on the estimate and its standard error.
Neyman & Pearson felt Fisher hadn’t gone far enough in just rejecting null hypotheses with a test & p value. Neyman & Pearson proposed Type II error & Neyman introduced confidence limits
Confidence intervals calculated for multiple sample estimates will contain the true parameter 95% of the time
For a single sample, the probability that the confidence interval contains the true parameter is 0 or 1
In reporting results you should include
The null and alternative hypotheses (at least specifying whether the alternative is 1-tailed or 2-tailed)
The statistical test & p value
The effect size and estimate of error, either the confidence interval or the standard error
A concluding statement about the hypothesis
Dislpay 3.1
Dislpay 3.2
# Run code from RScs0301_3rd
attach(case0301)
str(case0301) #Seeded is level 1 of Treatment (it's first alphabetically)
## 'data.frame': 52 obs. of 2 variables:
## $ Rainfall : num 1203 830 372 346 321 ...
## $ Treatment: Factor w/ 2 levels "Seeded","Unseeded": 2 2 2 2 2 2 2 2 2 2 ...
boxplot(Rainfall ~ Treatment)
boxplot(log(Rainfall) ~ Treatment) # Boxplots of natural logs of Rainfall
t.test(log(Rainfall) ~ Treatment, var.equal=TRUE,
alternative="greater") # 1-sided t-test; alternative: level 1 mean is greater
##
## Two Sample t-test
##
## data: log(Rainfall) by Treatment
## t = 2.54, df = 50, p-value = 0.007
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is greater than 0
## 95 percent confidence interval:
## 0.3904 Inf
## sample estimates:
## mean in group Seeded mean in group Unseeded
## 5.1342 3.9904
myTest <- t.test(log(Rainfall) ~ Treatment, var.equal=TRUE,
alternative="two.sided") # 2-sided alternative to get confidence interval
exp(myTest$est[1] - myTest$est[2]) # Back-transform estimate on log scale
## mean in group Seeded
## 3.1386
exp(myTest$conf) # Back transform endpoints of confidence interval
## [1] 1.2723 7.7423
## attr(,"conf.level")
## [1] 0.95
boxplot(log(Rainfall) ~ Treatment,
ylab="Log of Rainfall Volume in Target Area (Acre Feet)",
names=c("On 26 Seeded Days", "On 26 Unseeded Days"),
main="Distributions of Rainfalls from Cloud Seeding Experiment")
## POLISHED BOXPLOTS FOR PRESENTATION FROM HORTON:
opar <- par(no.readonly=TRUE) # Store device graphics parameters
par(mar=c(4,4,4,4)) # Change margins to allow more space on right
boxplot(log(Rainfall) ~ Treatment, ylab="Log Rainfall (Acre-Feet)",
names=c("on 26 seeded days","on 26 unseeded days"),
main="Boxplots of Rainfall on Log Scale", col="green", boxlwd=2,
medlwd=2, whisklty=1, whisklwd=2, staplewex=.2, staplelwd=2,
outlwd=2, outpch=21, outbg="green", outcex=1.5 )
myTicks <- c(1,5,10,100,500,1000,2000,3000) # some tick marks for original scale
axis(4, at=log(myTicks), label=myTicks) # Add original-scale axis on right
mtext("Rainfall (Acre Feet)", side=4, line=2.5) # Add right-side axis label
par(opar) # Restore previous graphics parameter settings
Difference is 1.14 on the log scale, so the difference is 3.1 Acre-Feet on the standard scale exp(1.14)≈3.1 exp(0.241)≈1.3 exp(2.047)≈7.7
MASS::boxcox(lm(I(Rainfall) ~ Treatment, data=case0301),lam=seq(-1,1.5,.1))
# The 95% CI includes lambda=0, which indicates a log transform is appropriate
# (Draper & Smith, 1998, p 280)
# Levene's test with the car package
## default for Levene's test is center=median, SPSS is center = mean
attach(case0301)
## The following objects are masked from case0301 (pos = 3):
##
## Rainfall, Treatment
OUT=leveneTest(Rainfall ~ Treatment, center=mean, data=case0301)
OUT
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 6.09 0.017 *
## 50
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
OUTLN=leveneTest(log(Rainfall) ~ Treatment, center=mean, data=case0301)
OUTLN
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.06 0.81
## 50
detach(case0301)
# Nick Horton's (Smith/Amherst) code for the case study 3.1 data
trellis.par.set(theme = col.mosaic()) # get a better color scheme for lattice
options(digits = 3, show.signif.stars = FALSE)
summary(case0301)
## Rainfall Treatment
## Min. : 1 Seeded :26
## 1st Qu.: 29 Unseeded:26
## Median : 117
## Mean : 303
## 3rd Qu.: 307
## Max. :2746
favstats(Rainfall ~ Treatment, data = case0301)
## Treatment min Q1 median Q3 max mean sd n missing
## 1 Seeded 4.1 98.1 221.6 406 2746 442 651 26 0
## 2 Unseeded 1.0 24.8 44.2 159 1203 165 278 26 0
bwplot(Rainfall ~ Treatment, data = case0301)
densityplot(~Rainfall, groups = Treatment, auto.key = TRUE, data = case0301)
case0301$lograin = log(case0301$Rainfall)
favstats(lograin ~ Treatment, data = case0301)
## Treatment min Q1 median Q3 max mean sd n missing
## 1 Seeded 1.41 4.58 5.40 6.00 7.92 5.13 1.60 26 0
## 2 Unseeded 0.00 3.21 3.79 5.07 7.09 3.99 1.64 26 0
bwplot(lograin ~ Treatment, data = case0301)
densityplot(~lograin, groups = Treatment, auto.key = TRUE, data = case0301)
ttestlog = t.test(log(Rainfall) ~ Treatment, var.equal = TRUE, data = case0301)
ttestlog
##
## Two Sample t-test
##
## data: log(Rainfall) by Treatment
## t = 3, df = 50, p-value = 0.01
## alternative hypothesis: true difference in means between group Seeded and group Unseeded is not equal to 0
## 95 percent confidence interval:
## 0.241 2.047
## sample estimates:
## mean in group Seeded mean in group Unseeded
## 5.13 3.99
summary(lm(lograin ~ Treatment, data = case0301))
##
## Call:
## lm(formula = lograin ~ Treatment, data = case0301)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.990 -0.745 0.162 1.019 3.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.134 0.318 16.15 <2e-16
## TreatmentUnseeded -1.144 0.450 -2.54 0.014
##
## Residual standard error: 1.62 on 50 degrees of freedom
## Multiple R-squared: 0.115, Adjusted R-squared: 0.0969
## F-statistic: 6.47 on 1 and 50 DF, p-value: 0.0141
obslogdiff = -diff(mean(lograin ~ Treatment, data = case0301))
obslogdiff
## Unseeded
## 1.14
multiplier = exp(obslogdiff)
multiplier
## Unseeded
## 3.14
ttestlog$conf.int
## [1] 0.241 2.047
## attr(,"conf.level")
## [1] 0.95
exp(ttestlog$conf.int)
## [1] 1.27 7.74
## attr(,"conf.level")
## [1] 0.95
Dislpay 3.3
# Use tidyverse code to summarize the data
case0302 %>%
group_by(Veteran) %>%
summarise(min = min(Dioxin), Q1 = quantile(Dioxin, probs = .25),
median = median(Dioxin), Q3 = quantile(Dioxin, probs = .75),
max = max(Dioxin), mean = mean(Dioxin), sd = sd(Dioxin))
## # A tibble: 2 x 8
## Veteran min Q1 median Q3 max mean sd
## <fct> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Other 0 3 4 5 15 4.19 2.30
## 2 Vietnam 0 3 4 5 45 4.26 2.64
# The data contain zeros, so a small increment, say 0.05 must be added before
# log transforming
attach(case0302)
boxplot(Dioxin ~ Veteran)
## HISTOGRAMS FOR PRESENTATION
opar <- par(no.readonly=TRUE) # Store device graphics parameter settings
par(mfrow=c(2,1), mar=c(3,3,1,1)) # 2 by 1 layout of plots; change margins
myBreaks <- (0:46) - .5 # Make breaks for histogram bins
hist(Dioxin[Veteran=="Other"], breaks=myBreaks, xlim=range(Dioxin),
col="green", xlab="", ylab="", main="")
text(5,25,
"Dioxin in 97 'Other' Veterans; Estimated mean = 4.19 ppt (95% CI: 3.72 to 4.65 ppt)",
pos=4, cex=.75) # CI from 1-sample t-test & subset=(Veteran="Other")
hist(Dioxin[Veteran=="Vietnam"],breaks=myBreaks,xlim=range(Dioxin),
col="green", xlab="", ylab="", main="")
text(5,160,
"Dioxin in 646 Vietnam Veterans; Estimated mean = 4.26 ppt (95% CI: 4.06 to 4.64 ppt)",
pos=4, cex=.75)
text(8,145,"[Estimated Difference in Means: 0.07 ppt (95% CI: -0.63 to 0.48 ppt)]",
pos=4, cex=.75)
par(opar) # Restore previous graphics parameter settings
# Gallagher addendum
## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(log(Dioxin+0.05) ~ Veteran,xlab="Treatment",
ylab="ln(Dioxin+0.05)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))
detach(case0302)
Dislpay 3.3
MASS::boxcox(lm(I(Dioxin+.05) ~ Veteran, data=case0302),lam=seq(0,1,.1))
## 0.4 is the appropriate transform
## Here is the Box Cox transformation from p. 280 of Draper & Smith.
geoMean<- exp(mean(log(case0302$Dioxin+0.05)))
geoMean
## [1] 3.72
lambda<-0.4
# See equation on p 280 Equ 13.2.2 in Draper & Smith's 1998\
# Applied Regression Analysis 3rd edition
case0302$V<-((case0302$Dioxin^lambda) - 1)/(lambda*geoMean^(lambda-1))
## Side by side lattice box plots (from Murrell p 132)
plot1<-bwplot(Dioxin ~ Veteran,xlab="Treatment",
ylab="Dioxin",case0302) ## Vertical boxplots
plot2<-bwplot(V ~ Veteran,xlab="Treatment",
ylab="Box-Cox Transform (lambda=0.4)",case0302) ## Vertical boxplots
print(plot1, position=c(0,0,1/2,1), more=TRUE)
print(plot2, position=c(1/2,0,1,1))
t.test(V ~ Veteran, var.equal=TRUE,
alternative="less", data=case0302) # 1-sided t-test; alternative: group 1 mean is less
##
## Two Sample t-test
##
## data: V by Veteran
## t = -0.8, df = 741, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
## -Inf 0.181
## sample estimates:
## mean in group Other mean in group Vietnam
## 3.82 4.01
# Back transformed means
t.test(V ~ Veteran, alternative="less", var.equal=TRUE,
subset(case0302, Dioxin < 40)) # t-test on subset for which Dioxin < 40
##
## Two Sample t-test
##
## data: V by Veteran
## t = -0.8, df = 740, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
## -Inf 0.191
## sample estimates:
## mean in group Other mean in group Vietnam
## 3.82 3.99
t.test(V ~ Veteran, alternative="less", var.equal=TRUE,
subset(case0302, Dioxin < 20))
##
## Two Sample t-test
##
## data: V by Veteran
## t = -0.7, df = 739, p-value = 0.2
## alternative hypothesis: true difference in means between group Other and group Vietnam is less than 0
## 95 percent confidence interval:
## -Inf 0.2
## sample estimates:
## mean in group Other mean in group Vietnam
## 3.82 3.97
t.test(V ~ Veteran, var.equal=TRUE, data=case0302) # 2-sided--to get confidence interval
##
## Two Sample t-test
##
## data: V by Veteran
## t = -0.8, df = 741, p-value = 0.4
## alternative hypothesis: true difference in means between group Other and group Vietnam is not equal to 0
## 95 percent confidence interval:
## -0.633 0.252
## sample estimates:
## mean in group Other mean in group Vietnam
## 3.82 4.01
Randomization distribution of the t statistic for the 3.2 dioxin data
Two major assumptions
Both samples are independent samples from normally distributed populations
Both samples have identical standard deviations
The t test is usually robust to modest violations of the assumptions
These assumptions are never strictly met, but the t test is remarkably robust to violations of the assumptions, especially if the sample sizes are nearly equal
Robust means the conclusions from test — e.g., p values & confidence limits — are valid even when the assumptions aren’t strictly met, especially if sample sizes nearly equal
Transformations of the data are often used
With equal sample sizes, the t-test is affected moderately by long-tailedness (leptokurtic or peaked distribution) and very little by skewness (the symmetry of the distribution)
Kurtosis: peakedness, platykurtic (flat distribution), leptokurtic (peaked distribution)
Skewness: symmetry
If the two populations have the same standard deviations and approximately the same shape, with unequal sample size, the t tests are affected moderately by long tailedness (leptokurtic) and substantially by skewness
If the skewness differs considerably, the tools can be misleading with small and moderate sample sizes
Display 3.4
Display 3.5
Independence assumption important
Display 3.8
Log (x+1) transform
Most biological data, but not usually diversity
Needed when there is a multiplicative process in action:
++ exponential growth
++ bank account interest
++ Many pollutants (aquatic and terrestrial & human)
+++ polynuclear aromatic hydrocarbons
+++ fecal coliform bacteria, but not usually metals
Calculate the mean and 95% CI on the transformed scale and then back-transform. For data that are symmetric on the log-transformed scale, the back-transformed mean of the log-transformed data ≈ median. However, this back-transformed mean is the geometric mean and may differ from the median.
Many other transforms
Arcsin (√Y) for frequency data ranging between 0 and 1 (but the logit transform may be better)
Logit transform: log [Y/(1-Y)]
++ % silt clay, but the data must be on the interval 0 to 1
Square roots for counts
reciprocal for waiting times
logit transforms for proportions between 0 and 1 (log (P/(1-P))
“… it is recommended here that a trial-and error approach, with graphical analysis, be used instead.” Sleuth
Box-Cox Transformation of the 3.1 data
Matlab Box-Cox analysis of Case 3.1 data
Analysis of the transformed Case 3.1 data
CDC serum pollutant survey cover
CDC serum pollutant survey: lead
CDC serum pollutant survey: DDE
A procedure is resistant if it doesn’t change much when a small part of the data changes, perhaps drastically.
t tools are based on averages and are strongly affected by outliers
Chapter 4 introduces tests based on ranks, which protect against outliers (but not against unequal variance)
Practical strategies
Do side-by-side box plots to analyze departures from assumptions
Consider & test for serial spatial and cluster effects
Analyze spatial patterns in the residuals, use more sophisticated tools
++ Legendre & Legendre: if positive spatial autocorrelation, decrease the p value, e.g., test for differences at the 0.001 level instead of the 0.05 level to claim P (Type I error (alpha level) <0.05)
Dealing with outliers 1
Dealing with outliers 2
Dealing with outliers 2
Outlier strategy
Run analysis with and without outliers
Throw-out outliers only if there is very compelling evidence to do so, and document this data paring or culling
Note that outlier removal has created tremendous problems
++ POC flux to the deep sea
++ Failure to detect the ozone hole
++ Mendel’s data: 1:2:1 ratios and the chi-square test; documented by Fisher & Bruce Weir
++ Milliken’s study of the charge of the electron
Dealing with outliers 1
Dealing with outliers 2
The t tools are quite robust to violations of the assumptions, especially if sample sizes are equal
If variances are unequal and sample sizes unequal, test p values are liberal if the smaller group has the larger variance
Boxplots reveal the major problems for Student’s t test: Unequal variances between groups and outliers
Transformations, especially the log transform are often necessary
Beware of outlier removal, but sometimes it is necessary