QG Lecture 11

No class next Thursday (Oct 10) and the following Tuesday (Oct 15). Revised schedule (see syllabus).

Last time I showed you how to apply the formal t-test. I also introduced the idea of graphical inference.

Let's review and practice.

Two Sample Tests

Data are from two groups. We test the null hypothesis that the two samples have the same population mean by computing the t-statistic.

The t-statistic is the difference in sample means divided by the standard error of the difference in means (SEDM).

There are two ways to calculate SEDM. (1) Assume equal variance: A pooled standard deviation (s) is computed. Under the null hypothesis, the t-statistic will follow a t distribution with n1 + n2 - 2 degrees of freedom (dof).

(2) Do not assume equal variance: The t-statistic approximates a t distribution after adjusting the dof.

The two methods give similar results unless the sample sizes and variances are very different.

Example: We consider total lung capacity by sex from the dataset tlc (ISwR).

require(ISwR)

## Loading required package: ISwR

## Warning: there is no package called 'ISwR'

head(tlc)

## Error: object 'tlc' not found

`?`(tlc)

## No documentation for 'tlc' in specified packages and libraries:
## you could try '??tlc'

Start with side-by-side boxplots.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.2

ggplot(tlc, aes(x = factor(sex), y = tlc)) + geom_boxplot() + xlab("Sex (1: female)") + 
    ylab("Total lung capacity (liters)")

## Error: object 'tlc' not found

You see a difference in mean lung expenditure between the two groups. Is this difference statistically signficant?

With this data format (long) you use R's model syntax for specifying the t test.

R's model syntax is response ~ explanatory, where “response” refers to the response variable and “explanatory” refers to the explanatory variable. The ~ is a tilde given on your keyboard.

Thus tlc is your response variable and sex is your explanatory variable.

t.test(tlc ~ sex, data = tlc)

## Error: object 'tlc' not found

The small p-value allows you to conclude that there is convincing evidence against the null hypothesis of no difference.

The confidence interval (CI) is on the difference between the two sample means.

with(tlc, mean(tlc[sex == 1]) - mean(tlc[sex == 2]))

## Error: object 'tlc' not found

It is best to think of the CI as an uncertainty interval. The wider it is the more uncertain we are about the true population difference.

Graphically we should be able to pick out our data from a lineup of graphs.

library(nullabor)

## Error: there is no package called 'nullabor'

fun = null_permute("sex")

## Error: could not find function "null_permute"

inf = lineup(fun, tlc, n = 6)

## Error: could not find function "lineup"

ggplot(inf, aes(x = factor(sex), y = tlc)) + geom_boxplot() + facet_wrap(~.sample)

## Error: object 'inf' not found

Here the null function is simpler. You simply permute the explanatory variable.

By default there is no assumption that the variances are the same in the two groups so the Welch procedure is used. You can override this default by adding the argument var.equal=TRUE. What is the p value under this assumption?

t.test(tlc ~ sex, data = tlc, var.equal = TRUE)

## Error: object 'tlc' not found

F Test

To test the assumption of equal variances you use var.test() function.

var.test(tlc ~ sex, data = tlc)

## Error: object 'tlc' not found

What is the null hypothesis? What do you conclude?

The uncertainty interval includes the value of 1 and is quite wide. For small data sets such as this one, the assumption of constant variance is largely a matter of belief. This test is sensitive to small departures from a normal distribution.

The variance test is based on the assumption that the groups are independent. You should not apply this test to paired data.

The t.test() and var.test() functions are in the stats package as part of the base install of R.

The ctest package contains all the “classical tests,” and has several alternative tests for variance homogeneity, each with its own assumptions, benefits, and drawbacks.

Wilcoxon Test

The t test assumes that the two populations follow a normal distribution. That means the set of values when plotted with a histogram looks bell-shaped.

Let's plot using a stacked histogram.

ggplot(tlc, aes(tlc, fill = factor(sex))) + geom_histogram(binwidth = 1, color = "white") + 
    xlab("Total Lung Capacity (l)")

## Error: object 'tlc' not found

The histograms are not bell-shaped.

You can avoid distributional assumptions by using a non-parametric test. The non-parametric alternative to the t test is the Wilcoxon test. Also known as the Mann-Whitney U test (and Mann-Whitney-Wilcoxon test).

The test statistic W is the sum of the ranks in the first group minus the sum of the ranks in the second. You use the wilcox.test() function. Try it on the lung capacity data.

wilcox.test(tlc ~ sex, data = tlc)

## Error: object 'tlc' not found

Recent application: Atmospheric pressure and insect sexual behavior. Insects reduce courtship in anticipation of storms.

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0075004

Another example: Is there evidence of more or fewer U.S. hurricanes recently? One way to examine this question is to divide the record into two samples and compare the means from both samples.

Read the data into R.

loc = "http://myweb.fsu.edu/jelsner/data/US.txt"
H = read.table(loc, header = TRUE)

We can consider the first half of the record as separate from the second half and ask is there a statistical difference in hurricane counts between the two halfs. The null hypothesis is that the sample means are the same.

First create a vector that divides the record length in two equal parts.

early = H$Year <= median(H$Year)
head(early)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

tail(early)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Then use the test on the U.S. hurricane counts where the explanatory variable is the vector early.

t.test(H$All ~ early)

## 
##  Welch Two Sample t-test
## 
## data:  H$All by early
## t = -0.1631, df = 157.8, p-value = 0.8706
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4915  0.4165
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##               1.675               1.712

The p-value is large so we fail to reject the null hypothesis of no difference.

The 95% uncertainty interval is around the difference in means. Since the interval contains zero, there is no evidence against the null hypothesis.

Small counts are not normally distributed

ggplot(H, aes(factor(All))) + geom_bar() + ylab("Number of Years") + xlab("Number of Hurricanes")

plot of chunk barplotH

So you might want to use the Wilcoxon test instead.

s1 = H$All[early]
s2 = H$All[!early]
wilcox.test(s1, s2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  s1 and s2
## W = 3248, p-value = 0.869
## alternative hypothesis: true location shift is not equal to 0

The conclusion is the same. The second half of the data set is statistically indistinguishable from the first half.

Here you created two samples and then applied the test function.

Paired Observations

Sometimes two sample data come from paired observations. In this case you should use the paired=TRUE argument.

Paired tests are used when there are two measurements on the same experimental unit. The theory is based on taking the differences in paired values thus reducing the problem to that of a one-sample test. To invoke the paired t test, use the argument paired=TRUE.

Example: Shoe material. A manufacturer makes two different shoe materials (A and B). A sample of 10 kids try both materials and recorded are the wear times (months) for each material. Compare whether or not there is any difference in shoe material wear times.

matA = c(14, 8.8, 11.2, 14.2, 11.8, 6.4, 9.8, 11.3, 9.3, 13.6)
matB = c(13.2, 8.2, 10.9, 14.3, 10.7, 6.6, 9.5, 10.8, 8.8, 13.3)

t.test(matA, matB, paired = TRUE)

## 
##  Paired t-test
## 
## data:  matA and matB
## t = 3.349, df = 9, p-value = 0.008539
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.133 0.687
## sample estimates:
## mean of the differences 
##                    0.41

t.test(matA, matB)

## 
##  Welch Two Sample t-test
## 
## data:  matA and matB
## t = 0.3689, df = 17.99, p-value = 0.7165
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.925  2.745
## sample estimates:
## mean of x mean of y 
##     11.04     10.63

You arrive at two different conclusions. This is because the variation in wear times from one kid to another is large compared with the difference in wear material.

In the first case you control for the difference by using the paired=TRUE option so you can detect the smaller difference in wear times.

Graphically we might do something like:

Shoes.df = data.frame(matA, matB)
Shoes.df$Diff = matA - matB
Shoes.df$Boy = 1:length(matA)
head(Shoes.df)

##   matA matB Diff Boy
## 1 14.0 13.2  0.8   1
## 2  8.8  8.2  0.6   2
## 3 11.2 10.9  0.3   3
## 4 14.2 14.3 -0.1   4
## 5 11.8 10.7  1.1   5
## 6  6.4  6.6 -0.2   6

Generate line up.

fun = null_dist(var = "Diff", dist = "normal", params = list(mean = 0, sd = sd(Shoes.df$Diff)))

## Error: could not find function "null_dist"

inf = lineup(fun, Shoes.df)

## Error: could not find function "lineup"

ggplot(inf, aes(x = "", y = Diff)) + geom_boxplot() + facet_wrap(~.sample, nrow = 1, 
    ncol = 20)

## Error: object 'inf' not found

Example: Reaction time while driving

The use of a cell phone while driving is hypothosized to increase the chance of an accident due to slowed reaction time. The dataset reaction.time (UsingR) gives the time it takes to react to an external event while driving by various groups.

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.0.2

## Loading required package: MASS
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies

str(reaction.time)

## 'data.frame':    60 obs. of  4 variables:
##  $ age    : Factor w/ 2 levels "16-24","25+": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 1 2 ...
##  $ control: Factor w/ 2 levels "C","T": 2 2 2 2 2 1 2 2 2 1 ...
##  $ time   : num  1.36 1.47 1.51 1.39 1.38 ...

The numeric variable 'time' is the reaction time in seconds. The factor variable 'control' has two groups 'C' control (not using cell phone) and 'T' (using cell).

The data set is in the long format.

Suppose you are asked to see if reaction time is shorter for those not using a cell phone.

First, plot the response variable ('time').

ggplot(reaction.time, aes(x = control, y = time)) + geom_boxplot()

plot of chunk plotReactionTime

As expected, the reaction time for those not on the phone © is shorter than the reaction time for those on the phone (T).

The box plots indicate symmetric distributions about the median, so you are safe to use a t test.

The reaction times are not done as a paired experiment.

The null hypothesis is that there is no difference in reaction times and the alternative is that the reaction time is less for those not using a cell phone.

t.test(time ~ control, data = reaction.time, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  time by control
## t = -2.205, df = 29.83, p-value = 0.01765
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -0.01288
## sample estimates:
## mean in group C mean in group T 
##           1.390           1.446

The evidence in support of the null hypothesis that the reaction times are the same is summarized by the p-value. The p-value is .01765, so you conclude there is moderate evidence indicating cell phone use slows reaction times.

Recall the adjective 'moderate' comes from the following table: p-value as evidence against the null hypothesis.

   less than 0.01: convincing
        0.01-0.05: moderate 
        0.05-0.15: suggestive, but inconclusive
greater than 0.15: no

How does the evidence change if you consider women and men separately?

Recall the data frame has the following columns.

head(reaction.time)

##     age gender control  time
## 1 16-24      F       T 1.360
## 2 16-24      M       T 1.468
## 3 16-24      M       T 1.512
## 4 16-24      F       T 1.391
## 5 16-24      M       T 1.384
## 6 16-24      M       C 1.394

So you subset the data by 'gender.'

t.test(time ~ control, data = reaction.time[reaction.time$gender == "F", ], 
    alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  time by control
## t = -0.875, df = 9.966, p-value = 0.2011
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##    -Inf 0.0368
## sample estimates:
## mean in group C mean in group T 
##           1.416           1.451

t.test(time ~ control, data = reaction.time[reaction.time$gender == "M", ], 
    alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  time by control
## t = -1.989, df = 19.13, p-value = 0.0306
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -0.008845
## sample estimates:
## mean in group C mean in group T 
##           1.372           1.439

For both women and men, reaction time is slower while on the cell phone. But for the women, the p-value is > .15, while for the men the p-value is 0.03.

For the subsample of women, the reaction time does not appear to be significantly slower while on the cell phone.

Repeat the test separately for the two age groups. What do you conclude?

Example: Time spent on homework

The dataset homework (UsingR) contains the average number of minutes of homework given to students at 15 public and 15 private schools. Perform a t test to determine if there is a significant difference in the means between public and private schools.

head(homework)

##   Private Public
## 1    21.3   15.3
## 2    16.8   17.4
## 3     8.5   12.3
## 4    12.6   10.7
## 5    15.8   16.4
## 6    19.3   11.3

Here the data frame is in the wide format.

You can use the t.test() function directly as

t.test(homework$Private, homework$Public)

## 
##  Welch Two Sample t-test
## 
## data:  homework$Private and homework$Public
## t = 1.713, df = 27.73, p-value = 0.09779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5345  5.9878
## sample estimates:
## mean of x mean of y 
##     17.63     14.91

with(data = homework, t.test(Private, Public))

## 
##  Welch Two Sample t-test
## 
## data:  Private and Public
## t = 1.713, df = 27.73, p-value = 0.09779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5345  5.9878
## sample estimates:
## mean of x mean of y 
##     17.63     14.91

Or you can turn the wide data frame into a long data frame using the melt() function from the reshape2 package.

require(reshape2)

## Loading required package: reshape2

## Warning: package 'reshape2' was built under R version 3.0.2

hw.long = melt(homework)

## Using  as id variables

head(hw.long)

##   variable value
## 1  Private  21.3
## 2  Private  16.8
## 3  Private   8.5
## 4  Private  12.6
## 5  Private  15.8
## 6  Private  19.3

Now you can plot it and use the formula syntax with the t.test() function.

ggplot(hw.long, aes(x = variable, y = value)) + geom_boxplot() + ylab("Time on homework (min)") + 
    xlab("School type")

plot of chunk plotHomework

t.test(value ~ variable, data = hw.long)

## 
##  Welch Two Sample t-test
## 
## data:  value by variable
## t = 1.713, df = 27.73, p-value = 0.09779
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5345  5.9878
## sample estimates:
## mean in group Private  mean in group Public 
##                 17.63                 14.91

Suggestive, but inconclusive evidence of a difference.

Problem Set #3: Due Oct 17th.