Exercise 4

For a two-sample, two-tailed t-test consider the following example data on men’s and women’s weights.

Create a data frame (call it ‘dat’) containing all the weights in the first column (‘weight’) and the gender identity in the second column (‘gender’).

men_weights <- c(71.1, 73.4, 85.9, 75.5, 75.9, 87, 78.2, 66.1, 70.2)
women_weights <- c(52.8, 61.1, 56.8, 57, 55.6, 52.2, 63.9, 57.5, 45.2)
dat <- data.frame(weight = c(men_weights, women_weights), gender = rep(c("male", "female"), each = 9))

Check for normality in the two groups separately using the hist as well as the qqnorm in conjunction with the qqline functions and the shapiro.test. If the normality criterion is not grossly violated, run a two-tailed t-test using the formula approach (refer to the lab notes or the last example in the help file for the t.test function if you don’t remember).

hist(dat[dat$gender == "female", "weight"])

hist(dat[dat$gender == "male", "weight"])

qqnorm(dat[dat$gender == "female", "weight"])
qqline(dat[dat$gender == "female", "weight"])

qqnorm(dat[dat$gender == "male", "weight"])
qqline(dat[dat$gender == "male", "weight"])

shapiro.test(dat[dat$gender == "male", "weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  dat[dat$gender == "male", "weight"]
## W = 0.93793, p-value = 0.5603

shapiro.test(dat[dat$gender == "female", "weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  dat[dat$gender == "female", "weight"]
## W = 0.95875, p-value = 0.7852

mean(dat[dat$gender == "male", "weight"])-mean(dat[dat$gender == "female", "weight"])

## [1] 20.13333

boxplot(weight ~ gender, data = dat)

t.test(weight ~ gender, data = dat)

## 
##  Welch Two Sample t-test
## 
## data:  weight by gender
## t = -6.8617, df = 15.08, p-value = 5.247e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.38443 -13.88224
## sample estimates:
## mean in group female   mean in group male 
##             55.78889             75.92222

Run a one-tailed t-test using the formula approach and hypothesise that women’s weights are lower than men’s (hint: to find out which group is considered x and which y when using the formula interface, use levels(dat$gender)…whatever is listed first is considered x in the t.test function).

str(dat)

## 'data.frame':    18 obs. of  2 variables:
##  $ weight: num  71.1 73.4 85.9 75.5 75.9 87 78.2 66.1 70.2 52.8 ...
##  $ gender: Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 1 ...

levels(dat$gender)

## [1] "female" "male"

t.test(weight ~ gender, data = dat, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  weight by gender
## t = -6.8617, df = 15.08, p-value = 2.624e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -14.99142
## sample estimates:
## mean in group female   mean in group male 
##             55.78889             75.92222

t.test(x = dat[dat$gender == "female", "weight"], y = dat[dat$gender == "male", "weight"], alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  dat[dat$gender == "female", "weight"] and dat[dat$gender == "male", "weight"]
## t = -6.8617, df = 15.08, p-value = 2.624e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -14.99142
## sample estimates:
## mean of x mean of y 
##  55.78889  75.92222

## Aggregate the data
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dat2 <- group_by(.data = dat, gender) %>% summarise(sd = sd(weight), se = se(weight), weight = mean(weight))
barplot(dat2$weight, ylim = c(0, 100), las = 1)

One-sample t-test (consult the t.test help file and read up on the one-sample t-test).
A biologist was interested in determining whether sunflower seedlings treated with an extract from Vinca minor roots resulted in a lower average height of sunflower seedlings than the standard height of 15.7 cm. The biologist treated a random sample of n = 33 seedlings with the extract and subsequently obtained the following heights in cm:

11.5, 11.8, 15.7, 16.1, 14.1, 10.5, 15.2, 19.0, 12.8, 12.4, 19.2, 13.5, 16.5, 13.5, 14.4, 16.7, 10.9, 13.0, 15.1, 17.1, 13.3, 12.4, 8.5, 14.3, 12.9, 11.1, 15.0, 13.3, 15.8, 13.5, 9.3, 12.2, 10.3

Highlight and copy the data above (Ctrl + C on PC or Cmd + C on Mac) and read it into R using the scan() function.

On PC use the following code to read in data from the clipboard:

# On PC use:
sunflower <- scan(file = "clipboard", sep = ",") # on PC

On Mac use the following code to read in data from the clipboard:

# On Mac use:
sunflower <- scan(pipe(description = "pbpaste"), sep = ",")
sunflower

##  [1] 11.5 11.8 15.7 16.1 14.1 10.5 15.2 19.0 12.8 12.4 19.2 13.5 16.5 13.5
## [15] 14.4 16.7 10.9 13.0 15.1 17.1 13.3 12.4  8.5 14.3 12.9 11.1 15.0 13.3
## [29] 15.8 13.5  9.3 12.2 10.3

Now write down the null and alternative hypotheses for the mean ($\mu$). Note: Because the treatment is expected to have a negative effect, you should phrase the alternative hypothesis accordingly. \[ \begin{eqnarray} H_0: \mu = 15.7 \\ H_A: \mu < 15.7\end{eqnarray} \] a. Formulate a t-test for this scenario.

# Make sure you're dealing with a numeric vector
class(sunflower)

## [1] "numeric"

# Check for normally distributed data
par(mfrow = c(1, 2), cex.main = 0.8)
hist(sunflower, col = "lightgrey") # histogram to assess normality
qqnorm(sunflower) # Quantile-quantile plot for assessing normality
qqline(sunflower)

t.test(sunflower, mu = 15.7, alternative = "less")

## 
##  One Sample t-test
## 
## data:  sunflower
## t = -4.599, df = 32, p-value = 3.174e-05
## alternative hypothesis: true mean is less than 15.7
## 95 percent confidence interval:
##      -Inf 14.41366
## sample estimates:
## mean of x 
##  13.66364

Can the null hypothesis be rejected? Yes, the P-value is smaller than the significance level $\alpha$ = 0.05, thus lending support the alternative hypothesis.
What is the critical t-value for a significance level $\alpha$ of 0.05, given n – 1 = 32 degrees of freedom?

qt(p = 0.05, df = 32)

## [1] -1.693889

We can also reverse engineer the t-value from our test result like this:

qt(p = 3.174e-05, df = 32)

## [1] -4.599033

Consider this small dataset on depression scale measurements in nine patients taken at the first (x) and second (y) visit after initiation of a therapy (administration of a tranquilizer). The t-test procedure assumes independent samples. However, in this scenario measurements are taken from the same patient before and after therapy, so the samples are not independent. In this case, we need to account for the dependence

x <- c(1.83,  0.50,  1.62,  2.48, 1.68, 1.88, 1.55, 3.06, 1.30)
y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)

Run a paired t-test on the data.
a. Does the therapy show a statistically significant effect at the i) 0.05 significance level and the ii) 0.01 significance level?

t.test(x, y, paired = T)

## 
##  Paired t-test
## 
## data:  x and y
## t = 3.0354, df = 8, p-value = 0.01618
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1037787 0.7599991
## sample estimates:
## mean of the differences 
##               0.4318889

At the 0.05 significance level, the therapy has a statistically significant effect (P = 0.016 < 0.05) but we conclude that there is no significant therapy effect if we apply a 0.01 significance level (P = 0.016 > 0.01).

What is the power of this paired t-test? (Hint: carefully read the help file for the power.t.test function) The power.t.test function uses the standard error of the difference in sample means, which is computed as the pooled estimate of the standard deviation (i.e. pooled across the two groups) divided by the number of observations per group. The pooled standard error can be calculated as follows [recall that the standard deviation $\sigma$ is given as $\sigma = \sqrt{var(x)}$]:

\[ se_{pooled} = \sqrt{\frac{var(x)}{n_x} + \frac{var(y)}{n_y}} \]

## Calculate the pooled standard error
se.pooled <- sqrt(var(x)/length(x) + var(y)/length(y))

In the power.t.test function, the pooled standard error is referred to as the pooled standard deviation (because the standard error is nothing but the standard deviation divided by the square root of n) but don’t let that confuse you.

## Run the power t-test
power.t.test(n = 9, delta = mean(x) - mean(y), sd = se.pooled, type = "paired")

## 
##      Paired t test power calculation 
## 
##               n = 9
##           delta = 0.4318889
##              sd = 0.3583626
##       sig.level = 0.05
##           power = 0.8846098
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs

Our t-test has a power of 0.88, that is, given the level of replication, the magnitude of the true difference in means and the pooled standard deviation in this dataset, we have a 88% probability to detect a difference if it is really there.

Next, run a single sample t-test using the difference between x and y (i.e. x – y) as input data (perform the test with the default setting for mu = 0).

What is the null hypothesis we are testing?

\[ H_0: x - y = 0 \]

xy.diff <- x - y 
t.test(xy.diff, mu = 0)

## 
##  One Sample t-test
## 
## data:  xy.diff
## t = 3.0354, df = 8, p-value = 0.01618
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.1037787 0.7599991
## sample estimates:
## mean of x 
## 0.4318889

Compare the results to those of the paired t-test. What can you conclude about the inner workings of the paired t-test. The results are exactly the same suggesting that the paired t-test procedure simply computes the before-after differences and then performs a one sample t-test under the hood with the null hypothesis that the differences are zero.

Consider the built-in dataset ‘airquality’ and suppose we wanted to perform pairwise comparisons of ozone concentrations between all possible combinations of months. The dataset comprises five months, which yields ten pairwise comparisons. In other words, we would have to run ten t-tests by hand. R offers the pairwise.t.test function for this purpose.

data(airquality)
airquality

month.abb # built-in vector of abbreviated month names

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

## Create a factor month
month <- factor(airquality$Month, labels = month.abb[5:9])
pairwise.t.test(x = airquality$Ozone, g = month)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  airquality$Ozone and month 
## 
##     May     Jun     Jul     Aug    
## Jun 1.00000 -       -       -      
## Jul 0.00026 0.05113 -       -      
## Aug 0.00019 0.04987 1.00000 -      
## Sep 1.00000 1.00000 0.00488 0.00388
## 
## P value adjustment method: holm

pairwise.t.test(x = airquality$Ozone, g = airquality$Month)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  airquality$Ozone and airquality$Month 
## 
##   5       6       7       8      
## 6 1.00000 -       -       -      
## 7 0.00026 0.05113 -       -      
## 8 0.00019 0.04987 1.00000 -      
## 9 1.00000 1.00000 0.00488 0.00388
## 
## P value adjustment method: holm

## Check out the choose function
choose(length(month.abb[5:9]), k = 2)

## [1] 10

The output reports adjusted P-values. Why do we need to adjust (correct) the P-values of these ten t-tests? The reported P-values are adjusted to account for multiple testing with the same dataset (also referred to as multiplicity adjustment). Recall that the significance level $\alpha$ = 0.05 also represents the Type I error probability, i.e. on average, we obtain one significant test out of twenty just by chance.
What is the P-value of the ozone comparison between July and September?
The P-value for the July-September comparison is 0.00488.
Is there a similar pairwise routine available for the non-parametric Wilcoxon test? Yes, it is the pairwise.wilcox.test function:

# Non-parametric alternative
pairwise.wilcox.test(x = airquality$Ozone, g = month)

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  airquality$Ozone and month 
## 
##     May    Jun    Jul    Aug   
## Jun 0.5775 -      -      -     
## Jul 0.0003 0.0848 -      -     
## Aug 0.0011 0.1295 1.0000 -     
## Sep 0.4744 1.0000 0.0060 0.0227
## 
## P value adjustment method: holm

Read the help file for the p.adjust function. At some stage of your research career you will need to adjust for multiple testing. Here is a dummy example:

pvals <- c(0.03, 0.08, 0.1, 0.29, 0.57, 0.001, 0.00087)
p.adjust(pvals, method = "BH")

## [1] 0.0700000 0.1400000 0.1400000 0.3383333 0.5700000 0.0035000 0.0035000