Scientific Research, Statistics Lab 4

Learning targets

Understand the terms ‘standard deviation’, ‘variance’, and ‘degrees of freedom’.
Understand the process of a simple statistical test.
Get to know the different types of t-tests, including a non-parametric alternative.
Understand the notion of statistical power. Conduct a power-t-test.

Example question from the last lab for the exam

You have a population of 87 snails with a mean shell size of 13.4 mm. You determine that the standard deviation is 2.04 mm. Simulate this population and draw a histogram.

What assumption are you making?
What is the probability of finding a snail with a shell larger than 17 mm among your snails?
What is the smallest/largest individual in your population of 87? What was the probability of having such a small/large snail in your population?
Between what range are 90% of all snails in a normally distributed population with mean 13.4 mm and sd = 2.04 mm?

Variance and standard deviation

Ho do we get a meaningful measure of spread around the mean of our data? Intuitively one would calculate the differences between the individual observations and their mean and sum those differences up. To obtain an average deviation from the mean, we would then divide this sum of differences by the total number of observations (n). However, because our observations will be partly larger, and partly lower than the mean, we end up with positive and negative differences, which will partially or completely equalise themselves. A simple mathematical trick to get around this issue is to square the differences before we sum them up. This provides an unbiased estimate of the variation around the mean, however, the squaring makes this variation much bigger than it acutally is, right? In order to get back to the scale of the mean, we can simply take the square root of the variance, which results in the so-called the “standard deviation”. Here are a few figures that highlight the concepts:

## Simply adding the differences between observations and the mean results in equalisation
plot(1, 1, ylim = c(0, 12), xlim = c(0, 10), type = "n", ann = F, tcl = -0.4, las = 1)
abline(h = 5) # line indicating the mean
x <- 1:4
y <- c(7, 7, 3, 3)
points(x, y, pch = 21, bg = "black")
segments(x, rep(5, 4), x, y)
mtext("Sum of differences divided by the number of observations\n(no good):", side = 3, line = 0.5, at = 0, adj = 0, cex = 0.8)
text(6, 10, expression(frac(2 + 2 + group("(", -2, ")") + group("(", -2, ")"), 4) == 0), cex = 0.9)

## Solution: square the differences first to make negative differences positive
plot(1, 1, ylim = c(0, 12), xlim = c(0, 10), type = "n", ann = F, tcl = -0.3, las = 1)
abline(h = 5)
x <- 1:4
y <- c(7, 7, 3, 3)
points(x, y, pch = 21, bg = "black")
segments(x, rep(5, 4), x, y)
# text(6, 7, expression(frac(4 + 4 -4 -4, 4) == 0))
mtext(bquote(paste(bold("Trick"), ": Square the differences first before adding them up")), 
      side = 3, line = 1.8, at = 0, adj = 0, cex = 0.8)
mtext("and dividing their sum by the number of observations:", side = 3, line = 0.8, 
      at = 0, adj = 0, cex = 0.8)
text(0, 11.75, "Population variance:", adj = 0, cex = 0.7)
text(5, 10.5, expression(paste({sigma^2 == ~frac(sum((x[i] - bar(x))^2, i==1, n), 
                                italic("n"))} == {frac((7-5)^2 + (7-5)^2 + (3-5)^2 + 
                                                         (3-5)^2, 4) == 4})), cex = 0.7)

text(0, 9, "Population standard deviation:", adj = 0, cex = 0.7)
text(6.5, 7.25, expression(paste("s = ", {sqrt(frac(sum((x[i] - bar(x))^2, i==1, n), 
                                  italic("n"))) == sqrt("variance")} == {sqrt(4) == 2})), 
                                  cex = 0.7)

text(5.5, 4.25, "Sample variance and\nstandard deviation:", adj = 0, cex = 0.7)
text(5.75, 2.75, expression(paste({sigma^2 == ~frac(sum((x[i] - bar(x))^2, i==1, n), 
                                   italic("n") - 1)} == 5.333)), adj = 0, cex = 0.7)
text(5.75, 0.75, expression(paste("s = ", sqrt(frac(sum((x[i] - bar(x))^2, i==1, n),
                                  italic("n") - 1)) == 2.31)), adj = 0, cex = 0.7)

These calculations of the variance and the standard deviation are valid for entire populations. However, in reality we rarely have the time and resources to look at an entire population. We usually take a random sample of a population and hope that the mean, variance and standard deviation of this sample are fairly close to the “real” population values. For example, when we measure tree growth across multiple forests, then we would not sample all the trees in a forest (too time consuming and expensive) but only a limited number of trees - a random sample. It is clear that the mean and variance of a random sample will very rarely be the same as the population mean and variance. Therefore, we need to correct the sample variance and standard deviation by dividing by n-1 instead of n.

Standard deviation vs. standard error

The variability of a sample can be indicated comparing individual values to the mean, for example by summing up the differences between the individual data points and their mean. To ‘punish’ those values farthest away from the mean and to get rid of negative numbers, those differences are squared. In order to make the measure independent of the sample size, we divide by the latter. (Actually, we divide by n-1, why?). Finally, to get back our original units, we pull the square root. This leads to the definition of the standard deviation, a measure for the spread of a sample: \[ sd = \sqrt{\frac{\sum{(x_i - \overline{x})^2}}{(n-1)}} \] The standard error on the other hand indicates how well we are able to estimate the mean, it is therefore dependent on the sample size - the higher the sample size, the better the estimate: \[ se = \frac{sd}{\sqrt{n}} \] There is no built-in R function to caluclate the standard error, possibly because it is an excellent starting point to get into writing your own functions :-) And here is how it is done:

set.seed(321) # this command fixes the seed of the random number generator, so that we all 
# get the same random set of values (always needs to be executed in tandem with the rnorm function)
dat <- rnorm(n = 100)

## Caluclate standard error by hand
sd(dat)/sqrt(length(dat))

## [1] 0.09557693

## Wrap the algorithm up in a custom function
se <- function(x) {sd(x)/sqrt(length(x))}
se(dat)

## [1] 0.09557693

If you are interested further, I can provide you with short simulations in R that illustrate both why we divide by \(n-1\) (the degrees of freedom!) rather than by \(n\) for the standard deviation, and how the standard deviation differs from the standard error. On bar plots, we normally show means with one standard error ‘antennas’, to show how well we estimate that mean.

You should familiarise yourself with the notion of ‘degrees of freedom’. You will find several different explanations on the internet and in books. An intuitive one is that estimating metrics or parameters ‘eats’ up degrees of freedom. For example, if you can freely choose 5 numbers, you have 5 degrees of freedom. However, when I tell you that the mean of those numbers is to be 10, you loose exactly 1 degree of freedom: you can choose 4 numbers freely, but the fifth one is locked, as the five number have to have a mean of 10!

As an exercise, make up a vector of any 10 numbers, then calculate (by hand, using R functions) the standard deviation and the standard error for this sample.

The basics of a statistical test

The fundamental idea behind statistical inference is always the following: ‘What are the odds that our data is no different from random numbers?’ Therefore, we always calculate a test statistic and then compare it to the random distribution of this particular statistic. This is hard to understand in theory but much more comprehensible when illustrated by examples. Let us run through a simple example where we draw a sample of 30 from a base population 10,000, called ‘pop’.

set.seed(321)
pop <- rnorm(10000)
s1 <- sample(pop, 30)
h1 <- hist(pop)

h2 <- hist(s1, col = "lightgrey")

## Plot the two histograms together
hist(pop, freq = F, ylim = c(0, 0.6), col = rgb(0 ,0 , 1, 1/4))
hist(s1, breaks = h1$breaks, freq = F, add = T, col = rgb(1, 0, 0, 1/4), ylim = c(0, 0.6))

We pretend that we do not know about whether our sample actually originates from our base population or not, but we do know that the mean of the base population is zero. We also know that samples of size 30 taken from a normal distribution follow a t-distribution with 29 degrees of freedom. Why they don’t follow a normal distribution is a bit more difficult to explain, so just accept this for now. Also, we know that a t-distribution approaches a normal distribution as the sample size increases. We now follow a process to statistically test whether our sample originates from ‘pop’:

Formulate a hypothesis, namely a so-called null (\(H_0\)) and an alternative hypothesis (\(H_A\)). Our null hypothesis is ‘our sample originates from the base population’. Our alternative hypothesis is: ‘our sample does not originate from our population, i.e. it is significantly different from the base population’. Note that our \(H_0\) is falsifiable and we therefore have an alternative hypothesis called \(H_A\).
define what we mean by ‘significantly different’, i.e. choose a level of significance called \(\alpha\), usually \(\alpha\) = 0.05. This means that, should we reject our null hypothesis, we will be able to state that the odds are 5% that we made the wrong conclusion. We call this probability our type I error probability.
Identify the test statistic \(t\). This in our case is simply the mean of our sample because we are comparing it against a population with mean zero and standard deviation 1. If you were to compare your sample to a population with any mean and standard deviation, you would need to standardise your sample first: \[ t=\frac{\overline{x}-\mu}{\frac{sd}{\sqrt{n}}} \] where \(\mu\) is the population mean, \(\overline{x}\) the sample mean, \(n\) the sample size and \(sd\) the sample standard deviation.
Formulate a decision rule: if we stick with \(\alpha\) = 0.05 then we need to know the 0.025th and the 0.975th quantiles of the t-distribution with 29 degrees of freedom. Why? Because we cannot reject the null hypothesis if our test statistic lies within that range. Since our sample follows a t-distribution, we use the qt() command (rather than qnorm() in the case of the normal distribution).

qt(p = 0.025, df = 29) # df indicates the degrees of freedom

[1] -2.04523

qt(p = 0.975, df = 29)

[1] 2.04523

The two quantiles are the same with opposite signs, as the t-distribution is symmetrical.

Now calculate the test statistic, in our case this simply involves calculating the mean of our sample:

mean(s1)

[1] -0.03511305

Make a decision: most likely, your mean (which you compare against the quantiles of the t-statistic) will lie within the two quantiles, in this case you accept your null hypothesis \(H_0\) and reject your alternative hypothesis \(H_A\). If not, you commit a type I error!
Draw a conclusion based on your test. This is of course not very interesting in our case.

Now play with this further by setting up a second population of 1000, centered around e.g. 2 with standard deviation 1. If you now sample from your second population, you should detect this with your simple statistical test as outlined above. In the age of computer assisted analyses, calculating a test statistic is not necessary any more. Instead, the output of statistical tests usually provide the probability that our sample differs from a random population of our test statistic directly. This metric is called the p-value and is a standard output of simple statistical tests.

Simple two-sample tests

If we now want to compare two groups (rather than testing wether a single sample originates from a certain population as above), we can use a t-test. The mechanics are very much the same as above - we calculate our statistic based on the difference between the samples and their standard deviations (R will do this for us) and then compare it to a random distribution of this statistic. In the t-test, this statistic is (again) a t-value following a t-distribution. All we need to do is look at the p-value of the output, which indicates the probability of this statistic (i.e. the difference between the samples given their standard deviations) to occur at random. If this value is sufficiently small (normally <5%), we state that the two samples are ‘significantly different’. Look at the below example:

t.test(x = rnorm(20, mean = 0, sd = 1), y = rnorm(20, mean = 3, sd = 3), alternative = "less")


    Welch Two Sample t-test

data:  rnorm(20, mean = 0, sd = 1) and rnorm(20, mean = 3, sd = 3)
t = -3.812, df = 23.772, p-value = 0.0004286
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -1.282709
sample estimates:
  mean of x   mean of y 
-0.04614292  2.28178390

wilcox.test(x = rnorm(20, mean = 0, sd = 1), y = rnorm(20, mean = 3, sd = 3))


    Wilcoxon rank sum test

data:  rnorm(20, mean = 0, sd = 1) and rnorm(20, mean = 3, sd = 3)
W = 99, p-value = 0.005618
alternative hypothesis: true location shift is not equal to 0

Create an example (simulate numbers) that show an application of a paired t-test.

Again with an unpaired t-test, repeatedly (use the ‘up-arrow’ key!) test rnorm(20) against rnorm(20). What can you conclude?

In the above t-tests, what assumption are you making regarding the distribution of the base population? Show the use of the Wilcoxon rank sum test, a ‘non-parametric’ alternative to the t-test. In what cases would you use this test?

The power of a t-test

In statistics, the ‘power’ of a test is the ability to detect a difference (that in fact existst) e.g. between two samples. In a t-test, the power depends on: the desired significance level \(\alpha\), the sample size, the standard deviation of the samples, and the true difference between the samples. Before you start an experiment or an observational study, the fundamental question of how many samples to take will arise. Because of the above dependencies, the answer to this question requires an estimate of (1) the variation in your population and (2) the expected difference between treatments or groups. To get estimates for these metrics, you usually have to conduct a pilot study. This is usually a good investment: think of a situation where an expensive experiment is conducted but later it is found that given the variability in the data, the expected difference could never have been detected. For example, if your expected difference is 1 (arbitrary units), the mean standard deviation in your samples 3 and your sample size 10, your power is as low as 10%! In other words, you have a 10% chance of detecting a potential difference of 1 between your groups! In this case, you would have to increase your sample size dramatically, for example to 150 to reach a power of approximately 80%.\ R has a nice function to juggle with sample size, significance level, power, standard deviation, and the expected difference between groups when conducting a t-test: power.t.test(). Use this function to solve the following problem:

Imagine you are asked to determine whether there is a significant difference in say blood chemistry between two populations of mice. The lab work to get a single sample is substantial and involves expensive chemicals. So you want to take enough, but not too many samples. You supervisor tells you that from past experience, you can expect a difference of about 5 (let’s not worry about units for now). Here are your two mouse populations:

mice1 <- rnorm(1000, mean = 30, sd = 7)
mice2 <- rnorm(1000, mean = 24, sd = 9)

Determine a reasonable sample size by estimating the mean standard deviations in a pilot study, which analyses 6 random mice, 3 from each population (use the function sample() for this, then power.t.test()).

## Examples from the power.t.test help file:
power.t.test(n = 20, delta = 1)

## 
##      Two-sample t test power calculation 
## 
##               n = 20
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.8689528
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power.t.test(power = 0.90, delta = 1)

## 
##      Two-sample t test power calculation 
## 
##               n = 22.0211
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power.t.test(power = 0.90, delta = 1, alternative = "one.sided")

## 
##      Two-sample t test power calculation 
## 
##               n = 17.84713
##           delta = 1
##              sd = 1
##       sig.level = 0.05
##           power = 0.9
##     alternative = one.sided
## 
## NOTE: n is number in *each* group

power.t.test(n = 10, delta = 1, sd = 2)

## 
##      Two-sample t test power calculation 
## 
##               n = 10
##           delta = 1
##              sd = 2
##       sig.level = 0.05
##           power = 0.1838375
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power.t.test(delta = 5, power = 0.8)

## 
##      Two-sample t test power calculation 
## 
##               n = 2.117245
##           delta = 5
##              sd = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

mice.samples <- c(sample(x = mice1, size = 3, replace = F), sample(x = mice2, size = 3, replace = F))
sd(mice.samples)

## [1] 8.680657

power.t.test(delta = 5, sd = sd(mice.samples), power = 0.7)

## 
##      Two-sample t test power calculation 
## 
##               n = 38.1903
##           delta = 5
##              sd = 8.680657
##       sig.level = 0.05
##           power = 0.7
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

power.t.test(delta = 10, sd = 10, power = 0.8)

## 
##      Two-sample t test power calculation 
## 
##               n = 16.71477
##           delta = 10
##              sd = 10
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

As indicated above, the Wilcoxon test (aka Mann-Whitney U-Test) is a non-parametric alternative to the T-test. Non-parametric means that this type of test does not assume a certain distribution of your data. Having said that, you can certainly use non-parametric tests for normally distributed data as well.

wilcox.test(rnorm(20), rnorm(20, mean = 2))

## 
##  Wilcoxon rank sum test
## 
## data:  rnorm(20) and rnorm(20, mean = 2)
## W = 37, p-value = 1.654e-06
## alternative hypothesis: true location shift is not equal to 0

During the lab, we asked the question how to derive the probability for an interval. Consider the standard normal distribution (mean = 0, sd = 1) and say we would like to know the probability for values between -1 and 1, i.e. plus minus one standard deviation from the mean. The solution to this problem is simple, we first calculate the probability associated with 1 and subtract the probability associated with -1.

## Derive the probability for a range of values, here from -1 to +1 around a mean of 0.
pnorm(q = 1) - pnorm(q = -1) # tells us that 68% of the values should lie between plus/minus one standard deviation

## [1] 0.6826895

## Let's validate that by simulating a random sample of normally distributed values
set.seed(321) # this command fixes the seed of the random number generator, so that we all 
# get the same random set of values (always needs to be executed in tandem with the rnorm function)
dat <- rnorm(n = 100)

## We know how to get the mean and standard deviation
mean(dat)

## [1] 0.009067669

sd(dat)

## [1] 0.9557693

## Now let's compute the range (mean - sd to mean + sd)
msd.pos <- mean(dat) + sd(dat) # msd.pos = short for mean + standard deviation
msd.neg <- mean(dat) - sd(dat) # msd.neg = short for mean - standard deviation
msd.pos

## [1] 0.964837

msd.neg

## [1] -0.9467016

## Now confine the dat vector to this range of values and count them...
## should boil down to around 68 numbers, mind you were are talking about probabilities, 
## so by repeating this exercise a large number of times will results in a mean nunmber 
## of 68 (just run it several times and see what it returns).
dat1 <- dat[dat <= msd.pos & dat >= msd.neg]
range(dat1) # check whether values are in the specified range

## [1] -0.9096546  0.9538786

length(dat1) # should be around 68 (when repeated many times)

## [1] 69