Biological Sampling and Interpretation

October 1, 2014

Timetable

Tentative schedule for second half of semester

Week	Lecture/Lab
8	t-test and extensions, power test, scientific plotting
9	correlation, regression, scientific plotting
10	comparing several means (ANOVA)
11	ANOVA extensions, ANOVA power test
12	Revision
13-15	Exam weeks

What will we do today

Recap week 8: Example questions Wilcoxon rank sum test and power test
Correlation
Distribution R app
R skills / R demonstration continued

Example Question (1)

A p-value of 0.1 in a Wilcoxon test with \(n\) = 3 should not be overinterpreted because

this is the smallest p-value you can expect
there are only very few possible p-values at such a low \(n\)
the better test to use at such a small sample size is the t-test
both (1) and (2) are correct

Example Question (2)

What will be the power of a t-test if your expected difference between samples is 7, your standard devation 5, and your sample size 8?

It cannot be computed based on this information
Too high for the experiment to make sense
Far too low for the experiment to make sense
Around 75%

Correlation

Measuring relationships between variables

A strong positive relationship: plot of chunk unnamed-chunk-1

Correlation

Measuring relationships between variables

A weak positive relationship: plot of chunk unnamed-chunk-2

Correlation

Measuring relationships between variables

No relationship: plot of chunk unnamed-chunk-3

Correlation

Measuring relationships between variables

A negative relationship: plot of chunk unnamed-chunk-4

Measuring relationships

Remember the notion of variance? - Measuring the variation within a single variable

\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)

Why square the differences from the mean?
Why divide by \(n-1\)?
How does the variance relate to the standard devation and the standard error?

Now we apply the same idea to two variables simultaneously

Covariance

We want to characterise how the two variables are related
We need to see whether as one varible increases, the other increases, decreases or stays the same
We are not making any conclusion in terms of causality!

The metric we use is the covariance:

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

Can we understand this formula?

plot of chunk unnamed-chunk-5

Issues with covariance

It depends on the units of measurement, e.g. the covariance of two variables in miles is different from the same covariance in kilometers
We therefore have to standardise by dividing by the standard deviations of both variables
This is then called the correlation coefficient:

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Independent of units
Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation

Things to know about the correlation coefficient

It is an effect size
± 0.1 = small effect
± 0.3 = medium effect
± 0.5 = large effect
The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared by the other.
The third-variable problem: in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results

Calculating correlation coefficients in R

You can use cor.test() for most applications:

cor.test(rnorm(20), rnorm(20))

    Pearson's product-moment correlation

data:  rnorm(20) and rnorm(20)
t = 0.543, df = 18, p-value = 0.5938
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3344  0.5392
sample estimates:
   cor 
0.1269

This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis 'the two variables are significantly correlated'

Options to select from when testing or calculating correlations

The function cor.test() has several options (arguments) you can set.

'method': you can select from 'Pearson' (the default), 'Spearman' and 'Kendall'
use 'Pearson' for normally distributed data
use 'Spearman' or 'Kendall' for non-normal data, the latter is particularly good for small sample sizes
alternative: use 'two.sided' (the default) if testing for a positive OR negative correlation, 'greater' for a positive correlation, and 'less' for a negative correlation

Correlation, example 1:

We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:

data(USArrests) #load inbuilt data set
plot(USArrests) #'pairs' plot

plot of chunk unnamed-chunk-7

Correlation, example 1:

To get all the correlation coefficients between all variables at a glance, we can use cor()

cor(USArrests)

          Murder Assault UrbanPop   Rape
Murder   1.00000  0.8019  0.06957 0.5636
Assault  0.80187  1.0000  0.25887 0.6652
UrbanPop 0.06957  0.2589  1.00000 0.4113
Rape     0.56358  0.6652  0.41134 1.0000

Now see whether UrbanPop is significantly correlated with Assault:

    Pearson's product-moment correlation

data:  USArrests$Assault and USArrests$UrbanPop
t = 1.857, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02099  0.50111
sample estimates:
   cor 
0.2589

Correlation, example 1:

The correlation coefficient between Assault and UrbanPop is 0.26
The p-value for the null hypothesis 'Assault and UrbanPop are not significantly correlated' is 0.07
We therefore cannot reject the null hypothesis
Careful: we have not checked the normality assumption:

What happens if you use a non-parametric method? (In fact the data are not normal and a non-parametric test should be used)

    Kendall's rank correlation tau

data:  USArrests$Assault and USArrests$UrbanPop
z = 2.018, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
   tau 
0.1988

Now the correlation is significant! However, we won't worry about how 'tau' is interpreted here.

Correlation, example 2:

We would like to see whether there is a positive correlation between the lightness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.

x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)

plot of chunk unnamed-chunk-11 Assumption of normality is violated, check also using shapiro.test(), so we will use the non-parametric (rank-based) correlation tests.

Correlation, example 2:

cor.test(x, y, method = 'kendall', alternative = 'greater')

    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
   tau 
0.4444

cor.test(x, y, method = 'spearman', alternative = 'greater')

    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho 
0.6

Plotting two continuous variables

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year')

plot of chunk unnamed-chunk-13

Customising plots

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')

plot of chunk unnamed-chunk-14

Customising plots

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')
text(20, 350, 'p-value (Kendal) = 0.04')

plot of chunk unnamed-chunk-15

Customising plots

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     las = 1, tcl = .3, pch = 16, col = 'blue')
text(20, 350, 'p-value (Kendal) = 0.04')

plot of chunk unnamed-chunk-16

What will we have learnt by the end of this week?

How to perform a correlation test
When to apply a parametric or non-parametric test
How to interpret an \(R\) and an \(R^2\)-value
How to do a 'pairs' plot
How to compute a correlation matrix
How to customise a plot of two continuous variables

Glossary

Variance, covariance
non-parametric correlation vs. parametric correlation
Pearson's correlation coefficient
Coefficient of determination (\(R^2\))
'pairs' plot
Correlation matrix