October 1, 2014

Timetable

Tentative schedule for second half of semester

Week Lecture/Lab
8 t-test and extensions, power test, scientific plotting
9 correlation, regression, scientific plotting
10 comparing several means (ANOVA)
11 ANOVA extensions, ANOVA power test
12 Revision
13-15 Exam weeks

What will we do today

  • Recap week 8: Example questions Wilcoxon rank sum test and power test
  • Correlation
  • Distribution R app
  • R skills / R demonstration continued

Example Question (1)

A p-value of 0.1 in a Wilcoxon test with \(n\) = 3 should not be overinterpreted because

  1. this is the smallest p-value you can expect
  2. there are only very few possible p-values at such a low \(n\)
  3. the better test to use at such a small sample size is the t-test
  4. both (1) and (2) are correct

Example Question (2)

What will be the power of a t-test if your expected difference between samples is 7, your standard devation 5, and your sample size 8?

  1. It cannot be computed based on this information
  2. Too high for the experiment to make sense
  3. Far too low for the experiment to make sense
  4. Around 75%

Correlation

Measuring relationships between variables

A strong positive relationship: plot of chunk unnamed-chunk-1

Correlation

Measuring relationships between variables

A weak positive relationship: plot of chunk unnamed-chunk-2

Correlation

Measuring relationships between variables

No relationship: plot of chunk unnamed-chunk-3

Correlation

Measuring relationships between variables

A negative relationship: plot of chunk unnamed-chunk-4

Measuring relationships

Remember the notion of variance? - Measuring the variation within a single variable

\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)

  • Why square the differences from the mean?
  • Why divide by \(n-1\)?
  • How does the variance relate to the standard devation and the standard error?

Now we apply the same idea to two variables simultaneously

Covariance

  • We want to characterise how the two variables are related
  • We need to see whether as one varible increases, the other increases, decreases or stays the same
  • We are not making any conclusion in terms of causality!

The metric we use is the covariance:

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

Can we understand this formula?

plot of chunk unnamed-chunk-5

Issues with covariance

  • It depends on the units of measurement, e.g. the covariance of two variables in miles is different from the same covariance in kilometers
  • We therefore have to standardise by dividing by the standard deviations of both variables
  • This is then called the correlation coefficient:

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

  • Independent of units
  • Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation

Things to know about the correlation coefficient

  • It is an effect size
  • ± 0.1 = small effect
  • ± 0.3 = medium effect
  • ± 0.5 = large effect
  • The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared by the other.
  • The third-variable problem: in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results

Calculating correlation coefficients in R

You can use cor.test() for most applications:

cor.test(rnorm(20), rnorm(20))
    Pearson's product-moment correlation

data:  rnorm(20) and rnorm(20)
t = 0.543, df = 18, p-value = 0.5938
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3344  0.5392
sample estimates:
   cor 
0.1269 

This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis 'the two variables are significantly correlated'

Options to select from when testing or calculating correlations

The function cor.test() has several options (arguments) you can set.

  • 'method': you can select from 'Pearson' (the default), 'Spearman' and 'Kendall'
  • use 'Pearson' for normally distributed data
  • use 'Spearman' or 'Kendall' for non-normal data, the latter is particularly good for small sample sizes
  • alternative: use 'two.sided' (the default) if testing for a positive OR negative correlation, 'greater' for a positive correlation, and 'less' for a negative correlation

Correlation, example 1:

We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:

data(USArrests) #load inbuilt data set
plot(USArrests) #'pairs' plot

plot of chunk unnamed-chunk-7

Correlation, example 1:

To get all the correlation coefficients between all variables at a glance, we can use cor()

cor(USArrests)
          Murder Assault UrbanPop   Rape
Murder   1.00000  0.8019  0.06957 0.5636
Assault  0.80187  1.0000  0.25887 0.6652
UrbanPop 0.06957  0.2589  1.00000 0.4113
Rape     0.56358  0.6652  0.41134 1.0000

Now see whether UrbanPop is significantly correlated with Assault:

    Pearson's product-moment correlation

data:  USArrests$Assault and USArrests$UrbanPop
t = 1.857, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02099  0.50111
sample estimates:
   cor 
0.2589 

Correlation, example 1:

  • The correlation coefficient between Assault and UrbanPop is 0.26
  • The p-value for the null hypothesis 'Assault and UrbanPop are not significantly correlated' is 0.07
  • We therefore cannot reject the null hypothesis
  • Careful: we have not checked the normality assumption:

What happens if you use a non-parametric method? (In fact the data are not normal and a non-parametric test should be used)

    Kendall's rank correlation tau

data:  USArrests$Assault and USArrests$UrbanPop
z = 2.018, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
   tau 
0.1988 

Now the correlation is significant! However, we won't worry about how 'tau' is interpreted here.

Correlation, example 2:

We would like to see whether there is a positive correlation between the lightness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.

x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y <- c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)

plot of chunk unnamed-chunk-11 Assumption of normality is violated, check also using shapiro.test(), so we will use the non-parametric (rank-based) correlation tests.

Correlation, example 2:

cor.test(x, y, method = 'kendall', alternative = 'greater')
    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
   tau 
0.4444 
cor.test(x, y, method = 'spearman', alternative = 'greater')
    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho 
0.6 

Plotting two continuous variables

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year')

plot of chunk unnamed-chunk-13

Customising plots

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')

plot of chunk unnamed-chunk-14

Customising plots

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')
text(20, 350, 'p-value (Kendal) = 0.04')

plot of chunk unnamed-chunk-15

Customising plots

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = '% Urban population',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     las = 1, tcl = .3, pch = 16, col = 'blue')
text(20, 350, 'p-value (Kendal) = 0.04')

plot of chunk unnamed-chunk-16

What will we have learnt by the end of this week?

  • How to perform a correlation test
  • When to apply a parametric or non-parametric test
  • How to interpret an \(R\) and an \(R^2\)-value
  • How to do a 'pairs' plot
  • How to compute a correlation matrix
  • How to customise a plot of two continuous variables

Glossary

  • Variance, covariance
  • non-parametric correlation vs. parametric correlation
  • Pearson's correlation coefficient
  • Coefficient of determination (\(R^2\))
  • 'pairs' plot
  • Correlation matrix