Biological Sampling and Interpretation

October 3, 2022

Good morning

Week 10

‘Keep your eyes on that one, Tom… he’s not… you know… normal…’

What will we do today

Week 10

Distribution app
Recap week 9: Example questions
How to analyse two continuous variables
Correlation analysis
Non-parametric alternatives
Examples
How to plot two continuous variables
How to report results
Reviewing lab reports

Distribution App / Lab report 6

Week 10

https://gallery.shinyapps.io/dist_calc/

‘What is the probability of scoring x or higher when sampling randomly from data that follow a normal/Chi-squared/Poisson/… distribution’ (given the relevant parameters, which differ for every distribution)

Try to understand statistical testing in a generic sense! You obtain a test statistic that you compare against a random distribution of that statistic

Recap week 9

Week 10

How to test a data set with two categorical variables
What is a contingency table
How to conduct and interpret a Chi-squared test in R
How to manually conduct a Chi-squared test
Other distributions: Binomial, Poisson, Uniform, Chi-squared…

Example questions (1)

Week 10

If I told you that for a given test, you can use the standard normal distribution as a test statistic, would you consider a test outcome for that statistic of 10 a rare one?

That would be relatively rare, yes
This would be a very common outcome
This would be extremely rare, approaching a probability of zero
Your question does not make sense, you’d need to give us more information

Example questions (2)

Week 10

The Poisson distribution and the Chi-squared distribution have what in common?

Both are asymmetrical
Both are always symmetrical
Both only require one parameter to be specified
Both (1) and (3) are correct

Example questions (3)

Week 10

If your test statistic is Chi-squared distributed with 3 degrees of freedom, would you consider a value of 10 significant?

Yes, because this is a very rare value to occur by chance when sampling from a Chi-squared distribution with 3 degrees of freedom
Yes, because any test statistic that follows a Chi-squared distribution with 3 degrees of freedom will be significant
No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is very common
No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is extremely rare

Example questions (4) (non-MC)

Week 10

You would like to test whether the number of plane crashes, aborted take-offs, and go-arounds correlates with the airline alliance it happened with (Star Alliance, OneWorld, Skyteam).

What test are you likely using?
Sketch out the data in the long format. Then, using some imaginary (but reasonable) numbers, write down the corresponding contingency table.
Perform your test on those numbers and interpret your results

Example questions (5) (non-MC)

Week 10

You are given a data set called ‘d1’ with 2 variables, one called ‘x’ (a continuous variable), and one called ‘y’ (a binomial variable). Now you want to test whether the two levels of y are different in x. Write out the code you would use to test this.

Now write down the code to plot this, using the function boxplot().

Correlation

Measuring relationships between continuous variables

Week 10

A strong positive relationship:

Correlation

Measuring relationships between variables

Week 10

A weak positive relationship:

Correlation

Measuring relationships between variables

Week 10

No relationship:

Correlation

Measuring relationships between variables

Week 10

A negative relationship:

Measuring relationships

Week 10

Remember the notion of variance? - Measuring the variation within a single variable

\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)

Why square the differences from the mean?
Why divide by \(n-1\)?
How does the variance relate to the standard devation and the standard error?

Now we apply the same idea to two variables simultaneously

Covariance

Week 10

We want to characterise how the two variables are related
We need to see whether as one varible increases, the other one increases, decreases or stays the same
We are not making any conclusion in terms of causality!

The metric we use is the covariance:

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

Can we understand this formula?

Another great R app

Week 10

https://shiny.rit.albany.edu/stat/rectangles/

Issues with covariance

Week 10

It depends on the units of measurement, e.g. the covariance of two variables in miles is different from the same covariance in kilometers
We therefore have to standardise by dividing by the standard deviations of both variables
This is then called the correlation coefficient:

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

Independent of units
Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation

Things to know about the correlation coefficient

Week 10

It is an effect size
± 0.1 = small effect
± 0.3 = medium effect
± 0.5 = large effect
The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared/explained by the other (more on this later)
The third-variable problem (confounding): in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting or causing the correlation

https://rpsychologist.com/d3/correlation/

Calculating correlation coefficients in R

The correlation test

Week 10

You can use cor.test() for most applications:

set.seed(0)
cor.test(rnorm(20), rnorm(20))

    Pearson's product-moment correlation

data:  rnorm(20) and rnorm(20)
t = 0.33035, df = 18, p-value = 0.7449
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3778722  0.5028752
sample estimates:
       cor 
0.07762955

This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis ‘the two variables are not significantly correlated’

Calculating correlation coefficients in R

Interpreting the output

Week 10

What test statistic is the correlation test using?
Therefore, what distribution are we comparing our test statistic against?
How many degrees of freedom do we have? Why?
What is the key assumption of this test? (normally distributed data!)

Options to select from when testing or calculating correlations

Week 10

The function cor.test() has several options (arguments) you can set.

‘method’: you can select from ‘Pearson’ (the default), ‘Spearman’ and ‘Kendall’
use ‘Pearson’ for normally distributed data
use ‘Spearman’ or ‘Kendall’ for non-normal data, the latter is particularly good for small sample sizes
alternative: use ‘two.sided’ (the default) if testing for a positive OR negative correlation, ‘greater’ for a positive correlation, and ‘less’ for a negative correlation

Correlation, example 1:

Week 10

We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:

data(USArrests) #load the inbuilt data set 'USArrests'
pairs(USArrests) #'pairs' plot

Correlation, example 1:

Week 10

To get all the correlation coefficients between all variables at a glance, we can use cor()

cor(USArrests)

             Murder   Assault   UrbanPop      Rape
Murder   1.00000000 0.8018733 0.06957262 0.5635788
Assault  0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape     0.56357883 0.6652412 0.41134124 1.0000000

Why are all the correlation coefficients along the diagonal equal to one?

Alternative pairs plot, using ggplot2

Week 10

## install.packages("ggplot2"); install.packages("GGally") #install packages
library("ggplot2")  #load ggplot2 package
library("GGally")   #load GGally package
ggpairs(swiss)

Correlation, example 1:

Week 10

Now let’s see whether UrbanPop is significantly correlated with Assault:

cor.test(USArrests$Assault, USArrests$UrbanPop)

    Pearson's product-moment correlation

data:  USArrests$Assault and USArrests$UrbanPop
t = 1.8568, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02098836  0.50111118
sample estimates:
      cor 
0.2588717

Correlation, example 1:

Week 10

The correlation coefficient between Assault and UrbanPop is 0.26
The p-value for the null hypothesis ‘Assault and UrbanPop are not significantly correlated’ is 0.07
We therefore cannot reject the null hypothesis
Careful: we have not checked the normality assumption:

What happens if you use a non-parametric method? (In fact the data look non-normal [test for it!] and a non-parametric test should be used)

Correlation, example 1:

Week 10

cor.test(USArrests$Assault, USArrests$UrbanPop, method = 'kendall')

    Kendall's rank correlation tau

data:  USArrests$Assault and USArrests$UrbanPop
z = 2.0182, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.1988482

Now the correlation is significant! However, we won’t worry about how ‘tau’ is interpreted here. We simply conclude: ‘The number of assaults across states of the USA are significantly correlated with the percentage of people living in urban areas (Kendall’s correlation test, p < 0.05).’

Correlation, example 2:

Week 10

We would like to see whether there is a positive correlation between the tenderness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.

x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y = c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)

The assumption of normality seems violated (check also using shapiro.test()) and the sample size is small, so we will use the non-parametric (rank-based) correlation tests.

Correlation, example 2:

Week 10

cor.test(x, y, method = 'kendall', alternative = 'greater')

    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
      tau 
0.4444444

cor.test(x, y, method = 'spearman', alternative = 'greater')

    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho 
0.6

Correlation, example 2:

Week 10

Was it correct to test one-sided (i.e. using the argument alternative = 'greater')?

What would happen to the p-value if we tested two-sided?

cor.test(x, y, method = 'kendall')
cor.test(x, y, method = 'spearman')

Why is there no ‘alternative’ argument specified this time?
The p-values are now both non-significant, why? Did you expect this?

Plotting two continuous variables

Week 10

Let us use the USArrests data set to produce a ‘scatter plot’:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year')

Customising plots

Week 10

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400), #Note that plots usually don't have titles
     main = 'Correlation plot') #but titles are useful on exploratory plots!

Customising plots

Week 10

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')
text(20, 350, 'p-value (Kendall) = 0.04')

Customising plots

Week 10

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     las = 1, tcl = .3, pch = 16, col = 'blue')
text(x = 20, y = 350, 'p-value (Kendal) = 0.04')

Example data analysis

Week 10

In the data set ‘iris’, does petal length correlate with sepal length? Take all the necessary steps to test for this correlation and conclude.

Is your answer true for all three species?

Reporting statistical results in a text

Week 10

T-test/Wilcoxon test:

Significant: E.g. ‘Rats on treated islands were significantly lighter than rats on control islands (t-test, p = 0.0287)’
Non-significant: E.g. ‘Rats on treated islands were not significantly lighter than rats on control islands (t-test, p = 0.64)’
Wilcoxon rank-sum test: detto, but ‘…(Wilcoxon Rank-Sum test, p = …’)

Chi-squared test:

Significant: E.g. ’Significantly more females than males were bullied (Chi-squared test, p < 0.0001)
Non-significant: E.g. ’Bullying was not dependent on gender (Chi-squared test, p = 0.43)

Reporting statistical results in a text

Week 10

Correlation test:

Significant: E.g. ‘Antibody levels were significantly correlated with rat size (Pearson’s correlation test, p = 0.01545)’
Non-significant: E.g. ‘Antibody levels were not significantly correlated with rat size (Pearson’s correlation test, p = 0.92)’
Non-parametric correlation tests: detto, but ‘…(Kendall’s test, p = …’)

P-values

Either indicate exact p-value (preferred)
Or categorise them, e.g. p < 0.05, p < 0.01, p < 0.001
Often, asterisks are used to designate significance in a plot: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001

What will we have learnt by the end of this week?

Week 10

How to perform a correlation test
When to apply a parametric or non-parametric test
How to interpret an \(R\) and an \(R^2\)-value
How to do a ‘pairs’ plot
How to compute a correlation matrix
How to customise a plot of two continuous variables (scatter plot)
How to report results in a text

Glossary

Week 10

Variance, covariance
non-parametric correlation vs. parametric correlation
Pearson’s correlation coefficient
Spearman’s and Kendall’s correlation coefficient
Coefficient of determination (\(R^2\))
‘pairs’ plot
Scatter plot
Correlation matrix