October 3, 2022

Good morning

Week 10

‘Keep your eyes on that one, Tom… he’s not… you know… normal…’

What will we do today

Week 10
  • Distribution app
  • Recap week 9: Example questions
  • How to analyse two continuous variables
  • Correlation analysis
  • Non-parametric alternatives
  • Examples
  • How to plot two continuous variables
  • How to report results
  • Reviewing lab reports

Distribution App / Lab report 6

Week 10

https://gallery.shinyapps.io/dist_calc/

‘What is the probability of scoring x or higher when sampling randomly from data that follow a normal/Chi-squared/Poisson/… distribution’ (given the relevant parameters, which differ for every distribution)

Try to understand statistical testing in a generic sense! You obtain a test statistic that you compare against a random distribution of that statistic

Recap week 9

Week 10
  • How to test a data set with two categorical variables
  • What is a contingency table
  • How to conduct and interpret a Chi-squared test in R
  • How to manually conduct a Chi-squared test
  • Other distributions: Binomial, Poisson, Uniform, Chi-squared…

Example questions (1)

Week 10

If I told you that for a given test, you can use the standard normal distribution as a test statistic, would you consider a test outcome for that statistic of 10 a rare one?

  1. That would be relatively rare, yes

  2. This would be a very common outcome

  3. This would be extremely rare, approaching a probability of zero

  4. Your question does not make sense, you’d need to give us more information

Example questions (2)

Week 10

The Poisson distribution and the Chi-squared distribution have what in common?

  1. Both are asymmetrical

  2. Both are always symmetrical

  3. Both only require one parameter to be specified

  4. Both (1) and (3) are correct

Example questions (3)

Week 10

If your test statistic is Chi-squared distributed with 3 degrees of freedom, would you consider a value of 10 significant?

  1. Yes, because this is a very rare value to occur by chance when sampling from a Chi-squared distribution with 3 degrees of freedom
  2. Yes, because any test statistic that follows a Chi-squared distribution with 3 degrees of freedom will be significant
  3. No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is very common
  4. No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is extremely rare

Example questions (4) (non-MC)

Week 10

You would like to test whether the number of plane crashes, aborted take-offs, and go-arounds correlates with the airline alliance it happened with (Star Alliance, OneWorld, Skyteam).

  1. What test are you likely using?

  2. Sketch out the data in the long format. Then, using some imaginary (but reasonable) numbers, write down the corresponding contingency table.

  3. Perform your test on those numbers and interpret your results

Example questions (5) (non-MC)

Week 10

You are given a data set called ‘d1’ with 2 variables, one called ‘x’ (a continuous variable), and one called ‘y’ (a binomial variable). Now you want to test whether the two levels of y are different in x. Write out the code you would use to test this.

Now write down the code to plot this, using the function boxplot().

Correlation

Measuring relationships between continuous variables

Week 10

A strong positive relationship:

Correlation

Measuring relationships between variables

Week 10

A weak positive relationship:

Correlation

Measuring relationships between variables

Week 10

No relationship:

Correlation

Measuring relationships between variables

Week 10

A negative relationship:

Measuring relationships

Week 10

Remember the notion of variance? - Measuring the variation within a single variable

\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)

  • Why square the differences from the mean?
  • Why divide by \(n-1\)?
  • How does the variance relate to the standard devation and the standard error?

Now we apply the same idea to two variables simultaneously

Covariance

Week 10
  • We want to characterise how the two variables are related
  • We need to see whether as one varible increases, the other one increases, decreases or stays the same
  • We are not making any conclusion in terms of causality!

The metric we use is the covariance:

\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)

Can we understand this formula?

Another great R app

Issues with covariance

Week 10
  • It depends on the units of measurement, e.g. the covariance of two variables in miles is different from the same covariance in kilometers
  • We therefore have to standardise by dividing by the standard deviations of both variables
  • This is then called the correlation coefficient:

\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)

  • Independent of units
  • Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation

Things to know about the correlation coefficient

Week 10
  • It is an effect size
  • ± 0.1 = small effect
  • ± 0.3 = medium effect
  • ± 0.5 = large effect
  • The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared/explained by the other (more on this later)
  • The third-variable problem (confounding): in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting or causing the correlation

https://rpsychologist.com/d3/correlation/

Calculating correlation coefficients in R

The correlation test

Week 10

You can use cor.test() for most applications:

set.seed(0)
cor.test(rnorm(20), rnorm(20))
    Pearson's product-moment correlation

data:  rnorm(20) and rnorm(20)
t = 0.33035, df = 18, p-value = 0.7449
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3778722  0.5028752
sample estimates:
       cor 
0.07762955 

This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis ‘the two variables are not significantly correlated’

Calculating correlation coefficients in R

Interpreting the output

Week 10
  • What test statistic is the correlation test using?
  • Therefore, what distribution are we comparing our test statistic against?
  • How many degrees of freedom do we have? Why?
  • What is the key assumption of this test? (normally distributed data!)

Options to select from when testing or calculating correlations

Week 10

The function cor.test() has several options (arguments) you can set.

  • ‘method’: you can select from ‘Pearson’ (the default), ‘Spearman’ and ‘Kendall’
  • use ‘Pearson’ for normally distributed data
  • use ‘Spearman’ or ‘Kendall’ for non-normal data, the latter is particularly good for small sample sizes
  • alternative: use ‘two.sided’ (the default) if testing for a positive OR negative correlation, ‘greater’ for a positive correlation, and ‘less’ for a negative correlation

Correlation, example 1:

Week 10

We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:

data(USArrests) #load the inbuilt data set 'USArrests'
pairs(USArrests) #'pairs' plot

Correlation, example 1:

Week 10

To get all the correlation coefficients between all variables at a glance, we can use cor()

cor(USArrests)
             Murder   Assault   UrbanPop      Rape
Murder   1.00000000 0.8018733 0.06957262 0.5635788
Assault  0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape     0.56357883 0.6652412 0.41134124 1.0000000

Why are all the correlation coefficients along the diagonal equal to one?

Alternative pairs plot, using ggplot2

Week 10
## install.packages("ggplot2"); install.packages("GGally") #install packages
library("ggplot2")  #load ggplot2 package
library("GGally")   #load GGally package
ggpairs(swiss)

Correlation, example 1:

Week 10

Now let’s see whether UrbanPop is significantly correlated with Assault:

cor.test(USArrests$Assault, USArrests$UrbanPop)
    Pearson's product-moment correlation

data:  USArrests$Assault and USArrests$UrbanPop
t = 1.8568, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02098836  0.50111118
sample estimates:
      cor 
0.2588717 

Correlation, example 1:

Week 10
  • The correlation coefficient between Assault and UrbanPop is 0.26
  • The p-value for the null hypothesis ‘Assault and UrbanPop are not significantly correlated’ is 0.07
  • We therefore cannot reject the null hypothesis
  • Careful: we have not checked the normality assumption:

What happens if you use a non-parametric method? (In fact the data look non-normal [test for it!] and a non-parametric test should be used)

Correlation, example 1:

Week 10
cor.test(USArrests$Assault, USArrests$UrbanPop, method = 'kendall')
    Kendall's rank correlation tau

data:  USArrests$Assault and USArrests$UrbanPop
z = 2.0182, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.1988482 

Now the correlation is significant! However, we won’t worry about how ‘tau’ is interpreted here. We simply conclude: ‘The number of assaults across states of the USA are significantly correlated with the percentage of people living in urban areas (Kendall’s correlation test, p < 0.05).’

Correlation, example 2:

Week 10

We would like to see whether there is a positive correlation between the tenderness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.

x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
y = c( 2.6,  3.1,  2.5,  5.0,  3.6,  4.0,  5.2,  2.8,  3.8)
plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)

The assumption of normality seems violated (check also using shapiro.test()) and the sample size is small, so we will use the non-parametric (rank-based) correlation tests.

Correlation, example 2:

Week 10
cor.test(x, y, method = 'kendall', alternative = 'greater')
    Kendall's rank correlation tau

data:  x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
      tau 
0.4444444 
cor.test(x, y, method = 'spearman', alternative = 'greater')
    Spearman's rank correlation rho

data:  x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho 
0.6 

Correlation, example 2:

Week 10

Was it correct to test one-sided (i.e. using the argument alternative = 'greater')?

What would happen to the p-value if we tested two-sided?

cor.test(x, y, method = 'kendall')
cor.test(x, y, method = 'spearman')
  • Why is there no ‘alternative’ argument specified this time?
  • The p-values are now both non-significant, why? Did you expect this?

Plotting two continuous variables

Week 10

Let us use the USArrests data set to produce a ‘scatter plot’:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year')

Customising plots

Week 10

Let us use the USArrests data set:

plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400), #Note that plots usually don't have titles
     main = 'Correlation plot') #but titles are useful on exploratory plots!

Customising plots

Week 10
plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     main = 'Correlation plot')
text(20, 350, 'p-value (Kendall) = 0.04')

Customising plots

Week 10
plot(USArrests$Assault ~ USArrests$UrbanPop,
     xlab = 'Urban population (%)',
     ylab = 'Number of assaults/year',
     xlim = c(0, 100), ylim = c(0, 400),
     las = 1, tcl = .3, pch = 16, col = 'blue')
text(x = 20, y = 350, 'p-value (Kendal) = 0.04')

Example data analysis

Week 10

In the data set ‘iris’, does petal length correlate with sepal length? Take all the necessary steps to test for this correlation and conclude.

Is your answer true for all three species?

Reporting statistical results in a text

Week 10

T-test/Wilcoxon test:

  • Significant: E.g. ‘Rats on treated islands were significantly lighter than rats on control islands (t-test, p = 0.0287)’
  • Non-significant: E.g. ‘Rats on treated islands were not significantly lighter than rats on control islands (t-test, p = 0.64)’
  • Wilcoxon rank-sum test: detto, but ‘…(Wilcoxon Rank-Sum test, p = …’)

Chi-squared test:

  • Significant: E.g. ’Significantly more females than males were bullied (Chi-squared test, p < 0.0001)
  • Non-significant: E.g. ’Bullying was not dependent on gender (Chi-squared test, p = 0.43)

Reporting statistical results in a text

Week 10

Correlation test:

  • Significant: E.g. ‘Antibody levels were significantly correlated with rat size (Pearson’s correlation test, p = 0.01545)’
  • Non-significant: E.g. ‘Antibody levels were not significantly correlated with rat size (Pearson’s correlation test, p = 0.92)’
  • Non-parametric correlation tests: detto, but ‘…(Kendall’s test, p = …’)

P-values

  • Either indicate exact p-value (preferred)
  • Or categorise them, e.g. p < 0.05, p < 0.01, p < 0.001
  • Often, asterisks are used to designate significance in a plot: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001

What will we have learnt by the end of this week?

Week 10
  • How to perform a correlation test
  • When to apply a parametric or non-parametric test
  • How to interpret an \(R\) and an \(R^2\)-value
  • How to do a ‘pairs’ plot
  • How to compute a correlation matrix
  • How to customise a plot of two continuous variables (scatter plot)
  • How to report results in a text

Glossary

Week 10
  • Variance, covariance
  • non-parametric correlation vs. parametric correlation
  • Pearson’s correlation coefficient
  • Spearman’s and Kendall’s correlation coefficient
  • Coefficient of determination (\(R^2\))
  • ‘pairs’ plot
  • Scatter plot
  • Correlation matrix