‘Keep your eyes on that one, Tom… he’s not… you know… normal…’
What will we do today
- Distribution app
- Recap week 9: Example questions
- How to analyse two continuous variables
- Correlation analysis
- Non-parametric alternatives
- Examples
- How to plot two continuous variables
- How to report results
- Reviewing lab reports
Distribution App / Lab report 6
https://gallery.shinyapps.io/dist_calc/
‘What is the probability of scoring x or higher when sampling randomly from data that follow a normal/Chi-squared/Poisson/… distribution’ (given the relevant parameters, which differ for every distribution)
Try to understand statistical testing in a generic sense! You obtain a test statistic that you compare against a random distribution of that statistic
Recap week 9
- How to test a data set with two categorical variables
- What is a contingency table
- How to conduct and interpret a Chi-squared test in R
- How to manually conduct a Chi-squared test
- Other distributions: Binomial, Poisson, Uniform, Chi-squared…
Example questions (1)
If I told you that for a given test, you can use the standard normal distribution as a test statistic, would you consider a test outcome for that statistic of 10 a rare one?
That would be relatively rare, yes
This would be a very common outcome
This would be extremely rare, approaching a probability of zero
Your question does not make sense, you’d need to give us more information
Example questions (2)
The Poisson distribution and the Chi-squared distribution have what in common?
Both are asymmetrical
Both are always symmetrical
Both only require one parameter to be specified
Both (1) and (3) are correct
Example questions (3)
If your test statistic is Chi-squared distributed with 3 degrees of freedom, would you consider a value of 10 significant?
- Yes, because this is a very rare value to occur by chance when sampling from a Chi-squared distribution with 3 degrees of freedom
- Yes, because any test statistic that follows a Chi-squared distribution with 3 degrees of freedom will be significant
- No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is very common
- No, because sampling 10 from a Chi-squared distribution with 3 degrees of freedom at random is extremely rare
Example questions (4) (non-MC)
You would like to test whether the number of plane crashes, aborted take-offs, and go-arounds correlates with the airline alliance it happened with (Star Alliance, OneWorld, Skyteam).
What test are you likely using?
Sketch out the data in the long format. Then, using some imaginary (but reasonable) numbers, write down the corresponding contingency table.
Perform your test on those numbers and interpret your results
Example questions (5) (non-MC)
You are given a data set called ‘d1’ with 2 variables, one called ‘x’ (a continuous variable), and one called ‘y’ (a binomial variable). Now you want to test whether the two levels of y are different in x. Write out the code you would use to test this.
Now write down the code to plot this, using the function boxplot().
Correlation
Measuring relationships between continuous variables
A strong positive relationship:
Correlation
Measuring relationships between variables
A weak positive relationship:
Correlation
Measuring relationships between variables
No relationship:
Correlation
Measuring relationships between variables
A negative relationship:
Measuring relationships
Remember the notion of variance? - Measuring the variation within a single variable
\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)
- Why square the differences from the mean?
- Why divide by \(n-1\)?
- How does the variance relate to the standard devation and the standard error?
Now we apply the same idea to two variables simultaneously
Covariance
- We want to characterise how the two variables are related
- We need to see whether as one varible increases, the other one increases, decreases or stays the same
- We are not making any conclusion in terms of causality!
The metric we use is the covariance:
\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)
Can we understand this formula?
Another great R app
https://shiny.rit.albany.edu/stat/rectangles/
Issues with covariance
- It depends on the units of measurement, e.g. the covariance of two variables in miles is different from the same covariance in kilometers
- We therefore have to standardise by dividing by the standard deviations of both variables
- This is then called the correlation coefficient:
\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)
- Independent of units
- Ranges from -1 to 1, -1: perfect negative correlation, 1: perfect positive correlation
Things to know about the correlation coefficient
- It is an effect size
- ± 0.1 = small effect
- ± 0.3 = medium effect
- ± 0.5 = large effect
- The coefficient of determination, \(R^2\) is the proportion of variance in one variable shared/explained by the other (more on this later)
- The third-variable problem (confounding): in any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting or causing the correlation
https://rpsychologist.com/d3/correlation/
Calculating correlation coefficients in R
The correlation test
You can use cor.test() for most applications:
set.seed(0) cor.test(rnorm(20), rnorm(20))
Pearson's product-moment correlation
data: rnorm(20) and rnorm(20)
t = 0.33035, df = 18, p-value = 0.7449
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3778722 0.5028752
sample estimates:
cor
0.07762955
This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis ‘the two variables are not significantly correlated’
Calculating correlation coefficients in R
Interpreting the output
- What test statistic is the correlation test using?
- Therefore, what distribution are we comparing our test statistic against?
- How many degrees of freedom do we have? Why?
- What is the key assumption of this test? (normally distributed data!)
Options to select from when testing or calculating correlations
The function cor.test() has several options (arguments) you can set.
- ‘method’: you can select from ‘Pearson’ (the default), ‘Spearman’ and ‘Kendall’
- use ‘Pearson’ for normally distributed data
- use ‘Spearman’ or ‘Kendall’ for non-normal data, the latter is particularly good for small sample sizes
- alternative: use ‘two.sided’ (the default) if testing for a positive OR negative correlation, ‘greater’ for a positive correlation, and ‘less’ for a negative correlation
Correlation, example 1:
We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:
data(USArrests) #load the inbuilt data set 'USArrests' pairs(USArrests) #'pairs' plot
Correlation, example 1:
To get all the correlation coefficients between all variables at a glance, we can use cor()
cor(USArrests)
Murder Assault UrbanPop Rape Murder 1.00000000 0.8018733 0.06957262 0.5635788 Assault 0.80187331 1.0000000 0.25887170 0.6652412 UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412 Rape 0.56357883 0.6652412 0.41134124 1.0000000
Why are all the correlation coefficients along the diagonal equal to one?
Alternative pairs plot, using ggplot2
## install.packages("ggplot2"); install.packages("GGally") #install packages
library("ggplot2") #load ggplot2 package
library("GGally") #load GGally package
ggpairs(swiss)
Correlation, example 1:
Now let’s see whether UrbanPop is significantly correlated with Assault:
cor.test(USArrests$Assault, USArrests$UrbanPop)
Pearson's product-moment correlation
data: USArrests$Assault and USArrests$UrbanPop
t = 1.8568, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02098836 0.50111118
sample estimates:
cor
0.2588717
Correlation, example 1:
- The correlation coefficient between
AssaultandUrbanPopis 0.26 - The p-value for the null hypothesis ‘
AssaultandUrbanPopare not significantly correlated’ is 0.07 - We therefore cannot reject the null hypothesis
- Careful: we have not checked the normality assumption:
What happens if you use a non-parametric method? (In fact the data look non-normal [test for it!] and a non-parametric test should be used)
Correlation, example 1:
cor.test(USArrests$Assault, USArrests$UrbanPop, method = 'kendall')
Kendall's rank correlation tau
data: USArrests$Assault and USArrests$UrbanPop
z = 2.0182, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.1988482
Now the correlation is significant! However, we won’t worry about how ‘tau’ is interpreted here. We simply conclude: ‘The number of assaults across states of the USA are significantly correlated with the percentage of people living in urban areas (Kendall’s correlation test, p < 0.05).’
Correlation, example 2:
We would like to see whether there is a positive correlation between the tenderness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.
x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) y = c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8) plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)
The assumption of normality seems violated (check also using
shapiro.test()) and the sample size is small, so we will use the non-parametric (rank-based) correlation tests.
Correlation, example 2:
cor.test(x, y, method = 'kendall', alternative = 'greater')
Kendall's rank correlation tau
data: x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
tau
0.4444444
cor.test(x, y, method = 'spearman', alternative = 'greater')
Spearman's rank correlation rho
data: x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.6
Correlation, example 2:
Was it correct to test one-sided (i.e. using the argument alternative = 'greater')?
What would happen to the p-value if we tested two-sided?
cor.test(x, y, method = 'kendall') cor.test(x, y, method = 'spearman')
- Why is there no ‘alternative’ argument specified this time?
- The p-values are now both non-significant, why? Did you expect this?
Plotting two continuous variables
Let us use the USArrests data set to produce a ‘scatter plot’:
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = 'Urban population (%)',
ylab = 'Number of assaults/year')
Customising plots
Let us use the USArrests data set:
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = 'Urban population (%)',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400), #Note that plots usually don't have titles
main = 'Correlation plot') #but titles are useful on exploratory plots!
Customising plots
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = 'Urban population (%)',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400),
main = 'Correlation plot')
text(20, 350, 'p-value (Kendall) = 0.04')
Customising plots
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = 'Urban population (%)',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400),
las = 1, tcl = .3, pch = 16, col = 'blue')
text(x = 20, y = 350, 'p-value (Kendal) = 0.04')
Example data analysis
In the data set ‘iris’, does petal length correlate with sepal length? Take all the necessary steps to test for this correlation and conclude.
Is your answer true for all three species?
Reporting statistical results in a text
T-test/Wilcoxon test:
- Significant: E.g. ‘Rats on treated islands were significantly lighter than rats on control islands (t-test, p = 0.0287)’
- Non-significant: E.g. ‘Rats on treated islands were not significantly lighter than rats on control islands (t-test, p = 0.64)’
- Wilcoxon rank-sum test: detto, but ‘…(Wilcoxon Rank-Sum test, p = …’)
Chi-squared test:
- Significant: E.g. ’Significantly more females than males were bullied (Chi-squared test, p < 0.0001)
- Non-significant: E.g. ’Bullying was not dependent on gender (Chi-squared test, p = 0.43)
Reporting statistical results in a text
Correlation test:
- Significant: E.g. ‘Antibody levels were significantly correlated with rat size (Pearson’s correlation test, p = 0.01545)’
- Non-significant: E.g. ‘Antibody levels were not significantly correlated with rat size (Pearson’s correlation test, p = 0.92)’
- Non-parametric correlation tests: detto, but ‘…(Kendall’s test, p = …’)
P-values
- Either indicate exact p-value (preferred)
- Or categorise them, e.g. p < 0.05, p < 0.01, p < 0.001
- Often, asterisks are used to designate significance in a plot: * for p < 0.05, ** for p < 0.01, and *** for p < 0.001
What will we have learnt by the end of this week?
- How to perform a correlation test
- When to apply a parametric or non-parametric test
- How to interpret an \(R\) and an \(R^2\)-value
- How to do a ‘pairs’ plot
- How to compute a correlation matrix
- How to customise a plot of two continuous variables (scatter plot)
- How to report results in a text
Glossary
- Variance, covariance
- non-parametric correlation vs. parametric correlation
- Pearson’s correlation coefficient
- Spearman’s and Kendall’s correlation coefficient
- Coefficient of determination (\(R^2\))
- ‘pairs’ plot
- Scatter plot
- Correlation matrix