| Week | Lecture/Lab |
|---|---|
| 8 | t-test and extensions, power test, scientific plotting |
| 9 | correlation, regression, scientific plotting |
| 10 | comparing several means (ANOVA) |
| 11 | ANOVA extensions, ANOVA power test |
| 12 | Revision |
| 13-15 | Exam weeks |
October 1, 2014
| Week | Lecture/Lab |
|---|---|
| 8 | t-test and extensions, power test, scientific plotting |
| 9 | correlation, regression, scientific plotting |
| 10 | comparing several means (ANOVA) |
| 11 | ANOVA extensions, ANOVA power test |
| 12 | Revision |
| 13-15 | Exam weeks |
A p-value of 0.1 in a Wilcoxon test with \(n\) = 3 should not be overinterpreted because
What will be the power of a t-test if your expected difference between samples is 7, your standard devation 5, and your sample size 8?
A strong positive relationship:
A weak positive relationship:
No relationship:
A negative relationship:
Remember the notion of variance? - Measuring the variation within a single variable
\(var = \frac{\sum{(x_i-\bar{x})^2}}{n-1} = \frac{\sum{(x_i-\bar{x})(x_i-\bar{x})}}{n-1}\)
Now we apply the same idea to two variables simultaneously
The metric we use is the covariance:
\(cov(x,y) = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{n-1}\)
Can we understand this formula?
\(R = \frac{cov(x,y)}{s_xs_y} = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{(n-1)s_xs_y}\)
You can use cor.test() for most applications:
cor.test(rnorm(20), rnorm(20))
Pearson's product-moment correlation
data: rnorm(20) and rnorm(20)
t = 0.543, df = 18, p-value = 0.5938
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3344 0.5392
sample estimates:
cor
0.1269
This gives you a confidence interval for \(R\), its value, and an error probability on rejecting the null hypothesis 'the two variables are significantly correlated'
The function cor.test() has several options (arguments) you can set.
We would like to find out whether the number of murders occurring in US cities correlates with the number of assaults, the number of rapes, and the percentage of the population living in urban areas. We use the data set USArrests for this:
data(USArrests) #load inbuilt data set plot(USArrests) #'pairs' plot
To get all the correlation coefficients between all variables at a glance, we can use cor()
cor(USArrests)
Murder Assault UrbanPop Rape Murder 1.00000 0.8019 0.06957 0.5636 Assault 0.80187 1.0000 0.25887 0.6652 UrbanPop 0.06957 0.2589 1.00000 0.4113 Rape 0.56358 0.6652 0.41134 1.0000
Now see whether UrbanPop is significantly correlated with Assault:
Pearson's product-moment correlation
data: USArrests$Assault and USArrests$UrbanPop
t = 1.857, df = 48, p-value = 0.06948
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02099 0.50111
sample estimates:
cor
0.2589
Assault and UrbanPop is 0.26Assault and UrbanPop are not significantly correlated' is 0.07What happens if you use a non-parametric method? (In fact the data are not normal and a non-parametric test should be used)
Kendall's rank correlation tau
data: USArrests$Assault and USArrests$UrbanPop
z = 2.018, p-value = 0.04357
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.1988
Now the correlation is significant! However, we won't worry about how 'tau' is interpreted here.
We would like to see whether there is a positive correlation between the lightness of tuna flesh (x) and the consumer panel scores (y). This time we make sure we check for the normality assumption first.
x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8) plot(x, y); qqnorm(x); qqline(x); qqnorm(y); qqline(y)
Assumption of normality is violated, check also using
shapiro.test(), so we will use the non-parametric (rank-based) correlation tests.
cor.test(x, y, method = 'kendall', alternative = 'greater')
Kendall's rank correlation tau
data: x and y
T = 26, p-value = 0.05972
alternative hypothesis: true tau is greater than 0
sample estimates:
tau
0.4444
cor.test(x, y, method = 'spearman', alternative = 'greater')
Spearman's rank correlation rho
data: x and y
S = 48, p-value = 0.0484
alternative hypothesis: true rho is greater than 0
sample estimates:
rho
0.6
Let us use the USArrests data set:
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = '% Urban population',
ylab = 'Number of assaults/year')
Let us use the USArrests data set:
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = '% Urban population',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400),
main = 'Correlation plot')
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = '% Urban population',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400),
main = 'Correlation plot')
text(20, 350, 'p-value (Kendal) = 0.04')
plot(USArrests$Assault ~ USArrests$UrbanPop,
xlab = '% Urban population',
ylab = 'Number of assaults/year',
xlim = c(0, 100), ylim = c(0, 400),
las = 1, tcl = .3, pch = 16, col = 'blue')
text(20, 350, 'p-value (Kendal) = 0.04')