Lecture 4 Two samples

Eamonn Mallon
25/09/2019

Occam's razor applied to statistical models

mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)

No point carrying out an analysis that is more complicated than it has to be
The tests we will look at today, the classical tests, deal with some of the most frequent types of analysis
e.g. men's height vs women's, height versus weight etc.
FYI the R code is an example of a linear model, more on those in BS1070/MB1080

Todays' tests

t test (comparing two sample means with normal residuals)
wilcoxon's test (comparing two sample means with non-normal residuals)
Pearson's or Spearman's rank correlation (correlating two variables)
chi-squared test (testing for independence in contingency tables)

Student's t-test and Guinness

Guinness
Student was the pseudonym of W.S. Gosset (1876 - 1937)
Head Experimental Brewer, small-sample, stratified, and repeated balanced experiments on barley for proving the best yielding varieties
Gosset was a friend of both Pearson and Fisher, a noteworthy achievement, for each had a massive ego and a loathing for the other. He was a modest man who once cut short an admirer with this comment: “Fisher would have discovered it all anyway.”
Other awesome Guinness ads

The t-test

plot of chunk unnamed-chunk-2

how likely is it that the two sample means were drawn from populations with the same average?
calculate a test statistic
how likely that we obtain a test statistic this big or bigger if the null hypothesis is true
- compare the calculated test statistic to the critical value which is calculated on the assumption that the null hypothesis is true
quick test: what is the null hypothesis when comparing two means?

The t-test

t = \( \frac{difference\, between\, two\, means}{standard\, error\, of\, the\, difference} \)
t = \( \frac{\bar{y}_A-\bar{y}_B}{S.E.D} \)
- Lecture 2 explains the standard error of the mean (an estimate of how far the sample mean is likely to be from the population mean)
- For two independent variables, the variance of a difference is the sum of the separate variances
- \( S.E.M =\sqrt{\frac{s^2}{n}} \)
- \( S.E.D =\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}} \)
t = \( \frac{\bar{y}_A-\bar{y}_B}{\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}} \)

R code for a t-test

library(SMPracticals)#Data is in this package
t.test(formula = height ~ type,  # Formula
       data = darwin) # Dataframe containing the variables


    Welch Two Sample t-test

data:  height by type
t = 2.4371, df = 22.164, p-value = 0.02328
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.3909566 4.8423767
sample estimates:
mean in group Cross  mean in group Self 
           20.19167            17.57500

Outcrossed plants (mean +/- 95% confidence intervals: 20.19(0.39)) are larger than selfed plants (17.58 (4.48)) (t-test: t = 2.4371, df =22.164, p = 0.02328)

Wilcoxon test

plot of chunk unnamed-chunk-4

When the residuals are non-normal
Also know as a Mann-Whitney test
Rank all the data together
Add up the ranks for each treatment
compare the smaller value to a critical value

R code for a wilcoxon test

wilcox.test(formula = len ~ supp,  # Formula
       data = ToothGrowth, exact=FALSE) # Dataframe containing the variables


    Wilcoxon rank sum test with continuity correction

data:  len by supp
W = 575.5, p-value = 0.06449
alternative hypothesis: true location shift is not equal to 0

There is no significant difference between supplement types on their effect on tooth growth (Wilcoxon Rank-Sum Test: W= 575.5, n = 60, p = 0.06449)

Correlation

plot of chunk unnamed-chunk-6

defined in terms of variance of x, variance of y and covariance of xy
covariance: the way the two vary together
\( r =\frac{cov(x,y)}{\sqrt{s_x^2s_y^2}} \)
lots of maths and finally;
\( r =\frac{SSXY}{\sqrt{SSX.SSY}} \)

Pearson's product-moment correlation (parametric)

\( r =\frac{n\sum x_iy_i-\sum x_i\sum y_i}{\sqrt{n \sum x_i^2 - (\sum x_i)^2} \sqrt{n \sum y_i^2 - (\sum y_i)^2}} \)
Assumptions
- both variables should be normally distributed
- linearity (straight line relationship between each of the two variables)

R code for a pearson's product-moment correlation (parametric)

cor.test(babies$gestation, babies$bwt)


    Pearson's product-moment correlation

data:  babies$gestation and babies$bwt
t = 15.609, df = 1221, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3600303 0.4535398
sample estimates:
     cor 
0.407854

The length of gestation correlates with a baby's weight (pearson: r = 0.408, t = 15.609, df =1221, p < 0.0001 )

Spearman rank correlation (non-parametric)

Based on ranks
monotonic (not necessarily linear)
\( \rho = 1 -\frac{6\sum d_i^2}{n(n^2-1)} \)
d is the difference between the ranks of corresponding variables

R code for a spearman rank correlation (non-parametric)

cor.test(babies$gestation, babies$bwt, method = "spearman")


    Spearman's rank correlation rho

data:  babies$gestation and babies$bwt
S = 181438572, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4048838

The length of gestation correlates with a baby's weight (spearman: rho = 0.405, n = 1223, p < 0.0001 )

Correlation does not mean causation

Chi squared contingency table

	Blue.eyes	Brown.eyes	Row.totals
Fair hair	38	11	49
Dark hair	14	51	65
Column totals	52	62	114

counts (number of leaves, number of patients who dies etc.)
Is there an association between hair colour and eye colour?
Calculate expected value based on null hypothesis (Ho: There is no association between hair and eye colour)
expected = (row total x column total) / grand total
\( \chi^2 = \sum\frac{(O-E)^2}{E} \)
d.f. = (r-1) x (c-1)

R code for Chi squared contingency table

count <- matrix(c(38,14,11,51), nrow=2)
chisq.test(count)


    Pearson's Chi-squared test with Yates' continuity correction

data:  count
X-squared = 33.112, df = 1, p-value = 8.7e-09

Hair colour is associated with eye colour (\( \chi^2 \) = 33.112, d.f. = 1, p = \( 8.7 \times 10^{-9} \))

What you learned to do today

Are men and women different heights?
Is height correlated with weight?
Is hair colour associated with eye colour
These are examples of almost any biological question. Can you think of one that doesn't fit above? At a simple level, there isn't anything you can't ask with the tools you have

Next semester

Regression: How x changes y
Anova: Allows you to tell >2 sample means are different
Preview of linear models: one test to rule them all.