Lecture 4 Two samples

Eamonn Mallon
25/09/2019

Occam's razor applied to statistical models

mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
  • No point carrying out an analysis that is more complicated than it has to be
  • The tests we will look at today, the classical tests, deal with some of the most frequent types of analysis
  • e.g. men's height vs women's, height versus weight etc.
  • FYI the R code is an example of a linear model, more on those in BS1070/MB1080

Todays' tests

  • t test (comparing two sample means with normal residuals)
  • wilcoxon's test (comparing two sample means with non-normal residuals)
  • Pearson's or Spearman's rank correlation (correlating two variables)
  • chi-squared test (testing for independence in contingency tables)

Student's t-test and Guinness

  • Guinness
  • Student was the pseudonym of W.S. Gosset (1876 - 1937)
  • Head Experimental Brewer, small-sample, stratified, and repeated balanced experiments on barley for proving the best yielding varieties
  • Gosset was a friend of both Pearson and Fisher, a noteworthy achievement, for each had a massive ego and a loathing for the other. He was a modest man who once cut short an admirer with this comment: “Fisher would have discovered it all anyway.”
  • Other awesome Guinness ads

The t-test

plot of chunk unnamed-chunk-2

  • how likely is it that the two sample means were drawn from populations with the same average?
  • calculate a test statistic
  • how likely that we obtain a test statistic this big or bigger if the null hypothesis is true
    • compare the calculated test statistic to the critical value which is calculated on the assumption that the null hypothesis is true
  • quick test: what is the null hypothesis when comparing two means?

The t-test

  • t = \( \frac{difference\, between\, two\, means}{standard\, error\, of\, the\, difference} \)
  • t = \( \frac{\bar{y}_A-\bar{y}_B}{S.E.D} \)
    • Lecture 2 explains the standard error of the mean (an estimate of how far the sample mean is likely to be from the population mean)
    • For two independent variables, the variance of a difference is the sum of the separate variances
    • \( S.E.M =\sqrt{\frac{s^2}{n}} \)
    • \( S.E.D =\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}} \)
  • t = \( \frac{\bar{y}_A-\bar{y}_B}{\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}} \)

R code for a t-test

library(SMPracticals)#Data is in this package
t.test(formula = height ~ type,  # Formula
       data = darwin) # Dataframe containing the variables

    Welch Two Sample t-test

data:  height by type
t = 2.4371, df = 22.164, p-value = 0.02328
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.3909566 4.8423767
sample estimates:
mean in group Cross  mean in group Self 
           20.19167            17.57500 

Outcrossed plants (mean +/- 95% confidence intervals: 20.19(0.39)) are larger than selfed plants (17.58 (4.48)) (t-test: t = 2.4371, df =22.164, p = 0.02328)

Wilcoxon test

plot of chunk unnamed-chunk-4

  • When the residuals are non-normal
  • Also know as a Mann-Whitney test
  • Rank all the data together
  • Add up the ranks for each treatment
  • compare the smaller value to a critical value

R code for a wilcoxon test

wilcox.test(formula = len ~ supp,  # Formula
       data = ToothGrowth, exact=FALSE) # Dataframe containing the variables

    Wilcoxon rank sum test with continuity correction

data:  len by supp
W = 575.5, p-value = 0.06449
alternative hypothesis: true location shift is not equal to 0

There is no significant difference between supplement types on their effect on tooth growth (Wilcoxon Rank-Sum Test: W= 575.5, n = 60, p = 0.06449)

Correlation

plot of chunk unnamed-chunk-6

  • defined in terms of variance of x, variance of y and covariance of xy
  • covariance: the way the two vary together
  • \( r =\frac{cov(x,y)}{\sqrt{s_x^2s_y^2}} \)
  • lots of maths and finally;
  • \( r =\frac{SSXY}{\sqrt{SSX.SSY}} \)

Pearson's product-moment correlation (parametric)

  • \( r =\frac{n\sum x_iy_i-\sum x_i\sum y_i}{\sqrt{n \sum x_i^2 - (\sum x_i)^2} \sqrt{n \sum y_i^2 - (\sum y_i)^2}} \)
  • Assumptions
    • both variables should be normally distributed
    • linearity (straight line relationship between each of the two variables)

R code for a pearson's product-moment correlation (parametric)

cor.test(babies$gestation, babies$bwt)

    Pearson's product-moment correlation

data:  babies$gestation and babies$bwt
t = 15.609, df = 1221, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3600303 0.4535398
sample estimates:
     cor 
0.407854 

The length of gestation correlates with a baby's weight (pearson: r = 0.408, t = 15.609, df =1221, p < 0.0001 )

Spearman rank correlation (non-parametric)

  • Based on ranks
  • monotonic (not necessarily linear)
  • \( \rho = 1 -\frac{6\sum d_i^2}{n(n^2-1)} \)
  • d is the difference between the ranks of corresponding variables

R code for a spearman rank correlation (non-parametric)

cor.test(babies$gestation, babies$bwt, method = "spearman")

    Spearman's rank correlation rho

data:  babies$gestation and babies$bwt
S = 181438572, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4048838 

The length of gestation correlates with a baby's weight (spearman: rho = 0.405, n = 1223, p < 0.0001 )

Correlation does not mean causation

Chi squared contingency table

Blue.eyes Brown.eyes Row.totals
Fair hair 38 11 49
Dark hair 14 51 65
Column totals 52 62 114
  • counts (number of leaves, number of patients who dies etc.)
  • Is there an association between hair colour and eye colour?
  • Calculate expected value based on null hypothesis (Ho: There is no association between hair and eye colour)
  • expected = (row total x column total) / grand total
  • \( \chi^2 = \sum\frac{(O-E)^2}{E} \)
  • d.f. = (r-1) x (c-1)

R code for Chi squared contingency table

count <- matrix(c(38,14,11,51), nrow=2)
chisq.test(count)

    Pearson's Chi-squared test with Yates' continuity correction

data:  count
X-squared = 33.112, df = 1, p-value = 8.7e-09

Hair colour is associated with eye colour (\( \chi^2 \) = 33.112, d.f. = 1, p = \( 8.7 \times 10^{-9} \))

What you learned to do today

  • Are men and women different heights?
  • Is height correlated with weight?
  • Is hair colour associated with eye colour

  • These are examples of almost any biological question. Can you think of one that doesn't fit above? At a simple level, there isn't anything you can't ask with the tools you have

Next semester

  • Regression: How x changes y
  • Anova: Allows you to tell >2 sample means are different
  • Preview of linear models: one test to rule them all.