Week 4: two variable statistics and L-D experiment tips

Some useful information for L-D report

You can always use ?function_name to know what a function is doing (e.g., abline in the L-D report).

The range of x and y-axis can be specified by xlim = c(start, end) and ylim = c(start, end).

vx = runif(20, 0, 1)
vy = runif(20, 10, 100)
plot(vx, vy, pch = 20)

#if we want to make the range of x and y the same, we could use the xlim and ylim
plot(vx, vy, pch = 20, xlim = c(0, 100), ylim = c(0, 100))

Oftentimes, making plots using different scale provides you a different look of the data. For the reports, you could decide which plot(s) is useful.

You will need to count the number of plates without any colonies to estimate mutation rate. There are many ways you could do it, and here are some possibilities.

v1 = c(0, 1, 2, 6, 0, 10, 20, 3, 5, 2, 0, 0, 2, 5, 7, 19)
bool = v1 == 0

#use table to summarize how often your vector equals zero
table(bool)
bool
FALSE  TRUE 
   12     4 
#sum of bool variables: T = 1, F = 0
sum(bool)
[1] 4

Count data - summarize data and statistical tests

When the numerical data is sparse, what we learned in Week 3 about summarizing data may not be that useful.

Do you think the histogram and boxplot helpful for summarizing v1 below?

It is likely that the data only consists of few values. table() is always a good function to see if that is true.

table(v1)
v1
 1 19 20 
 8  6  5 

So, in this example, there is only three possible values of v1. Instead of a histogram, a barplot may be more appropriate for v1. To make a barplot using the the outcome of table() is easy.

barplot(table(v1))

You can also specify the height of the bars by providing a vector. The names could be specify by option names = c( ). Remember, group names needs " " around them (since they are not variables).

v2 = c(1, 10, 4)
barplot(v2, names = c("control", "treatment 1", "treatment 2"))

To compare the observed count with expectation, we could use fisher.test() (for 2x2 tables) and chisq.test().

A commonly used example to understand Fisher’s exact test and Chi-square test is coin tossing. Let’s say we are tossing a fair coin.

?sample

coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin = sample(coin, size = ntoss, replace = TRUE) #a vector of ntoss random toss

table(random_coin)
random_coin
head tail 
  49   51 
obs_toss = as.vector(table(random_coin))  #force the table outcome to be a vector
exp_toss = c(ntoss/2, ntoss/2) #expectation

#combine to vector into one long vector
m = c(obs_toss, exp_toss)
#make it into a 2x2 matrix
m = matrix(m, nrow = 2)

fisher.test(m)

    Fisher's Exact Test for Count Data

data:  m
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.5315849 1.7363182
sample estimates:
odds ratio 
 0.9609541 

Let’s try tossing another “unfair” coin. You can specify the probability of each outcome in the sample() by prob = c( , ).

coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin2 = sample(coin, size = ntoss, replace = TRUE, prob = c(0.8, 0.2)) #tossing an unfair coin

table(random_coin2)
random_coin2
head tail 
  73   27 
obs_toss2 = as.vector(table(random_coin2))  #force the table outcome to be a vector
exp_toss2 = c(ntoss/2, ntoss/2) #expectation

#combine to vector into one long vector
m2 = c(obs_toss2, exp_toss2)
#make it into a 2x2 matrix
m2 = matrix(m2, nrow = 2)
m2
     [,1] [,2]
[1,]   73   50
[2,]   27   50
fisher.test(m2)

    Fisher's Exact Test for Count Data

data:  m2
p-value = 0.001316
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.439928 5.105704
sample estimates:
odds ratio 
  2.689832 

Make sure you know how to interpret the Fisher’s Exact test results!

Consider the following data structure. When you conduct Fisher’s Exact test with the following 2x2 table, what are you testing?

sunny cloudy
head 45 53
tail 55 47

Fisher’s Exact tests can calculate the p-value exactly, while Chi-square tests approximate it. But for tables larger than 2x2, we have to use chiseq.test. The usage is very similar.

obs = c(100, 105, 50)
exp = c(100, 100, 100)

m3 = c(obs, exp)
m3 = matrix(m3, nrow = 2, byrow = T) ##by defult, R fill up the columns first!

chisq.test(m3)

    Pearson's Chi-squared test

data:  m3
X-squared = 13.227, df = 2, p-value = 0.001342

Tip: when generating a matrix, remember that R fills up the column firsts. If it is hard to remember (at least for me), then always specify byrow = T or byrow = F to avoid confusions.


Comparing distributions - Kolmogorov–Smirnov test

As we discussed last week, sometimes, the sample distributions could have the same mean and similar variance, but they in fact look very different when you make histograms. Unlike t.test() that tests whether the mean is the same and Wilcox.test() that tests whether the median is the same, ks.test() compares the whole distribution.

v1 = rnorm(100, 0, 4)
v2 = rnorm(100, 0, 1)

par(mfrow = c(1, 2)) #put two figures (1 row, 2 columns) in one plot
hist(v1)
hist(v2)

#compare their distribution using ks test
ks.test(v1, v2)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  v1 and v2
D = 0.38, p-value = 1.071e-06
alternative hypothesis: two-sided
#let's try another distiburion with vey similar sd to v1
v3 = rnorm(100, 0, 3.9)
hist(v1)
hist(v3)

ks.test(v1, v3)

    Asymptotic two-sample Kolmogorov-Smirnov test

data:  v1 and v3
D = 0.12, p-value = 0.4676
alternative hypothesis: two-sided

Testing associations - correlation tests

To test the associations between two variables, we have multiple ways to do it. In fact, t.test() and wilcox.test() are testing whether the values of one continuous variable (e.g., age - continuous variable) is associated with the value of a categorical variable (e.g., positions in baseball games).

Similarly, Fisher.test() and chisq.test() can also be used to test the associations between two categorical variables (see above examples).

For two continuous variables, we usually use Pearson correlation tests cor.test() to test whether their associations are statistically significant and how strong the association is. Pearson correlation is a parametric test, with the normality assumptions.

Let’s look at an example we covered last week.

Based on this figure, what null hypothesis (H0) would you propose?

cor.test(x = data_mlb2$Weight, y = data_mlb2$Height)

    Pearson's product-moment correlation

data:  data_mlb2$Weight and data_mlb2$Height
t = 20.192, df = 1032, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4869843 0.5744776
sample estimates:
      cor 
0.5321502 

Pearson correlation tests can easily be biased by outliers.

Do you think there is an association between x and y in the below figure?

cor.test(v1, v2)

    Pearson's product-moment correlation

data:  v1 and v2
t = 4.6796, df = 10, p-value = 0.000868
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4855298 0.9504906
sample estimates:
     cor 
0.828557 

Nonparametric Spearman correlation test is less sensitive to outlier values. It can be specified using the method = "spearman option (remember the " ").

cor.test(v1, v2, method = "spearman")

    Spearman's rank correlation rho

data:  v1 and v2
S = 216, p-value = 0.4435
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2447552