vx = runif(20, 0, 1)
vy = runif(20, 10, 100)
plot(vx, vy, pch = 20)#if we want to make the range of x and y the same, we could use the xlim and ylim
plot(vx, vy, pch = 20, xlim = c(0, 100), ylim = c(0, 100))You can always use ?function_name to know what a function is doing (e.g., abline in the L-D report).
The range of x and y-axis can be specified by xlim = c(start, end) and ylim = c(start, end).
vx = runif(20, 0, 1)
vy = runif(20, 10, 100)
plot(vx, vy, pch = 20)#if we want to make the range of x and y the same, we could use the xlim and ylim
plot(vx, vy, pch = 20, xlim = c(0, 100), ylim = c(0, 100))Oftentimes, making plots using different scale provides you a different look of the data. For the reports, you could decide which plot(s) is useful.
You will need to count the number of plates without any colonies to estimate mutation rate. There are many ways you could do it, and here are some possibilities.
v1 = c(0, 1, 2, 6, 0, 10, 20, 3, 5, 2, 0, 0, 2, 5, 7, 19)
bool = v1 == 0
#use table to summarize how often your vector equals zero
table(bool)bool
FALSE TRUE
12 4
#sum of bool variables: T = 1, F = 0
sum(bool)[1] 4
When the numerical data is sparse, what we learned in Week 3 about summarizing data may not be that useful.
Do you think the histogram and boxplot helpful for summarizing v1 below?
It is likely that the data only consists of few values. table() is always a good function to see if that is true.
table(v1)v1
1 19 20
8 6 5
So, in this example, there is only three possible values of v1. Instead of a histogram, a barplot may be more appropriate for v1. To make a barplot using the the outcome of table() is easy.
barplot(table(v1))You can also specify the height of the bars by providing a vector. The names could be specify by option names = c( ). Remember, group names needs " " around them (since they are not variables).
v2 = c(1, 10, 4)
barplot(v2, names = c("control", "treatment 1", "treatment 2"))To compare the observed count with expectation, we could use fisher.test() (for 2x2 tables) and chisq.test().
A commonly used example to understand Fisher’s exact test and Chi-square test is coin tossing. Let’s say we are tossing a fair coin.
?sample
coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin = sample(coin, size = ntoss, replace = TRUE) #a vector of ntoss random toss
table(random_coin)random_coin
head tail
49 51
obs_toss = as.vector(table(random_coin)) #force the table outcome to be a vector
exp_toss = c(ntoss/2, ntoss/2) #expectation
#combine to vector into one long vector
m = c(obs_toss, exp_toss)
#make it into a 2x2 matrix
m = matrix(m, nrow = 2)
fisher.test(m)
Fisher's Exact Test for Count Data
data: m
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5315849 1.7363182
sample estimates:
odds ratio
0.9609541
Let’s try tossing another “unfair” coin. You can specify the probability of each outcome in the sample() by prob = c( , ).
coin = c('head', 'tail')
ntoss = 100 #the number of random toss
random_coin2 = sample(coin, size = ntoss, replace = TRUE, prob = c(0.8, 0.2)) #tossing an unfair coin
table(random_coin2)random_coin2
head tail
73 27
obs_toss2 = as.vector(table(random_coin2)) #force the table outcome to be a vector
exp_toss2 = c(ntoss/2, ntoss/2) #expectation
#combine to vector into one long vector
m2 = c(obs_toss2, exp_toss2)
#make it into a 2x2 matrix
m2 = matrix(m2, nrow = 2)
m2 [,1] [,2]
[1,] 73 50
[2,] 27 50
fisher.test(m2)
Fisher's Exact Test for Count Data
data: m2
p-value = 0.001316
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.439928 5.105704
sample estimates:
odds ratio
2.689832
Make sure you know how to interpret the Fisher’s Exact test results!
Consider the following data structure. When you conduct Fisher’s Exact test with the following 2x2 table, what are you testing?
| sunny | cloudy | |
|---|---|---|
| head | 45 | 53 |
| tail | 55 | 47 |
Fisher’s Exact tests can calculate the p-value exactly, while Chi-square tests approximate it. But for tables larger than 2x2, we have to use chiseq.test. The usage is very similar.
obs = c(100, 105, 50)
exp = c(100, 100, 100)
m3 = c(obs, exp)
m3 = matrix(m3, nrow = 2, byrow = T) ##by defult, R fill up the columns first!
chisq.test(m3)
Pearson's Chi-squared test
data: m3
X-squared = 13.227, df = 2, p-value = 0.001342
Tip: when generating a matrix, remember that R fills up the column firsts. If it is hard to remember (at least for me), then always specify byrow = T or byrow = F to avoid confusions.
As we discussed last week, sometimes, the sample distributions could have the same mean and similar variance, but they in fact look very different when you make histograms. Unlike t.test() that tests whether the mean is the same and Wilcox.test() that tests whether the median is the same, ks.test() compares the whole distribution.
v1 = rnorm(100, 0, 4)
v2 = rnorm(100, 0, 1)
par(mfrow = c(1, 2)) #put two figures (1 row, 2 columns) in one plot
hist(v1)
hist(v2)#compare their distribution using ks test
ks.test(v1, v2)
Asymptotic two-sample Kolmogorov-Smirnov test
data: v1 and v2
D = 0.38, p-value = 1.071e-06
alternative hypothesis: two-sided
#let's try another distiburion with vey similar sd to v1
v3 = rnorm(100, 0, 3.9)
hist(v1)
hist(v3)ks.test(v1, v3)
Asymptotic two-sample Kolmogorov-Smirnov test
data: v1 and v3
D = 0.12, p-value = 0.4676
alternative hypothesis: two-sided
To test the associations between two variables, we have multiple ways to do it. In fact, t.test() and wilcox.test() are testing whether the values of one continuous variable (e.g., age - continuous variable) is associated with the value of a categorical variable (e.g., positions in baseball games).
Similarly, Fisher.test() and chisq.test() can also be used to test the associations between two categorical variables (see above examples).
For two continuous variables, we usually use Pearson correlation tests cor.test() to test whether their associations are statistically significant and how strong the association is. Pearson correlation is a parametric test, with the normality assumptions.
Let’s look at an example we covered last week.
Based on this figure, what null hypothesis (H0) would you propose?
cor.test(x = data_mlb2$Weight, y = data_mlb2$Height)
Pearson's product-moment correlation
data: data_mlb2$Weight and data_mlb2$Height
t = 20.192, df = 1032, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4869843 0.5744776
sample estimates:
cor
0.5321502
Pearson correlation tests can easily be biased by outliers.
Do you think there is an association between x and y in the below figure?
cor.test(v1, v2)
Pearson's product-moment correlation
data: v1 and v2
t = 4.6796, df = 10, p-value = 0.000868
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4855298 0.9504906
sample estimates:
cor
0.828557
Nonparametric Spearman correlation test is less sensitive to outlier values. It can be specified using the method = "spearman option (remember the " ").
cor.test(v1, v2, method = "spearman")
Spearman's rank correlation rho
data: v1 and v2
S = 216, p-value = 0.4435
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.2447552