David Robinson
1/31/14
One of the most common statistical tasks is to test whether two continuous variables are related.
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()
cor(mtcars$wt, mtcars$mpg)
[1] -0.8677
cor.test(mtcars$wt, mtcars$mpg)
Pearson's product-moment correlation
data: mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338 -0.7441
sample estimates:
cor
-0.8677
ct = cor.test(mtcars$wt, mtcars$mpg)
ct$p.value
[1] 1.294e-10
ct$conf.int
[1] -0.9338 -0.7441
attr(,"conf.level")
[1] 0.95
A Spearman correlation is a “non-parametric” test. The Pearson correlation makes assumptions:
A Spearman correlation instead compares the ranks, and so it is more robust when those assumptions are violated.
cor(mtcars$wt, mtcars$mpg, method="spearman")
[1] -0.8864
cor.test(mtcars$wt, mtcars$mpg, method="spearman")
Spearman's rank correlation rho
data: mtcars$wt and mtcars$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.8864
Linear regression is very similar to a Pearson correlation test, but it
fit = lm(mpg ~ wt, data=mtcars)
summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.543 -2.365 -0.125 1.410 6.873
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.86 < 2e-16 ***
wt -5.344 0.559 -9.56 1.3e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared: 0.753, Adjusted R-squared: 0.745
F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
Test multiple predictors in combination:
multiple.fit = lm(mpg ~ wt + disp + gear, data=mtcars)
summary(multiple.fit)
Call:
lm(formula = mpg ~ wt + disp + gear, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.589 -2.350 -0.807 1.604 6.200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.41433 4.92749 7.59 2.9e-08 ***
wt -3.50937 1.21227 -2.89 0.0073 **
disp -0.01825 0.00935 -1.95 0.0610 .
gear -0.49435 0.88920 -0.56 0.5827
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.95 on 28 degrees of freedom
Multiple R-squared: 0.783, Adjusted R-squared: 0.76
F-statistic: 33.7 on 3 and 28 DF, p-value: 1.95e-09
summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.8776 19.858 8.242e-19
wt -5.344 0.5591 -9.559 1.294e-10
summary(fit)$r.squared
[1] 0.7528
summary(fit)$adj.r.squared
[1] 0.7446
Comparing mpg between automatic (0) and manual (1):
ggplot(mtcars, aes(x=factor(am), y=mpg)) + geom_boxplot()
mpg is a numeric vector representing miles per gallon.
mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
am is a numeric vector representing automatic (0) or manual (1).
mtcars$am
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
[26] 1 1 1 1 1 1 1
tt = t.test(mpg ~ am, data=mtcars)
tt
Welch Two Sample t-test
data: mpg by am
t = -3.767, df = 18.33, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.28 -3.21
sample estimates:
mean in group 0 mean in group 1
17.15 24.39
tt$p.value
[1] 0.001374
Wilcoxon rank-sum test, also known as a Mann-Whitney U-Test, is the non-parametric version that is more robust.
wilcox.test(mpg ~ am, data=mtcars)
Wilcoxon rank sum test with continuity
correction
data: mpg by am
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0
Let's look at an experimental set of data that needs a different approach: an experiment on how cold temperature affects CO2 uptake in plants. The plants were either chilled or not chilled, and measures of both CO2 concentration and CO2 uptake were performed. Each plant came either from Quebec or Mississippi.
data(CO2)
head(CO2)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
summary(lm(uptake ~ conc + Treatment + Type + Plant, CO2))
Call:
lm(formula = uptake ~ conc + Treatment + Type + Plant, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-17.37 -1.64 1.27 3.89 8.79
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.41512 4.67429 8.00 1.6e-11
conc 0.01773 0.00223 7.96 2.0e-11
Treatmentchilled -12.49674 5.10128 -2.45 0.01676
TypeMississippi -23.33293 6.01035 -3.88 0.00023
Plant.L 21.58453 11.14199 1.94 0.05670
Plant.Q -4.61669 2.26946 -2.03 0.04566
Plant.C 1.45994 5.10346 0.29 0.77566
Plant^4 2.33920 2.26946 1.03 0.30617
Plant^5 -0.47824 5.77002 -0.08 0.93418
Plant^6 -0.03902 2.26946 -0.02 0.98633
Plant^7 -1.91482 3.64147 -0.53 0.60064
Plant^8 -3.27825 2.26946 -1.44 0.15300
Plant^9 NA NA NA NA
Plant^10 0.54631 2.26946 0.24 0.81046
Plant^11 NA NA NA NA
(Intercept) ***
conc ***
Treatmentchilled *
TypeMississippi ***
Plant.L .
Plant.Q *
Plant.C
Plant^4
Plant^5
Plant^6
Plant^7
Plant^8
Plant^9
Plant^10
Plant^11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6 on 71 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.692
F-statistic: 16.5 on 12 and 71 DF, p-value: 4.5e-16
CO2$Plant
[1] Qn1 Qn1 Qn1 Qn1 Qn1 Qn1 Qn1 Qn2 Qn2 Qn2 Qn2 Qn2
[13] Qn2 Qn2 Qn3 Qn3 Qn3 Qn3 Qn3 Qn3 Qn3 Qc1 Qc1 Qc1
[25] Qc1 Qc1 Qc1 Qc1 Qc2 Qc2 Qc2 Qc2 Qc2 Qc2 Qc2 Qc3
[37] Qc3 Qc3 Qc3 Qc3 Qc3 Qc3 Mn1 Mn1 Mn1 Mn1 Mn1 Mn1
[49] Mn1 Mn2 Mn2 Mn2 Mn2 Mn2 Mn2 Mn2 Mn3 Mn3 Mn3 Mn3
[61] Mn3 Mn3 Mn3 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc2 Mc2
[73] Mc2 Mc2 Mc2 Mc2 Mc2 Mc3 Mc3 Mc3 Mc3 Mc3 Mc3 Mc3
12 Levels: Qn1 < Qn2 < Qn3 < Qc1 < Qc3 < ... < Mc1
anova(lm(uptake ~ conc + Treatment + Type + Plant, CO2))
Analysis of Variance Table
Response: uptake
Df Sum Sq Mean Sq F value Pr(>F)
conc 1 2285 2285 63.38 2.0e-11 ***
Treatment 1 988 988 27.41 1.6e-06 ***
Type 1 3366 3366 93.35 1.4e-14 ***
Plant 9 509 57 1.57 0.14
Residuals 71 2560 36
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data.table package, which is effectively an improvement on the built in data frame, is a very powerful tool for analyzing a dataset. We'll use it to
d = read.csv("http://dgrtwo.github.io/pages/celldata.csv")
head(d)
Area Length Width Eccentricity Perimeter TotalDye1
1 212 29.99 9.534 0.9481 67.84 384757
2 293 36.93 10.608 0.9579 81.50 27480
3 218 30.06 10.390 0.9384 68.18 26859
4 557 66.94 11.524 0.9851 142.67 35491
5 120 19.15 8.202 0.9036 44.63 22401
6 228 25.85 11.375 0.8980 60.87 474736
MeanDye1 TotalDye2 MeanDye2 TotalDye3 MeanDye3 Drug
1 1814.89 20017 94.42 352758 1664.0 DrugB
2 93.79 29566 100.91 235347 803.2 DrugB
3 123.21 18150 83.26 135568 621.9 DrugB
4 63.72 12645 22.70 80788 145.0 DrugE
5 186.68 59788 498.23 147036 1225.3 DrugF
6 2082.18 167238 733.50 200961 881.4 DrugA
This dataset has 3000 observations of E. coli cells under different drug treatments (DrugA-F). Multiple phenotypes (physical attributes) of each cell were measured:
ggplot(d, aes(Length, Width, col=Drug)) + geom_point() + stat_smooth(method="lm")
ggplot(d, aes(MeanDye1, MeanDye2, col=Drug)) + geom_point() + stat_smooth(method="lm") + scale_x_log10() + scale_y_log10()
ggplot(d, aes(MeanDye2, MeanDye3, col=Drug)) + geom_point() + stat_smooth(method="lm") + scale_x_log10() + scale_y_log10()
summary(lm(Length ~ Drug, data=d))
Call:
lm(formula = Length ~ Drug, data = d)
Residuals:
Min 1Q Median 3Q Max
-44.97 -9.92 -2.94 6.06 114.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.914 0.937 41.54 < 2e-16 ***
DrugDrugB 7.215 1.074 6.72 2.2e-11 ***
DrugDrugC -8.259 1.217 -6.79 1.4e-11 ***
DrugDrugD -17.774 1.305 -13.62 < 2e-16 ***
DrugDrugE 12.441 1.326 9.38 < 2e-16 ***
DrugDrugF -11.935 1.204 -9.91 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.9 on 2994 degrees of freedom
Multiple R-squared: 0.266, Adjusted R-squared: 0.265
F-statistic: 217 on 5 and 2994 DF, p-value: <2e-16
anova(lm(Length ~ Drug, data=d))
Analysis of Variance Table
Response: Length
Df Sum Sq Mean Sq F value Pr(>F)
Drug 5 309734 61947 217 <2e-16 ***
Residuals 2994 854006 285
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which drugs result in the same global phenotypic landscape as DrugF?
There are many approaches we can use:
We'll do something simple today: average each attribute within each drug
library(data.table)
dt = as.data.table(d)
dt[, mean(Area), by=Drug]
Drug V1
1: DrugB 363.3
2: DrugE 551.4
3: DrugF 211.4
4: DrugA 338.2
5: DrugD 162.3
6: DrugC 253.3
dt[, mean(Length), by=Drug]
Drug V1
1: DrugB 46.13
2: DrugE 51.35
3: DrugF 26.98
4: DrugA 38.91
5: DrugD 21.14
6: DrugC 30.65
dt[, sd(Length), by=Drug]
Drug V1
1: DrugB 21.758
2: DrugE 20.147
3: DrugF 10.961
4: DrugA 12.905
5: DrugD 7.277
6: DrugC 14.721
dt[, list(Area=mean(Area)), by=Drug]
Drug Area
1: DrugB 363.3
2: DrugE 551.4
3: DrugF 211.4
4: DrugA 338.2
5: DrugD 162.3
6: DrugC 253.3
dt[, list(Area=mean(Area), Length=mean(Length)), by=Drug]
Drug Area Length
1: DrugB 363.3 46.13
2: DrugE 551.4 51.35
3: DrugF 211.4 26.98
4: DrugA 338.2 38.91
5: DrugD 162.3 21.14
6: DrugC 253.3 30.65
dt[, list(Area=mean(Area), Length=mean(Length), Width=mean(Width), Eccentricity=mean(Eccentricity), Perimeter=mean(Perimeter), TotalDye1=mean(TotalDye1), TotalDye2=mean(TotalDye2), TotalDye3=mean(TotalDye3), MeanDye1=mean(MeanDye1), MeanDye2=mean(TotalDye2), MeanDye3=mean(MeanDye3)), by=Drug]
library(reshape2)
dt.m = melt(dt, id="Drug")
dt.m = as.data.table(dt.m)
head(dt.m)
Drug variable value
1: DrugB Area 212
2: DrugB Area 293
3: DrugB Area 218
4: DrugE Area 557
5: DrugF Area 120
6: DrugA Area 228
ggplot(dt.m, aes(value, fill=Drug)) + geom_histogram() + facet_wrap(~ variable) + scale_x_log10()
averages = dt.m[, list(average=mean(value)), by=c("Drug", "variable")]
head(averages)
Drug variable average
1: DrugB Area 363.3
2: DrugE Area 551.4
3: DrugF Area 211.4
4: DrugA Area 338.2
5: DrugD Area 162.3
6: DrugC Area 253.3
averages.matrix = dcast(averages, variable ~ Drug)
head(averages.matrix)
variable DrugA DrugB DrugC DrugD
1 Area 3.382e+02 3.633e+02 2.533e+02 1.623e+02
2 Length 3.891e+01 4.613e+01 3.065e+01 2.114e+01
3 Width 1.184e+01 1.133e+01 1.117e+01 1.027e+01
4 Eccentricity 9.395e-01 9.548e-01 8.995e-01 8.368e-01
5 Perimeter 8.738e+01 1.044e+02 7.200e+01 5.369e+01
6 TotalDye1 3.379e+05 1.254e+05 9.490e+04 2.913e+04
DrugE DrugF
1 5.514e+02 2.114e+02
2 5.135e+01 2.698e+01
3 1.540e+01 1.069e+01
4 9.363e-01 8.898e-01
5 1.291e+02 6.656e+01
6 7.978e+04 5.733e+04
Simulations play an important role in many areas of natural and social science. Some are meant to gain a clearer understanding of the behavior of a system, some are designed to test models, and some statistical methods, such as permutation tests and bootstrapping, use random number generation to assess statistical significance.
You can generate a random number with the sample function. Give it the range or vector to be sampled from and the number of elements to sample:
sample(1:10, 1)
[1] 8
sample(1:10, 1)
[1] 7
x = c(1, 4, 9, 15, 21, 30)
sample(x, 1)
[1] 1
sample(x, 1)
[1] 21
When you don't give a number of elements, it randomly permutes the vector given.
sample(1:10)
[1] 8 6 3 7 2 1 4 5 9 10
sample(1:10)
[1] 1 6 3 2 5 9 10 4 7 8
sample(x)
[1] 1 30 9 15 21 4
sample(x)
[1] 30 9 1 15 21 4
You can generate
runif(1)
[1] 0.3073