Data Analysis and Visualization Using R: Lesson 3

David Robinson
1/31/14

Stuff We Won't Be Discussing

Mathematical formulas for anything (those can be found in a class, statistical textbooks, online resources, Wikipedia etc if you want to do these by hand)
Philosophy behind p-values (below .05 means it's significant, above means it's probably not. Deal?)

Statistical Testing

Testing relationship between two values

One of the most common statistical tasks is to test whether two continuous variables are related.

Relationship between weight and miles-per-gallon

ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()

plot of chunk plot_wt_mpg

Use Correlation for Linear Relationship

cor(mtcars$wt, mtcars$mpg)

[1] -0.8677

cor.test(mtcars$wt, mtcars$mpg)


    Pearson's product-moment correlation

data:  mtcars$wt and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338 -0.7441
sample estimates:
    cor 
-0.8677

Getting Values out of a Test: Use $

ct = cor.test(mtcars$wt, mtcars$mpg)
ct$p.value

[1] 1.294e-10

ct$conf.int

[1] -0.9338 -0.7441
attr(,"conf.level")
[1] 0.95

Use Spearman if Relationship is not Linear

A Spearman correlation is a “non-parametric” test. The Pearson correlation makes assumptions:

the two variables are linearly related
the residual error is normally distributed

A Spearman correlation instead compares the ranks, and so it is more robust when those assumptions are violated.

Use Spearman if Relationship is not Linear

cor(mtcars$wt, mtcars$mpg, method="spearman")

[1] -0.8864

cor.test(mtcars$wt, mtcars$mpg, method="spearman")


    Spearman's rank correlation rho

data:  mtcars$wt and mtcars$mpg
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
    rho 
-0.8864

Other resources

Regression

Linear regression is very similar to a Pearson correlation test, but it

offers more details about the relationship, including the slope
allows for multiple dependent variables
can be extended to non-linear data (generalized linear models)

Linear regression

fit = lm(mpg ~ wt, data=mtcars)
summary(fit)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753, Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Multiple Regression

Test multiple predictors in combination:

multiple.fit = lm(mpg ~ wt + disp + gear, data=mtcars)
summary(multiple.fit)


Call:
lm(formula = mpg ~ wt + disp + gear, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.589 -2.350 -0.807  1.604  6.200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.41433    4.92749    7.59  2.9e-08 ***
wt          -3.50937    1.21227   -2.89   0.0073 ** 
disp        -0.01825    0.00935   -1.95   0.0610 .  
gear        -0.49435    0.88920   -0.56   0.5827    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.95 on 28 degrees of freedom
Multiple R-squared:  0.783, Adjusted R-squared:  0.76 
F-statistic: 33.7 on 3 and 28 DF,  p-value: 1.95e-09

Extracting values from the summary

summary(fit)$coefficients

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   37.285     1.8776  19.858 8.242e-19
wt            -5.344     0.5591  -9.559 1.294e-10

summary(fit)$r.squared

[1] 0.7528

summary(fit)$adj.r.squared

[1] 0.7446

Other resources

Regression
- Detailed guide to regression in R

Testing difference between groups

Testing between two groups

Comparing mpg between automatic (0) and manual (1):

ggplot(mtcars, aes(x=factor(am), y=mpg)) + geom_boxplot()

plot of chunk auto_manual_boxplot

Testing between two groups

mpg is a numeric vector representing miles per gallon.

mtcars$mpg

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

am is a numeric vector representing automatic (0) or manual (1).

mtcars$am

 [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
[26] 1 1 1 1 1 1 1

For two groups, use a t-test

tt = t.test(mpg ~ am, data=mtcars)
tt


    Welch Two Sample t-test

data:  mpg by am
t = -3.767, df = 18.33, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.28  -3.21
sample estimates:
mean in group 0 mean in group 1 
          17.15           24.39

tt$p.value

[1] 0.001374

Difference between non-normal groups

Wilcoxon rank-sum test, also known as a Mann-Whitney U-Test, is the non-parametric version that is more robust.

wilcox.test(mpg ~ am, data=mtcars)


    Wilcoxon rank sum test with continuity
    correction

data:  mpg by am
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0

ANOVA

Let's look at an experimental set of data that needs a different approach: an experiment on how cold temperature affects CO2 uptake in plants. The plants were either chilled or not chilled, and measures of both CO2 concentration and CO2 uptake were performed. Each plant came either from Quebec or Mississippi.

CO2 Data

data(CO2)
head(CO2)

  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2

Summary of a Linear Model Shows Effect of Each Plant

summary(lm(uptake ~ conc + Treatment + Type + Plant, CO2))


Call:
lm(formula = uptake ~ conc + Treatment + Type + Plant, data = CO2)

Residuals:
   Min     1Q Median     3Q    Max 
-17.37  -1.64   1.27   3.89   8.79 

Coefficients: (2 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)       37.41512    4.67429    8.00  1.6e-11
conc               0.01773    0.00223    7.96  2.0e-11
Treatmentchilled -12.49674    5.10128   -2.45  0.01676
TypeMississippi  -23.33293    6.01035   -3.88  0.00023
Plant.L           21.58453   11.14199    1.94  0.05670
Plant.Q           -4.61669    2.26946   -2.03  0.04566
Plant.C            1.45994    5.10346    0.29  0.77566
Plant^4            2.33920    2.26946    1.03  0.30617
Plant^5           -0.47824    5.77002   -0.08  0.93418
Plant^6           -0.03902    2.26946   -0.02  0.98633
Plant^7           -1.91482    3.64147   -0.53  0.60064
Plant^8           -3.27825    2.26946   -1.44  0.15300
Plant^9                 NA         NA      NA       NA
Plant^10           0.54631    2.26946    0.24  0.81046
Plant^11                NA         NA      NA       NA

(Intercept)      ***
conc             ***
Treatmentchilled *  
TypeMississippi  ***
Plant.L          .  
Plant.Q          *  
Plant.C             
Plant^4             
Plant^5             
Plant^6             
Plant^7             
Plant^8             
Plant^9             
Plant^10            
Plant^11            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6 on 71 degrees of freedom
Multiple R-squared:  0.736, Adjusted R-squared:  0.692 
F-statistic: 16.5 on 12 and 71 DF,  p-value: 4.5e-16

Plant is a "factor" variable

CO2$Plant

 [1] Qn1 Qn1 Qn1 Qn1 Qn1 Qn1 Qn1 Qn2 Qn2 Qn2 Qn2 Qn2
[13] Qn2 Qn2 Qn3 Qn3 Qn3 Qn3 Qn3 Qn3 Qn3 Qc1 Qc1 Qc1
[25] Qc1 Qc1 Qc1 Qc1 Qc2 Qc2 Qc2 Qc2 Qc2 Qc2 Qc2 Qc3
[37] Qc3 Qc3 Qc3 Qc3 Qc3 Qc3 Mn1 Mn1 Mn1 Mn1 Mn1 Mn1
[49] Mn1 Mn2 Mn2 Mn2 Mn2 Mn2 Mn2 Mn2 Mn3 Mn3 Mn3 Mn3
[61] Mn3 Mn3 Mn3 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc1 Mc2 Mc2
[73] Mc2 Mc2 Mc2 Mc2 Mc2 Mc3 Mc3 Mc3 Mc3 Mc3 Mc3 Mc3
12 Levels: Qn1 < Qn2 < Qn3 < Qc1 < Qc3 < ... < Mc1

ANOVA instead tests for the effect of including the plant variable

anova(lm(uptake ~ conc + Treatment + Type + Plant, CO2))

Analysis of Variance Table

Response: uptake
          Df Sum Sq Mean Sq F value  Pr(>F)    
conc       1   2285    2285   63.38 2.0e-11 ***
Treatment  1    988     988   27.41 1.6e-06 ***
Type       1   3366    3366   93.35 1.4e-14 ***
Plant      9    509      57    1.57    0.14    
Residuals 71   2560      36                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Other resources

t-test
- http://www.stat.columbia.edu/~martin/W2024/R2.pdf
- Description of the math behind the t-test

Using data.table for data analysis

The data.table package, which is effectively an improvement on the built in data frame, is a very powerful tool for analyzing a dataset. We'll use it to

Educational Purposes Only

This is real, unpublished data.
It is not my data, and it is also not yours. Don't use it to scoop the author.
The names of drugs and dyes have been anonymized, and a few hundred rows removed, to protect it from theft

Introduction to the dataset

d = read.csv("http://dgrtwo.github.io/pages/celldata.csv")
head(d)

  Area Length  Width Eccentricity Perimeter TotalDye1
1  212  29.99  9.534       0.9481     67.84    384757
2  293  36.93 10.608       0.9579     81.50     27480
3  218  30.06 10.390       0.9384     68.18     26859
4  557  66.94 11.524       0.9851    142.67     35491
5  120  19.15  8.202       0.9036     44.63     22401
6  228  25.85 11.375       0.8980     60.87    474736
  MeanDye1 TotalDye2 MeanDye2 TotalDye3 MeanDye3  Drug
1  1814.89     20017    94.42    352758   1664.0 DrugB
2    93.79     29566   100.91    235347    803.2 DrugB
3   123.21     18150    83.26    135568    621.9 DrugB
4    63.72     12645    22.70     80788    145.0 DrugE
5   186.68     59788   498.23    147036   1225.3 DrugF
6  2082.18    167238   733.50    200961    881.4 DrugA

Introduction to the dataset

This dataset has 3000 observations of E. coli cells under different drug treatments (DrugA-F). Multiple phenotypes (physical attributes) of each cell were measured:

Area
Eccentricity
Perimeter
Flourescence Under Particular Dyes (Dye1, Dye2, Dye3)

Can examine simple questions visually with ggplot2

ggplot(d, aes(Length, Width, col=Drug)) + geom_point() + stat_smooth(method="lm")

plot of chunk unnamed-chunk-11

Can examine simple questions visually with ggplot2

ggplot(d, aes(MeanDye1, MeanDye2, col=Drug)) + geom_point() + stat_smooth(method="lm") + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-12

Can examine simple questions visually with ggplot2

ggplot(d, aes(MeanDye2, MeanDye3, col=Drug)) + geom_point() + stat_smooth(method="lm") + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-13

Can see whether each phenotype depends on the drug using lm/anova

summary(lm(Length ~ Drug, data=d))


Call:
lm(formula = Length ~ Drug, data = d)

Residuals:
   Min     1Q Median     3Q    Max 
-44.97  -9.92  -2.94   6.06 114.36 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   38.914      0.937   41.54  < 2e-16 ***
DrugDrugB      7.215      1.074    6.72  2.2e-11 ***
DrugDrugC     -8.259      1.217   -6.79  1.4e-11 ***
DrugDrugD    -17.774      1.305  -13.62  < 2e-16 ***
DrugDrugE     12.441      1.326    9.38  < 2e-16 ***
DrugDrugF    -11.935      1.204   -9.91  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.9 on 2994 degrees of freedom
Multiple R-squared:  0.266, Adjusted R-squared:  0.265 
F-statistic:  217 on 5 and 2994 DF,  p-value: <2e-16

Can see whether each phenotype depends on the drug using lm/anova

anova(lm(Length ~ Drug, data=d))

Analysis of Variance Table

Response: Length
            Df Sum Sq Mean Sq F value Pr(>F)    
Drug         5 309734   61947     217 <2e-16 ***
Residuals 2994 854006     285                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Main question of interest

Which drugs result in the same global phenotypic landscape as DrugF?

There are many approaches we can use:
- Multinomial logistic regression
- Principal component analysis
We'll do something simple today: average each attribute within each drug

Using data.table to perform calculations within a grouped column

library(data.table)
dt = as.data.table(d)
dt[, mean(Area), by=Drug]

    Drug    V1
1: DrugB 363.3
2: DrugE 551.4
3: DrugF 211.4
4: DrugA 338.2
5: DrugD 162.3
6: DrugC 253.3

Using data.table to perform calculations within a grouped column

dt[, mean(Length), by=Drug]

    Drug    V1
1: DrugB 46.13
2: DrugE 51.35
3: DrugF 26.98
4: DrugA 38.91
5: DrugD 21.14
6: DrugC 30.65

Doesn't have to be the average:

dt[, sd(Length), by=Drug]

    Drug     V1
1: DrugB 21.758
2: DrugE 20.147
3: DrugF 10.961
4: DrugA 12.905
5: DrugD  7.277
6: DrugC 14.721

We can name the column with list()

dt[, list(Area=mean(Area)), by=Drug]

    Drug  Area
1: DrugB 363.3
2: DrugE 551.4
3: DrugF 211.4
4: DrugA 338.2
5: DrugD 162.3
6: DrugC 253.3

Similarly, can use list() to calculate multiple columns

dt[, list(Area=mean(Area), Length=mean(Length)), by=Drug]

    Drug  Area Length
1: DrugB 363.3  46.13
2: DrugE 551.4  51.35
3: DrugF 211.4  26.98
4: DrugA 338.2  38.91
5: DrugD 162.3  21.14
6: DrugC 253.3  30.65

If we wanted to, we could use this to average all columns

dt[, list(Area=mean(Area), Length=mean(Length), Width=mean(Width), Eccentricity=mean(Eccentricity), Perimeter=mean(Perimeter), TotalDye1=mean(TotalDye1), TotalDye2=mean(TotalDye2), TotalDye3=mean(TotalDye3), MeanDye1=mean(MeanDye1), MeanDye2=mean(TotalDye2), MeanDye3=mean(MeanDye3)), by=Drug]

But we don't want to.

Melt data so that the phenotype is just another column

library(reshape2)
dt.m = melt(dt, id="Drug")
dt.m = as.data.table(dt.m)
head(dt.m)

    Drug variable value
1: DrugB     Area   212
2: DrugB     Area   293
3: DrugB     Area   218
4: DrugE     Area   557
5: DrugF     Area   120
6: DrugA     Area   228

This melted format is more friendly to some plots

ggplot(dt.m, aes(value, fill=Drug)) + geom_histogram() + facet_wrap(~ variable) + scale_x_log10()

plot of chunk unnamed-chunk-17

Now our operation is just averaging within those "Drug x variable" groups

averages = dt.m[, list(average=mean(value)), by=c("Drug", "variable")]
head(averages)

    Drug variable average
1: DrugB     Area   363.3
2: DrugE     Area   551.4
3: DrugF     Area   211.4
4: DrugA     Area   338.2
5: DrugD     Area   162.3
6: DrugC     Area   253.3

dcast is like melt in reverse: turn tall matrix into wide

averages.matrix = dcast(averages, variable ~ Drug)
head(averages.matrix)

      variable     DrugA     DrugB     DrugC     DrugD
1         Area 3.382e+02 3.633e+02 2.533e+02 1.623e+02
2       Length 3.891e+01 4.613e+01 3.065e+01 2.114e+01
3        Width 1.184e+01 1.133e+01 1.117e+01 1.027e+01
4 Eccentricity 9.395e-01 9.548e-01 8.995e-01 8.368e-01
5    Perimeter 8.738e+01 1.044e+02 7.200e+01 5.369e+01
6    TotalDye1 3.379e+05 1.254e+05 9.490e+04 2.913e+04
      DrugE     DrugF
1 5.514e+02 2.114e+02
2 5.135e+01 2.698e+01
3 1.540e+01 1.069e+01
4 9.363e-01 8.898e-01
5 1.291e+02 6.656e+01
6 7.978e+04 5.733e+04

Random Simulation

Simulations play an important role in many areas of natural and social science. Some are meant to gain a clearer understanding of the behavior of a system, some are designed to test models, and some statistical methods, such as permutation tests and bootstrapping, use random number generation to assess statistical significance.

Generate a Random Integer

You can generate a random number with the sample function. Give it the range or vector to be sampled from and the number of elements to sample:

sample(1:10, 1)

[1] 8

sample(1:10, 1)

[1] 7

x = c(1, 4, 9, 15, 21, 30)
sample(x, 1)

[1] 1

sample(x, 1)

[1] 21

Randomly Permute a Vector of Values

When you don't give a number of elements, it randomly permutes the vector given.

sample(1:10)

 [1]  8  6  3  7  2  1  4  5  9 10

sample(1:10)

 [1]  1  6  3  2  5  9 10  4  7  8

sample(x)

[1]  1 30  9 15 21  4

sample(x)

[1] 30  9  1 15 21  4

Generate a Random Number Between 0 and 1

You can generate

runif(1)

[1] 0.3073