1 Day 3 outline

Book Chapter 2:

Book chapter 7:

2 Exploratory data analysis

2.1 Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

  • Discover biases, systematic errors and unexpected variability in data
  • Graphical approach to detecting these issues
  • Represents a first step in data analysis and guides hypothesis testing
  • Opportunities for discovery in the outliers

2.2 Quantile Quantile Plots

  • Quantiles divide a distribution into equally sized bins
  • Division into 100 bins gives percentiles
  • Quantiles of a theoretical distribution are plotted against an experimental distribution
    • alternatively, quantiles of two experimental distributions
  • Given a perfect fit, \(x=y\)
  • Useful in determining data distribution (normal, t, etc.)

2.3 Quantile Quantile Plots

suppressPackageStartupMessages(library(UsingR))
suppressPackageStartupMessages(library(rafalib))
# height qq plot
x <- father.son$fheight
ps <- (seq(0, 99) + 0.5) / 100
qs <- quantile(x, ps)
normalqs <- qnorm(ps, mean(x), popsd(x))
par(mfrow = c(1, 3))
plot(normalqs,
     qs,
     xlab = "Normal Percentiles",
     ylab = "Height Percentiles",
     main = "Height Q-Q Plot")
abline(0, 1)
# t-distributed for 12 df
x <- rt(1000, 12)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=12) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)
# t-distributed for 3 df
x <- rt(1000, 3)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=3) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)

2.4 Boxplots

  • Provide a graph that is easy to interpret where data may not be normally distributed
  • Qualitative + quantitative
  • Particularly informative in relation to outliers and range
  • Possible to compare multiple distributions side by side

2.5 Boxplots

par(mfrow = c(1, 3))
hist(exec.pay, main = "CEO Compensation")
qqnorm(exec.pay, main = "CEO Compensation")
boxplot(exec.pay,
        ylab = "10,000s of dollars",
        ylim = c(0, 400),
        main = "CEO Compensation")
Three different views of a continuous variable

2.6 Scatterplots And Correlation

  • For two continuous variables, scatter plot and calculation of correlation is useful
  • Provides a graphical and numeric estimation of relationships
  • Quick and easy with plot() and cor()

2.7 Scatterplots And Correlation

par(mfrow = c(1, 3))
plot(
    father.son$fheight,
    father.son$sheight,
    xlab = "Father's height in inches",
    ylab = "Son's height in inches",
    main = paste("correlation =", signif(
        cor(father.son$fheight, father.son$sheight), 2
    ))
)
plot(
    cars$speed,
    cars$dist,
    xlab = "Speed",
    ylab = "Stopping Distance",
    main = paste("correlation =", signif(cor(
        cars$speed, cars$dist
    ), 2))
)
plot(
    faithful$eruptions,
    faithful$waiting,
    xlab = "Eruption Duration",
    ylab = "Waiting Time",
    main = paste("correlation =", signif(
        cor(faithful$eruptions, faithful$waiting), 2
    ))
)

2.8 Exploratory data analysis in high dimensions

#install from Github
BiocManager::install("genomicsclass/GSE5859Subset")  
BiocManager::install("genomicsclass/tissuesGeneExpression")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix"     "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4
head(sampleInfo)
##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

2.9 Volcano plots

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
##    user  system elapsed 
##   0.006   0.000   0.007
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

2.10 Volcano plots

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

2.11 Volcano plots (summary)

  • Many small p-values with small effect size indicate low within-group variability
  • Inspect for asymmetry
  • Can color points by significance threshold

2.12 P-value histograms

  • If all null hypotheses are true, expect a flat histogram of p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value
par(mfrow = c(1, 2))
hist(pvals, ylim = c(0, 1400))
hist(nullpvals, ylim = c(0, 1400))

2.13 P-value histograms

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

2.14 P-value histograms (summary)

  • Give a quick look at how many significant p-values there may be
  • When using permuted labels, can expose non-independence among the samples
    • can be due to batch effects or family structure
  • Most common approaches for correcting batch effects are:
    • including batch in your model formula
    • ComBat: corrects for known batch effects by linear model), and
    • sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
    • correction using control (housekeeping) genes

All are available from the sva Bioconductor package

2.15 MA plot

  • just a scatterplot rotated 45\(^o\)
rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

2.16 MA plot (summary)

  • useful for quality control of high-dimensional data
  • plot all data values for a sample against another sample or a median “pseudosample”
  • affyPLM::MAplots better MA plots
    • adds a smoothing line to highlight departures from horizontal line
    • plots a “cloud” rather than many data points

2.17 Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

2.18 Heatmaps

  • Clustering becomes slow and memory-intensivefor thousands of rows
    • probably too detailed for thousands of rows
  • can show co-expressed genes, groups of samples

2.19 Colors

  • Types of color pallettes:
    • sequential: shows a gradient
    • diverging: goes in two directions from a center
    • qualitative: for categorical variables
  • Keep color blindness in mind (10% of all men)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

rafalib::mypar(1, 1)
RColorBrewer::display.brewer.all(n = 7)

2.20 Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

  • Always avoid pie charts
  • Avoid doughnut charts too
  • Avoid pseudo 3D and most Excel defaults
  • Effective graphs use color judiciously

3 Distances in high-dimensional data analysis

3.1 The importance of distance

  • High-dimensional data are complex and impossible to visualize in raw form
  • Thousands of dimensions, we can only visualize 2-3
  • Distances can simplify thousands of dimensions
animals
  • Distances can help organize samples and genomic features
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
set.seed(1)
ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

  • Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
  • Distances or similarities are mathematical representations of what we mean by close or similar.
  • The choice of distance is important and requires thought.
    • choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

3.2 Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)
  • A distance is only required to satisfy 1-3.
  • A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
  • A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

3.3 Euclidian distance (metric)

  • Remember grade school:
rafalib::mypar()
plot(
    c(0, 1, 1),
    c(0, 0, 1),
    pch = 16,
    cex = 2,
    xaxt = "n",
    yaxt = "n",
    xlab = "",
    ylab = "",
    bty = "n",
    xlim = c(-0.25, 1.25),
    ylim = c(-0.25, 1.25)
)
lines(c(0, 1, 1, 0), c(0, 0, 1, 0))
text(0, .2, expression(paste('(A'[x] * ',A'[y] * ')')), cex = 1.5)
text(1, 1.2, expression(paste('(B'[x] * ',B'[y] * ')')), cex = 1.5)
text(-0.1, 0, "A", cex = 2)
text(1.1, 1, "B", cex = 2)
Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
  • Side note: also referred to as \(L_2\) norm

3.4 Euclidian distance in high dimensions

##BiocManager::install("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

3.4.1 Notes:

  • Points are no longer on the Cartesian plane,
  • instead they are in higher dimensions. For example:
    • sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
    • feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

3.5 Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol = 1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3.6 3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919

3.7 3 sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

3.8 The dist() function

Excerpt from ?dist:

dist(
    x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2
)
  • method: the distance measure to be used.
    • This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
  • dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

3.9 Note on standardization

  • In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

  • This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
    • GC bias, PCR amplification efficiency, …
  • Euclidian distance and \(1-r\) (Pearson cor) are related:
    • \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

4 Dimension reduction and PCA

4.1 Motivation for dimension reduction

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

4.1.1 Visualizing twin pairs data

z1 = (y[1,] + y[2,]) / 2 #the sum
z2 = (y[1,] - y[2,])   #the difference

z = rbind(z1, z2) #matrix now same dimensions as y

thelim <- c(-3, 3)
rafalib::mypar(1, 2)

plot(
    y[1,],
    y[2,],
    xlab = "Twin 1 (standardized height)",
    ylab = "Twin 2 (standardized height)",
    xlim = thelim,
    ylim = thelim,
    main = "Original twin heights"
)
points(y[1, 1:2], y[2, 1:2], col = 2, pch = 16)

plot(
    z[1,],
    z[2,],
    xlim = thelim,
    ylim = thelim,
    xlab = "Average height",
    ylab = "Difference in height",
    main = "Manual PCA-like projection"
)
points(z[1, 1:2] , z[2, 1:2], col = 2, pch = 16)

4.1.2 Not much distance is lost in the second dimension

rafalib::mypar()
d = dist(t(y))
d3 = dist(z[1, ]) * sqrt(2) ##distance computed using just first dimension mypar(1,1)
plot(as.numeric(d),
     as.numeric(d3),
     xlab = "Pairwise distances in 2 dimensions",
     ylab = "Pairwise distances in 1 dimension")
abline(0, 1, col = "red")
  • Not much loss of height differences when just using average heights of twin pairs.
    • because twin heights are highly correlated

4.2 Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD
  • note: the above formulation is for \(m > n\)

  • \(\mathbf{Y}\): the m rows x n cols matrix of measurements
  • \(\mathbf{U}\): m x n matrix relating original scores to PCA scores (loadings)
  • \(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
  • \(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or PCA scores)
    • orthogonal = unit length and “perpendicular” in 3-D

4.3 SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <-
                t(apply(e, 1, function(x)
                    x - mean(x))))
##    user  system elapsed 
##   1.560   0.080   1.678
system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))
##    user  system elapsed 
##   0.137   0.001   0.139
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

4.4 Components of SVD results

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189
SVD

4.5 PCA of gene expression dataset

rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

4.6 PCA interpretation: loadings

SVD
  • \(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
    • think of principal component axes as a weighted combination of original axes

4.7 Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "Scores of PC1") #or, predict(p)
abline(h = c(-0.03, 0.04), col = "red")

Genes with high PC1 loadings:

e.pc1genes <-
    e.standardize.fast[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
    e.pc1genes,
    scale = "none",
    show_rownames = TRUE,
    show_colnames = FALSE
)

4.8 PCA interpretation: eigenvalues

  • \(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.
rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

Alternatively as cumulative % variance explained (using cumsum() function):

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

4.9 PCA interpretation: scores

SVD
  • \(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
  • In some implementations (like prcomp()), scores \(\mathbf{D V^T}\)

4.10 PCA interpretation: scores

rafalib::mypar()
plot(
    p$x[, 1:2],
    xlab = "PC1",
    ylab = "PC2",
    main = "plot of p$x[, 1:2]",
    col = factor(tissue),
    pch = as.integer(factor(tissue))
)
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    pch = 1:length(unique(tissue)),
    bty = 'n'
)

4.11 Multi-dimensional Scaling (MDS)

  • also referred to as Principal Coordinates Analysis (PCoA)
  • a reduced SVD, performed on a distance matrix
  • identify two (or more) eigenvalues/vectors that preserve distances
d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)
rafalib::mypar()
plot(mds, col = factor(tissue), pch = as.integer(factor(tissue)))
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    bty = 'n',
    pch = 1:length(unique(tissue))
)

4.12 Summary: distances and dimension reduction

  • Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
  • PCA and MDS are useful for dimension reduction when you have correlated variables
  • Variables are always centered.
  • Variables are also scaled unless you know they have the same scale in the population
  • PCA projection can be applied to new datasets if you know the matrix calculations
  • PCA is subject to over-fitting, screeplot can be tested by cross-validation

5 Batch Effects

5.1 Batch Effects

  • pervasive in genomics (e.g. Leek et al. )
  • affect DNA and RNA sequencing, proteomics, imaging, microarray…
  • have caused high-profile problems and retractions
    • you can’t get rid of them
    • but you can make sure they are not confounded with your experiment

5.2 Batch Effects - an example

  • Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
  • Title: Common genetic variants account for differences in gene expression among ethnic groups.
    • “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

5.3 Date of processing as a proxy for batch

  • Sample metadata included date of processing:
head(table(sampleInfo$date))
## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2
year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

5.4 Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

5.5 Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by batch")

5.6 The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)
##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

5.7 Batch Effects - summary

  • tend to affect many or all measurements by a little bit
  • during experimental design:
    • keep track of anything that might cause a batch effect for post-hoc analysis
    • include control samples in each batch
  • batches can be corrected for if randomized in study design
    • if confounded with treatment or outcome of interest, nothing can help you

6 Lab

Exploratory Data Analysis: http://genomicsclass.github.io/book/pages/exploratory_data_analysis_exercises.html

hist
  1. Given the above histogram, how many people are between the ages of 35 and 45?

  2. The InsectSprays data set is included in R. The dataset reports the counts of insects in agricultural experimental units treated with different insecticides. Make a boxplot and determine which insecticide appears to be most effective.

  3. Download and load this dataset into R. Use exploratory data analysis tools to determine which two columns are different from the rest. Which is the column that is positively skewed?

  4. Which is the column that is negatively skewed?

  5. Let’s consider a random sample of finishers from the New York City Marathon in 2002. This dataset can be found in the UsingR package. Load the library and then load the nym.2002 dataset.

suppressPackageStartupMessages(library(dplyr))
data(nym.2002, package="UsingR")

Use boxplots and histograms to compare the finishing times of males and females. Which of the following best describes the difference?

A) Males and females have the same distribution.
B) Most males are faster than most women.
C) Male and females have similar right skewed distributions with the former, 20 minutes shifted to the left.
D) Both distribution are normally distributed with a difference in mean of about 30 minutes.
  1. Use dplyr to create two new data frames: males and females, with the data for each gender. For males, what is the Pearson correlation between age and time to finish?

  2. For females, what is the Pearson correlation between age and time to finish?

  3. If we interpret these correlations without visualizing the data, we would conclude that the older we get, the slower we run marathons, regardless of gender. Look at scatterplots and boxplots of times stratified by age groups (20-25, 25-30, etc..). After examining the data, what is a more reasonable conclusion?

    1. Finish times are constant up until about our 40s, then we get slower.
    2. On average, finish times go up by about 7 minutes every five years.
    3. The optimal age to run a marathon is 20-25.
    4. Coding errors never happen: a five year old boy completed the 2012 NY city marathon.
  4. When is it appropriate to use pie charts or donut charts?
    1. When you are hungry.
    2. To compare percentages.
    3. To compare values that add up to 100%.
    4. Never.
  5. The use of pseudo-3D plots in the literature mostly adds:
    1. Pizzazz.
    2. The ability to see three dimensional data.
    3. Ability to discover.
    4. Confusion.