Exploratory Data Analysis

Robust Summaries

The normal approximation is often useful when analyzing life sciences data. However, due to the complexity of the measurement devices, it is also common to mistakenly observe data points generated by an undesired process. For example, a defect on a scanner can produce a handful of very high intensities or a PCR bias can lead to a fragment appearing much more often than others. We therefore may have situations that are approximated by, for example, 99 data points from a standard normal distribution and one large number.

set.seed(1)
x=c(rnorm(100,0,1)) ##real distribution
x[23] <- 100 ##mistake made in 23th measurement
boxplot(x)

Normally distributed data with one point that is very large due to a mistake.

In statistics we refer to these type of points as outliers. A small number of outliers can throw off an entire analysis. For example, notice how the following one point results in the sample mean and sample variance being very far from the 0 and 1 respectively.

cat("The average is",mean(x),"and the SD is",sd(x))

## The average is 1.108142 and the SD is 10.02938

The median

The median, defined as the point having half the data larger and half the data smaller, is a summary statistic that is robust to outliers. Note how much closer the median is to 0, the center of our actual distribution:

median(x)

## [1] 0.1684483

The median absolute deviation

The median absolute deviation (MAD) is a robust summary for the standard deviation. It is defined by computing the differences between each point and the median, and then taking the median of their absolute values:

\[ 1.4826 \mbox{median} \lvert X_i - \mbox{median}(X_i) \rvert \]

The number \(1.4826\) is a scaling factor such that the MAD is an unbiased estimate of the standard deviation. Notice how much closer we are to 1 with the MAD:

mad(x)

## [1] 0.8857141

Spearman correlation

Earlier we saw that the correlation is also sensitive to outliers. Here we construct an independent list of numbers, but for which a similar mistake was made for the same entry:

set.seed(1)
x=rnorm(100,0,1) ##real distribution
x[23] <- 100 ##mistake made in 23th measurement
y=rnorm(100,0,1) ##real distribution
y[23] <- 84 ##similar mistake made in 23th measurement
library(rafalib)
mypar()
plot(x,y,main=paste0("correlation=",round(cor(x,y),2)),
     pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
abline(0,1)

Scatterplot showing bivariate normal data with one signal outlier resulting in large values in both dimensions.

The Spearman correlation follows the general idea of median and MAD, that of using quantiles. The idea is simple: we convert each dataset to ranks and then compute correlation:

mypar(1,2)
plot(x,y,main=paste0("correlation=",round(cor(x,y),3)),pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
plot(rank(x),rank(y), main=paste0("correlation=",round(cor(x,y,method="spearman"),3)),
     pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
abline(0,1)

Scatterplot of original data (left) and ranks (right). Spearman correlation reduces the influence of outliers by considering the ranks instead of original data.

So if these statistics are robust to outliers, why would we ever use the non-robust version? In general, if we know there are outliers, then median and MAD are recommended over the mean and standard deviation counterparts. However, there are examples in which robust statistics are less powerful than the non-robust versions.

We also note that there is a large statistical literature on robust statistics that go far beyond the median and the MAD.

Symmetry of log ratios

Ratios are not symmetric. To see this, we will start by simulating the ratio of two positive random numbers, which will represent the expression of genes in two samples:

x <- 2^(rnorm(100))
y <- 2^(rnorm(100)) 
ratios <- x / y

Reporting ratios or fold changes are common in the life sciences. Suppose you are studying ratio data showing, say, gene expression before and after treatment. You are given ratio data so values larger than 1 imply gene expression was higher after the treatment. If the treatment has no effect, we should see as many values below 1 as above 1. A histogram seems to suggest that the treatment does in fact have an effect:

mypar(1,2)
hist(ratios)

logratios <- log2(ratios)
hist(logratios)

Histogram of original (left) and log (right) ratios.

The problem here is that ratios are not symmetrical around 1. For example, 1/32 is much closer to 1 than 32/1. Using the log takes care of this problem. The log of ratios are of course symmetric around 0 because:

\[\log(x/y) = \log(x)-\log(y) = -(\log(y)-\log(x)) = -log(y/x)\]

As demonstrated by these simple plots:

Histogram of original (left) and log (right) powers of 2 seen as ratios.

In the life sciences, the log transformation is also commonly used because (multiplicative) fold changes are the most widely used quantification of interest. Note that a fold change of 100 can be a ratio of 100/1 or 1/100. However, 1/100 is much closer to 1 (no fold change) than 100: ratios are not symmetric about 1.

Wilcoxon Rank Sum Test

We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small p-value, while the Wilcoxon test does not:

set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)

Create outliers:

x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)

## t-test pval: 0.04439948

cat("Wilcox test pval:",wilcox.test(x,y)$p.value)

## Wilcox test pval: 0.1310212

The basic idea is to 1) combine all the data, 2) turn the values into ranks, 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.

library(rafalib)
mypar(1,2)

stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)

xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=x)]

stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)

ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)

Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values

W <-sum(ws)

W is the sum of the ranks for the first group relative to the second group. We can compute an exact p-value for \(W\) based on combinatorics. We can also use the CLT since statistical theory tells us that this W is approximated by the normal distribution. We can construct a z-score as follows:

n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)

## [1] 1.523124

Here the Z is not large enough to give us a p-value less than 0.05. These are part of the calculations performed by the R function wilcox.test.

Homework 3

The InsectSprays data set is included in R. The dataset reports the counts of insects in agricultural experimental units treated with different insecticides. Make a boxplot and determine which insecticide appears to be most effective.
Consider a random sample of finishers from the New York City Marathon in 2002. This dataset can be found in the UsingR package. Load the library and then load the nym.2002 dataset. Use boxplots and histograms to compare the finishing times of males and females.
Which of the following best describes the difference?

Males and females have the same distribution.
Most males are faster than most women.
Male and females have similar right skewed distributions with the former, 20 minutes shifted to the left.
Both distribution are normally distributed with a difference in mean of about 30 minutes to the left.

Use dplyr to create two new data frames: males and females, with the data for each gender. For males, what is the Pearson correlation between age and time to finish? For females, what is the Pearson correlation between age and time to finish? Display the correlations in a side-by-side scatter plot seperately for males and females.
If we interpret these correlations without visualizing the data, we would conclude that the older we get, the slower we run marathons, regardless of gender. Look at scatterplots and boxplots of times stratified by age groups (20-25, 25-30, etc..). After examining the data, what is a more reasonable conclusion?

Finish times are constant up until about our 40s, then we get slower.
On average, finish times go up by about 7 minutes every five years.
The optimal age to run a marathon is 20-25.
Coding errors never happen: a five year old boy completed the 2012 NY City Marathon.

Use the ChickWeight data set included in R for the following questions. This data set contains weight of chicks in grams as they grow from day 0 to day 21. This dataset also splits up the chicks by different protein diets, which are coded from 1 to 4.

Take a look at the weights of all observations over time and color the points to represent the Diet. Plot using ggplot2. Notice that the rows in this data set represent time points rather than individuals. To facilitate the comparison of weights at different time points and across the different chicks, you may want to reshape the data so that each row is a chick. You can use the functions built into dplyr package or use reshape in base R.
How much does the average of chick weigh at day 4 and how much does the average change if we exchange one of the chick’s weights with an outlier measurement of 3000 grams?
How much does the median change with the outlier chick from the actual weights?
How much does the standard deviation change with the outlier chick from the actual weights?
Calculate the Pearson correlation of the weight of chicks at 4 days and at 21 days? Also, calculate the Spearman correlation.
Save the weights of the chicks on day 4 from diet 1 as a vector x. Save the weights of the chicks on day 4 from diet 4 as a vector y. Perform a t-test comparing x and y. Which t-test did you choose and why? Then perform a Wilcoxon test of x and y.
Load the react data set in the ISwR library. This is a single vector. Do the values of this data set look reasonably normally distributed? What is the mean? Does the mean differ significantly from zero according to one sided t-test?
In the dat set vitcap use a t-test to compare the vital capacity for the two groups.
Perform the analyses of the react and vitcap data using nonparametric tecniques.

Exploratory Data Analysis - Lecture 2

Rafael Irizarry and Michael Love modified by Mete Civelek