Lucas Schiffer

February 25, 2016

Data Analysis for the Life Sciences

- Introduction
- Quantile Quantile Plots
- Boxplots
- Scatterplots And Correlation
- Stratification
- Bi-variate Normal Distribution
- Plots To Avoid
- Misunderstanding Correlation

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

- Discover biases, systematic errors and unexpected variability in data
- Graphical approach to detecting these issues
- Represents a first step in data analysis and guides hypothesis testing
- Opportunities for discovery in the outliers

- Quantiles divide a distribution into equally sized bins
- Division into 100 bins gives percentiles
- Quantiles of a theoretical distribution are plotted against an experimental distribution
- Given a perfect fit, \( x=y \)
- Useful in determining data distribution (normal, t, etc.)

- Provide a graph that is easy to interpret where data is not normally distributed
- Would be an appropriate choice to explore income data, as distribution is highly skewed
- Particularly informative in relation to outliers and range
- Possible to compare multiple distributions side by side

- Where data is not univariate but is normally distributed
- A scatter plot and calculation of correlation is useful
- Provides a graphical and numeric estimation of relationships
- Quick and easy with plot() and cor()

- Useful where a hypothesized difference exist between groups
- Can also stratify bivariate data into bins, instead of scatterplot
- When stratified data is displayed as a boxplot, trends become obvious
- Bin trends are a stronger predictor of the estimated parameter

\( \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left( \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right) } \)

- Difficult equation but logical explanation
- Hold a value of x constant and plot normally distributed (x,y) pairs
- Referred to conditioning in statistics
- Theoretical quartiles can be plotted and compared to regression line

“Pie charts are a very bad way of displaying information.” - R Help

- Always avoid pie charts
- Avoid doughnut charts too
- Avoid pseudo 3D and most Excel defaults
- Effective graphs use color judiciously

“Correlation does not imply causation!”

- Even where hypothesis test produce highly correlated results, they must be reproducible
- For example, gene expression data tends to be skewed and not approximated by normal distribution
- It is essential to select the correct distribution for data analysis, as given by theory
- Exploratory data analysis is an important tool, but theoretical knowledge is essential