Exploratory Data Analysis

Lucas Schiffer
February 25, 2016

Data Analysis for the Life Sciences

Topics

  • Introduction
  • Quantile Quantile Plots
  • Boxplots
  • Scatterplots And Correlation
  • Stratification
  • Bi-variate Normal Distribution
  • Plots To Avoid
  • Misunderstanding Correlation

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

  • Discover biases, systematic errors and unexpected variability in data
  • Graphical approach to detecting these issues
  • Represents a first step in data analysis and guides hypothesis testing
  • Opportunities for discovery in the outliers

Quantile Quantile Plots

  • Quantiles divide a distribution into equally sized bins
  • Division into 100 bins gives percentiles
  • Quantiles of a theoretical distribution are plotted against an experimental distribution
  • Given a perfect fit, \( x=y \)
  • Useful in determining data distribution (normal, t, etc.)

Quantile Quantile Plots

plot of chunk unnamed-chunk-2 plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-4

Boxplots

  • Provide a graph that is easy to interpret where data is not normally distributed
  • Would be an appropriate choice to explore income data, as distribution is highly skewed
  • Particularly informative in relation to outliers and range
  • Possible to compare multiple distributions side by side

Boxplots

plot of chunk unnamed-chunk-5 plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-7

Scatterplots And Correlation

  • Where data is not univariate but is normally distributed
  • A scatter plot and calculation of correlation is useful
  • Provides a graphical and numeric estimation of relationships
  • Quick and easy with plot() and cor()

Scatterplots And Correlation

plot of chunk unnamed-chunk-8 plot of chunk unnamed-chunk-9 plot of chunk unnamed-chunk-10

Stratification

  • Useful where a hypothesized difference exist between groups
  • Can also stratify bivariate data into bins, instead of scatterplot
  • When stratified data is displayed as a boxplot, trends become obvious
  • Bin trends are a stronger predictor of the estimated parameter

Stratification

plot of chunk unnamed-chunk-11 plot of chunk unnamed-chunk-12 plot of chunk unnamed-chunk-13

Bi-variate Normal Distribution

\( \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left( \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right) } \)

  • Difficult equation but logical explanation
  • Hold a value of x constant and plot normally distributed (x,y) pairs
  • Referred to conditioning in statistics
  • Theoretical quartiles can be plotted and compared to regression line

Bi-variate Normal Distribution

plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

  • Always avoid pie charts
  • Avoid doughnut charts too
  • Avoid pseudo 3D and most Excel defaults
  • Effective graphs use color judiciously

Plots To Avoid

plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

Plots To Avoid

plot of chunk unnamed-chunk-16

Plots To Avoid

plot of chunk unnamed-chunk-17

Plots To Avoid

plot of chunk unnamed-chunk-18plot of chunk unnamed-chunk-18

Misunderstanding Correlation

“Correlation does not imply causation!”

  • Even where hypothesis test produce highly correlated results, they must be reproducible
  • For example, gene expression data tends to be skewed and not approximated by normal distribution
  • It is essential to select the correct distribution for data analysis, as given by theory
  • Exploratory data analysis is an important tool, but theoretical knowledge is essential