# Exploratory Data Analysis

Lucas Schiffer
February 25, 2016

Data Analysis for the Life Sciences

### Topics

• Introduction
• Quantile Quantile Plots
• Boxplots
• Scatterplots And Correlation
• Stratification
• Bi-variate Normal Distribution
• Plots To Avoid
• Misunderstanding Correlation

### Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

• Discover biases, systematic errors and unexpected variability in data
• Graphical approach to detecting these issues
• Represents a first step in data analysis and guides hypothesis testing
• Opportunities for discovery in the outliers

### Quantile Quantile Plots

• Quantiles divide a distribution into equally sized bins
• Division into 100 bins gives percentiles
• Quantiles of a theoretical distribution are plotted against an experimental distribution
• Given a perfect fit, $$x=y$$
• Useful in determining data distribution (normal, t, etc.)

### Boxplots

• Provide a graph that is easy to interpret where data is not normally distributed
• Would be an appropriate choice to explore income data, as distribution is highly skewed
• Particularly informative in relation to outliers and range
• Possible to compare multiple distributions side by side

### Scatterplots And Correlation

• Where data is not univariate but is normally distributed
• A scatter plot and calculation of correlation is useful
• Provides a graphical and numeric estimation of relationships
• Quick and easy with plot() and cor()

### Stratification

• Useful where a hypothesized difference exist between groups
• Can also stratify bivariate data into bins, instead of scatterplot
• When stratified data is displayed as a boxplot, trends become obvious
• Bin trends are a stronger predictor of the estimated parameter

### Bi-variate Normal Distribution

$$\int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left( \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right) }$$

• Difficult equation but logical explanation
• Hold a value of x constant and plot normally distributed (x,y) pairs
• Referred to conditioning in statistics
• Theoretical quartiles can be plotted and compared to regression line

### Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

• Always avoid pie charts
• Avoid doughnut charts too
• Avoid pseudo 3D and most Excel defaults
• Effective graphs use color judiciously

### Misunderstanding Correlation

“Correlation does not imply causation!”

• Even where hypothesis test produce highly correlated results, they must be reproducible
• For example, gene expression data tends to be skewed and not approximated by normal distribution
• It is essential to select the correct distribution for data analysis, as given by theory
• Exploratory data analysis is an important tool, but theoretical knowledge is essential