Exploratory Data Analysis

STOR 390
1/31/17

Load data

library(tidyverse)
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))

EDA is an iterative cycle

  1. Generate questions about your data.

  2. Search for answers by visualizing, transforming, and modelling your data.

  3. Use what you learn to refine your questions and/or generate new questions.

Ask lots of questions

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.

Things to look for

  • modes/clusters
  • outliers
  • the unexpected
  • any kind of pattern

Two categories of questions

  • variation
  • covariation

Definitions

  • variable (column)
  • observation (row)
  • tabluar data (matrix)
  • value (entry )

Summary statistics: location vs. range

median(data$totalsal)
[1] 59342
max(data$totalsal)
[1] 819069

Plot one variable

# plot each data point
ggplot(data=data) +
    geom_point(aes(x=totalsal, y=0)) +
    ylim(-10, 10)

plot of chunk unnamed-chunk-3

Jitter plots

# same plot as above but with random y values
ggplot(data=data) +
    geom_jitter(aes(x=totalsal, y=0)) +
    ylim(-10, 10)

plot of chunk unnamed-chunk-4

Boxplots

ggplot(data=data) +
    geom_boxplot(aes(x=0, y=totalsal))

plot of chunk unnamed-chunk-5

Histograms

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 30)

plot of chunk unnamed-chunk-6

Too many bins

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 10000)

plot of chunk unnamed-chunk-7

Too few bins

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 2)

plot of chunk unnamed-chunk-8

Just right (maybe?)

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 100)

plot of chunk unnamed-chunk-9

Multimodal

Gaussian mixture with two modes

set.seed(342)
mix <- tibble(val=c(rnorm(n=4000, mean=0, sd=1), 
                  rnorm(n=4000, mean=2.5, sd=1)))

Wide binwidth misses the modes

# wide binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 10)

plot of chunk unnamed-chunk-11

moderate binwidth: kind of see two modes

# moderate binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 30)

plot of chunk unnamed-chunk-12

Small binwidth: false modes appear

# small binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 2000)

plot of chunk unnamed-chunk-13

kernel density estimate

# geom_density with its default values
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=1)

plot of chunk unnamed-chunk-14

KDE warning

  • a KDE is a continuous version of a histogram
  • it has one (or more) parameters that need to be set Warning: always be wary of “smart defaults”. No one default value will work well in every (or even a majority of) situations.

KDE with fat window

# geom_density with a fat window
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=10)

plot of chunk unnamed-chunk-15

KDE with skinny window

# geom_density with a skinny window
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=.1)

plot of chunk unnamed-chunk-16

Best practice: combine hist (or KDE) with points

plot of chunk unnamed-chunk-17

Covariation

  • visualize relationship between two variables
  • can do three or more but becomes tenuous

Scatter plot is the most visualization of covariation

ggplot(data=data) +
    geom_point(aes(x=age, y=totalsal)) 

plot of chunk unnamed-chunk-18

Correlations is the most simple summary of covariation

cor(data$age, data$totalsal)
[1] 0.2355144

Bar plot

data %>% 
    filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
    group_by(dept) %>%
    summarise(mean_sal = mean(totalsal)) %>%
    ggplot() +
    geom_bar(aes(x=dept, y=mean_sal), stat='identity')

plot of chunk unnamed-chunk-20

Box plots

data %>% 
    filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
    ggplot() +
    geom_boxplot(aes(x=dept, y=totalsal)) + 
    coord_flip() # max the labels horizontal so people can read them!

plot of chunk unnamed-chunk-21

Clusters!

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

plot of chunk unnamed-chunk-22

Add color

ggplot(data=data) +
    geom_point(aes(x=age, y=totalsal, color=status)) 

plot of chunk unnamed-chunk-23