Exploratory Data Analysis

STOR 390
1/31/17

Load data

library(tidyverse)
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))

EDA is an iterative cycle

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

Ask lots of questions

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.

Things to look for

modes/clusters
outliers
the unexpected
any kind of pattern

Two categories of questions

variation
covariation

Definitions

variable (column)
observation (row)
tabluar data (matrix)
value (entry )

Summary statistics: location vs. range

median(data$totalsal)

[1] 59342

max(data$totalsal)

[1] 819069

Plot one variable

# plot each data point
ggplot(data=data) +
    geom_point(aes(x=totalsal, y=0)) +
    ylim(-10, 10)

plot of chunk unnamed-chunk-3

Jitter plots

# same plot as above but with random y values
ggplot(data=data) +
    geom_jitter(aes(x=totalsal, y=0)) +
    ylim(-10, 10)

plot of chunk unnamed-chunk-4

Boxplots

ggplot(data=data) +
    geom_boxplot(aes(x=0, y=totalsal))

plot of chunk unnamed-chunk-5

Histograms

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 30)

plot of chunk unnamed-chunk-6

Too many bins

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 10000)

plot of chunk unnamed-chunk-7

Too few bins

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 2)

plot of chunk unnamed-chunk-8

Just right (maybe?)

ggplot(data=data) +
    geom_histogram(aes(x=totalsal), bins = 100)

plot of chunk unnamed-chunk-9

Multimodal

Gaussian mixture with two modes

set.seed(342)
mix <- tibble(val=c(rnorm(n=4000, mean=0, sd=1), 
                  rnorm(n=4000, mean=2.5, sd=1)))

Wide binwidth misses the modes

# wide binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 10)

plot of chunk unnamed-chunk-11

moderate binwidth: kind of see two modes

# moderate binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 30)

plot of chunk unnamed-chunk-12

Small binwidth: false modes appear

# small binwidth
ggplot(data=mix) +
    geom_histogram(aes(x=val), bins = 2000)

plot of chunk unnamed-chunk-13

kernel density estimate

# geom_density with its default values
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=1)

plot of chunk unnamed-chunk-14

KDE warning

a KDE is a continuous version of a histogram
it has one (or more) parameters that need to be set Warning: always be wary of “smart defaults”. No one default value will work well in every (or even a majority of) situations.

KDE with fat window

# geom_density with a fat window
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=10)

plot of chunk unnamed-chunk-15

KDE with skinny window

# geom_density with a skinny window
ggplot(data=data) +
    geom_density(aes(x=totalsal), kernel="gaussian", adjust=.1)

plot of chunk unnamed-chunk-16

Best practice: combine hist (or KDE) with points

plot of chunk unnamed-chunk-17

Covariation

visualize relationship between two variables
can do three or more but becomes tenuous

Scatter plot is the most visualization of covariation

ggplot(data=data) +
    geom_point(aes(x=age, y=totalsal))

plot of chunk unnamed-chunk-18

Correlations is the most simple summary of covariation

cor(data$age, data$totalsal)

[1] 0.2355144

Bar plot

data %>% 
    filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
    group_by(dept) %>%
    summarise(mean_sal = mean(totalsal)) %>%
    ggplot() +
    geom_bar(aes(x=dept, y=mean_sal), stat='identity')

plot of chunk unnamed-chunk-20

Box plots

data %>% 
    filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
    ggplot() +
    geom_boxplot(aes(x=dept, y=totalsal)) + 
    coord_flip() # max the labels horizontal so people can read them!

plot of chunk unnamed-chunk-21

Clusters!

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

plot of chunk unnamed-chunk-22

Add color

ggplot(data=data) +
    geom_point(aes(x=age, y=totalsal, color=status))

plot of chunk unnamed-chunk-23