Histos

Jason Van Pelt, Jeni Rainer
3/22/2019

Overview

Shared Vocabulary
Why use histograms
Abstract samples
Spike samples

Shared Vocabulary and Why

Histograms answer questions related to frequency and data distribution
1. How similar or diverse are the points in my data set
Filter Histograms:
1. Small
2. Low fidelity
3. Cue to the user to lead them to narrowing down result set to find specific data points
Bin size:
1. Narrow enough to reveal interesting features about the distribution
2. Wide enough to reduce noise
AREA is what matters, not height

plot of chunk histo1

Same Data, Different User Experience

ggplot(hist_data, aes(spanCount)) + 
  geom_histogram(binwidth = .1, col = "black", fill = "blue")

plot of chunk rawData This is test data, but has not been massaged in any way.

Naturalize Histogram by putting all outliers in one bin

hist_data %>%
  mutate(spanCountNew = ifelse(hist_data$spanCount > 10, 10, hist_data$spanCount)) %>%
  ggplot(aes(spanCountNew)) +
  geom_histogram(binwidth = .1, col = "black", fill="blue")

plot of chunk hist_data2

Same Data, Different User Experience

Change Bin Width

ggplot(hist_data, aes(spanCount)) + 
  geom_histogram(binwidth = 1, col = "black", fill = "blue")

plot of chunk unnamed-chunk-2

Change Bin width and group outliers

hist_data %>%
  mutate(spanCountNew = ifelse(hist_data$spanCount > 10, 10, hist_data$spanCount)) %>%
  ggplot(aes(spanCountNew)) +
  geom_histogram(binwidth = 1, col = "black", fill="blue")

plot of chunk unnamed-chunk-3