Histos

Jason Van Pelt, Jeni Rainer
3/22/2019

Overview

  • Shared Vocabulary
  • Why use histograms
  • Abstract samples
  • Spike samples

Shared Vocabulary and Why

  • Histograms answer questions related to frequency and data distribution
    1. How similar or diverse are the points in my data set
  • Filter Histograms:
    1. Small
    2. Low fidelity
    3. Cue to the user to lead them to narrowing down result set to find specific data points
  • Bin size:
    1. Narrow enough to reveal interesting features about the distribution
    2. Wide enough to reduce noise
  • AREA is what matters, not height

plot of chunk histo1


Same Data, Different User Experience

ggplot(hist_data, aes(spanCount)) + 
  geom_histogram(binwidth = .1, col = "black", fill = "blue")

plot of chunk rawData This is test data, but has not been massaged in any way.

Naturalize Histogram by putting all outliers in one bin

hist_data %>%
  mutate(spanCountNew = ifelse(hist_data$spanCount > 10, 10, hist_data$spanCount)) %>%
  ggplot(aes(spanCountNew)) +
  geom_histogram(binwidth = .1, col = "black", fill="blue")

plot of chunk hist_data2

Same Data, Different User Experience

Change Bin Width

ggplot(hist_data, aes(spanCount)) + 
  geom_histogram(binwidth = 1, col = "black", fill = "blue")

plot of chunk unnamed-chunk-2

Change Bin width and group outliers

hist_data %>%
  mutate(spanCountNew = ifelse(hist_data$spanCount > 10, 10, hist_data$spanCount)) %>%
  ggplot(aes(spanCountNew)) +
  geom_histogram(binwidth = 1, col = "black", fill="blue")

plot of chunk unnamed-chunk-3