Summer 2020

Outline

  1. Plotting Tools
  2. Visualizing Single Variate Distributions & Values
  3. Visualizing Multi-Variate Distributions & Values
  4. Interaction & Navigation
  5. Analytical Navigation
  6. Further Help

Plotting Tools

Out-of-the-Box Tools

  • Microsoft Excel
  • Google Spreadsheet
  • Many Eyes
  • Tableau
  • SPSS

Some Programming Required

  • Python
  • PHP, HTML, Javascript, etc.
  • Processing
  • R
  • SAS

For Interactive visualization:

  • Python + Plot.ly
  • R + Shiny
  • Javascript + D3

Illustration Tools

  • Adobe Illustrator
  • Gimp
  • Inkscape
  • Dia

Visualizing Single Variate Distributions & Values

Lolipop Plots for Discrete Distributions

Suppose we want to visualize a Binomial distribution, \(n=15,\; p=0.25\)

library(ggplot2)

k = 0:15
pmf = dbinom(k,  size=max(k),  prob=0.25)
MyData = data.frame(k,  pmf)

ggplot(MyData,aes(x=k,  y=pmf)) + 
  geom_linerange(ymin=0,  ymax=pmf,  size=1.25) + 
  geom_point(size=3.5) + 
  ylab("Pr{k}") +
  theme(text=element_text(size=18, family="Times"))

Lolipop Plots for Discrete Distributions

Distribution Plot for Continuous Distributions

Suppose we want to visualize a Normal distribution, \(\mu = 5, \sigma=2\)

library(ggplot2)

ggplot(data.frame(x=c(-5,15),y=c(0,1)),aes(x=x,y=y)) + 
  stat_function(fun=dnorm,args=list(mean=5,sd=2)) + 
  ggtitle("Normal Distribution, ~N(5,2)") +
  theme(text=element_text(size=18, family="Times"))

Distribution Plot for Continuous Distributions

Estimating Distributions with Histograms

To get a rough picture of the distribution of a sample, use a histogram

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_histogram(binwidth=0.5, col="white", fill="darkblue") +
  xlab("Value") + ylab("Count") + ggtitle("Histogram of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Histograms

Estimating Distributions with Density Plots

Or a density plot

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  xlab("Value") + ylab("Density") + ggtitle("Density of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Density Plots

Estimating Distributions with Several Plots

Or all of these

library(ggplot2)

MyData = data.frame(val=rnorm(200)) 

mu = mean(MyData$val)
sig = sqrt(var(MyData$val))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  geom_histogram(binwidth=0.5, aes(y=..density..), col="white", alpha=0.4) +
  stat_function(fun=dnorm,arg=list(mean=mu,sd=sig), size=1.5, col="darkred") +
  xlab("Value") + ylab("Density") + 
  ggtitle("Estimating MyData Distribution") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Several Plots

Q-Q Norm Plots

Q-Q plots give us a way to see how close to a normal distribution
our data might be

Right Skew Short Tails
Left Skew Long Tails

Q-Q Norm Plots

MyData = data.frame(val=rnorm(200))

qqnorm(MyData$val,pch=19,col="darkgray")
qqline(MyData$val,lwd=2,col="darkred")

Q-Q Norm Plots

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])

ggplot(MyData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)") +
  theme(text=element_text(size=18, family="Times"))   

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

Note: There’s no implicit meaning to the \(y\)-axis positions

Ordered Dot Plot

So we can order the dot plot based on value typically to make it easier to read

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)")  +
  theme(text=element_text(size=18, family="Times"))

Ordered Dot Plot

Ordered Bar Plot

Bar plots use length and position to encode a numeric value

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=State,y=Area)) +
  geom_bar(stat="identity") +
  coord_flip() +
  ylab("Area (sq. miles)")  +   # Recall we flipped the axes ...
  theme(text=element_text(size=18, family="Times"))

Note: Again, these are ordered for ease of reading …

Ordered Bar Plot

Bar Plots are Not Histograms

  • Histograms visualize an estimate for a distribution of a numeric variable
  • The bins in the histogram remain in the order given by the values
  • While bar plots visualize the values of specific observations
  • And the order of the bar plots presented is typically up to us

Strip Plots

Strip plots encode 1D numeric values and imply distribution information

Line Plots

Lines imply connection … don’t use them if there isn’t any

Line Plot

For example, use lines to connect the same algorithm at different points during a run

library(ggplot2)

fakeData = data.frame(evals=c(100,150,200,250),
                      performance=c(1000.1,1300.2,1410.6,1470.3),
                      ci=c(150,90,50,30))

ggplot(fakeData,aes(evals,performance)) +
  geom_errorbar(aes(ymin=performance-ci/2, ymax=performance+ci/2),
                size=0.5, width=10) +
  geom_line(color="darkblue", size=1.25) +
  geom_point(size=5) +
  xlab("Number of Evaluations") +
  ylab("Algorithm Performance") +
  theme(text=element_text(size=18, family="Times"))

Line Plot

Box Plots

Box plots give information about the median, inter-quartiles, outliers, as well as confidence inervals

library(ggplot2)

ggplot(mtcars, aes(1,y=mpg)) +
  geom_boxplot(notch=T, fill="pink") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlim(c(0,2)) + 
  xlab("") + ylab("Mileage") + 
  ggtitle("Distribution of Car Mileage") +
  theme(text=element_text(size=18, family="Times"))

Box Plots

Visualizing Multi-Variate Distributions & Values

Overlaid Lolipop Plots for Discrete Distributions

Use dodge to visualize multiple Binomial distributions

library(ggplot2)

k = 0:15
p = factor(c(rep(0.25,length(k)),rep(0.4,length(k))))
pmf = c(dbinom(k,  size=max(k),  prob=0.25), dbinom(k,  size=max(k),  prob=0.4))
MyData = data.frame(k,  p, pmf)

ggplot(MyData, aes(x=k,  y=pmf, group=p)) + 
  geom_linerange(ymin=0,  
                 aes(ymax=pmf, color=p),  
                 size=1.25, 
                 position=position_dodge(width=0.25)) + 
  geom_point(size=3.5, position=position_dodge(width=0.25), aes(color=p)) + 
  ylab("Pr{k}") +
  ggtitle("Two Binomial Distributions, n=15, p=0.25 and p=0.4") 

Overlaid Lolipop Plots

Label text is too small? Use theme()

Overlaid Density Plots of Multiple Variables

You can use factors to separate different plots straightforwardly

library(ggplot2)
library(MASS)                         # Contains a lot of extra data sets

birthwt1 = birthwt                    # Copy a birth Wt / risk factor data set 
birthwt1$smoke = factor(birthwt$smoke) # Make "smoking during preg." a factor

ggplot(birthwt1, aes(x=bwt, fill=smoke)) + 
  geom_density(alpha=0.3) +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Density Plots of Multiple Variables

Overlaid Histograms of Multiple Variables

library(ggplot2)
library(MASS)                    # Contains a lot of extra data sets

bwt = birthwt$bwt                # Get the birth Wt / risk factor vector
smoke = as.factor(birthwt$smoke) # Make "smoking during preg." variable a factor
MyData = data.frame(bwt,smoke)

ggplot(MyData, aes(x=bwt, fill=smoke)) +
  geom_histogram(aes(y=..density..),
           binwidth=500,
           position=position_dodge(width=500),
           color="black") +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Histograms of Multiple Variables

Two-Dimensional Density Plots

You can use stat_density2d to create contour density plots

library(ggplot2)
library(gcookbook)

ggplot(faithful, aes(x=eruptions, y=waiting)) +
  stat_density2d(aes(color=..level..), size=1.5) +
  xlab("Eruption Time (min)") +
  ylab("Time Between Eruptions (min)") +
  scale_color_continuous(name="Distribution\nDensity") +
  ggtitle("Old Gaithful Geyser Eruptions") + 
  theme(text=element_text(size=20, family="Times"))

Two-Dimensional Density Plots

The Basic Scatterplot

Use geom_point for scatter plots of numeric values

library(ggplot2)
library(MASS)

ggplot(Boston,aes(x=age, y=medv, size=crim, color=dis)) + 
  geom_point() + 
  scale_size(range=c(2.5,10)) + 
  xlab("Age of Home") + 
  ylab("Median Home Value (thousands)") + 
  scale_size_continuous(name="Township\nCrime Rate") +
  scale_color_continuous(name="Distance to\nEmployment") +
  ggtitle("Houses of Boston") + 
  theme(text=element_text(size=20, family="Times"))

The Basic Scatterplot

Pairwise Scatterplots

The standard R function pairs allows us to see all pairwise scatter plots

pairs(iris[1:4],pch=19)

Pairwise Scatterplots

Pairwise Scatterplots with GGally

If you install the GGally library, you get a ggplot version with ggpairs

library(GGally)

ggpairs(iris) + 
  theme(text=element_text(size=20, family="Times"))

Pairwise Scatterplots with GGally

Co-Plotting Multiple Trends

Co-Plotting Multiple Trends

Stacking Multiple Trends

Stacking Multiple Trends

Stacking Multiple Trends

Multiple Bar plots, Grouped

We can make “grouped” boxplots using dodge

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", position="dodge", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Grouped

Multiple Bar plots, Stacked

By default, ggplot wants to stack …

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Stacked

Mosaic Plots

  • Mosaic plots are like multi-dimensional bar plots
  • Encode values using area
  • In R, we need to install and load the library vcd
  • The vcd mosaic function requires a somewhat more sophisticated data table structure (more on this in another lecture)
library(vcd)
mosaic(HairEyeColor) + 
  theme(text=element_text(size=20, family="Times"))

Mosaic Plots

## NULL

Coxcomb Plots

Florence Nightengale used Coxcomb plots to convince the the Brittish that the biggest threat to their soldiers during the Crimean war were preventable diseases

nightengale = read.csv("http://eecs.ucf.edu/~wiegand/ids6938/datasets/nightengale.csv",header=TRUE)
Month = as.Date(paste("01",nightengale$Date),"%d %B %Y")
DeathType = factor(nightengale$DeathType,ordered=TRUE)
DeathRate = sqrt((1000*nightengale$NumDeaths/nightengale$AvgArmySize)/pi)
MyData = data.frame(Month,DeathType,DeathRate)

ggplot(MyData, aes(x=Month, 
                   y=DeathRate, 
                   fill=DeathType, 
                   order=as.numeric(DeathType))) + 
  geom_bar(stat="identity") +
  coord_polar() +
  scale_x_date(breaks=MyData$Month,labels=format(MyData$Month,"%b %Y")) + 
  theme(text=element_text(size=20, family="Times"))

Coxcomb Plots

Parallel Coordinate Plots

  • We can show an arbitary number of dimensions using a parallel coordinate plot
  • But these are typically pretty hard to understand
  • It’s sometimes useful for quickly intuiting relationships among variables
parcoord(iris[1:4],col=iris$Species,lwd=2,main="Iris Dataset")

Multiple Boxplots

library(ggplot2)

ggplot(iris,aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot(outlier.size=3, notch=TRUE) +
  ylab("Iris Sepal Length (cm)") + 
  theme(text=element_text(size=20, family="Times"))

Multiple Boxplots

Trellis Displays

  • Separating data on multiple plots creates a “lookup” problem for the reader when decoding
  • Putting too much data on one plot creates a “clutter” problem for the reader when decoding
  • So sometmes we try to separate plots, but organize the plots so that their scales and axes are the same, well aligned and close together
  • These called trellis displays
  • We’ll come back to these after we learn about models

Interaction & Navigation

Interacting with Data

  • Increasingly, people are using interactive data visualizations to help explore and understand data
  • There are a number of ways to interact with data, including:
    • Comparing
    • Sorting, selecting and filtering
    • Brushing, highlighting, and annotating
    • Re-visualizing, Re-exressing and Dynamic detail

Comparing

Sorting, Selecting, and Filtering

Brushing, Highlighting, & Annotating

Re-visualizing, Re-expressing, and Dynamic Detail

Analytical Navigation

Directed vs. Exploratory Navigation

Analytical Navigation: Visual navigation through data as means to learn something about patterns, relationships, and idiosyncrasies within it (and the underlying phenomena that produced it)

“Data analysis, like experimentation, must be considered as an open-minded, highly interactive, iterative process …” – John Tukey

  • Directed Navigation: navigation driven by a specific question (or questions) to answer
  • Exploratory Navigation: navigation to learn more about the data before forming (then pursuing answers for) a question (or questions)

Schneiderman’s Mantra

“Overview first, zoom and filter, then details-on-demand” – Ben Schneiderman

ARCC Ganglia Report

Like a paper: abstract, method, then results

Hierarchical Navigation

  • Humans are used to thinking of things in hierarchies
  • Interactive visualizations can facilitate explicitly facilitate analytical navigation of data in a hierarchical manner
  • Heirarchical navigation can help support Scneiderman’s Mantra
  • Tree Maps are one such method

Tree Map Example

2011 Proposed Budget

Further Help

Some Useful Links

Sidebar: What is Most Important?