Simple Effective Graphs, Interaction, & Navigation

Summer 2020

Outline

Plotting Tools
Visualizing Single Variate Distributions & Values
Visualizing Multi-Variate Distributions & Values
Interaction & Navigation
Analytical Navigation
Further Help

Plotting Tools

Out-of-the-Box Tools

Microsoft Excel
Google Spreadsheet
Many Eyes
Tableau
SPSS

Some Programming Required

Python
PHP, HTML, Javascript, etc.
Processing
R
SAS

For Interactive visualization:

Python + Plot.ly
R + Shiny
Javascript + D3

Illustration Tools

Adobe Illustrator
Gimp
Inkscape
Dia

Visualizing Single Variate Distributions & Values

Lolipop Plots for Discrete Distributions

Suppose we want to visualize a Binomial distribution, \(n=15,\; p=0.25\)

library(ggplot2)

k = 0:15
pmf = dbinom(k,  size=max(k),  prob=0.25)
MyData = data.frame(k,  pmf)

ggplot(MyData,aes(x=k,  y=pmf)) + 
  geom_linerange(ymin=0,  ymax=pmf,  size=1.25) + 
  geom_point(size=3.5) + 
  ylab("Pr{k}") +
  theme(text=element_text(size=18, family="Times"))

Lolipop Plots for Discrete Distributions

Distribution Plot for Continuous Distributions

Suppose we want to visualize a Normal distribution, \(\mu = 5, \sigma=2\)

library(ggplot2)

ggplot(data.frame(x=c(-5,15),y=c(0,1)),aes(x=x,y=y)) + 
  stat_function(fun=dnorm,args=list(mean=5,sd=2)) + 
  ggtitle("Normal Distribution, ~N(5,2)") +
  theme(text=element_text(size=18, family="Times"))

Distribution Plot for Continuous Distributions

Estimating Distributions with Histograms

To get a rough picture of the distribution of a sample, use a histogram

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_histogram(binwidth=0.5, col="white", fill="darkblue") +
  xlab("Value") + ylab("Count") + ggtitle("Histogram of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Histograms

Estimating Distributions with Density Plots

Or a density plot

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  xlab("Value") + ylab("Density") + ggtitle("Density of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Density Plots

Estimating Distributions with Several Plots

Or all of these

library(ggplot2)

MyData = data.frame(val=rnorm(200)) 

mu = mean(MyData$val)
sig = sqrt(var(MyData$val))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  geom_histogram(binwidth=0.5, aes(y=..density..), col="white", alpha=0.4) +
  stat_function(fun=dnorm,arg=list(mean=mu,sd=sig), size=1.5, col="darkred") +
  xlab("Value") + ylab("Density") + 
  ggtitle("Estimating MyData Distribution") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Several Plots

Q-Q Norm Plots

Q-Q plots give us a way to see how close to a normal distribution
our data might be

Right Skew		Short Tails
Left Skew		Long Tails

Q-Q Norm Plots

MyData = data.frame(val=rnorm(200))

qqnorm(MyData$val,pch=19,col="darkgray")
qqline(MyData$val,lwd=2,col="darkred")

Q-Q Norm Plots

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])

ggplot(MyData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)") +
  theme(text=element_text(size=18, family="Times"))

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

Note: There’s no implicit meaning to the \(y\)-axis positions

Ordered Dot Plot

So we can order the dot plot based on value typically to make it easier to read

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)")  +
  theme(text=element_text(size=18, family="Times"))

Ordered Dot Plot

Ordered Bar Plot

Bar plots use length and position to encode a numeric value

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=State,y=Area)) +
  geom_bar(stat="identity") +
  coord_flip() +
  ylab("Area (sq. miles)")  +   # Recall we flipped the axes ...
  theme(text=element_text(size=18, family="Times"))

Note: Again, these are ordered for ease of reading …

Ordered Bar Plot

Bar Plots are Not Histograms

Histograms visualize an estimate for a distribution of a numeric variable
The bins in the histogram remain in the order given by the values
While bar plots visualize the values of specific observations
And the order of the bar plots presented is typically up to us

Strip Plots

Strip plots encode 1D numeric values and imply distribution information

Line Plots

Lines imply connection … don’t use them if there isn’t any

Line Plot

For example, use lines to connect the same algorithm at different points during a run

library(ggplot2)

fakeData = data.frame(evals=c(100,150,200,250),
                      performance=c(1000.1,1300.2,1410.6,1470.3),
                      ci=c(150,90,50,30))

ggplot(fakeData,aes(evals,performance)) +
  geom_errorbar(aes(ymin=performance-ci/2, ymax=performance+ci/2),
                size=0.5, width=10) +
  geom_line(color="darkblue", size=1.25) +
  geom_point(size=5) +
  xlab("Number of Evaluations") +
  ylab("Algorithm Performance") +
  theme(text=element_text(size=18, family="Times"))

Line Plot

Box Plots

Box plots give information about the median, inter-quartiles, outliers, as well as confidence inervals

library(ggplot2)

ggplot(mtcars, aes(1,y=mpg)) +
  geom_boxplot(notch=T, fill="pink") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlim(c(0,2)) + 
  xlab("") + ylab("Mileage") + 
  ggtitle("Distribution of Car Mileage") +
  theme(text=element_text(size=18, family="Times"))

Box Plots

Visualizing Multi-Variate Distributions & Values

Overlaid Lolipop Plots for Discrete Distributions

Use dodge to visualize multiple Binomial distributions

library(ggplot2)

k = 0:15
p = factor(c(rep(0.25,length(k)),rep(0.4,length(k))))
pmf = c(dbinom(k,  size=max(k),  prob=0.25), dbinom(k,  size=max(k),  prob=0.4))
MyData = data.frame(k,  p, pmf)

ggplot(MyData, aes(x=k,  y=pmf, group=p)) + 
  geom_linerange(ymin=0,  
                 aes(ymax=pmf, color=p),  
                 size=1.25, 
                 position=position_dodge(width=0.25)) + 
  geom_point(size=3.5, position=position_dodge(width=0.25), aes(color=p)) + 
  ylab("Pr{k}") +
  ggtitle("Two Binomial Distributions, n=15, p=0.25 and p=0.4")

Overlaid Lolipop Plots

Label text is too small? Use theme()

Overlaid Density Plots of Multiple Variables

You can use factors to separate different plots straightforwardly

library(ggplot2)
library(MASS)                         # Contains a lot of extra data sets

birthwt1 = birthwt                    # Copy a birth Wt / risk factor data set 
birthwt1$smoke = factor(birthwt$smoke) # Make "smoking during preg." a factor

ggplot(birthwt1, aes(x=bwt, fill=smoke)) + 
  geom_density(alpha=0.3) +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Density Plots of Multiple Variables

Overlaid Histograms of Multiple Variables

library(ggplot2)
library(MASS)                    # Contains a lot of extra data sets

bwt = birthwt$bwt                # Get the birth Wt / risk factor vector
smoke = as.factor(birthwt$smoke) # Make "smoking during preg." variable a factor
MyData = data.frame(bwt,smoke)

ggplot(MyData, aes(x=bwt, fill=smoke)) +
  geom_histogram(aes(y=..density..),
           binwidth=500,
           position=position_dodge(width=500),
           color="black") +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Histograms of Multiple Variables

Two-Dimensional Density Plots

You can use stat_density2d to create contour density plots

library(ggplot2)
library(gcookbook)

ggplot(faithful, aes(x=eruptions, y=waiting)) +
  stat_density2d(aes(color=..level..), size=1.5) +
  xlab("Eruption Time (min)") +
  ylab("Time Between Eruptions (min)") +
  scale_color_continuous(name="Distribution\nDensity") +
  ggtitle("Old Gaithful Geyser Eruptions") + 
  theme(text=element_text(size=20, family="Times"))

Two-Dimensional Density Plots

The Basic Scatterplot

Use geom_point for scatter plots of numeric values

library(ggplot2)
library(MASS)

ggplot(Boston,aes(x=age, y=medv, size=crim, color=dis)) + 
  geom_point() + 
  scale_size(range=c(2.5,10)) + 
  xlab("Age of Home") + 
  ylab("Median Home Value (thousands)") + 
  scale_size_continuous(name="Township\nCrime Rate") +
  scale_color_continuous(name="Distance to\nEmployment") +
  ggtitle("Houses of Boston") + 
  theme(text=element_text(size=20, family="Times"))

The Basic Scatterplot

Pairwise Scatterplots

The standard R function pairs allows us to see all pairwise scatter plots

pairs(iris[1:4],pch=19)

Pairwise Scatterplots

Pairwise Scatterplots with GGally

If you install the GGally library, you get a ggplot version with ggpairs

library(GGally)

ggpairs(iris) + 
  theme(text=element_text(size=20, family="Times"))

Pairwise Scatterplots with GGally

Co-Plotting Multiple Trends

We can use group to separate trends in a dataset
Color and line thickness can help distinguish these groups
Still, too many separate lines are hard to decode

library(ggplot2)
library(gcookbook)

ggplot(uspopage, aes(x=Year, y=Thousands, group=AgeGroup)) +
  geom_line(aes(color=AgeGroup,size=AgeGroup)) +
  xlab("Year") + 
  ylab("Number of People in US (thousands)") + 
  theme(text=element_text(size=20, family="Times"))

Co-Plotting Multiple Trends

Stacking Multiple Trends

Stacking can make sense when:
- a factor is ordered (ordinal categorical variable)
- we care mainly about the aggregated values at any stack level
Try to match stack order with legend order

Stacking Multiple Trends

library(ggplot2)
library(gcookbook)

Year = uspopage$Year
Thousands = uspopage$Thousands
AgeGroup = factor(uspopage$AgeGroup,levels=rev(levels(uspopage$AgeGroup)))
MyData = data.frame(Year,Thousands,AgeGroup)

ggplot(MyData, aes(x=Year, 
                  y=Thousands, 
                  fill=AgeGroup, 
                  order=-as.numeric(AgeGroup))) +
  geom_area()   +  scale_fill_grey(start=0.8, end=0) +
  xlab("Year")  +  ylab("Number of People in US (thousands)") + 
  theme(text=element_text(size=20, family="Times"))

Stacking Multiple Trends

Multiple Bar plots, Grouped

We can make “grouped” boxplots using dodge

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", position="dodge", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Grouped

Multiple Bar plots, Stacked

By default, ggplot wants to stack …

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Stacked

Mosaic Plots

Mosaic plots are like multi-dimensional bar plots
Encode values using area
In R, we need to install and load the library vcd
The vcd mosaic function requires a somewhat more sophisticated data table structure (more on this in another lecture)

library(vcd)
mosaic(HairEyeColor) + 
  theme(text=element_text(size=20, family="Times"))

Mosaic Plots

## NULL

Coxcomb Plots

Florence Nightengale used Coxcomb plots to convince the the Brittish that the biggest threat to their soldiers during the Crimean war were preventable diseases

nightengale = read.csv("http://eecs.ucf.edu/~wiegand/ids6938/datasets/nightengale.csv",header=TRUE)
Month = as.Date(paste("01",nightengale$Date),"%d %B %Y")
DeathType = factor(nightengale$DeathType,ordered=TRUE)
DeathRate = sqrt((1000*nightengale$NumDeaths/nightengale$AvgArmySize)/pi)
MyData = data.frame(Month,DeathType,DeathRate)

ggplot(MyData, aes(x=Month, 
                   y=DeathRate, 
                   fill=DeathType, 
                   order=as.numeric(DeathType))) + 
  geom_bar(stat="identity") +
  coord_polar() +
  scale_x_date(breaks=MyData$Month,labels=format(MyData$Month,"%b %Y")) + 
  theme(text=element_text(size=20, family="Times"))

Coxcomb Plots

Parallel Coordinate Plots

We can show an arbitary number of dimensions using a parallel coordinate plot
But these are typically pretty hard to understand
It’s sometimes useful for quickly intuiting relationships among variables

parcoord(iris[1:4],col=iris$Species,lwd=2,main="Iris Dataset")

Multiple Boxplots

library(ggplot2)

ggplot(iris,aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot(outlier.size=3, notch=TRUE) +
  ylab("Iris Sepal Length (cm)") + 
  theme(text=element_text(size=20, family="Times"))

Multiple Boxplots

Trellis Displays

Separating data on multiple plots creates a “lookup” problem for the reader when decoding
Putting too much data on one plot creates a “clutter” problem for the reader when decoding
So sometmes we try to separate plots, but organize the plots so that their scales and axes are the same, well aligned and close together
These called trellis displays
We’ll come back to these after we learn about models

Interaction & Navigation

Interacting with Data

Increasingly, people are using interactive data visualizations to help explore and understand data
There are a number of ways to interact with data, including:
- Comparing
- Sorting, selecting and filtering
- Brushing, highlighting, and annotating
- Re-visualizing, Re-exressing and Dynamic detail

Comparing

Satellite Database

Sorting, Selecting, and Filtering

MLS Standings

Brushing, Highlighting, & Annotating

NY Times Buy vs. Rent

Gap Minder

Re-visualizing, Re-expressing, and Dynamic Detail

Climate Opinions

2013 Proposed Budget

Analytical Navigation

Directed vs. Exploratory Navigation

Analytical Navigation: Visual navigation through data as means to learn something about patterns, relationships, and idiosyncrasies within it (and the underlying phenomena that produced it)

“Data analysis, like experimentation, must be considered as an open-minded, highly interactive, iterative process …” – John Tukey

Directed Navigation: navigation driven by a specific question (or questions) to answer
Exploratory Navigation: navigation to learn more about the data before forming (then pursuing answers for) a question (or questions)

Schneiderman’s Mantra

“Overview first, zoom and filter, then details-on-demand” – Ben Schneiderman

Outline

Plotting Tools

Out-of-the-Box Tools

Some Programming Required

For Interactive visualization:

Illustration Tools

Visualizing Single Variate Distributions & Values

Lolipop Plots for Discrete Distributions

Lolipop Plots for Discrete Distributions

Distribution Plot for Continuous Distributions

Distribution Plot for Continuous Distributions

Estimating Distributions with Histograms

Estimating Distributions with Histograms

Estimating Distributions with Density Plots

Estimating Distributions with Density Plots

Estimating Distributions with Several Plots

Estimating Distributions with Several Plots

Q-Q Norm Plots

Q-Q Norm Plots

Q-Q Norm Plots

Dot Plot

Dot Plot

Ordered Dot Plot

Ordered Dot Plot

Ordered Bar Plot

Ordered Bar Plot

Bar Plots are Not Histograms

Strip Plots

Line Plots

Line Plot

Line Plot

Box Plots

Box Plots

Visualizing Multi-Variate Distributions & Values

Overlaid Lolipop Plots for Discrete Distributions

Overlaid Lolipop Plots

Overlaid Density Plots of Multiple Variables

Overlaid Density Plots of Multiple Variables

Overlaid Histograms of Multiple Variables

Overlaid Histograms of Multiple Variables

Two-Dimensional Density Plots

Two-Dimensional Density Plots

The Basic Scatterplot

The Basic Scatterplot

Pairwise Scatterplots

Pairwise Scatterplots

Pairwise Scatterplots with GGally

Pairwise Scatterplots with GGally

Co-Plotting Multiple Trends

Co-Plotting Multiple Trends

Stacking Multiple Trends

Stacking Multiple Trends

Stacking Multiple Trends

Multiple Bar plots, Grouped

Multiple Bar plots, Grouped

Multiple Bar plots, Stacked

Multiple Bar plots, Stacked

Mosaic Plots

Mosaic Plots

Coxcomb Plots

Coxcomb Plots

Parallel Coordinate Plots

Multiple Boxplots

Multiple Boxplots

Trellis Displays

Interaction & Navigation

Interacting with Data

Comparing

Sorting, Selecting, and Filtering

Brushing, Highlighting, & Annotating

Re-visualizing, Re-expressing, and Dynamic Detail

Analytical Navigation

Directed vs. Exploratory Navigation

Schneiderman’s Mantra

Hierarchical Navigation

Further Help

Some Useful Links

Sidebar: What is Most Important?