Visualization, Part 2

Spring 2025

Graphical Perception Tasks

The Visual Perception System

Eyes sense light reflecting & refracting off of surfaces
A composite object is formed in our brain from various visual properties
We perceive composite as a whole object, but we distinguish these properties
For example: 2D location, length, width, area, shape, color, orientation
We do not attend to everything we see
We do not have good working memory for what we see

Preattentive Processing

Humans have a limited set of visual properties that are detected very rapidly and accurately by our visual system before we are consciously aware of it

We easily detect the presence or absence of a target within a visual field
We easily detect texture boundary between two groups of elements
We easily track an element with a unique visual feature in space and time

“Perception in Visualization”, Chris Healey, NC State

Preattentive Processing – Color is Easy

Find the red circle:

Preattentive Processing – Shape is Easy

Find the red circle:

Preattentive Processing – Conjunction is Harder

Find the red circle:

Preattentive Processing – Other Pre-attentive Cues

See: “Perception in Visualization”, Chris Healey, NC State

Postattentive Vision

What hapens to our visual representation when we stop attending and look at something else?

Sustained attention to objects do not make visual search more efficient
Repeated visual searches are not more efficient
Once we see a pattern, we match the pattern even when it isn’t there
Moral: Do not make users search for things in your visualization, but draw attention to things explicitly

“Perception in Visualization”, Chris Healey, NC State

Familiar Patterns

Humans are pattern matchers …

Familiar Patterns

See the dolphin!

Familiar Patterns

We cannot easily “unsee” things …

Familiar Patterns

There is no spoon … er … dolphin!

Poor Working Memory

We rely on memory, but our working visual memory is very limited

Visual Encoding

Tasks to be done when visualizing information:
- Encode numeric data visually
- Encode cateogrical data visually
- Encode distinctions between different pieces of information
- Encode methods to associate data / distinctions to some context
Objective: To make the reader’s decoding process as easy and error-free as possible

Encoding Numeric & Categorical Data

Typically the categorical data we wish to encode in fact numeric:
- Proportions: a continuous number between 0 and 1
- Frequencies: a discrete integer or count
So often the most fundamental encoding choices for numeric and categorical values to be plotted are the same

Distinguishing Graphical Elements

Whether underlying variables are categorical or numeric, we often have multiple things on a plot. E.g.,
- Proportions from different levels of some variable
- Different categorical level values in a factor
- Different numeric variable values
- Different trend lines in a time series
Because the reader needs to discern these as different things, our encoding must distinguish these for the reader in some way

Common Ways to Visually Encode Numbers

In order from most easily perceived to least:
1. Position along a common scale, axis, and baseline
2. Position along non-aligned axes
3. Length, direction, angles of relative lines / slope
4. Area
5. Volume, curvature, arcs / angles within a shape
6. Color or shading

Cleveland McGill (1984). “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association, 79(387), pp. 531–554.

Position on a Common Scale

Position on Non-Aligned Axes

Length Comparisons

Area Comparisons

Angle / Curve Comparisons

Color Comparisons

Common Ways to Distinguish Visual Elements

Often we need to separate or distinguish discrete visual items using:

Distinct positions
Different colors or shading
Distinguishing symbols, words, or annotations
Other plot elements (e.g., line thickness)

Distinct Positions

Different Colors or Shading

Distinguishing Symbols, Words, or Annotations

Other Plot Elements

Color Considerations

Color can be used for a number of purposes:
- Encoding numeric values
- Distinguishing or highlighting visual elements
- Mood & effect
Perception of colors depends on context:
- Medium: paper, poster, screen, projected presentation
- Lighting: glare, contrast,
- Audience: Colorblindness?

Grouping: Gestalt Principles

Proximity: When objects are close together, we often perceive them as a group
Similarity: When objects share similar attributes (color, shape, etc.), we often perceive them as a group
Enclosure: When objects are surrounded by a boundary, we often perceive them as a group
Closure: Sometimes partially open structures can still be perceived as a grouping metaphor (e.g., “\(\left[ \ldots \right]\)”)
Connectivity: When you draw curves or lines through data elements, this is often perceived as creating a connection between them

Proximity

When objects are close together, we often perceive them as a group

Similarity

When objects share similar attributes (color, shape, etc.), we often perceive them as a group

Enclosure

When objects are surrounded by a boundary, we often perceive them as a group

Closure

Sometimes partially open structures can still be perceived as a grouping metaphor

Connectivity

When you draw curves or lines through data elements, this is often perceived as creating a connection between them

Lines Imply Connection

Lines imply connection … don’t use them if there isn’t any

Groups Imply Connection

Group things so that the most important things to compare are closest

Memory Limitations

Humans have different kinds of memory, stored differently and in different parts of the brain
- Long-term vs. working memory
- Verbal memory vs. visual memory
Working memory for visual information is very limited
Humans can retain roughly three chunks of information at a time
Visualizations can help “chunk” information together

Keeping It Together

We should avoid “fragmentation” (separating things that should be remembered together)
So place the things most related closest together – things that you most want the reader to remember together
Highlight and annotate things explicitly, if you want the reader to notice them

R: Visualizing Single Variate Distributions & Values

Lolipop Plots for Discrete Distributions

Suppose we want to visualize a Binomial distribution, \(n=15,\; p=0.25\)

library(ggplot2)

k = 0:15
pmf = dbinom(k,  size=max(k),  prob=0.25)
MyData = data.frame(k,  pmf)

ggplot(MyData,aes(x=k,  y=pmf)) + 
  geom_linerange(ymin=0,  ymax=pmf,  size=1.25) + 
  geom_point(size=3.5) + 
  ylab("Pr{k}") +
  theme(text=element_text(size=18, family="Times"))

Lolipop Plots for Discrete Distributions

Distribution Plot for Continuous Distributions

Suppose we want to visualize a Normal distribution, \(\mu = 5, \sigma=2\)

library(ggplot2)

ggplot(data.frame(x=c(-5,15),y=c(0,1)),aes(x=x,y=y)) + 
  stat_function(fun=dnorm,args=list(mean=5,sd=2)) + 
  ggtitle("Normal Distribution, ~N(5,2)") +
  theme(text=element_text(size=18, family="Times"))

Distribution Plot for Continuous Distributions

Estimating Distributions with Histograms

To get a rough picture of the distribution of a sample, use a histogram

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_histogram(binwidth=0.5, col="white", fill="darkblue") +
  xlab("Value") + ylab("Count") + ggtitle("Histogram of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Histograms

Estimating Distributions with Density Plots

Or a density plot

library(ggplot2)

MyData = data.frame(val=rnorm(200))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  xlab("Value") + ylab("Density") + ggtitle("Density of MyData") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Density Plots

Estimating Distributions with Several Plots

Or all of these

library(ggplot2)

MyData = data.frame(val=rnorm(200)) 

mu = mean(MyData$val)
sig = sqrt(var(MyData$val))

ggplot(MyData,aes(x=val)) + 
  geom_density(fill="pink",col=NA) +
  geom_histogram(binwidth=0.5, aes(y=..density..), col="white", alpha=0.4) +
  stat_function(fun=dnorm,arg=list(mean=mu,sd=sig), size=1.5, col="darkred") +
  xlab("Value") + ylab("Density") + 
  ggtitle("Estimating MyData Distribution") +
  theme(text=element_text(size=18, family="Times"))

Estimating Distributions with Several Plots

Q-Q Norm Plots

Q-Q plots give us a way to see how close to a normal distribution
our data might be

Right Skew		Short Tails
Left Skew		Long Tails

Q-Q Norm Plots

MyData = data.frame(val=rnorm(200))

qqnorm(MyData$val,pch=19,col="darkgray")
qqline(MyData$val,lwd=2,col="darkred")

Q-Q Norm Plots

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])

ggplot(MyData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)") +
  theme(text=element_text(size=18, family="Times"))

Dot Plot

Dot plots use position to encode a numeric value, proportion, or frequency

Note: There’s no implicit meaning to the \(y\)-axis positions

Ordered Dot Plot

So we can order the dot plot based on value typically to make it easier to read

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=Area,y=State)) +
  geom_point(size=4) +
  xlab("Area (sq. miles)")  +
  theme(text=element_text(size=18, family="Times"))

Ordered Dot Plot

Ordered Bar Plot

Bar plots use length and position to encode a numeric value

library(ggplot2)

MyData = data.frame(State=state.name[1:10], Area=state.area[1:10])
MySortedData = transform(MyData, State=reorder(State,Area))

ggplot(MySortedData,aes(x=State,y=Area)) +
  geom_bar(stat="identity") +
  coord_flip() +
  ylab("Area (sq. miles)")  +   # Recall we flipped the axes ...
  theme(text=element_text(size=18, family="Times"))

Note: Again, these are ordered for ease of reading …

Ordered Bar Plot

Bar Plots are Not Histograms

Histograms visualize an estimate for a distribution of a numeric variable
The bins in the histogram remain in the order given by the values
While bar plots visualize the values of specific observations
And the order of the bar plots presented is typically up to us

Line Plot

For example, use lines to connect the same algorithm at different points during a run

library(ggplot2)

fakeData = data.frame(evals=c(100,150,200,250),
                      performance=c(1000.1,1300.2,1410.6,1470.3),
                      ci=c(150,90,50,30))

ggplot(fakeData,aes(evals,performance)) +
  geom_errorbar(aes(ymin=performance-ci/2, ymax=performance+ci/2),
                size=0.5, width=10) +
  geom_line(color="darkblue", size=1.25) +
  geom_point(size=5) +
  xlab("Number of Evaluations") +
  ylab("Algorithm Performance") +
  theme(text=element_text(size=18, family="Times"))

Line Plot

Box Plots

Box plots give information about the median, inter-quartiles, outliers, as well as confidence inervals

library(ggplot2)

ggplot(mtcars, aes(1,y=mpg)) +
  geom_boxplot(notch=T, fill="pink") +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlim(c(0,2)) + 
  xlab("") + ylab("Mileage") + 
  ggtitle("Distribution of Car Mileage") +
  theme(text=element_text(size=18, family="Times"))

Box Plots

R: Visualizing Multi-Variate Distributions & Values

Overlaid Lolipop Plots for Discrete Distributions

Use dodge to visualize multiple Binomial distributions

library(ggplot2)

k = 0:15
p = factor(c(rep(0.25,length(k)),rep(0.4,length(k))))
pmf = c(dbinom(k,  size=max(k),  prob=0.25), dbinom(k,  size=max(k),  prob=0.4))
MyData = data.frame(k,  p, pmf)

ggplot(MyData, aes(x=k,  y=pmf, group=p)) + 
  geom_linerange(ymin=0,  
                 aes(ymax=pmf, color=p),  
                 size=1.25, 
                 position=position_dodge(width=0.25)) + 
  geom_point(size=3.5, position=position_dodge(width=0.25), aes(color=p)) + 
  ylab("Pr{k}") +
  ggtitle("Two Binomial Distributions, n=15, p=0.25 and p=0.4")

Overlaid Lolipop Plots

Label text is too small? Use theme()

Overlaid Density Plots of Multiple Variables

You can use factors to separate different plots straightforwardly

library(ggplot2)
library(MASS)                         # Contains a lot of extra data sets

birthwt1 = birthwt                    # Copy a birth Wt / risk factor data set 
birthwt1$smoke = factor(birthwt$smoke) # Make "smoking during preg." a factor

ggplot(birthwt1, aes(x=bwt, fill=smoke)) + 
  geom_density(alpha=0.3) +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Density Plots of Multiple Variables

Overlaid Histograms of Multiple Variables

library(ggplot2)
library(MASS)                    # Contains a lot of extra data sets

bwt = birthwt$bwt                # Get the birth Wt / risk factor vector
smoke = as.factor(birthwt$smoke) # Make "smoking during preg." variable a factor
MyData = data.frame(bwt,smoke)

ggplot(MyData, aes(x=bwt, fill=smoke)) +
  geom_histogram(aes(y=..density..),
           binwidth=500,
           position=position_dodge(width=500),
           color="black") +
  xlab("Birth Weight (g)") +
  ylab("Distribution Density") +
  scale_fill_discrete(name="Mom Smoked?",
                      labels=c("No","Yes")) + 
  theme(text=element_text(size=20, family="Times"))

Overlaid Histograms of Multiple Variables

Two-Dimensional Density Plots

You can use stat_density2d to create contour density plots

library(ggplot2)
library(gcookbook)

ggplot(faithful, aes(x=eruptions, y=waiting)) +
  stat_density2d(aes(color=..level..), size=1.5) +
  xlab("Eruption Time (min)") +
  ylab("Time Between Eruptions (min)") +
  scale_color_continuous(name="Distribution\nDensity") +
  ggtitle("Old Gaithful Geyser Eruptions") + 
  theme(text=element_text(size=20, family="Times"))

Two-Dimensional Density Plots

The Basic Scatterplot

Use geom_point for scatter plots of numeric values

library(ggplot2)
library(MASS)

ggplot(Boston,aes(x=age, y=medv, size=crim, color=dis)) + 
  geom_point() + 
  scale_size(range=c(2.5,10)) + 
  xlab("Age of Home") + 
  ylab("Median Home Value (thousands)") + 
  scale_size_continuous(name="Township\nCrime Rate") +
  scale_color_continuous(name="Distance to\nEmployment") +
  ggtitle("Houses of Boston") + 
  theme(text=element_text(size=20, family="Times"))

The Basic Scatterplot

Pairwise Scatterplots

The standard R function pairs allows us to see all pairwise scatter plots

pairs(iris[1:4],pch=19)

Pairwise Scatterplots

Pairwise Scatterplots with GGally

If you install the GGally library, you get a ggplot version with ggpairs

library(GGally)

ggpairs(iris) + 
  theme(text=element_text(size=20, family="Times"))

Pairwise Scatterplots with GGally

Co-Plotting Multiple Trends

We can use group to separate trends in a dataset
Color and line thickness can help distinguish these groups
Still, too many separate lines are hard to decode

library(ggplot2)
library(gcookbook)

ggplot(uspopage, aes(x=Year, y=Thousands, group=AgeGroup)) +
  geom_line(aes(color=AgeGroup,size=AgeGroup)) +
  xlab("Year") + 
  ylab("Number of People in US (thousands)") + 
  theme(text=element_text(size=20, family="Times"))

Co-Plotting Multiple Trends

Stacking Multiple Trends

Stacking can make sense when:
- a factor is ordered (ordinal categorical variable)
- we care mainly about the aggregated values at any stack level
Try to match stack order with legend order

Stacking Multiple Trends

library(ggplot2)
library(gcookbook)

Year = uspopage$Year
Thousands = uspopage$Thousands
AgeGroup = factor(uspopage$AgeGroup,levels=rev(levels(uspopage$AgeGroup)))
MyData = data.frame(Year,Thousands,AgeGroup)

ggplot(MyData, aes(x=Year, 
                  y=Thousands, 
                  fill=AgeGroup, 
                  order=-as.numeric(AgeGroup))) +
  geom_area()   +  scale_fill_grey(start=0.8, end=0) +
  xlab("Year")  +  ylab("Number of People in US (thousands)") + 
  theme(text=element_text(size=20, family="Times"))

Stacking Multiple Trends

Multiple Bar plots, Grouped

We can make “grouped” boxplots using dodge

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", position="dodge", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Grouped

Multiple Bar plots, Stacked

By default, ggplot wants to stack …

library(ggplot2)

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) +
  geom_bar(stat="identity", color="white") + 
  scale_fill_brewer(palette="Set1") +
  theme(text=element_text(size=20, family="Times"))

Multiple Bar plots, Stacked

Mosaic Plots

Mosaic plots are like multi-dimensional bar plots
Encode values using area
In R, we need to install and load the library vcd
The vcd mosaic function requires a somewhat more sophisticated data table structure (more on this in another lecture)

library(vcd)
mosaic(HairEyeColor) + 
  theme(text=element_text(size=20, family="Times"))

Mosaic Plots

## NULL

Coxcomb Plots

Florence Nightengale used Coxcomb plots to convince the the Brittish that the biggest threat to their soldiers during the Crimean war were preventable diseases

nightengale = read.csv("http://eecs.ucf.edu/~wiegand/ids6938/datasets/nightengale.csv",header=TRUE)
Month = as.Date(paste("01",nightengale$Date),"%d %B %Y")
DeathType = factor(nightengale$DeathType,ordered=TRUE)
DeathRate = sqrt((1000*nightengale$NumDeaths/nightengale$AvgArmySize)/pi)
MyData = data.frame(Month,DeathType,DeathRate)

ggplot(MyData, aes(x=Month, 
                   y=DeathRate, 
                   fill=DeathType, 
                   order=as.numeric(DeathType))) + 
  geom_bar(stat="identity") +
  coord_polar() +
  scale_x_date(breaks=MyData$Month,labels=format(MyData$Month,"%b %Y")) + 
  theme(text=element_text(size=20, family="Times"))

Coxcomb Plots

Multiple Boxplots

library(ggplot2)

ggplot(iris,aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot(outlier.size=3, notch=TRUE) +
  ylab("Iris Sepal Length (cm)") + 
  theme(text=element_text(size=20, family="Times"))

Multiple Boxplots

A Few Simple Viz Examples in Python/Matplotlib

Bar Plot

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

fruits = ['apple', 'blueberry', 'cherry', 'orange']
counts = [40, 100, 30, 55]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange']

ax.bar(fruits, counts, label=bar_labels, color=bar_colors)

ax.set_ylabel('fruit supply')
ax.set_title('Fruit supply by kind and color')
ax.legend(title='Fruit color')

plt.show()

Bar Plot

Bubble Plot

import matplotlib.pyplot as plt
import numpy as np

# Fixing random state for reproducibility
np.random.seed(19680801)


N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N))**2  # 0 to 15 point radii

plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()

Bubble Plot

Boxplots

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(19680801)
fruit_weights = [
    np.random.normal(130, 10, size=100),
    np.random.normal(125, 20, size=100),
    np.random.normal(120, 30, size=100),
]
labels = ['peaches', 'oranges', 'tomatoes']
colors = ['peachpuff', 'orange', 'tomato']

fig, ax = plt.subplots()
ax.set_ylabel('fruit weight (g)')

bplot = ax.boxplot(fruit_weights,
                   patch_artist=True,  # fill with color
                   tick_labels=labels)  # will be used to label x-ticks

# fill with colors
for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color)

plt.show()