Spring 2025
Humans have a limited set of visual properties that are detected very rapidly and accurately by our visual system before we are consciously aware of it
Find the red circle:
Find the red circle:
Find the red circle:
What hapens to our visual representation when we stop attending and look at something else?
Humans are pattern matchers …
See the dolphin!
We cannot easily “unsee” things …
There is no spoon … er … dolphin!
We rely on memory, but our working visual memory is very limited
In order from most easily perceived to least:
Cleveland McGill (1984). “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association, 79(387), pp. 531–554.
Often we need to separate or distinguish discrete visual items using:
Proximity: When objects are close together, we often perceive them as a group
Similarity: When objects share similar attributes (color, shape, etc.), we often perceive them as a group
Enclosure: When objects are surrounded by a boundary, we often perceive them as a group
Closure: Sometimes partially open structures can still be perceived as a grouping metaphor (e.g., “\(\left[ \ldots \right]\)”)
Connectivity: When you draw curves or lines through data elements, this is often perceived as creating a connection between them
When objects are close together, we often perceive them as a group
When objects share similar attributes (color, shape, etc.), we often perceive them as a group
When objects are surrounded by a boundary, we often perceive them as a group
Sometimes partially open structures can still be perceived as a grouping metaphor
When you draw curves or lines through data elements, this is often perceived as creating a connection between them
Lines imply connection … don’t use them if there isn’t any
Group things so that the most important things to compare are closest
Suppose we want to visualize a Binomial distribution, \(n=15,\; p=0.25\)
library(ggplot2) k = 0:15 pmf = dbinom(k, size=max(k), prob=0.25) MyData = data.frame(k, pmf) ggplot(MyData,aes(x=k, y=pmf)) + geom_linerange(ymin=0, ymax=pmf, size=1.25) + geom_point(size=3.5) + ylab("Pr{k}") + theme(text=element_text(size=18, family="Times"))
Suppose we want to visualize a Normal distribution, \(\mu = 5, \sigma=2\)
library(ggplot2) ggplot(data.frame(x=c(-5,15),y=c(0,1)),aes(x=x,y=y)) + stat_function(fun=dnorm,args=list(mean=5,sd=2)) + ggtitle("Normal Distribution, ~N(5,2)") + theme(text=element_text(size=18, family="Times"))
To get a rough picture of the distribution of a sample, use a histogram
library(ggplot2) MyData = data.frame(val=rnorm(200)) ggplot(MyData,aes(x=val)) + geom_histogram(binwidth=0.5, col="white", fill="darkblue") + xlab("Value") + ylab("Count") + ggtitle("Histogram of MyData") + theme(text=element_text(size=18, family="Times"))
Or a density plot
library(ggplot2) MyData = data.frame(val=rnorm(200)) ggplot(MyData,aes(x=val)) + geom_density(fill="pink",col=NA) + xlab("Value") + ylab("Density") + ggtitle("Density of MyData") + theme(text=element_text(size=18, family="Times"))
Or all of these
library(ggplot2) MyData = data.frame(val=rnorm(200)) mu = mean(MyData$val) sig = sqrt(var(MyData$val)) ggplot(MyData,aes(x=val)) + geom_density(fill="pink",col=NA) + geom_histogram(binwidth=0.5, aes(y=..density..), col="white", alpha=0.4) + stat_function(fun=dnorm,arg=list(mean=mu,sd=sig), size=1.5, col="darkred") + xlab("Value") + ylab("Density") + ggtitle("Estimating MyData Distribution") + theme(text=element_text(size=18, family="Times"))
Q-Q plots give us a way to see how close to a normal distribution
our data might be
Right Skew | ![]() |
Short Tails | ![]() |
Left Skew | ![]() |
Long Tails | ![]() |
MyData = data.frame(val=rnorm(200)) qqnorm(MyData$val,pch=19,col="darkgray") qqline(MyData$val,lwd=2,col="darkred")
Dot plots use position to encode a numeric value, proportion, or frequency
library(ggplot2) MyData = data.frame(State=state.name[1:10], Area=state.area[1:10]) ggplot(MyData,aes(x=Area,y=State)) + geom_point(size=4) + xlab("Area (sq. miles)") + theme(text=element_text(size=18, family="Times"))
Dot plots use position to encode a numeric value, proportion, or frequency
Note: There’s no implicit meaning to the \(y\)-axis positions
So we can order the dot plot based on value typically to make it easier to read
library(ggplot2) MyData = data.frame(State=state.name[1:10], Area=state.area[1:10]) MySortedData = transform(MyData, State=reorder(State,Area)) ggplot(MySortedData,aes(x=Area,y=State)) + geom_point(size=4) + xlab("Area (sq. miles)") + theme(text=element_text(size=18, family="Times"))
Bar plots use length and position to encode a numeric value
library(ggplot2) MyData = data.frame(State=state.name[1:10], Area=state.area[1:10]) MySortedData = transform(MyData, State=reorder(State,Area)) ggplot(MySortedData,aes(x=State,y=Area)) + geom_bar(stat="identity") + coord_flip() + ylab("Area (sq. miles)") + # Recall we flipped the axes ... theme(text=element_text(size=18, family="Times"))
Note: Again, these are ordered for ease of reading …
For example, use lines to connect the same algorithm at different points during a run
library(ggplot2) fakeData = data.frame(evals=c(100,150,200,250), performance=c(1000.1,1300.2,1410.6,1470.3), ci=c(150,90,50,30)) ggplot(fakeData,aes(evals,performance)) + geom_errorbar(aes(ymin=performance-ci/2, ymax=performance+ci/2), size=0.5, width=10) + geom_line(color="darkblue", size=1.25) + geom_point(size=5) + xlab("Number of Evaluations") + ylab("Algorithm Performance") + theme(text=element_text(size=18, family="Times"))
Box plots give information about the median, inter-quartiles, outliers, as well as confidence inervals
library(ggplot2) ggplot(mtcars, aes(1,y=mpg)) + geom_boxplot(notch=T, fill="pink") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + xlim(c(0,2)) + xlab("") + ylab("Mileage") + ggtitle("Distribution of Car Mileage") + theme(text=element_text(size=18, family="Times"))
Use dodge to visualize multiple Binomial distributions
library(ggplot2) k = 0:15 p = factor(c(rep(0.25,length(k)),rep(0.4,length(k)))) pmf = c(dbinom(k, size=max(k), prob=0.25), dbinom(k, size=max(k), prob=0.4)) MyData = data.frame(k, p, pmf) ggplot(MyData, aes(x=k, y=pmf, group=p)) + geom_linerange(ymin=0, aes(ymax=pmf, color=p), size=1.25, position=position_dodge(width=0.25)) + geom_point(size=3.5, position=position_dodge(width=0.25), aes(color=p)) + ylab("Pr{k}") + ggtitle("Two Binomial Distributions, n=15, p=0.25 and p=0.4")
Label text is too small? Use theme()
You can use factors to separate different plots straightforwardly
library(ggplot2) library(MASS) # Contains a lot of extra data sets birthwt1 = birthwt # Copy a birth Wt / risk factor data set birthwt1$smoke = factor(birthwt$smoke) # Make "smoking during preg." a factor ggplot(birthwt1, aes(x=bwt, fill=smoke)) + geom_density(alpha=0.3) + xlab("Birth Weight (g)") + ylab("Distribution Density") + scale_fill_discrete(name="Mom Smoked?", labels=c("No","Yes")) + theme(text=element_text(size=20, family="Times"))
library(ggplot2) library(MASS) # Contains a lot of extra data sets bwt = birthwt$bwt # Get the birth Wt / risk factor vector smoke = as.factor(birthwt$smoke) # Make "smoking during preg." variable a factor MyData = data.frame(bwt,smoke) ggplot(MyData, aes(x=bwt, fill=smoke)) + geom_histogram(aes(y=..density..), binwidth=500, position=position_dodge(width=500), color="black") + xlab("Birth Weight (g)") + ylab("Distribution Density") + scale_fill_discrete(name="Mom Smoked?", labels=c("No","Yes")) + theme(text=element_text(size=20, family="Times"))
You can use stat_density2d to create contour density plots
library(ggplot2) library(gcookbook) ggplot(faithful, aes(x=eruptions, y=waiting)) + stat_density2d(aes(color=..level..), size=1.5) + xlab("Eruption Time (min)") + ylab("Time Between Eruptions (min)") + scale_color_continuous(name="Distribution\nDensity") + ggtitle("Old Gaithful Geyser Eruptions") + theme(text=element_text(size=20, family="Times"))
Use geom_point for scatter plots of numeric values
library(ggplot2) library(MASS) ggplot(Boston,aes(x=age, y=medv, size=crim, color=dis)) + geom_point() + scale_size(range=c(2.5,10)) + xlab("Age of Home") + ylab("Median Home Value (thousands)") + scale_size_continuous(name="Township\nCrime Rate") + scale_color_continuous(name="Distance to\nEmployment") + ggtitle("Houses of Boston") + theme(text=element_text(size=20, family="Times"))
The standard R function pairs allows us to see all pairwise scatter plots
pairs(iris[1:4],pch=19)
If you install the GGally library, you get a ggplot version with ggpairs
library(GGally) ggpairs(iris) + theme(text=element_text(size=20, family="Times"))
library(ggplot2) library(gcookbook) ggplot(uspopage, aes(x=Year, y=Thousands, group=AgeGroup)) + geom_line(aes(color=AgeGroup,size=AgeGroup)) + xlab("Year") + ylab("Number of People in US (thousands)") + theme(text=element_text(size=20, family="Times"))
library(ggplot2) library(gcookbook) Year = uspopage$Year Thousands = uspopage$Thousands AgeGroup = factor(uspopage$AgeGroup,levels=rev(levels(uspopage$AgeGroup))) MyData = data.frame(Year,Thousands,AgeGroup) ggplot(MyData, aes(x=Year, y=Thousands, fill=AgeGroup, order=-as.numeric(AgeGroup))) + geom_area() + scale_fill_grey(start=0.8, end=0) + xlab("Year") + ylab("Number of People in US (thousands)") + theme(text=element_text(size=20, family="Times"))
We can make “grouped” boxplots using dodge
library(ggplot2) ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) + geom_bar(stat="identity", position="dodge", color="white") + scale_fill_brewer(palette="Set1") + theme(text=element_text(size=20, family="Times"))
By default, ggplot wants to stack …
library(ggplot2) ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) + geom_bar(stat="identity", color="white") + scale_fill_brewer(palette="Set1") + theme(text=element_text(size=20, family="Times"))
library(vcd) mosaic(HairEyeColor) + theme(text=element_text(size=20, family="Times"))
## NULL
Florence Nightengale used Coxcomb plots to convince the the Brittish that the biggest threat to their soldiers during the Crimean war were preventable diseases
nightengale = read.csv("http://eecs.ucf.edu/~wiegand/ids6938/datasets/nightengale.csv",header=TRUE) Month = as.Date(paste("01",nightengale$Date),"%d %B %Y") DeathType = factor(nightengale$DeathType,ordered=TRUE) DeathRate = sqrt((1000*nightengale$NumDeaths/nightengale$AvgArmySize)/pi) MyData = data.frame(Month,DeathType,DeathRate) ggplot(MyData, aes(x=Month, y=DeathRate, fill=DeathType, order=as.numeric(DeathType))) + geom_bar(stat="identity") + coord_polar() + scale_x_date(breaks=MyData$Month,labels=format(MyData$Month,"%b %Y")) + theme(text=element_text(size=20, family="Times"))
library(ggplot2) ggplot(iris,aes(x=Species, y=Sepal.Length)) + geom_boxplot(outlier.size=3, notch=TRUE) + ylab("Iris Sepal Length (cm)") + theme(text=element_text(size=20, family="Times"))
import matplotlib.pyplot as plt fig, ax = plt.subplots() fruits = ['apple', 'blueberry', 'cherry', 'orange'] counts = [40, 100, 30, 55] bar_labels = ['red', 'blue', '_red', 'orange'] bar_colors = ['tab:red', 'tab:blue', 'tab:red', 'tab:orange'] ax.bar(fruits, counts, label=bar_labels, color=bar_colors) ax.set_ylabel('fruit supply') ax.set_title('Fruit supply by kind and color') ax.legend(title='Fruit color') plt.show()
import matplotlib.pyplot as plt import numpy as np # Fixing random state for reproducibility np.random.seed(19680801) N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = (30 * np.random.rand(N))**2 # 0 to 15 point radii plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show()
import matplotlib.pyplot as plt import numpy as np np.random.seed(19680801) fruit_weights = [ np.random.normal(130, 10, size=100), np.random.normal(125, 20, size=100), np.random.normal(120, 30, size=100), ] labels = ['peaches', 'oranges', 'tomatoes'] colors = ['peachpuff', 'orange', 'tomato'] fig, ax = plt.subplots() ax.set_ylabel('fruit weight (g)') bplot = ax.boxplot(fruit_weights, patch_artist=True, # fill with color tick_labels=labels) # will be used to label x-ticks # fill with colors for patch, color in zip(bplot['boxes'], colors): patch.set_facecolor(color) plt.show()