General Principles, Scales & Axes

Spring 2020

Outline

Clarity of Data
Trellis Displays
Clarity of Plot Elements
Clarity of Understanding
Scales & Axes
Summary of Principles

Clarity of Data

Emphasize Data

Good plots should:

Make the data stand out
Make it easy for the reader to decode the data
Avoid unnecessary “chart junk”
Use large enough plot elements to see and distinguish data

ARL Notebook LibQUAL+ Example

An Alternative

No Gridlines Make Referencing Hard

Prominent Gridlines Can Distract from Data

Emphasize Data Plot Elements

De-emphasize Gridlines

Another Way to De-emphasize Gridlines

Tufte Suggests Rugs for Reference

MyData = data.frame(x=rnorm(15), y=rnorm(15))
ggplot(MyData,aes(x,y)) + geom_point(size=5) + 
  theme_bw() + 
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        text = element_text(size=20, family="Times")) +
  geom_rug(color="gray")

Tufte Suggests Implied Grid Lines

library(dplyr)

MyData = data.frame(state.x77[3:8,])
MyData$Name = rownames(MyData)

ggplot(MyData, aes(x=reorder(Name,-Area),y=Area)) + 
  geom_bar(stat="identity", fill="darkred") +
  geom_hline(yintercept=seq(from=0,to=150000,by=25000), color="white", size=1.25) +
  theme_bw() + 
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        text = element_text(size=20, family="Times")) +
  xlab("State") +
  ylab("Area (square miles)")

Enrollment in This Class By Deparment & Degree

Overlapping Plot Elements

When there are many, overlapping plot elements (points, lines, etc.) plots can become difficult to read
Can reduce the size of elements, but only so much
Can remove fill color (or use transparency), but this can make distinctions harder
Can use different colors or shapes, but again only so much
Sometimes we can “jitter” points to make them easier to see, but can be harder to understand
Sometimes we must separate the data
Often it is better to summarize in some way if there’s too much data to make sense of

Overlapping Points Are Hard To Read

WineData<-read.csv ("http://eecs.ucf.edu/~wiegand/ids6938/datasets/r-lab1.dat")
ggplot(WineData, aes(Alcohol, WineType)) + 
  geom_point(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

Jittering Points Can Help Some

ggplot(WineData, aes(Alcohol, WineType)) +  
  geom_jitter(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

Combining Ways to Distinguish Can Help Some

ggplot(WineData, aes(Alcohol, WineType, fill=WineType)) +  
  geom_jitter(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

But What is the Objective? Summarize?

ggplot(WineData, aes(y=Alcohol, x=WineType, fill=WineType)) +  
  geom_boxplot() + 
  coord_flip() +
  theme(text = element_text(size=20, family="Times"))

Sometimes You Need to Break Data Up

Sometimes Transparency Helps …

MyData = data.frame(x=rnorm(1500), y=rnorm(1500))
ggplot(MyData, aes(x,y)) +
  geom_point(size=5, alpha=0.4, color="steelblue") +
  theme(text = element_text(size=20, family="Times"))

Or Making Plot Elements Smaller …

ggplot(MyData, aes(x,y)) +
  geom_point(size=2, color="steelblue") +
  theme(text = element_text(size=20, family="Times"))

Shapes & Colors Distinguish …

n=40
MyData = data.frame(x=c(rnorm(n),0.6*rnorm(n)+1,1.1*rnorm(n)-1),
                    y=c(rnorm(n),0.6*rnorm(n)+1,1.1*rnorm(n)-1),
                    Type=c(rep("Thing A",n),rep("Thing B",n),rep("Thing C",n)))
ggplot(MyData, aes(x,y,color=Type,shape=Type)) + geom_point(size=5)

But Sometimes Separation is Best

Color & Linetype Help Distinguish Line Plots

Again, Sometimes Separation is Best

Trellis Displays

Separating data on multiple plots creates a “lookup” problem for the reader when decoding
Putting too much data on one plot creates a “clutter” problem for the reader when decoding
So sometmes we try to separate plots, but organize the plots so that their scales and axes are the same, well aligned and close together
These called trellis displays

Trellis Histograms

In ggplot we create a trellis using a facet and a model

bwt = birthwt$bwt
smoke = factor(c("Mother Didn't Smoke","Mother Smoked")[birthwt$smoke+1])
MyData = data.frame(bwt,smoke)

ggplot(MyData, aes(x=bwt)) +
  geom_histogram(fill="white",color="black", binwidth=500) +
  facet_grid(smoke ~ .) +
  xlab("Birth Weight") +
  ylab("Count") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Histograms

Trellis Scatterplots

ggplot(mpg, aes(x=displ, y=hwy)) +
  geom_point(size=3) +
  facet_grid(drv ~ class) +
  xlab("Engine Displacement (liters)") +
  ylab("MPG on Highway") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Scatterplots

Trellis Bar Charts

ggplot(mpg, aes(x=manufacturer,y=hwy)) + 
  geom_bar(stat="identity") + 
  facet_grid(year ~ .) +
  xlab("Car Manufacturer") +
  ylab("MPG on the Highway") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Bar Charts

Trellis of Boxplots

Cylinders=mtcars$cyl
MPG=mtcars$mpg
Gears=factor(paste(mtcars$gear,"Gears"))
MyData = data.frame(MPG,Cylinders,Gears)

ggplot(MyData, aes(x=factor(Cylinders), y=MPG)) + 
  geom_boxplot() + 
  facet_grid(Gears ~ .) +
  xlab("Number of Engine Cylinders") +
  ylab("Average MPG") + 
  theme(text=element_text(size=20, family="Times"))

Trellis of Boxplots

Clarity of Plot Elements

Use Plot Elements to Clarify

Place the things you most wish to compare close together
Encode data that requires the most nuanced differentiation with the way we perceive best (e.g., position)
Save other harder to perceive elments for numeric data that requires less nuanced differentiation (e.g., area, color)
Label all axes, include units where relevant
But don’t clutter the plot with labels
And don’t let data elements obscure labels
Annotate important elements
Use legends & keys to clarify distinctions
Plots look different when reduced / typeset / projected
Proofread your graphs just as you do your prose

Labels are Good, But Clutter is Bad …

Pick Simple Plot Elements

ggplot(data.frame(state.x77), aes(x=Population, y=Income)) + 
  theme(text = element_text(size=20, family="Times")) +  
  geom_text(aes(label=state.abb),hjust=-0.25,size=4) + 
  ggtitle("State Income vs. Population")

Adding Context Supports Your Narrative

Adding reference lines, points, or regions can help contextualize your plot
Adding annotations provide explicit context
All of these mechanisms help focus the reader’s attention on your narrative

Annotate & Reference to Focus Reader

ggplot(longley, aes(x=Year, y=Armed.Forces)) + 
  geom_rect(xmin=1950.2,xmax=1953.5,ymin=140,ymax=370,fill="lightgreen",alpha=0.03) +
  geom_text(x=1951.55,y=366,label="Korean War", color="darkgreen") +
  geom_line(size=1.5) + 
  theme(text=element_text(size=20, family="Times")) + 
  ylab("Number of People in Armed Forces (thousands)") +
  ylim(c(150,360))

Small Font Sizes Frustrate Readers

Annotations & Reference Lines Add Context

library(dplyr)
library(ggplot2)
crime <- read.csv('Crime/fbi-crime-1996-2015.csv', header=T)
ggplot(arrange(crime, Year), aes(x=Year, y=Murder.and..nonnegligent..manslaughter..rate.)) +
    geom_line(size=1.35, col="darkblue") +
    geom_point(size=4, shape=21, fill="white", color="darkblue") +
    ylab("Murder & Manslaughter Rate in U.S (per 100K people)") +
    theme(text=element_text(family="Times", size=16)) +
    geom_hline(yintercept=7.7, linetype="dashed", color="black") +
    annotate("text",2005,7.75,vjust=0,label="Murder Rate of Bolivia, 2011", color="black") +        
    geom_hline(yintercept=3.7, linetype="dashed", color="black") +
    annotate("text",2005,3.75,vjust=0,label="Murder Rate of Chile, 2011", color="black")  +
    annotate("text", 1996, 4.5, hjust=0, label="The U.S. Murder Rate was the lowest in the last\n100 years in 1955, when it was 4.5 per 100K people") +
    ggtitle("Murder in the U.S. is at an Historic Low")

Clarity of Understanding

Make Sure Visualizations Do Not Add Inaccuracies

Draw data to the correct / consistent scale
Make sure there are no computational innaccuracies
Don’t make the reader do math your computer could have done for you
Properly baseline viusalizations when possible
When grouping, be consistent with order, color, and other elements

Optimal Quantitative Scales

When using a bar graph, begin the scale at zero and end the scale a little above the highest value
- Recall that bar graphs use both length and position to encode a number
- When it isn’t properly baselined, these do not agree, which creates a cognitive mismatch
- Can mislead readers because pre-attentively, they fixate on the differences between lengths … though length is no longer an accurate encoding
With other types of graph, begin and end a little above or below the extrema
Make intervals on the scale easy numbers to understand (e.g., round numbers)
Good visualization software will make following such rules easy

Spot the Problem

Pie Chart of Loans

Which is Right?

Wait. What?

Positive, but negative bar chart

What Are We Comparing?

Which Direction?

Where to Start?

More Examples of Bad Viz

Scales & Axes

Aspect Ratio for Line Plots

Baseline Correctly, But Context Matters

Typically bar plots should almost always be baselined at zero
(length / position perceptions are similar only when the baseline is zero)
Unless there is a good reason, baseline at zero for other plots, as well
Good reasons to baseline at a different value (for non-bar plots):
- The range of the data hides relevant changes because of its magnitude
- Length encodings (e.g., confidence intervals or error bars) are obscured because of data magnitude
- Baseline value has some contextual meaning in the narrative (e.g., there is a “natural” baseline in the data)
Always make it clear to the reader when a graph is not baselined at zero

Dealing with Highly Skewed Data

When data contains a few very extreme values, it can be difficult to display data appropriately. Things to consider:

Consider separating long-tail values into a separate plot
Consider using a logarithmic scale instead
Try not to break the scale because it can lead to data misrepresentation
If you must, then make the break as easy as possible for the reader to decode

Logarithmic Scales

“Unskews” data with large values
Helps spread out data with long-tails by bringing the tails in and using more of the plotting space for the high-density data
But requires a bit more sophisticated understanding from the reader

Making Scale Breaks Very Obvious

Don’t Do This

Typically Dual-Axes Are a Bad Idea

Can be confusing to decode
It’s easy to mislead reader about the scale
Difficult to accomplish in most plotting tools
Typically it is easier to separate data or repeat a plot in a different scale

Using Two Scales for Same Data Can Be Useful

Very Careful Consideration is Needed

Summary of Principles

Some General Strategies

Good data visualization is not an afterthought, but takes thought and attention
Producing a graph is an iterative process, proofread and correct
Try to create graphs such that the main message supporting the narrative is very easy to decode
Use visual elements to maximize data content
Choose grouping and encoding mechanisms that focus on your narative and maximize differentiation
Avoid misrepresenting the data numerically or visually