Spring 2020

Outline

  1. Clarity of Data
  2. Trellis Displays
  3. Clarity of Plot Elements
  4. Clarity of Understanding
  5. Scales & Axes
  6. Summary of Principles

Clarity of Data

Emphasize Data

Good plots should:

  • Make the data stand out
  • Make it easy for the reader to decode the data
  • Avoid unnecessary “chart junk”
  • Use large enough plot elements to see and distinguish data

ARL Notebook LibQUAL+ Example

An Alternative

No Gridlines Make Referencing Hard

Prominent Gridlines Can Distract from Data

Emphasize Data Plot Elements

De-emphasize Gridlines

Another Way to De-emphasize Gridlines

Tufte Suggests Rugs for Reference

Tufte Suggests Rugs for Reference

MyData = data.frame(x=rnorm(15), y=rnorm(15))
ggplot(MyData,aes(x,y)) + geom_point(size=5) + 
  theme_bw() + 
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        text = element_text(size=20, family="Times")) +
  geom_rug(color="gray")

Tufte Suggests Implied Grid Lines

Tufte Suggests Implied Grid Lines

library(dplyr)

MyData = data.frame(state.x77[3:8,])
MyData$Name = rownames(MyData)

ggplot(MyData, aes(x=reorder(Name,-Area),y=Area)) + 
  geom_bar(stat="identity", fill="darkred") +
  geom_hline(yintercept=seq(from=0,to=150000,by=25000), color="white", size=1.25) +
  theme_bw() + 
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        text = element_text(size=20, family="Times")) +
  xlab("State") +
  ylab("Area (square miles)")

Enrollment in This Class By Deparment & Degree

Overlapping Plot Elements

  • When there are many, overlapping plot elements (points, lines, etc.) plots can become difficult to read
  • Can reduce the size of elements, but only so much
  • Can remove fill color (or use transparency), but this can make distinctions harder
  • Can use different colors or shapes, but again only so much
  • Sometimes we can “jitter” points to make them easier to see, but can be harder to understand
  • Sometimes we must separate the data
  • Often it is better to summarize in some way if there’s too much data to make sense of

Overlapping Points Are Hard To Read

Overlapping Points Are Hard To Read

WineData<-read.csv ("http://eecs.ucf.edu/~wiegand/ids6938/datasets/r-lab1.dat")
ggplot(WineData, aes(Alcohol, WineType)) + 
  geom_point(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

Jittering Points Can Help Some

Jittering Points Can Help Some

ggplot(WineData, aes(Alcohol, WineType)) +  
  geom_jitter(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

Combining Ways to Distinguish Can Help Some

Combining Ways to Distinguish Can Help Some

ggplot(WineData, aes(Alcohol, WineType, fill=WineType)) +  
  geom_jitter(shape=21, size=5) +
  theme(text = element_text(size=20, family="Times"))

But What is the Objective? Summarize?

But What is the Objective? Summarize?

ggplot(WineData, aes(y=Alcohol, x=WineType, fill=WineType)) +  
  geom_boxplot() + 
  coord_flip() +
  theme(text = element_text(size=20, family="Times"))

Sometimes You Need to Break Data Up

Sometimes Transparency Helps …

Sometimes Transparency Helps …

MyData = data.frame(x=rnorm(1500), y=rnorm(1500))
ggplot(MyData, aes(x,y)) +
  geom_point(size=5, alpha=0.4, color="steelblue") +
  theme(text = element_text(size=20, family="Times"))

Or Making Plot Elements Smaller …

Or Making Plot Elements Smaller …

ggplot(MyData, aes(x,y)) +
  geom_point(size=2, color="steelblue") +
  theme(text = element_text(size=20, family="Times"))

Shapes & Colors Distinguish …

Shapes & Colors Distinguish …

n=40
MyData = data.frame(x=c(rnorm(n),0.6*rnorm(n)+1,1.1*rnorm(n)-1),
                    y=c(rnorm(n),0.6*rnorm(n)+1,1.1*rnorm(n)-1),
                    Type=c(rep("Thing A",n),rep("Thing B",n),rep("Thing C",n)))
ggplot(MyData, aes(x,y,color=Type,shape=Type)) + geom_point(size=5)

But Sometimes Separation is Best

Color & Linetype Help Distinguish Line Plots

Again, Sometimes Separation is Best

Trellis Displays

Trellis Displays

  • Separating data on multiple plots creates a “lookup” problem for the reader when decoding
  • Putting too much data on one plot creates a “clutter” problem for the reader when decoding
  • So sometmes we try to separate plots, but organize the plots so that their scales and axes are the same, well aligned and close together
  • These called trellis displays

Trellis Histograms

In ggplot we create a trellis using a facet and a model

bwt = birthwt$bwt
smoke = factor(c("Mother Didn't Smoke","Mother Smoked")[birthwt$smoke+1])
MyData = data.frame(bwt,smoke)

ggplot(MyData, aes(x=bwt)) +
  geom_histogram(fill="white",color="black", binwidth=500) +
  facet_grid(smoke ~ .) +
  xlab("Birth Weight") +
  ylab("Count") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Histograms

Trellis Scatterplots

ggplot(mpg, aes(x=displ, y=hwy)) +
  geom_point(size=3) +
  facet_grid(drv ~ class) +
  xlab("Engine Displacement (liters)") +
  ylab("MPG on Highway") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Scatterplots

Trellis Bar Charts

ggplot(mpg, aes(x=manufacturer,y=hwy)) + 
  geom_bar(stat="identity") + 
  facet_grid(year ~ .) +
  xlab("Car Manufacturer") +
  ylab("MPG on the Highway") + 
  theme(text=element_text(size=20, family="Times"))

Trellis Bar Charts

Trellis of Boxplots

Cylinders=mtcars$cyl
MPG=mtcars$mpg
Gears=factor(paste(mtcars$gear,"Gears"))
MyData = data.frame(MPG,Cylinders,Gears)

ggplot(MyData, aes(x=factor(Cylinders), y=MPG)) + 
  geom_boxplot() + 
  facet_grid(Gears ~ .) +
  xlab("Number of Engine Cylinders") +
  ylab("Average MPG") + 
  theme(text=element_text(size=20, family="Times"))

Trellis of Boxplots

Clarity of Plot Elements

Use Plot Elements to Clarify

  • Place the things you most wish to compare close together
  • Encode data that requires the most nuanced differentiation with the way we perceive best (e.g., position)
  • Save other harder to perceive elments for numeric data that requires less nuanced differentiation (e.g., area, color)
  • Label all axes, include units where relevant
  • But don’t clutter the plot with labels
  • And don’t let data elements obscure labels
  • Annotate important elements
  • Use legends & keys to clarify distinctions
  • Plots look different when reduced / typeset / projected
  • Proofread your graphs just as you do your prose

Labels are Good, But Clutter is Bad …

Pick Simple Plot Elements

Pick Simple Plot Elements

ggplot(data.frame(state.x77), aes(x=Population, y=Income)) + 
  theme(text = element_text(size=20, family="Times")) +  
  geom_text(aes(label=state.abb),hjust=-0.25,size=4) + 
  ggtitle("State Income vs. Population")

Adding Context Supports Your Narrative

  • Adding reference lines, points, or regions can help contextualize your plot
  • Adding annotations provide explicit context
  • All of these mechanisms help focus the reader’s attention on your narrative

Annotate & Reference to Focus Reader

Annotate & Reference to Focus Reader

ggplot(longley, aes(x=Year, y=Armed.Forces)) + 
  geom_rect(xmin=1950.2,xmax=1953.5,ymin=140,ymax=370,fill="lightgreen",alpha=0.03) +
  geom_text(x=1951.55,y=366,label="Korean War", color="darkgreen") +
  geom_line(size=1.5) + 
  theme(text=element_text(size=20, family="Times")) + 
  ylab("Number of People in Armed Forces (thousands)") +
  ylim(c(150,360))

Small Font Sizes Frustrate Readers

Annotations & Reference Lines Add Context

Annotations & Reference Lines Add Context

library(dplyr)
library(ggplot2)
crime <- read.csv('Crime/fbi-crime-1996-2015.csv', header=T)
ggplot(arrange(crime, Year), aes(x=Year, y=Murder.and..nonnegligent..manslaughter..rate.)) +
    geom_line(size=1.35, col="darkblue") +
    geom_point(size=4, shape=21, fill="white", color="darkblue") +
    ylab("Murder & Manslaughter Rate in U.S (per 100K people)") +
    theme(text=element_text(family="Times", size=16)) +
    geom_hline(yintercept=7.7, linetype="dashed", color="black") +
    annotate("text",2005,7.75,vjust=0,label="Murder Rate of Bolivia, 2011", color="black") +        
    geom_hline(yintercept=3.7, linetype="dashed", color="black") +
    annotate("text",2005,3.75,vjust=0,label="Murder Rate of Chile, 2011", color="black")  +
    annotate("text", 1996, 4.5, hjust=0, label="The U.S. Murder Rate was the lowest in the last\n100 years in 1955, when it was 4.5 per 100K people") +
    ggtitle("Murder in the U.S. is at an Historic Low")

Clarity of Understanding

Make Sure Visualizations Do Not Add Inaccuracies

  • Draw data to the correct / consistent scale
  • Make sure there are no computational innaccuracies
  • Don’t make the reader do math your computer could have done for you
  • Properly baseline viusalizations when possible
  • When grouping, be consistent with order, color, and other elements

Optimal Quantitative Scales

  • When using a bar graph, begin the scale at zero and end the scale a little above the highest value
    • Recall that bar graphs use both length and position to encode a number
    • When it isn’t properly baselined, these do not agree, which creates a cognitive mismatch
    • Can mislead readers because pre-attentively, they fixate on the differences between lengths … though length is no longer an accurate encoding
  • With other types of graph, begin and end a little above or below the extrema

  • Make intervals on the scale easy numbers to understand (e.g., round numbers)

  • Good visualization software will make following such rules easy

Spot the Problem

Pie Chart of Loans

Which is Right?

Wait. What?

Positive, but negative bar chart

What Are We Comparing?

Which Direction?

Where to Start?

More Examples of Bad Viz

Scales & Axes

Aspect Ratio for Line Plots

Baseline Correctly, But Context Matters

  • Typically bar plots should almost always be baselined at zero
    (length / position perceptions are similar only when the baseline is zero)
  • Unless there is a good reason, baseline at zero for other plots, as well
  • Good reasons to baseline at a different value (for non-bar plots):
    • The range of the data hides relevant changes because of its magnitude
    • Length encodings (e.g., confidence intervals or error bars) are obscured because of data magnitude
    • Baseline value has some contextual meaning in the narrative (e.g., there is a “natural” baseline in the data)
  • Always make it clear to the reader when a graph is not baselined at zero

Dealing with Highly Skewed Data

When data contains a few very extreme values, it can be difficult to display data appropriately. Things to consider:

  • Consider separating long-tail values into a separate plot
  • Consider using a logarithmic scale instead
  • Try not to break the scale because it can lead to data misrepresentation
  • If you must, then make the break as easy as possible for the reader to decode

Logarithmic Scales

  • “Unskews” data with large values
  • Helps spread out data with long-tails by bringing the tails in and using more of the plotting space for the high-density data
  • But requires a bit more sophisticated understanding from the reader

Making Scale Breaks Very Obvious

Don’t Do This

Typically Dual-Axes Are a Bad Idea

  • Can be confusing to decode
  • It’s easy to mislead reader about the scale
  • Difficult to accomplish in most plotting tools
  • Typically it is easier to separate data or repeat a plot in a different scale

Using Two Scales for Same Data Can Be Useful

Very Careful Consideration is Needed

Summary of Principles

Some General Strategies

  • Good data visualization is not an afterthought, but takes thought and attention
  • Producing a graph is an iterative process, proofread and correct
  • Try to create graphs such that the main message supporting the narrative is very easy to decode
  • Use visual elements to maximize data content
  • Choose grouping and encoding mechanisms that focus on your narative and maximize differentiation
  • Avoid misrepresenting the data numerically or visually