In his book “The Visual Display of Quantitative Information” Edward R. Tufte establishes the term “data ink”. He defines “data ink” as the “non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented”. Conversely, non-data-ink are all those parts of a graphic, that are not directly related to its core information. He also points out that some ink may technically be “data ink” but redundant. For example, many barplots which already include an axis with labeled tick marks also state the heights of the bars explicitly with additional pieces of text above the bars. Tufte advocates for keeping the following principles in mind, when creating plots and visualizations of data:
In his book, Tufte goes over many different types of plots and
illustrates how they can be simplified to satisfy these principles. In
this post, we will consider three types of plots: scatter plots, bar
plots and box plots. We will first have look at how these plots are
created by the base R plot commands, i.e. plot(),
barplot() and boxplot(), when the user does
not play around with any of their numerous parameters and arguments.
Then, we will try to tweak these plots to make them look like in Tufte’s
book. Lastly, I will add what I think is the sweet spot between the
two.
set.seed(1234)
n = 50
x = 1:n + rnorm(n, 0, 10)
y = 1:n + rnorm(n, 0, 10)
plot(
x = x,
y = y
)
plot(
x = x,
y = y,
pch = 16,
ann = FALSE,
axes = FALSE,
frame.plot = FALSE
)
axis(
side = 1,
at = range(x),
lwd.tick = 0,
labels = FALSE
)
axis(
side = 2,
at = range(y),
lwd.tick = 0,
labels = FALSE
)
plot(
x = x,
y = y,
pch = 16,
xlim = c(-20, 60),
ylim = c(-20, 80),
ann = FALSE,
axes = FALSE,
frame.plot = FALSE
)
axis(
side = 1,
las = 1,
at = seq(-20, 60, 20)
)
axis(
side = 2,
las = 1,
at = seq(-20, 80, 20)
)
set.seed(1234)
n <- 10
x <- runif(n, 0, 100)
barnames <- LETTERS[1:n]
barplot(
height = x,
names.arg = barnames
)
barplot(
height = x,
names.arg = barnames,
border = NA,
axes = FALSE
)
grid(
nx = 0,
ny = NULL,
col = "white",
lty = 1,
lwd = 2
)
axis(
side = 2,
lty = 0,
las = 1
)
barplot(
height = x,
names.arg = barnames,
border = NA,
axes = FALSE
)
axis(
side = 2,
las = 1
)
set.seed(1234)
n <- 100
dat <- data.frame(
"A" = runif(n, 0, 100),
"B" = rchisq(n, 20),
"C" = rnorm(n, 50, 20),
"D" = rlogis(n, 50, 10),
"E" = rlnorm(n, 3, .5),
"F" = runif(n, 30, 90),
"G" = rnorm(n, 70, 10)
)
boxplot(dat)
boxplot(
dat,
las = 1,
axes = FALSE,
frame.plot = FALSE,
pars = list(
whisklty = 1, # whisker: line type
staplelty = 0, # staple: line type
boxcol = "white", # box: border color
boxfill = "white", # box: fill color
medlty = 0, # median: line type
medpch = 16, # median: point type
outpch = 1, # outlier: point type
outcex = 0.7 # outlier: point size
)
)
axis(
side = 1,
lty = 0,
at = 1:ncol(dat),
labels = colnames(dat)
)
axis(
side = 2,
lty = 0,
las = 1,
at = c(0:5)*20
)
reorder <- order(sapply(dat, median))
boxplot(
dat[reorder],
axes = FALSE,
frame.plot = FALSE,
pars = list(
whisklty = 1, # whisker: line type
staplelty = 0, # staple: line type
boxfill = "white", # box: fill color
boxwex = 0.3, # box: width (default = 0.8)
medlwd = 1, # median: line width
outpch = 1, # outlier: point type
outcex = 0.7 # outlier: point size
)
)
axis(
side = 1,
lty = 0,
at = 1:ncol(dat),
labels = colnames(dat)[reorder]
)
axis(
side = 2,
las = 1,
at = c(0:5)*20
)
Tufte, Edward R., 2001, “The Visual Display of Quantitative Information”, Graphics Press, Cheshire, Connecticut.