Effective scientific graphs don’t need to be beautifully designed, in fact, many of the most beautifully designed graphs sacrifice effectiveness in data communication for aesthetics. This is often misleading.
However, the best scientific graphics convey a message honestly, helping the viewer see the facts (i.e. numbers) in clear and straightforward way. Clarity > aesthetics!
While visual elements can powerfully convey information, they can also mislead — intentionally or unintentionally. Techniques like hiding relevant data, displaying too much data, or using inappropriate graphic forms can distort the truth and mislead the audience. Remember that while lying involves deliberate deception, even honest mistakes in visual design can result in misleading interpretations.
x <- 1:6 # basically the same vector as x <- c(1,2,3,4,5,6)
# y is a combine function, which is used to make vectors
y <- c(2,1,4,5,3,6)
# vectors can be combined by columns using cbind
print(cbind(x,y))
## x y
## [1,] 1 2
## [2,] 2 1
## [3,] 3 4
## [4,] 4 5
## [5,] 5 3
## [6,] 6 6
plot(cbind(x,y))
# trying out something
a <- c("oranges","plums","apples","peaches","bananas","pears")
b <- c(5,7,9,2,4,6)
# if i use cbind here it'll turn the numeric values into character strings, so instead data.frame()
# function needs to be used
# matrices in R must have **homogeneous data types**, so cbind() forces this by converting to the
# "highest" type in the hierarchy
# + **coercion hierarchy in matrices**: logical -> integer -> numeric -> character
# data frames can handle **mixed data types**, so b would stay numeric while a remains character
print(data.frame(a,b))
## a b
## 1 oranges 5
## 2 plums 7
## 3 apples 9
## 4 peaches 2
## 5 bananas 4
## 6 pears 6
This dataset gives CO2 emission by different economic sectors (in millions of CO2 per year). Based on this dataset, this tutorial explores several different types of graphs and graphical features, including: + pie charts + bar plots + color schemes and pallets + area + Hawkins climate change visualization
# first loading the data by copying the code directly from the tutorial sheet
emitters <- data.frame(
Sector = c("Other energy", "Electricity and heat production", "Industry", "Transportation",
"Buildings", "Agriculture, forestry, and other land use"),
Emission = c(.1,.25,.21,.14,.06,.24)*6558)
# printing the data to the screen
emitters
## Sector Emission
## 1 Other energy 655.80
## 2 Electricity and heat production 1639.50
## 3 Industry 1377.18
## 4 Transportation 918.12
## 5 Buildings 393.48
## 6 Agriculture, forestry, and other land use 1573.92
Pie charts use elements of angles and area to convey information of how different pieces fit into the whole.They generally only make sense if interested in something proportional that can be divided into sensible categories.
In this graphic, the pie chart attempts to shows how big or small each economic sector’s contribution is to total greenhouse gas emissions. Food for thought: How effective is it? Can you easily discern the largest emitter? What about the second largest?
?pie
# pie(x, labels = names(x), edges = 200, radius = 0.8, clockwise = FALSE, init.angle =
# if(clockwise) 90 else 0, density = NULL, angle = 45,
# col = NULL, border = NULL, lty = NULL, main = NULL, ...)
# for better clarity, i'll also sort the emissions so that these appear in descending order
# this can be done to the dataframe itself
pie(emitters$Emission[order(emitters$Emission, decreasing=TRUE)],labels=emitters$Sector
[order(emitters$Emission, decreasing=TRUE)], clockwise=TRUE,
radius=1, col=c("yellow","palegreen","darkred","mediumblue","lightyellow","lightpink"),
main="Emissions by Economic Sector", cex=0.8, font.main=4)
# color palette that takes colorblind people into account -> but must be installed first
# install.packages("viridis") # downloads and installs the package to your computer (must only be
# done once)
library (viridis) # loads the package into your current R session (must be done every time when
# starting R)
pie(emitters$Emission[order(emitters$Emission, decreasing=TRUE)],labels=emitters$Sector
[order(emitters$Emission, decreasing=TRUE)], clockwise=TRUE, radius=1, col=viridis
(length(emitters$Sector)), main="Emissions by Economic Sector", cex=0.8, font.main=4)
# to see the full range of available colors -> use colors()
length(colors())
## [1] 657
head(colors(), 20)
## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3" "antiquewhite4" "aquamarine"
## [9] "aquamarine1" "aquamarine2" "aquamarine3" "aquamarine4"
## [13] "azure" "azure1" "azure2" "azure3"
## [17] "azure4" "beige" "bisque" "bisque1"
# BUT must fix the cut-off sector descriptions as this makes it hard to understand the data ->
# using parameter function to set the margins
par(mar=c(1,5,5,10)) # order of margins: bottom, left, top, right
# AND must be careful with color as it often has cultural meaning and may inadvertently convey
# unintended meanings
pie(emitters$Emission[order(emitters$Emission, decreasing=TRUE)],labels=emitters$Sector
[order(emitters$Emission, decreasing=TRUE)], clockwise=TRUE, radius=1, col="lightpink",
main="Emissions by Economic Sector", cex=0.8, font.main=4)
# Leiden University QRS team apparently created their own package of colors, which allows for
# interactively picking colors -> devtools
options(repos = c(CRAN = "https://cran.rstudio.com/")) # tells R where to download packages from
install.packages("devtools") # this package allows for installing non-standard R packages
##
## The downloaded binary packages are in
## /var/folders/cz/21g21r953_d4b1yk50_xhxc40000gn/T//Rtmpsm85Ff/downloaded_packages
require(devtools) # library() can also be used -> require() keeps going if package isn't found
# install.packages() always requires the package name in quotes! remember this
install_github("MarcoDVisser/choosecolor") # installs QRS custom package
require(choosecolor)
mycol <- color.choose()
A bar plot - or bar graph or bar chart - is used to present categorical data, and it uses different lengths to show the values across the categories. The bar chart makes it easier for the reader to compare sectors but doesn’t easily convey how each one relates to the total.
?barplot
# barplot(height, width = 1, space = NULL, names.arg = NULL, legend.text = NULL, beside = FALSE,
# horiz = FALSE, density = NULL, angle = 45, col = NULL, border = par("fg"), main = NULL,
# sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
# axes = TRUE, axisnames = TRUE, cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
# inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0, add = FALSE, ann = !add
# && par("ann"), args.legend = NULL, ...)
barplot(emitters$Emission, names.arg=emitters$Sector, main="Greenhouse Gas Emissions per
Economic Sector", font.main=4, cex.main=1.4)
# there isn’t enough space to display the sector names in the margins as they are simply too long
# one way to fix this is leaving names.arg() empty and using the function text() to add the
# sector names onto the bars later on
# it may also help to display the bars horizontally because this way the text is easier to read
# this is done by changing "horiz" element -> if FALSE, the bars are drawn vertically, if TRUE,
# the bars are drawn horizontally
barinfo <- barplot(emitters$Emission, main="Greenhouse Gas Emissions per Economic Sector",
col="#a6cee3", horiz=TRUE, xlab="Emissions (million tons CO2)", cex.lab=1,
cex.axis=0.8)
# cex.lab changes axis label size (for xlab and ylab titles), cex.axis changes axis tick label
# size (actual numbers/categories)
barinfo <- barplot(emitters$Emission, main="Greenhouse Gas Emissions per Economic Sector",
col="#a6cee3", horiz=TRUE, xlab="Emissions (million tons CO2)", cex.lab=1,
cex.axis=0.8)
barinfo # shows what is stored in barinfo so labels can be attached
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
# adding the labels manually through text() function
text(800,barinfo[,1], labels=emitters$Sector, col="darkblue", cex=1) # cex sets size of text
Another way to convey information is through the use of colors and color scales. Choosing the correct color scheme and scale can really make scientific graphics stand out, however choosing a poor scheme can hurt your message - or even exclude certain groups from your message (such as people suffering from various forms of colorblindness).
R has a number of color sets, and color interpolation functions that can give you a range of colors. Alternatively, you can create your own new color palettes.
R provides numerous color ramp functions (functions that do color interpolations) that can be used on continuous data to indicate different values. The viridis R package, developed by Simon Garnier, provides a set of color palettes designed to enhance the visual appeal of plots while ensuring they are accessible to a wide audience, including individuals with color vision deficiencies - thereby enhancing the clarity and inclusivity of your work. These palettes are printer-friendly, perceptually uniform, and colorblind-friendly.
Sometimes it is much more useful to make your own color palettes. One very useful resource is the color brewer 2 website, which allows you to easily select effective color palettes (link: https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3). + go to the webpage (see link) + select a color scheme to your liking + click EXPORT to download the color scheme + select code from ”java” box to copy to R script + simply alter the java code + mycols<-c(‘#762a83’,‘#af8dc3’,‘#e7d4e8’,‘#d9f0d3’,‘#7fbf7b’,‘#1b7837’) + myPalette<-colorRampPalette(mycols)
# setting the margin very small using mfrow (4 panel plot (2x2)), and dropping the x and y axes
par(mfrow=c(2,2), mar=c(0,0,2,0), xaxt="n", yaxt="n")
# use a color ramp by inputting the number of colors you want
grey.colors(4) ## this returns 4 grey colors
## [1] "#4D4D4D" "#969696" "#C3C3C3" "#E6E6E6"
## [1] "#4D4D4D" "#969696" "#C3C3C3" "#E6E6E6"
# now make a sequence of increasing numbers
?seq # generates regular sequences
# seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
# length.out = NULL, along.with = NULL, ...)
seqData1 <- seq(1,1000,1)
# food for thought: is this different from what 1:1000 would do?
seqData2 <- seq(1:1000)
head(seqData1 == seqData2,10) # -> gives TRUE for all values
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# "image" is used to plot a matrix, and "matrix" makes a matrix out of seqData
image(matrix(seqData1), col=terrain.colors(1000), main="terrain.colors()")
image(matrix(seqData1), col=topo.colors(1000), main="topo.colors()")
image(matrix(seqData1), col=heat.colors(1000), main="heat.colors()")
image(matrix(seqData1), col=grey.colors(1000), main="grey.colors()")
# the viridis package contains four sequential color scales:
# 1. viridis -> the default and primary palette
# 2. magma -> a warm-toned alternative
# 3. plasma -> a vibrant palette with a wide range of colors
# 4. inferno -> an intense gradient palette
require(viridis)
# creating a mathematical surface visualization to showcase the Viridis color palette
# first, creating two identical sequences from -2pi to 2pi with 40 points each (for x and y)
x <- y <- seq(-2*pi, 2*pi, len = 40)
# second, calculating the distance from origin for every (x,y) point combination, creating a radial
# distance matrix
r <- sqrt(outer(x^2, y^2, "+"))
# filled.contour(...) -> creates a contour plot, with the color.palette argument specifying the
# palette to be used (viridis, magma, plasma, or inferno)
# used to plot a mathematical function that creates concentric wave-like patterns
# (cos(r^2) -> oscilatting rings / exp(-r/2pi) -> makes rings fade towards edges)
# using the Viridis palette
filled.contour(cos(r^2)*exp(-r/(2*pi)), axes=FALSE, color.palette=viridis, asp=1, main="Viridis")
filled.contour(cos(r^4)*exp(-r/(4*pi)), axes=FALSE, color.palette=magma, asp=1, main="Magma")
filled.contour(sin(r^4)*exp(-r/(2*pi)), axes=FALSE, color.palette=inferno, asp=1, main="Inferno")
# exporting my own color scheme from color brewer and changing it to R syntax
mycols <- c('#ffffcc','#c7e9b4','#7fcdbb','#41b6c4','#2c7fb8','#253494')
# next, saving it as my own palette function
myPalette <- colorRampPalette(mycols)
# now using myPalette() in future plots!
par(mfrow=c(1,1),mar=c(0,0,2,0),xaxt="n",yaxt="n")
# using seq() function to create a simple sequence of consecutive integers from 1 to 1000, stepping
# by 1, so seqData becomes: [1,2,3,4,5,...,998,999,1000]
# the purpose is to create 1000 distinct values that will each get mapped to one of the 1000 colors
# from my color palette myPalette(1000)
# when doing matrix(seqData), it converts this sequence into a 1000x1 matrix (a single column)
# and image() then displays each value as a thin horizontal stripe with its corresponding color,
# essentially creating a color gradient bar showing a smooth transition across the 1000 colors
seqData3<-seq(1,1000,1)
image(matrix(seqData3),col=myPalette(1000),main="myPalette() - Sunrise")
# here, the effects are not blended as the parameters have changed
seqData3<-seq(1,1000,100)
image(matrix(seqData3),col=myPalette(5000),main="myPalette() - Sunrise")
# now let's apply this color scheme to the emitters data!
# no axis, foreground color or margins other than for the title needed
par(xaxt = "n", yaxt = "n", bty = "n", mar = c(1, 1, 3, 1), cex.main = 1.2)
# obtaining the relative ranking of each sector by using order() for a decreasing order
rnk <- order(emitters$Emission, decreasing = TRUE)
rnk # order of rank per sector recorded
## [1] 2 6 3 4 1 5
N <- nrow(emitters)
N # provides number of emitters recorded
## [1] 6
# assigning a ranked color from heat.colors for eacg data point recorded, and increasing x-axis
# limits range to -5:5 to have space for the sector names and a legend
plot(rep(1, N), 1:N, col = heat.colors(N)[rnk], xlab = "", ylab = "", fg = "white",
main = "Greenhouse gas emission per sector (mil tons/year)", pch = 15, cex = 4,
xlim = c(-5, 5))
# adding the sector names manually using text()
text(-2, 1:N, emitters$Sector, col="black", cex=0.8)
# rounding to the nearest integer
minmax <- round(range(emitters$Emission)/100) * 100
minmax
## [1] 400 1600
# adding a legend to understand the color scheme
# legend(x, y=NULL, legend, fill=NULL, col=par("col"), border="black", lty, lwd, pch,
# angle=45, density=NULL, bty = "o", bg=par("bg"), box.lwd=par("lwd"),
# box.lty=par("lty"), box.col=par("fg"), pt.bg=NA, cex=1, pt.cex=cex, pt.lwd=lwd,
# xjust=0, yjust=1, x.intersp=1, y.intersp=1, adj=c(0, 0.5), text.width=NULL,
# text.col=par("col"), text.font=NULL, merge=do.lines && has.pch, trace=FALSE,
# plot=TRUE, ncol=1, horiz=FALSE, title=NULL, inset=0, xpd, title.col=text.col[1],
# title.adj=0.5, title.cex=cex[1], title.font=text.font[1], seg.len=2)
legend("right", col = heat.colors(N)[5:1], bty = "n",
legend = seq(minmax[1], minmax[2], length.out = 5), pch = 15)
A final means with which to convey information in numerical data is by use of area. This works in a very similar way to the color graphic.
## we don't need axis, or foreground color or margins other than
## for the title
par(xaxt = "n", yaxt = "n", bty = "n", mar = c(1, 1, 3, 1), cex.main = 1.2)
## To set an area for each sector we calculate a radius that
## will feed to the cex argument in the plot
radius <- sqrt(emitters$Emission/100)
## how many data point do we have?
N <- nrow(emitters) ## what does nrow do? use ?nrow
## Below we assign a relative area with cex for each data point.
## We also set the x-axis limits broad from -5 to 5 so we have
## space for the sector names and a legend
plot(rep(1, N), 1:N, col = "grey", xlab = ""
, ylab = ""
, fg = "white",
main = "Greenhouse gas emission per sector (mil tons/year)", pch = 16,
cex = radius, xlim = c(-5, 5))
## add the sector names
text(-2, 1:N, emitters$Sector, col = "black", cex = 0.7)
## get the min and max values for the area legend
minmax <- round(range(emitters$Emission)/100) * 100
minmaxCex <- round(range(radius))
## add a legend (see ?legend) want to know what a function
## within the legend function does? Copy and paste into R!
legend("right", col = "grey", bty = "n", pt.cex = seq(minmaxCex[1], minmaxCex[2], length.out = 5),
legend = seq(minmax[1], minmax[2], length.out = 5), pch = 16, y.intersp = 2)
A) Make a R-script that loads the NASA historic temperature data, subsets the it correctly, calculates yearly means, and creates a color ramp palette. Then make the Ed Hawkins graphic (see figure 2.1). You are free to use Ed’s color palette or choose your own color palette - if you feel another palette is better. Also, if you feel you can make a more honest and effective graphic to communicate how the climate is changing… upload this instead! We will look at all the graphics. Save your graphic as a pdf, or jpg, or png and submit both your finished R-code and visualization online.
NOTE: If you upload the figure above, this is wrong, you need to at the very least remake the Hawkins palette, and adapt the colors.
B) Rate several of the above graphics on honesty and effectiveness in communicating the real nature of the numbers. How well do you feel each graphic is at allowing people to make up there own minds regarding what the data/numbers mean? + To be completed on the quiz itself
# prepare the data
setwd("~/OneDrive - Universiteit Leiden/Quantitative Research Skills/R Tutorials")
temperature <- read.csv("QRS.NASAGISS1880-2021.csv")
# always check the 1,2,3!
summary(temperature)
## temp year doy
## Min. :10.78 Min. :1880 Min. : 14.0
## 1st Qu.:12.29 1st Qu.:1915 1st Qu.: 76.0
## Median :13.59 Median :1950 Median :167.0
## Mean :13.59 Mean :1950 Mean :182.3
## 3rd Qu.:14.94 3rd Qu.:1986 3rd Qu.:259.0
## Max. :16.36 Max. :2021 Max. :350.0
hist(temperature$temp, main="Distribution of Global Temperatures", xlab="Temperature (Co")
# distribution is bimodal due to seasonality!
# calculating the yearly average temperature for subset 1880:2020 -> recall Exercise 5!
temp1880to2020 <- subset(temperature, year < 2021)
# alternatively: remove final year with negative index
# yearlytemp[-length(yearlytemp)] -> yearlytemp without final year (2021)
# e.g. yearlytemp[-142] -> also removes 2021
# e.g. yearlytemp[-141] -> removes 2020
# year 2020 can also be removed automatically: yearlytemp[-(length(yearlytemp)-1)]
yearlytemp <- tapply(temp1880to2020$temp, temp1880to2020$year, mean)
hist(yearlytemp, main="Distribution of Means of Global Temperatures", xlab="Temperature (Co")
# extracting names of year column (vector) and turning them into numbers
head(yearlytemp) # shows that temp is classified by year, with the year as the column name
## 1880 1881 1882 1883 1884 1885
## 13.37917 13.46333 13.43833 13.37750 13.26417 13.21167
# because R reads this as a categorical variable, must be converted to numeric variable
years <- as.numeric(names(yearlytemp)) # alternatively: years <- 1880:2020
# now we need to put the years together with their mean temperatures
annual_data <- data.frame(year=years, temp=yearlytemp)
# choose a diverging color palette from the color brewer website using c function (combine)
new_cols <- rev(c('#a50026','#d73027','#f46d43','#fdae61','#fee090','#ffffbf','#e0f3f8','#abd9e9',
'#74add1','#4575b4','#313695'))
# mapping the interval of values to specific colors, i.e. creating a function using colorRampPalette
# which creates a palette of smoothly divergent colors! -> see COLOR SCHEMES AND PALETTES
temp_cols <- colorRampPalette(new_cols)
# temp_cols(length(yearlytemp)) shows selected colors based on number of temperature values
head(temp_cols, 36)
##
## 1 function (n)
## 2 {
## 3 x <- ramp(seq.int(0, 1, length.out = n))
## 4 if (ncol(x) == 4L)
## 5 rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255)
## 6 else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255)
## 7 }
# making the visualization (without margins, or box around graphic)
par(mar=c(0,0,0,0),bty="n") # mar -> no margins by setting them to 0 / bty -> box type "none" to
# remove encircling box
# rank() -> numeric / order() -> numeric, alphabetic
# we want the lowest number to have the lowest rank, so that it is at the start of the plot, this
# way, red colors correspond to hot years
# using ties.method="random" ensures every year has a unique color stripe while still preserving
# the overall trend (the same color would break the gradient)
colRank <- rank(annual_data$temp,ties.method="random")
print(colRank)
## [1] 46 68 61 45 22 15 17 12 43 62 13 34 27 16 19 32 58 60
## [19] 30 42 67 48 23 8 2 29 33 7 6 1 5 4 10 14 49 52
## [37] 11 3 18 24 26 40 21 28 25 31 57 35 36 9 44 63 47 20
## [55] 54 37 50 74 79 77 96 103 90 94 104 93 69 73 56 55 41 70
## [73] 81 92 53 51 39 85 89 84 75 88 83 87 38 59 72 76 66 86
## [91] 82 65 80 99 71 78 64 101 91 100 107 111 97 109 98 95 102 112
## [109] 115 108 119 117 105 106 110 118 113 120 125 114 116 122 127 126 121 133
## [127] 128 131 123 130 134 124 129 132 135 137 140 138 136 139 141
# now we use the image function which takes as input a matrix, not a vector or a data.frame
image(matrix(yearlytemp),col=temp_cols(colRank))
# new way of plotting -> a better way to visualize this data
# annual_data -> shows full dataset, but for the sake of space, will just show head and tail
head(annual_data, 20)
## year temp
## 1880 1880 13.37917
## 1881 1881 13.46333
## 1882 1882 13.43833
## 1883 1883 13.37750
## 1884 1884 13.26417
## 1885 1885 13.21167
## 1886 1886 13.23167
## 1887 1887 13.18500
## 1888 1888 13.37083
## 1889 1889 13.43917
## 1890 1890 13.19417
## 1891 1891 13.31917
## 1892 1892 13.27250
## 1893 1893 13.23167
## 1894 1894 13.24417
## 1895 1895 13.31750
## 1896 1896 13.43250
## 1897 1897 13.43500
## 1898 1898 13.27667
## 1899 1899 13.36667
tail(annual_data, 20)
## year temp
## 2001 2001 14.08000
## 2002 2002 14.16750
## 2003 2003 14.15917
## 2004 2004 14.07750
## 2005 2005 14.21833
## 2006 2006 14.17917
## 2007 2007 14.20500
## 2008 2008 14.08333
## 2009 2009 14.19917
## 2010 2010 14.26333
## 2011 2011 14.15000
## 2012 2012 14.18500
## 2013 2013 14.21750
## 2014 2014 14.28500
## 2015 2015 14.43417
## 2016 2016 14.55417
## 2017 2017 14.46250
## 2018 2018 14.38583
## 2019 2019 14.51667
## 2020 2020 14.55667
par(mar=c(5,5,5,5))
plot(temp~year, data=annual_data, ylab=bquote("Temperature"~C^o), xlab="Year", main="Global Mean Temperature
between 1880 and 2020", font.main=4, col=temp_cols(colRank), pch=16) # pch -> point character
# same as: plot(annual_data$year, annual_data$temp)
# being more fair with the data by looking at anomalies
annual_data$anom <- annual_data$temp - mean(annual_data$temp[1:length(1880:2020)])
plot(anom~year, data=annual_data, ylab=bquote("Temperature Anomaly"~C^o), xlab="Year", main="Temperature
Relative to the Global Mean", font.main=4, pch=16)