ggplot2 workshopHow much R do you need to know to make good charts?
None: you can get away with using SigmaPlot, GraphPad, Excel … for almost everything.
… And if you wanted to learn to reproducibly make charts and graphs you could spend your time learning any number of other tools in python or javascript or one of the mathematically-oriented languages (matlab / maple) or statistical tools (stata / sas / igorPro).
I just rather like R & R-graphics. When I started learning R, Romain François’ now-defunct R graph gallery existed and I liked monkeying around with the suggested figures. It’s newer counterpart is a pretty good place to look for plotspiration too. R has a bunch of datasets preinstalled and has good tools for reading-in, manipulating and analysing datasets.
In this workshop we’ll read in some data, reorganise that data and plot it out. It is not assumed that you’ve ever used R, but that you have made some kind of graph at least once. In the process, you will write an R-script to generate some graphics and hopefully find out that it’s not too daunting to make charts from data.
In this document:
This is text.
# This is a comment.
this <- is("code that you would type into the R console or into your script.")
## [1] "This is" "a valid" "result"
## This is just a message from R - you don't need to worry about this.
But bad code throws a warning …
## Warning: .. that looks like this
… or if it’s really bad it might throw an “Error:” (you’ll probably see a few of them if you’ve never used R before).
If you stick with R, you might want to learn how to write up reports in R-markdown (like the document you’re looking at) or in R-notebooks. Here we’re just going to put everything in a script and view the results that R prints to the command line, or that get plotted to the ‘Plots’ panel in Rstudio. All will be revealed.
Feel free to copy / modify this workshop for your own needs. The code is available at bitbucket.
To keep up with the workshop, you need the following on your computer:
Once installed, you should boot up Rstudio using it’s icon.
If that was too terse. Have a look at this wonderful data-carpentry course on R for ecological data-science. In particular, look at their “Before-we-start” notes where there’s more info on installing R and Rstudio, and on working with R inside Rstudio.
Then open a new Rscript File -> New File -> R Script. This should open up a blank, plain-text document in the top-left panel of Rstudio. There should also be a console window in the bottom left.
During the workshop you’re going to type code into the script, and then run it inside the console. You can do this by highlighting the code you want to run and then clicking Run at the top of the window containing your script. To prove you’ve got everything setup
Run > Run selected Line(s))# EX: Can you get this to run?
hist(rnorm(100), col = "grey", main = "My first histogram!")
That should give you a histogram in the plot window (bottom-right) of Rstudio that looks something like this:
Whenever you see a code-block in this workshop, copy it into your script and run it in the same way.
You will be installing a variety of R packages in the process of this workshop. So, you’ll need internet access, or to have already installed them. The packages are all available from CRAN (which hosts the most established R packages; for bioinformatics there’s another repository called bioconductor that will save your life). The list of required packages is as follows:
MASSdplyrggplot2If you want to install these before you arrive, try using the Tools > Install Packages tab in Rstudio - don’t worry if you can’t get it to work, we can fix that during the workshop (see later).
If you’ve used GraphPad or SigmaPlot before, you’ll be used to the point-and-click approach to data analysis and visualisation. You load your data into a spreadsheet-like interface and then mouse-click your way through to your graph. When you need to remake a graph, or make multiple related graphs, it’s very easy to forget a step and it can be a bit of a chore to remake the same graph several times.
To make a graph of some dataset in R, rather than clicking your way to success, you write down precisely how to make that graph in code. This code can be saved in a script and reran to produce exactly the same graph. The code can be applied to a different dataset, so that you can automate producing multiple identically- formatted graphs.
I wrote a basic introduction to R code and the types of data that it’s happy to work with here. There are many similar resources out there. You don’t need to understand R programming in much detail to follow this workshop though.
The very first thing you should know is this:
# type this into the console
? plot
That brings up the help page for the plot function. If you’re having any trouble with any of the functions discussed later, have a look at it’s help page using ? function_name.
If you want to plot data out using R, you first need to know how to store data inside R. We do this by associating your data values with a variable.
To store the value 123 in the variable abc we type the following into R.
# Store some value in the variable `abc`
abc <- 123
Here, abc is just a placeholder - it’s an alias. It points to somewhere in your computer’s memory where the value 123 is held. Make sure you type <- without a space (between ‘<’ and ‘-’): this is R‘s principle ’assignment operator’.
You can access the value stored in a variable as follows:
# dereference the variable `abc`
abc
## [1] 123
You can do any number of operations on your stored values:
# Double it using the multiplication operator '*'
abc * 2
## [1] 246
# Take it's base-2 logarithm using the function `log2`
log2(abc)
## [1] 6.942515
# Repeat it using the function `rep` and the argument `times`
rep(abc, times = 5)
## [1] 123 123 123 123 123
# Compare it to some other value
abc > -113
## [1] TRUE
Note that last example and have a think what the difference is between abc < - 123 and abc <- 123 (spaces added for emphasis). Type the two commands into the console to see if R behaves as you expect.
# EX: your turn
You can also redefine your variables in R:
# Redefine the contents of `abc`
abc <- 123 * 456
# Print out the value that's now stored in `abc` to the console:
abc
## [1] 56088
There’s a few constraints on what you call your variables:
*!?+/= etc) inside variable names (eg, abc*abc).123_data).Note also, that variable names are case-sensitive (some_var is not the same as some_Var) and you should try not to reuse names that are already in use.
The data that you typically store within R will rarely be as simple as abc. You can store many different values in the same variable:
# `c` combines multiple values into a `vector` of values of the same type
xs <- c("A", "def", "10", "some_value", 10.001)
# Note that final 10.001 has been converted into characters:
xs
## [1] "A" "def" "10" "some_value" "10.001"
ys <- c(1, 5, 4.3, 9.900000001)
ys
## [1] 1.0 5.0 4.3 9.9
The data-frame structure is pervasive in R. This is a bit more complicated than the values and vectors that you’ve just seen. It is a two-dimensional way of storing data (ie, it has both rows and columns, like in a database table or a spreadsheet). In it, the contents of a given column must have a particular type - be it:
ys above);xs above; eg, names);factor);TRUE / FALSE values).You can make a data-frame as follows:
my_df <- data.frame(
x = 1:5,
y = c(1, 2.1, 3, 5, 4),
z = letters[1:5]
)
# print it to the console:
my_df
| x | y | z |
|---|---|---|
| 1 | 1.0 | a |
| 2 | 2.1 | b |
| 3 | 3.0 | c |
| 4 | 5.0 | d |
| 5 | 4.0 | e |
The above used the base-R code2 for making data-frame structures 3. It made a data-frame with three columns and five rows. In addition to defining them as above, you could construct data-frames (or related structures) by importing data from a plain-text file (see readr or read.csv), or from excel files (see readxl) or by import from a database / website / OMG_THERES_SO_MUCH_DATA.
You can get some summary info for data-frames using the functions str and summary.
# EX: Summarise my_df using `summary` and then using `str`
Packages are available that extend the functionality of base-R. Some of these contain functions, or define data-structures. Some of them contain datasets that you can use. A list of the datasets that can be accessed through R is available here.
Anscombe’s quartet is a well known synthetic dataset. It’s available within the datasets package. datasets is loaded at start-up by R, so all it’s contents are available for you to play with.
To make the Anscombe data visible use the data function:
# Add this to your script, and then run it in *Rstudio* using
# `Run selected lines` or by copying it into the `console`
data("anscombe")
The data is now available in the anscombe variable:
anscombe
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
There are eight columns in the dataset (the first printed row contains the column-names), each of which contains eleven rows. Each column is numerical. The columns are paired, in the sense that the x1 column corresponds to the y1 column in this dataset (and so on; although the pairing might not be obvious).
This is a charts & graphs workout, not a programming or stats one, so you don’t need to understand the following code: we summarise a bit of numerical info about this dataset.
# Literally, `Apply the function 'mean' to each column of anscombe and then
# simplify the results`:
sapply(anscombe, mean)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
You could also get the means from the summary of the anscombe dataset.
# Similarly get the sd for each column
sapply(anscombe, sd)
## x1 x2 x3 x4 y1 y2 y3 y4
## 3.316625 3.316625 3.316625 3.316625 2.031568 2.031657 2.030424 2.030579
All of the y-columns have the same mean and the same sd (to 3SF at least). This is also true for the x-columns. You could look in more detail at these paired columns: the regression coefficients and the correlations are virtually the same as well. See the Extras section for details on how to do this.
We can plot out one column against another like so (put this code in your script and run it in Rstudio):
# Base-R plotting - y1 column on the y-axis, x1 column on the x-axis
plot(y1 ~ x1, data = anscombe)
There’s a number of arguments that you can pass to the plot function to alter how the resulting charts look. For example, you could set:
xlim = c(3, 19) (or choose your own limits);ylim;xlab = "some label" (or ylab, resp.);main = "Scatter"cex = 2), colours (col = "red"), or types (pch = 3);To use these arguments, the code would be similar to the following: plot(column_name2 ~ column_name1, data = some_dataset, xlim = c(3, 19))
Try plotting y2 against x2 for the Anscombe dataset. Pick whatever arguments aesthetically suit your mood.
# EX: Your code goes here
Pair-one and pair-two have the same
but there’s a notable difference in their distribution when you plot them out.
And, this difference was revealed by plotting the data (charts are good)
And, plotting the data was literally, two lines of code (efficient is good)
And, those two lines of code are now stored in a script that you could rerun to generate the same figures (reproducibility is good).
Within base-R, you have access to a range of different types of chart: histograms, boxplots, heatmaps. Indeed, the plot function can take data of varying type and make sane choices about how to represent it, for example, plot(y ~ x, ...) will make a Box-plot if x is categorical.
But, …
…
… that scatter plot …
… it was a bit ugly wasn’t it?
Even with a bit of fancy, it’s hardly dashing: 5
# You won't be expected to write code like this by the end of the workshop.
# Based on https://www.r-bloggers.com/mastering-r-plot-part-3-outer-margins/
# Store the initial- global- plotting parameters
op <- par(no.readonly = TRUE)
# Decide on the lower/upper limits for the x- and y-axis, so each chart has the
# same range.
xls <- c(2, 19)
yls <- c(2, 14)
# Set various graphical parameters:
par(oma = c(2, 2, 0, 2), # Expands the outside margins to allow a common X/Y
# label.
mar = c(3, 3, 1, 0), # Tweaks the margins of each plot, so the bottom
# layers don't draw over the x-axis label.
mfrow = c(2, 2), # 2x2 plot.
pch = 16 # Filled-in circles.
)
# We plot each group in the quartet separately
plot(y1 ~ x1, data = anscombe, xlim = xls, ylim = yls, xlab = "", ylab = "", col = "red")
plot(y2 ~ x2, data = anscombe, xlim = xls, ylim = yls, xlab = "", ylab = "", col = "blue")
plot(y3 ~ x3, data = anscombe, xlim = xls, ylim = yls, xlab = "", ylab = "", col = "green")
plot(y4 ~ x4, data = anscombe, xlim = xls, ylim = yls, xlab = "", ylab = "", col = "black")
# Adds in a shared x- and a shared y-axis label
mtext(text = "X axis", side = 1, line = 0, outer = TRUE)
mtext(text = "Y axis", side = 2, line = 0, outer = TRUE)
# Set the parameters back
par(op)
We could spend our time tweaking base-R and all it’s curiously-named graphical parameters. But we’re going to learn ggplot2 instead.
ggplot2 was released in 2005 and provides a more consistent plotting vocabulary / syntax than is available in base-R. The creator of ggplot2 was also responsible for rstudio, dplyr, readr and several other packages that are now a standard part of the R-programming lexicon. See the tidyverse website for more details and see the ggplot2 section of the R graph gallery for further examples.
If you are working in Rstudio, the easiest way to install packages is using the Tools -> Install Packages widget at the top. To install and then load a single package, say MASS, you just need to enter ‘MASS’ into the ‘Install Packages’ dialogue box. This will set up a library for you and install any dependencies that the package may need.
Alternatively, in the console you could type:
install.packages("MASS") # Installs MASS
Then to make the package available in your current R workspace, add this to your script & run it in the console:
library("MASS") # LOAD
The packages that we’re going to use in the rest of the workshop are as follows
## [1] "MASS" "ggplot2" "dplyr"
Try installing / loading each of these in turn, using the same approach that was used to install MASS6.
# EX: Install `ggplot2` and `dplyr`
# EX: Load `ggplot2` and `dplyr`
If you’re writing a script that depends upon a specific package, by putting library("some_package") at the top of that script, all the functions etc will be loaded up by R every time that script is run. So add the following to the top of your script:
library("MASS")
library("ggplot2")
library("dplyr")
Hopefully, no errors were thrown while installing / loading those packages. We occasionally see issues for Windows users who are working on a computer where they can’t write to the directory where R stores it’s packages by default (eg, UoG SSD PCs). See the footnotes for a remedy7.
ggplot2OK. You’ve got ggplot2 installed. You’ve got some data in the data-frame anscombe. We’ll first generate a similar scatter plot as we did earlier on, but using ggplot2 instead of base-R.
The simplest plotting call in ggplot2 looks like the following (in the abstract):
ggplot(
data = <some_data_frame>,
mapping = aes(
x = <some_column>,
y = <some_other_column>
)
) +
geom_<some_plot_type>()
The main function in there is ggplot, but of itself, it doesn’t plot anything out. It needs to be told what type of plot to make by a function with the prefix geom_***. There are a range of functions whose name begins with geom_ that could be used in the above: for example, we use geom_point (for scatter plots) and geom_line (for line graphs) below.
Here’s a concrete example, where we’re using the anscombe data-frame, and columns x1 and y1 as the x- and y-variables.
# ? nothing's plotted out
ggplot(data = anscombe, mapping = aes(x = x1, y = y1))
As you can see, it just prints the background and labels some axes. It’s only when ggplot() is used in conjunction with a geom_... function that you get any points plotted out. To get the same scatter plot as before, you’d use the function geom_point.
ggplot(data = anscombe, mapping = aes(x = x1, y = y1)) + geom_point()
Again, try and plot the Anscombe y2 column against x2, but this time using a ggplot2-scatter plot.
# EX: Plot y2 against x2 using ggplot2
So, you have to specify the type of chart that you want to make separately from the call to ggplot(). Have a look at the reference for ggplot2 to see the different types of geom_... functions that are available. For example, you could use geom_line to plot a line joining the x/y points (it connects them in order of increasing x-value).
ggplot(data = anscombe, mapping = aes(x = x1, y = y1)) + geom_line()
ggplot2That ‘plus’ sign between ggplot() and the geom_...() functions is a bit suggestive. If you can do A() + B() and you can do A() + C(), I wonder what happens if you do A() + B() + C(). Run the following from your script in Rstudio to find out:
# EX: Try it yourself
ggplot(data = anscombe, mapping = aes(x = x1, y = y1)) +
geom_line() +
geom_point()
This is typical of how to construct a more complicated ggplot2 chart from a simpler one: you layer different plot-types on top of each other. Later you’ll see how using related syntax you can overlay (or replace):
Much as you can generate charts with two different types of geom on top of each other for the same data, you can generate a chart for two different sets of data.
Whether it makes sense to compare your Apples data against your Oranges cohort, is up to you.
In
ggplot(data = my_data, mapping = aes(x = some_col, y = other_col)) + ...
the mapping argument in the call to ggplot tells it the correspondence between
It can also be used to colour and size your points, see the next section.
In the above, that’s how we indicated the x1 and y1 columns were to be plotted on the x- and y-axes. If you look at the help page for geom_point (or one of the other geoms) there is a mapping argument there as well.
# EX: Get the help page for geom_point up
We can recreate the earlier line-graph by using the mapping argument in geom_line instead of ggplot. Note that labs can be used to add main/sub-titles and axis labels to your plots.
ggplot(anscombe) +
geom_line(mapping = aes(x = x1, y = y1), col = "red") +
labs(title = "We've seen this a few times now", x = "X", y = "Y")
So you could specify a different mapping between your columns and your axes within each call to a geom_... function. Hence, you should be able to add a second call to geom_line into the above code to plot the Anscombe x2/y2 data over the x1/y1 data on the same axes.
# EX: Your turn
#
#
#
In the last couple of graphs we were able to add some colour to the line-graphs. But, we had to hard-code the desired colours for each line. This approach doesn’t easily scale: suppose you wanted to plot a time-course graph for each of 4 different treatments - you’d need to write geom_line(….) four times if you followed the above approach. ggplot provides a cleaner way to specify how things like colours, line-thicknesses, point-sizes should be set up for different subsets of your dataset. However, your data might need neatening up before you can use this system.
Here’s a motivating example using a suitably cleaned-up version of the Anscombe dataset.
# Tidy up the Anscombe dataset
# - Again, you don't have to understand this code today,
# but hopefully you can see what it's doing
tidy_ans <- data.frame(
# - combine all the x values into a single column
x = with(anscombe, c(x1, x2, x3, x4)),
# - combine all the y values into a single column
y = with(anscombe, c(y1, y2, y3, y4)),
# - indicate which group of the quartet each entry comes from:
# eg, the first 11 rows are from the x1/y1 group and are
# indicated with the group-label "Q1"
group = rep(paste0("Q", 1:4), each = 11)
)
Compare (a subset of) the original dataset8:
head(anscombe)
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
… to part of the tidied-up dataset:
# EX: Use the `head` function to show the first few rows of `tidy_ans`
In the original dataset, there was four different columns for the x-axis points, and four columns for the y-points. The newer dataset is tidier because
There’s a bit more to it than that, and you can find out more about tidy data here, but all you need to know today is that tidy data plays well inside ggplot2.
Having tidied up the dataset, we can trivially overlay graphs for the groups of the Anscombe quartet:
ggplot(data = tidy_ans, mapping = aes(x = x, y = y, col = group)) +
geom_point() +
geom_line()
Note that we used col=group inside the mapping argument for ggplot. This tells ggplot to convert the contents of the group column (ie, the quartet-groups) into a colour-indicator, and then geom_line/geom_point can use this information to plot an appropriately coloured line for the different groups.
See the Extras for a neater version, where the different groups are split into a separate panel using the facet_wrap function.
We can encode other aesthetic components of a plot based on columns of a data-frame. For example,
you could encode a different line-type (dashed / dotted / solid) or fill- colour (in a Box plot) for each level of a categorical variable;
or encode point sizes or colours based on a numerical variable (so larger / darker / brighter points correspond to larger values).
You don’t have to use the default colours provided by ggplot. There are a few ggplot2 functions with the prefix scale_*** that can be used to change colour scheme.
For example, you can use ... + scale_colour_manual(values = c("red", "blue", "green", "black")) to generate the same colour scheme we used in base-R.
Or you could use a built-in colour scheme. The function scale_colour_brewer() can use some colour schemes devised specifically for sequential, diverging or qualitative data types (see the website or the function’s help page). You can choose between the colour schemes by using the arguments type = "XXX" (where XXX is either seq, div, or qual) and palette = YYY (where YYY is some number 1, 2, …, or the name of a palette).
Try remaking the last plot and add + scale_colour_brewer(type = "something", palette = some_number) to pick a better colour scheme for that last plot.
# EX: Use the scale_colour_brewer() function at the end of your plotting code
For further information on mapping variables into aesthetic components of your graphs see the “aesthetics” and “scales” sections of the ggplot2 reference.
The other dataset we’re going to work with today contains some record hill-running times for Scottish peaks. If you’ve loaded the MASS package, you should be able to import the hills dataset using the data function, as we did for the anscombe dataset earlier on. You can also find out some info about the new dataset using ? hills.
# EX: Read in the `hills` dataset
The dataset looks like this:
head(hills)
| dist | climb | time | |
|---|---|---|---|
| Greenmantle | 2.5 | 650 | 16.083 |
| Carnethy | 6.0 | 2500 | 48.350 |
| Craig Dunain | 6.0 | 900 | 33.650 |
| Ben Rha | 7.5 | 800 | 45.600 |
| Ben Lomond | 8.0 | 3070 | 62.267 |
| Goatfell | 8.0 | 2866 | 73.217 |
It contains record times (minutes) for a range of Scottish hill races from back in 19849. It also includes the distances (miles) and heights (feet) of the respective hill-runs.
In it’s current form, the different hills are indicated by the row-names in the hills dataset. We’re going to stick them inside the data-frame instead – keep it tidy everyone.
# `mutate` can be used to (re)construct a column based on the values in the
# existing columns of a data-frame. Here we use it to make a column called
# `peak` that contains the values that were previously in the row-names of
# `hills`.
tidy_hills <- mutate(hills, peak = row.names(hills))
head(tidy_hills)
| dist | climb | time | peak |
|---|---|---|---|
| 2.5 | 650 | 16.083 | Greenmantle |
| 6.0 | 2500 | 48.350 | Carnethy |
| 6.0 | 900 | 33.650 | Craig Dunain |
| 7.5 | 800 | 45.600 | Ben Rha |
| 8.0 | 3070 | 62.267 | Ben Lomond |
| 8.0 | 2866 | 73.217 | Goatfell |
With the skills you picked up on the Anscombe toy dataset, you can generate a few interesting charts for the hills dataset already.
For example:
# EX: Plot the record time (y-axis) against the race distance (x-axis)
# - Make sure you put in some nice axis labels using labs()
# - To tell ggplot what x- and y-limits you want to use:
# - try using ggplot(...) + blah_blah() + xlim(lo, hi) + ylim(lo, hi), where
# you've substituted some appropriate numbers for `lo` and `hi`
# - a neat trick to just specify the lower axis-limit is `... + xlim(0, NA)`
# - Note that NA is the R 'not available' value
# EX: Plot the record time against the height-climbed
So longer races take a long time, and higher races take a long time. Try and make a chart that indicates all three factors: height-climbed, distance and time.
For example, you could indicate the record time using point-size or point-colour.
# EX: Plot the height-climbed against distance
# and colour the points by record time
# - Note that scale_colour_brewer can't be used for continuous variables like
# `time`, but you could use `scale_colour_gradient` (with `low` and `high`
# args) or `scale_colour_distiller` (with `type` and `palette` args)
You can try playing around with various other mappings as well. Try using the size aesthetic to plot time against distance, with the points sized by the height climbed.
ggplot(data = tidy_hills, aes(x = dist, y = time, size = climb)) +
geom_point() +
xlim(0, NA) +
ylim(0, NA)
Or you could play around with different geom_*** functions.
For example, here’s a scatter plot with the hill names indicated (note that the data and all mappings defined within the ggplot function are passed to all subsequent geom_*s; this doesn’t happen when mappings are defined in a geom_* function, so the size aesthetic in geom_point doesn’t affect the font-size in geom_text):
# EX: approximate this graph using
# <stuff_you_already_know> + geom_text(mapping = aes(label = peak, vjust = 2))
.. and here’s a histogram of the times (if you want a relative-frequency histogram, put mapping = aes(y = ..density..) into the args for geom_histogram).
# note that you don't need a `y`-aesthetic when you're passing into
# geom_histogram
ggplot(tidy_hills, mapping = aes(x = time)) +
geom_histogram(bins = 15, fill = "orangered1", col = "black") +
labs(x = "Time (minutes)", y = "Count")
Hopefully we’ve got this far and you’ve seen
We’ve purposely avoided going into the R language in any great detail so that the workshop could be ran in a morning and could focus on graphical applications in some detail. But you should have learned some good practise: by writing out explicitly how to construct a graph in code, you should be able to reproduce making that graph at a later date, you can readily modify the style of the graph, and you can show others how to make that graph.
Nonetheless, to work with your own data in R, you’ll still need to learn how to
Between them base-R and ggplot2 provide a powerful basis to construct statistical graphics. We’ve mainly worked with scatter plots and line-graphs, although we did print out a couple of histograms. Many other types of graphs are possible - Box-plots, Density plots, heatmaps - and there are infinitely many ways of altering the designs of the graphs you’ve already generated. R can do all this stuff, and by learning the basics you’ve already got half way to any graph you can draw on paper.
I’d strongly recommend you persevere with ggplot2 if you want to use R to make graphs for papers / presentations. The syntax might be a bit difficult to initially learn and it places some constraints on how you organise your datasets, but it’s pretty consistent once you’ve learned it (and makes you think how your data should really have been organised anyway).
ggplot(some_data, aes(x = some_col, y = another_col, ...)) + geom_***()
… hopefully makes a bit more sense now.
There’s loads of good material for learning R graphics and learning visualisation more generally, in more detail.
Here’s some examples:
There’s a ggplot2 cheat-sheet, but you’ll need to be pretty fluent to follow it
You could try the swirl package if you want a more well-rounded introduction to R. Their “Exploratory_Data_Analysis” course has more ggplot2 visualisation stuff as well.
You could try working through one of the data-carpentry workshops on R and data visualisation.
Just search.
Or: draw a sketch of a graph you want to make
Or: find a paper containing a graph you want to make
… and hack at R until you can make it in code10.
These are things you might find interesting, and I think they’re relevant to the current workshop. But we probably aren’t likely to get into in the available time.
Duplicated code smells. When you’re programming, if you find yourself writing the same code over-and-over, have a think whether you can rewrite it to reduce that duplication.
Let’s say you are using a dataset, and wanted to make several related plots. Maybe you just want to monkey around with formatting until you’re happy. It would be a bit wasteful to have to type in all of the following each time, just to play around with the values in scale_colour_manual.
ggplot(my_data, aes(x = blah, y = why)) +
geom_my_favourite_plottype() +
geom_some_overlayed_plot() +
labs(title = "A really long title", x = "...", y = "yawnnnn") +
xlim(0, NA) +
ylim(NA, 0) +
scale_colour_manual(values = c("red", "firebrick", "indianred4"))
Instead, you can store ggplot objects inside a variable, and manipulate them later.
For example, the following code sets up a ggplot object called p, but it doesn’t plot it out.
# store all the parts of the plot that you've already decided on in `p`
p <- ggplot(tidy_hills, aes(x = time, fill = dist > 5)) +
geom_histogram(bins = 15) +
labs(title = "A really long title", x = "Yawnnnn", y = "Count")
However, if you evaluate p it will print out the plot:
p
So you can now monkey around with the formatting until you’re happy:
p + scale_fill_brewer()
Nah, that’s no good. Try again:
# You'll have to find out about `theme`s yourself ...
p +
scale_fill_brewer(palette = "PuOr") +
theme_dark() +
theme(axis.title.y = element_text(angle = 0, vjust = 0.5))
Being able to pass plots around like they are normal variables in R is a huge benefit of ggplot2 over base-R graphics. In the latter, you can only add stuff to a plot while it is painted up on the current graphics device (typically the screen), so you end up copying and pasting a lot of code before you get your figure right.
R grew out of statistical applications. So there is extensive support for statistical modelling / summarising functions.
The code to run a linear regression of a response variable, y, against a predictor x, looks like this:
# lm(y ~ x)
# OR, if the variables are columns within a dataframe:
# lm(y ~ x, data = my_data_frame)
The regression coefficients can be extracted from the returned object using the coef function.
So, this runs a regression model of y1 against x1 in the Anscombe quartet, and then returns the regression coefficients:
# To regress the column y1 against the column x1:
coef(lm(y1 ~ x1, data = anscombe))
## (Intercept) x1
## 3.0000909 0.5000909
Similarly, you can run regression of x1 against y1
# To regress x1 against y1:
coef(lm(x1 ~ y1, data = anscombe))
## (Intercept) y1
## -0.9975311 1.3328426
To compute the correlation coefficient you use the following function:
# To correlate x1 and y1
cor(anscombe$x1, anscombe$y1)
## [1] 0.8164205
Feel free to check these values for the other column-pairs (substitute x1/y1 for your chosen columns).
We didn’t talk about the dplyr package in much detail, although it was used to manipulate hills dataset. This contains tools for data-frame manipulation like filtering on observations (rows), subsetting variables (columns), computing new variables, summarising variables over subsets of observations…
Suppose you only wanted to look at those hills with a climb < 2000 feet and a distance < 10 miles. You could filter the hills dataset like this:
# try it yourself
dplyr::filter(tidy_hills, climb < 2000 & dist < 10)
You can pass the results of this kind of filtering code into ggplot:
sub_data <- dplyr::filter(tidy_hills, climb < 2000 & dist < 10)
ggplot(sub_data, aes(x = time)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There was a single hill in this sub-dataset that had a record time above 1 hour. We can put the name of this isolated hill on a scatter plot to identify it. This can be done by filtering within a call to geom_text:
ggplot(tidy_hills, aes(x = dist, y = time)) +
geom_point(aes(size = climb)) +
geom_text(data = filter(tidy_hills, dist < 5 & time > 75),
aes(label = peak, vjust = 2)) +
xlim(0, NA) +
ylim(0, NA)
We used the following code to overlay line-graphs for the Anscombe quartet:
ggplot(data = tidy_ans, mapping = aes(x = x, y = y, col = group)) +
geom_point() +
geom_line()
As a means to distinguish the four groups in the quartet, it was rather poor: the individual points couldn’t be seen, and if anything the lines obfuscated each other. Fortunately, we can use the group category in the tidy version of the Anscombe dataset to easily make a 4-panel plot (as we did in base-R). This uses the ggplot2 function facet_wrap
ggplot(data = tidy_ans, mapping = aes(x = x, y = y, col = group)) +
geom_point() +
geom_line() +
facet_wrap(~ group, ncol = 2) +
labs(x = "X axis", y = "Y axis")
Note - how few lines of code were required; - that matching up the x- and y- axis ranges is automated; - and that we didn’t have to alter any volatile global parameters;
.. compared to the base-R version of the same plot. But, we did have to tidy up the raw dataset before we could get the code to look this neat.
This document was written as a ~ 3h workshop for cancer scientists as an introduction to plotting-in-R prior. We subsequently ran a plotting / coding club within our department to develop these visualisation skills.↩
The “base-R” functions are those that are available in R before you’ve loaded any extension packages.↩
If you use R ever again, you might start getting frustrated with the base R data.frame structure: why did it convert my strings into factors? why does it print the whole thing to the screen by default? why did it turn into a vector when I removed that column? I urge you to look into the tibble package (especially) or the data.table package (for very large data-frames).↩
Those playing ‘data-viz bingo’ might want to mark off the Anscombe’s quartet square.↩
To me, all the side-effects involved in making this plot are way uglier than the plot itself (that is, setting and resetting the global options, the four separate calls to plot()). The plot’s perfectly reasonable.↩
When installing a bunch of packages all at once, as an alternative to calling install.packages multiple times, you could do the following. Put all the package names that want to install into a vector called pkg and install them all at the same time using install.packages(pkgs).↩
If this happens to you, have a look at how to change the environment-variable “R_LIBS_USER” in this Stack-Overflow question. Then add a directory called “C:/rpackages” (or similar) to your file-system and add this location as your R_LIBS_USER environment variable, then restart Rstudio. The new directory should be the first entry when you type .libPaths() into the console and R should now save any downloaded packages into this directory.↩
The head function returns the first few rows of a dataset.↩
You can get a more recent dataset (hills2000) from the DAAG package if you want.↩
If you swear loud enough, I’ll probably come and help out.↩