ggplot2Some of the exercises below have been adapted from Chapter 4 of Getting Started with R by Beckerman, Childs and Petchey.
Data Visualization by Kieran Healy is the best I have found on the use of the ggplot2 package for making superb figures for exploratory data analysis and for the communication of information from data. The whole book is available free online.
The R Graphics Cookbook by Winston Chang gives many recipes for doing this or that within ggplot2.
You will also find the documentation on ggplot2 very useful.
The Data visualisation and Graphics for communication chapters in R for Data Science by Grolemund and Wickham
The Graphics with ggplot chapter in An Introduction to R by Douglas et al. is concise and extremely clear. It gives a very good idea of how a plot done using this package is built up in layers, rather like maps in GIS.
Before you do any statistical analysis of your data it is almost always a good idea to plot the data in some way. This will very likely tell you whether or not the relationships you suspect to be present in the data are or are not present. In this exercise we introduce you to a powerful plotting package called ggplot2 that is one of the packages within tidyverse.
There are other way to do plots in R that are also very powerful. Within base R, that is without need for any extra packages, you can produce publication quality plots, and if you want to go down that road, don’t let me stop you.
However, once you get the hang of the general approach of the tidyverse and get to the point where you want to produce a plot, you will find that ggplot2 is the obvious thing to use to do them.
Make no mistake, to get a plot that includes all the finicky things you want for your dissertation or paper will probably take some time, however you produce it. This time will very likely include a lot of Google queries. But, to do exploratory data analysis, where you just want plots that will give you insight into your data along the way towards carrying out statistical tests, ggplot2 is a tool that can quickly give you what you need,
To begin our exploration of ggplot2, we will again investigate the compensation.csv data set.
This has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.
You can find this on the Teams/Files/RStuff/data folder and save it into your own RStuff/data folder.
You could put it anywhere, but it will help enormously if we are organised and have a regular place where we put our data within our Project. On which topic…
Project, and open the projectA Project in RStudio terms is just a folder (aka directory or repository) with added superpowers. You can ignore those powers and just treat your Rtuff folder as a regular folder, if you want. For our purposes, denoting our RStuff folder as a Project enables the here package to work so that we don’t have to worry about ‘Working Directory’ stuff or busy ourselves with the arcane syntax of computer pathways when trying to find a file. That’s reason enough to use Projects, but it turns out there are other benefits further down the line. My advice is : use them!
Once you open your project, the Environment pane should be empty, and you should see the contents of your RStuff folder displayed under the Files tab in the bottom right window.
Create a new R Notebook in RStudio. Name it visualising_data and save it into your RStuff/scripts folder. RStudio will add the .Rmd extension to it so you don’t have to do that.
In the script pane, amend the yaml section appropriately as you have done before.
---
title: "Visualising data in R using `ggplot2`"
author: Your name
date: dd-mm-yyyy
output:
html_document:
df_print: paged
---
Delete everything below the yaml.
Now read the rest of these instructions and enter each of the chunks below, implementing them as you go. Include suitable headers and notes before each chunk to let the reader know what is happening.
Remember that each code chunk needs to be placed between rows of three back ticks, like this:
```{r}
# the code goes here
```
After entering each code chunk, place your cursor somewhere on the line or highlight the lines if there are more than one, then press Ctrl-Enter or Command-Enter if you are using a Mac. You should see the line being implemented in the console pane and if any plots are generated, they will appear in the plot pane, bottom right, so long as you select the Plots tab.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Much of the power of R lies in what you can do with the packages that you can build into it on top of the base R that you get when you first install it.
Most scripts you write will make use of some package or other. Remember that the first time you use a package you have to install it on your machine by typing install.packages("<name of package - note the surrounding quotes>") into your console window - you don’t want this line in your script. Thereafter, you have to make that package available to your scripts by including in your scripts lines like these, where you write library(<name of package - but leave out the quotes>)
library(tidyverse)
library(here)
library(kableExtra)
If this throws an error, then one or other of the tidyverse or here packages have not yet been installed on your machine. The error message(s) will tell you which. To sort this, you need to type install.packages('tidyverse') and/or install.packages('here')or install.packages('kableExtra') into your console window, not in your script.
This will install tidyverse or here. You should then run the library(....) lines again.
Remember that tidyverse is just a big Christmas stocking full of other useful (like, really, really useful) packages. I suggest that every one of your scripts includes and uses it. Also here(), which removes much of the headaches around working directories, and finding your files. kableExtra is useful for producing nice tables. In time, you will gather a list of packages that you load by default. In the world of ecology, I suspect that the vegan() package might be one of them.
Now we load into R the compensation.csv data set that we are going to analyse. This can be found in the RStuff/data folder of the Files tab within Teams and should be saved into your RStuff/data folder.
mypath<-here("data", "compensation.csv")
compensation<-read_csv(mypath)
Remind yourself what this data set contains
glimpse(compensation)
In this data set, Fruit production and Root mass are response variables to the type of grazing (Grazed or Ungrazed).
There are several ways to inspect the data. One way is
head(compensation)
which shows you the first few lines. Another is is to click on the text of the name of the compensation object in the environment pane. That will show you the data in a spreadsheet-like view in the scripts pane.
Another is summary(<name of data frame>). That gives you summary statistics of each of the columns.
summary (compensation)
This is useful especially for seeing if you have any missing values in your dataset, which will show up as ‘NA’s, but is not very informative about columns that contains categorical (’factors’ in RSpeak) data, as in our Grazing column.
I often do all of these.
Before we get going with plotting the data, let us just look at some summary statistics for each grazing condition. To do this we use the powerful combination of the group_by() and summarise() commands plus the pipe operator %>% which feeds the results of one line into the next line.
In English, this next code chunk means:
summary_table_rootcompensation data frame and thensummary_table_root#summary table for the Root values
summary_table_root<-compensation %>%
group_by(Grazing) %>%
summarise(min.Root = min(Root), med.Root = median(Root), mean.Root = mean(Root, na.rm = TRUE), max.Root = max(Root))
summary_table_root
And here is the same for the Fruit column
#summary table for the Fruit values
summary_table_fruit<-compensation %>%
group_by(Grazing) %>%
summarise(min.Fruit = min(Fruit), med.Fruit = median(Fruit), mean.Fruit = mean(Fruit), max.Fruit = max(Fruit))
summary_table_fruit
Later, when we come to consider particular statistical tests for our data, we may well need to consider how the data are distributed, and in paticular whether they are normally distributed.
Do these summary tables suggest that this may well be the case, or not?
ggplot2: a grammar for graphicsggplot2 is a powerful package within tidyverse for drawing graphics. It creates plots in layers. The possibilities are endless, so here we will just work through a few examples
A simple, plain, bivariate scatter plot of the data.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
geom_point() # this is the second layer- a 'geometric' layer which in this case adds points
In the code, the information is added to the plot in succesive layers, each written on a separate line and separated from the next one by a +.
The first line specifies what we want to plot. It has the aes (short for aesthetic) argument. In this you specify how information from the data will be mapped to aspects of the plot. What will be plotted along the x-axis, what variable will determine the colour of the points, and so on?
In the second line we specify what type of plot we want using one of many possible geom_....() functions. Type ?geom into the console and see how many options you.are offered. Don’t worry, a few of these will do most of what you want most of the time and whenever they don’t then there is plenty of help available. Here we want a scatter plot, for which geom_point() is the correct choice.
There are many ways in which you can customise the basic plot given above. Here are a few:
Don’t like the grey background? Add a theme layer. There are a few built into R, and others are available through packages. I often use theme_cowplot() from the cowplot package.
library(cowplot) # I usually have this line in my top 'load all the packages' code chunk
compensation %>%
ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
geom_point() +# this is the second layer- a 'geometric' layer which in this case adds points
theme_cowplot()
We normally do this, even for on-the-fly exploratory plots, since it is helpful and easy to do.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point() +
labs(x = 'Root Biomass',
y = 'Fruit Production') +
theme_cowplot()
We can include a size argument in the geom_point() line.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point(size = 5) + # all the points will be size 5 (big!)
theme_cowplot()
Notice that because the size = 5 argument appears in the geom_point() line and not in the first line, in the aes() function, it affects all points. It is not mapped to any particular set of points depending on values in some column of the data.
Just as we changed the size of the points, we can change the colour or shape of all of them using a colour = or shape = argument in the geom_point() layer.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point(colour = "blue") + # all the points will be coloured blue
xlab('Root Biomass') +
ylab('Fruit Production') +
theme_cowplot()
While perhaps prettier than the default black, the colour in the previous plot has added no new information to the plot. Nor would it have helped if we had used triangular symbols instead of circles.
While we might do those things to fit in with some general style we wanted our plot to have, (or just because we liked the colour blue!) we can use colour, size, shape and so on to add information if the appearance attribute of the symbols is mapped to some feature of the data. That is, for example, if the points are coloured or shaped or sized differently depending on the value of some variable, such as species, site or as here, grazing condition.
We do this in the first, ggplot layer, as an additional arguemt within the aes() function. All the arguments to this map variables (ie columns) of the dataset to aspects of the plot.
Like this:
compensation %>%
ggplot(aes(x = Root, y = Fruit, colour = Grazing)) + # the colour will depend on the value of Grazing.
geom_point() +
xlab('Root Biomass') +
ylab('Fruit Production') +
theme_cowplot()
Notice that by default we now get a legend, which in this plot we need. You can move adjust or even remove this legend altogether if you want.
See if you can adapt the previous code chunk to give a scatter plot of Fruit Production vs Root Biomass, with a different symbol depending on whether the tree was in grazed or ungrazed pasture.
You just have to adapt the code above to do what you want (Hint: shape). Often, when we use R, we do not have to write scripts from scratch. We just adapt existing scripts that almost do what we want so that they do exactly what we want.
Often, when we present data we want to convey some idea of their central value (mean, median etc) and some idea of their spread. A bar plot hides much of this information. A Box and Whisker plot is a good way to show it.
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot()
For each data set, the thick bar in the middle shows the median, the box shows the interquartile range, from the 25th to the 75 percentile, by default the whiskers show the distance to the points whose distance from the median is 1.5 times the width of the interquartile range, and any outliers are shown as individual points.
Note that we have by default been given a legend. Since hre this adds no information that was not already in the axis labels, we would like to remove it. This can be done with a final theme() layer:
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
You can use plots like this to see whether two data sets are similar or markedly different. If your research hypothesis was about whether two or more groups were different, a box plot often goes some way towards giving you an answer.
Do a similar plot to the last one, but this time show how Root Mass varies with Grazing.
If there are not too many data points, it can be instructive to include these in the box plot. To do this we just amend the code for drawing the basic box plot with another line (think of this as another layer) where we include a geom_point(),
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) +
geom_boxplot() +
geom_point(size = 4, colour = 'lightblue', alpha = 0.5) + # all the points will have these attributes
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
Do you see how we added a layer to the plot we had earlier?
Do you also see again how arguments within the aesthetic of the first ggplot line map to elements of the data, so that, for example, the fill colour depends on the value of Grazing, whereas the arguments in the geom... layers do not map to any subset of the data? All points will be size 4, light blue and so on.
What does the alpha variable do?
Don’t like these colours? R recognises an extensive vocabulary of colour names, including all the obvious ones - so it is easy to guess and get something that will work. The full list is shown here. If that list is not enough you can also use palettes which make available a vast array of colour schemes suitable for different purposes, such as in maps. The rcolorbrewer() package is especially good. Trust me, there is a whole world of colour fun out there if you want it.
p1<-compensation %>%
ggplot(aes(x = Root, y = Fruit, colour = Grazing)) +
geom_point() +
xlab('Root Biomass') +
ylab('Fruit Production') +
theme_cowplot()
p2<-compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) +
geom_boxplot() +
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)
Can you use either of these to answer these questions?
Looking at the distribution of our variables is very important. It can tell us about the central tendency and spread of the data, and give us an idea as to whether particular statistical tests will be valid.
A histogram is a good way to do this.
In a histogram, counts are made of numeric observations whose value fall within defined ranges or, as they are called, ‘bins’.
Let us produce a histogram of Fruit production.
For a histogram, we onlt specify the variable to be mapped to the x axis, in this case fruit mass, since the y-variable will be the counts of observations where the value of that variable fell within each bin.
g9<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram() +
theme_cowplot()
g9
That wasn’t very nice! Thre are too many bins, so that hardly any observations fall in any one bin. We need to change the bin width, or, alternatively, the number of bins.
Let us try both…
pn<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram(bins = 10) +
#title('Fixing the number of bins') +
theme_cowplot()
pw<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram(binwidth = 15) +
theme_cowplot()
# this is a handy function from the gridExtra() package that you can use to arrange plots
grid.arrange(pn, pw, nrow = 1)
Try using more/narrower bins or fewer/fatter bins.
Do you see that, for a histogram to be informative, the number of bins and/or the bin width has to be chosen carefully?
From these histograms, can you tell roughly where the data is centered, and roughly how variable it is?
When we want several plots done in the same style, one for each of the different values of a categorical variable (or factor as R would say) the facet_wrap() and facet_grid() functions are useful;
compensation %>%
ggplot(aes(x = Fruit)) +
geom_histogram(binwidth = 15) +
facet_wrap(~Grazing, ncol = 2) +
theme_cowplot()
Try altering the ncol = 2 argument to ncol = 1. What happens? When might doing that be useful?
There are very many other plot types you can do using ggplot. Here are a couple:
To illustrate this, we will make up a dataset of discrete count datav: fruit types:
fruit_data<-tibble(fruits = c(rep("apple", 11), rep("orange", 7), rep("pear", 4)))
glimpse(fruit_data)
## Rows: 22
## Columns: 1
## $ fruits <chr> "apple", "apple", "apple", "apple", "apple", "apple", "apple", …
That gives a data frame with one categorical variable, fruits. To see how many there are of each type we can use the table() function:
table(fruit_data$fruits)
##
## apple orange pear
## 11 7 4
and we can illustrate this count data with a bar vhart, for which we use geom_bar() within ggplot:
fruit_data %>%
ggplot(aes(x = fruits)) +
geom_bar() +
theme_cowplot()
We don’t need to specify a y variable, since geom_bar() understands that it is supposed to count up how many of each of the x variable the data set contains.
If our dataset is already a set of counts, for example:
fruit_count<-fruit_data %>%
group_by(fruits) %>%
summarise(Count = n())
fruit_count
then we can get a bar plot of these counts most simply by using using geom_col()
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col() +
theme_cowplot()
There are many, many variations you can make to these basic plots. For example, you might want some colour:
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col(fill = "darkred") +
theme_cowplot() +
theme(legend.position="none")
Notice the line theme(legend.position="none") line which we have used before. It suppresses a legend, which we would get by default but do not need here.
To view the plot sideways we add the coord_flip() line:
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col(fill = "cornflowerblue") +
coord_flip() +
theme_cowplot() +
theme(legend.position="none")
This is useful when you have lots of categories and the category names, eg species names, are too long to fit under each column.
Let us use ggplot2 to display the data in the iris.csv data file that should already be in your RStuff/data folder. If not, you should be able to get it from the RStuff/data folder on Teams.
Create a new R Notebook called iris_plot and save it in your RStuff/scripts folder
tidyverse and hereirisglimpse()facet_wrap() to produce 3 histograms for sepal width, one for each species, one above the other. Adjust the bin number until the histograms look about right.# load packages
library(tidyverse)
library(here)
# load the data
filepath <- here("data","iris.csv")
iris<-read_csv(filepath)
#inspect it
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "setosa…
There is only one variable per column and each variable appears only once in each row, so the data is tidy
means_table<-iris %>%
group_by(Species) %>%
summarise(mean.sepal.length=mean(Sepal.Length),mean.sepal.width=mean(Sepal.Width))
means_table
means_table %>%
kbl(digits=2) %>%
kable_styling(full_width=F)
| Species | mean.sepal.length | mean.sepal.width |
|---|---|---|
| setosa | 5.01 | 3.43 |
| versicolor | 5.94 | 2.77 |
| virginica | 6.59 | 2.97 |
# Scatter plot, sepal length vs sepal width
iris %>%
ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species))+
geom_point()+
xlab('Sepal Width')+
ylab('Sepal Length')+
theme_cowplot()
There appears to be a positive correlation between sepal length and sepal width for each of the species.
# Box plots for sepal length
iris %>%
ggplot(aes(x=Species,y=Sepal.Length,fill=Species))+
geom_boxplot()+
theme_cowplot()+
theme(legend.position="none") # remove the legend as it is redndant in this plot
It looks as though the sepal lengths do differ between species, with setosa < versicolor < virginica. Having seen this plot and got this idea we would properly establish whether or not the observed differences could be down to chance by using either an ANOVA or Kruskal-Wallis test.
# Histograms for sepal width, one for each species, one above the other
iris %>%
ggplot(aes(x=Sepal.Width))+
geom_histogram()+
facet_wrap(~Species,ncol=1) +
theme_cowplot()
This defaults to 30 bins, which is too many for these data. Let’s reduce it until we get something that looks about right:
iris %>%
ggplot(aes(x=Sepal.Width))+
geom_histogram(bins=10)+
facet_wrap(~Species,ncol=1) +
theme_cowplot()
I think around 10 bins is ‘about right’. What happens if we make the bin number too low?
Each data set is roughly symmetrically distributed. They could well be normally distributed but it is difficult to tell from histograms of samll samples. A more effective graphical way to tell is to use [quantile-quantile plots](https://rpubs.com/mbh038/725314.
The sepal width of setosa appears to be greater than that of the other two, between which there is no obvious difference. We can see this easily because we have used facet_wrap() and placed the sub-figures one above the other.