Visualising data in R using ggplot2

Sources

Some of the exercises below have been adapted from Chapter 4 of Getting Started with R by Beckerman, Childs and Petchey.
Data Visualization by Kieran Healy is the best I have found on the use of the ggplot2 package for making superb figures for exploratory data analysis and for the communication of information from data. The whole book is available free online.
The R Graphics Cookbook by Winston Chang gives many recipes for doing this or that within ggplot2.
You will also find the documentation on ggplot2 very useful.
The Data visualisation and Graphics for communication chapters in R for Data Science by Grolemund and Wickham
The Graphics with ggplot chapter in An Introduction to R by Douglas et al. is concise and extremely clear. It gives a very good idea of how a plot done using this package is built up in layers, rather like maps in GIS.

Introduction

Before you do any statistical analysis of your data it is almost always a good idea to plot the data in some way. This will very likely tell you whether or not the relationships you suspect to be present in the data are or are not present. In this exercise we introduce you to a powerful plotting package called ggplot2 that is one of the packages within tidyverse.

There are other way to do plots in R that are also very powerful. Within base R, that is without need for any extra packages, you can produce publication quality plots, and if you want to go down that road, don’t let me stop you.

However, once you get the hang of the general approach of the tidyverse and get to the point where you want to produce a plot, you will find that ggplot2 is the obvious thing to use to do them.

Make no mistake, to get a plot that includes all the finicky things you want for your dissertation or paper will probably take some time, however you produce it. This time will very likely include a lot of Google queries. But, to do exploratory data analysis, where you just want plots that will give you insight into your data along the way towards carrying out statistical tests, ggplot2 is a tool that can quickly give you what you need,

Preliminaries

To begin our exploration of ggplot2, we will again investigate the compensation.csv data set.

This has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.

You can find this on the Teams/Files/RStuff/data folder and save it into your own RStuff/data folder.

Put the data set into your RStuff/data folder

You could put it anywhere, but it will help enormously if we are organised and have a regular place where we put our data within our Project. On which topic…

Make sure that your RStuff folder is a `Project`, and open the project

If it is, it will contain an RStuff.RProj file. Click on this to open the project.
If it isn’t yet, make it become one by clicking File/New Project/Existing Directory

A Project in RStudio terms is just a folder (aka directory or repository) with added superpowers. You can ignore those powers and just treat your Rtuff folder as a regular folder, if you want. For our purposes, denoting our RStuff folder as a Project enables the here package to work so that we don’t have to worry about ‘Working Directory’ stuff or busy ourselves with the arcane syntax of computer pathways when trying to find a file. That’s reason enough to use Projects, but it turns out there are other benefits further down the line. My advice is : use them!

Once you open your project, the Environment pane should be empty, and you should see the contents of your RStuff folder displayed under the Files tab in the bottom right window.

Create a new R Notebook.

Create a new R Notebook in RStudio. Name it visualising_data and save it into your RStuff/scripts folder. RStudio will add the .Rmd extension to it so you don’t have to do that.

In the script pane, amend the yaml section appropriately as you have done before.

---
title: "Visualising data in R using `ggplot2`"
author: Your name
date: dd-mm-yyyy
output:
  html_document:
    df_print: paged
---

Delete everything below the yaml.

Now read the rest of these instructions and enter each of the chunks below, implementing them as you go. Include suitable headers and notes before each chunk to let the reader know what is happening.

Remember that each code chunk needs to be placed between rows of three back ticks, like this:

```{r}
# the code goes here
```

After entering each code chunk, place your cursor somewhere on the line or highlight the lines if there are more than one, then press Ctrl-Enter or Command-Enter if you are using a Mac. You should see the line being implemented in the console pane and if any plots are generated, they will appear in the plot pane, bottom right, so long as you select the Plots tab.

The set-up chunk

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Load packages we need

Much of the power of R lies in what you can do with the packages that you can build into it on top of the base R that you get when you first install it.

Most scripts you write will make use of some package or other. Remember that the first time you use a package you have to install it on your machine by typing install.packages("<name of package - note the surrounding quotes>") into your console window - you don’t want this line in your script. Thereafter, you have to make that package available to your scripts by including in your scripts lines like these, where you write library(<name of package - but leave out the quotes>)

library(tidyverse)
library(here)
library(kableExtra)

If this throws an error, then one or other of the tidyverse or here packages have not yet been installed on your machine. The error message(s) will tell you which. To sort this, you need to type install.packages('tidyverse') and/or install.packages('here')or install.packages('kableExtra') into your console window, not in your script.

This will install tidyverse or here. You should then run the library(....) lines again.

Remember that tidyverse is just a big Christmas stocking full of other useful (like, really, really useful) packages. I suggest that every one of your scripts includes and uses it. Also here(), which removes much of the headaches around working directories, and finding your files. kableExtra is useful for producing nice tables. In time, you will gather a list of packages that you load by default. In the world of ecology, I suspect that the vegan() package might be one of them.

Load the data into R

Now we load into R the compensation.csv data set that we are going to analyse. This can be found in the RStuff/data folder of the Files tab within Teams and should be saved into your RStuff/data folder.

mypath<-here("data", "compensation.csv")
compensation<-read_csv(mypath)

Inspect the data

Remind yourself what this data set contains

glimpse(compensation)

In this data set, Fruit production and Root mass are response variables to the type of grazing (Grazed or Ungrazed).

Is this data tidy?
What makes it so, or not so?

There are several ways to inspect the data. One way is

head(compensation)

which shows you the first few lines. Another is is to click on the text of the name of the compensation object in the environment pane. That will show you the data in a spreadsheet-like view in the scripts pane.

Another is summary(<name of data frame>). That gives you summary statistics of each of the columns.

summary (compensation)

This is useful especially for seeing if you have any missing values in your dataset, which will show up as ‘NA’s, but is not very informative about columns that contains categorical (’factors’ in RSpeak) data, as in our Grazing column.

I often do all of these.

Summary statistics by group

Before we get going with plotting the data, let us just look at some summary statistics for each grazing condition. To do this we use the powerful combination of the group_by() and summarise() commands plus the pipe operator %>% which feeds the results of one line into the next line.

In English, this next code chunk means:

Let’s create a data frame called summary_table_root
Start from the compensation data frame and then
Group this by the values in the Grazing column and then
For each group, find the minimum, median, mean and maximum values of the Root column.
Finally, print out summary_table_root

#summary table for the Root values
summary_table_root<-compensation %>%
  group_by(Grazing) %>%
  summarise(min.Root = min(Root), med.Root = median(Root), mean.Root = mean(Root, na.rm = TRUE), max.Root = max(Root))
summary_table_root

And here is the same for the Fruit column

#summary table for the Fruit values
summary_table_fruit<-compensation %>%
  group_by(Grazing) %>%
  summarise(min.Fruit = min(Fruit), med.Fruit = median(Fruit), mean.Fruit = mean(Fruit), max.Fruit = max(Fruit))
summary_table_fruit

Later, when we come to consider particular statistical tests for our data, we may well need to consider how the data are distributed, and in paticular whether they are normally distributed.

Do these summary tables suggest that this may well be the case, or not?

Visualising data

`ggplot2`: a grammar for graphics

ggplot2 is a powerful package within tidyverse for drawing graphics. It creates plots in layers. The possibilities are endless, so here we will just work through a few examples

Scatter plots

A simple, plain, bivariate scatter plot of the data.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
  geom_point() # this is the second layer- a 'geometric' layer which in this case adds points

In the code, the information is added to the plot in succesive layers, each written on a separate line and separated from the next one by a +.

The first line specifies what we want to plot. It has the aes (short for aesthetic) argument. In this you specify how information from the data will be mapped to aspects of the plot. What will be plotted along the x-axis, what variable will determine the colour of the points, and so on?

In the second line we specify what type of plot we want using one of many possible geom_....() functions. Type ?geom into the console and see how many options you.are offered. Don’t worry, a few of these will do most of what you want most of the time and whenever they don’t then there is plenty of help available. Here we want a scatter plot, for which geom_point() is the correct choice.

Customise the plot

There are many ways in which you can customise the basic plot given above. Here are a few:

Get rid of the grey background if you don’t like it

Don’t like the grey background? Add a theme layer. There are a few built into R, and others are available through packages. I often use theme_cowplot() from the cowplot package.

library(cowplot) # I usually have this line in my top 'load all the packages' code chunk
compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
  geom_point() +# this is the second layer- a 'geometric' layer which in this case adds points
  theme_cowplot()

Change the x and y axis labels

We normally do this, even for on-the-fly exploratory plots, since it is helpful and easy to do.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point() +
  labs(x = 'Root Biomass', 
       y = 'Fruit Production') +
  theme_cowplot()

Change the size of the points

We can include a size argument in the geom_point() line.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point(size = 5) + # all the points will be size 5 (big!)
  theme_cowplot()

Notice that because the size = 5 argument appears in the geom_point() line and not in the first line, in the aes() function, it affects all points. It is not mapped to any particular set of points depending on values in some column of the data.

Change the colours of all the points

Just as we changed the size of the points, we can change the colour or shape of all of them using a colour = or shape = argument in the geom_point() layer.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point(colour = "blue") + # all the points will be coloured blue 
  xlab('Root Biomass') +
  ylab('Fruit Production') +
  theme_cowplot()

Change colours of the points to match levels (Grazed, Ungrazed) within the data

While perhaps prettier than the default black, the colour in the previous plot has added no new information to the plot. Nor would it have helped if we had used triangular symbols instead of circles.

While we might do those things to fit in with some general style we wanted our plot to have, (or just because we liked the colour blue!) we can use colour, size, shape and so on to add information if the appearance attribute of the symbols is mapped to some feature of the data. That is, for example, if the points are coloured or shaped or sized differently depending on the value of some variable, such as species, site or as here, grazing condition.

We do this in the first, ggplot layer, as an additional arguemt within the aes() function. All the arguments to this map variables (ie columns) of the dataset to aspects of the plot.

Like this:

compensation %>%
  ggplot(aes(x = Root, y = Fruit, colour = Grazing)) +  # the colour will depend on the value of Grazing.
  geom_point() +
  xlab('Root Biomass') +
  ylab('Fruit Production') +
  theme_cowplot()

Notice that by default we now get a legend, which in this plot we need. You can move adjust or even remove this legend altogether if you want.

Change shapes of the points to match levels (Grazed, Ungrazed) within the data

See if you can adapt the previous code chunk to give a scatter plot of Fruit Production vs Root Biomass, with a different symbol depending on whether the tree was in grazed or ungrazed pasture.

You just have to adapt the code above to do what you want (Hint: shape). Often, when we use R, we do not have to write scripts from scratch. We just adapt existing scripts that almost do what we want so that they do exactly what we want.

Box and Whisker plots

Often, when we present data we want to convey some idea of their central value (mean, median etc) and some idea of their spread. A bar plot hides much of this information. A Box and Whisker plot is a good way to show it.

How does Fruit production vary, and how does it depend on the type of grazing?

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
  geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot()

For each data set, the thick bar in the middle shows the median, the box shows the interquartile range, from the 25th to the 75 percentile, by default the whiskers show the distance to the points whose distance from the median is 1.5 times the width of the interquartile range, and any outliers are shown as individual points.

Note that we have by default been given a legend. Since hre this adds no information that was not already in the axis labels, we would like to remove it. This can be done with a final theme() layer:

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
  geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")

You can use plots like this to see whether two data sets are similar or markedly different. If your research hypothesis was about whether two or more groups were different, a box plot often goes some way towards giving you an answer.

Your turn:

Do a similar plot to the last one, but this time show how Root Mass varies with Grazing.

Add the raw data to the box plot .

If there are not too many data points, it can be instructive to include these in the box plot. To do this we just amend the code for drawing the basic box plot with another line (think of this as another layer) where we include a geom_point(),

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + 
  geom_boxplot() +
  geom_point(size = 4, colour = 'lightblue', alpha = 0.5) + # all the points will have these attributes
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")

Do you see how we added a layer to the plot we had earlier?

Do you also see again how arguments within the aesthetic of the first ggplot line map to elements of the data, so that, for example, the fill colour depends on the value of Grazing, whereas the arguments in the geom... layers do not map to any subset of the data? All points will be size 4, light blue and so on.

What does the alpha variable do?

Try changing its values between 0 and 1.
Try different colours of points
Try different sizes of points

Don’t like these colours? R recognises an extensive vocabulary of colour names, including all the obvious ones - so it is easy to guess and get something that will work. The full list is shown here. If that list is not enough you can also use palettes which make available a vast array of colour schemes suitable for different purposes, such as in maps. The rcolorbrewer() package is especially good. Trust me, there is a whole world of colour fun out there if you want it.

What use are these plots?

Scatter plot

p1<-compensation %>%
  ggplot(aes(x = Root, y = Fruit, colour = Grazing)) + 
  geom_point() +
  xlab('Root Biomass') +
  ylab('Fruit Production') +
  theme_cowplot()

Box and whisker plot

p2<-compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + 
  geom_boxplot() +
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")

library(gridExtra)
grid.arrange(p1, p2, nrow = 1)

Can you use either of these to answer these questions?

Do plants with wider root diameters at the start of the experiment produce more fruit?
Do grazed or ungrazed plants produce more fruit?
Is root mass correlated with fruit production?

Distributions: Making histograms of numerical variables

Looking at the distribution of our variables is very important. It can tell us about the central tendency and spread of the data, and give us an idea as to whether particular statistical tests will be valid.

A histogram is a good way to do this.

In a histogram, counts are made of numeric observations whose value fall within defined ranges or, as they are called, ‘bins’.

Let us produce a histogram of Fruit production.

For a histogram, we onlt specify the variable to be mapped to the x axis, in this case fruit mass, since the y-variable will be the counts of observations where the value of that variable fell within each bin.

g9<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram() +
  theme_cowplot()
g9

That wasn’t very nice! Thre are too many bins, so that hardly any observations fall in any one bin. We need to change the bin width, or, alternatively, the number of bins.
Let us try both…

pn<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram(bins = 10) +
  #title('Fixing the number of bins') +
  theme_cowplot()

pw<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram(binwidth = 15) +
  theme_cowplot()

# this is a handy function from the gridExtra() package that you can use to arrange plots
grid.arrange(pn, pw, nrow = 1)

Try using more/narrower bins or fewer/fatter bins.
Do you see that, for a histogram to be informative, the number of bins and/or the bin width has to be chosen carefully?

From these histograms, can you tell roughly where the data is centered, and roughly how variable it is?

Facets - seeing the effect of a categorical variable on the distribution of a numerical variable.

Faceted plots

When we want several plots done in the same style, one for each of the different values of a categorical variable (or factor as R would say) the facet_wrap() and facet_grid() functions are useful;

compensation %>%
  ggplot(aes(x = Fruit)) +
  geom_histogram(binwidth = 15) +
  facet_wrap(~Grazing, ncol = 2) + 
  theme_cowplot()

Try altering the ncol = 2 argument to ncol = 1. What happens? When might doing that be useful?

Other plot types

There are very many other plot types you can do using ggplot. Here are a couple:

Bar charts

To illustrate this, we will make up a dataset of discrete count datav: fruit types:

fruit_data<-tibble(fruits = c(rep("apple", 11), rep("orange", 7), rep("pear", 4)))
glimpse(fruit_data)

## Rows: 22
## Columns: 1
## $ fruits <chr> "apple", "apple", "apple", "apple", "apple", "apple", "apple", …

That gives a data frame with one categorical variable, fruits. To see how many there are of each type we can use the table() function:

table(fruit_data$fruits)

## 
##  apple orange   pear 
##     11      7      4

and we can illustrate this count data with a bar vhart, for which we use geom_bar() within ggplot:

fruit_data %>%
  ggplot(aes(x = fruits)) +
  geom_bar() +
  theme_cowplot()

We don’t need to specify a y variable, since geom_bar() understands that it is supposed to count up how many of each of the x variable the data set contains.

If our dataset is already a set of counts, for example:

fruit_count<-fruit_data %>%
  group_by(fruits) %>%
  summarise(Count = n())
fruit_count

then we can get a bar plot of these counts most simply by using using geom_col()

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col() +
  theme_cowplot()

There are many, many variations you can make to these basic plots. For example, you might want some colour:

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col(fill = "darkred") +
  theme_cowplot() +
  theme(legend.position="none")

Notice the line theme(legend.position="none") line which we have used before. It suppresses a legend, which we would get by default but do not need here.

To view the plot sideways we add the coord_flip() line:

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col(fill = "cornflowerblue") +
  coord_flip() +
  theme_cowplot() +
  theme(legend.position="none")

This is useful when you have lots of categories and the category names, eg species names, are too long to fit under each column.

Exercise

Let us use ggplot2 to display the data in the iris.csv data file that should already be in your RStuff/data folder. If not, you should be able to get it from the RStuff/data folder on Teams.

Create a new R Notebook called iris_plot and save it in your RStuff/scripts folder

Load the the packages tidyverse and here
Load the data into a data frame called iris
Inspect the data using glimpse()
Is the data tidy?
Group the data by species and find the mean sepal length and sepal width for each one. Make a nice table of the results with sensible levels of precision.
Produce a scatter plot of sepal length against sepal width, with a different colour for each species.
Does sepal length appear to be correlated with sepal width?
Produce a box and whisker plot for sepal length against species. You should end up with three boxes, side by side, all within the same set of axes. Give each box a different fill colour, and remove the legend.
Does there appear to be a difference between the sepal lengths of the different species?
Use facet_wrap() to produce 3 histograms for sepal width, one for each species, one above the other. Adjust the bin number until the histograms look about right.
Does it look from this plot or from the box and whisker plots for sepal lengths as though lengths and widths might well be normally distributed?
Does it look as though there is a difference between the sepal width of the three species?

Solution to Exercise

Load the packages

# load packages
library(tidyverse)
library(here)

Load the data

# load the data
filepath <- here("data","iris.csv")
iris<-read_csv(filepath)

Inspect the data

#inspect it
glimpse(iris)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "setosa…

There is only one variable per column and each variable appears only once in each row, so the data is tidy

Group the data by species and find the mean sepal length and sepal width for each one

means_table<-iris %>%
  group_by(Species) %>%
  summarise(mean.sepal.length=mean(Sepal.Length),mean.sepal.width=mean(Sepal.Width))
means_table

Make a nice table of that summary

means_table %>%
  kbl(digits=2) %>%
  kable_styling(full_width=F)

Species	mean.sepal.length	mean.sepal.width
setosa	5.01	3.43
versicolor	5.94	2.77
virginica	6.59	2.97

Scatter plot of sepal length against sepal width

# Scatter plot, sepal length vs sepal width
iris %>%
  ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species))+
  geom_point()+
  xlab('Sepal Width')+
  ylab('Sepal Length')+
  theme_cowplot()

There appears to be a positive correlation between sepal length and sepal width for each of the species.

Box plot

# Box plots for sepal length
iris %>%
  ggplot(aes(x=Species,y=Sepal.Length,fill=Species))+
  geom_boxplot()+
  theme_cowplot()+
  theme(legend.position="none") # remove the legend as it is redndant in this plot

It looks as though the sepal lengths do differ between species, with setosa < versicolor < virginica. Having seen this plot and got this idea we would properly establish whether or not the observed differences could be down to chance by using either an ANOVA or Kruskal-Wallis test.

Histograms

# Histograms for sepal width, one for each species, one above the other
iris %>%
  ggplot(aes(x=Sepal.Width))+
  geom_histogram()+
  facet_wrap(~Species,ncol=1) +
  theme_cowplot()

This defaults to 30 bins, which is too many for these data. Let’s reduce it until we get something that looks about right:

iris %>%
  ggplot(aes(x=Sepal.Width))+
  geom_histogram(bins=10)+
  facet_wrap(~Species,ncol=1) +
  theme_cowplot()

I think around 10 bins is ‘about right’. What happens if we make the bin number too low?

Each data set is roughly symmetrically distributed. They could well be normally distributed but it is difficult to tell from histograms of samll samples. A more effective graphical way to tell is to use [quantile-quantile plots](https://rpubs.com/mbh038/725314.

The sepal width of setosa appears to be greater than that of the other two, between which there is no obvious difference. We can see this easily because we have used facet_wrap() and placed the sub-figures one above the other.

Visualising data in R using ggplot2

Sources

Introduction

Preliminaries

Put the data set into your RStuff/data folder

Make sure that your RStuff folder is a Project, and open the project

Create a new R Notebook.

The set-up chunk

Load packages we need

Load the data into R

Inspect the data

Summary statistics by group

Visualising data

ggplot2: a grammar for graphics

Scatter plots

Customise the plot

Get rid of the grey background if you don’t like it

Change the x and y axis labels

Change the size of the points

Change the colours of all the points

Change colours of the points to match levels (Grazed, Ungrazed) within the data

Change shapes of the points to match levels (Grazed, Ungrazed) within the data

Box and Whisker plots

How does Fruit production vary, and how does it depend on the type of grazing?

Your turn:

Add the raw data to the box plot .

What use are these plots?

Scatter plot

Box and whisker plot

Distributions: Making histograms of numerical variables

Facets - seeing the effect of a categorical variable on the distribution of a numerical variable.

Faceted plots

Other plot types

Bar charts

Exercise

Solution to Exercise

Load the packages

Load the data

Inspect the data

Group the data by species and find the mean sepal length and sepal width for each one

Make a nice table of that summary

Scatter plot of sepal length against sepal width

Box plot

Histograms

Visualising data in R using `ggplot2`

Make sure that your RStuff folder is a `Project`, and open the project

`ggplot2`: a grammar for graphics