Sources

I could fill page after page with suggested resources on how best to do visualisations (graphs, as we used to call them). Here are a few that I have used and found useful. Some are on the detail of how to use the ggplot2 package - which you will most likely use in your dissertation - and some are on more general principles of making effective data driven visualisations.


First, the ggplot2 sources:

  1. Some of the exercises below have been adapted from Chapter 4 of Getting Started with R by Beckerman, Childs and Petchey.

  2. Data Visualization by Kieran Healy is one of the best texts I have found on the use of the ggplot2 package for making superb figures for exploratory data analysis and for the communication of information from data. The whole book is available free online.

  3. The R Graphics Cookbook by Winston Chang gives many recipes for doing this or that within ggplot2.

  4. You will also find the documentation on ggplot2 very useful.

  5. The Data visualisation and Graphics for communication chapters in R for Data Science by Grolemund and Wickham. This is from the horse’s mouth. Wickham and Grloemun are leading lights behind RStudio and the tidyverse.

  6. The Graphics with ggplot chapter in An Introduction to R by Douglas et al. is concise and extremely clear. It gives a very good idea of how a plot done using this package is built up in layers, rather like maps in GIS.


The next two books are about general principles of good data driven visualisations:

  1. The Fundamentals of Data Visualization by Claus Wilke is available free online. Rather than being wedded to any particular software, it concentrates on the principles of how to make visualizations that look great and which accurately reflect the data.

  2. Andy Kirk’s Data Visualisation : a handbook for data driven design (2016), Sage is in a similar vein, but focuses more on infographic-y design principles.


Other software besides ggplot2 in R.

Research starts with questions. We design studies then gather and analyse data in order to answer these as precisely as we can. Plotting the data in a sensible way is part of the process of doing that and normally this requires software of some kind. In our sessions we will mainly use R and the plotting package ggplot2 because they offer an unparalleled combination of integrated analysis and plotting capability.

However, we also need to communicate our results, sometimes to a wide variety of audiences using a variety of media. To that end, depending on your audience or the setting, don’t be afraid to try other software packages. One online tool that is very powerful and requires no programming is Flourish, through which, for free, you have access to a huge and fantastic range of templates for visualisations, many of them interactive or animated. No programming required.

Introduction

Before you do any statistical analysis of your data it is almost always a good idea to plot the data in some way. This will very likely tell you whether or not the relationships you suspect to be present in the data are or are not present. In this exercise we introduce you to a powerful plotting package called ggplot2 that is one of the packages within tidyverse.

There are other way to do plots in R that are also very powerful. Within base R, that is without need for any extra packages, you can produce publication quality plots, and if you want to go down that road, don’t let me stop you.

However, once you get the hang of the general approach of the tidyverse and get to the point where you want to produce a plot, you will find that ggplot2 is the obvious thing to use to do them.

Make no mistake, to get a plot that includes all the finicky things you want for your dissertation or paper will probably take some time, however you produce it. This time will very likely include a lot of Google queries. But, to do exploratory data analysis, where you just want plots that will give you insight into your data along the way towards carrying out statistical tests, ggplot2 is a tool that can quickly give you what you need.

Example

As an example, let us take the iris dataset that is built into R. For three species of iris it contains the sepal length and width, and petal length and width. Suppose we want to see how the sepal length compares to the sepal width for each species.

We can do a quick and dirty (but still pretty good) scatter plot, where each species is indicated by a different colour, but we don’t bother about proper axis labels or anything else. This is really easy to do and would be good enough for ourselves as researchers while we explore our own data.

# quick and dirty
p1<-iris %>%
  ggplot(aes(x=Sepal.Length,y=Petal.Length,colour=Species)) +
  geom_point() +
  labs(caption="Really easy to do")

If we want something a bit fancier, to show someone else, we might put more effort in. For example we might want to specify the axis labels, include annotation or choose a colour scheme suitable for colour-blind people. That requires a few more lines of script, but not many….

# a bit more work involved for a fancier plot
p2<-iris %>%
  ggplot(aes(x=Sepal.Length,y=Petal.Length,colour=Species)) +
  geom_point() +
  labs(x="Petal Width (cm)",
       y="Petal Length (cm)",
       caption="Takes more time - but not that much more.") +
  scale_color_colorblind() +
  annotate("text",x=6.7,y=2.5,label="Using a colour scheme\nsuitable for those who\n are colour blind",size=3)+
  theme_bw()

Preliminaries

To begin our exploration of ggplot2, we will again investigate the compensation.csv data set.

This has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.

You can find this in the place stated by your tutor - typically the Moodle site for the module, or in the files tab of the channel used for this work within whatever Teams page this module uses. Your tutor will direct you to the right place. If you haven’t already got it, save it into your own /data folder. You could put it anywhere, but it will help enormously if we are organised and have a regular place where we put our data within our Project. On which topic…

Make sure that your Project folder is a Project, and open the project

  • If it is, it will contain an project_name.RProj file. Click on this to open the project.
  • If it isn’t yet, make it become one by clicking File/New Project/Existing Directory

A Project in RStudio terms is just a folder (aka in other circles as ‘directory’ or ‘repository’ or (amongst friends), ‘repo’.) with added superpowers. You can ignore those powers and just treat your project folder as a regular folder, if you want. For our purposes, denoting our project folder as a Project enables the here package to work so that we don’t have to worry about ‘Working Directory’ stuff or busy ourselves with the arcane syntax of computer pathways when trying to find a file. That’s reason enough to use Projects, but it turns out there are other benefits further down the line. My advice is : use them!

Once you open your project, the Environment pane should be empty, and you should see the contents of your Project folder displayed under the Files tab in the bottom right window.

Create a new R Notebook.

Create a new R Notebook in RStudio. Name it something sensible like visualising data (or visualising_data or visualing.data. R does not mind if file names have spaces) and save it into your Project/scripts folder. RStudio will add the .Rmd extension to it, so you don’t have to do that.

In the script pane (the one on the top-left of RStudio), amend the yaml section appropriately as you have done before.

---
title: "Visualising data in R using `ggplot2`"
author: "Your name"
date: "dd-mm-yyyy"
output:
  html_document:
    df_print: paged
---

Delete everything below the yaml. That is just exemplar material to show you what you can do in R Markdown. But you know this already. If not, read the material now, and once you are done, from then until eternity, just delete it all every time you start a new script.

Now read the rest of the instructions below and enter each of the chunks as you come to them, implementing them as you go by clicking the little green arrow in the top-right of each chunk.. Include suitable headers and notes before each chunk to let the reader (who is most likely to be you some weeks/months later) know what is the purpose of each chunk.

Remember that each code chunk needs to be placed between rows of three back ticks, like this:

```{r}
# the code goes here
```

The easiest way to type these in is to use the shortcut Ctrl-Alt-I on a Windows machine, or Cmd-Option-I on a Mac.

After entering each code chunk, place your cursor somewhere on the line or highlight the lines if there are more than one, then press Ctrl-Enter or Cmd-Enter if you are using a Mac. You should see the line being implemented in the console pane and if any plots are generated, they will appear in the scripts pane beneath the code chunk that generated them.

The set-up chunk

This chunk is only useful if you want to render your script as an html document. If you don’t, just ignore it for now and skip to the next section.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Load packages we need

Much of the power of R lies in what you can do with the packages that you can build into it on top of the base R that you get when you first install it.

Most scripts you write will make use of some package or other. Remember that the first time you use a package you have to install it on your machine by typing install.packages("<name of package - note the surrounding quotes>") into your console window - you don’t want this line in your script. Thereafter, you have to make that package available to your scripts by including in your scripts lines like these, where you write library(<name of package - but leave out the quotes>)

library(tidyverse)
library(here)
library(cowplot)

If this throws an error, then one or other of the packages you are tryng to load have not yet been installed on your machine. The error message(s) will tell you which. To sort this, you need to type install.packages('tidyverse') etc into your console window, not in your script.

This will install the missing package(s). You should then run the library(....) lines again.

Remember that tidyverse is just a big Christmas stocking full of other useful (like, really, really useful) packages. I suggest that every one of your scripts includes and uses it. Also here(), which removes much of the headaches that would otherwise afflict you around finding your data before reading it into R. In time, you will gather a list of packages that you find useful that you load by default i nevery notebook. In the world of ecology, the vegan() package will very likely be one of them.

Load the data into R

Now we load into R the compensation.csv data set that we are going to analyse. This can be found in the edenr/data folder of the Files tab within Teams and should be saved into your /data folder.

mypath<-here("data", "compensation.csv")
compensation<-read_csv(mypath)

Inspect the data

Remind yourself what this data set contains

glimpse(compensation)

In this data set, Fruit production and Root mass are response variables to the type of grazing (Grazed or Ungrazed).

  • Is this data tidy?
  • What makes it so, or not so?

There are several ways to inspect the data. One way is

head(compensation)

which shows you the first few lines. Another is is to click on the text of the name of the compensation object in the environment pane. That will show you the data in a spreadsheet-like view in the scripts pane. That is what I now mostly do. I don’t bother using the head() function.

Another is summary(<name of data frame>). That gives you summary statistics of each of the columns.

summary (compensation)

This is useful especially for seeing if you have any missing values in your data set, which will show up as ‘NA’s, but is not very informative about columns that contains categorical data (’factors’ in RSpeak) , as in our Grazing column.

I often do all of these.

Summary statistics by group

Before we get going with plotting the data, let us just look at some summary statistics for each grazing condition. To do this we use the powerful combination of the group_by() and summarise() commands plus the pipe operator %>% which feeds the results of one line into the next line.

In English, this next code chunk means:

  • Let’s create a data frame called summary_table_root
  • To do so, we start from the compensation data frame and then
  • Group this by the values in the Grazing column and then
  • For each group, find the minimum, median, mean and maximum values of the Root column.
  • Finally, print out summary_table_root
#summary table for the Root values
summary_table_root<-compensation %>%
  group_by(Grazing) %>%
  summarise(min.Root = min(Root),
            med.Root = median(Root),
            mean.Root = mean(Root, na.rm = TRUE),
            max.Root = max(Root))
summary_table_root

And here is the same for the Fruit column

#summary table for the Fruit values
summary_table_fruit<-compensation %>%
  group_by(Grazing) %>%
  summarise(min.Fruit = min(Fruit),
            med.Fruit = median(Fruit),
            mean.Fruit = mean(Fruit),
            max.Fruit = max(Fruit))
summary_table_fruit

Later, when we come to consider particular statistical tests for our data, we may well need to consider how the data are distributed, and in particular, more often than not, whether they are normally distributed.

Do these summary tables suggest that this may well be the case, or not?

Visualising data

ggplot2: a grammar for graphics

ggplot2 is a powerful package within tidyverse for drawing graphics. It creates plots in layers. The possibilities are endless, so here we will just work through a few examples

Scatter plots

A simple, plain, bivariate scatter plot of the data.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
  geom_point() # this is the second layer- a 'geometric' layer which in this case adds points

In the code, the information is added to the plot in successive layers, each written on a separate line and separated from the next one by a +.

The first line specifies what we want to plot. It has the aes (short for aesthetic) argument. In this you specify how information from the data will be mapped to aspects of the plot. What will be plotted along the x-axis, what variable will determine the colour of the points or the fill colour of any bars, and so on?

In the second line we specify what type of plot we want using one of many possible geom_....() functions. Type ?geom into the console and see how many options you.are offered. Loads, right? Don’t worry, a few of these will do most of what you want most of the time and whenever they don’t then there is plenty of help available. Here we want a scatter plot, for which geom_point() is the correct choice.

Customise the plot

There are many ways in which you can customise the basic plot given above. Here are a few:

Get rid of the grey background if you don’t like it

Don’t like the grey background? Add a theme layer. There are a few built into R, and others are available through packages. I often use theme_cowplot() from the cowplot package.

library(cowplot) # I usually have this line in my top 'load all the packages' code chunk
compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
  geom_point() +# this is the second layer- a 'geometric' layer which in this case adds points
  theme_cowplot()

You can have more than one theme() line. They are where you alter various aspects of the ‘look’ of a plot, such as where to put a legend, whether to have one at all, fonts, font sizes etc.

Change the x and y axis labels

We normally do this, even for on-the-fly exploratory plots, since it is helpful and easy to do.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point() +
  labs(x = 'Root Biomass', 
       y = 'Fruit Production') +
  theme_cowplot()

Change the size of the points

We can include a size argument in the geom_point() line.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point(size = 5) + # all the points will be size 5 (big!)
  theme_cowplot()

Notice that because the size = 5 argument appears in the geom_point() line and not in the first line, in the aes() function, it affects all points. It is not mapped to any particular set of points depending on values in some column of the data.

Change the colours of all the points

Just as we changed the size of the points, we can change the colour or shape of all of them using a colour = or shape = argument in the geom_point() layer.

compensation %>%
  ggplot(aes(x = Root, y = Fruit)) + 
  geom_point(colour = "blue") + # all the points will be coloured blue 
  labs('Root Biomass',
       y='Fruit Production') +
  theme_cowplot()

Change colours of the points to match levels (Grazed, Ungrazed) within the data

While perhaps prettier than the default black, the colour in the previous plot has added no new information to the plot. Nor would it have helped if we had used triangular symbols instead of circles.

While we might do those things to fit in with some general style we wanted our plot to have, (or just because we liked the colour blue!) we can use colour, size, shape and so on to add information if the appearance attribute of the symbols is mapped to some feature of the data. That is, for example, if the points are coloured or shaped or sized differently depending on the value of some variable, such as species, site or as here, grazing condition.

We do this in the first, ggplot layer, as an additional argument within the aes() function. All the arguments to this ‘map’ variables (ie columns) of the dataset to aspects of the plot.

So if we wanted to colour of the points to depend on the grazng condition, we would do this:

compensation %>%
  ggplot(aes(x = Root, y = Fruit, colour = Grazing)) +  # the colour will depend on the value of Grazing.
  geom_point() +
  labs(x='Root Biomass',
       y='Fruit Production') +
  theme_cowplot()

Notice that by default we now get a legend, which in this plot we need. As we mentioned above, you can move, adjust or even remove this legend altogether if you want using a theme() line.

Change shapes of the points to match levels (Grazed, Ungrazed) within the data

See if you can adapt the previous code chunk to give a scatter plot of Fruit Production vs Root Biomass, with a different symbol depending on whether the tree was in grazed or ungrazed pasture.

You just have to adapt the code above to do what you want (Hint: shape). Often, when we use R, we do not have to write scripts from scratch. We just adapt existing scripts that almost do what we want so that they do exactly what we want.

Box and Whisker plots

Often, when we present data we want to convey some idea of their central value (mean, median etc) and some idea of their spread. A bar plot hides much of this information. A Box and Whisker plot is a good way to show it.

How does Fruit production vary, and how does it depend on the type of grazing?

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
  geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
  labs(x='Grazing treatment',
       y='Fruit Production') +
  theme_cowplot()

For each data set, the thick bar in the middle shows the median, the box shows the interquartile range, from the 25th to the 75 percentile, by default the whiskers show the distance to the points whose distance from the median is 1.5 times the width of the interquartile range, and any outliers are shown as individual points.

Note that we have by default been given a legend. Since here this adds no information that was not already in the axis labels, we would like to remove it. This can be done with a final theme() layer:

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
  geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
  labs(x='Grazing treatment',
       y='Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")

You can use plots like this to see whether two data sets are similar or markedly different. If your research hypothesis was about whether two or more groups were different, a box plot often goes some way towards giving you an answer.

Your turn:

Do a similar plot to the last one, but this time show how Root Mass varies with Grazing.

Add the raw data to the box plot .

If there are not too many data points, it can be instructive to include these in the box plot. To do this we just amend the code for drawing the basic box plot with another line (think of this as another layer) where we include a geom_point(),

compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing, alpha = 0.2)) + 
  geom_boxplot() +
  geom_point(aes(colour=Grazing),size = 4 ) + # all the points will have these attributes
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")

Do you also see again how arguments within the aesthetic of the first ggplot line map to elements of the data, so that, for example, the fill colour depends on the value of Grazing.

In the geom_point() layer we have also included an aesthetic argument so that we can map the colour of the points to the Grazing treatment.

What does the alpha variable do?

  • Try changing its values between 0 and 1.
  • Try different colours of points
  • Try different sizes of points

Don’t like these colours? R recognises an extensive vocabulary of colour names, including all the obvious ones - so it is easy to guess and get something that will work. The full list is shown here. If that list is not enough you can also use palettes which make available a vast array of colour schemes suitable for different purposes, such as in maps. The RColorBrewer() package is especially good. Trust me, there is a whole world of colour fun out there if you want it.

What use are these plots?

Scatter plot

p1<-compensation %>%
  ggplot(aes(x = Root, y = Fruit, colour = Grazing)) + 
  geom_point() +
  xlab('Root Biomass') +
  ylab('Fruit Production') +
  theme_cowplot()

Box and whisker plot

p2<-compensation %>%
  ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + 
  geom_boxplot() +
  xlab('Grazing treatment') +
  ylab('Fruit Production') +
  theme_cowplot() +
  theme(legend.position = "none")
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)

Can you use either of these to answer these questions?

  1. Do plants with wider root diameters at the start of the experiment produce more fruit?
  2. Do grazed or ungrazed plants produce more fruit?
  3. Is root mass correlated with fruit production?

Is there anything in either of these plots that adds no additional information.If not, good. If so, we should get rid of it!

Distributions: Making histograms of numerical variables

Looking at the distribution of our variables is very important. It can tell us about the central tendency and spread of the data, and give us an idea as to whether particular statistical tests will be valid.

A histogram is a good way to do this.

In a histogram, counts are made of numeric observations whose value fall within defined ranges or, as they are called, ‘bins’.

Let us produce a histogram of Fruit production.

For a histogram, we only specify the variable to be mapped to the x axis, in this case fruit mass, since the y-variable will be the counts of observations where the value of that variable fell within each bin.

g9<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram() +
  theme_cowplot()
g9

That wasn’t very nice! Thre are too many bins, so that hardly any observations fall in any one bin. We need to change the bin width, or, alternatively, the number of bins.
Let us try both…

pn<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram(bins = 10) +
  #title('Fixing the number of bins') +
  theme_cowplot()
pw<-ggplot(compensation, aes(x = Fruit)) +
  geom_histogram(binwidth = 15) +
  theme_cowplot()
# this is a handy function from the gridExtra() package that you can use to arrange plots
grid.arrange(pn, pw, nrow = 1)

Try using more/narrower bins or fewer/fatter bins.
Do you see that, for a histogram to be informative, the number of bins and/or the bin width has to be chosen carefully?

From these histograms, can you tell roughly where the data is centered, and roughly how variable it is?

Facets - seeing the effect of a categorical variable on the distribution of a numerical variable.

Faceted plots

When we want several plots done in the same style, one for each of the different values of a categorical variable (or factor as R would say) the facet_wrap() and facet_grid() functions are useful;

compensation %>%
  ggplot(aes(x = Fruit)) +
  geom_histogram(binwidth = 15) +
  facet_wrap(~Grazing, ncol = 2) + 
  theme_cowplot()

Try altering the ncol = 2 argument to ncol = 1. What happens? When might doing that be useful?

Try removing the facet_wrap() line. What do you get? Do you see how facet_wrap() can be useful?

Other plot types

There are very many other plot types you can do using ggplot. Here are a couple:

Bar charts

To illustrate this, we will make up a dataset of discrete count data: fruit types:

fruit_data<-tibble(fruits = c(rep("apple", 11), rep("orange", 7), rep("pear", 4)))
glimpse(fruit_data)
## Rows: 22
## Columns: 1
## $ fruits <chr> "apple", "apple", "apple", "apple", "apple", "apple", "apple", …

That gives a data frame with one categorical variable, fruits. To see how many there are of each type we can use the table() function:

table(fruit_data$fruits)
## 
##  apple orange   pear 
##     11      7      4

and we can illustrate this count data with a bar chart, for which we use geom_bar() within ggplot:

fruit_data %>%
  ggplot(aes(x = fruits)) +
  geom_bar() +
  theme_cowplot()

We don’t need to specify a y variable, since geom_bar() understands that it is supposed to count up how many of each of the x variable the data set contains.

If our dataset is already a set of counts, for example:

fruit_count<-fruit_data %>%
  group_by(fruits) %>%
  summarise(Count = n())
fruit_count

then we can get a bar plot of these counts most simply by using using geom_col()

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col() +
  theme_cowplot()

There are many, many variations you can make to these basic plots. For example, you might want some colour:

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col(fill = "darkred") +
  theme_cowplot() +
  theme(legend.position="none")

Notice the line theme(legend.position="none") line which we have used before. It suppresses a legend, which we would get by default but do not need here.

To view the plot sideways we add the coord_flip() line:

fruit_count %>%
  ggplot(aes(x = fruits, y = Count)) +
  geom_col(fill = "cornflowerblue") +
  coord_flip() +
  theme_cowplot() +
  theme(legend.position="none")

This is useful when you have lots of categories and the category names, eg species names, are too long to fit under each column.

Error Bars

The long version

When you gather a sample of data with some number of replicates, you hope that that sample is representative of the population from which it has been drawn. Why? Because it is the population that you want to know about. The variation of the sample, which we characterise by its standard deviation, is our best estimate of the variability of the population from which the sample has been drawn.

The mean of of some aspect of that sample is your best estimate of the mean value of that aspect for the whole population. You know, though, that if you were to take another sample, then the mean of that sample would be different from that of the first. And the mean of a third sample would be different from the mean of the second, and so on. In time, you would see that the successive means were distributed, probably normally distributed, around a central value. That central value you can reasonably assume to be the true population mean. The spread of the distribution can be characterised by what we call the standard error of the mean. This distribution is sometimes referred to as the sampling distribution.

But, hang on, this is fictitious. We don’t, in reality, take sample after sample. We take just one. That single, actual sample will have a mean that lies somewhere within the sampling distribution - the distribution of sample means we would get if we did take sample after sample. The mean of our actual sample might be a bit more or a bit less than the true population mean. We don’t know this population mean but the mean of our solitary sample is our best estimate of it. It is all we have to go on. We can give people an idea of how close our sample mean is likely to be to the true mean by telling them the standard error of the sampling distribution, and this we can estimate from the standard deviation of our actual sample. If the standard deviation is \(\sigma\), then the standard error of the mean is \(\frac{\sigma}{\sqrt{n}}\), where \(n\) is the sample size.

So, in the end, if we are going to display mean values of samples in a plot or a table, and talk of these as those they are estimates of population means, then we ought to include some information as to how precise an estimate they are. We do this by also including the standard error of the mean. Typically, you might see in a table a value presented as something like \(2.4 \pm 0.35\) together with information (hopefully!) in the caption to tell you that this is a mean value plus or minus one standard error of the mean. In a figure, we would include an error bar, where that error bar would have a length equal to one standard error of the mean. Crucially too, we should say in the caption to the figure what that error bar represented. In the case discussed here we mean it to indicate one standard error of the mean. But in another paper, another author might mean by it the standard deviation of the sample which, remember, is our best estimate of the spread of values in the population. This is not the same thing as the standard error of the mean, which is our estimate of how precisely we know the mean of that population. So, we need to clearly tell the reader what we actually mean.

tldr: how to include error bars in a plot.

Here we use the iris data set, a commonly used data set that is built into R. It has 150 rows of data, 50 for each of 3 species of iris. For each species there are measurements of 4 quantitative variables, sepal length and width, and petal length and width. Suppose we are interested in sepal lengths and how that differs between species. We first use group_by() and summarise() to find the mean and standard error of the mean of the sepal length for each species, then we plot the mean values using geom_point() and add error bars using geom_errorbar().

iris %>%
  group_by(Species) %>%
  summarise(n=n(),
            mean_SL=mean(Sepal.Length),
            se_SL=sd(Sepal.Length/sqrt(n()))) %>%
  
  
  ggplot(aes(x=Species,y=mean_SL,colour=Species,fill=Species)) +
  geom_point() +
  geom_errorbar(aes(ymin=mean_SL-se_SL,ymax=mean_SL+se_SL),width=0.2) + 
  labs(x="Species",
       y="Sepal Length (mm)",
       caption="Figure shows mean values plus or minus one standard error of the mean") +
  theme_cowplot() +
  theme(legend.position="none")

Exercise

Let us use ggplot2 to display the data in the iris.csv data file that should already be in your RStuff/data folder. If not, you should be able to get it from the RStuff/data folder on Teams.

Create a new R Notebook called iris_plot and save it in your RStuff/scripts folder

  1. Load the the packages tidyverse and here
  2. Load the data into a data frame called iris
  3. Inspect the data using glimpse()
  4. Is the data tidy?
  5. Group the data by species and find the mean sepal length and sepal width for each one. Make a nice table of the results with sensible levels of precision.
  6. Produce a scatter plot of sepal length against sepal width, with a different colour for each species.
  7. Does sepal length appear to be correlated with sepal width?
  8. Produce a box and whisker plot for sepal length against species. You should end up with three boxes, side by side, all within the same set of axes. Give each box a different fill colour, and remove the legend.
  9. Does there appear to be a difference between the sepal lengths of the different species?
  10. Use facet_wrap() to produce 3 histograms for sepal width, one for each species, one above the other. Adjust the bin number until the histograms look about right.
  11. Does it look from this plot or from the box and whisker plots for sepal lengths as though lengths and widths might well be normally distributed?
  12. Does it look as though there is a difference between the sepal width of the three species?

Solution to Exercise

Load the packages
# load packages
library(tidyverse)
library(here)
Load the data
# load the data
filepath <- here("data","iris.csv")
iris<-read_csv(filepath)
Inspect the data
#inspect it
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "setosa…

There is only one variable per column and each variable appears only once in each row, so the data is tidy

Group the data by species and find the mean sepal length and sepal width for each one
means_table<-iris %>%
  group_by(Species) %>%
  summarise(mean.sepal.length=mean(Sepal.Length),mean.sepal.width=mean(Sepal.Width))
means_table
Scatter plot of sepal length against sepal width
# Scatter plot, sepal length vs sepal width
iris %>%
  ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species))+
  geom_point()+
  xlab('Sepal Width')+
  ylab('Sepal Length')+
  theme_cowplot()

There appears to be a positive correlation between sepal length and sepal width for each of the species.

Box plot

This next plot is useful overall but has a feature of questionable usefulness.

# Box plots for sepal length
iris %>%
  ggplot(aes(x=Species,y=Sepal.Length,fill=Species))+
  geom_boxplot()+
  theme_cowplot()+
  theme(legend.position="none") # remove the legend as it is redndant in this plot

It looks as though the sepal lengths do differ between species, with setosa < versicolor < virginica. Having seen this plot and got this idea we would next properly establish whether or not the observed differences could be down to chance by using, preferably, a parametric test such as an ANOVA or, if the data did not permit that, its less powerful non-parametric equivalent, the Kruskal-Wallis test.

Now, does the use of colour add information to this plot? Hint: no! It possibly even confuses the reader by making them think there is some other feature that differentiates the three sets of measurement besides that they relate to different species. The species are displayed on the x-axis, so what do the coulrs do for us? Nothing, so best leave them out.

Histograms
# Histograms for sepal width, one for each species, one above the other
iris %>%
  ggplot(aes(x=Sepal.Width))+
  geom_histogram()+
  facet_wrap(~Species,ncol=1) +
  theme_cowplot()

This defaults to 30 bins, which is too many for these data. You end up with too few points in each bin and a very ragged looking histogram. Let’s reduce it until we get something that looks about right:

iris %>%
  ggplot(aes(x=Sepal.Width))+
  geom_histogram(bins=10)+
  facet_wrap(~Species,ncol=1) +
  theme_cowplot()

I think that for this data set, around 10 bins is ‘about right’. What happens if we make the bin number too low?

Each data set is roughly symmetrically distributed. They could well be normally distributed but it is difficult to tell from histograms of small samples. A more effective graphical way to tell is to use [quantile-quantile plots](https://rpubs.com/mbh038/725314.

The sepal width of setosa appears to be greater than that of the other two, between which there is no obvious difference. We can see this easily because we have used facet_wrap() and placed the sub-figures one above the other.