ggplot2I could fill page after page with suggested resources on how best to
do visualisations (graphs, as we used to call them). Here are a few that
I have used and found useful. Some are on the detail of how to use the
ggplot2 package - which you will most likely use in your
dissertation - and some are on more general principles of making
effective data driven visualisations.
First, the ggplot2 sources:
Some of the exercises below have been adapted from Chapter 4 of Getting Started with R by Beckerman, Childs and Petchey.
Data Visualization by
Kieran Healy is one of the best texts I have found on the use of the
ggplot2 package for making superb figures for exploratory
data analysis and for the communication of information from data. The
whole book is available free online.
The R Graphics
Cookbook by Winston Chang gives many recipes for doing this or
that within ggplot2.
You will also find the documentation on
ggplot2 very useful.
The Data visualisation and Graphics for communication chapters in R for Data Science by Grolemund and Wickham. This is from the horse’s mouth. Wickham and Grloemun are leading lights behind RStudio and the tidyverse.
The Graphics with ggplot chapter in An Introduction to R by Douglas et al. is concise and extremely clear. It gives a very good idea of how a plot done using this package is built up in layers, rather like maps in GIS.
The next two books are about general principles of good data driven visualisations:
The Fundamentals of Data Visualization by Claus Wilke is available free online. Rather than being wedded to any particular software, it concentrates on the principles of how to make visualizations that look great and which accurately reflect the data.
Andy Kirk’s Data Visualisation : a handbook for data driven design (2016), Sage is in a similar vein, but focuses more on infographic-y design principles.
ggplot2 in R.Research starts with questions. We design studies then gather and
analyse data in order to answer these as precisely as we can. Plotting
the data in a sensible way is part of the process of doing that and
normally this requires software of some kind. In our sessions we will
mainly use R and the plotting package ggplot2
because they offer an unparalleled combination of integrated analysis
and plotting capability.
However, we also need to communicate our results, sometimes to a wide variety of audiences using a variety of media. To that end, depending on your audience or the setting, don’t be afraid to try other software packages. One online tool that is very powerful and requires no programming is Flourish, through which, for free, you have access to a huge and fantastic range of templates for visualisations, many of them interactive or animated. No programming required.
Before you do any statistical analysis of your data it is almost
always a good idea to plot the data in some way. This will very likely
tell you whether or not the relationships you suspect to be present in
the data are or are not present. In this exercise we introduce you to a
powerful plotting package called ggplot2 that is one of the
packages within tidyverse.
There are other way to do plots in R that are also very powerful. Within base R, that is without need for any extra packages, you can produce publication quality plots, and if you want to go down that road, don’t let me stop you.
However, once you get the hang of the general approach of the
tidyverse and get to the point where you want to produce a
plot, you will find that ggplot2 is the obvious thing to
use to do them.
Make no mistake, to get a plot that includes all the finicky things
you want for your dissertation or paper will probably take some time,
however you produce it. This time will very likely include a lot of
Google queries. But, to do exploratory data analysis, where you
just want plots that will give you insight into your data along the way
towards carrying out statistical tests, ggplot2 is a tool
that can quickly give you what you need.
As an example, let us take the iris dataset that is
built into R. For three species of iris it contains the
sepal length and width, and petal length and width. Suppose we want to
see how the sepal length compares to the sepal width for each
species.
We can do a quick and dirty (but still pretty good) scatter plot, where each species is indicated by a different colour, but we don’t bother about proper axis labels or anything else. This is really easy to do and would be good enough for ourselves as researchers while we explore our own data.
# quick and dirty
p1<-iris %>%
ggplot(aes(x=Sepal.Length,y=Petal.Length,colour=Species)) +
geom_point() +
labs(caption="Really easy to do")
If we want something a bit fancier, to show someone else, we might put more effort in. For example we might want to specify the axis labels, include annotation or choose a colour scheme suitable for colour-blind people. That requires a few more lines of script, but not many….
# a bit more work involved for a fancier plot
p2<-iris %>%
ggplot(aes(x=Sepal.Length,y=Petal.Length,colour=Species)) +
geom_point() +
labs(x="Petal Width (cm)",
y="Petal Length (cm)",
caption="Takes more time - but not that much more.") +
scale_color_colorblind() +
annotate("text",x=6.7,y=2.5,label="Using a colour scheme\nsuitable for those who\n are colour blind",size=3)+
theme_bw()
To begin our exploration of ggplot2, we will again
investigate the compensation.csv data set.
This has 40 observations of the root stock mass and mass of fruit harvested, for apple trees in both grazed and ungrazed conditions.
You can find this in the place stated by your tutor - typically the
Moodle site for the module, or in the files tab of the channel used for
this work within whatever Teams page this module uses. Your tutor will
direct you to the right place. If you haven’t already got it, save it
into your own
Project, and open the projectA Project in RStudio terms is just a folder (aka in other circles as
‘directory’ or ‘repository’ or (amongst friends), ‘repo’.) with added
superpowers. You can ignore those powers and just treat your project
folder as a regular folder, if you want. For our purposes, denoting our
project folder as a Project enables the
here package to work so that we don’t have to worry about
‘Working Directory’ stuff or busy ourselves with the arcane syntax of
computer pathways when trying to find a file. That’s reason enough to
use Projects, but it turns out there are other benefits further down the
line. My advice is : use them!
Once you open your project, the Environment pane should be empty, and you should see the contents of your Project folder displayed under the Files tab in the bottom right window.
Create a new R Notebook in RStudio. Name it something sensible like
visualising data (or
visualising_data or visualing.data. R does not mind if file
names have spaces) and save it into your Project/scripts folder. RStudio
will add the .Rmd extension to it, so you don’t have to do
that.
In the script pane (the one on the top-left of RStudio), amend the
yaml section appropriately as you have done before.
---
title: "Visualising data in R using `ggplot2`"
author: "Your name"
date: "dd-mm-yyyy"
output:
html_document:
df_print: paged
---
Delete everything below the yaml. That is just exemplar
material to show you what you can do in R Markdown. But you know this
already. If not, read the material now, and once you are done, from then
until eternity, just delete it all every time you start a new
script.
Now read the rest of the instructions below and enter each of the chunks as you come to them, implementing them as you go by clicking the little green arrow in the top-right of each chunk.. Include suitable headers and notes before each chunk to let the reader (who is most likely to be you some weeks/months later) know what is the purpose of each chunk.
Remember that each code chunk needs to be placed between rows of three back ticks, like this:
```{r}
# the code goes here
```
The easiest way to type these in is to use the shortcut
Ctrl-Alt-I on a Windows machine, or
Cmd-Option-I on a Mac.
After entering each code chunk, place your cursor somewhere on the
line or highlight the lines if there are more than one, then press
Ctrl-Enter or Cmd-Enter if you are using a
Mac. You should see the line being implemented in the console pane and
if any plots are generated, they will appear in the scripts pane beneath
the code chunk that generated them.
This chunk is only useful if you want to render your script as an html document. If you don’t, just ignore it for now and skip to the next section.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Much of the power of R lies in what you can do with the packages that
you can build into it on top of the base R that you get
when you first install it.
Most scripts you write will make use of some package or other.
Remember that the first time you use a package you have to install it on
your machine by typing
install.packages("<name of package - note the surrounding quotes>")
into your console window - you don’t want this line in your
script. Thereafter, you have to make that package available to your
scripts by including in your scripts lines like these, where you write
library(<name of package - but leave out the quotes>)
library(tidyverse)
library(here)
library(cowplot)
If this throws an error, then one or other of the packages you are
tryng to load have not yet been installed on your machine. The error
message(s) will tell you which. To sort this, you need to type
install.packages('tidyverse') etc into your console window,
not in your script.
This will install the missing package(s). You should then run the
library(....) lines again.
Remember that tidyverse is just a big Christmas stocking
full of other useful (like, really, really useful) packages. I suggest
that every one of your scripts includes and uses it. Also
here(), which removes much of the headaches that would
otherwise afflict you around finding your data before reading it into R.
In time, you will gather a list of packages that you find useful that
you load by default i nevery notebook. In the world of ecology, the
vegan() package will very likely be one of them.
Now we load into R the compensation.csv data set that we
are going to analyse. This can be found in the edenr/data folder of the
Files tab within Teams and should be saved into your
mypath<-here("data", "compensation.csv")
compensation<-read_csv(mypath)
Remind yourself what this data set contains
glimpse(compensation)
In this data set, Fruit production and Root mass are response variables to the type of grazing (Grazed or Ungrazed).
There are several ways to inspect the data. One way is
head(compensation)
which shows you the first few lines. Another is is to click on the
text of the name of the compensation object in the
environment pane. That will show you the data in a spreadsheet-like view
in the scripts pane. That is what I now mostly do. I don’t bother using
the head() function.
Another is summary(<name of data frame>). That
gives you summary statistics of each of the columns.
summary (compensation)
This is useful especially for seeing if you have any missing values in your data set, which will show up as ‘NA’s, but is not very informative about columns that contains categorical data (’factors’ in RSpeak) , as in our Grazing column.
I often do all of these.
Before we get going with plotting the data, let us just look at some
summary statistics for each grazing condition. To do this we use the
powerful combination of the group_by() and
summarise() commands plus the pipe operator
%>% which feeds the results of one line into the next
line.
In English, this next code chunk means:
summary_table_rootcompensation data frame
and thensummary_table_root#summary table for the Root values
summary_table_root<-compensation %>%
group_by(Grazing) %>%
summarise(min.Root = min(Root),
med.Root = median(Root),
mean.Root = mean(Root, na.rm = TRUE),
max.Root = max(Root))
summary_table_root
And here is the same for the Fruit column
#summary table for the Fruit values
summary_table_fruit<-compensation %>%
group_by(Grazing) %>%
summarise(min.Fruit = min(Fruit),
med.Fruit = median(Fruit),
mean.Fruit = mean(Fruit),
max.Fruit = max(Fruit))
summary_table_fruit
Later, when we come to consider particular statistical tests for our data, we may well need to consider how the data are distributed, and in particular, more often than not, whether they are normally distributed.
Do these summary tables suggest that this may well be the case, or not?
ggplot2: a grammar for graphicsggplot2 is a powerful package within
tidyverse for drawing graphics. It creates plots in layers.
The possibilities are endless, so here we will just work through a few
examples
A simple, plain, bivariate scatter plot of the data.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
geom_point() # this is the second layer- a 'geometric' layer which in this case adds points
In the code, the information is added to the plot in successive
layers, each written on a separate line and separated from the next one
by a +.
The first line specifies what we want to plot. It has the
aes (short for aesthetic) argument. In this you specify how
information from the data will be mapped to aspects of the plot. What
will be plotted along the x-axis, what variable will determine the
colour of the points or the fill colour of any bars, and so on?
In the second line we specify what type of plot we want using one of
many possible geom_....() functions. Type
?geom into the console and see how many options you.are
offered. Loads, right? Don’t worry, a few of these will do most of what
you want most of the time and whenever they don’t then there is plenty
of help available. Here we want a scatter plot, for which
geom_point() is the correct choice.
There are many ways in which you can customise the basic plot given above. Here are a few:
Don’t like the grey background? Add a theme layer. There are a few
built into R, and others are available through packages. I often use
theme_cowplot() from the cowplot package.
library(cowplot) # I usually have this line in my top 'load all the packages' code chunk
compensation %>%
ggplot(aes(x = Root, y = Fruit)) + # this is the fist layer
geom_point() +# this is the second layer- a 'geometric' layer which in this case adds points
theme_cowplot()
You can have more than one theme() line. They are where
you alter various aspects of the ‘look’ of a plot, such as where to put
a legend, whether to have one at all, fonts, font sizes etc.
We normally do this, even for on-the-fly exploratory plots, since it is helpful and easy to do.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point() +
labs(x = 'Root Biomass',
y = 'Fruit Production') +
theme_cowplot()
We can include a size argument in the
geom_point() line.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point(size = 5) + # all the points will be size 5 (big!)
theme_cowplot()
Notice that because the size = 5 argument appears in the
geom_point() line and not in the first line, in the
aes() function, it affects all points. It is not mapped to
any particular set of points depending on values in some column of the
data.
Just as we changed the size of the points, we can change the colour
or shape of all of them using a colour = or
shape = argument in the geom_point()
layer.
compensation %>%
ggplot(aes(x = Root, y = Fruit)) +
geom_point(colour = "blue") + # all the points will be coloured blue
labs('Root Biomass',
y='Fruit Production') +
theme_cowplot()
While perhaps prettier than the default black, the colour in the previous plot has added no new information to the plot. Nor would it have helped if we had used triangular symbols instead of circles.
While we might do those things to fit in with some general style we wanted our plot to have, (or just because we liked the colour blue!) we can use colour, size, shape and so on to add information if the appearance attribute of the symbols is mapped to some feature of the data. That is, for example, if the points are coloured or shaped or sized differently depending on the value of some variable, such as species, site or as here, grazing condition.
We do this in the first, ggplot layer, as an additional
argument within the aes() function. All the arguments to
this ‘map’ variables (ie columns) of the dataset to aspects of the
plot.
So if we wanted to colour of the points to depend on the grazng condition, we would do this:
compensation %>%
ggplot(aes(x = Root, y = Fruit, colour = Grazing)) + # the colour will depend on the value of Grazing.
geom_point() +
labs(x='Root Biomass',
y='Fruit Production') +
theme_cowplot()
Notice that by default we now get a legend, which in this plot we
need. As we mentioned above, you can move, adjust or even remove this
legend altogether if you want using a theme() line.
See if you can adapt the previous code chunk to give a scatter plot of Fruit Production vs Root Biomass, with a different symbol depending on whether the tree was in grazed or ungrazed pasture.
You just have to adapt the code above to do what you want (Hint: shape). Often, when we use R, we do not have to write scripts from scratch. We just adapt existing scripts that almost do what we want so that they do exactly what we want.
Often, when we present data we want to convey some idea of their central value (mean, median etc) and some idea of their spread. A bar plot hides much of this information. A Box and Whisker plot is a good way to show it.
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
labs(x='Grazing treatment',
y='Fruit Production') +
theme_cowplot()
For each data set, the thick bar in the middle shows the median, the box shows the interquartile range, from the 25th to the 75 percentile, by default the whiskers show the distance to the points whose distance from the median is 1.5 times the width of the interquartile range, and any outliers are shown as individual points.
Note that we have by default been given a legend. Since here this
adds no information that was not already in the axis labels, we would
like to remove it. This can be done with a final theme()
layer:
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) + # we map the fill colour to the value of Grazing.
geom_boxplot() + # if we had fill = "blue" as an argument here, all boxes would be blue.
labs(x='Grazing treatment',
y='Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
You can use plots like this to see whether two data sets are similar or markedly different. If your research hypothesis was about whether two or more groups were different, a box plot often goes some way towards giving you an answer.
Do a similar plot to the last one, but this time show how Root Mass varies with Grazing.
If there are not too many data points, it can be instructive to
include these in the box plot. To do this we just amend the code for
drawing the basic box plot with another line (think of this as another
layer) where we include a geom_point(),
compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing, alpha = 0.2)) +
geom_boxplot() +
geom_point(aes(colour=Grazing),size = 4 ) + # all the points will have these attributes
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
Do you also see again how arguments within the aesthetic of the first
ggplot line map to elements of the data, so that, for
example, the fill colour depends on the value of Grazing.
In the geom_point() layer we have also included an
aesthetic argument so that we can map the colour of the points to the
Grazing treatment.
What does the alpha variable do?
Don’t like these colours? R recognises an extensive
vocabulary of colour names, including all the obvious ones - so it is
easy to guess and get something that will work. The full list is shown
here.
If that list is not enough you can also use palettes which
make available a vast array of colour schemes suitable for different
purposes, such as in maps. The RColorBrewer() package is
especially good. Trust me, there is a whole world of colour fun out
there if you want it.
p1<-compensation %>%
ggplot(aes(x = Root, y = Fruit, colour = Grazing)) +
geom_point() +
xlab('Root Biomass') +
ylab('Fruit Production') +
theme_cowplot()
p2<-compensation %>%
ggplot(aes(x = Grazing, y = Fruit, fill = Grazing)) +
geom_boxplot() +
xlab('Grazing treatment') +
ylab('Fruit Production') +
theme_cowplot() +
theme(legend.position = "none")
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)
Can you use either of these to answer these questions?
Is there anything in either of these plots that adds no additional information.If not, good. If so, we should get rid of it!
Looking at the distribution of our variables is very important. It can tell us about the central tendency and spread of the data, and give us an idea as to whether particular statistical tests will be valid.
A histogram is a good way to do this.
In a histogram, counts are made of numeric observations whose value fall within defined ranges or, as they are called, ‘bins’.
Let us produce a histogram of Fruit production.
For a histogram, we only specify the variable to be mapped to the x axis, in this case fruit mass, since the y-variable will be the counts of observations where the value of that variable fell within each bin.
g9<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram() +
theme_cowplot()
g9
That wasn’t very nice! Thre are too many bins, so that hardly any
observations fall in any one bin. We need to change the bin width, or,
alternatively, the number of bins.
Let us try both…
pn<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram(bins = 10) +
#title('Fixing the number of bins') +
theme_cowplot()
pw<-ggplot(compensation, aes(x = Fruit)) +
geom_histogram(binwidth = 15) +
theme_cowplot()
# this is a handy function from the gridExtra() package that you can use to arrange plots
grid.arrange(pn, pw, nrow = 1)
Try using more/narrower bins or fewer/fatter bins.
Do you see that, for a histogram to be informative, the number of bins
and/or the bin width has to be chosen carefully?
From these histograms, can you tell roughly where the data is centered, and roughly how variable it is?
When we want several plots done in the same style, one for each of
the different values of a categorical variable (or factor as R would
say) the facet_wrap() and facet_grid()
functions are useful;
compensation %>%
ggplot(aes(x = Fruit)) +
geom_histogram(binwidth = 15) +
facet_wrap(~Grazing, ncol = 2) +
theme_cowplot()
Try altering the ncol = 2 argument to
ncol = 1. What happens? When might doing that be
useful?
Try removing the facet_wrap() line. What do you get? Do
you see how facet_wrap() can be useful?
There are very many other plot types you can do using
ggplot. Here are a couple:
To illustrate this, we will make up a dataset of discrete count data: fruit types:
fruit_data<-tibble(fruits = c(rep("apple", 11), rep("orange", 7), rep("pear", 4)))
glimpse(fruit_data)
## Rows: 22
## Columns: 1
## $ fruits <chr> "apple", "apple", "apple", "apple", "apple", "apple", "apple", …
That gives a data frame with one categorical variable,
fruits. To see how many there are of each type we can use
the table() function:
table(fruit_data$fruits)
##
## apple orange pear
## 11 7 4
and we can illustrate this count data with a bar chart, for which we
use geom_bar() within ggplot:
fruit_data %>%
ggplot(aes(x = fruits)) +
geom_bar() +
theme_cowplot()
We don’t need to specify a y variable, since
geom_bar() understands that it is supposed to count up how
many of each of the x variable the data set contains.
If our dataset is already a set of counts, for example:
fruit_count<-fruit_data %>%
group_by(fruits) %>%
summarise(Count = n())
fruit_count
then we can get a bar plot of these counts most simply by using using
geom_col()
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col() +
theme_cowplot()
There are many, many variations you can make to these basic plots. For example, you might want some colour:
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col(fill = "darkred") +
theme_cowplot() +
theme(legend.position="none")
Notice the line theme(legend.position="none") line which
we have used before. It suppresses a legend, which we would get by
default but do not need here.
To view the plot sideways we add the coord_flip()
line:
fruit_count %>%
ggplot(aes(x = fruits, y = Count)) +
geom_col(fill = "cornflowerblue") +
coord_flip() +
theme_cowplot() +
theme(legend.position="none")
This is useful when you have lots of categories and the category names, eg species names, are too long to fit under each column.
When you gather a sample of data with some number of replicates, you hope that that sample is representative of the population from which it has been drawn. Why? Because it is the population that you want to know about. The variation of the sample, which we characterise by its standard deviation, is our best estimate of the variability of the population from which the sample has been drawn.
The mean of of some aspect of that sample is your best estimate of the mean value of that aspect for the whole population. You know, though, that if you were to take another sample, then the mean of that sample would be different from that of the first. And the mean of a third sample would be different from the mean of the second, and so on. In time, you would see that the successive means were distributed, probably normally distributed, around a central value. That central value you can reasonably assume to be the true population mean. The spread of the distribution can be characterised by what we call the standard error of the mean. This distribution is sometimes referred to as the sampling distribution.
But, hang on, this is fictitious. We don’t, in reality, take sample after sample. We take just one. That single, actual sample will have a mean that lies somewhere within the sampling distribution - the distribution of sample means we would get if we did take sample after sample. The mean of our actual sample might be a bit more or a bit less than the true population mean. We don’t know this population mean but the mean of our solitary sample is our best estimate of it. It is all we have to go on. We can give people an idea of how close our sample mean is likely to be to the true mean by telling them the standard error of the sampling distribution, and this we can estimate from the standard deviation of our actual sample. If the standard deviation is \(\sigma\), then the standard error of the mean is \(\frac{\sigma}{\sqrt{n}}\), where \(n\) is the sample size.
So, in the end, if we are going to display mean values of samples in a plot or a table, and talk of these as those they are estimates of population means, then we ought to include some information as to how precise an estimate they are. We do this by also including the standard error of the mean. Typically, you might see in a table a value presented as something like \(2.4 \pm 0.35\) together with information (hopefully!) in the caption to tell you that this is a mean value plus or minus one standard error of the mean. In a figure, we would include an error bar, where that error bar would have a length equal to one standard error of the mean. Crucially too, we should say in the caption to the figure what that error bar represented. In the case discussed here we mean it to indicate one standard error of the mean. But in another paper, another author might mean by it the standard deviation of the sample which, remember, is our best estimate of the spread of values in the population. This is not the same thing as the standard error of the mean, which is our estimate of how precisely we know the mean of that population. So, we need to clearly tell the reader what we actually mean.
Here we use the iris data set, a commonly used data set
that is built into R. It has 150 rows of data, 50 for each of 3 species
of iris. For each species there are measurements of 4 quantitative
variables, sepal length and width, and petal length and width. Suppose
we are interested in sepal lengths and how that differs between species.
We first use group_by() and summarise() to
find the mean and standard error of the mean of the sepal length for
each species, then we plot the mean values using
geom_point() and add error bars using
geom_errorbar().
iris %>%
group_by(Species) %>%
summarise(n=n(),
mean_SL=mean(Sepal.Length),
se_SL=sd(Sepal.Length/sqrt(n()))) %>%
ggplot(aes(x=Species,y=mean_SL,colour=Species,fill=Species)) +
geom_point() +
geom_errorbar(aes(ymin=mean_SL-se_SL,ymax=mean_SL+se_SL),width=0.2) +
labs(x="Species",
y="Sepal Length (mm)",
caption="Figure shows mean values plus or minus one standard error of the mean") +
theme_cowplot() +
theme(legend.position="none")
Let us use ggplot2 to display the data in the
iris.csv data file that should already be in your
RStuff/data folder. If not, you should be able to get it from the
RStuff/data folder on Teams.
Create a new R Notebook called iris_plot and save it in your RStuff/scripts folder
tidyverse and
hereirisglimpse()facet_wrap() to produce 3 histograms for sepal
width, one for each species, one above the other. Adjust the bin number
until the histograms look about right.# load packages
library(tidyverse)
library(here)
# load the data
filepath <- here("data","iris.csv")
iris<-read_csv(filepath)
#inspect it
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "setosa…
There is only one variable per column and each variable appears only once in each row, so the data is tidy
means_table<-iris %>%
group_by(Species) %>%
summarise(mean.sepal.length=mean(Sepal.Length),mean.sepal.width=mean(Sepal.Width))
means_table
# Scatter plot, sepal length vs sepal width
iris %>%
ggplot(aes(x=Sepal.Width,y=Sepal.Length,color=Species))+
geom_point()+
xlab('Sepal Width')+
ylab('Sepal Length')+
theme_cowplot()
There appears to be a positive correlation between sepal length and sepal width for each of the species.
This next plot is useful overall but has a feature of questionable usefulness.
# Box plots for sepal length
iris %>%
ggplot(aes(x=Species,y=Sepal.Length,fill=Species))+
geom_boxplot()+
theme_cowplot()+
theme(legend.position="none") # remove the legend as it is redndant in this plot
It looks as though the sepal lengths do differ between species, with setosa < versicolor < virginica. Having seen this plot and got this idea we would next properly establish whether or not the observed differences could be down to chance by using, preferably, a parametric test such as an ANOVA or, if the data did not permit that, its less powerful non-parametric equivalent, the Kruskal-Wallis test.
Now, does the use of colour add information to this plot? Hint: no! It possibly even confuses the reader by making them think there is some other feature that differentiates the three sets of measurement besides that they relate to different species. The species are displayed on the x-axis, so what do the coulrs do for us? Nothing, so best leave them out.
# Histograms for sepal width, one for each species, one above the other
iris %>%
ggplot(aes(x=Sepal.Width))+
geom_histogram()+
facet_wrap(~Species,ncol=1) +
theme_cowplot()
This defaults to 30 bins, which is too many for these data. You end up with too few points in each bin and a very ragged looking histogram. Let’s reduce it until we get something that looks about right:
iris %>%
ggplot(aes(x=Sepal.Width))+
geom_histogram(bins=10)+
facet_wrap(~Species,ncol=1) +
theme_cowplot()
I think that for this data set, around 10 bins is ‘about right’. What happens if we make the bin number too low?
Each data set is roughly symmetrically distributed. They could well be normally distributed but it is difficult to tell from histograms of small samples. A more effective graphical way to tell is to use [quantile-quantile plots](https://rpubs.com/mbh038/725314.
The sepal width of setosa appears to be greater than that of
the other two, between which there is no obvious difference. We can see
this easily because we have used facet_wrap() and placed
the sub-figures one above the other.