Palmer Penguins Data Set
To begin, let’s quickly refresh our memories of the penguins
data set from the palmerpenguins
R package (Horst, Hill, and Gorman 2020).
This data set contains information on 3 species of penguin, who live on different islands in the Palmer archipelago, off the coast of Antarctica. For more details, you can refer to Section 2 of the Data Visualisation in R supplement.
Note: If you do not have the palmerpenguins
package downloaded, just click on the Code
box below, and run the code that appears:
install.packages("palmerpenguins")
Open up RStudio, and run the following code to load and summarise the palmerpenguins
package.
Note that the package is called palmerpenguins
, but once this is loaded, the actual data to access in R is stored in the object penguins
.
# This code loads the `palmerpenguins` package into your current R working environment.
library(palmerpenguins)
# This code summarises the data in the `palmerpenguins` package.
summary(penguins)
Don’t worry too much about the values shown in the summary table - the main things to note at this stage are the different variables, namely species
, island
, bill_length_mm
, bill_depth_mm
, flipper_length_mm
, body_mass_g
, sex
and year
.
Plotly Scatter Plots
From our summary table, we can see that the measurement variables for the penguins include body_mass_g
and flipper_length_mm
.
It seems reasonable to think that penguins with larger body masses might also have longer flippers.
To visualise this, and check our assumption, we could use a scatter plot. To create this scatter plot, we will use the plotly
package, which offers several benefits over using the default plotting options in R. For more details on plotly
, you can refer to Section 3 of the Data Visualisation in R supplement.
First, let’s load the plotly
package. Using our process in 1 as a guide, load the plotly
package in R.
(If you do not have the plotly
package downloaded, make sure to do this now too).
Hint: Check the code chunk below if you are not sure how to proceed.
install.packages("plotly")
library(plotly)
We can create a plotly
plot using the function plot_ly()
. Let’s take a look at the typical composition of a plotly
plot:
plot_name <- plot_ly(data = ..., x = ~ ..., y = ~ ...)
Let’s break this down.
- Firstly, (using the assignment operator
<-
) we assign a name to our plot - here we have chosen the generic plot_name
.
- Next, within
plotly()
, we specify the main arguments of the function.
- The
data = ...
part tells R what data we are analysing.
- The
x = ~ ...
part tells R which variable in our data set to plot on the x-axis of our plot.
- The
y = ~ ...
part tells R which variable in our data set to plot on the y-axis of our plot.
Note that we simply replace the ...
s with whatever data we are using.
The code below will create a simple scatter plot of flipper_length_mm
versus body_mass_g
. Make sure to inspect this code, and check that you understand each component. Since we have specified our data set is penguins
, we don’t then need to do this when specifying our x
and y
inputs - we can simply specify any of the variables contained within this data set.
penguins_scatter <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm)
penguins_scatter
Note that once we assign the plot to the object penguins_scatter
, we then have to run this object in a subsequent line, in order for the plot to be rendered.
One key benefit of plotly
graphs compared to base R graphs, is that the plotly
graphs are interactive!
Notice is that if you hover over the data in the scatter plot, you can see the specific coordinates of each point.
If you left-click and drag your cursor over a section to create a box, you can also zoom in on a particular section of the plot. Just double left-click to zoom back out.
As we suspected, it seems quite clear that as the body mass of penguins increases, so too does their flipper length.
But our graph is quite basic at the moment - we can do better.
Another great aspect of plotly graphs is that it is very easy to include a third variable which can help to further distinguish the data plotted on the x
and y
axes. We can do so by adding the argument color = ~...
within plotly()
.
Perhaps the body_mass_g
and flipper_length_mm
of the penguins is also related to their sex
?
Let’s take a look at how our scatter plot changes, if we distinguish between male and female penguins.
penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex)
penguins_scatter2
That’s looking a bit nicer! Now we can see that most of the smaller penguins are female, and most of the larger penguins are male. If you hover over the data, you’ll notice that the sex
is now shown alongside the coordinates of each data point.
We also have a helpful legend in the top right. This is not only useful as a guide - try clicking on one of the labels in the legend.
While our scatter plot is looking better, the default colours chosen to distinguish between male and female penguins are quite similar. Perhaps we would like more contrast?
To specify the set of colours to use for the plot, we can add the additional argument colors = ...
to our plot_ly
function. This argument accepts any valid R colour codes. Take a look at this pdf for an overview of different colours we can use in R.
Complete and then run the code below to change the colours you use in your scatter plot.
penguins_scatter_colours <- plot_ly(data = penguins,
x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = ...)
penguins_scatter_colours
Hint: You will need a combination of two colours. Check the Code
box below if you are stuck.
# If you are specifying specific individual colours, you will need to use the layout
colors = c("...", "...").
If you do not want to spend too much time customising the colours used in your plots, there are pre-existing sets of colours you can use. Try setting your colors =...
argument in 2.5 sequentially to colors = "Set1"
, then to colors = "Set2"
and finally to colors = "Set3"
. Do any particular sets appeal to you?
There are many different display options for plot_ly
graphics, and if you try running the R commands
penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1")
penguins_scatter2
you may see some red Warning messages
appear in the R Console. Often, you don’t have to worry about these, but if you would like to minimise them, you can add the following arguments to your plot_ly
function.
penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1",
type = "scatter", mode = "markers")
penguins_scatter2
Here, we have included the additional arguments type = ...
and mode = ...
.
- We set
type = "scatter"
to ensure our data is plotted as a scatter plot.
- We set
mode = "markers"
to ensure that each of our data points is plotted individually.
These additional arguments are often helpful, as sometimes we like to have a little more control over how our data is presented.
You’ll notice however that if these commands are omitted from your function, R will just work out what it thinks is the optimal presentation format (hence the warning messages informing us which options R has selected, since some details haven’t been user-specified).
This is often for the best - try changing the mode = "markers"
section of code to mode ="lines"
and then re-running the plot. What happens?
So far, we have treated all the penguins as one large group, differentiated by sex
. However, we actually have data for three separate species of penguin - Adelie
, Chinstrap
, and Gentoo
.
We have already used different colours to differentiate the male and female penguins, but so far all the data points are the same symbol - a dot. We can use the additional argument symbol = ...
within our plot_ly
function to further improve our graph, and distinguish between the different species of penguin.
Take a look at the R code and resultant graph below:
penguins_scatter3 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1", symbol = ~species,
type = "scatter", mode = "markers")
penguins_scatter3
Great! This is looking much more informative than our initial scatter plot. Now it’s quite clear for instance that the majority of larger penguins (both male and female) are of the Gentoo species, which we couldn’t discern from our previous version of the scatter plot.
Just as R has many colour options available, so too are there many symbol options available. Since we have not specified which specific symbols to use in 2.8, R has used the first default 3.
We used colors = ...
to modify our color = ...
specification, and similarly, we can use symbols = ...
to modify our symbol = ...
specification.
There are 26 different base R symbols you can choose from - these can be specified either by number, or by name. Some of the names are quite long, e.g. "filled triangle point-up"
, so it is often easier to use numbers. However, some names are easy to remember - take a look at the table below.
Table 2.1: Symbol Options
Number |
Name |
0 |
square |
1 |
circle |
2 |
triangle point up |
3 |
plus |
4 |
cross |
5 |
diamond |
8 |
star |
using 2.1 and the symbols = ...
argument, change the symbols used in the penguins_scatter3
scatter plot created in 2.8.
Hint: If you are using symbol names, and your code isn’t working, check the code chunk below.
# Note that just like for the colours argument, if you are using words,
# these need to be surrounded with quotation marks,
# e.g. "square", or 'square' will work, but square will not
As a final touch, you may also like to change the size of the symbols in your scatter plot.
To do so, we can include the marker = ...
argument in our plot_ly
function.
This is a little more complicated to use than our previous arguments, as multiple specifications can be made within this argument. As a result, we use the format marker = list(...)
. Within the list()
function, we can include multiple specifications which all pertain to the marker
argument.
To change the size of the symbols, we use the appropriately named size =
argument, within the list()
function.
As a result, if we want to change the default marker size (6) to be a little larger, we could include the argument
marker = list(size = 8)
within our plot_ly
function for our penguins_scatter3
scatter plot created in 2.8.
Try this now, observe the changes, and then try increasing and decreasing the marker size.
Mixed Subplots
Recall from our first Data Science Computer Lab how we created some histograms for our palmerpenguins
data set. Some of the code used for that lab is reproduced below:
penguin_hist <- plot_ly(data = penguins, x = ~body_mass_g, color = ~island, type = "histogram", alpha = 0.6)
penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'), barmode ="overlay")
penguin_hist
Suppose that we would like to present all our palmerpenguins
data visualisations together. We can do this using the subplot
function.
Take a look at the R code below:
penguin_combined_plots <- subplot(penguins_scatter3, penguin_hist,
nrows = 2, margin = 0.05)
penguin_combined_plots <- penguin_combined_plots %>%
layout(title = "Palmer Penguin Data",
xaxis = list(title = 'body_mass_g'),
yaxis = list(title = "flipper_length_mm"),
xaxis2 = list(title = 'body_mass_g'),
yaxis2 = list(title = "count"))
Note that here:
- We are using the
subplot
command to plot the penguins_scatter3
and penguin_hist
plots together.
- The
nrows = 2
argument tells R to produce these plots in 2 rows.
- The
margin = 0.05
argument tells R to leave a small margin between the two plots.
- The subsequent lines of code are used to add a title to our selection of plots, and add axes labels to the plots - note that we use
xaxis
to define the x-axis label for the first plot, and xaxis2
to define the x-axis label for the second plot (and similarly for the y-axes).
When we now run this object penguin_combined_plots
, we obtain the following:
penguin_combined_plots
Note that the two plots are still wholely interactive. The legends have been combined, and can be used to filter the individual plots.
While we have only combined two plots here, the subplot
function can be used to present several plots together, which can be particularly informative when you would like to display multiple aspects of your data simultaneously.
The only major downside of presenting plots together using subplot
is that their axes labels are removed by default, and must be respecified, as above.
Using the information from 4.1, try to combine the scatter plot you produced in 3 with the histogram shown above at the start of 4.
Hint: You don’t need to write any code for the histogram, you can simply use the R code shown at the start of 4.
Great job, that’s everything for today!
Hopefully you now feel confident creating plotly scatter plots. Don’t worry if some of the code seems difficult at the moment - we are only at the second lab, and we will have plenty of time to practice and improve as the semester progresses.
Before you finish up, make sure to save your script file somewhere safe - it might come in handy later on.
