Topic 2B: Data Visualisation I
Welcome to the second computer lab for the Data Science stream of STM1001.
In this computer lab we will create some informative, interactive data visualisations, using the penguins
data set from the palmerpenguins
package (Horst, Hill, and Gorman 2020) , and a new package,plotly
(Sievert 2020).
The types of plots we consider in this lab were introduced in
Topic 1 and Topic 2.
By the end of this lab, you should feel comfortable creating interactive histograms and scatter plots in RStudio.
🎧 Reminder: Online students
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:
Prompts for you
💬 Write your answer in the chat.
Modes at different times during the lab
🏡 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion
💡 Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.
💻 Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.
🏫 Reminder: Face-to-face (blended) students
Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.
Palmer Penguins Data Set
🏡 Let’s quickly refresh our memories of the penguins
data set from the palmerpenguins
R package (Horst, Hill, and Gorman 2020).
This data set contains information on 3 species of penguin, who live on different islands in the Palmer archipelago, off the coast of Antarctica. For more details, you can refer to Section 2 of the Data Visualisation in R supplement.
🏡 To begin, make sure you have the palmerpenguins
R package installed in RStudio.
Note: If you do not have the palmerpenguins
package downloaded, just click on the Code
box below, and run the code that appears:
install.packages("palmerpenguins")
🏡 Run the following code to load the palmerpenguins
R package, and to summarise the penguins
data set.
Note: The package is called palmerpenguins
, but once this is loaded, the actual data to access in R is stored in the object penguins
.
# Load the `palmerpenguins` package into your current R working environment.
library(palmerpenguins)
# Summarise the `penguins` data in the `palmerpenguins` package.
summary(penguins)
We don’t need to spend much time assessing this summary - the main things to note at this stage are the different variables, namely species
, island
, bill_length_mm
, bill_depth_mm
, flipper_length_mm
, body_mass_g
, sex
and year
.
Creating Interactive Histograms in RStudio
💻 Suppose that we are interested in the distribution of recorded body masses (in grams) of the penguins living in the Palmer Archipelago. To visualise this distribution using the penguins
data, we can produce a histogram.
Note: Refer to Section 3.2 of Topic 1 for details on histograms.
Recall that we can create a simple histogram using the built-in R function hist
:
hist(penguins$body_mass_g, breaks = 19)

This histogram is static, meaning that we cannot interact with the image, and we cannot manipulate it in real time to display different details - perhaps, for example, we would like to see the distribution of the penguins’ body_mass_g
values, but only for the penguins on a specific island.
To achieve these objectives using just the hist
function would take some extensive coding.
💻 As an alternative to the built-in hist
function, we could use the plot_ly
function from the plotly
package to create an interactive, responsive histogram. Let’s take a look at how to do this now.
To begin, just as for the previous packages, we will need to download and load the plotly
package in RStudio, before we can use any plotly
functions.
Run the code below to install and load the plotly
package.
install.packages("plotly")
library(plotly)
💻 To create plotly
plots, we use the function plot_ly()
.
Run the code below to create and store an interactive histogram of the the penguins’ body_mass_g
values in the object penguin_hist_base
.
At this point, don’t worry about the composition of this function - we’ll cover this in more detail shortly.
- For the moment, take a look at the code below, and see if you can get a general idea of what’s going on.
penguin_hist_base <- plot_ly(data = penguins,
x = ~body_mass_g,
type = "histogram")
penguin_hist_base <- penguin_hist_base %>% layout(yaxis = list(title = 'count'))
❓Note
Once you have taken some time to consider the code above, if you would like more details or would like to check the accuracy of your interpretation, click the Show
button below for a brief explanation.
# Here, we are creating a plotly object called "penguin_hist_base"
penguin_hist_base <- plot_ly(data = penguins, # We are using the penguins data
x = ~body_mass_g, # and modelling the body_mass_g data
type = "histogram") # in a histogram format
# The code below is used to modify the layout of the histogram
# to include a label for the y-axis
penguin_hist_base <- penguin_hist_base %>% layout(yaxis = list(title = 'count'))
💻 The plotly
histogram has now been created, but won’t appear until we call the object in which it was stored. If you run the code below, your plotly
histogram (as shown below) should appear in the Viewer
section of RStudio.
penguin_hist_base
💻 Unlike graphs created using base R functions, plotly
graphs are interactive - even when embedded in web pages like this one!
Try the following:
- If you hover over the histogram in 2.3, you can see the specific details of the data.
- If you left-click and drag your cursor over a section of the histogram to create a box, you can also zoom in on a particular section of the plot. Just double left-click to zoom back out.
💻 Perhaps you are not impressed with plotly
yet. After all, our histogram in 2.3 doesn’t look that different to the hist
function version we created at the start of 2, so what is all the fuss about?
Well, it is very easy to modify our plot_ly
histogram to show extra detail. For example, we can easily produce separate histograms for the penguins on each island. Take a look at the R code below, which builds upon what we used in penguin_hist_base
.
penguin_hist <- plot_ly(data = penguins,
x = ~body_mass_g,
color = ~island,
type = "histogram", alpha = 0.6)
penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'),
barmode ="overlay")
Before you move on to the next question, run this code in RStudio.
❓Note
Once you have taken some time to consider the code above, if you would like more details or would like to check the accuracy of your interpretation, click the Show
button below for a brief explanation.
# Here, we are creating a plotly object called "penguin_hist"
penguin_hist <- plot_ly(data = penguins, # We are using the penguins data
x = ~body_mass_g, # and modelling the body_mass_g data
color = ~island, type = "histogram", alpha = 0.6)
# We are producing a histogram for this data, with points coloured differently,
# depending on the island on which the penguin is located
# The code below is used to modify the layout of the histogram
# This includes adding a label to the y-axis
# and setting the histograms to be layered over each other
# (hence the alpha = 0.6 above to change the opacity)
penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'),
barmode ="overlay")
💻 To produce this updated plotly
histogram, run the R code below. Your new histogram (as shown below) should appear in the Viewer
section of RStudio.
penguin_hist
This is looking better than our previous histogram! Because we have told our plot_ly
function to assign different colours to the different islands, we now have three histograms, rather than one with all the data clumped together.
Even better, these are all presented within the one graph, which also includes a handy legend. Hopefully you are now beginning to appreciate the additional functionality offered by plotly
over built-in R functions.
Note: For more details on plotly
, you can refer to Section 3 of the Data Visualisation in R supplement.
💻 Finally, and perhaps most importantly for this specific example, it is important to note that we can dynamically filter results in plotly
graphs. For this example, we can filter observations to focus on data from a specific island. Simply click on one of the lines in the legend in the top right of our histogram in 2.6, to temporarily remove that data from assessment (note that the axes dynamically adjust too).
Try focusing just on the Dream island penguins.
Hint: To bring the removed data back, simply click once more on the relevant line in the legend.
🎧 Online students
💬 Leave a comment in the chat about your favourite aspect of the interactive plotly graphs so far.
🏡 Reconvene in main room to discuss results
Creating Interactive Scatter Plots in RStudio
💻 From our summary table in 1.2, we can see that the measurement variables in the penguins
data set include body_mass_g
and flipper_length_mm
.
It seems reasonable to assume that penguins with larger body masses might also have longer flippers.
To visualise the observations for these variables, and check our assumption, we could use a scatter plot. To create this scatter plot, we will again use the plotly
package, as it offers several benefits over using the default plotting options in R.
Note: Refer to Section 5.1 of Topic 2 for details on scatter plots.
💻 To create an interactive plotly
plot we need to use the function plot_ly()
. In 2.2, we brushed over the details of the plot_ly()
function, so let’s remedy that now.
The typical composition of a simple plotly
plot looks like this:
plot_name <- plot_ly(data = ..., x = ~ ..., y = ~ ...)
Let’s break this down:
- Firstly, (using the assignment operator
<-
) we assign a name to our plot - here we have chosen the generic plot_name
.
- Next, within
plotly()
, we specify the main arguments of the function.
- The
data = ...
part tells R what data we are analysing.
- The
x = ~ ...
part tells R which variable in our data set to plot on the x-axis of our plot.
- The
y = ~ ...
part tells R which variable in our data set to plot on the y-axis of our plot.
Note: We simply replace the ...
s with whatever data we are using.
💻 Run the code below to create a simple scatter plot of the recorded flipper_length_mm
versus body_mass_g
values in the penguins
data set (as shown below). Make sure to inspect this code, and check that you understand each component.
Note that since we have specified our data set is penguins
in the code, we don’t then need to do this when specifying our x
and y
inputs - we can simply specify any of the variables contained within this data set.
penguins_scatter <- plot_ly(data = penguins,
x = ~body_mass_g, y = ~flipper_length_mm)
penguins_scatter
Note: Once we assign the plot to the object penguins_scatter
, we then have to run this object in a subsequent line, in order for the plot to be rendered.
💻 As we suspected, it seems quite clear from the scatter plot above in 3.2 that, in general, as the body mass of penguins increases so too does their flipper length.
But our graph is quite basic at the moment - we can do better.
It is very easy to include a third variable in a plotly
graph, which can help to further distinguish the data plotted on the x
and y
axes. We can do so by adding the argument color = ~...
within plotly()
.
Perhaps the body_mass_g
and flipper_length_mm
of the penguins is also related to their sex
?
Let’s take a look at how our scatter plot changes, if we distinguish between male and female penguins.
penguins_scatter2 <- plot_ly(data = penguins,
x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex)
penguins_scatter2
That’s looking a bit nicer! Now we can see that the smallest penguins tend to be female, and the largest penguins tend to be male. If you hover over the data, you’ll notice that the sex
is now shown alongside the coordinates of each data point.
We also have a helpful legend in the top right. Remember, this is not only useful as a guide - try clicking on one of the labels in the legend.
💻 While our scatter plot is looking better, the default colours chosen to distinguish between male and female penguins are quite similar. Perhaps we would like more contrast?
To specify the set of colours to use for the plot, we can add the additional argument colors = ...
to our plot_ly
function. This argument accepts any valid R colour codes. Take a look at this pdf for an overview of different colours we can use, or simply try a few basic colours like red, green etc.
Complete and then run the code below to change the colours you use in your scatter plot.
penguins_scatter_colours <- plot_ly(data = penguins,
x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = ...)
penguins_scatter_colours
❓Hint
You will need a combination of two colours. Check the Show
box below if you are stuck.
# If you are specifying specific individual colours, you will need to use the layout
colors = c("...", "...")
# within the plot_ly() function
💻 If you do not want to spend too much time customising the colours used in your plots, there are pre-existing sets of colours you can use. Try setting your colors =...
argument in 3.4 sequentially to colors = "Set1"
, then to colors = "Set2"
and finally to colors = "Set3"
. Do any particular sets appeal to you?
🎧 Online students
💬 Post your colour choice(s) in the chat.
💻 There are many different display options for plot_ly
graphics. If you try running the R commands below, you may see some red Warning messages appear in the RStudio Console
section. As discussed in Computer Lab 1B, while it is important to read them, often you don’t have to worry about these.
penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1")
penguins_scatter2
In this instance, if you would like to minimise warning messages, you can add the arguments type = "scatter", mode = "markers"
to your plot_ly
function, so that your code now looks like this:
penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1",
type = "scatter", mode = "markers")
penguins_scatter2
Here:
- We set
type = "scatter"
to ensure our data is plotted as a scatter plot.
- We set
mode = "markers"
to ensure that each of our data points is plotted individually.
These additional arguments are often helpful, as sometimes we like to have a little more control over how our data is presented.
You’ll notice however that if these commands are omitted from your function, R will just work out what it thinks is the optimal presentation format (hence the warning messages informing us which options have been selected, since some details haven’t been user-specified).
This automatic selection is often for the best - try changing the mode = "markers"
section of code above to mode ="lines"
and then re-running the code chunk. What happens?
💻 So far, we have treated all the penguins as one large group, differentiated by sex
.
When we hover over a point in our scatter plot (representing a penguin), we see the flipper length, body mass, and sex details for that penguin. This is great, but we are missing one important piece of information - the species of penguin! Remember, we actually have data for three separate species
of penguin - Adelie
, Chinstrap
, and Gentoo
.
Fortunately, it is straightforward to add this information to the hover text of our plot. We can do this by including the argument text = ~species
in our code, in a similar way to how we have used color = ~sex
to colour the points.
Update your penguins_scatter2
plot with this text = ~species
addition now, and hover over some points to check that your code has worked as intended.
💻 We have already used different colours to differentiate the male and female penguins, but all the data points are the same symbol - a dot.
Instead of including the argument text = ~species
, we can use the additional argument symbol = ...
within our plot_ly
function to distinguish between the different species
of penguin.
Take a look at the R code and resultant graph below:
penguins_scatter3 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm,
color = ~sex, colors = "Set1", symbol = ~species,
type = "scatter", mode = "markers")
penguins_scatter3
Great! This is looking much more informative than our initial scatter plot in 3.2.
Now it’s quite clear, for instance, that the majority of larger penguins (both male and female) are of the Gentoo
species, which we couldn’t discern from our initial versions of this scatter plot.
💻 There are 26 different base R symbols you can choose from - these can be specified either by number, or by name. Since we have not specified which specific symbols to use in 3.8, R has used the first default 3.
We used colors = ...
to modify our color = ...
specification, and similarly, we can use symbols = ...
to modify our symbol = ...
specification.
Some of the symbol names are quite long, e.g. "filled triangle point-up"
, so it is often easier to use numbers. The table below lists some of the available options:
Table 3.1: Symbol Options
Number |
Name |
0 |
square |
1 |
circle |
2 |
triangle point up |
3 |
plus |
4 |
cross |
5 |
diamond |
8 |
star |
Using 3.1 and the symbols = ...
argument, change the symbols used in the penguins_scatter3
scatter plot created in 3.8.
❓Hint
If you are using symbol names, and your code isn’t working, check the code chunk below.
# Note that just like for the colours argument, if you are using words,
# these need to be surrounded with quotation marks,
# e.g. "square", or 'square' will work, but square will not
🎧 Online students
💬 Post your symbol choice(s) in the chat.
💻 As a final touch, you may also like to change the size of the symbols in your scatter plot.
To do so, we can include the marker = ...
argument in our plot_ly
function.
This is a little more complicated to use than our previous arguments, as multiple specifications can be made within this argument. As a result, we use the format marker = list(...)
. Within the list()
function, we can include multiple specifications which all pertain to the marker
argument. To change the size of the symbols, we use the appropriately named size =
argument, within the list()
function.
For example, if we want to change the default marker size (6) to be a little larger, we could include the argument
marker = list(size = 8)
within our plot_ly
function for our penguins_scatter3
scatter plot created in 3.8.
Try this now, observe the changes, and then try increasing and decreasing the marker size.
We have now covered the basics of creating plotly
histograms and scatter plots, well done!
🏡 Reconvene in main room to discuss results
