Data Science Module

Topic 2B: Data Visualisation II


In this computer lab, we will extend our appreciation of the potential of the plotly (Sievert 2020) R package, and create some informative, interactive data visualisations, using data from the penguins data set. The types of plots we consider in this lab were introduced in Topic 2.

By the end of this lab, you should feel comfortable creating a customized, interactive scatter plot, and be able to combine different plotly graphs together in a single, customized display.


1 Palmer Penguins Data Set

To begin, let’s quickly refresh our memories of the penguins data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This data set contains information on 3 species of penguin, who live on different islands in the Palmer archipelago, off the coast of Antarctica. For more details, you can refer to Section 2 of the Data Visualisation in R supplement.

Note: If you do not have the palmerpenguins package downloaded, just click on the Code box below, and run the code that appears:

install.packages("palmerpenguins")

Open up RStudio, and run the following code to load and summarise the palmerpenguins package.

Note that the package is called palmerpenguins, but once this is loaded, the actual data to access in R is stored in the object penguins.

# This code loads the `palmerpenguins` package into your current R working environment.
library(palmerpenguins)
# This code summarises the data in the `palmerpenguins` package.
summary(penguins)

Don’t worry too much about the values shown in the summary table - the main things to note at this stage are the different variables, namely species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex and year.

2 Plotly Scatter Plots

From our summary table, we can see that the measurement variables for the penguins include body_mass_g and flipper_length_mm. It seems reasonable to think that penguins with larger body masses might also have longer flippers.

To visualise this, and check our assumption, we could use a scatter plot. To create this scatter plot, we will use the plotly package, which offers several benefits over using the default plotting options in R. For more details on plotly, you can refer to Section 3 of the Data Visualisation in R supplement.

2.1

First, let’s load the plotly package. Using our process in 1 as a guide, load the plotly package in R. (If you do not have the plotly package downloaded, make sure to do this now too).

Hint: Check the code chunk below if you are not sure how to proceed.

install.packages("plotly")
library(plotly)

2.2

We can create a plotly plot using the function plot_ly(). Let’s take a look at the typical composition of a plotly plot:

plot_name <- plot_ly(data = ..., x = ~ ..., y = ~ ...)

Let’s break this down.

  • Firstly, (using the assignment operator <-) we assign a name to our plot - here we have chosen the generic plot_name.
  • Next, within plotly(), we specify the main arguments of the function.
  • The data = ... part tells R what data we are analysing.
  • The x = ~ ... part tells R which variable in our data set to plot on the x-axis of our plot.
  • The y = ~ ... part tells R which variable in our data set to plot on the y-axis of our plot.

Note that we simply replace the ...s with whatever data we are using.

The code below will create a simple scatter plot of flipper_length_mm versus body_mass_g. Make sure to inspect this code, and check that you understand each component. Since we have specified our data set is penguins, we don’t then need to do this when specifying our x and y inputs - we can simply specify any of the variables contained within this data set.

penguins_scatter <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm)
penguins_scatter

Note that once we assign the plot to the object penguins_scatter, we then have to run this object in a subsequent line, in order for the plot to be rendered.

2.3

One key benefit of plotly graphs compared to base R graphs, is that the plotly graphs are interactive!

Notice is that if you hover over the data in the scatter plot, you can see the specific coordinates of each point. If you left-click and drag your cursor over a section to create a box, you can also zoom in on a particular section of the plot. Just double left-click to zoom back out.

2.4

As we suspected, it seems quite clear that as the body mass of penguins increases, so too does their flipper length.

But our graph is quite basic at the moment - we can do better.

Another great aspect of plotly graphs is that it is very easy to include a third variable which can help to further distinguish the data plotted on the x and y axes. We can do so by adding the argument color = ~... within plotly().

Perhaps the body_mass_g and flipper_length_mm of the penguins is also related to their sex? Let’s take a look at how our scatter plot changes, if we distinguish between male and female penguins.

penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex)
penguins_scatter2

That’s looking a bit nicer! Now we can see that most of the smaller penguins are female, and most of the larger penguins are male. If you hover over the data, you’ll notice that the sex is now shown alongside the coordinates of each data point.

We also have a helpful legend in the top right. This is not only useful as a guide - try clicking on one of the labels in the legend.

2.5

While our scatter plot is looking better, the default colours chosen to distinguish between male and female penguins are quite similar. Perhaps we would like more contrast?

To specify the set of colours to use for the plot, we can add the additional argument colors = ... to our plot_ly function. This argument accepts any valid R colour codes. Take a look at this pdf for an overview of different colours we can use in R.

Complete and then run the code below to change the colours you use in your scatter plot.

penguins_scatter_colours <- plot_ly(data = penguins, 
                                    x = ~body_mass_g, y = ~flipper_length_mm, 
                                    color = ~sex, colors = ...)
penguins_scatter_colours

Hint: You will need a combination of two colours. Check the Code box below if you are stuck.

# If you are specifying specific individual colours, you will need to use the layout 
colors = c("...", "...").

2.6

If you do not want to spend too much time customising the colours used in your plots, there are pre-existing sets of colours you can use. Try setting your colors =... argument in 2.5 sequentially to colors = "Set1", then to colors = "Set2" and finally to colors = "Set3". Do any particular sets appeal to you?

2.7

There are many different display options for plot_ly graphics, and if you try running the R commands

penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1")
penguins_scatter2

you may see some red Warning messages appear in the R Console. Often, you don’t have to worry about these, but if you would like to minimise them, you can add the following arguments to your plot_ly function.

penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1",
                             type = "scatter", mode = "markers")
penguins_scatter2

Here, we have included the additional arguments type = ... and mode = ....

  • We set type = "scatter" to ensure our data is plotted as a scatter plot.
  • We set mode = "markers" to ensure that each of our data points is plotted individually.

These additional arguments are often helpful, as sometimes we like to have a little more control over how our data is presented.

You’ll notice however that if these commands are omitted from your function, R will just work out what it thinks is the optimal presentation format (hence the warning messages informing us which options R has selected, since some details haven’t been user-specified).

This is often for the best - try changing the mode = "markers" section of code to mode ="lines" and then re-running the plot. What happens?

2.8

So far, we have treated all the penguins as one large group, differentiated by sex. However, we actually have data for three separate species of penguin - Adelie, Chinstrap, and Gentoo.

We have already used different colours to differentiate the male and female penguins, but so far all the data points are the same symbol - a dot. We can use the additional argument symbol = ... within our plot_ly function to further improve our graph, and distinguish between the different species of penguin.

Take a look at the R code and resultant graph below:

penguins_scatter3 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1", symbol = ~species, 
                             type = "scatter", mode = "markers")
penguins_scatter3

Great! This is looking much more informative than our initial scatter plot. Now it’s quite clear for instance that the majority of larger penguins (both male and female) are of the Gentoo species, which we couldn’t discern from our previous version of the scatter plot.

2.9

Just as R has many colour options available, so too are there many symbol options available. Since we have not specified which specific symbols to use in 2.8, R has used the first default 3.

We used colors = ... to modify our color = ... specification, and similarly, we can use symbols = ... to modify our symbol = ... specification.

There are 26 different base R symbols you can choose from - these can be specified either by number, or by name. Some of the names are quite long, e.g. "filled triangle point-up", so it is often easier to use numbers. However, some names are easy to remember - take a look at the table below.

Table 2.1: Symbol Options
Number Name
0 square
1 circle
2 triangle point up
3 plus
4 cross
5 diamond
8 star

using 2.1 and the symbols = ... argument, change the symbols used in the penguins_scatter3 scatter plot created in 2.8.

Hint: If you are using symbol names, and your code isn’t working, check the code chunk below.

# Note that just like for the colours argument, if you are using words, 
# these need to be surrounded with quotation marks,
# e.g. "square", or 'square' will work, but square will not

2.10

As a final touch, you may also like to change the size of the symbols in your scatter plot. To do so, we can include the marker = ... argument in our plot_ly function.

This is a little more complicated to use than our previous arguments, as multiple specifications can be made within this argument. As a result, we use the format marker = list(...). Within the list() function, we can include multiple specifications which all pertain to the marker argument.

To change the size of the symbols, we use the appropriately named size = argument, within the list() function. As a result, if we want to change the default marker size (6) to be a little larger, we could include the argument marker = list(size = 8) within our plot_ly function for our penguins_scatter3 scatter plot created in 2.8.

Try this now, observe the changes, and then try increasing and decreasing the marker size.

3 Creating your own Plotly Scatter Plot

Now that we have covered the basics of creating plotly scatter plots in 2, it’s time for you to create your own using the penguins data set.

3.1

To begin, try creating a simple plotly scatter plot of bill_length_mm versus body_mass_g.

If you’re not sure that you’re on the right track, refer back to 2.2, and/or check the code below:

penguins_scatter_new <- plot_ly(data = penguins, x = ~body_mass_g, y = ~...
                             type = "scatter", mode = "markers")
penguins_scatter_new

3.2

Once you are happy with your initial scatter plot, try using the color = ~ argument to differentiate the data in your plot by island. Do you notice any patterns?

You can refer back to 2.4 if you are not sure how to proceed.

3.3

Next, use the symbol = ~ argument to show different symbols for each species in your plot.

3.4

To finish off your plot, change the symbols in your plot, and increase the marker size of your symbols slightly.

3.5

Does it seem like penguins living on different islands have noticeably different body_mass_g or bill_length_mm measurements?

4 Mixed Subplots

Recall from our first Data Science Computer Lab how we created some histograms for our palmerpenguins data set. Some of the code used for that lab is reproduced below:

penguin_hist <- plot_ly(data = penguins, x = ~body_mass_g, color = ~island, type = "histogram", alpha = 0.6)

penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'), barmode ="overlay")
penguin_hist

Suppose that we would like to present all our palmerpenguins data visualisations together. We can do this using the subplot function.

4.1

Take a look at the R code below:

penguin_combined_plots <- subplot(penguins_scatter3, penguin_hist, 
                                  nrows = 2, margin = 0.05) 
penguin_combined_plots <- penguin_combined_plots %>% 
                          layout(title = "Palmer Penguin Data",
                                 xaxis = list(title = 'body_mass_g'), 
                                 yaxis = list(title = "flipper_length_mm"),
                                 xaxis2 = list(title = 'body_mass_g'), 
                                 yaxis2 = list(title = "count"))

Note that here:

  • We are using the subplot command to plot the penguins_scatter3 and penguin_hist plots together.
  • The nrows = 2 argument tells R to produce these plots in 2 rows.
  • The margin = 0.05 argument tells R to leave a small margin between the two plots.
  • The subsequent lines of code are used to add a title to our selection of plots, and add axes labels to the plots - note that we use xaxis to define the x-axis label for the first plot, and xaxis2 to define the x-axis label for the second plot (and similarly for the y-axes).

When we now run this object penguin_combined_plots, we obtain the following:

penguin_combined_plots

Note that the two plots are still wholely interactive. The legends have been combined, and can be used to filter the individual plots.

While we have only combined two plots here, the subplot function can be used to present several plots together, which can be particularly informative when you would like to display multiple aspects of your data simultaneously. The only major downside of presenting plots together using subplot is that their axes labels are removed by default, and must be respecified, as above.

4.2

Using the information from 4.1, try to combine the scatter plot you produced in 3 with the histogram shown above at the start of 4.

Hint: You don’t need to write any code for the histogram, you can simply use the R code shown at the start of 4.


Great job, that’s everything for today!

Hopefully you now feel confident creating plotly scatter plots. Don’t worry if some of the code seems difficult at the moment - we are only at the second lab, and we will have plenty of time to practice and improve as the semester progresses.

Before you finish up, make sure to save your script file somewhere safe - it might come in handy later on.


References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

