Data Science Stream

Topic 2B: Data Visualisation I


Welcome to the second computer lab for the Data Science stream of STM1001.

In this computer lab we will create some informative, interactive data visualisations, using the penguins data set from the palmerpenguins package (Horst, Hill, and Gorman 2020) , and a new package,plotly (Sievert 2020).

The types of plots we consider in this lab were introduced in Topic 1 and Topic 2.

By the end of this lab, you should feel comfortable creating interactive histograms and scatter plots in RStudio.


🎧 Reminder: Online students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:

Prompts for you

💬 Write your answer in the chat.

Modes at different times during the lab

🏡 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion

💡 Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.

💻 Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.


🏫 Reminder: Face-to-face (blended) students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.


1 Palmer Penguins Data Set

🏡 Let’s quickly refresh our memories of the penguins data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020). This data set contains information on 3 species of penguin, who live on different islands in the Palmer archipelago, off the coast of Antarctica. For more details, you can refer to Section 2 of the Data Visualisation in R supplement.

1.1

🏡 To begin, make sure you have the palmerpenguins R package installed in RStudio.

Note: If you do not have the palmerpenguins package downloaded, just click on the Code box below, and run the code that appears:

install.packages("palmerpenguins")

1.2

🏡 Run the following code to load the palmerpenguins R package, and to summarise the penguins data set.

Note: The package is called palmerpenguins, but once this is loaded, the actual data to access in R is stored in the object penguins.

# Load the `palmerpenguins` package into your current R working environment.
library(palmerpenguins)
# Summarise the `penguins` data in the `palmerpenguins` package.
summary(penguins)

We don’t need to spend much time assessing this summary - the main things to note at this stage are the different variables, namely species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex and year.

2 Creating Interactive Histograms in RStudio

💻 Suppose that we are interested in the distribution of recorded body masses (in grams) of the penguins living in the Palmer Archipelago. To visualise this distribution using the penguins data, we can produce a histogram.

Note: Refer to Section 3.2 of Topic 1 for details on histograms.

Recall that we can create a simple histogram using the built-in R function hist:

hist(penguins$body_mass_g, breaks = 19)

This histogram is static, meaning that we cannot interact with the image, and we cannot manipulate it in real time to display different details - perhaps, for example, we would like to see the distribution of the penguins’ body_mass_g values, but only for the penguins on a specific island.

To achieve these objectives using just the hist function would take some extensive coding.

2.1

💻 As an alternative to the built-in hist function, we could use the plot_ly function from the plotly package to create an interactive, responsive histogram. Let’s take a look at how to do this now.

To begin, just as for the previous packages, we will need to download and load the plotly package in RStudio, before we can use any plotly functions.

Run the code below to install and load the plotly package.

install.packages("plotly")
library(plotly)

2.2

💻 To create plotly plots, we use the function plot_ly().

Run the code below to create and store an interactive histogram of the the penguins’ body_mass_g values in the object penguin_hist_base.

At this point, don’t worry about the composition of this function - we’ll cover this in more detail shortly.

  • For the moment, take a look at the code below, and see if you can get a general idea of what’s going on.
penguin_hist_base <- plot_ly(data = penguins, 
                             x = ~body_mass_g, 
                             type = "histogram")

penguin_hist_base <- penguin_hist_base %>% layout(yaxis = list(title = 'count'))


Note

Once you have taken some time to consider the code above, if you would like more details or would like to check the accuracy of your interpretation, click the Show button below for a brief explanation.

# Here, we are creating a plotly object called "penguin_hist_base"
penguin_hist_base <- plot_ly(data = penguins, # We are using the penguins data
                             x = ~body_mass_g, # and modelling the body_mass_g data
                             type = "histogram") # in a histogram format

# The code below is used to modify the layout of the histogram
# to include a label for the y-axis
penguin_hist_base <- penguin_hist_base %>% layout(yaxis = list(title = 'count'))


2.3

💻 The plotly histogram has now been created, but won’t appear until we call the object in which it was stored. If you run the code below, your plotly histogram (as shown below) should appear in the Viewer section of RStudio.

penguin_hist_base

2.4

💻 Unlike graphs created using base R functions, plotly graphs are interactive - even when embedded in web pages like this one!

Try the following:

  • If you hover over the histogram in 2.3, you can see the specific details of the data.
  • If you left-click and drag your cursor over a section of the histogram to create a box, you can also zoom in on a particular section of the plot. Just double left-click to zoom back out.

2.5

💻 Perhaps you are not impressed with plotly yet. After all, our histogram in 2.3 doesn’t look that different to the hist function version we created at the start of 2, so what is all the fuss about?

Well, it is very easy to modify our plot_ly histogram to show extra detail. For example, we can easily produce separate histograms for the penguins on each island. Take a look at the R code below, which builds upon what we used in penguin_hist_base.

penguin_hist <- plot_ly(data = penguins, 
                        x = ~body_mass_g, 
                        color = ~island, 
                        type = "histogram", alpha = 0.6)

penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'), 
                                        barmode ="overlay")

Before you move on to the next question, run this code in RStudio.


Note

Once you have taken some time to consider the code above, if you would like more details or would like to check the accuracy of your interpretation, click the Show button below for a brief explanation.

# Here, we are creating a plotly object called "penguin_hist"
penguin_hist <- plot_ly(data = penguins, # We are using the penguins data
                        x = ~body_mass_g, # and modelling the body_mass_g data
                        color = ~island, type = "histogram", alpha = 0.6)
# We are producing a histogram for this data, with points coloured differently, 
# depending on the island on which the penguin is located

# The code below is used to modify the layout of the histogram
# This includes adding a label to the y-axis
# and setting the histograms to be layered over each other
# (hence the alpha = 0.6 above to change the opacity)
penguin_hist <- penguin_hist %>% layout(yaxis = list(title = 'count'), 
                                        barmode ="overlay")


2.6

💻 To produce this updated plotly histogram, run the R code below. Your new histogram (as shown below) should appear in the Viewer section of RStudio.

penguin_hist

This is looking better than our previous histogram! Because we have told our plot_ly function to assign different colours to the different islands, we now have three histograms, rather than one with all the data clumped together.

Even better, these are all presented within the one graph, which also includes a handy legend. Hopefully you are now beginning to appreciate the additional functionality offered by plotly over built-in R functions.

Note: For more details on plotly, you can refer to Section 3 of the Data Visualisation in R supplement.

2.7

💻 Finally, and perhaps most importantly for this specific example, it is important to note that we can dynamically filter results in plotly graphs. For this example, we can filter observations to focus on data from a specific island. Simply click on one of the lines in the legend in the top right of our histogram in 2.6, to temporarily remove that data from assessment (note that the axes dynamically adjust too).

Try focusing just on the Dream island penguins.

Hint: To bring the removed data back, simply click once more on the relevant line in the legend.


🎧 Online students 💬 Leave a comment in the chat about your favourite aspect of the interactive plotly graphs so far.


🏡 Reconvene in main room to discuss results


3 Creating Interactive Scatter Plots in RStudio

💻 From our summary table in 1.2, we can see that the measurement variables in the penguins data set include body_mass_g and flipper_length_mm.

It seems reasonable to assume that penguins with larger body masses might also have longer flippers.

To visualise the observations for these variables, and check our assumption, we could use a scatter plot. To create this scatter plot, we will again use the plotly package, as it offers several benefits over using the default plotting options in R.

Note: Refer to Section 5.1 of Topic 2 for details on scatter plots.

3.1

💻 To create an interactive plotly plot we need to use the function plot_ly(). In 2.2, we brushed over the details of the plot_ly() function, so let’s remedy that now.

The typical composition of a simple plotly plot looks like this:

plot_name <- plot_ly(data = ..., x = ~ ..., y = ~ ...)

Let’s break this down:

  • Firstly, (using the assignment operator <-) we assign a name to our plot - here we have chosen the generic plot_name.
  • Next, within plotly(), we specify the main arguments of the function.
  • The data = ... part tells R what data we are analysing.
  • The x = ~ ... part tells R which variable in our data set to plot on the x-axis of our plot.
  • The y = ~ ... part tells R which variable in our data set to plot on the y-axis of our plot.

Note: We simply replace the ...s with whatever data we are using.

3.2

💻 Run the code below to create a simple scatter plot of the recorded flipper_length_mm versus body_mass_g values in the penguins data set (as shown below). Make sure to inspect this code, and check that you understand each component.

Note that since we have specified our data set is penguins in the code, we don’t then need to do this when specifying our x and y inputs - we can simply specify any of the variables contained within this data set.

penguins_scatter <- plot_ly(data = penguins, 
                            x = ~body_mass_g, y = ~flipper_length_mm)
penguins_scatter

Note: Once we assign the plot to the object penguins_scatter, we then have to run this object in a subsequent line, in order for the plot to be rendered.

3.3

💻 As we suspected, it seems quite clear from the scatter plot above in 3.2 that, in general, as the body mass of penguins increases so too does their flipper length.

But our graph is quite basic at the moment - we can do better.

It is very easy to include a third variable in a plotly graph, which can help to further distinguish the data plotted on the x and y axes. We can do so by adding the argument color = ~... within plotly().

Perhaps the body_mass_g and flipper_length_mm of the penguins is also related to their sex? Let’s take a look at how our scatter plot changes, if we distinguish between male and female penguins.

penguins_scatter2 <- plot_ly(data = penguins, 
                             x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex)
penguins_scatter2

That’s looking a bit nicer! Now we can see that the smallest penguins tend to be female, and the largest penguins tend to be male. If you hover over the data, you’ll notice that the sex is now shown alongside the coordinates of each data point.

We also have a helpful legend in the top right. Remember, this is not only useful as a guide - try clicking on one of the labels in the legend.

3.4

💻 While our scatter plot is looking better, the default colours chosen to distinguish between male and female penguins are quite similar. Perhaps we would like more contrast?

To specify the set of colours to use for the plot, we can add the additional argument colors = ... to our plot_ly function. This argument accepts any valid R colour codes. Take a look at this pdf for an overview of different colours we can use, or simply try a few basic colours like red, green etc.

Complete and then run the code below to change the colours you use in your scatter plot.

penguins_scatter_colours <- plot_ly(data = penguins, 
                                    x = ~body_mass_g, y = ~flipper_length_mm, 
                                    color = ~sex, colors = ...)
penguins_scatter_colours


Hint

You will need a combination of two colours. Check the Show box below if you are stuck.

# If you are specifying specific individual colours, you will need to use the layout 
colors = c("...", "...")
# within the plot_ly() function


3.5

💻 If you do not want to spend too much time customising the colours used in your plots, there are pre-existing sets of colours you can use. Try setting your colors =... argument in 3.4 sequentially to colors = "Set1", then to colors = "Set2" and finally to colors = "Set3". Do any particular sets appeal to you?


🎧 Online students 💬 Post your colour choice(s) in the chat.


3.6

💻 There are many different display options for plot_ly graphics. If you try running the R commands below, you may see some red Warning messages appear in the RStudio Console section. As discussed in Computer Lab 1B, while it is important to read them, often you don’t have to worry about these.

penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1")
penguins_scatter2

In this instance, if you would like to minimise warning messages, you can add the arguments type = "scatter", mode = "markers" to your plot_ly function, so that your code now looks like this:

penguins_scatter2 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1",
                             type = "scatter", mode = "markers")
penguins_scatter2

Here:

  • We set type = "scatter" to ensure our data is plotted as a scatter plot.
  • We set mode = "markers" to ensure that each of our data points is plotted individually.

These additional arguments are often helpful, as sometimes we like to have a little more control over how our data is presented.

You’ll notice however that if these commands are omitted from your function, R will just work out what it thinks is the optimal presentation format (hence the warning messages informing us which options have been selected, since some details haven’t been user-specified).

This automatic selection is often for the best - try changing the mode = "markers" section of code above to mode ="lines" and then re-running the code chunk. What happens?

3.7

💻 So far, we have treated all the penguins as one large group, differentiated by sex.

When we hover over a point in our scatter plot (representing a penguin), we see the flipper length, body mass, and sex details for that penguin. This is great, but we are missing one important piece of information - the species of penguin! Remember, we actually have data for three separate species of penguin - Adelie, Chinstrap, and Gentoo.

Fortunately, it is straightforward to add this information to the hover text of our plot. We can do this by including the argument text = ~species in our code, in a similar way to how we have used color = ~sex to colour the points.

Update your penguins_scatter2 plot with this text = ~species addition now, and hover over some points to check that your code has worked as intended.

3.8

💻 We have already used different colours to differentiate the male and female penguins, but all the data points are the same symbol - a dot.

Instead of including the argument text = ~species, we can use the additional argument symbol = ... within our plot_ly function to distinguish between the different species of penguin.

Take a look at the R code and resultant graph below:

penguins_scatter3 <- plot_ly(data = penguins, x = ~body_mass_g, y = ~flipper_length_mm, 
                             color = ~sex, colors = "Set1", symbol = ~species, 
                             type = "scatter", mode = "markers")
penguins_scatter3

Great! This is looking much more informative than our initial scatter plot in 3.2.

Now it’s quite clear, for instance, that the majority of larger penguins (both male and female) are of the Gentoo species, which we couldn’t discern from our initial versions of this scatter plot.

3.9

💻 There are 26 different base R symbols you can choose from - these can be specified either by number, or by name. Since we have not specified which specific symbols to use in 3.8, R has used the first default 3.

We used colors = ... to modify our color = ... specification, and similarly, we can use symbols = ... to modify our symbol = ... specification.

Some of the symbol names are quite long, e.g. "filled triangle point-up", so it is often easier to use numbers. The table below lists some of the available options:

Table 3.1: Symbol Options
Number Name
0 square
1 circle
2 triangle point up
3 plus
4 cross
5 diamond
8 star

Using 3.1 and the symbols = ... argument, change the symbols used in the penguins_scatter3 scatter plot created in 3.8.


Hint

If you are using symbol names, and your code isn’t working, check the code chunk below.

# Note that just like for the colours argument, if you are using words, 
# these need to be surrounded with quotation marks,
# e.g. "square", or 'square' will work, but square will not


🎧 Online students 💬 Post your symbol choice(s) in the chat.


3.10

💻 As a final touch, you may also like to change the size of the symbols in your scatter plot. To do so, we can include the marker = ... argument in our plot_ly function.

This is a little more complicated to use than our previous arguments, as multiple specifications can be made within this argument. As a result, we use the format marker = list(...). Within the list() function, we can include multiple specifications which all pertain to the marker argument. To change the size of the symbols, we use the appropriately named size = argument, within the list() function.

For example, if we want to change the default marker size (6) to be a little larger, we could include the argument marker = list(size = 8) within our plot_ly function for our penguins_scatter3 scatter plot created in 3.8.

Try this now, observe the changes, and then try increasing and decreasing the marker size.


We have now covered the basics of creating plotly histograms and scatter plots, well done!


🏡 Reconvene in main room to discuss results


4 Extension: Creating your own plotly Scatter Plot

💻 To finish up, let’s try creating your own plotly scatter plot.

4.1

💻 To begin, create a simple plotly scatter plot of bill_length_mm versus body_mass_g, using the penguins data set.


Hint

If you are not sure that you are on the right track, refer back to 3.1, and/or check the code below:

# This partially complete code should help
penguins_scatter_new <- plot_ly(data = penguins, 
                                x = ~body_mass_g, y = ~...
                                type = "scatter", mode = "markers")
penguins_scatter_new

4.2

💻 Once you are happy with your initial scatter plot, try using the color = ~ argument to differentiate the data in your plot by island. Do you notice any patterns?

Hint: You can refer back to 3.3 if you are not sure how to proceed.

4.3

💻 Next, use the symbol = ~ argument to show different symbols for each species in your plot.

4.4

💻 To finish off your plot, change the symbols used in your plot, and increase the marker size of your symbols slightly.


🎧 Online students 💬 If you have been able to reach this point before the end of the lab, take a snippet/screenshot of your plot and copy-paste it into the chat.


4.5

💻 By inspecting your plot, does it seem like penguins living on different islands have noticeably different body_mass_g or bill_length_mm measurements?


Great job, that’s everything for today!

Hopefully you now feel confident creating interactive histograms and scatter plots using plotly. Don’t worry if some of the code seems difficult at the moment - we are only at the second data science computer lab, and we will have plenty of time to practice and improve as the semester progresses.

Before you finish up, make sure to save your script file somewhere safe - it might come in handy later on.


References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

