Data Science Stream

Topic 3B: Data Visualisation II


Welcome to the third computer lab for the Data Science stream of STM1001.

Previously, in the second Data Science computer lab, we familiarised ourselves with the plotly function, and made some informative and interactive histograms and scatter plots of data from the palmerpenguins package (Horst, Hill, and Gorman 2020).

In this computer lab we will focus on further developing our plotly (Sievert 2020) and R coding skills, and cover how to create interactive box plots and violin plots. These plots were introduced in Topic 2.

By the end of this lab, you should feel comfortable using plotly to create a range of interactive data visualisations in RStudio.

🎧 Online students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. Each emoji has a particular meaning and will sometimes be associated with additional instructions:

Prompts for you

💬 Write your answer in the chat.

Modes at different times during the lab

🏡 Main room. All together in the main room – your computer lab demonstrator will be presenting information or facilitating class discussion

💡 Breakout rooms. Person with birthday closest to (your computer lab demonstrator will pick a random date) shares their screen or whiteboard. Here you will discuss a question together and bring your group’s answer back to the main room.

💻 Focus mode. You will still be in the main room, but working independently. All students will be sharing screen during this time so that your computer lab demonstrator (but not other students) can see your screen.


🏫 Face-to-face (blended) students

Throughout the computer lab question sheets, you will see emojis and/or collapsible sections like this one. You can ignore the emojis and collapsible sections, as they contain information relevant to students who are studying online.


1 Preparations

🏡 Before we begin, we will need to carry out some initial preparations.

1.1

🏡 First, we will need to load all the requisite packages in RStudio.

By now, you should have the palmerpenguins and plotly packages installed. Open up RStudio and load these packages now.

Note: If for whatever reason you do not have one or both of these packages installed on your current device, just click the Code button below to see the relevant code you will need to run in RStudio.

install.packages("palmerpenguins")
install.packages("plotly")

Note: If you need a quick refresher on how to load packages in RStudio, just click the Code button below.

library(palmerpenguins)
library(plotly)

2 Creating Interactive Box Plots in RStudio

💻 Recall that we can obtain some key summary information on the penguins data from the palmerpenguins package using the function summary, as shown below:

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Rather than showing this information as numbers, we can use box plots to visually present the key information for the numeric variables in this data set.

Note: Refer to Section 5.2 of Topic 2 for details on box plots.

2.1

💻 Recall that we can create a simple box plot in RStudio using the built-in R function boxplot, as shown below:

boxplot(penguins$body_mass_g)

This box plot leaves much to be desired, and is a little underwhelming. Let’s see what we can do using the plotly package instead.

Note: As we create box plots using plotly, several red Warning messages may appear in the RStudio Console. It is safe to ignore these.

2.2

💻 Recall from the second Data Science computer lab that the typical composition of a simple plotly plot looks like this:

plot_name <- plot_ly(data = ..., x = ~ ..., y = ~ ..., type = ...)

Using this as a base, we can create a simple plotly box plot (as shown below) using the following code:

penguins_box <- plot_ly(data = penguins, y = ~body_mass_g, type = "box")
penguins_box

Note: Here type = "box" tells plotly to create a box plot, similar to how we used type = "scatter" to plot data as a scatter plot in the second Data Science computer lab.

2.3

💻 The plotly box plot in 2.2 looks better than the one created using the built-in boxplot function, and if we hover over it we can observe different informative details like the minimum, median, maximum, and first and third quartiles.

However, notice how the term trace 0 appears when you hover over the plotly box plot? We can replace this with a more informative term, by adding the argument x0 = ... in our code. For example, run the R code below to change trace 0 to body mass (g), and assess the result:

penguins_box <- plot_ly(data = penguins, 
                        y = ~body_mass_g, 
                        type = "box", 
                        x0 = "body mass (g)")
penguins_box

2.4

💻 While this is a good start, currently our box plot does not tell us about the distribution of body mass values across the male and female penguins, or across the different species.

Let’s see how we can add this information to our box plot.

Run the R code below to split the box plot of body_mass_g into separate box plots for male and female penguins:

penguins_box <- plot_ly(data = penguins, 
                        y = ~body_mass_g,
                        color = ~sex, 
                        type = "box")
penguins_box

Note: We no longer need to use the x0 argument here, as plotly now uses the values within the sex variable.

2.5

💻 What do you observe about the distribution of body_mass_g values across the male and female penguins? Would you say that the distributions are symmetric or skewed?


🎧 Online students 💬 Leave a comment about your results in the chat.


2.6

💻 Our box plot is looking better, but we can improve it further by segmenting the data by species.

We can add the species information into our box plot (as shown below), using the following code:

penguins_box <- plot_ly(data = penguins, 
                        x = ~species, y = ~body_mass_g, 
                        color = ~sex, type = "box")
penguins_box

2.7

💻 But wait, the box plots for the males and females of each species are overlapping, which doesn’t look great!

Fortunately, we can fix this easily by changing the layout specifications for the box plot. Run the code below, and observe what happens:

penguins_box %>% layout(boxmode = "group")

Note: Don’t worry about the warning messages that appear when running this code.

2.8

💻 Write a short summary statement discussing the notable features of the 2.7 box plots.


🎧 Online students 💬 Post a short summary of the box plots’ notable features in the chat.


🏡 Reconvene in main room to discuss results


3 Piping

💻 You may have noticed that we used the pipe operator in 2.71. Recall from Section 4.1 of the Data Visualisation in R supplement that the pipe operator can be used to chain together a sequence of operations, in an intuitive manner which is typically easier to read than alternative methods.

In 2.7 we used piping to add additional details to an existing object, without needing to define a new object.

Note: Before you continue with this lab, make sure you have read over Section 4.1 of the Data Visualisation in R supplement.

3.1

💻 Suppose that we would like to add a title to our grouped box plots from 2.7.

Instead of rewriting our penguins_box object and assigning the output to a new object (e.g. penguins_box2), we could use the pipe operator to add this information directly to penguins_box.

The code below does just this:

penguins_box %>% layout(title = "Box Plots of Penguin body mass Data", 
                        boxmode = "group")

3.2

💻 We can also add a title to our legend. This can often help to make our graphs more informative. Try running the code below, and check the result:

penguins_box %>% layout(title = "Box Plots of Penguin body mass Data", 
                        boxmode = "group",
                        legend=list(title=list(text='Sex')))

3.3

💻 Notice that in the 3.2 code, we were able to use the layout function to add details to multiple components of our plot. Generally, when we make changes to plotly plots via piping, we are making changes to the layout, rather than the core data being visualised.

Within the layout function, we have used the argument title (the function of which is to, rather appropriately, change the title). This is one of many possible arguments - some you will learn as we develop our understanding of plotly, and some you may never use, as they are quite context specific. Typically though, the names of the arguments are clear and easy to remember - for instance, legend allows us to change details in the legend.

3.4

💻 To conclude this example, suppose that we would like to rename the x-axis and y-axis of penguins_box. The default names are ok, but perhaps we would like something a little different. Run the code below and assess the results.

penguins_box %>% layout(xaxis = list(title = "Penguin Species"),
                        yaxis = list(title = "Penguin Body Mass (grams)"),
                        boxmode = "group",
                        legend=list(title=list(text='Sex')))

3.5

💻 Notice that we have included the list function within our layout function coding in 3.4. The xaxis and yaxis arguments can both take several settings - for example, we could change the x-axis title, and font size. A list function structure is a typical requirement for layout arguments (the title in 3.1 was an exception).

Therefore, please keep in mind that generally speaking, in the context of our plotly graphs, when dealing with the layout function we need to use the list function before specifying our desired changes to layout arguments.

3.6

💻 As a final note, it’s worth pointing out that our main title from 3.1 has disappeared in our new plot in 3.4. This is because we did not assign our enhanced plot to a new object.

When we use piping, we are not modifying the original object, but rather are carrying out operations on/with it. Therefore any changes we implement via piping are not saved to the original object.

4 Creating Interactive Violin Plots in RStudio

💻 Let us now use our plotly and piping skills to create another type of plot, the violin plot.

We can think of violin plots as being an extension of box plots, which also show the density of the observations (a bit like a smoothed version of a histogram).

Note: Refer to Section 5.3 of Topic 2 for details on violin plots.

4.1

💻 We can use the code from 2.3 as a good starting point for our first violin plot, although we will need to replace the type = 'box' specification with…you guessed it - type = 'violin'.

Run the R code below to produce our first violin plot:

penguins_violin <- plot_ly(data = penguins, 
                           y = ~body_mass_g, 
                           type = "violin", 
                           x0 = "body mass (g)",
                           box = list(visible = T ))
penguins_violin

Note: We include the argument box = list(visible = T ) to ensure the box plot of the data also appears in the violin plot.

4.2

💻 Let us now start making our violin plot more informative. Using the code in 2.6 as a guide, break the violin plot in 4.1 into separate violin plots for each species of penguin.

Hint: If you are stuck on this and not sure how to proceed, check the Code chunk below:

penguins_violin <- plot_ly(data = penguins, 
                           x = ~species, # this is the line we add
                           y = ~body_mass_g, 
                           type = 'violin',
                           box = list(visible = T )) 
penguins_violin

4.3

💻 Next, let’s split these violin plots into separate ones for male and female penguins. To do this, we can add either the argument color =~ sex, or the argument split = ~sex. Choose one, and modify your penguins_violin object accordingly.

4.4

💻 Unfortunately, using either option in 4.3 results in our violin plots overlapping like in 2.6.

Fortunately, we can address this using our new piping skills.

Using the code in 2.7 and 3.1 as a guide, pipe additional details to your penguins_violin object so that the violin plots for the male and female penguins of each species are next to each in separate lanes, rather than overlapping.

Hint: You will have to change the boxmode= "group" code used in the layout parts of 2.7 and 3.1 - think about what type of plot you are considering now.

4.5

💻 To finish up, use piping to add an appropriate title to your grouped violin plots graph.


🏡 Reconvene in main room to discuss results


5 Extension: Creating your own plotly plots

💡 Now that you have had a chance to practice creating interactive box plots and violin plots using plotly, it is time to make your own.

Using all the skills you have gained from this lab, create a set of either box plots or violin plots (or both if you would like) with the following characteristics:

  • The plots should show information on the bill_length of the penguins
  • The plots should be grouped by the sex of the penguin
  • The plots should be coloured according to the island the penguin inhabits
  • The plots should be split according to the species of the penguin
  • The hover text should show the island the penguin inhabits
  • The plots should have informative axes and legend labels, and an informative title

See how you go, and if you get stuck, remember to ask your lab demonstrator for help. Good luck!


🎧 Online students 💬 If your breakout room group had time to create your own Plotly penguin plot in Question 5, take a snippet/screenshot of it and copy-paste it into the chat.


🏡 Discuss results in main room and conclude lab


Well done! There was a lot of content today.

Don’t worry if you weren’t able to finish everything in the one session - there is quite of lot of material to work through in this lab, and it’s not easy.

Hopefully though, you are beginning to feel quite skilled with using plotly. The techniques and coding skills you are learning should hold you in good stead for the following weeks. Remember, you can always refer back to this material at a later date if you need a quick refresher.

Before you finish up, make sure to save your script file somewhere safe - it might come in handy later on.

References

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.


  1. We used it a couple of times in the second Data Science computer lab too↩︎

