class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 Lecture # Introduction to Data Visualisation in R ## Data Science stream ### La Trobe University --- # Welcome! ### In this lecture we cover an Introduction to Data Visualisation in R. -- * By the end of this lecture you will: -- * have a deeper appreciation of the importance of data visualisation -- * be able to distinguish between histograms, box plots, violin plots and scatter plots -- * have a clearer understanding of the content to be covered in the 2nd, 3rd and 4th Data Science computer labs -- We will use RStudio to create all our Data Science data visualisations. --- # Overview Over the following slides, we will cover: -- * What is data visualisation and why do we use it? -- * The Palmer Penguins Data Set -- * Histograms -- * Box Plots -- * Violin Plots -- * Scatter Plots -- * Plotly Custom Controls and Animations --- class: middle # 1. What is data visualisation and why do we use it? Humanity has created more data in the last few years than in all of previous human history. -- STM1001 is all about making sense of data, and .teal_style[data visualisation] is an integral tool in this endeavour. -- Data visualisation involves presenting data in a visual format, in order to highlight key information and to make it more accessible to others. -- Visualising our data can make it easier for us to assess, investigate, and understand our data. It is often one of the first steps in an analysis. -- Data visualisations are also an effective way in which to summarise and convey key information to others. -- Data visualisations can range from simple static plots to complex interactive plots to web apps and dashboard content which updates in real time. --- # Hans Rosling Data Visualisation Video As an example of effective data visualisation, let's watch a short (5 minutes) video by the late world famous physician and statistician .teal_style[Hans Rosling] on GDP changes across countries in the last 200 years. -- * By completing Computer Labs 2B-4B, you will develop the skills to create a similar interactive and animated plot! .center[ <iframe width="560" height="315" src="https://www.youtube.com/embed/jbkSRLYSojo" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture"></iframe> ] --- # Data Visualisation Careers Effective data visualisation skills are in high demand in industry, and some data scientists specialise primarily in data visualisation. -- As an example of a potential career pathway as a data visualisation expert, here is a recent job advertisement from Seek (June 2023): -- <img src="data:image/png;base64,#data_vis_ad.png" width="800px" style="display: block; margin: auto 0 auto auto;" /> -- Of course, we won't be able to cover all the skills required for such a job in just 3 weeks, but the content we cover will lay important foundations for future studies, if you find that a career in data visualisation sounds appealing. --- # Supplementary Material Note * All of the material we cover in this lecture is available in the LMS in the supplementary material .teal_style[Introduction to Data Visualisation in R] ([available here](https://bookdown.org/rehk/stm1001_dsm_data_visualisation_in_r/)). -- * Once you have attended this lecture, and gone through sections 1 - 3 in the .teal_style[Introduction to Data Visualisation in R] supplement in your own time, you will be ready to start Computer Lab 2B. -- * The latter sections of the supplement should be covered before beginning Computer Labs 3B and 4B. -- * It is recommended that you refer regularly to the .teal_style[Introduction to Data Visualisation in R] content, especially in the early part of the subject, to help consolidate your understanding of R data visualisation techniques. --- class:middle We can produce high-quality, professional graphics with precision using R. -- We will use RStudio to create various types of plots throughout the teaching period, both in the core and stream-specific parts of the subject. -- To help us introduce the plots we will cover today, let's introduce a data set. --- ## 2. The Palmer Penguins Data Set Throughout the teaching period, we will use various data sets to teach important data science techniques. -- One such data set, which we will use extensively, is the .teal_style[penguins] data set from the .teal_style[palmerpenguins] R package (Horst, Hill, and Gorman 2020). -- The .teal_style[penguins] data set contains recorded characteristics of three species of penguin living in the Palmer archipelago, off the coast of Antarctica, on three specific islands: * Dream * Biscoe * Torgersen -- The three species of penguin are: --- ### Adelie Penguins <img src="data:image/png;base64,#adelie.jpg" width="600px" style="display: block; margin: auto;" /> “Adelie Penguin (Pygoscelis adeliae)” by Gregory ‘Slobirdr’ Smith is licensed under CC BY-SA 2.0 --- ### Chinstrap Penguins <img src="data:image/png;base64,#chinstrap.jpg" width="600px" style="display: block; margin: auto;" /> “Chinstrap Penguins” by D-Stanley is licensed under CC BY 2.0 --- ### Gentoo Penguins <img src="data:image/png;base64,#gentoo.jpg" width="600px" style="display: block; margin: auto;" /> “Gentoo Penguins” by D-Stanley is licensed under CC BY 2.0 --- # ### Inspecting the data <br> The `penguins` data set contains measurements for different characteristics of over 300 adult penguins - a selection is shown below: <div style="border: 0px;overflow-x: scroll; width:100%; "><table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> species </th> <th style="text-align:left;"> island </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> <th style="text-align:right;"> body_mass_g </th> <th style="text-align:left;"> sex </th> <th style="text-align:right;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:left;"> Torgersen </td> <td style="text-align:right;"> 39.1 </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 181 </td> <td style="text-align:right;"> 3750 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 2007 </td> </tr> <tr> <td style="text-align:left;"> Adelie </td> <td style="text-align:left;"> Biscoe </td> <td style="text-align:right;"> 41.4 </td> <td style="text-align:right;"> 18.6 </td> <td style="text-align:right;"> 191 </td> <td style="text-align:right;"> 3700 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 2008 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:left;"> Biscoe </td> <td style="text-align:right;"> 46.5 </td> <td style="text-align:right;"> 14.4 </td> <td style="text-align:right;"> 217 </td> <td style="text-align:right;"> 4900 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 2008 </td> </tr> <tr> <td style="text-align:left;"> Gentoo </td> <td style="text-align:left;"> Biscoe </td> <td style="text-align:right;"> 41.7 </td> <td style="text-align:right;"> 14.7 </td> <td style="text-align:right;"> 210 </td> <td style="text-align:right;"> 4700 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 2009 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:left;"> Dream </td> <td style="text-align:right;"> 49.7 </td> <td style="text-align:right;"> 18.6 </td> <td style="text-align:right;"> 195 </td> <td style="text-align:right;"> 3600 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 2008 </td> </tr> <tr> <td style="text-align:left;"> Chinstrap </td> <td style="text-align:left;"> Dream </td> <td style="text-align:right;"> 43.5 </td> <td style="text-align:right;"> 18.1 </td> <td style="text-align:right;"> 202 </td> <td style="text-align:right;"> 3400 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 2009 </td> </tr> </tbody> </table></div> -- <br> We have data on the penguin's .teal_style[species], the .teal_style[island] on which they live, their .teal_style[bill length], .teal_style[bill depth] and .teal_style[flipper length] (all measured in mm), their .teal_style[body mass] (measured in grams), their .teal_style[sex], and the .teal_style[year] in which the recordings were made. --- layout: true ### Inspecting the data As a starting point, suppose we would like to know the number of penguins measured for each of the three species. --- * We could create a .teal_style[Frequency Table]: ```r summary(penguins$species) ``` ``` ## Adelie Chinstrap Gentoo ## 146 68 119 ``` --- Or as a visual alternative, use a .teal_style[Bar Chart] (see Topic 1): --
* *Note that this bar chart is interactive - we'll discuss this more shortly!* --- layout:false class:middle Of course, we would like to know more than just how many penguins there are. -- Over the following slides, we will look at various .teal_style[data visualisation] methods that can help us to quickly and easily visually identify the differences in the characteristics of these species of penguin. -- In Computer Labs 2B, 3B and 4B you will learn how to create these data visualisations. -- Let's introduce some of these methods now! --- # 3. Histograms Histograms are commonly used to swiftly visualise the shape of data (see Topic 1). -- A histogram is a chart that depicts the frequency of a numerical variable in non-overlapping intervals, called 'bins', that span the entire range of the data. -- * While we have used a bar chart for the categorical variable `species`, a histogram would be the equivalent kind of chart for numerical data like `body mass`. -- * For example, does the data look bell-shaped, or does it seem to be skewed to the left or to the right? --- Suppose we are interested in the .teal_style[body masses] of the penguins. We could produce a simple histogram using the inbuilt R function `hist`: -- <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> -- However, this histogram leaves a lot to be desired - it's static, and a little boring. We cannot interact with the image, and we cannot manipulate it in real time to display different details. -- We could add some extra details, but this would require a lot more coding. --- layout: true In the .red_style[Data Science] stream, we will learn how to create dynamic, interactive plots using the `plotly` package. --- Let's have a look at the penguin `body mass` data, in a `plotly` histogram format. --
* Unlike graphs created using base R functions, `plotly` graphs are .teal_style[interactive] - even when embedded in web pages and slides like this! --- It is also relatively easy to add extra detail to our `plotly` graphs - perhaps we want to split the body mass data by the species of penguin (and add some clear titles):
* The plot has a handy legend, which we can also use to .teal_style[dynamically filter] the results displayed. The axes will scale automatically too! --- layout: false layout: true # 4. Box Plots .teal_style[Box plots] are also commonly used to show the shape or distribution of data (see Topic 2). A box plot displays the minimum, 25% quantile (Q1), median (Q2), 75% quantile (Q3), and the maximum for the observations of a numerical variable. --- Let's take a look at an example box plot: -- <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- If there are any outliers these will also be shown, typically as dots: -- <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> --- Box plots can also indicate how skewed the data is - let's take a look at some examples: --- <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- layout: false layout: true # 4. Box Plots The box plots shown so far have been static - we can produce more informative, interactive box plots using `plotly`. ---
--- * We can easily display multiple box plots in the one graph:
--- * We can also split the data we present by another variable:
--- layout: false # 5. Violin Plots .teal_style[Violin plots] are another excellent way to display data (see Topic 2). -- We can think of violin plots as extensions of box plots, which also show the density of the observations in our data (a bit like a smoothed version of a histogram). -- Let's take a look at some examples. --- # 5. Violin Plots Here we have included the corresponding histograms, to help introduce violin plots. <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- layout: true # 5. Violin Plots Using `plotly`, we can create interactive violin plots. ---
--- * We can easily display multiple violin plots in the one graph:
--- * We can also split the data we present by another variable:
--- layout: false class: middle # 6. Scatter Plots We have focused so far on plots for presenting data for one numerical variable (e.g. the penguins' body masses), with additional information from categorical variables (e.g. species) added. -- If we would like to compare two numerical variables (and perhaps check for a relationship between them), a convenient visual option is to use a .teal_style[Scatter Plot]. -- Suppose we would like to check if there is a relationship between the penguins' .teal_style[body masses] and their .teal_style[flipper lengths]. --- # 6. Scatter Plots We could create a simple scatter plot using the inbuilt R function `plot`: -- <img src="data:image/png;base64,#STM1001_DS_Data_Visualisation_Lecture_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> -- Just like the default histogram, this scatter plot is not very exciting - it displays the relevant data, but we can make it more interesting and informative by creating an interactive `plotly` version. --- # 6. Scatter Plots Let's have a look at these numerical variables, in a `plotly` scatter plot format:
--- # 6. Scatter Plots Now let's add some more information to our scatter plot:
--- layout: false layout: true # 7. Plotly Custom Controls and Animations Once you become familiar with the basics of `plotly` functions, we will cover how to add additional details to your plots, such as: --- -- * Range sliders (easy)
--- * Animations (easy)
--- * Buttons to switch between presentation formats (slightly harder)
--- layout: false # End That concludes our Introduction to Data Visualisation in R lecture. -- What to do next: * Before Computer Lab 2B, please check over sections 1 to 3 of the supplementary material .teal_style[Introduction to Data Visualisation in R]. * If you have any questions, we can resolve them in the computer labs. --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- # References * BBC. (2010, Nov. 26). *Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC 4* [Video]. YouTube. [https://www.youtube.com/watch?v=jbkSRLYSojo](https://www.youtube.com/watch?v=jbkSRLYSojo). * Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. *Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data*. [https://doi.org/10.5281/zenodo.3960218](https://doi.org/10.5281/zenodo.3960218). * Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. [https://plotly-r.com](https://plotly-r.com). --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>