R Programming - Class 3: Introduction to ggplot2 and mutate

🎯 Objectives

By the end of this class, you will be able to:

  • Understand the grammar of graphics using ggplot2.
  • Create basic plots using ggplot().
  • Use the mutate() function to create new variables.
  • Combine data manipulation with plotting for data exploration.

🎨 Introduction to ggplot2

  • ggplot2 is part of the tidyverse and is used for creating elegant and customizable data visualizations.
  • Based on “The Grammar of Graphics” concept.
  • ggplot() works in layers which are added to the main ggplot() function using the ‘+’ symbol instead of a ‘|>’

✨ Basic Structure of a ggplot

ggplot(data = <DATA>, aes(x = <X>, y = <Y>)) +
  <GEOM_FUNCTION>()

📊 Common Geoms

Geom Function Description
geom_point() Scatter plot
geom_bar() Bar plot
geom_col() Bar plot (pre-counted)
geom_histogram() Histogram
geom_line() Line plot

✅ Try it out using the iris dataset

🧹 Cleaning the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Let's clean the data and keep only what we need
iris_clean <- iris |> 
  select(Species, Sepal.Length) |> 
  group_by(Species) |> 
  summarise(mean_length = mean(Sepal.Length),
            sd_length = sd(Sepal.Length),
            min_length = mean_length - sd_length,
            max_length = mean_length + sd_length) |> 
  mutate(Species = fct_reorder(Species, mean_length, .desc = T))

#Look at the summarised data by calling the variable
iris_clean
# A tibble: 3 × 5
  Species    mean_length sd_length min_length max_length
  <fct>            <dbl>     <dbl>      <dbl>      <dbl>
1 setosa            5.01     0.352       4.65       5.36
2 versicolor        5.94     0.516       5.42       6.45
3 virginica         6.59     0.636       5.95       7.22

📊 Plotting a simple bar graph using the clean data

The colour is automatically selected by R because of the fill = Species argument. It colours the bars according to the Species.

iris_clean |> 
  
  #Note: The y-axis equals to the mean_length and not just the Sepal.Length because we want to plot a single count data
  ggplot(aes(x = Species, y = mean_length, fill = Species))+
  geom_col(color = "black")+
  geom_errorbar(aes(ymin = min_length, ymax = max_length), width = 0.3)


📊 Let’s make the graph a little bit more aesthetic

To add the data points to the plot, all we need to do is either use the geom_jitter() or the geom_point() layer. Technically speaking, geom_point(position = "jitter) is geom_jitter() . But, we need to call the non-summarised dataset seperately inside geom_point() or geom_jitter() and specify the aesthetics with aes()

#this is the original code
iris_clean |> 
  ggplot(aes(Species, mean_length, fill = Species))+
  geom_col(color = "black")+
  geom_errorbar(aes(ymin = min_length, ymax = max_length), width = 0.3)+
  
  #adding the data points to the plot
  geom_jitter(data =iris |> 
               select(Species, Sepal.Length) |> 
               group_by(Species), aes(x = Species, y = Sepal.Length), width = 0.3, size = 3, alpha = 0.5)+
  
  #getting rid of the small gap between the x-axis and the start of y-axis. expand does the trick
  scale_y_continuous(expand = c(0,0), limits = c(0, 8), breaks = seq(0,8, 1))+
  theme_classic()+
  
  #making the legend disappear
  theme(
    legend.position = "none"
  )+
  
  #adding labels to x and y axis
  labs(
    x = "",
    y = "Sepal Length"
  )