R Programming - Class 3: Introduction to ggplot2 and mutate
🎯 Objectives
By the end of this class, you will be able to:
Understand the grammar of graphics using ggplot2.
Create basic plots using ggplot().
Use the mutate() function to create new variables.
Combine data manipulation with plotting for data exploration.
🎨 Introduction to ggplot2
ggplot2 is part of the tidyverse and is used for creating elegant and customizable data visualizations.
Based on “The Grammar of Graphics” concept.
ggplot() works in layers which are added to the main ggplot() function using the ‘+’ symbol instead of a ‘|>’
✨ Basic Structure of a ggplot
ggplot(data =<DATA>, aes(x =<X>, y =<Y>)) +<GEOM_FUNCTION>()
📊 Common Geoms
Geom Function
Description
geom_point()
Scatter plot
geom_bar()
Bar plot
geom_col()
Bar plot (pre-counted)
geom_histogram()
Histogram
geom_line()
Line plot
✅ Try it out using the iris dataset
🧹 Cleaning the data
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Let's clean the data and keep only what we neediris_clean <- iris |>select(Species, Sepal.Length) |>group_by(Species) |>summarise(mean_length =mean(Sepal.Length),sd_length =sd(Sepal.Length),min_length = mean_length - sd_length,max_length = mean_length + sd_length) |>mutate(Species =fct_reorder(Species, mean_length, .desc = T))#Look at the summarised data by calling the variableiris_clean
📊 Plotting a simple bar graph using the clean data
The colour is automatically selected by R because of the fill = Species argument. It colours the bars according to the Species.
iris_clean |>#Note: The y-axis equals to the mean_length and not just the Sepal.Length because we want to plot a single count dataggplot(aes(x = Species, y = mean_length, fill = Species))+geom_col(color ="black")+geom_errorbar(aes(ymin = min_length, ymax = max_length), width =0.3)
📊 Let’s make the graph a little bit more aesthetic
To add the data points to the plot, all we need to do is either use the geom_jitter() or the geom_point() layer. Technically speaking, geom_point(position = "jitter) is geom_jitter() . But, we need to call the non-summarised dataset seperately inside geom_point() or geom_jitter() and specify the aesthetics with aes()
#this is the original codeiris_clean |>ggplot(aes(Species, mean_length, fill = Species))+geom_col(color ="black")+geom_errorbar(aes(ymin = min_length, ymax = max_length), width =0.3)+#adding the data points to the plotgeom_jitter(data =iris |>select(Species, Sepal.Length) |>group_by(Species), aes(x = Species, y = Sepal.Length), width =0.3, size =3, alpha =0.5)+#getting rid of the small gap between the x-axis and the start of y-axis. expand does the trickscale_y_continuous(expand =c(0,0), limits =c(0, 8), breaks =seq(0,8, 1))+theme_classic()+#making the legend disappeartheme(legend.position ="none" )+#adding labels to x and y axislabs(x ="",y ="Sepal Length" )