Masterclass: Data Visualization with ggplot2

A Systemic Guide from Basics to Publication-Ready Plots

Author

Abdullah Al Shamim

Published

February 7, 2026

Introduction

Data visualization is more than just aesthetics; it is a tool for storytelling. In this notebook, we utilize the power of ggplot2 to build professional-grade visualizations layer by layer.


1. Environment Setup

We begin by loading the tidyverse suite and configuring professional typography using showtext.

Code
library(tidyverse)
library(showtext)
library(sysfonts)
library(palmerpenguins) # For penguin data
library(car)            # For salary data

# Adding Google Font "Poppins"
font_add_google("Poppins", "poppins")
showtext_auto()

2. Distribution at a Glance: Histograms

Objective: To visualize the distribution of a continuous variable.

Code
iris %>% 
  ggplot(aes(Sepal.Length)) + 
  geom_histogram(fill = "steelblue", color = "white", bins = 20) +
  theme_minimal(base_family = "poppins") +
  labs(title = "Sepal Length Distribution", x = "Length", y = "Frequency")

Explanation: geom_histogram() groups data into ‘bins’ to show frequency, helping identify skewness or normality in your data.


3. Comparing Categories: Bar Chart

Objective: Comparing frequencies of different categorical groups.

Code
gss_cat %>%   
  ggplot(aes(marital)) +   
  geom_bar(fill = "skyblue", color = "black") +   
  theme_minimal(base_family = "poppins") +
  labs(title = "Marital Status Count", x = "Status", y = "Count")


4. Data Summary & Outliers: Box Plots

Objective: To identify the median, spread, and outliers within categories.

Code
chickwts %>%    
  ggplot(aes(weight, feed, fill = feed)) + 
  geom_boxplot(alpha = 0.6) +   
  theme_minimal(base_size = 15, base_family = "poppins") +   
  labs(title = "Chicken Weight by Feed Type", x = "Weight", y = "Feed Type")

Explanation: The central line represents the Median. Dots beyond the whiskers are Outliers, indicating values that deviate significantly from the rest of the group.


6. Determining Relationships: Scatter Plots & Regression

Objective: To visualize correlation and trend lines between two numerical variables.

Code
penguins %>% 
  drop_na(body_mass_g, flipper_length_mm) %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", color = "red", se = TRUE) + 
  facet_wrap(~species) + 
  theme_light(base_size = 15, base_family = "poppins") +
  labs(title = "Flipper Length vs Body Mass", subtitle = "Linear Regression by Species")


7. Professional Customization (Advanced Plot)

Objective: Creating a publication-ready visualization with polished axes and themes.

Code
Salaries %>%
  filter(salary < 220000) %>% 
  ggplot(aes(rank, salary, fill = sex)) +
  geom_boxplot(alpha = 0.5) +
  scale_y_continuous(labels = c("$50k", "$100k", "$150k", "$200k"), 
                     breaks = c(50000, 100000, 150000, 200000)) +
  scale_x_discrete(labels = c("AsstProf" = "Assistant\nProfessor", 
                              "AssocProf" = "Associate\nProfessor", 
                              "Prof" = "Professor")) +
  theme_minimal(base_size = 15, base_family = "poppins") +
  theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Faculty Salary Analysis", fill = "Gender")


Quick Reference Checklist

Chart Type Best Use Case Function
Histogram Distribution of a single numerical variable. geom_histogram()
Bar Chart Frequency of categorical data. geom_bar()
Scatter Plot Relationship/Correlation between two numbers. geom_point()
Boxplot Distribution summary and outlier detection. geom_boxplot()

```