Masterclass: Data Visualization with ggplot2

Introduction

Data visualization is more than just aesthetics; it is a tool for storytelling. In this notebook, we utilize the power of ggplot2 to build professional-grade visualizations layer by layer.

1. Environment Setup

We begin by loading the tidyverse suite and configuring professional typography using showtext.

Code

library(tidyverse)
library(showtext)
library(sysfonts)
library(palmerpenguins) # For penguin data
library(car)            # For salary data

# Adding Google Font "Poppins"
font_add_google("Poppins", "poppins")
showtext_auto()

2. Distribution at a Glance: Histograms

Objective: To visualize the distribution of a continuous variable.

Code

iris %>% 
  ggplot(aes(Sepal.Length)) + 
  geom_histogram(fill = "steelblue", color = "white", bins = 20) +
  theme_minimal(base_family = "poppins") +
  labs(title = "Sepal Length Distribution", x = "Length", y = "Frequency")

Explanation: geom_histogram() groups data into ‘bins’ to show frequency, helping identify skewness or normality in your data.

3. Comparing Categories: Bar Chart

Objective: Comparing frequencies of different categorical groups.

Code

gss_cat %>%   
  ggplot(aes(marital)) +   
  geom_bar(fill = "skyblue", color = "black") +   
  theme_minimal(base_family = "poppins") +
  labs(title = "Marital Status Count", x = "Status", y = "Count")

4. Data Summary & Outliers: Box Plots

Objective: To identify the median, spread, and outliers within categories.

Code

chickwts %>%    
  ggplot(aes(weight, feed, fill = feed)) + 
  geom_boxplot(alpha = 0.6) +   
  theme_minimal(base_size = 15, base_family = "poppins") +   
  labs(title = "Chicken Weight by Feed Type", x = "Weight", y = "Feed Type")

Explanation: The central line represents the Median. Dots beyond the whiskers are Outliers, indicating values that deviate significantly from the rest of the group.

5. Trends Over Time: Time-series

Objective: Observing fluctuations in data across a chronological timeline.

Code

economics %>% 
  drop_na() %>% 
  ggplot(aes(date, psavert)) +
  geom_line(color = "steelblue", size = 1) +
  theme_minimal(base_size = 15, base_family = "poppins") +
  labs(title = "Personal Savings Rate Over Time", x = "Year", y = "Savings Rate (%)")

6. Determining Relationships: Scatter Plots & Regression

Objective: To visualize correlation and trend lines between two numerical variables.

Code

penguins %>% 
  drop_na(body_mass_g, flipper_length_mm) %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) +
  geom_point(aes(color = species), alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", color = "red", se = TRUE) + 
  facet_wrap(~species) + 
  theme_light(base_size = 15, base_family = "poppins") +
  labs(title = "Flipper Length vs Body Mass", subtitle = "Linear Regression by Species")

7. Professional Customization (Advanced Plot)

Objective: Creating a publication-ready visualization with polished axes and themes.

Code

Salaries %>%
  filter(salary < 220000) %>% 
  ggplot(aes(rank, salary, fill = sex)) +
  geom_boxplot(alpha = 0.5) +
  scale_y_continuous(labels = c("$50k", "$100k", "$150k", "$200k"), 
                     breaks = c(50000, 100000, 150000, 200000)) +
  scale_x_discrete(labels = c("AsstProf" = "Assistant\nProfessor", 
                              "AssocProf" = "Associate\nProfessor", 
                              "Prof" = "Professor")) +
  theme_minimal(base_size = 15, base_family = "poppins") +
  theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Faculty Salary Analysis", fill = "Gender")

Quick Reference Checklist

Chart Type	Best Use Case	Function
Histogram	Distribution of a single numerical variable.	`geom_histogram()`
Bar Chart	Frequency of categorical data.	`geom_bar()`
Scatter Plot	Relationship/Correlation between two numbers.	`geom_point()`
Boxplot	Distribution summary and outlier detection.	`geom_boxplot()`

```