IV. Data Visualisation

Introduction to `ggplot2`

ggplot2 is a R package dedicated to data visualization. It can greatly improve the quality and aesthetics of your graphics, and will make you much more efficient in creating them.
ggplot2 allows to build almost any type of chart. The R graph gallery focuses on it so almost every section there starts with ggplot2 examples.

Basic syntax and structure of `ggplot2`

In this section, we’ll explore a series of videos that highlight key concepts related to data visualization using ggplot2 and the Grammar of Graphics. These videos will help you appreciate the principles behind effective data visualization and how ggplot2 can be leveraged to create powerful visual representations of data.

Appreciating Grammar of Graphics - This video introduces the foundational concept of the Grammar of Graphics, which is the theoretical framework behind ggplot2.

New York Times example - The New York Times is renowned for its exemplary data visualizations. In this video, we analyze an example of how ggplot2 can be used to replicate the style and clarity of a New York Times chart.

ggplot2 nuanced example - This video dives deeper into the nuances of ggplot2.

`ggplot2` cheatsheet

Understanding this grammar will enable you to build complex and meaningful visualizations by combining different components in a structured way.

Newer version (maintained by Posit)

Older version of cheatsheet

We have added a caption and locked the aspect ratio. Aligned to left.

Aspect ratio is not locked. Aligned to the right.

Creating simple plots (scatter plots, bar charts, histograms)

Note you must have installed and loaded ggplot2 package.

# install.packages("ggplot2")

# Load necessary libraries
library(ggplot2)

When using ggplot2 package, we use the ggplot function.

1. Scatter Plot

Scatter plots are useful for visualizing the relationship between two continuous variables. For example, we can plot mpg (miles per gallon) against hp (horsepower).

# Scatter plot: mpg vs hp
ggplot(mtcars, 
       aes(x = hp, y = mpg)) +
  geom_point()

ggplot2 can be saved and called upon later.

chart1 <-
ggplot(mtcars, 
       aes(x = hp, y = mpg)) +
  geom_point()

ggplot2 charts can be exported as an image as well.

??ggsave
# Save the plot to a PNG file
ggsave(filename = "images/scatter_plot_mtcars.png", 
       plot     = chart1, 
       width    = 8, 
       height   = 6,
       dpi      = 300
       )

Exporting to Different Formats

You can specify different file extensions in the filename argument to save the plot in various formats like PNG, JPEG, PDF, SVG.

Adjusting Size and Resolution

width: Width of the saved image in inches.
height: Height of the saved image in inches.
dpi: Resolution of the image in dots per inch (only for raster formats like PNG and JPEG).

2. Histogram

Histograms are used to show the distribution of a single continuous variable by dividing it into bins. We can create a histogram of mpg to see its distribution.

# Histogram: Distribution of mpg
ggplot(data = mtcars, 
       mapping = aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
  labs(title = "Histogram: Distribution of Miles per Gallon (mpg)",
       x = "Miles per Gallon (mpg)",
       y = "Frequency") +
  theme_minimal()

Density plots are used to visualize the distribution of a continuous variable and estimate its probability density function.

# Density plot: Distribution of miles per gallon (mpg)
ggplot(mtcars, aes(x = mpg)) +
  geom_density()

Box plots show the distribution of a continuous variable and highlight the median, quartiles (Q1 and Q3), and potential outliers (1.5 times IQR from Q1 and Q3 against the mean).

# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)
ggplot(mtcars, aes(y = mpg)) +
  geom_boxplot()

3. Bar Chart

Bar charts are used to show the frequency of categorical data. We can create a bar chart of the number of cars by the number of cylinders (cyl).

# Bar chart: Number of cars by number of cylinders
ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl))) +
  geom_bar()

Customizing Plots

Adding titles, labels, and themes

ggplot2 offers extensive options for customizing plots to make them more informative and visually appealing. Key aspects include adding titles, axis labels, annotations, and choosing suitable themes.

1. Adding Titles and Labels

Title: Provides a descriptive title for the plot.
X and Y Labels: Label the axes to indicate what data they represent.

chart1 #print the saved chart at beginning of the document

# Scatter plot: mpg vs hp with general annotation
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
       x = "Horsepower (hp)",
       y = "Miles per Gallon (mpg)")

2. Choosing Themes

Themes: Control the overall appearance of the plot, such as background color, grid lines, and font sizes.
Check out the ggplot2 themes.

# Scatter plot: mpg vs hp with general annotation
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
       x = "Horsepower (hp)",
       y = "Miles per Gallon (mpg)") +
  theme_minimal()

3. Adding Annotations (advanced)

Annotations: Text or markers added to specific locations on the plot to highlight important points or add additional information.

chart1 #print the saved chart at beginning of the document

# Scatter plot: mpg vs hp with general annotation
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot: mpg vs hp",
       x = "Horsepower (hp)",
       y = "Miles per Gallon (mpg)") +
  theme_minimal() + 
    annotate("text", x = 200, y = 30, label = "High HP & MPG", size = 4, vjust = 1)

Customizing colors and themes

1. Customizing Colors

Points and Lines: You can change the color of points, lines, and other plot elements to enhance visibility or match a specific color scheme.
Fills: Customize the fill color of areas such as bars or regions in density plots.

Lots of colors in ggplot2.

Scales: When creating visualizations, you may want to customize the colors used for different data groups. The scale_color_manual(), scale_fill_manual() and other scale functions functions allow you to manually define the colors used in your plots - giving you full control over the appearance.

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point() +
  scale_color_manual(values = c("firebrick", "slateblue", "green4"))

2. Customizing Themes

Themes: Control the overall appearance of your plot, including background colors, grid lines, text sizes, and more.
Built-in Themes: ggplot2 provides several built-in themes such as theme_minimal(), theme_classic(), and theme_light().
- Custom Themes: You can create your own themes or modify existing ones to better suit your presentation needs.

# Scatter plot: mpg vs hp with general annotation
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "darkred") +
  labs(title = "Scatter Plot: mpg vs hp",
       x = "Horsepower (hp)",
       y = "Miles per Gallon (mpg)") +
  annotate("text", x = 200, y = 30, label = "High HP & MPG", color = "blue", size = 4, vjust = 1) +
  theme_minimal()

    # Density plot: Distribution of miles per gallon (mpg)
    ggplot(mtcars, aes(x = mpg)) +
      geom_density(fill = "lightblue", color = "black", alpha = 0.5) +
      labs(title = "Density Plot: Distribution of Miles per Gallon (mpg)",
           x = "Miles per Gallon (mpg)",
           y = "Density") +
      theme_minimal()

    # Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)
    ggplot(mtcars, aes(y = mpg)) +
      geom_boxplot(fill = "lightblue", color = "black") +
      labs(title = "Box Plot: Distribution of Miles per Gallon (mpg) by Number of Cylinders",
           x = "Number of Cylinders",
           y = "Miles per Gallon (mpg)") +
      theme_minimal()

# Bar chart: Number of cars by number of cylinders
ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl))) +
  geom_bar(fill = "lightblue") +
  labs(title = "Bar Chart: Number of Cars by Cylinders",
       x = "Number of Cylinders",
       y = "Count") +
  theme_minimal()

Can split variables by other variables to explore patterns.

# Box plot: Distribution of miles per gallon (mpg) by number of cylinders (cyl)
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Box Plot: Distribution of Miles per Gallon (mpg) by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Miles per Gallon (mpg)") +
  theme_minimal()

Distributions

Let’s create and plot one discrete distribution (Poisson distribution) and one continuous distribution (Normal distribution) using ggplot2.

We’ll generate the x values and compute the corresponding y values using density functions - dpois for the Poisson distribution and dnorm for the Normal distribution. Then we will plot them.

Base R Cheat Sheet.
To see other distributions and their identifying parameters, type ?distribution in R.

?distribution

1. Poisson Distribution (Discrete)

The Poisson distribution is commonly used to model the number of events occurring within a fixed interval of time or space.

Defined by \(\lambda\), the rate parameter.

# Generate x values for Poisson distribution (discrete)
x_pois <- 0:20  # Poisson distribution typically has integer values

# Generate y values for Poisson distribution with lambda = 5
y_pois <- dpois(x = x_pois, lambda = 5)


# Plotting the Poisson distribution
ggplot(mapping = aes(x = x_pois, y = y_pois)) +
  geom_bar(stat = "identity", fill = "lightblue", color = "blue") +
  labs(title = "Poisson Distribution (λ = 5)",
       x = "Number of Events",
       y = "Probability") +
  theme_minimal()

The stat parameter within a geom function specifies the statistical transformation to be applied to the data before plotting.
- The stat = "identity" argument is used when you want the raw data to be plotted directly, without any statistical transformation.

DETAILED EXPLANATION

geom_bar(stat = "identity"): When creating a bar chart with geom_bar(), the default behavior is to count the number of occurrences of each category and plot these counts as the heights of the bars. This is useful when you’re working with categorical data, where you want to show the frequency of each category.

However, when you already have the y-values (e.g., probabilities, counts, or some other metric) that you want to plot, you use stat = "identity" to tell ggplot2 to use these values directly rather than performing a count or another transformation.

2. Normal Distribution (Continuous)

The Normal distribution is often used to represent real-valued random variables with a symmetric, bell-shaped distribution.

Defined by \(\mu\) and \(\sigma\), the mean and standard deviation parameters.

# Generate x values for Normal distribution (continuous)
x_norm <- seq(from = -4, to = 4, by = 0.01)

# Generate y values for Normal distribution with mean = 0 and sd = 1
y_norm <- dnorm(x = x_norm, mean = 0, sd = 1)

# Plotting the Normal distribution
ggplot(mapping = aes(x = x_norm, y = y_norm)) +
  geom_line(color = "red", size = 1) +
  labs(title = "Standard Normal Distribution (mean = 0, sd = 1)",
       x = "Value",
       y = "Density") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Advanced Visualizations

Faceting and creating multi-panel plots

Reshaping Data into long form for faceting

Faceting is a technique in data visualization where you create multiple plots based on a variable’s values. It helps in visualizing data distributions or relationships across different subsets of the data. In ggplot2, faceting is achieved using functions like facet_wrap() and facet_grid().
The reshape2 package provides tools to reshape data, which is essential for preparing data for faceting in ggplot2. Specifically, the melt() function is used to convert data from wide to long format.

Wide Format: Each variable is in a separate column.
Long Format: All values of a variable are in a single column, with an additional column indicating the variable name.

Visualization of data in long form with faceting command

library(reshape2)
??melt

df <- mtcars

head(df)                     # wide format

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

df_ggplot <- melt(df)        # long format

No id variables; using all as measure variables

head(df_ggplot)

  variable value
1      mpg  21.0
2      mpg  21.0
3      mpg  22.8
4      mpg  21.4
5      mpg  18.7
6      mpg  18.1

tail(df_ggplot)

    variable value
347     carb     2
348     carb     2
349     carb     4
350     carb     6
351     carb     8
352     carb     2

ggplot(data    = df_ggplot,
       mapping =  aes(x = value)
       )    + geom_histogram()    + 
                              facet_wrap(facets = . ~ variable,
                                         scale  = 'free'         # Y axis scales vary for each subchart 
                                         )

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can keep on controlling different aspects of the chart.

ggplot(data    = df_ggplot,
       mapping =  aes(x = value)
       )    + geom_histogram()    + 
                              facet_wrap(facets = . ~ variable,
                                         scale  = 'free'
                                         ) +
  labs(title = "Customized Histogram Facets",  # Add a plot title
       x = NULL,         # Remove x-axis label
       y = "Frequency")  +  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),  # Rotate x-axis labels, adjust size
    strip.text = element_text(size = 14),  # Adjust size of facet labels
    panel.grid.major = element_line(color = "grey80"),  # Customize grid lines
    panel.grid.minor = element_line(color = "grey90")  # Customize minor grid lines
  )

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plotting time series data

Lets get US GDP time series data from the FRED® API, which is a web service that allows developers to write programs and build applications that retrieve economic data from the FRED® and ALFRED® websites hosted by the Economic Research Division of the Federal Reserve Bank of St. Louis. Requests can be customized according to data source, release, category, series, and other preferences.

Loading time series data for U.S. Gross Domestic Product (GDP) using the fredr package in R is useful for economic analysis, allowing one to pull in real-time data directly from FRED into your R environment.
Usually, reading help files or blog on the API can be very helpful in terms of setting learning how to use it. Example - “Getting started with fredr” at https://cran.r-project.org/web/packages/fredr/vignettes/fredr.html

Step 1: Register for FRED API key.

You should sign up with your own key and replace it below.
Signup Link for key: https://fred.stlouisfed.org/docs/api/fred/

remove(list=ls())

# install.packages("fredr")
library(fredr)

FRED_API_KEY="8a9ec1330374c1696f05cc8e526233b5"
fredr_set_key(FRED_API_KEY)

Step 2: Search for Data

Show us the most popular GDP stats.

fredr_series_search_text(
    search_text = "gdp",
    order_by    = "popularity",
    sort_order  = "desc",
    limit       = 5
)

# A tibble: 5 × 16
  id         realtime_start realtime_end title observation_start observation_end
  <chr>      <chr>          <chr>        <chr> <chr>             <chr>          
1 GDP        2025-07-31     2025-07-31   Gros… 1947-01-01        2025-04-01     
2 GDPC1      2025-07-31     2025-07-31   Real… 1947-01-01        2025-04-01     
3 GFDEGDQ18… 2025-07-31     2025-07-31   Fede… 1966-01-01        2025-01-01     
4 PAYEMS     2025-07-31     2025-07-31   All … 1939-01-01        2025-06-01     
5 M2V        2025-07-31     2025-07-31   Velo… 1959-01-01        2025-04-01     
# ℹ 10 more variables: frequency <chr>, frequency_short <chr>, units <chr>,
#   units_short <chr>, seasonal_adjustment <chr>,
#   seasonal_adjustment_short <chr>, last_updated <chr>, popularity <int>,
#   group_popularity <int>, notes <chr>

Step 3: Import Data

Data comes straight to your environment.
Try changing the frequency to annual, or changing the dates.

GDP <-
fredr(
  series_id = "GDP",
  observation_start = as.Date("1950-01-01"),
  observation_end = as.Date("2023-01-01"), 
  frequency = "q" # quarterly
)

Step 4: Plot Data

Lets use ggplot2 to visualise the data.

ggplot(data = GDP, 
       mapping = aes(x = date,
                     y = value/1000, 
                     color = series_id)
       ) +
    geom_line() +
    labs(x = "Observation Date", 
         y = "GDP (in trillions of USD)", 
         color = "Series"
         )

Step 5: Model Data (Optional)

Once you have the data, you can apply time series models and generate forecasts.

# Install the forecast package if you haven't already
# install.packages("forecast")

# Load the forecast package
library(forecast)

# Fit an ARIMA model
fit <- auto.arima(GDP$value)

# Print the model
summary(fit)

# Forecast the next 12 periods
forecasted_values <- forecast(fit, h = 24)

# Plot the forecast
plot(forecasted_values)

Appendix

ggplot2: Elegant Graphics for Data Analysis (3e) was written by Hadley Wickham, Danielle

https://ggplot2-book.org/introduction
https://r-graph-gallery.com/ggplot2-package.html

Introduction to ggplot2

Basic syntax and structure of ggplot2

ggplot2 cheatsheet