Data Visualization Examples for Self-Study

Example 1: 2d bin counts plot

A 2d bin counts plot divides the plane into regular hexagons, counts the number of cases in each hexagon, and then (by default) maps the number of cases to the hexagon fill. It is used to resolve the “overplotting” problem, similar to using position = "jitter" when doing the scatter plot.

For example, in the loans_full_schema data set, if we hope to plot interest_rate against debt_to_income using a scatter plot, it looks like this:

ggplot(loans_full_schema) +
  geom_point(aes(x = debt_to_income, y = interest_rate)) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Interest Rate (in percentage)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

This graph is not very informative since many points overlap with each other (overplotting). To make things more clear, we can use the geom_bin_2d() function to create a 2d bin counts plot.

ggplot(loans_full_schema) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Interest Rate (in percentage)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

In the graph above, the colors represent the counts (equivalently density) in each square bin. It is clear that we have more data points at low interest rate between 5% to 13% combined with low debt_to_income ratio between 0% to 20%.

Since there are relatively few points for a debt-to-income ratio of higher than 100%. We can filter our data and make the plot more detailed:

ggplot(filter(loans_full_schema, debt_to_income < 100)) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Interest Rate (in percentage)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Example 2: Customize scales (breaks and labels)

Scales refer to the x- and y-ticks and their labels on axes. There are a few functions that can customize the scale. Let’s take the following graph as an example:

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We see that the scales on x-axis are 0, 100, 200, 300, 400 and the scales on y-axis are 0, 500000, 1000000, 1500000 and 2000000.

Let’s first learn how to change the position of scales using scale_x_continuous and scale_y_continuous functions.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  scale_x_continuous(breaks = seq(0, 450, 50)) +
  scale_y_continuous(breaks = seq(0, 2000000, 250000)) + 
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

By defining the breaks argument inside scale_x_continuous or scale_y_continuous function one can define all positions of scales.

We can also customize the labels of scales.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Here labels = NULL removes all labels on the corresponding scale. Or we can define them by ourselves.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We can also customize the label names here with the name argument, and customize the limits with the limits argument. Some useful scale options are scale::percent, scale::dollar and scale::comma to change the format of scales.

Example 3: Use log scale

In many data sets, one numeric variable may span a few orders of magnitudes (for example, household income from $1,000 to $1,000,000). If we use continuous_scale, the graph does not show details very well. In that case we need to change our scale to log scale (plotting the logarithm of variable).

For data exploration, it is common that one use log10 scales:

ggplot(loans_full_schema) + 
  geom_bin_2d(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_log10(limits = c(5000, 2500000), labels = scales::dollar) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

The functions scale_y_log10 and scale_x_log10 converts y-axis or x-axis into log10 scale, respectively.

Example 4: Use preset themes

There are eight preset themes offered in ggplot, that gives different settings in axes, grid and background appearance. They are:

The eight themes built-in to ggplot2.

We can change the theme by calling the theme functions:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  theme_classic()

Data Visualization Examples for Self-Study

Miao Yu

2023-02-05

Example 1: 2d bin counts plot

Example 2: Customize scales (breaks and labels)

Example 3: Use log scale

Example 4: Use preset themes