A 2d bin counts plot divides the plane into regular hexagons, counts
the number of cases in each hexagon, and then (by default) maps the
number of cases to the hexagon fill. It is used to resolve the
“overplotting” problem, similar to using
position = "jitter"
when doing the scatter plot.
For example, in the loans_full_schema
data set, if we
hope to plot interest_rate
against
debt_to_income
using a scatter plot, it looks like
this:
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = interest_rate)) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Interest Rate (in percentage)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
This graph is not very informative since many points overlap with
each other (overplotting). To make things more clear, we can use the
geom_bin_2d()
function to create a 2d bin counts plot.
ggplot(loans_full_schema) +
geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Interest Rate (in percentage)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
In the graph above, the colors represent the counts (equivalently density) in each square bin. It is clear that we have more data points at low interest rate between 5% to 13% combined with low debt_to_income ratio between 0% to 20%.
Since there are relatively few points for a debt-to-income ratio of higher than 100%. We can filter our data and make the plot more detailed:
ggplot(filter(loans_full_schema, debt_to_income < 100)) +
geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Interest Rate (in percentage)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
Scales refer to the x- and y-ticks and their labels on axes. There are a few functions that can customize the scale. Let’s take the following graph as an example:
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = annual_income)) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We see that the scales on x-axis are 0, 100, 200, 300, 400 and the scales on y-axis are 0, 500000, 1000000, 1500000 and 2000000.
Let’s first learn how to change the position of scales using
scale_x_continuous
and scale_y_continuous
functions.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = annual_income)) +
scale_x_continuous(breaks = seq(0, 450, 50)) +
scale_y_continuous(breaks = seq(0, 2000000, 250000)) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
By defining the breaks
argument inside
scale_x_continuous
or scale_y_continuous
function one can define all positions of scales.
We can also customize the labels of scales.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = annual_income)) +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
Here labels = NULL
removes all labels on the
corresponding scale. Or we can define them by ourselves.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income/100, y = annual_income)) +
scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We can also customize the label names here with the name
argument, and customize the limits with the limits
argument. Some useful scale options are scale::percent
,
scale::dollar
and scale::comma
to change the
format of scales.
In many data sets, one numeric variable may span a few orders of
magnitudes (for example, household income from $1,000 to $1,000,000). If
we use continuous_scale
, the graph does not show details
very well. In that case we need to change our scale to log scale
(plotting the logarithm of variable).
For data exploration, it is common that one use log10 scales:
ggplot(loans_full_schema) +
geom_bin_2d(aes(x = debt_to_income/100, y = annual_income)) +
scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
scale_y_log10(limits = c(5000, 2500000), labels = scales::dollar) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
The functions scale_y_log10
and
scale_x_log10
converts y-axis or x-axis into log10 scale,
respectively.
There are eight preset themes offered in ggplot
, that
gives different settings in axes, grid and background appearance. They
are:
The eight themes built-in to ggplot2.
We can change the theme by calling the theme functions:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
theme_classic()