library(tidyverse)
library(openintro)
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
In this module, we will first learn how to adjust the appearance of data-related and non-data-related components for a figure. Then we will study a few data visualization and analysis examples, which naturally raises the necessity of performing data transformation before data visualization in many situations.
As we see in previous examples, we can customize the color, shape, fill color and other aesthetic components of data-related components (symbols, lines, bars, fills etc.). Note that this is different from aesthetic grouping since we would control the appearance of the whole plot.
These components are customized by arguments inside the
geom_
functions (but outside the aes()
function). For example,
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(color = "blue", fill = "green", shape = 21, size = 3) +
geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)
All aesthetic features apply to all data components for that
geom_
function.
For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
color
, fill
and alpha
- color
customizationAlmost all geoms have either color
or
fill
(or both) to customize the color of
points/lines/bars/… To specify a color, one may use the following
ways:
"red"
, "blue"
etc. R has
657 built-in named colors in total.#A52A2A
NA
, which refers to completely transparent
coloralpha
refers to the opacity. Values of
alpha
range from 0 to 1, with lower values corresponding to
more transparent colors.
shape
, size
and linetype
-
point/line cutomizationshape
can be specified with an integer between 0 and
25. Each code refers to a type of point.
size
can be specified with a numerical value (in mm)
or a relative size with rel()
function.
linetype
can be specified with an integer (0-6) or a
name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 =
longdash, 6 = twodash).
Create a density plot of interest_rate
in
loans
data with
color
to be blue
fill
to be green
linetype
to be dashed
linewidth
to be 1.5
Now let’s learn how to polish non-data-related components. This includes:
titles and labels
axis, ticks
margins
positions of all components
grids
fonts and font sizes for all texts
legends
For a graph to be accessible to a wider audience, it must have proper
axis labels and title. In ggplot
, the function
labs()
is used to specify these details.
ggplot(data = loans) +
geom_histogram(
mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(
title = "Interest rate from lending club data",
x = "Interest Rate (%)",
y = "Count"
)
theme()
To further polish graph details, we need to add theme()
into our code. The theme()
function can customize all
non-data components of the plots regarding their appearances.
To use theme
, one needs to follow the following
template:
theme(
COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
)
Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
theme()
function workstheme(plot.title = element_text(hjust = 0.5))
The argument plot.title
of theme()
specifies that we hope to customize the text appearance of the
title.
To change any setting for text, we must use the
element_text()
function.
Therein, we change hjust
to be 0.5, which refers to
the horizontal justification, and a value of 0.5 refers to placing in
center.
Let’s see another example. Now we want to enlarge the font size of title to be 20 pts. The following code would work:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 20))
As an exercise, please make a guess how the following code will change the graph
theme(
plot.title = element_text(hjust = 0.5, size = 20))
axis.title = element_text(size = 15)
axis.text = element_text(size = 15)
)
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 20), axis.title = element_text(size = 15), axis.text = element_text(size = 15))
Nobody would remember all names of arguments for theme()
function. So when you want to customize a particular element in your
graph, use the help documentation as your reference.
?theme
There are three element_
functions used in
theme()
: - element_rect()
: for borders and
backgrounds - element_line()
: for lines -
element_text()
: for texts
In the future, when new ways of using theme()
function
appear, you should research by yourself to understand how it works.
rel()
and margin()
There are two useful functions rel()
and
margin()
when we customize our graphs.
rel()
is used to specify relative sizes. For example,
rel(1.5)
means 1.5 times larger in sizemargin()
is used to specify the margins of elements
from top (t
), bottom (b
), right
(r
) and left (l
), along with a unit.theme(
axis.text = element_text(colour = "blue", size = rel(1.5))
plot.margin = margin(1,1,1,1, unit = "cm")
)
Read https://ggplot2.tidyverse.org/reference/element.html for more details.
An exemplary graph is shown below after adjusting the margin and font colors.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), colour = "red"), axis.title = element_text(colour = "blue", size = rel(1.2), margin = margin(b = 3)), axis.text = element_text(size = rel(1.2)), plot.margin = margin(1,1,1,1, unit = "cm"))
Do a simple graph
ggplot(mpg) + geom_point(aes(x = cty, y = hwy))
, make the
following customization of your graph:
xlim
and ylim
)To set the limits of x and y axis, which is usually needed for graph
polishing, we need to use xlim
or ylim
functions:
ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() +
xlim(0, 40) + ylim(0, 50) +
theme(axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm"))
On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).
A 2d bin counts plot divides the plane into regular hexagons, counts
the number of cases in each hexagon, and then (by default) maps the
number of cases to the hexagon fill. It is used to resolve the
“overplotting” problem, similar to using
position = "jitter"
when doing the scatter plot.
For example, in the loans_full_schema
data set, if we
hope to plot interest_rate
against
debt_to_income
using a scatter plot, it looks like
this:
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = interest_rate))
This graph is not very informative since many points overlap with
each other (overplotting). To make things more clear, we can use the
geom_bin_2d()
function to create a 2d bin counts plot.
ggplot(loans_full_schema) +
geom_bin_2d(aes(x = debt_to_income, y = interest_rate))
In the graph above, the colors represent the counts (equivalently density) in each square bin. It is clear that we have more data points at low interest rate between 5% to 13% combined with low debt_to_income ratio between 0% to 20%.
Since there are relatively few points for a debt-to-income ratio of higher than 100%. We can filter our data and make the plot more detailed:
ggplot(loans_full_schema) +
geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) +
xlim(0, 100)
Scales refer to the x- and y-ticks and their labels on axes. There are a few functions that can customize the scale. Let’s take the following graph as an example:
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = annual_income))
We see that the scales on x-axis are 0, 100, 200, 300, 400 and the scales on y-axis are 0, 500000, 1000000, 1500000 and 2000000.
Let’s first learn how to change the position of scales using
scale_x_continuous
and scale_y_continuous
functions.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income,
y = annual_income)) +
scale_x_continuous(breaks = seq(0, 450, 50)) +
scale_y_continuous(breaks = seq(0, 2000000, 250000))
By defining the breaks
argument inside
scale_x_continuous
or scale_y_continuous
function one can define all positions of scales.
We can also customize the labels of scales.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income, y = annual_income)) +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Here labels = NULL
removes all labels on the
corresponding scale. Or we can define them by ourselves.
ggplot(loans_full_schema) +
geom_point(aes(x = debt_to_income/100, y = annual_income)) +
scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
scale_y_continuous(labels = scales::dollar)
We can also customize the label names here with the name
argument, and customize the limits with the limits
argument. Some useful scale options are scale::percent
,
scale::dollar
and scale::comma
to change the
format of scales.
In many data sets, one numeric variable may span a few orders of
magnitudes (for example, household income from $1,000 to $1,000,000). If
we use continuous_scale
, the graph does not show details
very well. In that case we need to change our scale to log scale
(plotting the logarithm of variable).
For data exploration, it is common that one use log10 scales:
ggplot(loans_full_schema) +
geom_bin_2d(aes(x = debt_to_income/100, y = annual_income)) +
scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
scale_y_log10(limits = c(5000, 2500000), labels = scales::dollar) +
labs(title = "LendingClub Loan Data",
x = "Debt to Income Ratio (in percentage)",
y = "Annual Income (in US dollar)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
The functions scale_y_log10
and
scale_x_log10
converts y-axis or x-axis into log10 scale,
respectively.
There are eight preset themes offered in ggplot
, that
gives different settings in axes, grid and background appearance. They
are:
We can change the theme by calling the theme functions:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
theme_classic()
Using “Facets” is another way to add additional variables into a graph.
Facets divide a plot into subplots based on the values of one or more discrete variables.
When creating subplots based on values of a single
categorical variable, one should use facet_wrap()
.
As below is an example.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2) +
labs(title = "Vehicle Fuel Economy Data by Vehicle Class",
x = "Engine Displacement (liter)",
y = "Highway Mile per Gallon") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
The facet_wrap()
function wraps subplots into a
2-dimensional array. This is generally a better use of screen space
because most displays are roughly rectangular.
In the code above, ~ class
is called a
formula in R. We will study it later. For now you just
need to know that ~ <VARIABLE_NAME>
is needed as the
first argument of facet_wrap
function.
In the graph above, we still plot engine
displacement vs highway mpg, but only plot
grouped data for each class
in every subplot. By doing
this, we clearly see where each group is - better than plotting them
altogether.
facet_grid
functionWhen creating subplots based on values of two categorical
variables, one should use facet_grid()
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class) + # A formula with two variables
labs(title = "Fuel Economy Data by Vehicle Class and Drive Train Type",
x = "Engine Displacement (liter)",
y = "Highway Mile per Gallon") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.2)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.2)))
In this case, a grid of subplots is created and the x- and y-axis of
the grid corresponds to the values of drv
and
class
, respectively.
For example, in the subplot at the top right corner, it plots
displ
against hwy
for data points with a
class
value of suv, and a drv
value of
4
, which corresponds to 4-wheel drive suvs.
facet_grid()
can be used to study the relationship
between four variables (two numeric and two categorical). When the data
set is large and complicated, it can be very useful to provide some
insights for us.
Finish all Lab Exercises
Create a graph based on the diamonds
data set with
the following requirements:
x
being carat
and y
being price
.clarity
quality.$5,000
etc.Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.