In this class, we continue to learn how to visualize data with the
ggplot2
package in R. We will learn the following
topics
To summarize a single numeric variable, the most commonly used chart is a histogram. As below is an example
A good histogram shows the distribution shape of the data set.
In this class, we will use the data set
loans_full_schema
to show examples. The data set is from
the package openintro
.
library(openintro)
If you don’t have the package installed, use
install.packages("openintro")
to install it first. After
loading the package, we can take a look at the data.
glimpse(loans_full_schema)
Question: How many samples are there? How many variables are there?
There are too many variables in the data set. To make things simpler, we will only handle 8 of them for now by executing the following code in R.
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
This operation of selecting some variables from the original data set belongs to data transformation, which will be our topic in the next.
Now we create a new data set named loans
that only
stores the selected variables (loan_amount
,
interest_rate
etc.) from the original set.
Question: What is the data set about?
glimpse(loans)
## Rows: 10,000
## Columns: 8
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …
variable | description |
---|---|
loan_amount |
Amount of the loan received, in US dollars |
interest_rate |
Interest rate on the loan, in an annual percentage |
term |
The length of the loan, which is always set as a whole number of months |
grade |
Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid |
state |
US state where the borrower resides |
annual_income |
Borrower’s annual income, including any second income, in US dollars |
homeownership |
Indicates whether the person owns, owns but has a mortgage, or rents |
debt_to_income |
Debt-to-income ratio, in percentage |
<fct>
means factor, which is a data type in R used
to store and process categorical data.
variable | type |
---|---|
loan_amount |
numerical, continuous |
interest_rate |
numerical, continuous |
term |
numerical, discrete |
grade |
categorical, ordinal |
state |
categorical, not ordinal |
annual_income |
numerical, continuous |
homeownership |
categorical, not ordinal |
debt_to_income |
numerical, continuous |
Now let’s summarize a numeric variable by creating a histogram. Let’s
pick the annual_income
since its meaning is easy to
understand.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate))
bins=20
In the previous graph, around 10% there are two “missing” bars which look unnatural. We can resolve this by reducing the number of bins (groups).
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), bins = 20)
bins=10
We can make the bin number even smaller to be 10. Among the three histograms, Which one do you think is the best? Why?
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), bins = 10)
In some situations, we hope to specify the bin width instead of bin number. We may do the following:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5)
We can specify the center
or boundary
of
one bin to make adjustments to the position of bins.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)
The graph above does not look very “beautiful”, and therefore not professional. One must polish graph details to make it at good quality. In the next, we will learn how to polish graph details.
For a graph to be accessible to a wider audience, it must have proper
labels and title. In ggplot
, the function
labs()
is used to specify these details.
ggplot(data = loans) +
geom_histogram(
mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(
title = "Interest rate from lending club data",
x = "Interest Rate (%)",
y = "Count"
)
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count")
theme()
To further polish graph details, we need to add theme()
into our code. The theme()
function can customize all
non-data components of the plots regarding their appearances:
theme()
To use theme
, one needs to follow the following
template:
theme(
COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
)
Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
theme()
function workstheme(plot.title = element_text(hjust = 0.5))
plot.title
of theme()
specifies that we hope to customize the text appearance of the
title.element_text()
function.hjust
to be 0.5, which refers to the
horizontal justification, and a value of 0.5 refers to placing in
center.Let’s see another example. Now we want to enlarge the font size of title to be 25 pts. The following code would work:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 25))
As an exercise, please make a guess how the following code will change the graph
theme(
plot.title = element_text(hjust = 0.5, size = 25))
axis.title = element_text(size = 25)
axis.text = element_text(size = 20)
)
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 25), axis.title = element_text(size = 25), axis.text = element_text(size = 20))
Nobody would remember all names of arguments for theme()
function. So when you want to customize a particular element in your
graph, use the help documentation as your reference.
?theme
There are three element_
functions used in
theme()
: - element_rect()
: for borders and
backgrounds - element_line()
: for lines -
element_text()
: for texts
In the future, when new ways of using theme()
function
appear, you should research by yourself to understand how it works.
rel()
and margin()
There are two useful functions rel()
and
margin()
when we customize our graphs.
rel()
is used to specify relative sizes. For example,
rel(1.5)
means 1.5 times larger in sizemargin()
is used to specify the margins of elements
from top (t
), bottom (b
), right
(r
) and left (l
), along with a unit.theme(
axis.text = element_text(colour = "blue", size = rel(1.5))
plot.margin = margin(1,1,1,1, unit = "cm")
)
Read https://ggplot2.tidyverse.org/reference/element.html for more details.
An exemplary graph is shown below after adjusting the margin and font colors.
Do a simple graph
ggplot(mpg) + geom_point(aes(x = cty, y = hwy))
, make the
following customization of your graph:
Please note that for data components, their aesthetics is customized
inside the geom_
functions or using other built-in
functions.
For example, the col
argument is used to set the color
of your plotting data (in points, lines, bars or other objects) in most
cases.
ggplot(mpg, aes(cty, hwy)) +
geom_point(col = "blue") +
geom_smooth(col = "red")
In the code above, we make all points blue and the line red. Also be noted that all argument names are ignored (think why this works.)
xlim
and ylim
)To set the limits of x and y axis, which is usually needed for graph
polishing, we need to use xlim
or ylim
functions:
ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() +
xlim(0, 40) + ylim(0, 50)
On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).