In this class, we continue to learn how to visualize data with the
ggplot2
package in R. We will learn the following
topics
library(tidyverse)
In this class, we will use the data set
loans_full_schema
to show examples. The data set is from
the package openintro
.
library(openintro) # Install the package if it's not available
If you don’t have the package installed, use
install.packages("openintro")
to install it first. After
loading the package, we can take a look at the data.
glimpse(loans_full_schema)
Question: How many samples are there? How many variables are there?
To summarize a single numeric variable, the most commonly used chart is a histogram. As below is an example
A good histogram shows the distribution shape of the data set.
There are too many variables in the data set. To make things simpler, we will only handle 8 of them for now by executing the following code in R.
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
This operation of selecting some variables from the original data set belongs to data transformation, which will be our topic in the next chapter.
Now we create a new data set named loans
that only
stores the selected variables (loan_amount
,
interest_rate
etc.) from the original set.
Question: What is the data set about?
glimpse(loans)
## Rows: 10,000
## Columns: 8
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,…
## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,…
## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000…
## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…
variable | description |
---|---|
loan_amount |
Amount of the loan received, in US dollars |
interest_rate |
Interest rate on the loan, in an annual percentage |
term |
The length of the loan, which is always set as a whole number of months |
grade |
Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid |
state |
US state where the borrower resides |
annual_income |
Borrower’s annual income, including any second income, in US dollars |
homeownership |
Indicates whether the person owns, owns but has a mortgage, or rents |
debt_to_income |
Debt-to-income ratio, in percentage |
<fct>
means factor, which is a data type in R used
to store and process categorical data.
variable | type |
---|---|
loan_amount |
numerical, continuous |
interest_rate |
numerical, continuous |
term |
numerical, discrete |
grade |
categorical, ordinal |
state |
categorical, not ordinal |
annual_income |
numerical, continuous |
homeownership |
categorical, not ordinal |
debt_to_income |
numerical, continuous |
table
and unique
functionOftentimes, we hope to quickly check all values for a categorical or discrete variable. There are two functions to fulfill the job:
unique(loans$term)
## [1] 60 36
The unique
function returns all unique values for a
vector. As we can see here, for all the loans in the data set, the term
length is either 36 or 60 months.
An even more powerful function is the table
function,
which creates a frequency table for any given categorical or discrete
variable.
table(loans$term)
##
## 36 60
## 6970 3030
As we see, the function not only lists all values of the variable, but also lists the counts (frequency) for each value.
Answer the following questions by using unique
or
table
function:
How many distinct values are there for homeownership
variable? Which value is the most common one?
How many distinct interest rates are there? Which value is the most common one?
Apply table
function to the
annual_income
variable. Do you think the result is helpful
or not?
Now let’s learn how to summarize a numeric variable by creating a
histogram. Let’s pick the annual_income
since its meaning
is easy to understand.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate))
By default, there are 30 bins and their ranges are automatically determined. In many cases, this won’t give us a satisfactory graph. We need to customize it.
bins=20
In the previous graph, around 10% there are two “missing” bars which look unnatural. We can resolve this by reducing the number of bins (groups).
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), bins = 20)
bins=10
We can make the bin number even smaller to be 10. Among the three histograms, Which one do you think is the best? Why?
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), bins = 10)
In some situations, we hope to specify the bin width instead of bin number. We may do the following:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5)
We can specify the center
or boundary
of
one bin to make adjustments to the position of bins.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)
We can specify the center
or boundary
of
one bin to make adjustments to the position of bins.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 1, center = 10)
It may look a little unnatural if we don’t plot x starting from zero.
We may fix it by using the function xlim
.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 1, center = 10) +
xlim(0, 40)
Why is the lowest interest rate 5%-ish and there was no lower interest rate? Can you explain?
Why are there some peak interest rates around 7%, 10%, 14%? Can you explain?
Create a histogram of loan_amount
. Customize your
plot to give a graph that looks most reasonable to you.
Create a histogram of annual_income
. What is the
issue with your graph?
For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.
The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.
ggplot(loans, aes(x = loan_amount)) +
geom_density()
Here geom_density()
function creates a density plot:
The total area under the density plot is one.
By default, the density curve is fit to the histogram with the default bin number (30 bins, see the graph below).
To adjust the “smoothness” of the plot, change the variable
adjust
.
Usually, it can be a good idea to plot histogram and density in one plot:
ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)),
boundary = 0, colour = "black", fill = "white") +
geom_density(linewidth = 1.2)
Note that to make the histogram also plotted with y-axis being
density, we need to add y = after_stat(density)
in the
aes
function.
Larger adjust
value gives more smooth density curves.
For example, we can set it to be 30/8
to fit the curve to a
histogram with 8 bins (which is smoother).
ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5000,
boundary = 0, colour = "black", fill = "white") +
geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins
The after_stat
function is used in the example above
because the y-axis is mapped to data that are not a variable in the
original data frame. For histograms, the default y-axis is mapped to
counts, which is computed by another function stat_bin()
.
The function after_stat(count)
or
after_stat(density)
indicates that count
or
density
are only available after the original data are
transformed.
With this understanding, we can create a relative frequency histogram (although rarely used) where the y-axis is the proportion of samples in each bin rather than the counts:
ggplot(loans) +
geom_histogram(mapping = aes(x = loan_amount, y = after_stat(count/sum(count))),
binwidth = 5000, boundary = 0, colour = "black", fill = "white")
Create a histogram of variable debt_to_income
in
loans
with the following requirements:
Question: Can you explain the distribution of
debt_to_income
?
In many cases, it is very useful to add another level of aesthetic
mapping to a figure. For example, in the mpg
data set, if
we plot hwy
vs displ
, we would see a plot like
this:
Example: How would we explain the red dots which
seem to deviate the overall trend (larger displ
leading to
lower hwy
)?
How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars (SUV, pickups).
Here, we put a color
argument inside the
aes
function, this creates a new aesthetic group (in color)
by the categorical variable class
.
Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:
<GEOM_FUNCTION>(mapping = aes(x = ..., y = ...,
color/shape/size/linetype/alpha = <VARIABLE_NAME>))
Note that the argument must be inside the aes()
function.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl))
Question: What can we learn from this graph?
For loans
data, create a scatter plot of
interest_rate
vs debt_to_income
with mapping
color
to grade
. What can you learn from the
graph?
Sometimes we hope to compare histograms between groups. One way to do
this is to use fill
in the aes
function. For
example, if we want to investigate the effect of
homeownership
to loan_amount
, we can do the
following:
ggplot(loans, aes(loan_amount, fill = homeownership)) +
geom_histogram(binwidth = 5000, alpha = 0.5)
The argument alpha
is between 0 and 1 that controls the
transparency of each histogram. The smaller alpha
is, the
more transparent. It is very useful when we plot multiple charts that
overlap with each other.
When the absolute counts between groups are quite different, comparing histograms is not a good idea. Instead, we would like to compare density curves between groups to see their differences.
ggplot(loans, aes(x = loan_amount, fill = homeownership)) +
geom_density(adjust = 2, alpha = 0.5) # Transparency is necessary
So this graph shows the relatively insignificant differences between groups - people with mortgage tend to borrow more money than those renting houses (maybe they care less about having more debts since they are indebted anyway).
When there are too many categories, a density ridge plot can be useful.
library(ggridges) # The package "ggridges" must be installed
ggplot(loans, aes(x = loan_amount, y = grade,
fill = grade, color = grade)) +
geom_density_ridges(alpha = 0.5)
This enhanced graph becomes available after installing the package
ggridges
. You should be able to understand the code now
without my explanation after studying many examples!
Question: What can we learn from this graph?
Use ggsave
to save a figure as a file in the current
working folder.
ggplot(mpg) + geom_point(aes(cty, hwy))
ggsave("my-plot.pdf")
Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.
getwd()
You may save your figure as pdf, png, jpeg and other compatible formats.
Create a scatter plot of loan_amount
vs
interest_rate
with a color grouping using term
variable (please use factor(term)
to convert it into a
categorical variable). Save your plot to your local folder. Submit your
code and graph on Canvas.
As we see in previous examples, we can customize the color, shape, fill color and other aesthetic components of data-related components (symbols, lines, bars, fills etc.). Note that this is different from aesthetic grouping since we would control the appearance of the whole plot.
These components are customized by arguments inside the
geom_
functions (but outside the aes()
function). For example,
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(color = "blue", fill = "green", shape = 21, size = 3) +
geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)
All aesthetic features apply to all data components for that
geom_
function.
For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
color
, fill
and alpha
- color
customizationAlmost all geoms have either color
or
fill
(or both) to customize the color of
points/lines/bars/… To specify a color, one may use the following
ways:
"red"
, "blue"
etc. R has
657 built-in named colors in total.#A52A2A
NA
, which refers to completely transparent
coloralpha
refers to the opacity. Values of
alpha
range from 0 to 1, with lower values corresponding to
more transparent colors.
shape
, size
and linetype
-
point/line cutomizationshape
can be specified with an integer between 0 and
25. Each code refers to a type of point.
size
can be specified with a numerical value (in mm)
or a relative size with rel()
function.
linetype
can be specified with an integer (0-6) or a
name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 =
longdash, 6 = twodash).
Create a density plot of interest_rate
in
loans
data with
color
to be blue
fill
to be green
linetype
to be dashed
linewidth
to be 1.5
Now let’s learn how to polish non-data-related components. This includes:
titles and labels
axis, ticks
margins
positions of all components
grids
fonts and font sizes for all texts
legends
For a graph to be accessible to a wider audience, it must have proper
axis labels and title. In ggplot
, the function
labs()
is used to specify these details.
ggplot(data = loans) +
geom_histogram(
mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(
title = "Interest rate from lending club data",
x = "Interest Rate (%)",
y = "Count"
)
theme()
To further polish graph details, we need to add theme()
into our code. The theme()
function can customize all
non-data components of the plots regarding their appearances.
To use theme
, one needs to follow the following
template:
theme(
COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
)
Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
theme()
function workstheme(plot.title = element_text(hjust = 0.5))
The argument plot.title
of theme()
specifies that we hope to customize the text appearance of the
title.
To change any setting for text, we must use the
element_text()
function.
Therein, we change hjust
to be 0.5, which refers to
the horizontal justification, and a value of 0.5 refers to placing in
center.
Let’s see another example. Now we want to enlarge the font size of title to be 20 pts. The following code would work:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 20))
As an exercise, please make a guess how the following code will change the graph
theme(
plot.title = element_text(hjust = 0.5, size = 20))
axis.title = element_text(size = 15)
axis.text = element_text(size = 15)
)
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = 20), axis.title = element_text(size = 15), axis.text = element_text(size = 15))
Nobody would remember all names of arguments for theme()
function. So when you want to customize a particular element in your
graph, use the help documentation as your reference.
?theme
There are three element_
functions used in
theme()
: - element_rect()
: for borders and
backgrounds - element_line()
: for lines -
element_text()
: for texts
In the future, when new ways of using theme()
function
appear, you should research by yourself to understand how it works.
rel()
and margin()
There are two useful functions rel()
and
margin()
when we customize our graphs.
rel()
is used to specify relative sizes. For example,
rel(1.5)
means 1.5 times larger in sizemargin()
is used to specify the margins of elements
from top (t
), bottom (b
), right
(r
) and left (l
), along with a unit.theme(
axis.text = element_text(colour = "blue", size = rel(1.5))
plot.margin = margin(1,1,1,1, unit = "cm")
)
Read https://ggplot2.tidyverse.org/reference/element.html for more details.
An exemplary graph is shown below after adjusting the margin and font colors.
ggplot(data = loans) +
geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) +
labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), colour = "red"), axis.title = element_text(colour = "blue", size = rel(1.2), margin = margin(b = 3)), axis.text = element_text(size = rel(1.2)), plot.margin = margin(1,1,1,1, unit = "cm"))
Do a simple graph
ggplot(mpg) + geom_point(aes(x = cty, y = hwy))
, make the
following customization of your graph:
xlim
and ylim
)To set the limits of x and y axis, which is usually needed for graph
polishing, we need to use xlim
or ylim
functions:
ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() +
xlim(0, 40) + ylim(0, 50) +
theme(axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm"))
On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).