loans
library(tidyverse)
library(openintro)
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
Again, let’s only work on the variables above to make things simpler. To make codes shorter for the convenience of presenting, the details of the figure are not polished (but you should polish them in your assignments).
loan_amount
ggplot(loans, aes(loan_amount)) +
geom_histogram(binwidth = 5000)
If we want to investigate the effect of homeownership
to
loan_amount
, we can do the following:
ggplot(loans, aes(loan_amount, fill = homeownership)) +
geom_histogram(binwidth = 5000, alpha = 0.5)
Here the argument fill
in the aes
function
refers to “filling with different colors by different values in
‘homeownership’”.
The argument alpha
is between 0 and 1 that controls the
transparency of each histogram. The smaller alpha
is, the
more transparent. It is very useful when we plot multiple charts that
overlap with each other.
ggplot(loans, aes(loan_amount, fill = homeownership)) +
geom_histogram(binwidth = 5000, alpha = 0.5)
Question: Is there an effect of homeownership
on
loan_amount
?
When the sample size is too different between different groups (such
as for different values in homeownership
), it is more
convenient to plot the y-axis with relative frequency to remove the
effect of sample size.
To do this, we need to do the following
ggplot(loans) +
geom_histogram(aes(x = loan_amount, y = after_stat(count/sum(count))),
binwidth = 5000, alpha = 0.5) +
labs(y = "relative frequency")
Here after_stat()
function can create some new
quantities for plotting. I will explain this later in this course. Now
you may simply copy my code and know that count/sum(count)
would compute the relative frequency of a single numeric data set.
To compare relative frequency between different groups, we can use facets. An example is as below:
So we can see that the effect of
homeownership
on
loan_amount
is not significant.
The code to create the previous plot is as follows:
ggplot(loans) +
geom_histogram(aes(x = loan_amount, y = after_stat(density*5000)),
binwidth = 5000) +
facet_wrap(~ homeownership, nrow = 3) +
labs (y = "relative frequency")
Here density*5000
is another way to compute relative
frequency (RF = DENSITY * BINWIDTH) since count/sum(count)
won’t work with multiple panels or groups.
The function facet_wrap()
is used to facet by a
single variable with the template of
facet_wrap(~ variable_name, nrows = ..., ncol = ...)
For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.
The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.
ggplot(loans, aes(x = loan_amount)) +
geom_density()
Here geom_density()
function creates a density plot,
which is on the next page.
adjust
.ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black", fill = "white") +
geom_density(linewidth = 1.2)
Larger adjust
value gives more smooth density curves
ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5000,
boundary = 0, colour = "black", fill = "white") +
geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins
## Lab Exercise
Create a histogram of variable debt_to_income
in
loans
with the following requirements:
Question: Can you explain the distribution of
debt_to_income
?
As we see in many previous cases, we can customize the color, shape, fill color and other aesthetic components of data itself.
These components are customized by arguments inside the
geom_
functions (but outside the
aes()
function). For example,
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(color = "blue", fill = "green", shape = 21, size = 3) +
geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)
All aesthetic features apply to all data components for that
geom_
function.
For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
color
, fill
and alpha
- color
customizationcolor
or fill
(or both) to customize the color of points/lines/bars/… To specify a
color, one may use the following ways:
"red"
, "blue"
etc. R has
657 built-in named colors in total.#A52A2A
NA
, which refers to completely transparent
coloralpha
refers to the opacity. Values of
alpha
range from 0 to 1, with lower values corresponding to
more transparent colors.shape
, size
and linetype
-
point/line cutomizationshape
can be specified with an integer between 0 and
25. Each code refers to a type of point.size
can be specified with a numerical value (in mm) or
a relative size with rel()
function.linetype
can be specified with an integer (0-6) or a
name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 =
longdash, 6 = twodash).Lab Exercise
Create a density plot of interest_rate
in
loans
data with
color
to be bluefill
to be greenlinetype
to be dashedlinewidth
to be 1.5Next, let’s learn how to use color/shape/size/linetype to group data in different visualizations. The purpose is to illustrate relationship between three or more variables.
Example: How can you explain the red dots?
How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars.
Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:
<GEOM_FUNCTION>(mapping = aes(x = ..., y = ...,
color/shape/size/linetype/alpha = <VARIABLE_NAME>))
**Note that this is different from what we just learned for aesthetic customization since
color/shape/size/linetype/alpha = <VARIABLE_NAME>
is inside the aes()
function!**
shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, shape = drv),
size = rel(2))
Question: What can we learn from this graph?
shape
and
color
We can make things even more clear with mapping two aesthetic components.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, shape = drv, color = drv),
size = rel(2), position = "jitter")
For loans
data, create a scatter plot of
interest_rate
vs debt_to_income
with mapping
color
to grade
. What can you learn from the
graph?
ggplot(loans) + geom_point(aes(interest_rate, debt_to_income,
color = grade))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy,
color = as.factor(cyl), size = as.factor(cyl))) #<<
Question: What can you learn from this graph?
Mapping by fill
can be useful to investigate the effect
of a categorical variable on the distribution of another numeric
variable.
ggplot(loans, aes(x = loan_amount,
fill = homeownership)) + #<<
geom_density(adjust = 2,
alpha = 0.5) + #Transparency is necessary
labs(
x = "Loan amount ($)",
y = "Density",
title = "Amounts of Lending Club loans",
fill = "Homeownership" #fill the legend boxes
) +
theme(plot.title = element_text(hjust = 0.5))
Question: What can we learn from this graph?
When there are too many categories, a density ridge plot can be useful.
library(ggridges) # The package "ggridges" must be installed
ggplot(loans, aes(x = loan_amount, y = grade,
fill = grade, color = grade)) +
geom_density_ridges(alpha = 0.5)
This enhanced graph becomes available after installing the package
ggridges
. You should be able to understand the code now
without explanation after studying so many examples.
Question: What can we learn from this graph?
Use ggsave
to save a figure as a file in the current
working folder.
ggplot(mpg) + geom_point(aes(cty, hwy))
ggsave("my-plot.pdf")
Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.
getwd()
You may save your figure as pdf, png, jpeg and other compatible formats.