loanslibrary(tidyverse)
library(openintro)
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
Again, let’s only work on the variables above to make things simpler. To make codes shorter for the convenience of presenting, the details of the figure are not polished (but you should polish them in your assignments).
loan_amountggplot(loans, aes(loan_amount)) +
geom_histogram(binwidth = 5000)
If we want to investigate the effect of homeownership to
loan_amount, we can do the following:
ggplot(loans, aes(loan_amount, fill = homeownership)) +
geom_histogram(binwidth = 5000, alpha = 0.5)
Here the argument fill in the aes function
refers to “filling with different colors by different values in
‘homeownership’”.
The argument alpha is between 0 and 1 that controls the
transparency of each histogram. The smaller alpha is, the
more transparent. It is very useful when we plot multiple charts that
overlap with each other.
ggplot(loans, aes(loan_amount, fill = homeownership)) +
geom_histogram(binwidth = 5000, alpha = 0.5)
Question: Is there an effect of homeownership on
loan_amount?
When the sample size is too different between different groups (such
as for different values in homeownership), it is more
convenient to plot the y-axis with relative frequency to remove the
effect of sample size.
To do this, we need to do the following
ggplot(loans) +
geom_histogram(aes(x = loan_amount, y = after_stat(count/sum(count))),
binwidth = 5000, alpha = 0.5) +
labs(y = "relative frequency")
Here after_stat() function can create some new
quantities for plotting. I will explain this later in this course. Now
you may simply copy my code and know that count/sum(count)
would compute the relative frequency of a single numeric data set.
To compare relative frequency between different groups, we can use facets. An example is as below:
So we can see that the effect of
homeownership on
loan_amount is not significant.
The code to create the previous plot is as follows:
ggplot(loans) +
geom_histogram(aes(x = loan_amount, y = after_stat(density*5000)),
binwidth = 5000) +
facet_wrap(~ homeownership, nrow = 3) +
labs (y = "relative frequency")
Here density*5000 is another way to compute relative
frequency (RF = DENSITY * BINWIDTH) since count/sum(count)
won’t work with multiple panels or groups.
The function facet_wrap() is used to facet by a
single variable with the template of
facet_wrap(~ variable_name, nrows = ..., ncol = ...)
For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.
The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.
ggplot(loans, aes(x = loan_amount)) +
geom_density()
Here geom_density() function creates a density plot,
which is on the next page.
adjust.ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black", fill = "white") +
geom_density(linewidth = 1.2)
Larger adjust value gives more smooth density curves
ggplot(loans, aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5000,
boundary = 0, colour = "black", fill = "white") +
geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins
## Lab Exercise
Create a histogram of variable debt_to_income in
loans with the following requirements:
Question: Can you explain the distribution of
debt_to_income?
As we see in many previous cases, we can customize the color, shape, fill color and other aesthetic components of data itself.
These components are customized by arguments inside the
geom_ functions (but outside the
aes() function). For example,
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(color = "blue", fill = "green", shape = 21, size = 3) +
geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)
All aesthetic features apply to all data components for that
geom_ function.
For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
color, fill and alpha - color
customizationcolor or fill
(or both) to customize the color of points/lines/bars/… To specify a
color, one may use the following ways:
"red", "blue" etc. R has
657 built-in named colors in total.#A52A2ANA, which refers to completely transparent
coloralpha refers to the opacity. Values of
alpha range from 0 to 1, with lower values corresponding to
more transparent colors.shape, size and linetype -
point/line cutomizationshape can be specified with an integer between 0 and
25. Each code refers to a type of point.size can be specified with a numerical value (in mm) or
a relative size with rel() function.linetype can be specified with an integer (0-6) or a
name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 =
longdash, 6 = twodash).Lab Exercise
Create a density plot of interest_rate in
loans data with
color to be bluefill to be greenlinetype to be dashedlinewidth to be 1.5Next, let’s learn how to use color/shape/size/linetype to group data in different visualizations. The purpose is to illustrate relationship between three or more variables.
Example: How can you explain the red dots?
How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars.
Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:
<GEOM_FUNCTION>(mapping = aes(x = ..., y = ...,
color/shape/size/linetype/alpha = <VARIABLE_NAME>))
**Note that this is different from what we just learned for aesthetic customization since
color/shape/size/linetype/alpha = <VARIABLE_NAME>
is inside the aes() function!**
shapeggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, shape = drv),
size = rel(2))
Question: What can we learn from this graph?
shape and
colorWe can make things even more clear with mapping two aesthetic components.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, shape = drv, color = drv),
size = rel(2), position = "jitter")
For loans data, create a scatter plot of
interest_rate vs debt_to_income with mapping
color to grade. What can you learn from the
graph?
ggplot(loans) + geom_point(aes(interest_rate, debt_to_income,
color = grade))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy,
color = as.factor(cyl), size = as.factor(cyl))) #<<
Question: What can you learn from this graph?
Mapping by fill can be useful to investigate the effect
of a categorical variable on the distribution of another numeric
variable.
ggplot(loans, aes(x = loan_amount,
fill = homeownership)) + #<<
geom_density(adjust = 2,
alpha = 0.5) + #Transparency is necessary
labs(
x = "Loan amount ($)",
y = "Density",
title = "Amounts of Lending Club loans",
fill = "Homeownership" #fill the legend boxes
) +
theme(plot.title = element_text(hjust = 0.5))
Question: What can we learn from this graph?
When there are too many categories, a density ridge plot can be useful.
library(ggridges) # The package "ggridges" must be installed
ggplot(loans, aes(x = loan_amount, y = grade,
fill = grade, color = grade)) +
geom_density_ridges(alpha = 0.5)
This enhanced graph becomes available after installing the package
ggridges. You should be able to understand the code now
without explanation after studying so many examples.
Question: What can we learn from this graph?
Use ggsave to save a figure as a file in the current
working folder.
ggplot(mpg) + geom_point(aes(cty, hwy))
ggsave("my-plot.pdf")
Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.
getwd()
You may save your figure as pdf, png, jpeg and other compatible formats.