In this module, we will learn to


Loading data set loans

library(tidyverse)
library(openintro)

loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)

Again, let’s only work on the variables above to make things simpler. To make codes shorter for the convenience of presenting, the details of the figure are not polished (but you should polish them in your assignments).


Create a histogram of loan_amount

ggplot(loans, aes(loan_amount)) + 
  geom_histogram(binwidth = 5000)


Add the effect of another categorical variable by filling

If we want to investigate the effect of homeownership to loan_amount, we can do the following:

ggplot(loans, aes(loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, alpha = 0.5) 

Here the argument fill in the aes function refers to “filling with different colors by different values in ‘homeownership’”.

The argument alpha is between 0 and 1 that controls the transparency of each histogram. The smaller alpha is, the more transparent. It is very useful when we plot multiple charts that overlap with each other.


ggplot(loans, aes(loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, alpha = 0.5) 

Question: Is there an effect of homeownership on loan_amount?


Plot with relative frequency (proportion)

When the sample size is too different between different groups (such as for different values in homeownership), it is more convenient to plot the y-axis with relative frequency to remove the effect of sample size.

To do this, we need to do the following

ggplot(loans) + 
  geom_histogram(aes(x = loan_amount, y = after_stat(count/sum(count))), 
                 binwidth = 5000, alpha = 0.5) + 
  labs(y = "relative frequency")

Here after_stat() function can create some new quantities for plotting. I will explain this later in this course. Now you may simply copy my code and know that count/sum(count) would compute the relative frequency of a single numeric data set.


Facets with a categorical variable

To compare relative frequency between different groups, we can use facets. An example is as below:

So we can see that the effect of homeownership on loan_amount is not significant.


The code to create the previous plot is as follows:

ggplot(loans) + 
  geom_histogram(aes(x = loan_amount, y = after_stat(density*5000)),
                 binwidth = 5000) +
  facet_wrap(~ homeownership, nrow = 3) + 
  labs (y = "relative frequency")

Here density*5000 is another way to compute relative frequency (RF = DENSITY * BINWIDTH) since count/sum(count) won’t work with multiple panels or groups.

The function facet_wrap() is used to facet by a single variable with the template of

facet_wrap(~ variable_name, nrows = ..., ncol = ...)


Density plot

For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.

The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.

ggplot(loans, aes(x = loan_amount)) +
  geom_density()

Here geom_density() function creates a density plot, which is on the next page.


ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), 
                 colour = "black", fill = "white") +
  geom_density(linewidth = 1.2)


Larger adjust value gives more smooth density curves

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5000, 
                 boundary = 0, colour = "black", fill = "white") +
  geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins


## Lab Exercise

Create a histogram of variable debt_to_income in loans with the following requirements:

  1. The plotting range of x is between 0 and 100
  2. The binwidth is 2
  3. Create a density plot on top of the histogram

Question: Can you explain the distribution of debt_to_income?


Customize the appearance of data components

As we see in many previous cases, we can customize the color, shape, fill color and other aesthetic components of data itself.

These components are customized by arguments inside the geom_ functions (but outside the aes() function). For example,

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(color = "blue", fill = "green", shape = 21, size = 3) + 
  geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)

All aesthetic features apply to all data components for that geom_ function.

For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html


color, fill and alpha - color customization


shape, size and linetype - point/line cutomization


Lab Exercise

Create a density plot of interest_rate in loans data with


Visualize the relationship between three variables

Next, let’s learn how to use color/shape/size/linetype to group data in different visualizations. The purpose is to illustrate relationship between three or more variables.

Example: How can you explain the red dots?


How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars.


Aesthetic mappings

Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:

<GEOM_FUNCTION>(mapping = aes(x = ..., y = ..., 
                color/shape/size/linetype/alpha = <VARIABLE_NAME>))

**Note that this is different from what we just learned for aesthetic customization since

color/shape/size/linetype/alpha = <VARIABLE_NAME>

is inside the aes() function!**

Example: mapping data by shape

ggplot(data = mpg) + 
     geom_point(mapping = aes(x = cty, y = hwy, shape = drv), 
                size = rel(2))

Question: What can we learn from this graph?


Example: mapping data by both shape and color

We can make things even more clear with mapping two aesthetic components.

ggplot(data = mpg) + 
     geom_point(mapping = aes(x = cty, y = hwy, shape = drv, color = drv), 
                size = rel(2), position = "jitter")


Lab Exercise

For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?


Solution

ggplot(loans) + geom_point(aes(interest_rate, debt_to_income, 
                               color = grade))


Example: mapping data by size (for ordinal variables)

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, 
  color = as.factor(cyl), size = as.factor(cyl))) #<<

Question: What can you learn from this graph?


Example: mapping data by fill for histograms

Mapping by fill can be useful to investigate the effect of a categorical variable on the distribution of another numeric variable.

ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + #<<
  geom_density(adjust = 2, 
               alpha = 0.5) +       #Transparency is necessary
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
    fill = "Homeownership"          #fill the legend boxes
  ) +
  theme(plot.title = element_text(hjust = 0.5))


Question: What can we learn from this graph?

Ridge plots for density curves

When there are too many categories, a density ridge plot can be useful.

library(ggridges)   # The package "ggridges" must be installed 
ggplot(loans, aes(x = loan_amount, y = grade, 
                  fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)

This enhanced graph becomes available after installing the package ggridges. You should be able to understand the code now without explanation after studying so many examples.


Question: What can we learn from this graph?


Save your figure

Use ggsave to save a figure as a file in the current working folder.

ggplot(mpg) + geom_point(aes(cty, hwy)) 
ggsave("my-plot.pdf")

Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.

getwd()

You may save your figure as pdf, png, jpeg and other compatible formats.


Summary