Lecture 4 - Data Visualization Part 3

In this module, we will learn to

customize histograms
create a density plot to summarize a numeric variable
customize data-related components (shape, color, line type etc.)
visualize the relationship between three variables
save your figure

Loading data set `loans`

library(tidyverse)
library(openintro)

loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)

Again, let’s only work on the variables above to make things simpler. To make codes shorter for the convenience of presenting, the details of the figure are not polished (but you should polish them in your assignments).

Create a histogram of `loan_amount`

ggplot(loans, aes(loan_amount)) + 
  geom_histogram(binwidth = 5000)

Add the effect of another categorical variable by filling

If we want to investigate the effect of homeownership to loan_amount, we can do the following:

ggplot(loans, aes(loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, alpha = 0.5)

Here the argument fill in the aes function refers to “filling with different colors by different values in ‘homeownership’”.

The argument alpha is between 0 and 1 that controls the transparency of each histogram. The smaller alpha is, the more transparent. It is very useful when we plot multiple charts that overlap with each other.

ggplot(loans, aes(loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, alpha = 0.5)

Question: Is there an effect of homeownership on loan_amount?

Plot with relative frequency (proportion)

When the sample size is too different between different groups (such as for different values in homeownership), it is more convenient to plot the y-axis with relative frequency to remove the effect of sample size.

To do this, we need to do the following

ggplot(loans) + 
  geom_histogram(aes(x = loan_amount, y = after_stat(count/sum(count))), 
                 binwidth = 5000, alpha = 0.5) + 
  labs(y = "relative frequency")

Here after_stat() function can create some new quantities for plotting. I will explain this later in this course. Now you may simply copy my code and know that count/sum(count) would compute the relative frequency of a single numeric data set.

Facets with a categorical variable

To compare relative frequency between different groups, we can use facets. An example is as below:

So we can see that the effect of homeownership on loan_amount is not significant.

The code to create the previous plot is as follows:

ggplot(loans) + 
  geom_histogram(aes(x = loan_amount, y = after_stat(density*5000)),
                 binwidth = 5000) +
  facet_wrap(~ homeownership, nrow = 3) + 
  labs (y = "relative frequency")

Here density*5000 is another way to compute relative frequency (RF = DENSITY * BINWIDTH) since count/sum(count) won’t work with multiple panels or groups.

The function facet_wrap() is used to facet by a single variable with the template of

facet_wrap(~ variable_name, nrows = ..., ncol = ...)

Density plot

For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.

The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.

ggplot(loans, aes(x = loan_amount)) +
  geom_density()

Here geom_density() function creates a density plot, which is on the next page.

The total area under the density plot is one.
By default, the density curve is fit to the histogram with the default bin number (30 bins). See the graph on the next page.
To adjust the “smoothness” of the plot, change the variable adjust.

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), 
                 colour = "black", fill = "white") +
  geom_density(linewidth = 1.2)

Larger adjust value gives more smooth density curves

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5000, 
                 boundary = 0, colour = "black", fill = "white") +
  geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins

## Lab Exercise

Create a histogram of variable debt_to_income in loans with the following requirements:

The plotting range of x is between 0 and 100
The binwidth is 2
Create a density plot on top of the histogram

Question: Can you explain the distribution of debt_to_income?

Customize the appearance of data components

As we see in many previous cases, we can customize the color, shape, fill color and other aesthetic components of data itself.

These components are customized by arguments inside the geom_ functions (but outside the aes() function). For example,

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(color = "blue", fill = "green", shape = 21, size = 3) + 
  geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)

All aesthetic features apply to all data components for that geom_ function.

For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

`color`, `fill` and `alpha` - color customization

Almost all geoms have either color or fill (or both) to customize the color of points/lines/bars/… To specify a color, one may use the following ways:
- A name, such as "red", "blue" etc. R has 657 built-in named colors in total.
- An RGB specification, such as #A52A2A
- An NA, which refers to completely transparent color
alpha refers to the opacity. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

`shape`, `size` and `linetype` - point/line cutomization

shape can be specified with an integer between 0 and 25. Each code refers to a type of point.
size can be specified with a numerical value (in mm) or a relative size with rel() function.
linetype can be specified with an integer (0-6) or a name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash).

Lab Exercise

Create a density plot of interest_rate in loans data with

color to be blue
fill to be green
linetype to be dashed
linewidth to be 1.5

Visualize the relationship between three variables

Next, let’s learn how to use color/shape/size/linetype to group data in different visualizations. The purpose is to illustrate relationship between three or more variables.

Example: How can you explain the red dots?

How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars.

Aesthetic mappings

Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:

<GEOM_FUNCTION>(mapping = aes(x = ..., y = ..., 
                color/shape/size/linetype/alpha = <VARIABLE_NAME>))

**Note that this is different from what we just learned for aesthetic customization since

color/shape/size/linetype/alpha = <VARIABLE_NAME>

is inside the aes() function!**

Example: mapping data by `shape`

ggplot(data = mpg) + 
     geom_point(mapping = aes(x = cty, y = hwy, shape = drv), 
                size = rel(2))

Question: What can we learn from this graph?

Example: mapping data by both `shape` and `color`

We can make things even more clear with mapping two aesthetic components.

ggplot(data = mpg) + 
     geom_point(mapping = aes(x = cty, y = hwy, shape = drv, color = drv), 
                size = rel(2), position = "jitter")

Lab Exercise

For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?

Solution

ggplot(loans) + geom_point(aes(interest_rate, debt_to_income, 
                               color = grade))

Example: mapping data by size (for ordinal variables)

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy, 
  color = as.factor(cyl), size = as.factor(cyl))) #<<

Question: What can you learn from this graph?

Example: mapping data by fill for histograms

Mapping by fill can be useful to investigate the effect of a categorical variable on the distribution of another numeric variable.

ggplot(loans, aes(x = loan_amount, 
                  fill = homeownership)) + #<<
  geom_density(adjust = 2, 
               alpha = 0.5) +       #Transparency is necessary
  labs(
    x = "Loan amount ($)",
    y = "Density",
    title = "Amounts of Lending Club loans", 
    fill = "Homeownership"          #fill the legend boxes
  ) +
  theme(plot.title = element_text(hjust = 0.5))

Question: What can we learn from this graph?

Ridge plots for density curves

When there are too many categories, a density ridge plot can be useful.

library(ggridges)   # The package "ggridges" must be installed 
ggplot(loans, aes(x = loan_amount, y = grade, 
                  fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)

This enhanced graph becomes available after installing the package ggridges. You should be able to understand the code now without explanation after studying so many examples.

Question: What can we learn from this graph?

Save your figure

Use ggsave to save a figure as a file in the current working folder.

ggplot(mpg) + geom_point(aes(cty, hwy)) 
ggsave("my-plot.pdf")

Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.

getwd()

You may save your figure as pdf, png, jpeg and other compatible formats.

Summary

You must think to find a good graph to answer a given question or for data exploration.
- You need to know your goal
- You need to try as many ways as you can (don’t be lazy!)
- You need to understand your data very well
- You need to be proficient in basic skills and sometimes be creative

Lecture 4 - Data Visualization Part 3

Miao Yu

2023-02-02

In this module, we will learn to

Loading data set `loans`

Create a histogram of `loan_amount`

Add the effect of another categorical variable by filling

Plot with relative frequency (proportion)

Facets with a categorical variable

Density plot

Customize the appearance of data components

`color`, `fill` and `alpha` - color customization

`shape`, `size` and `linetype` - point/line cutomization

Visualize the relationship between three variables

Aesthetic mappings

Example: mapping data by `shape`

Example: mapping data by both `shape` and `color`

Lab Exercise

Solution

Example: mapping data by size (for ordinal variables)

Example: mapping data by fill for histograms

Ridge plots for density curves

Save your figure

Summary

Lecture 4 - Data Visualization Part 3

Miao Yu

2023-02-02

In this module, we will learn to

Loading data set loans

Create a histogram of loan_amount

Add the effect of another categorical variable by filling

Plot with relative frequency (proportion)

Facets with a categorical variable

Density plot

Customize the appearance of data components

color, fill and alpha - color customization

shape, size and linetype - point/line cutomization

Visualize the relationship between three variables

Aesthetic mappings

Example: mapping data by shape

Example: mapping data by both shape and color

Lab Exercise

Solution

Example: mapping data by size (for ordinal variables)

Example: mapping data by fill for histograms

Ridge plots for density curves

Save your figure

Summary

Loading data set `loans`

Create a histogram of `loan_amount`

`color`, `fill` and `alpha` - color customization

`shape`, `size` and `linetype` - point/line cutomization

Example: mapping data by `shape`

Example: mapping data by both `shape` and `color`