Load Library and Prepare Data


library(tidyverse)
library(openintro)
loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)

In this module, we will first learn how to adjust the appearance of data-related and non-data-related components for a figure. Then we will study a few data visualization and analysis examples, which naturally raises the necessity of performing data transformation before data visualization in many situations.

color, fill and alpha - color customization


  • Almost all geoms have either color or fill (or both) to customize the color of points/lines/bars/… To specify a color, one may use the following ways:

    • A name, such as "red", "blue" etc. R has 657 built-in named colors in total.
    • An RGB specification, such as #A52A2A
    • An NA, which refers to completely transparent color


  • alpha refers to the opacity. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

shape, size and linetype - point/line cutomization


  • shape can be specified with an integer between 0 and 25. Each code refers to a type of point.

  • size can be specified with a numerical value (in mm) or a relative size with rel() function.

  • linetype can be specified with an integer (0-6) or a name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash).

Lab Exercise


Create a density plot of interest_rate in loans data with

  • color to be blue

  • fill to be green

  • linetype to be dashed

  • linewidth to be 1.5

Titles and labels


For a graph to be accessible to a wider audience, it must have proper axis labels and title. In ggplot, the function labs() is used to specify these details.

ggplot(data = loans) +
  geom_histogram(
    mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(
    title = "Interest rate from lending club data", 
    x = "Interest Rate (%)", 
    y = "Count"
       )

Customize graph details with theme()


To further polish graph details, we need to add theme() into our code. The theme() function can customize all non-data components of the plots regarding their appearances.

To use theme, one needs to follow the following template:

theme(
      COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
     )

Example


Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))

How theme() function works


theme(plot.title = element_text(hjust = 0.5))


  • The argument plot.title of theme() specifies that we hope to customize the text appearance of the title.

  • To change any setting for text, we must use the element_text() function.

  • Therein, we change hjust to be 0.5, which refers to the horizontal justification, and a value of 0.5 refers to placing in center.

Another example


Let’s see another example. Now we want to enlarge the font size of title to be 20 pts. The following code would work:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 20))

Question


As an exercise, please make a guess how the following code will change the graph

theme(
  plot.title = element_text(hjust = 0.5, size = 20))
  axis.title = element_text(size = 15)
  axis.text = element_text(size = 15)
  )

Solution


ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 20), axis.title = element_text(size = 15), axis.text = element_text(size = 15))

Use help documentation


Nobody would remember all names of arguments for theme() function. So when you want to customize a particular element in your graph, use the help documentation as your reference.

?theme


There are three element_ functions used in theme(): - element_rect(): for borders and backgrounds - element_line(): for lines - element_text(): for texts

In the future, when new ways of using theme() function appear, you should research by yourself to understand how it works.

rel() and margin()


There are two useful functions rel() and margin() when we customize our graphs.

  • rel() is used to specify relative sizes. For example, rel(1.5) means 1.5 times larger in size
  • margin() is used to specify the margins of elements from top (t), bottom (b), right (r) and left (l), along with a unit.
theme(
     axis.text = element_text(colour = "blue", size = rel(1.5))
     plot.margin = margin(1,1,1,1, unit = "cm")
     )


Read https://ggplot2.tidyverse.org/reference/element.html for more details.

Example


An exemplary graph is shown below after adjusting the margin and font colors.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), colour = "red"), axis.title = element_text(colour = "blue", size = rel(1.2), margin = margin(b = 3)), axis.text = element_text(size = rel(1.2)), plot.margin = margin(1,1,1,1, unit = "cm"))

Lab Exercise


Do a simple graph ggplot(mpg) + geom_point(aes(x = cty, y = hwy)), make the following customization of your graph:

  1. Make a title of “Fuel Economy Data”
  2. Make your x label of “miles per gallon in city”, and a y label of “miles per gallon in highway”.
  3. Make your title 2 times larger and in color blue
  4. Make your labels in x and y 1.5 times larger
  5. Change the aspect ratio of your graph to be 1 to make your graph square. Find how to do this by yourself with the assistance of help documentation.

Setting scales (xlim and ylim)


To set the limits of x and y axis, which is usually needed for graph polishing, we need to use xlim or ylim functions:

ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() + 
  xlim(0, 40) + ylim(0, 50) + 
  theme(axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm"))

Principles of graph polishing


  1. (Most important) The graph needs to look good to fit the actual needs in its final form.
  2. With the same graph, when the output format is different, the optimized setting are also different.
  3. Always try different settings and check your figure in its final form for optimization.

On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).


2d bin counts plot


A 2d bin counts plot divides the plane into regular hexagons, counts the number of cases in each hexagon, and then (by default) maps the number of cases to the hexagon fill. It is used to resolve the “overplotting” problem, similar to using position = "jitter" when doing the scatter plot.

For example, in the loans_full_schema data set, if we hope to plot interest_rate against debt_to_income using a scatter plot, it looks like this:

ggplot(loans_full_schema) +
  geom_point(aes(x = debt_to_income, y = interest_rate))

This graph is not very informative since many points overlap with each other (overplotting). To make things more clear, we can use the geom_bin_2d() function to create a 2d bin counts plot.

ggplot(loans_full_schema) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate))

In the graph above, the colors represent the counts (equivalently density) in each square bin. It is clear that we have more data points at low interest rate between 5% to 13% combined with low debt_to_income ratio between 0% to 20%.

Since there are relatively few points for a debt-to-income ratio of higher than 100%. We can filter our data and make the plot more detailed:

ggplot(loans_full_schema) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate)) + 
  xlim(0, 100)

Customize scales (breaks and labels)


Scales refer to the x- and y-ticks and their labels on axes. There are a few functions that can customize the scale. Let’s take the following graph as an example:

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income))

We see that the scales on x-axis are 0, 100, 200, 300, 400 and the scales on y-axis are 0, 500000, 1000000, 1500000 and 2000000.

Scale control functions

Let’s first learn how to change the position of scales using scale_x_continuous and scale_y_continuous functions.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, 
                 y = annual_income)) +
  scale_x_continuous(breaks = seq(0, 450, 50)) +
  scale_y_continuous(breaks = seq(0, 2000000, 250000))

By defining the breaks argument inside scale_x_continuous or scale_y_continuous function one can define all positions of scales.

Customize scales (Cont’d)

We can also customize the labels of scales.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)

Here labels = NULL removes all labels on the corresponding scale. Or we can define them by ourselves.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_continuous(labels = scales::dollar) 

We can also customize the label names here with the name argument, and customize the limits with the limits argument. Some useful scale options are scale::percent, scale::dollar and scale::comma to change the format of scales.

Customize scales (Cont’d)


In many data sets, one numeric variable may span a few orders of magnitudes (for example, household income from $1,000 to $1,000,000). If we use continuous_scale, the graph does not show details very well. In that case we need to change our scale to log scale (plotting the logarithm of variable).

For data exploration, it is common that one use log10 scales:

ggplot(loans_full_schema) + 
  geom_bin_2d(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_log10(limits = c(5000, 2500000), labels = scales::dollar) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

The functions scale_y_log10 and scale_x_log10 converts y-axis or x-axis into log10 scale, respectively.

Use preset themes


There are eight preset themes offered in ggplot, that gives different settings in axes, grid and background appearance. They are:

We can change the theme by calling the theme functions:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  theme_classic()

Facets


Using “Facets” is another way to add additional variables into a graph.

  • Facets divide a plot into subplots based on the values of one or more discrete variables.

  • When creating subplots based on values of a single categorical variable, one should use facet_wrap(). As below is an example.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2) +
  labs(title = "Vehicle Fuel Economy Data by Vehicle Class", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

The facet_wrap() function wraps subplots into a 2-dimensional array. This is generally a better use of screen space because most displays are roughly rectangular.

In the code above, ~ class is called a formula in R. We will study it later. For now you just need to know that ~ <VARIABLE_NAME> is needed as the first argument of facet_wrap function.

In the graph above, we still plot engine displacement vs highway mpg, but only plot grouped data for each class in every subplot. By doing this, we clearly see where each group is - better than plotting them altogether.

The facet_grid function

When creating subplots based on values of two categorical variables, one should use facet_grid():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class) + # A formula with two variables
  labs(title = "Fuel Economy Data by Vehicle Class and Drive Train Type", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

In this case, a grid of subplots is created and the x- and y-axis of the grid corresponds to the values of drv and class, respectively.

For example, in the subplot at the top right corner, it plots displ against hwy for data points with a class value of suv, and a drv value of 4, which corresponds to 4-wheel drive suvs.

facet_grid() can be used to study the relationship between four variables (two numeric and two categorical). When the data set is large and complicated, it can be very useful to provide some insights for us.

Lab Homework


  1. Finish all Lab Exercises

  2. Create a graph based on the diamonds data set with the following requirements:

  • A grid of scatter plots with x being carat and y being price.
  • In each plot, use different colors for different clarity quality.
  • For the grid of subplots, the x-axis should refer to different cut quality, and the y-axis referring to different diamond color.
  • The scale of y-axis should be in the format like $5,000 etc.
  1. Do you think the plot is informative? Provide your opinion.

Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.