Introduction

In this class, we continue to learn how to visualize data with the ggplot2 package in R. We will learn the following topics

  1. How to create a histogram and/or a density plot
  2. How to create labels and titles
  3. How to control elements of the plot (fontsize, ticks, plotting ranges, grids etc.)
  4. How to save your figure
  5. How to create plots with groups in color, size etc.


Histogram

To summarize a single numeric variable, the most commonly used chart is a histogram. As below is an example

A good histogram shows the distribution shape of the data set.


Lending Club data set

In this class, we will use the data set loans_full_schema to show examples. The data set is from the package openintro.

library(openintro)

If you don’t have the package installed, use install.packages("openintro") to install it first. After loading the package, we can take a look at the data.

glimpse(loans_full_schema)

Question: How many samples are there? How many variables are there?


Selected variables

There are too many variables in the data set. To make things simpler, we will only handle 8 of them for now by executing the following code in R.

loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)

This operation of selecting some variables from the original data set belongs to data transformation, which will be our topic in the next.

Now we create a new data set named loans that only stores the selected variables (loan_amount, interest_rate etc.) from the original set.


Glimpse of the data

Question: What is the data set about?

glimpse(loans)
## Rows: 10,000
## Columns: 8
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 2…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, …
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, …
## $ grade          <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, …
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, …


Meaning of variables


variable description
loan_amount Amount of the loan received, in US dollars
interest_rate Interest rate on the loan, in an annual percentage
term The length of the loan, which is always set as a whole number of months
grade Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
state US state where the borrower resides
annual_income Borrower’s annual income, including any second income, in US dollars
homeownership Indicates whether the person owns, owns but has a mortgage, or rents
debt_to_income Debt-to-income ratio, in percentage


Variable types

<fct> means factor, which is a data type in R used to store and process categorical data.

variable type
loan_amount numerical, continuous
interest_rate numerical, continuous
term numerical, discrete
grade categorical, ordinal
state categorical, not ordinal
annual_income numerical, continuous
homeownership categorical, not ordinal
debt_to_income numerical, continuous


Create a histogram

Now let’s summarize a numeric variable by creating a histogram. Let’s pick the annual_income since its meaning is easy to understand.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate))


Changing the bin number - bins=20

In the previous graph, around 10% there are two “missing” bars which look unnatural. We can resolve this by reducing the number of bins (groups).

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), bins = 20) 


Changing the bin number - bins=10

We can make the bin number even smaller to be 10. Among the three histograms, Which one do you think is the best? Why?

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), bins = 10) 


Change the bin width

In some situations, we hope to specify the bin width instead of bin number. We may do the following:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5) 


Adjust the position of bins

We can specify the center or boundary of one bin to make adjustments to the position of bins.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10) 


Polishing details of a graph

The graph above does not look very “beautiful”, and therefore not professional. One must polish graph details to make it at good quality. In the next, we will learn how to polish graph details.


Polish details of a graph - Title and Labels

For a graph to be accessible to a wider audience, it must have proper labels and title. In ggplot, the function labs() is used to specify these details.

ggplot(data = loans) +
  geom_histogram(
    mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(
    title = "Interest rate from lending club data", 
    x = "Interest Rate (%)", 
    y = "Count"
       )


ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count")


Customize graph details with theme()

To further polish graph details, we need to add theme() into our code. The theme() function can customize all non-data components of the plots regarding their appearances:


Bascis about theme()

To use theme, one needs to follow the following template:

theme(
      COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
     )

Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))


How theme() function works

theme(plot.title = element_text(hjust = 0.5))


Another example

Let’s see another example. Now we want to enlarge the font size of title to be 25 pts. The following code would work:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 25))


Exercises

As an exercise, please make a guess how the following code will change the graph

theme(
  plot.title = element_text(hjust = 0.5, size = 25))
  axis.title = element_text(size = 25)
  axis.text = element_text(size = 20)
  )


ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 25), axis.title = element_text(size = 25), axis.text = element_text(size = 20))


Use help documentation

Nobody would remember all names of arguments for theme() function. So when you want to customize a particular element in your graph, use the help documentation as your reference.

?theme

There are three element_ functions used in theme(): - element_rect(): for borders and backgrounds - element_line(): for lines - element_text(): for texts

In the future, when new ways of using theme() function appear, you should research by yourself to understand how it works.


rel() and margin()

There are two useful functions rel() and margin() when we customize our graphs.

theme(
     axis.text = element_text(colour = "blue", size = rel(1.5))
     plot.margin = margin(1,1,1,1, unit = "cm")
     )

Read https://ggplot2.tidyverse.org/reference/element.html for more details.


An exemplary graph is shown below after adjusting the margin and font colors.


Exercise

Do a simple graph ggplot(mpg) + geom_point(aes(x = cty, y = hwy)), make the following customization of your graph:

  1. Make a title of “Fuel Economy Data”
  2. Make your x label of “miles per gallon in city”, and a y label of “miles per gallon in highway”.
  3. Make your title 2 times larger and in color blue
  4. Make your labels in x and y 1.5 times larger
  5. (Advanced) Change the aspect ratio of your graph to be 1 to make your graph square. Find how to do this by yourself with the assistance of help documentation.


Customize data components

Please note that for data components, their aesthetics is customized inside the geom_ functions or using other built-in functions.

For example, the col argument is used to set the color of your plotting data (in points, lines, bars or other objects) in most cases.

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(col = "blue") + 
  geom_smooth(col = "red")

In the code above, we make all points blue and the line red. Also be noted that all argument names are ignored (think why this works.)


Setting scales (xlim and ylim)

To set the limits of x and y axis, which is usually needed for graph polishing, we need to use xlim or ylim functions:

ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() + 
  xlim(0, 40) + ylim(0, 50)


Principles of graph polishing

  1. (Most important) The graph needs to look good to fit the actual needs in its final form.
  2. With the same graph, when the output format is different, the optimized setting are also different.
  3. Always try different settings and check your figure in its final form for optimization.

On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).