Data Visualization

Introduction

In this class, we continue to learn how to visualize data with the ggplot2 package in R. We will learn the following topics

How to create a histogram and/or a density plot
How to customize histograms and density plots
Use aesthetic grouping to explore the effect of an additional categorical variable
Save your figure to disk
Control the appearance of data-related components
Control the appearance of non-data-related components

library(tidyverse)

Lending Club data set

In this class, we will use the data set loans_full_schema to show examples. The data set is from the package openintro.

library(openintro) # Install the package if it's not available

If you don’t have the package installed, use install.packages("openintro") to install it first. After loading the package, we can take a look at the data.

glimpse(loans_full_schema)

Question: How many samples are there? How many variables are there?

Histogram

To summarize a single numeric variable, the most commonly used chart is a histogram. As below is an example

A good histogram shows the distribution shape of the data set.

Selected variables

There are too many variables in the data set. To make things simpler, we will only handle 8 of them for now by executing the following code in R.

loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)

This operation of selecting some variables from the original data set belongs to data transformation, which will be our topic in the next chapter.

Now we create a new data set named loans that only stores the selected variables (loan_amount, interest_rate etc.) from the original set.

Glimpse of the data

Question: What is the data set about?

glimpse(loans)

## Rows: 10,000
## Columns: 8
## $ loan_amount    <int> 28000, 5000, 2000, 21600, 23000, 5000, 24000, 20000, 20…
## $ interest_rate  <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, 13.59, 11.99, 1…
## $ term           <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, 60, 60, 36, 60,…
## $ grade          <fct> C, C, D, A, C, A, C, B, C, A, C, B, C, B, D, D, D, F, E…
## $ state          <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, IL, FL, SC, CO,…
## $ annual_income  <dbl> 90000, 40000, 40000, 30000, 35000, 34000, 35000, 110000…
## $ homeownership  <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, MORTGAGE, MORTGA…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, 23.66, 16.19, 3…

Meaning of variables

variable	description
`loan_amount`	Amount of the loan received, in US dollars
`interest_rate`	Interest rate on the loan, in an annual percentage
`term`	The length of the loan, which is always set as a whole number of months
`grade`	Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
`state`	US state where the borrower resides
`annual_income`	Borrower’s annual income, including any second income, in US dollars
`homeownership`	Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income`	Debt-to-income ratio, in percentage

Variable types

<fct> means factor, which is a data type in R used to store and process categorical data.

variable	type
`loan_amount`	numerical, continuous
`interest_rate`	numerical, continuous
`term`	numerical, discrete
`grade`	categorical, ordinal
`state`	categorical, not ordinal
`annual_income`	numerical, continuous
`homeownership`	categorical, not ordinal
`debt_to_income`	numerical, continuous

`table` and `unique` function

Oftentimes, we hope to quickly check all values for a categorical or discrete variable. There are two functions to fulfill the job:

unique(loans$term)

## [1] 60 36

The unique function returns all unique values for a vector. As we can see here, for all the loans in the data set, the term length is either 36 or 60 months.

An even more powerful function is the table function, which creates a frequency table for any given categorical or discrete variable.

table(loans$term)

## 
##   36   60 
## 6970 3030

As we see, the function not only lists all values of the variable, but also lists the counts (frequency) for each value.

Lab Exercise

Answer the following questions by using unique or table function:

How many distinct values are there for homeownership variable? Which value is the most common one?
How many distinct interest rates are there? Which value is the most common one?
Apply table function to the annual_income variable. Do you think the result is helpful or not?

Create a histogram

Now let’s learn how to summarize a numeric variable by creating a histogram. Let’s pick the annual_income since its meaning is easy to understand.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate))

By default, there are 30 bins and their ranges are automatically determined. In many cases, this won’t give us a satisfactory graph. We need to customize it.

Changing the bin number - `bins=20`

In the previous graph, around 10% there are two “missing” bars which look unnatural. We can resolve this by reducing the number of bins (groups).

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), bins = 20)

Changing the bin number - `bins=10`

We can make the bin number even smaller to be 10. Among the three histograms, Which one do you think is the best? Why?

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), bins = 10)

Change the bin width

In some situations, we hope to specify the bin width instead of bin number. We may do the following:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5)

Adjust the position of bins

We can specify the center or boundary of one bin to make adjustments to the position of bins.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)

Adjust the position of bins

We can specify the center or boundary of one bin to make adjustments to the position of bins.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 1, center = 10)

Adjust the plotting range of x

It may look a little unnatural if we don’t plot x starting from zero. We may fix it by using the function xlim.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 1, center = 10) +
  xlim(0, 40)

Questions on data

Why is the lowest interest rate 5%-ish and there was no lower interest rate? Can you explain?
Why are there some peak interest rates around 7%, 10%, 14%? Can you explain?

Lab Exercise

Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.
Create a histogram of annual_income. What is the issue with your graph?

Density plot

For continuous random variables, it is common practice to have a density plot which plots a smoothed version of the histogram.

The density here is same as “probability density” in statistics. The density plot can be understood as a probability density function fit to the histogram. As below is an example.

ggplot(loans, aes(x = loan_amount)) +
  geom_density()

Here geom_density() function creates a density plot:

The total area under the density plot is one.
By default, the density curve is fit to the histogram with the default bin number (30 bins, see the graph below).
To adjust the “smoothness” of the plot, change the variable adjust.

Plot histogram and density together

Usually, it can be a good idea to plot histogram and density in one plot:

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)),  
                 boundary = 0, colour = "black", fill = "white") +
  geom_density(linewidth = 1.2)

Note that to make the histogram also plotted with y-axis being density, we need to add y = after_stat(density) in the aes function.

Adjust density smoothness

Larger adjust value gives more smooth density curves. For example, we can set it to be 30/8 to fit the curve to a histogram with 8 bins (which is smoother).

ggplot(loans, aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5000, 
                 boundary = 0, colour = "black", fill = "white") +
  geom_density(adjust = 30/8, linewidth = 1.2) #30/8 = 30bins/8bins

Create a relative frequency histogram (optional)

The after_stat function is used in the example above because the y-axis is mapped to data that are not a variable in the original data frame. For histograms, the default y-axis is mapped to counts, which is computed by another function stat_bin(). The function after_stat(count) or after_stat(density) indicates that count or density are only available after the original data are transformed.

With this understanding, we can create a relative frequency histogram (although rarely used) where the y-axis is the proportion of samples in each bin rather than the counts:

ggplot(loans) + 
  geom_histogram(mapping = aes(x = loan_amount, y = after_stat(count/sum(count))), 
                 binwidth = 5000, boundary = 0, colour = "black", fill = "white")

Lab Exercise

Create a histogram of variable debt_to_income in loans with the following requirements:

The plotting range of x is between 0 and 100
The binwidth is 2
Create a density plot on top of the histogram

Question: Can you explain the distribution of debt_to_income?

Aesthetic Grouping

In many cases, it is very useful to add another level of aesthetic mapping to a figure. For example, in the mpg data set, if we plot hwy vs displ, we would see a plot like this:

Example: How would we explain the red dots which seem to deviate the overall trend (larger displ leading to lower hwy)?

The same plot with color groups in vehicle class

How to explain those data points which has good fuel economy with large engine displacement? The following plot answers the question:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

We see that most of these cars are 2seaters, e.g., sports cars. Those cars are lighter with a powerful engine, thus having better mpg in highway than other non-sports cars (SUV, pickups).

Here, we put a color argument inside the aes function, this creates a new aesthetic group (in color) by the categorical variable class.

Template for aesthetic grouping

Similarly, we can map other aesthetic components (shape, color, size, linetype, transparency etc.) to any categorical variable. The template to do this is:

<GEOM_FUNCTION>(mapping = aes(x = ..., y = ..., 
                color/shape/size/linetype/alpha = <VARIABLE_NAME>))

Note that the argument must be inside the aes() function.

Example - group by color and size (for ordinal variable)

ggplot(data = mpg) + 
     geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl))

Question: What can we learn from this graph?

Lab Exercise

For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?

Aesthetic grouping for histogram

Sometimes we hope to compare histograms between groups. One way to do this is to use fill in the aes function. For example, if we want to investigate the effect of homeownership to loan_amount, we can do the following:

ggplot(loans, aes(loan_amount, fill = homeownership)) + 
  geom_histogram(binwidth = 5000, alpha = 0.5)

The argument alpha is between 0 and 1 that controls the transparency of each histogram. The smaller alpha is, the more transparent. It is very useful when we plot multiple charts that overlap with each other.

Aesthetic grouping for density curve

When the absolute counts between groups are quite different, comparing histograms is not a good idea. Instead, we would like to compare density curves between groups to see their differences.

ggplot(loans, aes(x = loan_amount, fill = homeownership)) + 
  geom_density(adjust = 2, alpha = 0.5)   # Transparency is necessary

So this graph shows the relatively insignificant differences between groups - people with mortgage tend to borrow more money than those renting houses (maybe they care less about having more debts since they are indebted anyway).

Ridge plots for density curves

When there are too many categories, a density ridge plot can be useful.

library(ggridges)   # The package "ggridges" must be installed 
ggplot(loans, aes(x = loan_amount, y = grade, 
                  fill = grade, color = grade)) + 
  geom_density_ridges(alpha = 0.5)

This enhanced graph becomes available after installing the package ggridges. You should be able to understand the code now without my explanation after studying many examples!

Question: What can we learn from this graph?

Save your figure

Use ggsave to save a figure as a file in the current working folder.

ggplot(mpg) + geom_point(aes(cty, hwy)) 
ggsave("my-plot.pdf")

Check the current working folder to find the file you just saved on your disk. You may use the following command to know your current working folder.

getwd()

You may save your figure as pdf, png, jpeg and other compatible formats.

Lab Homework (Required)

Create a scatter plot of loan_amount vs interest_rate with a color grouping using term variable (please use factor(term) to convert it into a categorical variable). Save your plot to your local folder. Submit your code and graph on Canvas.

Customize the appearance of data-related components

As we see in previous examples, we can customize the color, shape, fill color and other aesthetic components of data-related components (symbols, lines, bars, fills etc.). Note that this is different from aesthetic grouping since we would control the appearance of the whole plot.

These components are customized by arguments inside the geom_ functions (but outside the aes() function). For example,

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(color = "blue", fill = "green", shape = 21, size = 3) + 
  geom_smooth(color = "purple", linetype = "dashed", linewidth = 1.5)

All aesthetic features apply to all data components for that geom_ function.

For more details, please refer to https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

`color`, `fill` and `alpha` - color customization

Almost all geoms have either color or fill (or both) to customize the color of points/lines/bars/… To specify a color, one may use the following ways:
- A name, such as "red", "blue" etc. R has 657 built-in named colors in total.
- An RGB specification, such as #A52A2A
- An NA, which refers to completely transparent color

alpha refers to the opacity. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.

`shape`, `size` and `linetype` - point/line cutomization

shape can be specified with an integer between 0 and 25. Each code refers to a type of point.
size can be specified with a numerical value (in mm) or a relative size with rel() function.
linetype can be specified with an integer (0-6) or a name (0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash).

Lab Exercise

Create a density plot of interest_rate in loans data with

color to be blue
fill to be green
linetype to be dashed
linewidth to be 1.5

Titles and labels

For a graph to be accessible to a wider audience, it must have proper axis labels and title. In ggplot, the function labs() is used to specify these details.

ggplot(data = loans) +
  geom_histogram(
    mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(
    title = "Interest rate from lending club data", 
    x = "Interest Rate (%)", 
    y = "Count"
       )

Customize graph details with `theme()`

To further polish graph details, we need to add theme() into our code. The theme() function can customize all non-data components of the plots regarding their appearances.

To use theme, one needs to follow the following template:

theme(
      COMPONENT_NAME = ELEMENT_ADJUSTING_FUNCTION(STYLE_NAME = SETTING)
     )

Example

Let’s look at an example to understand this. For example, we want to center the title in the previous graph. We can do this:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))

How `theme()` function works

theme(plot.title = element_text(hjust = 0.5))

The argument plot.title of theme() specifies that we hope to customize the text appearance of the title.
To change any setting for text, we must use the element_text() function.
Therein, we change hjust to be 0.5, which refers to the horizontal justification, and a value of 0.5 refers to placing in center.

Another example

Let’s see another example. Now we want to enlarge the font size of title to be 20 pts. The following code would work:

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 20))

Exercises

As an exercise, please make a guess how the following code will change the graph

theme(
  plot.title = element_text(hjust = 0.5, size = 20))
  axis.title = element_text(size = 15)
  axis.text = element_text(size = 15)
  )

Solution

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 20), axis.title = element_text(size = 15), axis.text = element_text(size = 15))

Use help documentation

Nobody would remember all names of arguments for theme() function. So when you want to customize a particular element in your graph, use the help documentation as your reference.

?theme

There are three element_ functions used in theme(): - element_rect(): for borders and backgrounds - element_line(): for lines - element_text(): for texts

In the future, when new ways of using theme() function appear, you should research by yourself to understand how it works.

`rel()` and `margin()`

There are two useful functions rel() and margin() when we customize our graphs.

rel() is used to specify relative sizes. For example, rel(1.5) means 1.5 times larger in size
margin() is used to specify the margins of elements from top (t), bottom (b), right (r) and left (l), along with a unit.

theme(
     axis.text = element_text(colour = "blue", size = rel(1.5))
     plot.margin = margin(1,1,1,1, unit = "cm")
     )

Read https://ggplot2.tidyverse.org/reference/element.html for more details.

Example

An exemplary graph is shown below after adjusting the margin and font colors.

ggplot(data = loans) +
  geom_histogram(mapping = aes(x = interest_rate), binwidth = 5, boundary = 10)  +
  labs(title = "Interest rate from lending club data", x = "Interest Rate (%)", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), colour = "red"), axis.title = element_text(colour = "blue", size = rel(1.2), margin = margin(b = 3)), axis.text = element_text(size = rel(1.2)), plot.margin = margin(1,1,1,1, unit = "cm"))

Lab Homework (Required)

Do a simple graph ggplot(mpg) + geom_point(aes(x = cty, y = hwy)), make the following customization of your graph:

Make a title of “Fuel Economy Data”
Make your x label of “miles per gallon in city”, and a y label of “miles per gallon in highway”.
Make your title 2 times larger and in color blue
Make your labels in x and y 1.5 times larger
(Advanced) Change the aspect ratio of your graph to be 1 to make your graph square. Find how to do this by yourself with the assistance of help documentation.

Setting scales (`xlim` and `ylim`)

To set the limits of x and y axis, which is usually needed for graph polishing, we need to use xlim or ylim functions:

ggplot(mpg, aes(cty, hwy)) + geom_point() + geom_smooth() + 
  xlim(0, 40) + ylim(0, 50) + 
  theme(axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm"))

Principles of graph polishing

(Most important) The graph needs to look good to fit the actual needs in its final form.
With the same graph, when the output format is different, the optimized setting are also different.
Always try different settings and check your figure in its final form for optimization.

On the next page, I am going to show you a graph that is relatively polished in detail (You need really large font sizes for presentation).

Data Visualization - Part Two

Miao Yu

2024-01-24

Introduction

Lending Club data set

Histogram

Selected variables

Glimpse of the data

Meaning of variables

Variable types

table and unique function

Lab Exercise

Create a histogram

Changing the bin number - bins=20

Changing the bin number - bins=10

Change the bin width

Adjust the position of bins

Adjust the position of bins

Adjust the plotting range of x

Questions on data

Lab Exercise

Density plot

Plot histogram and density together

Adjust density smoothness

Create a relative frequency histogram (optional)

Lab Exercise

Aesthetic Grouping

The same plot with color groups in vehicle class

Template for aesthetic grouping

Example - group by color and size (for ordinal variable)

Lab Exercise

Aesthetic grouping for histogram

Aesthetic grouping for density curve

Ridge plots for density curves

Save your figure

Lab Homework (Required)

Customize the appearance of data-related components

color, fill and alpha - color customization

shape, size and linetype - point/line cutomization

Lab Exercise

Polish non-data-related components - title and labels

Titles and labels

Customize graph details with theme()

Example

How theme() function works

Another example

Exercises

Solution

Use help documentation

rel() and margin()

Example

Lab Homework (Required)

Setting scales (xlim and ylim)

Principles of graph polishing

`table` and `unique` function

Changing the bin number - `bins=20`

Changing the bin number - `bins=10`

`color`, `fill` and `alpha` - color customization

`shape`, `size` and `linetype` - point/line cutomization

Customize graph details with `theme()`

How `theme()` function works

`rel()` and `margin()`

Setting scales (`xlim` and `ylim`)