Load Library


library(tidyverse)
library(openintro)

In this module, we will first study a few data visualization and analysis examples, which naturally raises the necessity of performing data transformation before data visualization in many situations.


1. Data Visualization Examples


2d bin counts plot


A 2d bin counts plot divides the plane into regular hexagons, counts the number of cases in each hexagon, and then (by default) maps the number of cases to the hexagon fill. It is used to resolve the “overplotting” problem, similar to using position = "jitter" when doing the scatter plot.

For example, in the loans_full_schema data set, if we hope to plot interest_rate against debt_to_income using a scatter plot, it looks like this:

ggplot(loans_full_schema) +
  geom_point(aes(x = debt_to_income, y = interest_rate))

This graph is not very informative since many points overlap with each other (overplotting). To make things more clear, we can use the geom_bin_2d() function to create a 2d bin counts plot.

ggplot(loans_full_schema) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate))

In the graph above, the colors represent the counts (equivalently density) in each square bin. It is clear that we have more data points at low interest rate between 5% to 13% combined with low debt_to_income ratio between 0% to 20%.

Since there are relatively few points for a debt-to-income ratio of higher than 100%. We can filter our data and make the plot more detailed:

ggplot(filter(loans_full_schema, debt_to_income < 100)) +
  geom_bin_2d(aes(x = debt_to_income, y = interest_rate))

Customize scales (breaks and labels)


Scales refer to the x- and y-ticks and their labels on axes. There are a few functions that can customize the scale. Let’s take the following graph as an example:

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income))

We see that the scales on x-axis are 0, 100, 200, 300, 400 and the scales on y-axis are 0, 500000, 1000000, 1500000 and 2000000.

Let’s first learn how to change the position of scales using scale_x_continuous and scale_y_continuous functions.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  scale_x_continuous(breaks = seq(0, 450, 50)) +
  scale_y_continuous(breaks = seq(0, 2000000, 250000))

By defining the breaks argument inside scale_x_continuous or scale_y_continuous function one can define all positions of scales.

We can also customize the labels of scales.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income, y = annual_income)) +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)

Here labels = NULL removes all labels on the corresponding scale. Or we can define them by ourselves.

ggplot(loans_full_schema) + 
  geom_point(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_continuous(labels = scales::dollar) 

We can also customize the label names here with the name argument, and customize the limits with the limits argument. Some useful scale options are scale::percent, scale::dollar and scale::comma to change the format of scales.

Customize scale


In many data sets, one numeric variable may span a few orders of magnitudes (for example, household income from $1,000 to $1,000,000). If we use continuous_scale, the graph does not show details very well. In that case we need to change our scale to log scale (plotting the logarithm of variable).

For data exploration, it is common that one use log10 scales:

ggplot(loans_full_schema) + 
  geom_bin_2d(aes(x = debt_to_income/100, y = annual_income)) +
  scale_x_continuous(name = "debt to income ratio", labels = scales::percent, limits = c(0, 1)) +
  scale_y_log10(limits = c(5000, 2500000), labels = scales::dollar) +
  labs(title = "LendingClub Loan Data", 
       x = "Debt to Income Ratio (in percentage)", 
       y = "Annual Income (in US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

The functions scale_y_log10 and scale_x_log10 converts y-axis or x-axis into log10 scale, respectively.


Use preset themes


There are eight preset themes offered in ggplot, that gives different settings in axes, grid and background appearance. They are:

We can change the theme by calling the theme functions:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  theme_classic()


Facets


Using “Facets” is another way to add additional variables into a graph.

  • Facets divide a plot into subplots based on the values of one or more discrete variables.

  • When creating subplots based on values of a single categorical variable, one should use facet_wrap(). As below is an example.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2) +
  labs(title = "Vehicle Fuel Economy Data by Vehicle Class", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

The facet_wrap() function wraps subplots into a 2-dimensional array. This is generally a better use of screen space because most displays are roughly rectangular.

In the code above, ~ class is called a formula in R. We will study it later. For now you just need to know that ~ <VARIABLE_NAME> is needed as the first argument of facet_wrap function.

In the graph above, we still plot engine displacement vs highway mpg, but only plot grouped data for each class in every subplot. By doing this, we clearly see where each group is - better than plotting them altogether.

  • When creating subplots based on values of two categorical variables, one should use facet_grid():
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class) + # A formula with two variables
  labs(title = "Fuel Economy Data by Vehicle Class and Drive Train Type", 
       x = "Engine Displacement (liter)", 
       y = "Highway Mile per Gallon") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

In this case, a grid of subplots is created and the x- and y-axis of the grid corresponds to the values of drv and class, respectively.

For example, in the subplot at the top right corner, it plots displ against hwy for data points with a class value of suv, and a drv value of 4, which corresponds to 4-wheel drive suvs.

facet_grid() can be used to study the relationship between four variables (two numeric and two categorical). When the data set is large and complicated, it can be very useful to provide some insights for us.


2. Exemplary exploratory data analysis of diamonds data set


Next, let’s study a new data set to start our journey in data exploration. We will use the diamonds which is a built-in data set in ggplot2 package. The data set should be available after you load tidyverse into the R session.

First, let’s do some basic exploration of the data set by doing the following exercise:


Lab Exercise for diamonds data set


  1. How many samples are there? How many variables are there?
  2. Understand the meaning and the data type for each variable.
  3. Create a plot to study the relationship between price and carat. How do you understand this plot?
  4. Create plots to study the relationship between price and cut, color, clarity, respectively. How do you understand these plots?
  5. After finishing 3 and 4, do you have more questions raised in your mind? Create a plot to answer those questions.


Change the y variable in bar charts with stat_summary()


So far our bar plots are all plotting for counts. In many cases we hope to plot for other descriptive measures. In the diamonds data set, for example, we may hope to the mean price for diamonds with various combination of color/clarity levels. The following code does the job for us:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = clarity, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by clarity and Color", 
       x = "Clarity Level", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Basically, stat_summary or any other <STAT_FUNCTION> is an alternative to to <GEOM_FUNCTIONS>. The different is that, -<GEOM_FUNCTIONS> creates a graph of a particular graph type (point, line, bar plot etc.) and we need to specify the mapping in it. -<STAT_FUNCTION> creates a graph that plots a particular statistic (count, proportion, density, or any specified variables and their functions), and we need to specify the graph type in the function by using the geom argument.

Graph analysis: The graph above suggests some information that is not consistent with common sense.

  • The clarity level is from I1 (worst) to IF (best), but the mean price seems to decrease as the clarity level increases.
  • Similarly, diamonds of the best color “D” for most clarity groups do not have the highest price, but rather the worst color “J” usually has a higher price.

Why? That is because we didn’t consider the major price indicator carat, along with another important factor sample size in this graph. To find the correct trend, we need to look into the data more carefully. Sometimes, we may need to do data transformation - the next topic of our course.


An Example leading to Data Transformation and EDA (exploratory data analysis)

To resolve the effect of carat, one way (of course not the best way here) to handle this is to create a new ordinal variable named carat_group. Then we label all samples into one group in <= 1 carat, 1-2 carat, 2-3 carat, and > 3 carat based on the diamond carat.

diamonds <- mutate(diamonds, carat_group = cut(diamonds$carat, c(0, 1,2,3, Inf), c('<= 1 carat', '1-2 carat', '2-3 carat', '> 3 carat')))

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = carat_group, y = price, fill = color), fun = "mean", geom = "bar", position = "dodge") + 
  labs(title = "Diamond Price by Carat and Color", 
       x = "Carat", 
       y = "Mean Price (US dollar)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))

Analysis: This graph makes much more sense now since for diamonds of less than 3 carats, the mean price decreases as the color level downgrades as an overall trend. However, the group “> 3 carat” still looks weird since the worst color still has the highest mean price.

In this case, we need to speculate why this may happen and investigate whether our speculation is correct or not by graphs. Here a natural speculation is that “> 3 carat” diamonds are very rare so the sample size is too small to reflect the trend. We may verify our speculation with the following graph:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = carat_group, fill = color) , position = "dodge") + 
  labs(title = "Diamond Data by Carat and Color", 
       x = "Carat", 
       y = "Counts") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

The graph above clearly shows that the sample size in the “> 3 carats” group is simply too small, so we are not able to draw reliable conclusions for that group.

It is also helpful to check the mean carat for each color level:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Diamond Weight by Color", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

As we see, in this data set diamonds with lower color quality are heavier on average, and that’s why we have to consider this factor when analyzing the relationship between price and color level.

It is also helpful to study diamonds of less than 2 carat to exclude big diamonds that are rare. In this case, we need to first filter our data using the filter function in dplyr package.

ggplot(data = filter(diamonds, carat < 2)) + 
  stat_summary(mapping = aes(x = color, y = carat, fill = color), fun = "mean", geom = "bar") + 
  labs(title = "Mean Weight vs Color for '< 2 carat' Diamonds", 
       x = "Color Level", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.2)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.2)))  

The trend changes little compared with the previous graph. Therefore we must take into account the effect of carat when analyzing the effect of color on price.

This also shows that diamond color indeed impacts the price because although diamonds with color “D” and “F” share similar mean price, but on average “F” is heavier than “D”.


Data Transformation and Explorative Data Analysis (EDA)

The example above shows the process of Exploratory Data Analysis (EDA), which involves using visualization and transformation to explore your data in a systematic way. This is really an iterative cycle of the following steps:

  1. Generate questions about your data.

  2. Search for answers by visualizing, transforming, and modelling your data.

  3. Use what you learn to refine your questions and/or generate new questions.

During this process, you will find that in many situations we don’t have the data in exactly the right form for what we need. In that case we have to refer to data transformation, which refers to the operations of

  • pick observations that meet our requirement (filter())
  • reorder the rows (arrange())
  • pick variables by their names (select())
  • create new variables with functions of existing variables (mutate())
  • collapse many values down to a single summary (collapse())
  • group data by particular conditions (group_by())


Lab Exercise: Free exploration of diamonds data set or mpg data set by performing EDA


  1. Ask a question of your interest
  2. Visualize data to answer your question
  3. Try to raise new questions from your plot
  4. Visualize data to answer the new question
  5. Repeat cycle 1-4