Data Visualization

Introductin
- Prerequisites
A Graphing Template
- Question
- Template
Aesthetics Mappings
- Mapping Aesthetics to Variables
- Mapping Aesthetics to geoms:
Facets
- Facetting on a Single Variable
- Facetting on two Variables
Geometric objects
- Display Multiple geoms on the Same Plot
Statistical transformations
Position adjustments

Introductin

This chapter will teach you how to visualise your data using ggplot2. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.

Prerequisites

library(ggplot2)
library(tibble)

A Graphing Template

Question

Let’s use our first graph to answer a question: What is the relationship between engine size and fuel efficiency? Is it positive? Negative? Linear? Nonlinear?

To answer this question, we’ll use the mpg data set that comes with R. Let’s have a look at it

mpg

## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Use the documentation to understand the data set. To do that run: ?mpg.

From the help file, we can see that to answer the question, we need to plot the relationship between displ, the engine size, and hwy, the highway miles per gallon. Let’s use ggplot2 to do that:

ggplot(data = mpg) +
  geom_point(mapping = aes(x=displ, y=hwy))

It appears from the graph that cars with big engines use more fuel.

Template

The code we just used is template for each ggplot2 graph:

ggplot(data = <YOUR DATA>) creates an empty graph.
The + operator adds layers to your empty graph.
geom_point() adds a layer of points, i.e: a scatter plot. ggplot2 comes with many geom functions that each add a different type of layer to a plot, and we will learn about a lot of them later.
Each geom function takes a mapping argument, that defines how variables are mapped to visual properties.
The mapping is paired with aes(), and the x and y arguments of aes() specifies which variables to map to x and y axes.

This is the template:

ggplot(data = <YOUR DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetics Mappings

Aesthetics map visual properties to variables in your dataset. For example, the class variable in the mpg dataset classifies cars into groups such as compact, midsize, and SUV. In the following examples, we will mapp class to different aesthetics:

Mapping Aesthetics to Variables

To map an aesthetic to a variable, set the name of the aesthetic to the name of the variable inside aes():

Map class to color:

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy, color=class))

Map class to size:

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy, size=class))

Map class to alpha (controls the transparency of the points):

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy, alpha=class))

Mapping Aesthetics to geoms:

To mapp the aesthetic to the goem itself, not to a variable, set the aesthetic by name as an argument of your geom function:

Make the Points’ color blue:

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy), color='blue')

Note: Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot.

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

Facetting on a Single Variable

use facet_wrap() and pass it the ~ operator followed by the conditioning variable

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy)) +
  facet_wrap(~ class, nrow=2)

Facetting on two Variables

use facet_grid and pass it the conditioning variables, separated by ~ operator:

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy)) +
  facet_grid(drv ~ cyl)

Geometric objects

Look at these two graphs:

They’re different, although we used the same dataset, and plotted the same variables on the x and y axes. This is because we used different geom for each of them. A geom is the geometrical object that a plot uses to represent data. Here’s the code for the two plots:

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Example:

Re-produce the right graph above, but condition on the drv variable, such that each category of the variable has a different line type

Solution:

Just add a linetype aesthetic and set it equal to drv:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype=drv))

Example:

Produce 2 graphs, the left one should be the same as the last graph, but with same line type. Hint: use the group argument instead of linetype.

The right graph is also the same as the last graph, but maps the color aesthetic to the drv variable.

Hint: use the gridExtra package to plot multiple graphs on the same row

Solution:

# left
left <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, group=drv))

# right
right <- ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color=drv),
              show.legend = FALSE)

#install.packages('gridExtra') 
#library(gridExtra)
grid.arrange(left, right, ncol=2)

Display Multiple geoms on the Same Plot

To display multiple geoms in the same plot, add multiple geom functions to ggplot():

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth()

Note: we placed the mapping inside ggplot() function to avoid duplicated code if we placed in geom_point() and geom_smooth().

Note: placing the mappings inside the ggplot() function makes them global. i.e: apply to each geom in the graph. While placing mappings inside a geom function makes them local, i.e: use them to extend or overwrite the global mappings for that layer only.

Example:

Repeat the last 2-layer plot, but for the scatter plot only, map the color aesthetic to the drv variable

Solution:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color=drv)) +
  geom_smooth()

Example:

Repeat the last example, but now map the color aesthetic to the drv variable for both plots

Solution:

We only need to add color=drv in ggplot() function:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
  geom_point() +
  geom_smooth()

Statistical transformations

Load the diamonds dataset and have a look at its documnentation using ?diamonds.

The following bar chart displays the total number of diamonds in the diamonds dataset, grouped by cut

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

The bar chart plotted count on the y-axis, which is not a variable in diamonds!. Where did it come from?
Scatterplots, among other graphs, plot the raw values of your dataset.
Bar charts, among other graphs, calculate new values to plot. Here, it binned your data and then plotted bin counts, the number of points that fall in each bin.
ggplot2 calls the algorithm that a graph uses to calculate new values a stat, which is short for statistical transformation.
Each geom in ggplot2 is associated with a default stat. Even geom_point has a stat, which is the identity stat; meaning it plots the raw data as it is.

You don’t need to worry about the stat because it works behind the scenes on behalf of you, except in two cases: 1. If want to override the default stat. In the code below, I change the stat of geom_bar() from count (the default) to identity.

demo <- tibble::tibble(
  a = c("bar_1", "bar_2", "bar_3"),
  b = c(20, 30, 40)
)
demo

## # A tibble: 3 x 2
##       a     b
##   <chr> <dbl>
## 1 bar_1    20
## 2 bar_2    30
## 3 bar_3    40

ggplot(data = demo) +
  geom_bar(mapping = aes(x = a, y = b), stat = "identity")

If want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

By default, geom_bar() maps y to count, but you can ask it to use prop instead with aes(y = ..prop..). The two dots that surround prop notify ggplot2 that the prop variable appears in the transformed dataset not in the raw dataset.

Position adjustments

You can color bar chart using either the color aesthetic, or more usefully, fill:

bar_color <- ggplot(data=diamonds, mapping = aes(x = cut, color = cut)) +
  geom_bar()

bar_fill <- ggplot(data=diamonds, mapping = aes(x = cut, fill = cut)) +
  geom_bar()

grid.arrange(bar_color, bar_fill, ncol=2)

Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of other options:

position = "fill" work like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

# position = 'stack'
stacked_bars <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar()

filled_bars <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = 'fill')

dodged_bars <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = 'dodge')


grid.arrange(stacked_bars, filled_bars, dodged_bars, ncol=3)

There’s one other type of adjustment that’s very useful for scatterplots. In our first scatterplot:

The plot displays only a subset of the dataset points, because hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting, And it makes it hard to see where the mass of the data is.

Setting position "jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().