Reference book - ggplot2: Elegant Graphics for Data Analysis

#Fuel economy data

library(ggplot2)
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

Key components

Every ggplot2 plot has three key components:

1.data, 2. A set of aesthetic mappings between variables in the data and visual properties, and 3. At least one layer which describes how to render each observation. Layers are usually created with a geom function.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

Add line

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_line()

## Exercise How would you describe the relationship between cty and hwy? Do you have any concerns about drawing conclusions from that plot?

ggplot(mpg,aes(cty,hwy)) + geom_point(colour = 'red') + geom_line(colour = 'skyblue')

ggplot(diamonds, aes(carat, price)) + geom_point()

ggplot(economics, aes(date, unemploy)) + geom_line()

ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Colour, size, shape and other aesthetic attributes

ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()

This gives each point a unique colour corresponding to its class. The legend allows us to read data values from the colour, showing us that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.

If you want to set an aesthetic to a fixed value, without scaling it, do so in the individual layer outside of aes(). Compare the following two plots:

ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))

ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")+geom_line(colour = 'red')

Different types of aesthetic attributes work better with different types of variables. For example, colour and shape work well with categorical variables, while size works well for continuous variables. The amount of data also makes a difference: if there is a lot of data it can be hard to distinguish different groups. An alternative solution is to use faceting, as described next.

When using aesthetics in a plot, less is usually more. It’s difficult to see the simultaneous relationships among colour and shape and size, so exercise restraint when using aesthetics. Instead of trying to make one very complex plot that shows everything at once, see if you can create a series of simple plots that tell a story, leading the reader from ignorance to knowledge.

  1. Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(colour = displ, size = displ))

  1. What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?
ggplot(mpg,aes(cty,hwy)) + geom_point(aes(colour = model,shape = trans))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 10 values. Consider specifying shapes manually if you need
##   that many of them.
## Warning: Removed 96 rows containing missing values or values outside the scale range
## (`geom_point()`).

Faceting

Another technique for displaying additional categorical variables on a plot is faceting. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset. There are two types of faceting: grid and wrapped. Wrapped is the most useful, so we’ll discuss it here, and you can learn about grid faceting later. To facet a plot you simply add a faceting specification with facet_wrap(), which takes the name of a variable preceded by ~.

ggplot(mpg, aes(displ,hwy)) + geom_point() + facet_wrap(~class)

# Plot geoms You might guess that by substituting geom_point() for a different geom function, you’d get a different type of plot. That’s a great guess! In the following sections, you’ll learn about some of the other important geoms provided in ggplot2. This isn’t an exhaustive list, but should cover the most commonly used plot types.

geom_smooth() fits a smoother to the data and displays the smooth and its standard error.

geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points.

geom_histogram() and geom_freqpoly() show the distribution of continuous variables.

geom_bar() shows the distribution of categorical variables.

geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time. ## Adding a smoother to a plot

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).

An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve:

method = “loess”, the default for small n, uses a smooth local regression (as described in ?loess). The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Add more arg

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue", linetype = "dashed")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(mapping = aes(x = depth, y = table),
              method = "glm",
              formula = y ~ x,
              se = TRUE,
              color = "blue",
              fill = "lightblue",
              linetype = "dashed")

ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  geom_smooth(
    method = "glm",
    formula = y ~ x,
    se = TRUE,
    color = "blue",
    fill = "lightblue",
    linetype = "dashed"
  )

Boxplots and jittered points

ggplot(mpg, aes(drv, hwy)) + 
  geom_point()

Because there are few unique values of both drv and hwy, there is a lot of overplotting. Many points are plotted in the same location, and it’s difficult to see the distribution. There are three useful techniques that help alleviate the problem:

Jittering, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.

Boxplots, geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.

Violin plots, geom_violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.

These are illustrated below:

ggplot(mpg, aes(drv, hwy)) + geom_jitter()

ggplot(mpg, aes(drv, hwy)) + geom_boxplot()

ggplot(mpg, aes(drv, hwy)) + geom_violin()

# Histograms and frequency polygons

ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both histograms and frequency polygons work in the same way: they bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines.

You can control the width of the bins with the binwidth argument (if you don’t want evenly spaced bins you can use the breaks argument). It is very important to experiment with the bin width. The default just splits your data into 30 bins, which is unlikely to be the best choice. You should always try many bin widths, and you may find you need multiple bin widths to tell the full story of your data.

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 2.5)

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 1)

An alternative to the frequency polygon is the density plot, geom_density(). A little care is required if you’re using density plots: compared to frequency polygons they are harder to interpret since the underlying computations are more complex. They also make assumptions that are not true for all data, namely that the underlying distribution is continuous, unbounded, and smooth.

To compare the distributions of different subgroups, you can map a categorical variable to either fill (for geom_histogram()) or colour (for geom_freqpoly()). It’s easier to compare distributions using the frequency polygon because the underlying perceptual task is easier. You can also use faceting: this makes comparisons a little harder, but it’s easier to see the distribution of each group.

ggplot(mpg, aes(displ, colour = drv)) + 
  geom_freqpoly(binwidth = 0.5)

ggplot(mpg, aes(displ, fill = drv)) + 
  geom_histogram(binwidth = 0.5) + 
  facet_wrap(~drv, ncol = 1)

Bar charts

ggplot(mpg,aes(manufacturer)) + geom_bar(fill = "skyblue")

Bar charts can be confusing because there are two rather different plots that are both commonly called bar charts. The above form expects you to have unsummarised data, and each observation contributes one unit to the height of each bar. The other form of bar chart is used for presummarised data. For example, you might have three drugs with their average effect:

To display this sort of data, you need to tell geom_bar() to not run the default stat which bins and counts the data. However, we think it’s even better to use geom_point() because points take up less space than bars, and don’t require that the y axis includes 0.

drugs <- data.frame(
  drug = c("a", "b", "c"),
  effect = c(4.2, 9.7, 6.1)
)

ggplot(drugs,aes(drug, effect)) + geom_bar(stat = "identity",fill = "green", colour = "black")

ggplot(drugs,aes(drug, effect)) + geom_point()

ggplot(mpg, aes(x = class, color = class)) +
  geom_bar(fill = "white")

Modifying the axes

ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3)

ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab("city driving (mpg)") + 
  ylab("highway driving (mpg)")

# Remove the axis labels with NULL
ggplot(mpg, aes(cty, hwy)) +
  geom_point(alpha = 1 / 3) + 
  xlab(NULL) + 
  ylab(NULL)

xlim() and ylim() modify the limits of axes:

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25)

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25) + 
  xlim("f", "r") + 
  ylim(20, 30)
## Warning: Removed 136 rows containing missing values or values outside the scale range
## (`geom_point()`).

#> Warning: Removed 138 rows containing missing values or values outside the scale range
#> (`geom_point()`).
  
# For continuous scales, use NA to set only one limit
ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(width = 0.25, na.rm = TRUE) + 
  ylim(NA, 30)

geom_jitter() adds random noise (horizontal/vertical) to separate overlapping points and show the true structure of your data.

# Basic scatter plot (points will overlap)
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_point(color = "blue") +
  ggtitle("Without Jitter (geom_point)") +
  theme_minimal()

# With
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_jitter(width = 0.2, color = "darkgreen", alpha = 0.6) +
  ggtitle("With Jitter (geom_jitter)") +
  theme_minimal()