ggplot uses the concept of Grammar of Graphics (Willinson 2005) to incrementally build a plot to your specification. Aesthetic mapping combined with other capabilities in ggplot can offer a number of powerful and easy to use graphics formatting options. In this vignette, we will be creating scatter plots with a focus on adding visual sophistication.

Libraries

The following libraries will be used in examples.

library(dplyr)
library(ggthemes)
library(ggplot2)

Data Set

data("mpg")
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...

Scatter Plot Aesthetics

In this example, we will be using the Fuel economy data from 1999 and 2008 for 38 popular models of cars.

Now let’s use ggplot to create a scatter plot for miles per gallon based on engine capacity.

# city miles depending on the engine capacity
ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point() 

As you can see, the plot is visually not very appealing. Now let’s take a closer look at the properties of aes. x and y properties are required which are the x and y coordinates of the plot but other properties are optional. Other properties include;

  • Colour
  • Shape
  • Size
  • Fill
  • Stroke
  • Alpha

Colour

In the plot below, we are dynamically changing the colour property based on number of cylinders.

# city miles depending on the engine capacity categorised by no. of cylinders
ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl))) 

Shape & Size

There are 25 different shapes for data points. These shapes can be evaluated by using the following R code.

# 25 shapes
ggplot(data.frame(x = 1:5 , y = 1:25, z = 1:25), aes(x, y)) +
  geom_point(aes(shape = z), size = 4, colour = "Black", fill = "Green") +
  scale_shape_identity()

To use fill you must use shapes 21-25 a shown in green above. We will use the fill property in the next section.

Now let’s use the shape to distinguish different cylinder types. We will also use the size property to make the data points larger.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl), shape=factor(cyl)),size = 3) 

Fill

To use variables, the property must be within the aes statement as shown above. Fixed properties must be outside the aes statement.

Now let’s combine the shape and fill properties. Note that as we are using fixed values and these properties are outside the aes statement. We cannot use a variable for shape as fill is only supported by the 21-25 shapes. If we used a variable, as shapes are assigned incrementally, to represent 4 cylinder types, shapes 1-4 are used which do not support fill.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl)), shape=21, size = 3, fill = "Black" ) 

Stroke

The line colour of the data points are not very clear so we can add the stroke property.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl)),size = 3,  shape=21, fill = "Black", stroke = 2) 

Alpha

Setting transparency can be useful for large data sets. Here we are using the alpha property to highlight vehicle class.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl), alpha =class),size = 3,   stroke = 1) 

Adding a “Best Fit” line

Now that we have our scatter plot formatted, let’s look to add a best fit line.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl), alpha =class),size = 3,   stroke = 1) +
  geom_smooth(method = "loess")

You can ignore the standard error by using geom_smooth(method= "loess", se=FALSE)

Themes

Using ggplot themes you can quickly enhance the overall appearance of your plot. Below we are using the theme_dark. Let’s also add a title and rename the x, y axis and the legends.

ggplot(data=mpg, aes(x=displ, y=cty)) + 
  geom_point(aes(colour = factor(cyl), alpha =class),size = 3,   stroke = 1) +
  geom_smooth(method= "loess") +
  ggtitle("City Miles Based on Vehicle Cylinder Type\n") +
  labs(x="Cylinder Size",y="City miles per gallon\n", alpha="Vehicle Class", colour="No. of cylinders") +
  theme_dark()