1 Introduction

ggplot2 is a very powerful data visualization package in R and is a part of tidyverse metapackage. R has several systems and packages of making graphs, but ggplot2 is one of the most elegant and versatile. The plots created using ggplot2 are more visually appealing, attractive compared to base R functions and other packages. Another important part of it is that the plot or object created by it can extensively customised. The structure of ggplot2 line of codes differs substantially from that of base R functions. That could be one of the reasons at the beginning we may get intimidating. But later as we go on practising these complex codes to create graph, which are so beautiful and explicative that we get very much satisfied ultimately. The first version of this package ggplot was developed in 2006 and the version ggplot2 was in 2007 by Hadley Wickham and his team.

“gg” in ggplot2 stands for grammar of graphics. As every language needs to be followed by certain grammatical rules like verbs, noun, adjectives etc to make a new sentences, and since graph/plot is a language of visualization, which also requires a grammatical rules; this is the main inspiration that created ggplot2. The package is very much inspired by the book “The Grammar of Graphics” by Leland Wilkinson. A grammar of graphics is a tool that enables concise description of components of a graphic.

In the following sections we are only going to cover some basics of ggplot2 about how it implements grammar of graphics. The grammar of graphics is implemented using different layers or components in ggplot2. The main components of ggplot2 are:

Data layer (In this layer we select the data of which we want to visualise. ggplot2 basically works only with data frame. Therefore, we have to make sure that whether the structure of the data set is data frame or not. Even if the structure is not data frame we can change any other type into data frame using suitable codes in R. Note that each column is a variable and each row is an observation in a data frame.)
Aesthetic layer (In this layer we have to define the aesthetics of the variable by properly assigning the type of scale of the variable in which the variable is mapped. For different scaled data there is different appropriate plot, for example histogram is not suitable to represent the categorical data, likewise the bar plot is not suitable to represent continuous variable.This is what the grammar of graphics is all about.)
Geometry layer (The next layer is to define the type of the plot/geometry like histogram, bar plot, density plot, box plot, line of regression etc. that we want to create.)
Facet layer (In this layer we can divide the plot into subgroups/facets, we have to make sure that the grouping variable must be categorical/factor.)
Theme layer (We can add the built in themes/background graphics to the plot in the final layer.)

The process of creating a plot using ggplot2 follows conventions that are a bit different than most of the code we have seen so far in R (although it is somewhat similar to the idea of piping (using pipe operator %>%, we introduced in an earlier course). The basic steps behind creating a plot with ggplot2 are:

Create an object of the ggplot class, typically specifying the data and some or all of the aesthetics;
Add on geoms and other elements to create and customize the plot, using +. The pipe operator %>% in dplyr package and the plus sign + in ggplot2 have same function.

Using ggplot2 we can add on one or many geoms and other elements to create plots that range from very simple to very customized. We’ll focus on simple geoms and added elements first, and then explore more detailed customization later.

To use this package first of all we have to install either the meta package tidyverse or the ggplot2 package itself. Remember that ggplot2 is a part of tidyverse. If you have already installed the tidyverse package, it is not necessary to install ggplot2 package separately, but can be loaded in any R session whenever required.

The data set that we are going to use in the following section is nepali data set from the package faraway. Let us first install the package faraway.

# install.packages("faraway") # to install the package "faraway".

Before we begin with the visualization using ggplot2 let us explore the data set nepali which we are going to use for the demonstration purpose.

library(faraway) # Remember! we have to load the package before using it.
data(nepali)
str(nepali)

## 'data.frame':    1000 obs. of  9 variables:
##  $ id   : int  120011 120011 120011 120011 120011 120012 120012 120012 120012 120012 ...
##  $ sex  : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ wt   : num  12.8 12.8 13.1 13.8 NA 14.9 15.1 15.8 16.2 NA ...
##  $ ht   : num  91.2 93.9 95.2 96.9 NA ...
##  $ mage : int  35 35 35 35 35 35 35 35 35 35 ...
##  $ lit  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ died : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ alive: int  5 5 5 5 5 5 5 5 5 5 ...
##  $ age  : int  41 45 49 53 57 57 61 65 69 73 ...

Clearly, we can observe that the nepali data set has structure of data frame with 1000 observations on 9 different variables. Other details like the data type of the variables and some missing observations (NA) can also be observed.

We can have details of this data frame by typing ?nepali in the R console. The data set is about the public health study of Nepalese children. (Source: West KP, Jr., LeClerq SC, Shrestha SR, Wu LS, Pradhan EK, Khatry SK, Katz J, Adhikari R, Sommer A. Effects of vitamin A on growth of vitamin A deficient children: field studies in Nepal. J Nutr 1997;10:1957-1965).

Now, we shall learn the basics of ggplot2 from the scratch by creating some simple graphs like Histogram, Bar plots, Scatter plot, Box plot etc one by one. Let us start creating histogram of the variable wt (Weight) from the nepali data set.

In the following subsections we learn about how grammar of graphics is implemented by adding different layers from data layer to theme layer in ggplot2. When we add noun, verbs and adjectives to create a new sentences; we will be adding these layers to create a meaningful graph/plot, is the key concept of grammar of graphics.

1.1 Data Layer

The first step in creating a plot using ggplot2 is to select the data frame by stating the argument data = data frame inside the function ggplot(). Note our data frame is nepali. Therefore, our data layer should look like this;

library(ggplot2)
ggplot(data = nepali)

figure showing the blank canvas as only the data layer is created.

When we run this code we should see blank canvas asking some other line of codes regarding aesthetic layer, geom layer to create some graph/plot.

1.2 Aesthetic Layer

The next layer is the Aesthetic layer, in which we map the variable which we want to visualise. In our example as we are going to create histogram of the wt variable, we have to map the variable wt to the x - axis. This layer is created using the function aes() with the arguments x = variable1, y = variable2. In this case we will be using x = wt as we are going to create histogram of the variable wt.

ggplot(data = nepali) + # "+" is used to add layers.
  aes(x = wt) # The argument `x = ` maps the variable to the x-axis.

figure showing mapping of the variable ‘wt’ to the x -axis.

Which aesthetics are required for a plot depend on which geoms (more on those in a second) you’re adding to the plot. You can find out the aesthetics you can use for a geom in the “Aesthetics” section of the geom’s help file (e.g., ?geom_point). Required aesthetics are in bold in this section of the help file and optional ones are not. Common plot aesthetics you might want to specify include:

Code	Description
`x`	Position on x-axis
`y`	Position on y-axis
`shape`	Shape
`color`	Color of border of elements
`fill`	Color of inside of elements
`size`	Size
`alpha`	Transparency (1: opaque; 0: transparent)
`linetype`	Type of line (e.g., solid, dashed)

1.3 Geometry Layer

The layer in which we define the graph or geometry that we are going to use to visualize the data. In this case as we are going to create histogram of the wt variable, we define this layer by adding this layer. The function geom_histogram() will be used to create the histogram.

geom_boxplot(), geom_bar(), geom_point() are functions that can be used to create box plot, bar plot and scatter plot respectively. Other geometry functions can easily viewed in RStudio.

ggplot(data = nepali) + aes(x = wt) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 123 rows containing non-finite values (stat_bin).

figure showing histogram of the ‘wt’ variable of ‘nepali’ data frame.

By default R selects 30 bins and can be adjusted with the command bins =, as we can see in the R messages. We can also observe the warning message regarding NA. In this subsection we are not going to discuss all these in details but later we shall discuss about these things in the customisation section.

We can remove the warning message by removing the NA’s. To remove the missing values, which R reads as NA’s we can use the argument na.rm = TRUE.

ggplot(data = nepali) + aes(x = wt) + geom_histogram(na.rm = TRUE)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The argument na.rm = TRUE allows to remove the missing observations if any and no warning messages will be displayed.

If we want to color the border of the histogram we can use the argument color = inside the function geom_histogram(), likewise we can change the binwidth (also known as class width or class size) by using the argument binwidth = and also change the no. of bins (number of classes)¹ by using the argument bin = inside the geom_histogram() function.

Suppose if we want to color the border of the histogram using black color, and number of bins to be 25 the line of codes will look like as below;

ggplot(data = nepali) + aes(x = wt) + geom_histogram(na.rm = TRUE, bins = 25, color = "black")

Histogram of ‘wt’ variable with 25 bins and border colored with black.

Now, let us talk about labelling the title of the plot , x - axis and y - axis. We use the function called ggtitle() for labelling the title and the functions xlab() and ylab() for labelling the x and y axes respectively.

ggplot(data = nepali) +
  aes(x = wt) + 
  geom_histogram(na.rm = TRUE, bins = 25, color = "black") +
  ggtitle("Histogram of the 'wt' variable") + 
  xlab("Weight of Children in Kilogram") + 
  ylab("Number of Children")

With those line of codes we see that the histogram has been well labelled with title and X and Y axes.

Before, we talk about the Facet Layer and Theme Layer let us try one more time regarding how we can add different layers to create a plot using ggplot2. This time let us try to create a box plot of the variable age from the same data frame nepali.

For this first we need to define the data layer by using argument data = nepali which will create a blank canvas as in the previous case. Readers can try this by themselves easily. The code should look like this.

ggplot(data = nepali)

The next layer, we would want to add will be the aesthetic layer. And the code to add this layer would be aes(x = age) which can be easily guessed, isn’t it?

When you add this layer the line of codes will look like this.

ggplot(data = nepali) + aes(x = age)

Finally, we have to add the geom layer by defining the function geom_boxplot() as we are willing to create box plot of the age variable.

If the line of code the reader should be thinking is;

ggplot(data = nepali) + aes(x = age) + geom_boxplot()

figure showing the box plot of the variable ‘age’ from which we can clearly point out the median age is about 38 years.

then you are doing absolutely correct and you are getting idea of the grammar of graphics through ggplot2.

Now, let us talk about creating scatter plot using ggplot2. As we know that scatter plot is used to visualize the relationship between two scaled variables, we have to map two variables in x and y axes. In this example let us try to create the scatter plot of the two scaled variable say height (ht) and the weight(wt), the line of codes will be as below:

ggplot(data = nepali) + #  to define the data layer
  aes(x = ht, y = wt) + # `ht` is mapped to x-axis and `wt` to y-axis.
  geom_point(na.rm = TRUE) # the function `geom_point()` is used to create scatter plot.

scatter plot between height and weight.

The relationship between the two variables is found to be positively correlated; we can see that weight goes on increasing as height increases.

As there exists a clear positive correlation between the two variables ht and wt of the children we may be interested to fit a linear regression line of wt which is dependent on ht. For the linear regression model readers are advised to refer “Simple Regression Analysis” . In this section we are going to plot a line of regression over this scatter plot using ggplot2.

ggplot(data = nepali) +
  aes(x = ht, y = wt) +
  geom_point(na.rm = TRUE ) +
  geom_smooth(method = "lm", na.rm = TRUE)  # "lm" stands for linear model.

## `geom_smooth()` using formula 'y ~ x'

Regression line embedded over the scatter plot

By the way, if we want to find out the coefficients i. e. the slope and intercept of the regression line we can use the following codes.

coef(lm(wt ~ ht, data = nepali))

## (Intercept)          ht 
##  -8.7581466   0.2341969

By using the geometry function geom_bar(), geom_density(), geom_qq(), geom_freqpoly() we can plot bar plot, density plot , quantile quantile plot and frequency polygon respectively. RStudio displays all the possible list of these plots.

1.4 Facet layer

Facet layer is added whenever we require to divide the graph into subgroups or facets, but we have to be very sure that the grouping variable must be categorical. For an example the above scatter plot can be faceted as per the sex variable. The function we use to facet the plot is facet_wrap().

ggplot(data = nepali) + 
  aes(x = ht, y = wt) + 
  geom_point(na.rm = TRUE) +
  facet_wrap(~sex)

Scatter plot of height and weight faceted by sex; 1 = male, 2 = female.

Similarly we can facet the two lines of regression for male and female using this code.

ggplot(data = nepali) + 
  aes(x = ht, y = wt) + 
  geom_point(na.rm = TRUE) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  facet_wrap(~sex)

## `geom_smooth()` using formula 'y ~ x'

Regression line embedded over scatter plot of height and weight faceted by sex; 1 = male, 2 = female.

1.5 Theme Layer

Themes are a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. Themes can be used to give plots a consistent customized look.

The commonly used themes are theme_minimal(), theme_classic(), theme_linedraw().

ggplot(data = nepali) +
  aes(x = ht, y = wt) +
  geom_point(na.rm = TRUE ) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  theme_linedraw()

## `geom_smooth()` using formula 'y ~ x'

TThe theme_linedraw() added to figure 7.

The suitable number of classes (K) can be obtained by using Sturge’s formula, K = 1 + 3.322 logN, N being no. of observations.↩︎

The `ggplot2` package

Pramo Udaya, Butwal Multiple Campus

1/4/2021

1 Introduction

1.1 Data Layer

1.2 Aesthetic Layer

1.3 Geometry Layer

1.4 Facet layer

1.5 Theme Layer

The ggplot2 package

Pramo Udaya, Butwal Multiple Campus

1/4/2021

1 Introduction

1.1 Data Layer

1.2 Aesthetic Layer

1.3 Geometry Layer

1.4 Facet layer

1.5 Theme Layer

The `ggplot2` package