ggplot2: the basics

Packages Loaded

A list of packages required.

Use “install.packages(”packagename") if you have not previously loaded these packages into your R.

#Access practice datasets
library(datasets)

#Make awesome plots
library(ggplot2)

#Use color pallettes
library(viridis)

#Create multiple figure plots
library(gridExtra)

Load the data needed

“swiss” is a dataset describing several socioeconomic factors within french-speaking regions in Switzerland in the 1800s:

Table 1, List of variables in swiss

Variable Name	Variable Type	Variable Description
Fertility	Numeric	Fertility Ig, ‘common standardized fertility measure’
Agriculture	Numeric	Agriculture % of males involved in agriculture as occupation
Examination	Numeric	Examination % draftees receiving highest mark on army examination
Education	Numeric	Education % education beyond primary school for draftees.
Catholic	Numeric	Catholic % ‘catholic’ (as opposed to ‘protestant’).
Infant.Mortality	Numeric	Infant.Mortality live births who live less than 1 year.

#Save the dataset in the environment
swiss <- as.data.frame(swiss)

The Most Basic of Plots

I’m interested in finding out whether in 1888, education levels in different regions was related to fertility in the region.

Here, we try the first line of ggplot2 code. We assign our plot the name “basic scatter”, we choose the dataframe we want to use (swiss) and we apply the x and y variables we intend to use for our plot to what we call the plot “aesthetics”.

When we want to see what we have created with this plot, we simply call the assigned name “basicscatter”.

#make the plot
basicscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility))

#call the plot
basicscatter

As you can see this doesn’t work super well. The graph we have been supplied with shows everything but the data. This is because ggplot2 needs to be told what type of plot to use in a second line of code. This is different from the base function “plot” which can automatically select an appropriate plot to display your variables.

#base plot function
plot(swiss$Education, swiss$Fertility)

This function also exists in ggplot2, where qplot can help you quickly visualise your data.

#make a quick plot with qplot
qplot(Education, Fertility, data = swiss)

However, although this is sufficient to show us the trends of the data, the graph itself isn’t very aesthetically pleasing. Part of the strengths of ggplot2 lie in the ability of the user to create multi-faceted and fully customisable layers.

Each layer within a ggplot can refer to a number of aspects within the plot, namely: aesthetic mapping, geometric objects (geoms), statistical transformations (stat) and position adjustments. A comprehensive list of layer types is provided here: [link] https://ggplot2.tidyverse.org/reference/

The first we will focus on here are the geoms. The easiest way I find to think of geoms is as the shape you want your data to take. For example, in the plots we have been doing until now we have two continuous variables and we want to create a scatter plot so we use the argument “geom_point”.

To add new layers to a ggplot we simply need to add a “+” at the end of each line, and create a list of code.

#make the plot
basicscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility)) +
                geom_point()
#call the plot
basicscatter

The resulting plot is much better. Now we have data to compare our relationship and try to answer a question as well as being able to begin to customise the design!

Begin the Customisation!

Ok, so we have a basic scatter plot but what can we do to improve the visualisation in order to make it publishable or use it to illustrate a point in a presentation?

The first things I notice in this plot is that my gridlines don’t really add any extra detail, and my axis line is not very clear. For the purposes of my report on swiss socioeconomic factors, I might want to create a clean, scientific style background to my plot. To do this, I want to begin to change aspects of the theme, by adding another line of code with another “+”. There are some premade themes (again see https://ggplot2.tidyverse.org/reference/#section-themes) and I might try theme_bw() to see how this changes the look of my figure.

#make the plot
betterscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility)) +
                 geom_point() +
                 theme_bw()
#call the plot
betterscatter

To me, this is better, but not what I wanted. The grey background has gone, but I still didn’t want gridlines in the background. Perhaps it is better if I customise my own theme instead.

#make the plot
betterscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility)) +
                 geom_point() +
                 theme(panel.grid.major = element_blank(), 
                       panel.grid.minor = element_blank(),
                       panel.background = element_rect(fill = "white", colour = "black"), 
                       axis.title = element_text(size = 12),
                       axis.text = element_text(size = 12))

#call the plot
betterscatter

Much better! Here, I set the panel grid lines to be blank with “= element_blank()”. I also made sure the axis line would be black and the background white by creating an “element_rect”. I changed the size of the font on the axis title and scale with “axis.title” and “axis.text”. Note that each argument for the theme are included within a main set of parentheses and separated from each other by commas not “+”. Again, there are so many more options available with which you can customise your figure theme - check this link for a full breakdown https://ggplot2.tidyverse.org/reference/theme.html.

There are other aspects of the figure that could do with improving. Some of my data points are outside of the scales presented on my axes. Finally, I might want to change the titles on my axes to be more specific about my variables.

#make the plot
betterscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility)) +
                 geom_point() +
                 theme(panel.grid.major = element_blank(), 
                       panel.grid.minor = element_blank(),
                       panel.background = element_rect(fill = "white", colour = "black"), 
                       axis.title = element_text(size = 12),
                       axis.text = element_text(size = 12)) +
                 xlab("Percent Educated Beyond Primary Level") +
                 ylab("Fertility Index") +
                 xlim(0, 60) +
                 ylim(0, 100)

#call the plot
betterscatter

Add to your data

Our last section provided us with a very scientific looking plot, however, what if we want to add more detail based on our dataset?

The first thing I might be interested in is whether there is any trend in my data, I can check this out superficially by adding a “geom_smooth” with method = “lm”. This will automatically compute a regression line using the relationship y ~ x, in addition to a confidence interval.

#make the plot
bestscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility)) +
                 geom_point() +
                 geom_smooth(method = lm, color = "black", size = 0.5) +
                 theme(panel.grid.major = element_blank(), 
                       panel.grid.minor = element_blank(),
                       panel.background = element_rect(fill = "white", colour = "black"), 
                       axis.title = element_text(size = 12),
                       axis.text = element_text(size = 12)) +
                 xlab("Percent Educated Beyond Primary Level") +
                 ylab("Fertility Index") +
                 xlim(0, 60) +
                 ylim(0, 100)

#call the plot
bestscatter

This data shows an expected negative relationship between fertility and a region and the degree to which individuals were being educated which is to be expected. According to Mosteller and Tukey (1977), Switzerland was entering a period of demographic transition in 1888 (the year of the study), where fertility fell from the higher levels characteristic of less developed countries.

Perhaps we might also be interested in adding a third variable to our dataset. The swiss dataset includes information on infant mortality and we might predict that fertility decreases would be more likely with increased infant survival OR that higher levels of education would increase the medical care available in a region.

To understand whether this relationship is at all likely we might want to highlight the points based on infant mortality. We can do this by changing the colour, the shape or the size of the points based on our third variable.

#make the plot
bestscatter <- ggplot(data = swiss, aes(x = Education, y = Fertility, fill = Infant.Mortality)) +
                 geom_point(alpha = 0.6, shape = 21, size = 3, color = "grey") +
                 geom_smooth(method = lm, color = "black", size = 0.7, linetype = "dashed") +
                 scale_fill_viridis(option = "viridis") +
                 theme(panel.grid.major = element_blank(), 
                       panel.grid.minor = element_blank(),
                       panel.background = element_rect(fill = "white", colour = "black"), 
                       axis.title = element_text(size = 12),
                       axis.text = element_text(size = 12)) +
                 xlab("Percent Educated Beyond Primary Level") +
                 ylab("Fertility Index") +
                 xlim(0, 60) +
                 ylim(0, 100)

#call the plot
bestscatter

In this figure, I added infant mortality as a third variable to the aesthetics using “fill = Infant.Mortality”. I changed the shape of the datapoints within my “geom_point” to be a point shape that can be filled by colour. I changed the size by specifiying the “size” argument and the opacity by changing the “alpha” argument. I also changed the linetype in my geom_smooth argument to dashed so that I would be able to see my points more clearly when they overlap my trendline. See below for which shapes and line types are available.

By using “scale_fill_viridis(option =”viridis“)”, I chose to scale the fill of the datapoints by their value for Infant Mortality, using the colour palette “viridis”, available in the package “viridis” (more details available in the image below). I could have also changed the size of my points by specifying a scale_size() argument (for more details see https://www.r-graph-gallery.com/320-the-basis-of-bubble-plot.html).

Figure 1, Guide to linetypes and point shapes available for ggplot2 as well as the colour blind friendly pallettes available in the package viridis.

Even More Fun!

Now you have been taken through the basics of how code structure is organised in ggplot2, it’s time to try your own! Obviously, we have only discussed one type of scatterplot here that was applied to a very specific dataset. In reality the possibilities are endless! Try working with one of the other geoms on your own data (use reference 1 below for a comprehensive guide), move on to one of our other tutorials or try something whacky out of the r-graph gallery!