Getting started – from base R to the tidyverse

We’re going to begin by doing some very basic plots and data processing using base R tools, and then show how to do them better, faster, and more visually attractive with tidyverse tools.

Set Up Your Project and Load Libraries

As always to begin we must load some libraries we will be using. If we do not load them, R will not be able to find the functions contained in these libraries (unless we use the “::” format). Right now we’re just using a few packages.

We also set up some defaults here. We want to see the commands we are issuing in our output, because this is for learning purposes. If we were writing a paper, we would not want to see the commands, so we would set echo = FALSE.

Comparing Base R to ggplot2

Base R Graphics

The quickest, but not necessarily the most efficient way to start, is by using the plot function from base R.

Notice how R Markdown manages the output when printing an entire dataset. If you executed this in the console, it would print all 1,704 rows, which is usually undesirable. In such cases, you might prefer using the head() function to display only the first few rows.

head(gapminder,n=20)
## # A tibble: 20 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.
## 13 Albania     Europe     1952    55.2  1282697     1601.
## 14 Albania     Europe     1957    59.3  1476505     1942.
## 15 Albania     Europe     1962    64.8  1728137     2313.
## 16 Albania     Europe     1967    66.2  1984060     2760.
## 17 Albania     Europe     1972    67.7  2263554     3313.
## 18 Albania     Europe     1977    68.9  2509048     3533.
## 19 Albania     Europe     1982    70.4  2780097     3631.
## 20 Albania     Europe     1987    72    3075321     3739.

Let’s use base R plots now. We access individual variables from the main data frame.

plot(gapminder$gdpPercap,gapminder$lifeExp)

It works, but there are several drawbacks: 1) the default settings are often odd (like using scientific notation), 2) the axis labels and titles are unusual, 3) the appearance is quite unattractive, and 4) customizing it requires remembering a complex list of commands.

Now, let’s try creating a histogram.

hist(iris$Sepal.Length, breaks = 10)

It’s adequate, but far from perfect. The default settings are subpar, and the appearance leaves much to be desired. The default x-axis and title even include strange $ signs.

Switching to ggplot2

Let’s quickly switch to ggplot2. We’ll begin with qplot, which allows us to create a much more visually appealing plot compared to base R graphics.

qplot(data=gapminder,x=gdpPercap,y=lifeExp)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

qplot(data=iris,x=Sepal.Length,geom="histogram",bins=10)

That’s the last time we’ll use qplot. It’s much prettier, but 1) it is now officially deprecated, and 2) doesn’t let us behold the power of this fully operational battlestation, I mean ggplot2.

Let’s try to make 2 simple plots.

p.scatter <- ggplot(data=gapminder, 
              mapping=aes(x=gdpPercap,y=lifeExp)) + 
              geom_point()

p.hist <- ggplot(data=iris, 
        mapping=aes(x=Sepal.Length)) + 
        geom_histogram(bins=10)
p.scatter

p.hist

You’ve just created your first plot with the full capabilities of ggplot2!

You might wonder why this is preferable to qplot, which seemed to do everything more straightforwardly.

In essence, plot and qplot are single-line imperative commands. Modifying a plot means altering complex options within that line, and creating a different plot type requires an entirely new command. Even then, certain customizations are nearly impossible.

ggplot provides a comprehensive solution by enabling you to construct visualizations logically, layer by layer. These layers can be combined like Lego pieces, allowing you to create and customize plots with ease.

In this simple example, the ggplot call does the following:

  1. Specifies the dataset to be plotted.

  2. Defines the aesthetic mapping, linking variables to visual elements. Here, GDP is mapped to the X axis and life expectancy to the Y axis.

  3. Adds a single layer or geom, which in this case is a scatterplot.

  4. Relies on sensible and appropriate defaults for all unspecified settings.

Defaults

ggplot2 has a set of defaults that are usually sensible. The creators of this package and many of its complementary packages have very good taste!

But while these are often a good starting point, they can (and should) be tweaked to fit specific needs and preferences.

Concepts

A grammar of graphics

ggplot2 is built on the principles of the “grammar of graphics,” which defines a structured approach to creating visualizations. In this system, data are mapped to geometric objects that have aesthetic attributes such as color, position, and size. Data can be transformed for more convenient visualization, scales can be adjusted, and the results are projected onto a coordinate system—typically Cartesian—forming a cohesive “graphics sentence.”

Mapping

In ggplot2, mappings are set up using ggplot() and aes(). These functions establish the relationships between your data and the visual aspects of your plot, known as aesthetic mappings.

Geoms

Geometric objects, or geoms, are specified in code with geom_xxx(), where xxx represents the type of plot, such as point, line, or hist (histogram). Each geom represents a layer of the plot.

Putting things together

Layers in a ggplot2 plot are combined using the + operator. This allows for adding multiple layers in the order you specify. The simplest plot includes two components: the ggplot() function and a geom_xxx() function. For example a scatterplot with var1 on the x-axis and var2 on the y:

#ggplot(data = data, aes(x=var1, y=var2)) + geom_point()

Save

Let’s save this plot in the Plots subfolder, with a set width and height. Notice we have to issue a save command for each file type we want.

We’ll want to save both PNG and PDF files. The PNG file is useful for display and sharing. The PDF file is best for generating PDFs for ultimate use as presentations and papers that can be resized.

ggsave(filename = "Plots/ch2_1.png",width=8,height=5)
ggsave(filename = "Plots/ch2_1.pdf",width=8,height=5)
ggsave(p.scatter, filename = "Plots/ch2_scatter.pdf",width=8,height=5)

But do we really need to save if we commit to R Markdown? No. But it’s useful if you want to use these plots in some other program: Word, Powerpoint, LaTeX, share on social media, etc.

Knit

Click the Knit button to create a knitted document fusing code, output, and notes together.