Section 2 Overview

In Section 2, you will learn how to create data visualizations in R using the ggplot2 package.

After completing Section 2, you will:

be able to use ggplot2 to create data visualizations in R.
be able to explain what the data component of a graph is.
be able to identify the geometry component of a graph and know when to use which type of geometry.
be able to explain what the aesthetic mapping component of a graph is.
be able to understand the scale component of a graph and select an appropriate scale component to use.

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

Note that it can be hard to memorize all of the functions and arguments used by ggplot2, so we recommend that you have a cheat sheet handy to help you remember the necessary commands.

We encourage you to use R to interactively test out your answers and further your learning.

ggplot2

Key points and notes

Throughout the series, we will create plots with the ggplot2 package. ggplot2 is part of the tidyverse suite of packages, which you can load with library(tidyverse).
Note that you can also load ggplot2 alone using the command library(ggplot2), instead of loading the entire tidyverse.
ggplot2 uses a grammar of graphics to break plots into building blocks that have intuitive syntax, making it easy to create relatively complex and aesthetically pleasing plots with relatively simple and readable code.
ggplot2 is designed to work exclusively with tidy data (rows are observations and columns are variables).

Graph Components

Transcript

The first step in learning ggplot2 is to be able to break a graph into components. So let’s break down this plot while we introduce some of the ggplot terminology.

The main three components that we have to be aware of are– one, that the US murders data table is being summarized. We refer to this as the data component.

Second, the plot is a scatter plot. This is referred to as the geometry component. Other possible geometries are bar plot, histograms, smooth densities, q-q plots, and box plots.

Third are what we call the mappings. The x-axis values are used to display population size. The y-axis values are used to display the total number of murders. Text is used to identify the states. And colors are used to denote the four different regions. This is referred to as the aesthetic mapping component. How we define the mapping depends on what geometry we use.

Other components of the plot worth mentioning are– one, that the range of the x-axis and the y-axis appear to be defined by the range of the data. That they are both on log scales. And we refer to this as a scale component.

We also see that there are labels, a title, a legend.

Key points

Plots in ggplot2 consist of 3 main components:

Data: The dataset being summarized
Geometry: The type of plot (scatterplot, boxplot, barplot, histogram, qqplot, smooth density, etc.)
Aesthetic mapping: Variables mapped to visual cues, such as x-axis and y-axis values and color

There are additional components:

Scale
Labels, Title, Legend
Theme/Style

Code

library(dslabs)
data(murders)

Creating a New Plot

Transcript

The first step in creating a ggplot graph is to define a ggplot object. We do this with a function ggplot, which initializes the graph.

If we read the data file for this function, we see that the first argument is used to specify what data is associated with the object. This is the data component.

So to initiate an object, we can type ggplot data equals murders, this associates the data set with the plotting object. We can also pipe the data.

So this line of code is equivalent to the one we just saw. We say murders piped into ggplot. Know that this code renders a plot. In this case, a blank slate since no geometry has been defined. The only style we see is a gray background. What we have just done is that we have created an object and because it was not assigned, it was automatically evaluated. This is why we saw that gray slab.

But note that we can define an object. For example, like this. Here we’re assigning the graph object to the object p. If we look at the class of p, we see that it’s a ggplot object. To render the plot associated with this object, we simply print the object p. The two following lines of code will do this. We can either write print p or much more efficiently just type p, hit enter, and we’ll see the plot.

Key points

You can associate a dataset x with a ggplot object with any of the 3 commands:
- ggplot(data = x)
- ggplot(x)
- x %>% ggplot()
You can assign a ggplot object to a variable. If the object is not assigned to a variable, it will automatically be displayed.
You can display a ggplot object assigned to a variable by printing that variable.

Code

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dslabs)
data(murders)

ggplot(data = murders)

murders %>% ggplot()

p <- ggplot(data = murders)
class(p)

## [1] "gg"     "ggplot"

print(p)    # this is equivalent to simply typing p

Layers

Transcript

In ggplot, we create graphs by adding layers. We add them component by component. Layers can define geometries, compute summary statistics, define what scales to use, and even change styles.

To add layers, we use the symbol plus. In general, a line of code in ggplot will look like this. We’ll have data, we pipe it into the ggplot command, and then we add layers. Here’s one, two, three, up to Layer N.

Usually the first added layer defines the geometry. We want to make a scatterplot. So what geometry do we use? Taking a quick look at the documentation, we see that the function used to create plots with this geometry is geom underscore point. We will see that geometry function names follow this pattern. Geom, underscore, and then a name that reminds you of the geometry you’re about to add.

For geom point to know what to do, we need to provide data and a mapping. We have already connected an object with the murders data table. And if we add, as a layer, geom point, we will default to using this data. To find out what mappings are expected, we read the aesthetics section of the help file. Here we see that as expected, we have two required arguments, and they are x and y. This will be the x and y of the plot.

Aes will be one of the functions that you will most use. The function connects data with what we see on the graph. We refer to this connection as the aesthetic mappings. There’s where the name comes from, aes. The outcome of this function is often used as the argument of a geometry function.

The following example produces a scatterplot of total murders versus population in millions. We pipe the murders data set into ggplot, and then we add a layer with a geom point function. We can see that we’re assigning population to x and total to y. Note that we can drop the x and the y if we wanted to, as these are the first and second expected arguments, as we saw in the help page.

Also note that in ggplot we can add layers to previously defined objects. For example, if we define p with a ggplot command, we can add a layer to p by simply adding to p. This will render the object, because we are not assigning it to a new object.

Note also that the scales and labels are defined by default when adding this layer.

Finally, notice that we use the variable names from the object component, population and total, to label the axes. Note that aes recognizes the variable names from the data object. Keep in mind that this behavior is quite specific to aes. With most functions, if you try to access the values of population or total for example, outside of aes, you will receive an error.

So, we have created a scatterplot. A second layer in the plot we wish to make involves adding a label to each point. This will help us identify which point goes with which state. The geom label and geom text functions permit us to add text to the plot. Geom label adds a label with a little rectangle, and geom text simply adds the text.

Because each state, each point, has a label, we need an aesthetic mapping to make this connection. By reading the help file, we learn that we supply the mapping between point and label through the label argument of aes. So the code looks like this. We’ve already defined p. We add the points with geom point, and now we’re going to add the text with geom text. The aesthetic mapping is the same, but now we add the label, and we’re going to add the state abbreviations, which is stored in the abb variable. Now the plot looks like this.

You can see we have points and we have text, the labels for each state. Let’s pause the creation of our graph for a second to give an example of the behavior of aes previously mentioned. Note that this call is fine. No error is produced. But if we move the label outside of aes, we get an error. This is because abb is not found outside of aes. It’s not a globally defined variable. Geom text does not know where to find it.

Key points

In ggplot2, graphs are created by adding layers to the ggplot object:

DATA %>% ggplot() + LAYER_1 + LAYER_2 + … + LAYER_N

The geometry layer defines the plot type and takes the format geom_X where X is the plot type.
Aesthetic mappings describe how properties of the data connect with features of the graph (axis position, color, size, etc.) Define aesthetic mappings with the aes() function.
aes() uses variable names from the object component (for example, total rather than murders$total).
geom_point() creates a scatterplot and requires x and y aesthetic mappings.
geom_text() and geom_label() add text to a scatterplot and require x, y, and label aesthetic mappings.
To determine which aesthetic mappings are required for a geometry, read the help file for that geometry.
You can add layers with different aesthetic mappings to the same graph.

Code: Adding layers to a plot

library(tidyverse)
library(dslabs)
data(murders)

murders %>% ggplot() +
    geom_point(aes(x = population/10^6, y = total))

# add points layer to predefined ggplot object
p <- ggplot(data = murders)
p + geom_point(aes(population/10^6, total))

# add text layer to scatterplot
p + geom_point(aes(population/10^6, total)) +
    geom_text(aes(population/10^6, total, label = abb))

Code: Example of aes behavior

# no error from this call
# p_test <- p + geom_text(aes(population/10^6, total, label = abb))

# error - "abb" is not a globally defined variable and cannot be found outside of aes
# p_test <- p + geom_text(aes(population/10^6, total), label = abb)

Tinkering

Transcript

If we read the documentation, we know that each geometry function has many arguments other than aes and data. They tend to be specific to the function. For example, in the plot we wish to make, the points are larger than the default ones. In the help file, we see that size is an aesthetic and we can change it using this code. We simply put size equals 3 outside the call to aes. It produces this plot. We note that all the points are larger. Note the size is not a mapping. It affects all the points the same. It was outside of aes.

Unfortunately, now that the points are larger, we can’t read the labels anymore. If we read the help file for geom text however, we see that there’s an argument nudge underscore x, which lets us move the label just a little bit. So we can add that argument to the call to geom text. Notice that now we’re adding an argument nudge x equals 1. This moves all the labels slightly to the right. And now, we can actually read them again. We can actually make the code we’ve just shown more efficient.

Note that in the previous lines of code, we have been mapping population and total to the points twice, once for each geometry. We can avoid this by adding what is called a global aesthetic mapping.

We can do this when we define the blank slate using the ggplot function. Remember that the function ggplot contains an argument that permits us to define the aesthetic mappings. We can see it by typing args ggplot. If we define a mapping in ggplot, then all the geometries that are added as layers will default to this mapping.

So we’re going to redefine p, this time defining a mapping inside the ggplot function. And this simplifies the code greatly. When we add the layers, we no longer have to define the mappings anymore.

So the code simplifies to this. Note that it produces the same plot. Also note that we kept size and nudge x in the geom point and geom text functions respectively. This is because those arguments are specific to those two geometries. Also note that the geom point function doesn’t need the label argument, which was defined globally, so it simply ignores it.

Now, if we need to override the global mappings, we can do this. The local mappings override the global ones. Here’s an example. If we type this code and we redefine a mapping inside the call the geom text, now when we see the plot that is produced, the labels are no longer there. Only the label assigned by that new local aesthetic mappings, we no longer see the state names. Instead we see the phrase, Hello there. This flexibility of being able to redefine mappings in each layer will prove to be very useful later.

Key points

You can modify arguments to geometry functions other than aes() and the data. Additional arguments can be found in the documentation for each geometry.
These arguments are not aesthetic mappings: they affect all data points the same way.
Global aesthetic mappings apply to all geometries and can be defined when you initially call ggplot(). All the geometries added as layers will default to this mapping. Local aesthetic mappings add additional information or override the default mappings.

Code

# change the size of the points
p + geom_point(aes(population/10^6, total), size = 3) +
    geom_text(aes(population/10^6, total, label = abb))

# move text labels slightly to the right
p + geom_point(aes(population/10^6, total), size = 3) +
    geom_text(aes(population/10^6, total, label = abb), nudge_x = 1)

# simplify code by adding global aesthetic
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) +
    geom_text(nudge_x = 1.5)

# local aesthetics override global aesthetics
p + geom_point(size = 3) +
    geom_text(aes(x = 10, y = 800, label = "Hello there!"))

Scales, Labels, and Colors

Transcript

In this video, we will show you how to adjust the scales, labels, add some color, and add a line.

First, our desired scales are in the log scale. This is not the default, so this change needs to be added through a scales layer. A quick look at the documentation reveals the scale_x_continuous function. We can use this to edit the behavior of scales. We simply tell ggplot to use the log10 transformation for both the x and y-axis. This produces this plot. We can see that the scales are now in the log scale. Note that because we’re in the log scale now, the nudge must be made smaller. The log transformation is so common, that ggplot provides specialized functions. So we can make the previous code slightly more efficient by using the scale_x_log10 layer and the scale_y_log10 layer to make the scales be in the log scale.

Now that we have a scatterplot with the appropriate scales, we’re ready to add some labels and titles. We can read the documentation which reveals that to change labels and add a title, we use the following function. xlab can add a label to the x-axis. ylab adds one to the y-axis. And ggtitle adds a title. So we just simply add these layers to the previous code, and we get the following plot. We’re almost there.

We still have to add color, legend, and some optional changes to the style. Note that we can change the color of the points using the col argument in the geom point function. To facilitate the exposition of how we do this, we will redefine the object p to be everything except the points layer. Then we’ll add points layer one by one to see what effect it has.

So we’re going to use this code to redefine p. Now we can make all the points blue by simply adding a color argument. Here’s an example. We have p and we add the later geom points, making the size of the points 3, and the color blue. The plot looks like this. But this is not what we want. We want the colors to be associated with their geographical region. A nice default behavior of ggplot lets us do this.

If we assign a categorical variable to the color argument, it automatically assigns a different color to each category. And it also adds a legend. Now because the color of each point will depend on the category and the region from which each state is, we have to use a mapping. To map each point to a color, we need to use aes, since this is a mapping.

So we use the following code. We put the col argument inside an aes, but the size argument– which applies to all points– goes outside. So by typing this code, we get the desired behavior. Here’s the plot. Now remember the x and y mappings are inherited from those already defined when we defined p. This is how geom point already knows what x and y go with with each point. So we don’t need to redefine it. Note that we also moved aes to the first arguments, since this is where the mappings are expected in this call.

We also see another useful default behavior of ggplot. It has automatically added a legend that maps colors to region. Look at it again in the plot.

We want to add a line that represents the average murder rate for the entire country. Note that once we determine the per million rate to be r for the entire country, this line is defined by the formula y equals rx. Why? Because if a state has a population x and it has the same murder rate as the US, which is r, we simply multiply r by x to get the total number of murders. And remember here y and x are the total murders and the population in millions, respectively. To compute the average rate for the entire country, we can use some of the dplyr skills we’ve learned. We have to add up all the totals, add up all the populations, and then take the ratio of these two. We use the summarize function to do this.

To add a line, we use the geom abline function. ggplot uses ab in the name to remind us that we’re supplying the intercept a the slope b. The default line for geom abline has slope 1 and intercept 0. So we only have to define the intercept. So the following code adds the desired line. We have a line with intercept log10 of r. And now the plot looks like this. You can see the line going through the points.

To recreate the original plot we want to make, we have to change the line type from solid to dashed, change the color from black to grey, and also, we need to draw the line before the points. Because otherwise, the line goes over the points and the state abbreviations. So for this, we’re going to redefine p using some arguments in abline. lty equals 2 changes the line type. And color equals dark grey makes it grey. We also notice that we put abline before the call to geom point. Once we do this, the plot looks like this.

We’re now quite close to our goal of making that plot we showed originally. We have learned that the default behavior of ggplot is quite useful. But often we need to make some minor tweaks to the default behavior. Although it’s not always obvious how to make this even after reading the documentation, keep in mind that ggplot is very flexible, and there’s almost always a way to achieve what you want.

For example, there’s a small change we need to make for our plot to match our original goal. And it’s to capitalize the word region in the legend. To do this, we discover that the function scale_color_discrete lets us do this. So we can change it by simply adding this layer. All right, we’re very close to finishing. The plot looks a lot like our goal now. But for the final touches to make it look exactly the same, we’re going to need some add on packages– some functionality coming from outside of ggplot.

Key points

Convert the x-axis to log scale with scale_x_continuous(trans = “log10”) or scale_x_log10(). Similar functions exist for the y-axis.
Add axis titles with xlab() and ylab() functions. Add a plot title with the ggtitle() function.
Add a color mapping that colors points by a variable by defining the col argument within aes(). To color all points the same way, define col outside of aes().
Add a line with the geom_abline() geometry. geom_abline() takes arguments slope (default = 1) and intercept (default = 0). Change the color with col or color and line type with lty.
Placing the line layer after the point layer will overlay the line on top of the points. To overlay points on the line, place the line layer before the point layer.
There are many additional ways to tweak your graph that can be found in the ggplot2 documentation, cheat sheet, or on the internet. For example, you can change the legend title with scale_color_discrete().

Code: Log-scale the x- and y-axis

# define p
library(tidyverse)
library(dslabs)
data(murders)
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))

# log base 10 scale the x-axis and y-axis
p + geom_point(size = 3) +
    geom_text(nudge_x = 0.05) +
    scale_x_continuous(trans = "log10") +
    scale_y_continuous(trans = "log10")

# efficient log scaling of the axes
p + geom_point(size = 3) +
    geom_text(nudge_x = 0.075) +
    scale_x_log10() +
    scale_y_log10()

Code: Add labels and title

p + geom_point(size = 3) +
    geom_text(nudge_x = 0.075) +
    scale_x_log10() +
    scale_y_log10() +
    xlab("Population in millions (log scale)") +
    ylab("Total number of murders (log scale)") +
    ggtitle("US Gun Murders in 2010")

Code: Change color of the points

# redefine p to be everything except the points layer
p <- murders %>%
    ggplot(aes(population/10^6, total, label = abb)) +
    geom_text(nudge_x = 0.075) +
    scale_x_log10() +
    scale_y_log10() +
    xlab("Population in millions (log scale)") +
    ylab("Total number of murders (log scale)") +
    ggtitle("US Gun Murders in 2010")
    
# make all points blue
p + geom_point(size = 3, color = "blue")

# color points by region
p + geom_point(aes(col = region), size = 3)

Code: Add a line with average murder rate

# define average murder rate
r <- murders %>%
    summarize(rate = sum(total) / sum(population) * 10^6) %>%
    pull(rate)
    
# basic line with average murder rate for the country
p + geom_point(aes(col = region), size = 3) +
    geom_abline(intercept = log10(r))    # slope is default of 1

# change line to dashed and dark grey, line under points
p + 
    geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
    geom_point(aes(col = region), size = 3)

Code: Change legend title

p <- p + scale_color_discrete(name = "Region")    # capitalize legend title

Add-on Packages

Transcript

The power of ggplot2 is further augmented thanks to the availability of add on packages. The remaining changes required to put the finishing touches on our plot requires the ggthemes and ggrepel packages.

Let’s start by changing the themes. The style of a ggplot graph can be changed using the theme function. Several themes are included as part of the ggplot2 package. In fact, for most of the plots in this series, we’re using a theme that we define, and it’s included in the dslabs package, and you can get it by typing ds_theme_set.

Many other themes can be added using the package ggthemes. Among those are the theme_economist that we use to make our original plot. After installing the package, you can change the style by adding a layer. We’ve already saved the plot in the p object, so now all we need to do is load the ggthemes library and then add a layer that is defined by the theme_economist.

And now the plot looks like this, a lot like our original. You can see how some of the other themes look by simply changing the function. For example, you might try the theme fivethirtyeight function to get a theme that looks like the fivethirtyeight web page. It looks like this, and we simply do it by changing that last layer.

The final difference between the plot we have now and our final goal has to do with the positions of the labels. Note, that in our plot some of the labels fall on top of each other making it hard to read. The add-on package ggrepel includes a geometry that adds labels ensuring that they don’t fall on top of each other. So all we need to do is change the geom_text layer with a geom_text_repel layer after loading the ggrepel package.

So now, let’s put it all together from scratch. We’re going to make that plot that we started with step by step. The first thing to do is load the ggthemes and ggrepel packages. Then we can define the value r that defines the slope of the line using the Summarize command in dplyr. And now by defining a ggplot object and adding a series of layers, which you have gone through step by step, we get the final plot as we wanted.

Key points

The style of a ggplot graph can be changed using the theme() function.
The ggthemes package adds additional themes.
The ggrepel package includes a geometry that repels text labels, ensuring they do not overlap with each other: geom_text_repel().

Code: Adding themes

# theme used for graphs in the textbook and course
library(dslabs)
ds_theme_set()

# themes from ggthemes
library(ggthemes)
p + theme_economist()    # style of the Economist magazine

p + theme_fivethirtyeight()    # style of the FiveThirtyEight website

Code: Putting it all together to assemble the plot

# load libraries
library(tidyverse)
library(ggrepel)
library(ggthemes)
library(dslabs)
data(murders)

# define the intercept
r <- murders %>%
    summarize(rate = sum(total) / sum(population) * 10^6) %>%
    .$rate
    
# make the plot, combining all elements
murders %>%
    ggplot(aes(population/10^6, total, label = abb)) +
    geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
    geom_point(aes(col = region), size = 3) +
    geom_text_repel() +
    scale_x_log10() +
    scale_y_log10() +
    xlab("Population in millions (log scale)") +
    ylab("Total number of murders (log scale)") +
    ggtitle("US Gun Murders in 2010") +
    scale_color_discrete(name = "Region") +
    theme_economist()

Other Examples

Transcript

Now that we’ve learned the basics of ggplot we can try to make some of the summary plots we have previously described.

Let’s start with the histogram. Let’s try to make the histogram for the male heights. So in a first step, we need to filter the heights to only include the males. We do this using the filter function.

Now that we have a data set, the next step is deciding what geometry we need. If you guessed geom_histogram you guessed right. That’s the correct geometry for making histograms in ggplot. Looking at the help file for this function, we learn that the only required argument is x, the variable for which we will construct a histogram.

So the code could look something like this. We’re going to do it in two steps. First, we’re going to define a graph object p, that has the data piped into the ggplot function and defines the aesthetic mapping that tells us that heights is what we’re going to make a histogram of. And then we add the geom_histogram layer, which will create the plot.

Here it is. Now when we do this, we notice that we get a warning or a message saying that the bin width was not picked, and it was picked for us. So what we can do now is add the bin width that we want. In our videos we use a bin width of one. We can define this through the arguments. So the code will instead looks something like this. We have geometry_histogram and we define the band width to be one. And this makes a plot like the ones we saw in previous videos.

Finally, as an example of the flexibility of ggplot, we’ll change the color just for aesthetic reasons and add a title. We do this using the following code. Note that we’ve defined the fill equals “blue” to make the bars blue, color equals “black” to make the outside of the bars black, and we’ve also added a more informative x label and a title. And the plot looks like this.

Now we can create smooth densities using another geometry. The geometry in this case is called geom_density. We’ve already defined a plot object p, so we can, instead of adding a histogram layer, we add a geom_density layer. And we get the following plot. We can add color by using the fill argument.

For a Q-Q plot we use the geom_qq geometry. From the help file we learn that we need to specify the sample argument. We will learn about samples later. For now, just know that’s where you put the data for which you want a Q-Q plot. So now we have to redefine p because it needs a different argument. Instead of x it’s now sample. So we say sample equals height. And now we’re ready to add the geometry. So we add to p, we add geom_qq and we get a Q-Q plot. It looks like this.

By default, the Q-Q plot is compared to the normal distribution with average zero and standard deviation one. To change this, again, from the help file we learn that we need to use the dparams argument. So now what we do is we define an object params that will have the mean and standard deviation of our data. We use some dplyr functions to do this, and now we add the geometry by assigning this new object that we created to the dparams argument. And now we see that the Q-Q plot is plotted against a normal distribution with the same mean and standard deviation as our data. It looks like this.

We can then add identity lines to see how well the normal approximation works. And in this case, we simply add the layer geom_abline, which adds an identity line. And we see that the points fall roughly on the line. This is because this data is approximately normal.

Another option here is to first scale the data so that we have them in standard units and plot it against the standard normal distribution. This saves us the step of having to compute the mean and standard deviation. The code is a little cleaner and it looks like this.

Finally, let’s learn how to make grids of plots. In previous videos, you’ve seen how we put plots next to each other. How do we do that? One way to do that is to use the gridExtra package, which has a function called grid.arrange that lets us show different plot objects next to each other. So to do that, what we do is we first define plots and we assign them to objects. We save them into objects. So in this case, we’re creating three different histograms. And we’re saving them to the objects p1, p2, p3. Then to show them next to each other we use the grid.arrange function like this. And this will produce a plot that has these plots next to each other.

Key points

geom_histogram() creates a histogram. Use the binwidth argument to change the width of bins, the fill argument to change the bar fill color, and the col argument to change bar outline color.
geom_density() creates smooth density plots. Change the fill color of the plot with the fill argument.
geom_qq() creates a quantile-quantile plot. This geometry requires the sample argument. By default, the data are compared to a standard normal distribution with a mean of 0 and standard deviation of 1. This can be changed with the dparams argument, or the sample data can be scaled.
Plots can be arranged adjacent to each other using the grid.arrange() function from the gridExtra package. First, create the plots and save them to objects (p1, p2, …). Then pass the plot objects to grid.arrange().

Code: Histograms in ggplot2

# load heights data
library(tidyverse)
library(dslabs)
data(heights)

# define p
p <- heights %>%
    filter(sex == "Male") %>%
    ggplot(aes(x = height))
    
# basic histograms
p + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(binwidth = 1)

# histogram with blue fill, black outline, labels and title
p + geom_histogram(binwidth = 1, fill = "blue", col = "black") +
    xlab("Male heights in inches") +
    ggtitle("Histogram")

Code: Smooth density plots in ggplot2

p + geom_density()

p + geom_density(fill = "blue")

Code: Quantile-quantile plots in ggplot2

# basic QQ-plot
p <- heights %>% filter(sex == "Male") %>%
    ggplot(aes(sample = height))
p + geom_qq()

# QQ-plot against a normal distribution with same mean/sd as data
params <- heights %>%
    filter(sex == "Male") %>%
    summarize(mean = mean(height), sd = sd(height))
p + geom_qq(dparams = params) +
    geom_abline()

# QQ-plot of scaled data against the standard normal distribution
heights %>%
    ggplot(aes(sample = scale(height))) +
    geom_qq() +
    geom_abline()

Code: Grids of plots with the gridExtra package

# define plots p1, p2, p3
p <- heights %>% filter(sex == "Male") %>% ggplot(aes(x = height))
p1 <- p + geom_histogram(binwidth = 1, fill = "blue", col = "black")
p2 <- p + geom_histogram(binwidth = 2, fill = "blue", col = "black")
p3 <- p + geom_histogram(binwidth = 3, fill = "blue", col = "black")

# arrange plots next to each other in 1 row, 3 columns
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p1, p2, p3, ncol = 3)

Data Science: Visualization - Section 2 - ggplot2

Feb 26, 2020

Section 2 Overview

ggplot2

Graph Components

Creating a New Plot

Layers

Tinkering

Scales, Labels, and Colors

Add-on Packages

Other Examples