\[\\[0.25in]\]

This is the primary source document for ggplot2, a package under the tidyverse group:

https://ggplot2.tidyverse.org/index.html

\[\\[0.05in]\]

This cheat sheet covers basic ggplot2 syntax: https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf

This website is a good walk thru: https://www.datanovia.com/en/blog/ggplot-examples-best-reference/

\[\\[0.05in]\]

I like this website for coloring: http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually

\[\\[0.01in]\]

Plus these online tools for specific colors:

https://www.appypie.com/design/image-color-picker/

https://www.google.com/search?q=rgb+to+hex

https://cssgradient.io/

\[\\[0.5in]\]

Packages used in this walkthru:

require(cowplot)
require(dplyr)
require(ggplot2)
require(ggrepel)
require(reshape2)
require(stringr)
require(tidyr)
require(tidyverse)
require(viridis)

\[\\[0.2in]\]

Let’s start by making some play data:

type1_df <- data.frame(matrix(NA, nrow = 5, ncol = 4)) # empty dataframe for data type 1
colnames(type1_df) <- c("Type", "var1", "var2", "var3") # Columns for data
type1_df$Type <- "Type1" 
type1_df$var1 <- rnorm(n = 5, mean = 100, sd = 10) # Normal distribution for made up variables
type1_df$var2 <- rnorm(n = 5, mean = 20, sd = 12)
type1_df$var3 <- rnorm(n = 5, mean = 56, sd = 2)

type2_df <- data.frame(matrix(NA, nrow = 5, ncol = 4))
colnames(type2_df) <- c("Type", "var1", "var2", "var3")
type2_df$Type <- "Type2"
type2_df$var1 <- rnorm(n = 5, mean = 120, sd = 20)
type2_df$var2 <- rnorm(n = 5, mean = 20, sd = 5)
type2_df$var3 <- rnorm(n = 5, mean = 75, sd = 10)

type3_df <- data.frame(matrix(NA, nrow = 5, ncol = 4))
colnames(type3_df) <- c("Type", "var1", "var2", "var3")
type3_df$Type <- "Type3"
type3_df$var1 <- rnorm(n = 5, mean = 140, sd = 10)
type3_df$var2 <- rnorm(n = 5, mean = 36, sd = 9)
type3_df$var3 <- rnorm(n = 5, mean = 50, sd = 20)


play_df <- dplyr::bind_rows(type1_df, type2_df, type3_df) # Combining the 3 data types

play_df$Sample <- paste0("Sample", sprintf("%02s", row.names(play_df))) # Adding sample names

play_df <- play_df[, c("Sample", "Type", "var1", "var2", "var3")] # Rearranging the columns

head(play_df)

Our play data is 15 samples from three different types, with data for 3 variables.

Let’s take a look at it with a simple graph.

\[\\[0.2in]\]

Making a scatter plot:

ggplot(play_df, aes(x = var1, y = var2)) + 
  geom_point()

Let’s dissect that syntax a bit:

ggplot calls the package function
play_df is the dataframe we’re using - this object usually needs to be a data.frame (check this with str() if you’re unsure and/or use the function as.data.frame() on it), although some graph types may need something different like a matrix/etc
aes specifies the aesthetics details, including the x and y axes
The + sign shows that there are more arguments
geom_point is function to make it a scatter plot

Note that you can run the plot command with the cursor anywhere inside the argument, it doesn’t have to be at the beginning. If you’ve used the plus signs correctly it will run in one pass. ggplot graph elements are built with on top of each argument sequentially - the order matters. For some details they’re additive (such as geom_point), while others will only recognize the last one.

\[\\[0.2in]\]

The block above produces a graph but doesn’t save it as an object. Let’s fix that and add some more details:

first_plot <- ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  xlab("variable 1") + 
  ylab("variable 2") + 
  ggtitle("Scatter plot", subtitle = "Colored by discrete variable")
first_plot

Syntax again:

first_plot is the new plot object, you need to call it again to display it
color in the aes() can be categorical or numeric data: be aware this can be tricky. Some ggplot functions will spell it colour, and different variables will distinguish color and fill
xlab and ylab are the labels for each axis
ggtitle shows the title. You can also include subtitle inside the () as a separate detail. You may need to call out the main title as main if it’s not written first. Write \n to create a new line in your text if needed.

\[\\[0.02in]\] Alternatively, we can color the points with a continuous variable:

ggplot(play_df, aes(x = var1, y = var2, color = var3)) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Colored by continuous variable")

\[\\[0.5in]\]

Plot basics:

A neat thing we can start doing, now that we have a basic plot as an object, is generating new plots without overwriting the original base one by using the + after our plot object:

# Setting different axis scales:
first_plot + 
  xlim(0, 200) +
  ylim(0, 50) + 
  ggtitle("Scatter plot", subtitle = "Manual axis ranges")

# Changing the axis intervals:
  # Note that you'll use `continuous` or `discrete` depending on your data
first_plot + 
  scale_x_continuous(breaks = seq(0, 200, by = 10)) + 
  scale_y_continuous(breaks = seq(0, 50, by = 5)) + 
  ggtitle("Scatter plot", subtitle = "Manual axis intervals")

# Sub-setting your data without modifying the dataframe:
  # You can use other logical arguments here too, like `>` and `grep`
second_plot <- ggplot(subset(play_df, Type %in% "Type1"), aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Subsetting data")
second_plot

# Changing other `aes` arguments: `shape`, `size`, and `alpha` (the opacity)
ggplot(play_df, aes(x = var1, y = var2, color = Type, shape = Type, size = var3, alpha = var3)) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Data shape/size/opacity")

Manually setting the colors and adding outlines: - Use scale_color_manual or scale_fill_manual - You can use: written colors (e.g. ‘black’), hexidecimal, RGB, etc. - Setting the colour (here to black) outside of the aes() changes the outline

ggplot(play_df, aes(x = var1, y = var2)) + 
  geom_point(aes(fill = Type), colour = "black", size = 2, shape = 21) + 
  scale_fill_manual(values = c("Type1" = "#FF99CC", 
                                "Type2" = "#00FFFF", 
                                "Type3" = "#00FF00")) + 
  ggtitle("Scatter plot", subtitle = "Manual coloring by a discrete variable")

# Or for a manual gradient along a continuous variable:
# Note that the scale values are % of the range, not actual values
ggplot(play_df, aes(x = var1, y = var2, color = var3)) + 
  geom_point(aes(fill = var3, fill = var3), colour = "black", size = 2, shape = 21) + 
  scale_fill_gradientn(colours = c("red","yellow","green","cyan","blue"),
                         values = c(1.0, 0.8, 0.6, 0.4, 0.2, 0)) + 
  ggtitle("Scatter plot", subtitle = "Manual coloring by a continuous variable")

\[\\[0.5in]\]

Exporting your files:

Now that we have some plots, we should save them. If you’re using RStudio, you can click the export button on the plot quadrant - but in my experience this is cumbersome to do for many files and isn’t easily replicatable for size/etc. Plus, since the graphs are dynamic, the text/point/etc sizes will change for different file sizes.

As a safe alternative, we can hard code lines that export the file. I would recommend generating the plots and checking the saved output a couple times until it’s the size/etc you like.

Some syntax in ggsave:

filename: the name of the output file. Note that the file type (.png, .jpeg, .pdf, etc.) needs to be included
plot: specifies which plot object you’re saving
units: can be inches, centimeters, etc
dpi: resolution

# From the plot object:
ggsave(filename = "filename.png", plot = first_plot, width = 14, height = 8, 
       units = c("in"), dpi = 250)

# Or a more flexible approach:
ggsave(filename = "filename.png", plot = last_plot(), width = 14, height = 8, 
       units = c("in"), dpi = 250)

For saving many files as a single .pdf file, use the base pdf function.

You’ll need to have run each of the objects, and it’ll save the last version run. Note that you need to include the dev.off() or else it will continue adding files. This defaults to your working directory, but you can add an argument to modify the file path. Fun tip: using the paste0 function in the file name can be used to print per variable if you run plots iteratively thru a file directory, or to pull the name of your raw data when you import it.

pdf(file = "ggplot_plots.pdf", width = 6, height = 5)
first_plot
second_plot
invisible(dev.off())

\[\\[0.5in]\]

Customizing your plot:

Now that we’ve covered the basics, we can add in some additional customization:

# You can manipulate data while plotting, instead of calculating it outside as an independent variable:
ggplot(play_df, aes(x = var1+var3, y = log(var2), color = Type)) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Calculations in ggplot code")

\[\\[0.1in]\]

Labeling points:

Note the label argument in the first line aes()
You may want to adjust the axis limits so the labels fit
hjust and vjust are the horizontal and vertical adjustments
Other options for preventing labels from over lapping include: using check_overlap in the geom_text, or adding additional arguments like geom_jitter, position_jitter, and ggrepel

ggplot(play_df, aes(x = var1, y = var2, color = Type, label = Sample)) + 
  geom_point() + 
  geom_text(hjust = 0, vjust = -2) + 
  ggtitle("Scatter plot", subtitle = "Basic labels")

ggplot(play_df, aes(x = var1, y = var2, color = Type, label = Sample)) + 
  geom_point() + ylim(0, 40) + 
  geom_label_repel(aes(label = Sample, fill = factor(Type)), color = 'white') + 
  ggtitle("Scatter plot", subtitle = "Pretty labels")

# Just like in the data manipulations above, there are many places where you can add new arguments:
ggplot(play_df, aes(x = var1, y = var2, color = Type, label = Sample)) + 
  geom_point() + ylim(0, 40) + 
  geom_label_repel(data = play_df, aes(label = ifelse(var2 > 20, as.character(Sample), ''))) + 
  ggtitle("Scatter plot", subtitle = "Labels based on logistical argument")

\[\\[0.1in]\]

Some aesthetic changes to your plots:

Adding colored blocks to the plot using geom_rect:

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_rect(aes(inherit.aes = FALSE, xmin = 100, xmax = 120, ymin = 0, ymax = Inf), 
            color = "transparent", fill = "#9c7ba8", alpha = 0.01) + 
  geom_rect(aes(inherit.aes = FALSE, xmin = 115, xmax = 130, ymin = 25, ymax = 50), 
            color = "transparent", fill = "#0ea113", alpha = 0.01) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Highlighting regions of the graph")

Adding lines using geom_vline (vertical) or geom_vline (horizontal), these can be used to make nice zero lines too:

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_vline(xintercept = c(12, 90, 120), color = "#e380ff", size = 1) + 
  geom_vline(xintercept = 0, color = 'black', size = 0.4) + 
  geom_hline(yintercept = 0, color = 'black', size = 0.4) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Vertical and horizontal lines")

# Changing font of the plot:
ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  theme(text = element_text(family = "Times New Roman", size = 15)) + 
  ggtitle("Scatter plot", subtitle = "Changing the font")

# Changing the angle of the axis labels
  # This is more helpful for long data names
ggplot(play_df, aes(x = Type, y = var3, color = Type)) + 
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Angled axis labels")

Editing the legend using guides:

This is also how you can change what the legend says with title
You can also change the position, height/width, direction, number of rows/columns
This is nice if you’ve added text labels and want to remove them from the legend

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  guides(color = "none") + 
  ggtitle("Scatter plot", subtitle = "Removing the legend")

# You can use pre-made palettes for coloring: 
  # These are usually outside packages, such as `RColorBrewer` and `viridis`

ggplot(play_df, aes(x = var1, y = var3, color = var1)) + 
  geom_point() + 
  scale_color_viridis(option = "D") + 
  ggtitle("Scatter plot", subtitle = "Using pre-made color palettes")

# Themes are a wide suite of options on how the graph layout looks

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  theme_classic() + 
  ggtitle("Scatter plot", subtitle = "Classic graph design theme")

Using theme_void and the other theme() settings are nice if you need to add a plot to another figure, everything here is transparent except the scatter points

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  theme(panel.background = element_rect(fill = 'transparent'), 
    plot.background = element_rect(fill = 'transparent', color = NA), 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    legend.background = element_rect(fill = 'transparent'), 
    legend.box.background = element_rect(fill = 'transparent')) + 
  theme_void() + 
  guides(color = FALSE)

\[\\[0.1in]\]

Breaking up your data into separate graphs using facet_wrap:

The ~ shows what variables you’re using. Leaving one side blank will leave it intact, and which side is blank will dictate if it’s the x or y axis. You can also do two variables, like X ~ Y.
Set scale to free to let the axes match each internal dataset, or set it to fixed to keep them constant. You can also choose this for just one axis.

ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  facet_wrap(~ Type, scales = "free") + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Breaking up data")

\[\\[0.1in]\]

More code for manipulating/adding to data:

# Plotting an inset inside of a main graph

big.plot <- ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  ylim(0, 75) + xlim(25, 175) + 
  ggtitle("Main plot: var1 vs var2", subtitle = "Adding an inset plot") + 
  geom_point()

small.plot <- ggplot(play_df, aes(x = var1/var3, y = var2/var3, color = Type)) + 
  geom_point() + 
  theme(legend.position = "none")

plot.with.inset <- ggdraw() +
  draw_plot(big.plot) +
  draw_plot(small.plot, x = .1, y = 0.5, width = .3, height = .3)
plot.with.inset

# Changing the axis scales:
ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point() + 
  coord_trans(x = "log2", y = "log2") + 
  ggtitle("Scatter plot", subtitle = "Changing the axis scales")

# Highlighting samples
ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  geom_point(color = "black", size = 2) + 
  geom_point(data = play_df[c(2:5, 10), ], aes(x = var1, y = var2), colour = "red", size = 3) + 
  ggtitle("Scatter plot", subtitle = "Highlighting specific samples")

\[\\[0.5in]\]

Other common plots:

Great, now that we’ve covered the basics for customizing a scatter plot, let’s explore some other common graph types:

Advanced scatterplots

# More scatter plot options, such as ellipses (`stat_ellipse`) and regressions (`geom_smooth`):
ggplot(play_df, aes(x = var1, y = var2, color = Type)) + 
  stat_ellipse(aes(x = var1, y = var2, color = Type), type = "norm") + 
  geom_smooth(method = 'lm', formula = y ~ x) + 
  geom_point() + 
  ggtitle("Scatter plot", subtitle = "Ellipses and regressions")

Barplots

You control normal vs stacked plots by what variable you plot and color by, and in ggplot2 pie charts are just stacked barplots with a circular axis.

# Normal bar, plotting sample and coloring by type
ggplot(play_df, aes(x = Sample, y = var2, fill = Type, color = Type)) + 
  geom_bar(stat = "identity") + 
  scale_x_discrete(guide = guide_axis(angle = 45)) + 
  ggtitle("Bar plot", subtitle = "Normal")

# Stacked bar, plotting type and coloring by sample
ggplot(play_df, aes(x = Type, y = var2, fill = Sample, color = Sample)) + 
  geom_bar(stat = "identity") + 
  scale_x_discrete(guide = guide_axis(angle = 45)) + 
  ggtitle("Bar plot", subtitle = "Stacked")

# Pie chart:
ggplot(play_df, aes(x = "", y = var2, fill = Type)) + 
  geom_bar(stat = "identity", width = 1) + 
  coord_polar("y", start = 0) + 
  theme_void() + 
  ggtitle("Pie chart")

Box plot and violin plot

ggplot(play_df, aes(x = Type, y = var2, color = Type)) + 
  geom_boxplot(outlier.shape = NA) + 
  geom_point(aes(group = Type), alpha = 0.75, position = position_jitterdodge()) + 
  ggtitle("Box plot")

ggplot(play_df, aes(x = Type, y = var2, color = Type)) + 
  geom_violin() + 
  geom_point(aes(group = Type), alpha = 0.75, position = position_jitterdodge()) + 
  ggtitle("Violin plot")

Heat map

# Definitely check out the `reshape2` package's `melt` and `cast` functions
melted.data <- melt(play_df, id = "Sample")
melted.data <- melted.data[(16:60), ]
melted.data$value <- as.numeric(melted.data$value)

head(melted.data)

ggplot(melted.data, aes(x = variable, y = Sample, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "#00FFFF", high = "#9c7ba8") + 
  ggtitle("Heatmap")

\[\\[0.25in]\]

Ternary diagram

The ternary diagram packages interfere with ggplot2 - in my experience it’s best practice to load them immediately before the code and unloaded them immediately afterwards

require(ggtern)
require(ggalt) 
ggtern(data = play_df, aes(var1, var2, var3, color = Type, fill = Type)) + 
  geom_encircle(alpha = 0.3, size = 1) +
  scale_shape_manual(values = 1:8) + 
  geom_point(size = 2.2, color = "black") + 
  geom_point(size = 2) + 
  theme_rgbw() +
  guides(color = "none") + 
  ggtitle("Ternary plot")

unloadNamespace("ggtern")
unloadNamespace("ggalt")

\[\\[0.25in]\]

PCA plot

Here’s the code I use to run a PCA plot, with all of the info included for both the stats and graphs:

PCA_df <- play_df
rownames(PCA_df) <- 1:nrow(PCA_df) 
PCA_df[, 3:5] <- sapply(PCA_df[, 3:5], as.numeric)

forPCA <- PCA_df[, (3:5)]

myPCA <- prcomp(forPCA[,-1], center = TRUE, retx = TRUE)

percentage <- round(myPCA$sdev / sum(myPCA$sdev) * 100, 2)
percentage <- paste(colnames(myPCA), "(", paste( as.character(percentage), "%", ")", sep="") )

PCA_df$Sample <- rownames(PCA_df)

PCA.df <- as.data.frame(myPCA$x, row.names = FALSE)
PCA.df$Sample <- rownames(PCA_df)
PCA_df_total <- dplyr::left_join(PCA.df, PCA_df, by = "Sample")


ggplot(PCA_df_total, aes(x = PC1, y = PC2, fill = Type, group = Type)) + 
  geom_vline(colour = "#000000", xintercept = 0) + 
  geom_hline(colour = "#000000", yintercept = 0) + 
  stat_ellipse(aes(fill = Type), geom = "polygon", type = "t", level = 0.95, alpha = 0.1) + 
  geom_point(colour = "black", shape = 21) + 
  xlab(paste0("PC1", percentage[1])) + ylab(paste0("PC2", percentage[2])) + 
  guides(size = FALSE, fill = guide_legend(override.aes = list(size = 3))) + 
  ggtitle("Principle Components Analysis")

\[\\[0.25in]\]

Happy plotting!

\[\\[1in]\]

Introduction to ggplot2

Chris Mulligan

Nov 17th, 2023

Let’s start by making some play data:

Making a scatter plot:

Plot basics:

Exporting your files:

Customizing your plot:

Other common plots:

Advanced scatterplots

Barplots

Box plot and violin plot

Heat map

Ternary diagram

PCA plot