Introduction

The ggplot2 package

What is a grammar of graphics?

Elements of grammar of graphics

Read CSV Data

ncdUG <- read.csv("~/Documents/EASI/Workshop/2022/SSP/ncdUG2014.csv")
attach(ncdUG)

ncdUG$residence <- factor(residence, levels = c("Urban", "Rural"), labels = c("Urban", "Rural"))
ncdUG$hypertension <- factor(hypertension, levels = c("Normal Blood Pressure", "Hypertension or Taking Medication for Hypertension"), labels = c("Normal", "Hypertensive"))
ncdUG$cvd <- factor(cvd, levels = c("NO CVD", "CVD"), labels = c("Normal", "CVD"))
ncdUG$mstatus <- factor(mstatus, levels = c("Single", "Married", "Separated/Divorced", "Widowed"), labels = c("Single", "Married", "Divorced", "Widowed"))
ncdUG$heduc <- factor(heduc, levels = c("none", "primary", "secondary", "university+"), labels = c("None", "Primary", "Secondary", "University"))
ncdUG$diabetes <- factor(diabetes, levels = c("blood glocuse < 6.1", "blood glocuse >=6.1 AND < 7.1", "blood glocuse >=7.1 or took meds today"), labels = c("Normal", "Prediabatic", "Diabetic"))
ncdUG$smoke <- factor(smoke, levels = c(0, 1), labels = c("None", "Smoker"))
ncdUG$hregion <- factor(hregion, levels = c("Northern", "Eastern", "Central", "Western"), labels = c("Northern", "Eastern", "Central", "Western"))

attach(ncdUG)
## The following objects are masked from ncdUG (pos = 3):
## 
##     age, bmi, cvd, diabetes, diastolic, fvservings, heduc, hhsize,
##     hregion, hypertension, mincome, mstatus, residence, smoke, systolic

The ggplot() function and aesthetics

ggplot(ncdUG, aes(x=diastolic, y=systolic)) +
  geom_point()
## Warning: Removed 81 rows containing missing values (geom_point).

Layers and overriding aesthetics

ggplot(ncdUG, aes(x=diastolic, y=systolic)) 

# scatter plot of volume vs sales
# with rug plot coloured by median sale price
ggplot(ncdUG, aes(x=diastolic, y=systolic)) +     # x=volume and y=sales inherited by all layers  
  geom_point() +
  geom_rug(aes(color=age))   # color will only apply to the rug plot because not specified in ggplot()
## Warning: Removed 81 rows containing missing values (geom_point).

Aesthetics

Mapping vs setting

Map aesthetics to variables inside the aes() function. By mapping, we mean the aesthetic will vary as the variable varies. For example, mapping x=time causes the position of the plotted data to vary with values of variable “time”. Similary, mapping color=group causes the color of objects to vary with values of variable “group”.

# mapping color to median inside of aes()
ggplot(ncdUG, aes(x=diastolic, y=systolic)) +
  geom_point(aes(color=age))
## Warning: Removed 81 rows containing missing values (geom_point).

  • Set aesthetics to a constant outside the aes() function.
# setting color to green outside of aes()
ggplot(ncdUG, aes(x=diastolic, y=systolic)) +
  geom_point(color="green")
## Warning: Removed 81 rows containing missing values (geom_point).

Geoms

  • Geom functions differ in the geometric shapes produced for the plot.
  • Some example geoms include: geom_bar(): bars with bases on the x-axis geom_boxplot(): boxes-and-whiskers geom_errorbar(): T-shaped error bars geom_density(): density plots geom_histogram(): histogram geom_line(): lines geom_point(): points (scatterplot) geom_ribbon(): bands spanning y-values across a range of x-values geom_smooth(): smoothed conditional means (e.g. loess smooth) geom_text(): text

Geoms and aesthetics

  • Each geom is defined by aesthetics required for it to be rendered. For example, geom_point() requires both x and y, the minimal specification for a scatterplot.
  • Geoms differ in which aesthetics they accept as arguments. For example, geom_point() accepts the aesthetic shape, which defines the shapes of points on the graph, while geom_bar() does not accept shape.
  • Check the geom function help files for required and understood aesthetics. In the Aesthetics section of the geom’s help file, required aesthetics are bolded.
  • Let’s demonstrate some commonly used geoms.

Histogram

  • Histograms are popular choices to depict the distribution of a continuous variable.
  • geom_histogram() cuts the continuous variable mapped to x into bins, and count the number of values within each bin.
  • Create a histogram of size from data set Sitka.
  • ggplot2 issues a message urging you to pick a number of bins for the histogram (it defaults to 30), using the bins argument.
  • Specify bins=20 inside of geom_histogram(). Note: bins is not an aesthetic, so should not be specified within aes().
ggplot(ncdUG, aes(x=age)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density Plots

  • Denisty plots are basically smoothed histograms.
  • Density plots, unlike histograms, can be plotted separately by group by mapping a grouping variable to color.
ggplot(ncdUG, aes(x=age)) + 
  geom_density() 

ggplot(ncdUG, aes(x=age, color=mstatus)) + 
  geom_density() 

Boxplots

  • Boxplots compactly visualize particular statistics of a distributions:

  • geom_boxplot() will create boxplots of the variable mapped to y for each group defined by the values of the x variable.

  • lower and upper hinges of box: first and third quartiles*

  • middle line: median*

  • lower and upper whiskers: (hinge-1.5xIQR) and (hinge+1.5xIQR) where IQR is the interquartile range (distance between hinges)*

  • dots: outliers*

ggplot(ncdUG, aes(x=mstatus, y=age)) + 
  geom_boxplot()

Bar Plots

  • Bar plots are often used to display frequencies of factor (categorical) variables.
  • geom_bar() by default produces a bar plot where the height of the bar represents counts of each x-value.
  • The color that fills the bars is not controlled by aesthetic color, but instead by fill, which can only be mapped to a factor (categorical) variable. We can visualize a crosstabulation of variables by mapping one of them to fill in geom_bar():
ggplot(ncdUG, aes(x=heduc)) + 
  geom_bar() 

ggplot(ncdUG, aes(x=heduc, fill=hregion)) + 
  geom_bar()

Scatter Plot

  • Scatter plots depict the covariation between pairs of variables (typically both continuous).
  • geom_point() depicts covariation between variables mapped to x and y.
  • Scatter plots are among the most flexible graphs, as variables can be mapped to many aesthetics such as color, shape, size, and alpha.
# scatter of diastolic vs systolic
ggplot(ncdUG, aes(x=diastolic, y=systolic)) + 
  geom_point() 
## Warning: Removed 81 rows containing missing values (geom_point).

ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age, alpha=fvservings, size=bmi)) + 
  geom_point()   
## Warning: Removed 298 rows containing missing values (geom_point).

Line graphs

  • Line graphs depict covariation between variables mapped to x and y with lines instead of points.

  • geom_line() will treat all data as belonging to one line unless a variable is mapped to one of the following aesthetics to group the data into separate lines:

    group: lines will look the same color: line colors will vary with mapped variable linetype: line patterns will vary with mapped variable

ggplot(ncdUG, aes(x=fvservings, y=diastolic, group=diabetes)) + 
  geom_line() 
## Warning: Removed 13 row(s) containing missing values (geom_path).

ggplot(ncdUG, aes(x=fvservings, y=diastolic)) + 
  geom_line() 
## Warning: Removed 12 row(s) containing missing values (geom_path).

ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=diabetes)) + 
  geom_line()   
## Warning: Removed 13 row(s) containing missing values (geom_path).

Statistics

  • The stat functions statistically transform data, usually as some form of summary, such as the mean, or standard deviation, or a confidence interval.
  • Each stat function is associated with a default geom, so no geom is required for shapes to be rendered.
  • stat_summary(), perhaps the most useful of all stat functions, applies a summary function to the variable mapped to y for each value of the x variable. The default summary function is mean_se(), with associated geom geom_pointrange(), which will produce a plot of the mean (dot) and standard error (lines) of the variable mapped to y for each value of the x variable.
  • What makes stat_summary() so powerful is that you can use any function that accepts a vector as the summary function (e.g. mean(), var(), max(), etc.) and the geom can also be changed to adjust the shapes plotted
# summarize diastolic (y) for each fvserving (x)
ggplot(ncdUG, aes(x=fvservings, y=diastolic)) + 
  stat_summary()
## Warning: Removed 91 rows containing non-finite values (stat_summary).
## No summary function supplied, defaulting to `mean_se()`
## Warning: Removed 23 rows containing missing values (geom_segment).

Scales

  • Scales define which aesthetic values are mapped to the data values.
  • The scale functions allow the user to control the scales for each aesthetic. These scale functions have names with structure scale_aesthetic_suffix, where aesthetic is the name of an aesthetic like color or shape or x, and suffix is some descriptive word that defines the functionality of the scale.
  • Then, to specify the aesthetic values to be used by the scale, supply a vector of values to the values argument (usually) of the scale function.
  • Some example scales functions: scale_color_manual(): define an arbitrary color scale by specifying each color manually scale_color_hue(): define an evenly-spaced color scale by specifying a range of hues and the number of colors on the scale scale_shape_manual(): define an arbitrary shape scale by specifying each shape manually
# ggplot2 color choice
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point()
## Warning: Removed 91 rows containing missing values (geom_point).

# use scale_colour_manual() to specify which colors we want to use
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue")) 
## Warning: Removed 91 rows containing missing values (geom_point).

Scale functions for the axes

  • Remember that x and y are aesthetics, and the two axes visualize the scale for these aesthetics.
  • Thus, we use scale functions to control to the scaling of these axes.
  • When y is mapped to a continuous variable, we will typically use scale_y_continuous() to control its scaling (use scale_y_discrete() if y is mapped to factor). Similar functions exist for the x aesthetic.
  • A description of some of the important arguments to scale_y_continuous(): breaks: at what data values along the range of of the axis to place tick marks and labels labels: what to label the tick marks name: what to title the axis
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  scale_color_manual(values=c("red", "orange", "green", "blue")) 
## Warning: Removed 91 rows containing missing values (geom_point).

# put tick marks at all grid lines along the y-axis using the breaks argument of scale_y_continuous  
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  scale_color_manual(values=c("red", "orange", "green", "blue")) + 
  scale_y_continuous(breaks=c(40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200))
## Warning: Removed 91 rows containing missing values (geom_point).

# relabel the tick marks to reflect units of thousands (of dollars) using labels
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  scale_color_manual(values=c("red", "orange", "green", "blue")) + 
  scale_y_continuous(breaks=c(40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200),
                     labels=c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
## Warning: Removed 91 rows containing missing values (geom_point).

# retitle the y-axis using the name argument to reflect the units
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  scale_color_manual(values=c("red", "orange", "green", "blue")) + 
  scale_y_continuous(breaks=c(40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200),
                     labels=c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
                     name="Diastolic(Tens of Units)")
## Warning: Removed 91 rows containing missing values (geom_point).

Modifying axis limits and titles

  • Although we can use scale functions like scale_x_continuous() to control the limits and titles of the x-axis, we can also use the following shortcut functions:

    lims(), xlim(), ylim(): set axis limits xlab(), ylab(), ggtitle(), labs(): give labels (titles) to x-axis, y-axis, or graph; labs can set labels for all aesthetics and title

# To set axis limits, supply a vector of 2 numbers (inside c(), for example) to one of the limits functions
ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  xlim(c(0,30)) # cut ranges from 0 to 5 in the data
## Warning: Removed 91 rows containing missing values (geom_point).

# use labs() to specify an overall titles for the overall graph, the axes, and legends (guides).

ggplot(ncdUG, aes(x=fvservings, y=diastolic, color=heduc)) + 
  geom_point() +
  labs(x="FV Servings", y="Diastolic", color="Education", title="Diastolic vs FV servings by Education")
## Warning: Removed 91 rows containing missing values (geom_point).

Guides visualize scales

  • Guides (axes and legends) visualize a scale, displaying data values and their matching aesthetic values. The x-axis, a guide, visualizes the mapping of data values to position along the x-axis. A color scale guide (legend) displays which colors map to which data values.
  • Most guides are displayed by default. The guides() function sets and removes guides for each scale.
# use guides() to remove the color scale legend:
# notice no legend on the right anymore
ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age)) + 
  geom_point() +
  guides(color="none")
## Warning: Removed 81 rows containing missing values (geom_point).

Coordinate systems

Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, ggplot2 provides multiple coordinate systems, including polar, flipped Carteisan and map projections.

Faceting (paneling)

  • Split plots into small multiples (panels) with the faceting functions, facet_wrap() and facet_grid(). The resulting graph shows how each plot varies along the faceting variable(s).
  • facet_wrap() wraps a ribbon of plots into a multirow panel of plots. Inside facet_wrap(), specify ~, then a list of splitting variables, separated by +. The number of rows and columns can be specified with arguments nrow and ncol.
  • facet_grid() allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the row-splitting variable before ~, and the column-splitting variable after. The character . specifies no faceting along that dimension.
ggplot(ncdUG, aes(x=fvservings, y=diastolic)) + 
  geom_point() + 
  facet_wrap(~heduc) # create a ribbon of plots using heduc
## Warning: Removed 91 rows containing missing values (geom_point).

Themes

Themes control elements of the graph not related to the data. For example:

background color
size of fonts
gridlines
color of labels

To modify these, we use the theme() function, which has a large number of arguments called theme elements, which control various non-data elements of the graph.

Some example theme() arguments and what aspect of the graph they control:

axis.line : lines forming x-axis and y-axis
axis.line.x: just the line for x-axis
legend.position: positioning of the legend on the graph
panel.background: the background of the graph
panel.border: the border around the graph
title: all titles on the graph

Specifying theme() Elements

  • Most non-data element of the graph can be categorized as either a line (e.g. axes, tick marks), a rectangle (e.g. the background), or text (e.g. axes titles, tick labels). Each of these categories has an associated element_ function to specify the parameters controlling its apperance:

    element_line() - can specify color, size, linetype, etc. element_rect() - can specify fill, color, size, etc. element_text() - can specify family, face, size, color, angle, etc. element_blank() - removes theme elements from graph

  • Inside theme() we control the properties of a theme element using the proper element_ function. For example, the x- and y-axes are lines and are both controlled by theme() argument axis.line, so their visual properties, such as color and size (thickness), are specified as arguments to element_line():

# the x- and y-axes are lines and are both controlled by theme() argument axis.line, so their visual properties, such as color and size (thickness), are specified as arguments to element_line():

ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2)) # size in mm
## Warning: Removed 81 rows containing missing values (geom_point).

# the background of the graph, controlled by theme() argument panel.background is a rectangle, so parameters like fill color and border color can be specified element_rect().

ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray")) # color is the border color
## Warning: Removed 81 rows containing missing values (geom_point).

# With element_text() we can control properties such as the font family or face ("bold", "italic", "bold.italic") of text elements like title, which controls the titles of both axes.
ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold")) 
## Warning: Removed 81 rows containing missing values (geom_point).

# some theme() arguments do not use element_ functions to control their properties, like legend.position, which simply accepts values "none", "left", "right", "bottom", and "top".
ggplot(ncdUG, aes(x=diastolic, y=systolic, color=age)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold"),
        legend.position="bottom") 
## Warning: Removed 81 rows containing missing values (geom_point).

Changing the overall look with complete themes

The ggplot2 package provides a few complete themes which make several changes to the overall background look of the graphic (see here for a full description). Examples: - theme_bw() - theme_light() - theme_dark() - theme_classic()

The themes usually adjust the color of the background and most of the lines that make up the non-data portion of the graph.

# theme_classic() mimics the look of base R graphics:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_classic()
## Warning: Removed 568 rows containing missing values (geom_point).

# theme_dark() makes a dramatic change to the look:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_dark()
## Warning: Removed 568 rows containing missing values (geom_point).

Saving plots to files

  • ggsave() makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R object.

  • ggsave attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf.

  • Other important arguments to ggsave(): width height units: units of width and height of plot file (“in”, “cm” or “mm”) dpi: plot resolution in dots per inch plot: name of object with stored plot

# save last displayed plot as pdf
ggsave("plot.pdf")
## Saving 7 x 5 in image
## Warning: Removed 568 rows containing missing values (geom_point).
# if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) + 
  geom_point()
# You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)
## Saving 7 x 5 in image