Plotting with `R`

Open R Sessions 2025

Lucia Ximena Alva Caballero

Violeta Caballero López

Laura Hildesheim

Simon Jacobsen Ellerstrand

Iain Moodie

Pedro Rosero

Goals for this session

Understand what a dataframe is, and how to interact with it
Understand why we plot data, and know when a certain style of plot is appropriate
Learn how to make most common styles of plots in R using graphics
Learn to find R helpfiles using ?
Learn how to efficiently search to get the answer you want

`data.frame`

penguins <- read.csv("palmerpenguins.csv", stringsAsFactors = TRUE)

str(penguins)

'data.frame':   344 obs. of  8 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

`data.frame`

head(penguins, 10)

   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1   Adelie Torgersen           39.1          18.7               181        3750
2   Adelie Torgersen           39.5          17.4               186        3800
3   Adelie Torgersen           40.3          18.0               195        3250
4   Adelie Torgersen             NA            NA                NA          NA
5   Adelie Torgersen           36.7          19.3               193        3450
6   Adelie Torgersen           39.3          20.6               190        3650
7   Adelie Torgersen           38.9          17.8               181        3625
8   Adelie Torgersen           39.2          19.6               195        4675
9   Adelie Torgersen           34.1          18.1               193        3475
10  Adelie Torgersen           42.0          20.2               190        4250
      sex year
1    male 2007
2  female 2007
3  female 2007
4    <NA> 2007
5  female 2007
6    male 2007
7  female 2007
8    male 2007
9    <NA> 2007
10   <NA> 2007

`data.frame`

class(penguins)

[1] "data.frame"

typeof(penguins)

[1] "list"

`data.frame`

penguins$bill_length_mm

  [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
 [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
 [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
 [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
 [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
 [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
 [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
[106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
[121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
[136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
[151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
[166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
[181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
[196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
[211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
[226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
[241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
[256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
[271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
[286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
[301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
[316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
[331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2

class(penguins$bill_length_mm)

[1] "numeric"

`data.frame`

penguins$is_cool <- TRUE

str(penguins)

'data.frame':   344 obs. of  9 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 $ is_cool          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

`data.frame`

penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm

head(penguins)

  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA          NA
5  Adelie Torgersen           36.7          19.3               193        3450
6  Adelie Torgersen           39.3          20.6               190        3650
     sex year is_cool bill_ratio
1   male 2007    TRUE   2.090909
2 female 2007    TRUE   2.270115
3 female 2007    TRUE   2.238889
4   <NA> 2007    TRUE         NA
5 female 2007    TRUE   1.901554
6   male 2007    TRUE   1.907767

Why do we plot data?

Communicate results
Explore data
Identify outliers
Tables are boring

Principles for plotting

A plot should:
- be clear
- be readable
- serve a purpose

How do we achieve this?

Use the right plot for the data
Appropriate plot size
Descriptive titles and captions
Informative axis labels and axis demarcations
Clear legend (if needed)
Strategic use of colour

Histograms

`hist()`

?hist

Histograms

Description:

     The generic function 'hist' computes a histogram of the given data
     values.  If 'plot = TRUE', the resulting object of class
     '"histogram"' is plotted by 'plot.histogram', before it is
     returned.

Usage:

     hist(x, ...)
     
     ## Default S3 method:
     hist(x, breaks = "Sturges",
          freq = NULL, probability = !freq,
          include.lowest = TRUE, right = TRUE, fuzz = 1e-7,
          density = NULL, angle = 45, col = "lightgray", border = NULL,
          main = paste("Histogram of" , xname),
          xlim = range(breaks), ylim = NULL,
          xlab = xname, ylab,
          axes = TRUE, plot = TRUE, labels = FALSE,
          nclass = NULL, warn.unused = TRUE, ...)
     
Arguments:

       x: a vector of values for which the histogram is desired.

  breaks: one of:

            • a vector giving the breakpoints between histogram cells,

            • a function to compute the vector of breakpoints,

            • a single number giving the number of cells for the
              histogram,

            • a character string naming an algorithm to compute the
              number of cells (see 'Details'),

            • a function to compute the number of cells.

          In the last three cases the number is a suggestion only; as
          the breakpoints will be set to 'pretty' values, the number is
          limited to '1e6' (with a warning if it was larger).  If
          'breaks' is a function, the 'x' vector is supplied to it as
          the only argument (and the number of breaks is only limited
          by the amount of available memory).

    freq: logical; if 'TRUE', the histogram graphic is a representation
          of frequencies, the 'counts' component of the result; if
          'FALSE', probability densities, component 'density', are
          plotted (so that the histogram has a total area of one).
          Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant
          (and 'probability' is not specified).

probability: an _alias_ for '!freq', for S compatibility.

include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'
          value will be included in the first (or last, for 'right =
          FALSE') bar.  This will be ignored (with a warning) unless
          'breaks' is a vector.

   right: logical; if 'TRUE', the histogram cells are right-closed
          (left open) intervals.

    fuzz: non-negative number, for the case when the data is "pretty"
          and some observations 'x[.]' are close but not exactly on a
          'break'.  For counting fuzzy breaks proportional to 'fuzz'
          are used.  The default is occasionally suboptimal.

 density: the density of shading lines, in lines per inch.  The default
          value of 'NULL' means that no shading lines are drawn.
          Non-positive values of 'density' also inhibit the drawing of
          shading lines.

   angle: the slope of shading lines, given as an angle in degrees
          (counter-clockwise).

     col: a colour to be used to fill the bars.

  border: the color of the border around the bars.  The default is to
          use the standard foreground color.

main, xlab, ylab: main title and axis labels: these arguments to
          'title()' get "smart" defaults here, e.g., the default 'ylab'
          is '"Frequency"' iff 'freq' is true.

xlim, ylim: the range of x and y values with sensible defaults.  Note
          that 'xlim' is _not_ used to define the histogram (breaks),
          but only for plotting (when 'plot = TRUE').

    axes: logical.  If 'TRUE' (default), axes are draw if the plot is
          drawn.

    plot: logical.  If 'TRUE' (default), a histogram is plotted,
          otherwise a list of breaks and counts is returned.  In the
          latter case, a warning is used if (typically graphical)
          arguments are specified that only apply to the 'plot = TRUE'
          case.

  labels: logical or character string.  Additionally draw labels on top
          of bars, if not 'FALSE'; see 'plot.histogram'.

  nclass: numeric (integer).  For S(-PLUS) compatibility only, 'nclass'
          is equivalent to 'breaks' for a scalar or character argument.

warn.unused: logical.  If 'plot = FALSE' and 'warn.unused = TRUE', a
          warning will be issued when graphical parameters are passed
          to 'hist.default()'.

     ...: further arguments and graphical parameters passed to
          'plot.histogram' and thence to 'title' and 'axis' (if 'plot =
          TRUE').

Details:

     The definition of _histogram_ differs by source (with
     country-specific biases).  R's default with equispaced breaks
     (also the default) is to plot the counts in the cells defined by
     'breaks'.  Thus the height of a rectangle is proportional to the
     number of points falling into the cell, as is the area _provided_
     the breaks are equally-spaced.

     The default with non-equispaced breaks is to give a plot of area
     one, in which the _area_ of the rectangles is the fraction of the
     data points falling in the cells.

     If 'right = TRUE' (default), the histogram cells are intervals of
     the form (a, b], i.e., they include their right-hand endpoint, but
     not their left one, with the exception of the first cell when
     'include.lowest' is 'TRUE'.

     For 'right = FALSE', the intervals are of the form [a, b), and
     'include.lowest' means '_include highest_'.

     A numerical tolerance of 1e-7 times the median bin size (for more
     than four bins, otherwise the median is substituted) is applied
     when counting entries on the edges of bins.  This is not included
     in the reported 'breaks' nor in the calculation of 'density'.

     The default for 'breaks' is '"Sturges"': see 'nclass.Sturges'.
     Other names for which algorithms are supplied are '"Scott"' and
     '"FD"' / '"Freedman-Diaconis"' (with corresponding functions
     'nclass.scott' and 'nclass.FD').  Case is ignored and partial
     matching is used.  Alternatively, a function can be supplied which
     will compute the intended number of breaks or the actual
     breakpoints as a function of 'x'.

Value:

     an object of class '"histogram"' which is a list with components:

  breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).
          These are the nominal breaks, not with the boundary fuzz.

  counts: n integers; for each cell, the number of 'x[]' inside.

 density: values f^(x[i]), as estimated density values. If
          'all(diff(breaks) == 1)', they are the relative frequencies
          'counts/n' and in general satisfy sum[i; f^(x[i])
          (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.

    mids: the n cell midpoints.

   xname: a character string with the actual 'x' argument name.

equidist: logical, indicating if the distances between 'breaks' are all
          the same.

References:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_.  Wadsworth & Brooks/Cole.

     Venables, W. N. and Ripley. B. D. (2002) _Modern Applied
     Statistics with S_.  Springer.

See Also:

     'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.

     Typical plots with vertical bars are _not_ histograms.  Consider
     'barplot' or 'plot(*, type = "h")' for such bar plots.

Examples:

     op <- par(mfrow = c(2, 2))
     hist(islands)
     utils::str(hist(islands, col = "gray", labels = TRUE))
     
     hist(sqrt(islands), breaks = 12, col = "lightblue", border = "pink")
     ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:
     r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),
               col = "blue1")
     text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = "blue3")
     sapply(r[2:3], sum)
     sum(r$density * diff(r$breaks)) # == 1
     lines(r, lty = 3, border = "purple") # -> lines.histogram(*)
     par(op)
     
     require(utils) # for str
     str(hist(islands, breaks = 12, plot =  FALSE)) #-> 10 (~= 12) breaks
     str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))
     
     hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,
          main = "WRONG histogram") # and warning
     
     ## Extreme outliers; the "FD" rule would take very large number of 'breaks':
     XXL <- c(1:9, c(-1,1)*1e300)
     hh <- hist(XXL, "FD") # did not work in R <= 3.4.1; now gives warning
     ## pretty() determines how many counts are used (platform dependently!):
     length(hh$breaks) ## typically 1 million -- though 1e6 was "a suggestion only"
     
     ## R >= 4.2.0: no "*.5" labels on y-axis:
     hist(c(2,3,3,5,5,6,6,6,7))
     
     require(stats)
     set.seed(14)
     x <- rchisq(100, df = 4)
     
     ## Histogram with custom x-axis:
     hist(x, xaxt = "n")
     axis(1, at = 0:17)
     
     
     ## Comparing data with a model distribution should be done with qqplot()!
     qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)
     
     ## if you really insist on using hist() ... :
     hist(x, freq = FALSE, ylim = c(0, 0.2))
     curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)

`hist()`

hist(x = penguins$bill_length_mm)

`hist()`

hist(penguins$bill_depth_mm)

`hist()`

hist(penguins$species)

Error in hist.default(penguins$species): 'x' must be numeric

str(penguins$species)

 Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...

Barplots

`barplot()`

barplot(penguins$species)

Error in barplot.default(penguins$species): 'height' must be a vector or a matrix

?barplot

Bar Plots

Description:

     Creates a bar plot with vertical or horizontal bars.

Usage:

     barplot(height, ...)
     
     ## Default S3 method:
     barplot(height, width = 1, space = NULL,
             names.arg = NULL, legend.text = NULL, beside = FALSE,
             horiz = FALSE, density = NULL, angle = 45,
             col = NULL, border = par("fg"),
             main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
             xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
             axes = TRUE, axisnames = TRUE,
             cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
             inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
             add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)
     
     ## S3 method for class 'formula'
     barplot(formula, data, subset, na.action,
             horiz = FALSE, xlab = NULL, ylab = NULL, ...)
     
Arguments:

  height: either a vector or matrix of values describing the bars which
          make up the plot.  If 'height' is a vector, the plot consists
          of a sequence of rectangular bars with heights given by the
          values in the vector.  If 'height' is a matrix and 'beside'
          is 'FALSE' then each bar of the plot corresponds to a column
          of 'height', with the values in the column giving the heights
          of stacked sub-bars making up the bar.  If 'height' is a
          matrix and 'beside' is 'TRUE', then the values in each column
          are juxtaposed rather than stacked.

   width: optional vector of bar widths. Re-cycled to length the number
          of bars drawn.  Specifying a single value will have no
          visible effect unless 'xlim' is specified.

   space: the amount of space (as a fraction of the average bar width)
          left before each bar.  May be given as a single number or one
          number per bar.  If 'height' is a matrix and 'beside' is
          'TRUE', 'space' may be specified by two numbers, where the
          first is the space between bars in the same group, and the
          second the space between the groups.  If not given
          explicitly, it defaults to 'c(0,1)' if 'height' is a matrix
          and 'beside' is 'TRUE', and to 0.2 otherwise.

names.arg: a vector of names to be plotted below each bar or group of
          bars.  If this argument is omitted, then the names are taken
          from the 'names' attribute of 'height' if this is a vector,
          or the column names if it is a matrix.

legend.text: a vector of text used to construct a legend for the plot,
          or a logical indicating whether a legend should be included.
          This is only useful when 'height' is a matrix.  In that case
          given legend labels should correspond to the rows of
          'height'; if 'legend.text' is true, the row names of 'height'
          will be used as labels if they are non-null.

  beside: a logical value.  If 'FALSE', the columns of 'height' are
          portrayed as stacked bars, and if 'TRUE' the columns are
          portrayed as juxtaposed bars.

   horiz: a logical value.  If 'FALSE', the bars are drawn vertically
          with the first bar to the left.  If 'TRUE', the bars are
          drawn horizontally with the first at the bottom.

 density: a vector giving the density of shading lines, in lines per
          inch, for the bars or bar components.  The default value of
          'NULL' means that no shading lines are drawn. Non-positive
          values of 'density' also inhibit the drawing of shading
          lines.

   angle: the slope of shading lines, given as an angle in degrees
          (counter-clockwise), for the bars or bar components.

     col: a vector of colors for the bars or bar components.  By
          default, '"grey"' is used if 'height' is a vector, and a
          gamma-corrected grey palette if 'height' is a matrix; see
          'grey.colors'.

  border: the color to be used for the border of the bars.  Use 'border
          = NA' to omit borders.  If there are shading lines, 'border =
          TRUE' means use the same colour for the border as for the
          shading lines.

main, sub: main title and subtitle for the plot.

    xlab: a label for the x axis.

    ylab: a label for the y axis.

    xlim: limits for the x axis.

    ylim: limits for the y axis.

     xpd: logical. Should bars be allowed to go outside region?

     log: string specifying if axis scales should be logarithmic; see
          'plot.default'.

    axes: logical.  If 'TRUE', a vertical (or horizontal, if 'horiz' is
          true) axis is drawn.

axisnames: logical.  If 'TRUE', and if there are 'names.arg' (see
          above), the other axis is drawn (with 'lty = 0') and labeled.

cex.axis: expansion factor for numeric axis labels (see 'par('cex')').

cex.names: expansion factor for axis names (bar labels).

  inside: logical.  If 'TRUE', the lines which divide adjacent
          (non-stacked!) bars will be drawn.  Only applies when 'space
          = 0' (which it partly is when 'beside = TRUE').

    plot: logical.  If 'FALSE', nothing is plotted.

axis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to
          the axis and tick marks of the categorical (default
          horizontal) axis.  Note that by default the axis is
          suppressed.

  offset: a vector indicating how much the bars should be shifted
          relative to the x axis.

     add: logical specifying if bars should be added to an already
          existing plot; defaults to 'FALSE'.

     ann: logical specifying if the default annotation ('main', 'sub',
          'xlab', 'ylab') should appear on the plot, see 'title'.

args.legend: list of additional arguments to pass to 'legend()'; names
          of the list are used as argument names.  Only used if
          'legend.text' is supplied.

 formula: a formula where the 'y' variables are numeric data to plot
          against the categorical 'x' variables.  The formula can have
          one of three forms:

                y ~ x
                y ~ x1 + x2
                cbind(y1, y2) ~ x
          
          (see the examples).

    data: a data frame (or list) from which the variables in formula
          should be taken.

  subset: an optional vector specifying a subset of observations to be
          used.

na.action: a function which indicates what should happen when the data
          contain 'NA' values.  The default is to ignore missing values
          in the given variables.

     ...: arguments to be passed to/from other methods.  For the
          default method these can include further arguments (such as
          'axes', 'asp' and 'main') and graphical parameters (see
          'par') which are passed to 'plot.window()', 'title()' and
          'axis'.

Value:

     A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',
     giving the coordinates of _all_ the bar midpoints drawn, useful
     for adding to the graph.

     If 'beside' is true, use 'colMeans(mp)' for the midpoints of each
     _group_ of bars, see example.

Author(s):

     R Core, with a contribution by Arni Magnusson.

References:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_.  Wadsworth & Brooks/Cole.

     Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.

See Also:

     'plot(..., type = "h")', 'dotchart'; 'hist' for bars of a
     _continuous_ variable.  'mosaicplot()', more sophisticated to
     visualize _several_ categorical variables.

Examples:

     # Formula method
     barplot(GNP ~ Year, data = longley)
     barplot(cbind(Employed, Unemployed) ~ Year, data = longley)
     
     ## 3rd form of formula - 2 categories :
     op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))
     summary(d.Titanic <- as.data.frame(Titanic))
     barplot(Freq ~ Class + Survived, data = d.Titanic,
             subset = Age == "Adult" & Sex == "Male",
             main = "barplot(Freq ~ Class + Survived, *)", ylab = "# {passengers}", legend.text = TRUE)
     # Corresponding table :
     (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age=="Adult"))
     # Alternatively, a mosaic plot :
     mosaicplot(xt[,,"Male"], main = "mosaicplot(Freq ~ Class + Survived, *)", color=TRUE)
     par(op)
     
     
     # Default method
     require(grDevices) # for colours
     tN <- table(Ni <- stats::rpois(100, lambda = 5))
     r <- barplot(tN, col = rainbow(20))
     #- type = "h" plotting *is* 'bar'plot
     lines(r, tN, type = "h", col = "red", lwd = 2)
     
     barplot(tN, space = 1.5, axisnames = FALSE,
             sub = "barplot(..., space= 1.5, axisnames = FALSE)")
     
     barplot(VADeaths, plot = FALSE)
     barplot(VADeaths, plot = FALSE, beside = TRUE)
     
     mp <- barplot(VADeaths) # default
     tot <- colMeans(VADeaths)
     text(mp, tot + 3, format(tot), xpd = TRUE, col = "blue")
     barplot(VADeaths, beside = TRUE,
             col = c("lightblue", "mistyrose", "lightcyan",
                     "lavender", "cornsilk"),
             legend.text = rownames(VADeaths), ylim = c(0, 100))
     title(main = "Death Rates in Virginia", font.main = 4)
     
     hh <- t(VADeaths)[, 5:1]
     mybarcol <- "gray20"
     mp <- barplot(hh, beside = TRUE,
             col = c("lightblue", "mistyrose",
                     "lightcyan", "lavender"),
             legend.text = colnames(VADeaths), ylim = c(0,100),
             main = "Death Rates in Virginia", font.main = 4,
             sub = "Faked upper 2*sigma error bars", col.sub = mybarcol,
             cex.names = 1.5)
     segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)
     stopifnot(dim(mp) == dim(hh))  # corresponding matrices
     mtext(side = 1, at = colMeans(mp), line = -2,
           text = paste("Mean", formatC(colMeans(hh))), col = "red")
     
     # Bar shading example
     barplot(VADeaths, angle = 15+10*1:5, density = 20, col = "black",
             legend.text = rownames(VADeaths))
     title(main = list("Death Rates in Virginia", font = 4))
     
     # Border color
     barplot(VADeaths, border = "dark blue") 
     
     # Log scales (not much sense here)
     barplot(tN, col = heat.colors(12), log = "y")
     barplot(tN, col = gray.colors(20), log = "xy")
     
     # Legend location
     barplot(height = cbind(x = c(465, 91) / 465 * 100,
                            y = c(840, 200) / 840 * 100,
                            z = c(37, 17) / 37 * 100),
             beside = FALSE,
             width = c(465, 840, 37),
             col = c(1, 2),
             legend.text = c("A", "B"),
             args.legend = list(x = "topleft"))

`barplot()`

species_count <- table(penguins$species)

species_count


   Adelie Chinstrap    Gentoo 
      152        68       124

`barplot()`

barplot(species_count)

Boxplots

The `formula` class

Very commonly used in R, and is the default way to specifiy statistical analyses in most packages
Takes the generic form of y ~ x
If you provide boxplot() with a formula, it will handle the underlying calculations for mean, range, etc for you
As always, help can be found using ?boxplot

`boxplot()`

boxplot(body_mass_g ~ species, data = penguins)

`boxplot()`

boxplot(body_mass_g ~ species, data = penguins, col = "white")

Adding colour with `col`

col will control the color of the lines/points/areas
Use ? to figure out what it does for each plot type
col can be a
- number (1-8)
- colour name ("black")
- HEX value ("#ff8301")
- RGB value (rgb(0, 0.8, 1))
col can also accept a vector of colours

`boxplot()`

colours <- c('#ff8301', '#bf5ccb', '#057076')
boxplot(body_mass_g ~ species, data = penguins, col = colours)

Scatterplots

head(penguins)

  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA          NA
5  Adelie Torgersen           36.7          19.3               193        3450
6  Adelie Torgersen           39.3          20.6               190        3650
     sex year is_cool bill_ratio
1   male 2007    TRUE   2.090909
2 female 2007    TRUE   2.270115
3 female 2007    TRUE   2.238889
4   <NA> 2007    TRUE         NA
5 female 2007    TRUE   1.901554
6   male 2007    TRUE   1.907767

`plot()`

a high level ‘generic’ function
depending on the class and shape of the object passed to plot(), a different method is called

methods(plot)

 [1] plot.acf*           plot.data.frame*    plot.decomposed.ts*
 [4] plot.default        plot.dendrogram*    plot.density*      
 [7] plot.ecdf           plot.factor*        plot.formula*      
[10] plot.function       plot.hclust*        plot.histogram*    
[13] plot.HoltWinters*   plot.isoreg*        plot.lm*           
[16] plot.medpolish*     plot.mlm*           plot.ppr*          
[19] plot.prcomp*        plot.princomp*      plot.profile*      
[22] plot.profile.nls*   plot.raster*        plot.spec*         
[25] plot.stepfun        plot.stl*           plot.table*        
[28] plot.ts             plot.tskernel*      plot.TukeyHSD*     
see '?methods' for accessing help and source code

`plot()` for scatterplots

plot(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)

Adding colour

plot(x = penguins$bill_length_mm, y = penguins$bill_depth_mm, col = penguins$species)

Adding custom colours

colours <- c('#057076', '#ff8301', '#bf5ccb')
names(colours) <- c('Gentoo', 'Adelie', 'Chinstrap')
plot(x = penguins$bill_length_mm, y = penguins$bill_depth_mm, col = colours[penguins$species])

Point shapes with `pch`

plot(
  x = penguins$bill_length_mm, 
  y = penguins$bill_depth_mm, 
  col = colours[penguins$species],
  pch = 19
  )

Point shapes with `pch`

Point shapes with `pch`

Point size with `cex`

plot(
  x = penguins$bill_length_mm, 
  y = penguins$bill_depth_mm, 
  col = colours[penguins$species],
  pch = 19,
  cex = 3
  )

Point size with `cex`

Point size with `cex`

plot(
  x = penguins$bill_length_mm, 
  y = penguins$bill_depth_mm, 
  col = colours[penguins$species],
  pch = 19,
  cex = 0.5
  )

Point size with `cex`

Adding titles and labels

plot(
  x = penguins$bill_length_mm, 
  y = penguins$bill_depth_mm, 
  col = colours[penguins$species],
  pch = 19,
  xlab = "Bill length (mm)",
  ylab = "Bill depth (mm)",
  main = "Penguin bill dimensions"
  )

Adding titles and labels

Adding a legend

plot(
  x = penguins$bill_length_mm, 
  y = penguins$bill_depth_mm, 
  col = colours[penguins$species],
  pch = 19,
  xlab = "Bill length (mm)",
  ylab = "Bill depth (mm)",
  main = "Penguin bill dimensions"
  )

legend("topright", legend = c("Gentoo", "Adelie", "Chinstrap"), col = colours, pch = 19)

Adding a legend

Adding a legend

legend(
  "topright", 
  legend = c("Gentoo", "Adelie", "Chinstrap"), 
  col = colours, 
  pch = 19
  )

The location of the legend can be controlled by
- providing x and y
- providing a character string that gives the location like "topright" etc
It’s up to you to make sure your legend matches your figure!

Few last things

There are two main ways used to produce graphics in R (in 2025)

graphics
- included with ‘base’ R
ggplot2
- a package by Hadley Wickham

Why use `ggplot2`?

Allows you to compose graphs by combining individual components

ggplot2 is built on the Grammar of Graphics, meaning you build plots layer by layer:

Data → Aesthetics → Geoms → Scales → Themes
‘Themes’ that make much easier to produce publication-quality graphics.

ggplot2 provides visually appealing defaults for color, spacing, legends, and axis labels.

Why use `ggplot2`?

Easier Complex Visuals

Multi-panel (facet) layouts, grouped color/shape aesthetics, and multi-layered elements (e.g., adding regression lines, custom annotations) are straightforward in ggplot2
Integration with Data Frames & tidyverse

ggplot2 works seamlessly with dplyr/tidyr pipelines. We’ll talk more about packages on the Functions session!

Why start with base graphics?

For many simple plots, graphics is quick and simple
Understanding graphics helps you better understand how R works
graphics is extremely flexible (anything you can think of, you can make)
graphics are used in many packages you will encounter, some of which do not have easy ggplot2 equivalents

`ggplot2`

penguins_sex <- subset(penguins, !is.na(sex))

library(ggplot2)
ggplot(data = penguins_sex, aes(x = species, y = body_mass_g, fill = sex)) +
  geom_boxplot()

`ggplot2`

`graphics`

boxplot(
  body_mass_g ~ sex + species, 
  data = penguins_sex, 
  col = rep(c("#F17770", "#3ABFC3"), times = 3)
  )

`graphics`

Plotting with R

Goals for this session

data.frame

data.frame

data.frame

data.frame

data.frame

data.frame

Why do we plot data?

Principles for plotting

How do we achieve this?

Histograms

hist()

hist()

hist()

hist()

Barplots

barplot()

barplot()

barplot()

Boxplots

Boxplots

The formula class

boxplot()

boxplot()

Adding colour with col

boxplot()

Scatterplots

Scatterplots

plot()

plot() for scatterplots

Adding colour

Adding custom colours

Point shapes with pch

Point shapes with pch

Point shapes with pch

Point size with cex

Point size with cex

Point size with cex

Point size with cex

Adding titles and labels

Adding titles and labels

Adding a legend

Adding a legend

Adding a legend

Few last things

There are two main ways used to produce graphics in R (in 2025)

Why use ggplot2?

Why use ggplot2?

Why start with base graphics?

ggplot2

ggplot2

graphics

graphics

The exercises

Plotting with `R`

`data.frame`

`data.frame`

`data.frame`

`data.frame`

`data.frame`

`data.frame`

`hist()`

`hist()`

`hist()`

`hist()`

`barplot()`

`barplot()`

`barplot()`

The `formula` class

`boxplot()`

`boxplot()`

Adding colour with `col`

`boxplot()`

`plot()`

`plot()` for scatterplots

Point shapes with `pch`

Point shapes with `pch`

Point shapes with `pch`

Point size with `cex`

Point size with `cex`

Point size with `cex`

Point size with `cex`

Why use `ggplot2`?

Why use `ggplot2`?

`ggplot2`

`ggplot2`

`graphics`

`graphics`