Statistics

Smoothing

The mtcars dataset contains information for 32 cars from Motor Trends magazine from 1974. This dataset is small, intuitive, and contains a variety of continuous and categorical (both nominal and ordinal) variables.

library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3
library(dplyr )

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
# View the structure of mtcars
str(cars)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
# Using mtcars, draw a scatter plot of mpg vs. wt
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  geom_smooth()

# Amend the plot. Use lin. reg. smoothing; turn off std err ribbon
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Amend the plot. Swap geom_smooth() for stat_smooth().
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

You can use either stat_smooth() or geom_smooth() to apply a linear model.

Grouping variables

We’ll continue by considering the situation of looking at sub-groups in our dataset. For this we’ll encounter the invisible group aesthetic.

#create factor cyl and am
mtcars <- mtcars %>% 
  mutate(fcyl = as.factor(cyl),
         fam = as.factor(am))
# Using mtcars, plot mpg vs. wt, colored by fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add a point layer
  geom_point() +
  # Add a smooth lin reg stat, no ribbon
  stat_smooth(method = "lm", se = FALSE)

# Amend the plot to add another smooth layer with dummy grouping
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  stat_smooth(mapping = aes(group = 1), method = "lm", se = FALSE)

Notice that the color aesthetic defined an invisible group aesthetic. Defining the group aesthetic for a specific geom means we can overwrite that. Here, we use a dummy variable to calculate the smoothing model for all values.

Modifying stat_smooth

We’ll consider another argument, span, used in LOESS smoothing, and we’ll take a look at a nice scenario of properly mapping different models.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add 3 smooth LOESS stats, varying span & color
  stat_smooth(color = "red", se = FALSE, span = 0.9) +
  stat_smooth(color = "green", se = FALSE, span = 0.6) +
  stat_smooth(color = "blue", se = FALSE, span = 0.3)

# Amend the plot to color by fcyl
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add a smooth LOESS stat, no ribbon
  stat_smooth(se = FALSE) +
  # Add a smooth lin. reg. stat, no ribbon
  stat_smooth(method = "lm", se = FALSE)

# Amend the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  # Map color to dummy variable "All"
  stat_smooth(mapping = aes(color = "All"), se = FALSE) +
  stat_smooth(method = "lm", se = FALSE)

The default span for LOESS is 0.9. A lower span will result in a better fit with more detail; but don’t overdo it or you’ll end up over-fitting!

We’ll take a look at the standard error ribbons, which show the 95% confidence interval of smoothing models.

library(carData)

Attaching package: 㤼㸱carData㤼㸲

The following object is masked _by_ 㤼㸱.GlobalEnv㤼㸲:

    Vocab

Vocab has been given an extra column, year_group, splitting the dates into before and after 1995.

Vocab <- Vocab %>% 
  mutate(year_group = as.factor(ifelse(year<1995, "[1974,1995]", "[1995,2016]") ))
# Using Vocab, plot vocabulary vs. education, colored by year group
ggplot(Vocab, aes(education, vocabulary, color = year_group)) +
  # Add jittered points with transparency 0.25
  geom_jitter(alpha = 0.25) +
  # Add a smooth lin. reg. line (with ribbon)
  stat_smooth(method = "lm", se = FALSE)

It’s easier to read the plot if the standard error ribbons match the lines, and the lines have more emphasis.

# Amend the plot
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +
  # Map the fill color to year_group, set the line size to 2
  stat_smooth(method = "lm",size = 2, mapping= aes(fill = year_group))

Notice that since 1995, education has relatively smaller effect on increasing vocabulary.

Quantiles

we’ll use stat_quantile() to apply a quantile regression.

Linear regression predicts the mean response from the explanatory variables, quantile regression predicts a quantile response (e.g. the median) from the explanatory variables. Specific quantiles can be specified with the quantiles argument.

Specifying many quantiles and color your models according to year can make plots too busy.

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.25) +
  # Add a quantile stat, at 0.05, 0.5, and 0.95
  stat_quantile(quantiles = c(0.05, 0.5, 0.95))

# Amend the plot to color by year_group
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +
  stat_quantile(quantiles = c(0.05, 0.5, 0.95))

Quantile regression is a great tool for getting a more detailed overview of a large dataset.

Using stat_sum

1 solution for overplotting is jittering with transparency. Another solution is to use stat_sum(), which calculates the total number of overlapping observations and maps that onto the size aesthetic.

stat_sum() allows a special variable, ..prop.., to show the proportion of values within the dataset.

# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  # Replace this with a sum stat
  geom_jitter(alpha = 0.25)

# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  # Replace this with a sum stat
  stat_sum()

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum() +
  # Add a size scale, from 1 to 10
  scale_size(range = c(1,10))

# Amend the stat to use proportion sizes
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum(aes(size =..prop..))

# Amend the plot to group by education
ggplot(Vocab, aes(x = education, y = vocabulary, group = education)) +
  stat_sum(aes(size = ..prop..))

If a few data points overlap, jittering is great. When you have lots of overlaps (particularly where continuous data has been rounded), using stat_sum() to count the overlaps is more useful.

Stats outside geom

Preparations

We’ll aim to make the below plot:

stats

stats

We’ll establish our positions and base layer of the plot.

Establishing these items as independent objects will allow us to recycle them easily in many layers, or plots.

  • position_jitter() adds jittering (e.g. for points).
  • position_dodge() dodges geoms, (e.g. bar, col, boxplot, violin, errorbar, pointrange).
  • position_jitterdodge() jitters and dodges geoms, (e.g. points).
# Define position objects
# 1. Jitter with width 0.2
posn_j <- position_jitter(width = 0.2)
# 2. Dodge with width 0.1
posn_d <- position_dodge(width = 0.1)
# 3. Jitter-dodge with jitter.width 0.2 and dodge.width 0.1
posn_jd <- position_jitterdodge(jitter.width = 0.2, dodge.width = 0.1)
# Create the plot base: wt vs. fcyl, colored by fam
p_wt_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, wt, color = fam))
# Add a point layer
p_wt_vs_fcyl_by_fam +
  geom_point()

The default positioning of the points is highly susceptible to overplotting.

Using position objects

Now that the position objects have been created, you can apply them to the base plot to see their effects. You do this by adding a point geom and setting the position argument to the position object.

# Add jittering only
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_j)

# Add dodging only
p_wt_vs_fcyl_by_fam_jit <- p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_d)
p_wt_vs_fcyl_by_fam_jit

# Add jittering and dodging
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_jd)

Although you can set position by setting the position argument to a string (for example position = “dodge”), defining objects promotes consistency between layers.

Plotting variations

The preparation is done; now let’s explore stat_summary().

Summary statistics refers to a combination of location (mean or median) and spread (standard deviation or confidence interval).

These metrics are calculated in stat_summary() by passing a function to the fun.data argument. mean_sdl(), calculates multiples of the standard deviation and mean_cl_normal() calculates the t-corrected 95% CI.

Arguments to the data function are passed to stat_summary()’s fun.args argument as a list.

p_wt_vs_fcyl_by_fam_jit +
  # Change the geom to be an errorbar
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d)

p_wt_vs_fcyl_by_fam_jit +
  # Change the geom to be an errorbar
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d,
  geom  ="errorbar")

p_wt_vs_fcyl_by_fam_jit +
  # Add a summary stat of normal confidence limits
  stat_summary(fun.data = mean_cl_normal, position = posn_d)

You can always assign your own function to the fun.data argument as long as the result is a data frame and the variable names match the aesthetics that you will need for the geom layer.

Coordinates

Zooming In

We’ll compare zooming by changing scales and by changing coordinates.

The big difference is that the scale functions change the underlying dataset, which affects calculations made by computed geoms (like histograms or smooth trend lines), whereas coordinate functions make no changes to the dataset.

# Run the code, view the plot, then update it
ggplot(mtcars, aes(x = wt, y = hp, color = fam)) +
  geom_point() +
  geom_smooth() +
  # Add a continuous x scale from 3 to 6
  scale_x_continuous(limits = c(3, 6))

ggplot(mtcars, aes(x = wt, y = hp, color = fam)) +
  geom_point() +
  geom_smooth() +
  # Add Cartesian coordinates with x limits from 3 to 6
  coord_cartesian(xlim = c(3, 6))

Using the scale function to zoom in meant that there wasn’t enough data to calculate the trend line, and geom_smooth() failed. When coord_cartesian() was applied, the full dataset was used for the trend calculation.

Aspect ratio: 1:1 ratios

We can set the aspect ratio of a plot with coord_fixed(), which uses ratio = 1 as a default. A 1:1 aspect ratio is most appropriate when two continuous variables are on the same scale, as with the iris dataset.

All variables are measured in centimeters, so it only makes sense that one unit on the plot should be the same physical distance on each axis. This gives a more truthful depiction of the relationship between the two variables since the aspect ratio can change the angle of our smoothing line. This would give an erroneous impression of the data. Of course the underlying linear models don’t change, but our perception can be influenced by the angle drawn.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  # Fix the coordinate ratio
  coord_fixed(ratio = 1)

A 1:1 aspect ratio is helpful when your axes show the same scales.

Setting rations

When values are not on the same scale it can be a bit tricky to set an appropriate aspect ratio. A classic William Cleveland (inventor of dot plots) example is the sunspots data set. We have 3200 observations from 1750 to 2016.

sun_plot is a plot without any set aspect ratio. It fills up the graphics device.

library(zoo)
package 㤼㸱zoo㤼㸲 was built under R version 3.6.3
Attaching package: 㤼㸱zoo㤼㸲

The following objects are masked from 㤼㸱package:base㤼㸲:

    as.Date, as.Date.numeric
sunspots.m <- data.frame(
year = index(sunspot.month),
value = reshape2::melt(sunspot.month)$value)
sun_plot <- ggplot(sunspots.m, aes(x = year, y = value)) +
  geom_line() +
  coord_fixed() # default to 1:1 aspect ratio

sun_plot

# Fix the aspect ratio to 1:1
sun_plot +
  coord_fixed(ratio = 1)
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

# Change the aspect ratio to 20:1
sun_plot +
  coord_fixed(ratio = 0.0555)
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Making a wide plot by calling coord_fixed() with a high ratio is often useful for long time series.

Expand and clip

The coord_*() layer functions offer two useful arguments that work well together: expand and clip.

  • expand sets a buffer margin around the plot, so data and axes don’t overlap. Setting expand to 0 draws the axes to the limits of the data.
  • clip decides whether plot elements that would lie outside the plot panel are displayed or ignored (“clipped”).

When done properly this can make a great visual effect! We’ll use theme_classic() and modify the axis lines in this example.

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # Add Cartesian coordinates with zero expansion
  coord_cartesian(expand = 0) +
  theme_classic()

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # Turn clipping off
  coord_cartesian(expand = 0, clip = "off") +
  theme_classic() +
  # Remove axis lines
  theme(axis.line = element_blank())

These arguments make clean and accurate plots by not cutting off data.

Log-transforming scales

Using scale_y_log10() and scale_x_log10() is equivalent to transforming our actual dataset before getting to ggplot2.

Using coord_trans(), setting x = “log10” and/or y = “log10” arguments, transforms the data after statistics have been calculated. The plot will look the same as with using scale_*_log10(), but the scales will be different, meaning that we’ll see the original values on our log10 transformed axes. This can be useful since log scales can be somewhat

# Produce a scatter plot of brainwt vs. bodywt
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  ggtitle("Raw Values")

# Add scale_*_*() functions
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("Scale_ functions")

# Perform a log10 coordinate system transformation
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10")

Each transformation method has implications for the plot’s interpretability. Think about your audience when choosing a method for applying transformations.

Adding stats to transformed scales

Remember that statistics are calculated on the untransformed data. A linear model may end up looking not-so-linear after an axis transformation. Let’s revisit the two plots from the previous exercise and compare their linear models.

# Plot with a scale_*_*() function:
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a log10 x scale
  scale_x_log10() +
  # Add a log10 y scale
  scale_y_log10() +
  ggtitle("Scale functions")

# Plot with transformed coordinates
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a log10 coordinate transformation for x and y axes
  coord_trans(x = "log10", y = "log10")

The smooth trend line is calculated after scale transformations but not coordinate transformations, so the second plot doesn’t make sense.

Useful double axes

library(lubridate)

Attaching package: 㤼㸱lubridate㤼㸲

The following object is masked from 㤼㸱package:base㤼㸲:

    date
airquality <- airquality %>% 
  mutate(Date = make_date(1974, Month, Day))

Double x and y-axes are a contentious topic in data visualization. Here, we will review a great use case where double axes actually do add value to a plot.

# Using airquality, plot Temp vs. Date
ggplot(airquality, aes(Date, Temp)) +
  # Add a line layer
  geom_line() +
  labs(x = "Date (1973)", y = "Fahrenheit")

# Define breaks (Fahrenheit)
y_breaks <- c(59, 68, 77, 86, 95, 104)

# Convert y_breaks from Fahrenheit to Celsius
y_labels <- (y_breaks - 32)*5/9

# Create a secondary x-axis
secondary_y_axis <- sec_axis(
  # Use identity transformation
  trans = identity,
  name = "Celsius",
  # Define breaks and labels as above
  breaks = y_breaks,
  labels = y_labels
)

# Examine the object
secondary_y_axis
<ggproto object: Class AxisSecondary, gg>
    axis: NULL
    break_info: function
    breaks: 59 68 77 86 95 104
    create_scale: function
    detail: 1000
    empty: function
    guide: waiver
    init: function
    labels: 15 20 25 30 35 40
    make_title: function
    mono_test: function
    name: Celsius
    trans: function
    transform_range: function
    super:  <ggproto object: Class AxisSecondary, gg>
# Update the plot
ggplot(airquality, aes(Date, Temp)) +
  geom_line() +
  # Add the secondary y-axis 
  scale_y_continuous(sec.axis = secondary_y_axis) +
  labs(x = "Date (1973)", y = "Fahrenheit")

Double axes are most useful when you want to display the same value in two differnt units.

### Flipping axes

Flipping axes means to reverse the variables mapped onto the x and y aesthetics. We can just change the mappings in aes(), but we can also use the coord_flip() layer function.

There are two reasons to use this function:

  • We want a vertical geom to be horizontal, or
  • We’ve completed a long series of plotting functions and want to flip it without having to rewrite all our commands.
# Plot fcyl bars, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Place bars side by side
  geom_bar(position = "dodge")

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar(position = "dodge") +
  # Flip the x and y coordinates
  coord_flip()

ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Set a dodge width of 0.5 for partially overlapping bars
  geom_bar(position = position_dodge(width = 0.5)) +
  coord_flip()

Horizontal bars are especially useful when the axis labels are long.

Another example:

mtcars <- read.csv("mtcars.csv", stringsAsFactors=FALSE)
mtcars <- mtcars %>% 
  mutate(fam = as.factor(am), fcyl = as.factor(cyl), car = model, fvs = as.factor(vs))
# Plot of wt vs. car
ggplot(mtcars, aes(car, wt)) +
  # Add a point layer
  geom_point() +
  labs(x = "car", y = "weight")

# Flip the axes to set car to the y axis
ggplot(mtcars, aes(car, wt)) +
  geom_point() +
  labs(x = "car", y = "weight") +
  coord_flip()

Pie charts

The coord_polar() function converts a planar x-y Cartesian plot to polar coordinates. This can be useful if you are producing pie charts.

We can imagine two forms for pie charts - the typical filled circle, or a colored ring.

Typical pie charts omit all of the non-data ink. Pie charts are not really better than stacked bar charts.

# Run the code, view the plot, then update it
ggplot(mtcars, aes(x = 1, fill = fcyl)) +
  geom_bar() +
  # Add a polar coordinate system
  coord_polar(theta = "y")

ggplot(mtcars, aes(x = 1, fill = fcyl)) +
  # Reduce the bar width to 0.1
  geom_bar(width = 0.1) +
  coord_polar(theta = "y") +
  # Add a continuous x scale from 0.5 to 1.5
  scale_x_continuous(limits = c(0.5, 1.5))

Polar coordinates are particularly useful if you are dealing with a cycle, like yearly data, that you would like to see represented as such.

Wind rose plots

Polar coordinate plots are well-suited to scales like compass direction or time of day. A popular example is the “wind rose”.

library(openair)
package 㤼㸱openair㤼㸲 was built under R version 3.6.3

Converting degrees into factor:

library(rvest)
Loading required package: xml2
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v tibble  2.1.3     v purrr   0.3.3
v tidyr   1.0.2     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x lubridate::as.difftime() masks base::as.difftime()
x lubridate::date()        masks base::date()
x dplyr::filter()          masks stats::filter()
x readr::guess_encoding()  masks rvest::guess_encoding()
x lubridate::intersect()   masks base::intersect()
x dplyr::lag()             masks stats::lag()
x purrr::pluck()           masks rvest::pluck()
x lubridate::setdiff()     masks base::setdiff()
x lubridate::union()       masks base::union()
# grab a table of values from the web with a little scraping
url <- 'http://snowfence.umn.edu/Components/winddirectionanddegreeswithouttable3.htm'
page <- read_html(url)
directions_raw <- page %>% html_node('td table') %>% html_table(header = TRUE)
directions <- directions_raw %>% 
    set_names(~tolower(sub(' Direction', '', .x))) %>% 
    slice(-1) %>% 
    separate(degree, c('degree_min', 'degree_max'), sep = '\\s+-\\s+', convert = TRUE)
directions
wind <- mydata %>% 
    mutate(wd_cardinal = cut(
        wd, 
        breaks = c(0, directions$degree_max, 360), 
        labels = c(directions$cardinal, 'N')
    ), ws_interval = cut(ws,
                         breaks = c(0,2,4,6,8,10,12,14),
                         include.lowest=T)
    )
# Using wind, plot wd filled by ws
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  # Add a bar layer with width 1
  geom_bar(width = 1)

# Convert to polar coordinates:
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  geom_bar(width = 1) +
  coord_polar()

# Convert to polar coordinates:
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  geom_bar(width = 1) +
  coord_polar(start = -pi/16)

They are not common, but polar coordinate plots are really useful.

Facet

Facets layer

Faceting splits the data up into groups, according to a categorical variable, then plots each group in its own panel. For splitting the data by one or two categorical variables, facet_grid() is best.

Given categorical variables A and B, the code pattern is

plot +
  facet_grid(rows = vars(A), cols = vars(B))
  

This draws a panel for each pairwise combination of the values of A and B.

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am
  facet_grid(rows = vars(am))

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet columns by cyl
  facet_grid(cols=vars(cyl))

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am and columns by cyl
  facet_grid(rows = vars(am), cols = vars(cyl))

Many variables

In addition to aesthetics, facets are another way of encoding factor (i.e. categorical) variables. They can be used to reduce the complexity of plots with many variables.

Two variables are mapped onto the color aesthetic, using hue and lightness. To achieve this we combined fcyl and fam into a single interaction variable, fcyl_fam. This will allow us to take advantage of Color Brewer’s Paired color palette.

mtcars <- mtcars %>% 
  mutate(fcyl_fam = interaction(fcyl, fam, sep=":"))
# See the interaction column
mtcars$fcyl_fam
 [1] 6:1 6:1 4:1 6:0 8:0 6:0 8:0 4:0 4:0 6:0 6:0 8:0 8:0 8:0 8:0 8:0 8:0 4:1 4:1 4:1 4:0 8:0 8:0 8:0 8:0 4:1
[27] 4:1 4:1 8:1 6:1 8:1 4:1
Levels: 4:0 6:0 8:0 4:1 6:1 8:1
# Color the points by fcyl_fam
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam)) +
  geom_point() +
  # Use a paired color palette
  scale_color_brewer(palette = "Paired")

# Update the plot to map disp to size
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam, size = disp)) +
  geom_point() +
  scale_color_brewer(palette = "Paired")

# Update the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam, size = disp)) +
  geom_point() +
  scale_color_brewer(palette = "Paired") +
  # Grid facet on gear and vs
  facet_grid(rows = vars(gear), cols = vars(vs)) 

The last plot you’ve created contains 7 variables (4 categorical, 3 continuous). Useful combinations of aesthetics and facets help to achieve this.

Formula notation

As well as the vars() notation for specifying which variables should be used to split the dataset into facets, there is also a traditional formula notation. The three cases are shown in the table.

Modern notation Formula notation
facet_grid(rows = vars(A)) facet_grid(A ~ .)
facet_grid(cols = vars(B)) facet_grid(. ~ B)
facet_grid(rows = vars(A), cols = vars(B)) facet_grid(A ~ B)
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am using formula notation
  facet_grid(am ~ .)

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet columns by cyl using formula notation
  facet_grid(. ~ cyl)

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am and columns by cyl using formula notation
  facet_grid(am ~ cyl)

While many ggplots still use the traditional formula notation, using vars() is now preferred.

Facet labels and order

Labelling facets

If our factor levels are not clear, your facet labels may be confusing. We can assign proper labels in your original data before plotting or we can use the labeller argument in the facet layer.

The default value is

  • label_value: Default, displays only the value

Common alternatives are:

  • label_both: Displays both the value and the variable name
  • label_context: Displays only the values or both the values and variables depending on whether multiple factors are faceted
# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # The default is label_value
  facet_grid(cols = vars(cyl))

O

# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Displaying both the values and the variables
  facet_grid(cols = vars(cyl), labeller = label_both)

# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Label context
  facet_grid(cols = vars(cyl), labeller = label_context)

# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Two variables
  facet_grid(cols = vars(vs, cyl), labeller = label_context)

Make sure there is no ambiguity in interpreting plots by using proper labels.

Setting order

If you want to change the order of your facets, it’s best to properly define your factor variables before plotting.

Let’s see this in action with the mtcars transmission variable am. In this case, 0 = “automatic” and 1 = “manual”.

Here, we’ll make am a factor variable and relabel the numbers to proper names. The default order is alphabetical. To rearrange them we’ll call fct_rev() from the forcats package to reverse the order.

library(forcats)

# Make factor, set proper labels explictly
mtcars$fam <- factor(mtcars$am, labels = c(`0` = "automatic",
                                           `1` = "manual"))

# Default order is alphabetical
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(cols = vars(fam))

# Make factor, set proper labels explictly, and
# manually set the label order
mtcars$fam <- factor(mtcars$am,
                     levels = c(1, 0),
                     labels = c("manual", "automatic"))
mtcars$fvs <- factor(mtcars$vs,
                     levels = c(1, 0),
                     labels = c("straight", "V-shaped"))

# View again
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(cols = vars(fam))

Arrange your facets in an intuitive order for your data.

Facet plotting spaces

Continuous variables

By default every facet of a plot has the same axes. If the data ranges vary wildly between facets, it can be clearer if each facet has its own scale. This is achieved with the scales argument to facet_grid().

  • “fixed” (default): axes are shared between facets.
  • free: each facet has its own axes.
  • free_x: each facet has its own x-axis, but the y-axis is shared.
  • free_y: each facet has its own y-axis, but the x-axis is shared.

When faceting by columns, “free_y” has no effect, but we can adjust the x-axis. In contrast, when faceting by rows, “free_x” has no effect, but we can adjust the y-axis.

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Facet columns by cyl 
  facet_grid(cols = vars(cyl))

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Facet columns by cyl 
  facet_grid(cols = vars(cyl),
             scales = "free_x")

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Swap cols for rows; free the y-axis scales
  facet_grid(rows = vars(cyl), scales = "free_y")

Shared scales make it easy to compare between facets, but can be confusing if the data ranges are very different. In that case, used free scales.

Categorical variables

When you have a categorical variable with many levels which are not all present in each sub-group of another variable, it’s usually desirable to drop the unused levels.

By default, each facet of a plot is the same size. This behavior can be changed with the spaces argument, which works in the same way as scales: “free_x” allows different sized facets on the x-axis, “free_y”, allows different sized facets on the y-axis, “free” allows different sizes in both directions.

ggplot(mtcars, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Facet rows by gear
  facet_grid(rows = vars(gear))

ggplot(mtcars, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Free the y scales and space
  facet_grid(rows = vars(gear),
  scales = "free_y",
  space = "free_y")

mtcars2 <- mtcars %>% 
  # arrange from lo to hi
  arrange(-mpg) %>% 
  # redefine in factor leveles
  mutate(car = as_factor(car))
ggplot(mtcars2, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Free the y scales and space
  facet_grid(rows = vars(gear),
  scales = "free_y",
  space = "free_y")

Freeing the y-scale to remove blank lines helps focus attention on the actual data present.

Facet wrap and margins

Wrapping for many levels

facet_grid() is fantastic for categorical variables with a small number of levels. Although it is possible to facet variables with many levels, the resulting plot will be very wide or very tall, which can make it difficult to view.

The solution is to use facet_wrap() which separates levels along one axis but wraps all the subsets across a given number of rows or columns.

library(carData)
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Create facets, wrapping by year, using vars()
  facet_wrap(vars(year))

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Create facets, wrapping by year, using a formula
  facet_wrap(~ year)

ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Update the facet layout, using 11 columns
  facet_wrap(~ year,
  ncol = 11)

Start experimenting with facets in your own plots.

Margin plots

Facets are great for seeing subsets in a variable, but sometimes you want to see both those subsets and all values in a variable.

Here, the margins argument to facet_grid() is your friend.

  • FALSE (default): no margins.
  • TRUE: add margins to every variable being faceted by.
  • c(“variable1”, “variable2”): only add margins to the variables listed.
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Facet rows by fvs and cols by fam
  facet_grid(cols = vars(gear), rows = vars(fvs, fam))

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to add margins
  facet_grid(rows = vars(fvs, fam), cols = vars(gear),
  margins = TRUE)

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to only show margins on fam
  facet_grid(rows = vars(fvs, fam), cols = vars(gear), margins = "fam")

ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to only show margins on gear and fvs
  facet_grid(rows = vars(fvs, fam), cols = vars(gear), margins = c("gear", "fvs"))

It can be really helpful to show the full margin plots!

Best Practices

Bar plots

Dynamite plots

“Dynamite plots” (bar plots with error bars) are not well suited for their intended purpose of depicting distributions. If we really want error bars on bar plots, we can of course get them, but we’ll need to set the positions manually. A point geom will typically serve you much better.

# Plot wt vs. fcyl
ggplot(mtcars, aes(x = fcyl, y = wt)) +
  # Add a bar summary stat of means, colored skyblue
  stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
  # Add an errorbar summary stat std deviation limits
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
`fun.y` is deprecated. Use `fun` instead.

Remember, we can specify any function in fun.data or fun.y and we can also specify any geom, as long as it’s appropriate to the data type.

Position dodging

we will add a distinction between transmission type, fam, for the dynamite plots and explore position dodging (where bars are side-by-side).

# Update the aesthetics to color and fill by fam
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
`fun.y` is deprecated. Use `fun` instead.

# Set alpha for the first and set position for each stat summary function
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar", alpha = 0.5, position = "dodge") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", position = "dodge", width = 0.1)
`fun.y` is deprecated. Use `fun` instead.

# Define a dodge position object with width 0.9
posn_d <- position_dodge(width = 0.9)

# For each summary stat, update the position to posn_d
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar", position = posn_d, alpha = 0.5) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), width = 0.1, position = posn_d, geom = "errorbar")
`fun.y` is deprecated. Use `fun` instead.

Slightly overlapping bar plots are common in the popular press and add a bit of style to your data viz.

Using aggregated data

If it is appropriate to use bar plots, then it nice to give an impression of the number of values in each group.

stat_summary() doesn’t keep track of the count. stat_sum() does (that’s the whole point), but it’s difficult to access. It’s more straightforward to calculate exactly what we want to plot ourselves.

mtcars_by_cyl <- mtcars %>%
  group_by(cyl) %>% 
  summarize(mean_wt = mean(wt),
          sd_wt = sd(wt),
          n_wt = n()) %>% 
          mutate( prop = n_wt/sum(n_wt))
mtcars_by_cyl
# Using mtcars_cyl, plot mean_wt vs. cyl
ggplot(mtcars_by_cyl, aes(cyl, mean_wt)) +
  # Add a bar layer with identity stat, filled skyblue
  geom_bar(stat = "identity", fill = "skyblue")

ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  # Swap geom_bar() for geom_col()
  geom_col(fill = "skyblue")

ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  # Set the width aesthetic to prop
  geom_col(fill = "skyblue", aes(width = prop))
Ignoring unknown aesthetics: width

ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  geom_col(aes(width = prop), fill = "skyblue") +
  # Add an errorbar layer
  geom_errorbar(
    # ... at mean weight plus or minus 1 std dev
    aes(ymin = mean_wt - sd_wt,
    ymax = mean_wt + sd_wt),
    # with width 0.1
    width = 0.1
  )
Ignoring unknown aesthetics: width

This is a good start, but it’s difficult to adjust the spacing between the bars.

Heatmaps

Since heat maps encode color on a continuous scale, they are difficult to accurately decode. Hence, heat maps are most useful if you have a small number of boxes and/or a clear pattern that allows you to overcome decoding difficulties.

To produce them, map two categorical variables onto the x and y aesthetics, along with a continuous variable onto fill. The geom_tile() layer adds the boxes.

library(lattice)
str(barley)
'data.frame':   120 obs. of  4 variables:
 $ yield  : num  27 48.9 27.4 39.9 33 ...
 $ variety: Factor w/ 10 levels "Svansota","No. 462",..: 3 3 3 3 3 3 7 7 7 7 ...
 $ year   : Factor w/ 2 levels "1932","1931": 2 2 2 2 2 2 2 2 2 2 ...
 $ site   : Factor w/ 6 levels "Grand Rapids",..: 3 6 4 5 1 2 3 6 4 5 ...
# Using barley, plot variety vs. year, filled by yield
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  # Add a tile geom
  geom_tile()

# Previously defined
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  geom_tile() + 
  # Facet, wrapping by site, with 1 column
  facet_wrap(facets = vars(site), ncol = 1) +
  # Add a fill scale using an 2-color gradient
  scale_fill_gradient(low = "white", high = "red")

library(RColorBrewer)
# A palette of 9 reds
red_brewer_palette <- brewer.pal(9, "Reds")

# Update the plot
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  geom_tile() + 
  facet_wrap(facets = vars(site), ncol = 1) +
  # Update scale to use n-colors from red_brewer_palette
  scale_fill_gradientn(colors = red_brewer_palette)

We can continue by using breaks, limits and labels to modify the fill scale and update the theme, but this is a pretty good start.

Heat map alternatives

There are several alternatives to heat maps. The best choice really depends on the data and the story you want to tell with this data. If there is a time component, the most obvious choice is a line plot.

# Using barley, plot yield vs. year, colored and grouped by variety
ggplot(barley, aes(x = year, y = yield, color = variety, group = variety)) +
  # Add a line layer
  geom_line() +
  # Facet, wrapping by site, with 1 row
  facet_wrap( ~ site, nrow = 1)

# Using barely, plot yield vs. year, colored, grouped, and filled by site
ggplot(barley, aes(x = year, y = yield, color = site, group = site, fill = site)) +
  # Add a line summary stat aggregated by mean
  stat_summary(fun.y = mean, geom = "line") +
  # Add a ribbon summary stat with 10% opacity, no color
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "ribbon", alpha = 0.1, color = NA)
`fun.y` is deprecated. Use `fun` instead.

Whenever you see a heat map, ask yourself it it’s really necessary. Many people use them because they look fancy and complicated - signs of poor communication skills.

When good data makes bad plots

When you first encounter a data visualization, either from yourself or a colleague, you always want to critically ask if it’s obscuring the data in any way.

Let’s take a look at the steps we could take to produce and improve the plot in the view.

The data comes from an experiment where the effect of two different types of vitamin C sources, orange juice or ascorbic acid, were tested on the growth of the odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.

The data is stored in the TG data frame, which contains three variables: dose, len, and supp.

library(datasets)
TG = (ToothGrowth)
str(TG)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Initial plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.1)) +
  theme_gray(3)

# View plot
growth_by_dose

# Initial plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.1)) +
  theme_classic()

# View plot
growth_by_dose

# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  theme_classic()

# View plot
growth_by_dose

# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  # Use the right geometry
  stat_summary(fun.y = mean,
               geom = "line",
               position = position_dodge(0.1)) +
  theme_classic()
`fun.y` is deprecated. Use `fun` instead.
# View plot
growth_by_dose

# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  stat_summary(fun.y = mean,
               geom = "line",
               position = position_dodge(0.1)) +
  theme_classic() +
  # Adjust labels and colors:
  labs(x = "Dose (mg/day)", y = "Odontoblasts length (mean, standard deviation)", color = "Supplement") +
  scale_color_brewer(palette = "Set1", labels = c("Orange juice", "Ascorbic acid")) +
  scale_y_continuous(limits = c(0,35), breaks = seq(0, 35, 5), expand = c(0,0))
`fun.y` is deprecated. Use `fun` instead.
# View plot
growth_by_dose

---
title: "Intermediate Data Visualization with ggplot2"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    
toc_depth: 3
---
# Statistics

## Smoothing

The mtcars dataset contains information for 32 cars from Motor Trends magazine from 1974. This dataset is small, intuitive, and contains a variety of continuous and categorical (both nominal and ordinal) variables.
```{r}
library(ggplot2)
library(dplyr )
```

```{r}
# View the structure of mtcars
str(cars)
```
```{r}
# Using mtcars, draw a scatter plot of mpg vs. wt
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  geom_smooth()
```
```{r}
# Amend the plot. Use lin. reg. smoothing; turn off std err ribbon
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```
```{r}
# Amend the plot. Swap geom_smooth() for stat_smooth().
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
```
You can use either stat_smooth() or geom_smooth() to apply a linear model.

## Grouping variables

We'll continue by considering the situation of looking at sub-groups in our dataset. For this we'll encounter the invisible group aesthetic.
```{r}
#create factor cyl and am
mtcars <- mtcars %>% 
  mutate(fcyl = as.factor(cyl),
         fam = as.factor(am))
```
```{r}
# Using mtcars, plot mpg vs. wt, colored by fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
  # Add a point layer
  geom_point() +
  # Add a smooth lin reg stat, no ribbon
  stat_smooth(method = "lm", se = FALSE)
```
```{r}
# Amend the plot to add another smooth layer with dummy grouping
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  stat_smooth(mapping = aes(group = 1), method = "lm", se = FALSE)
```
Notice that the color aesthetic defined an invisible group aesthetic. Defining the group aesthetic for a specific geom means we can overwrite that. Here, we use a dummy variable to calculate the smoothing model for all values.

## Modifying stat_smooth

We'll consider another argument, span, used in LOESS smoothing, and we'll take a look at a nice scenario of properly mapping different models.
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add 3 smooth LOESS stats, varying span & color
  stat_smooth(color = "red", se = FALSE, span = 0.9) +
  stat_smooth(color = "green", se = FALSE, span = 0.6) +
  stat_smooth(color = "blue", se = FALSE, span = 0.3)
```
```{r}
# Amend the plot to color by fcyl
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  # Add a smooth LOESS stat, no ribbon
  stat_smooth(se = FALSE) +
  # Add a smooth lin. reg. stat, no ribbon
  stat_smooth(method = "lm", se = FALSE)
```
```{r}
# Amend the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl)) +
  geom_point() +
  # Map color to dummy variable "All"
  stat_smooth(mapping = aes(color = "All"), se = FALSE) +
  stat_smooth(method = "lm", se = FALSE)
```
The default span for LOESS is 0.9. A lower span will result in a better fit with more detail; but don't overdo it or you'll end up over-fitting!

We'll take a look at the standard error ribbons, which show the 95% confidence interval of smoothing models. 
```{r}
library(carData)
```
Vocab has been given an extra column, year_group, splitting the dates into before and after 1995.
```{r}
Vocab <- Vocab %>% 
  mutate(year_group = as.factor(ifelse(year<1995, "[1974,1995]", "[1995,2016]") ))
```


```{r}
# Using Vocab, plot vocabulary vs. education, colored by year group
ggplot(Vocab, aes(education, vocabulary, color = year_group)) +
  # Add jittered points with transparency 0.25
  geom_jitter(alpha = 0.25) +
  # Add a smooth lin. reg. line (with ribbon)
  stat_smooth(method = "lm", se = FALSE)
```
It's easier to read the plot if the standard error ribbons match the lines, and the lines have more emphasis.

```{r}
# Amend the plot
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +
  # Map the fill color to year_group, set the line size to 2
  stat_smooth(method = "lm",size = 2, mapping= aes(fill = year_group))
```
Notice that since 1995, education has relatively smaller effect on increasing vocabulary.

## Quantiles

we'll use stat_quantile() to apply a quantile regression.

Linear regression predicts the mean response from the explanatory variables, quantile regression predicts a quantile response (e.g. the median) from the explanatory variables. Specific quantiles can be specified with the quantiles argument.

Specifying many quantiles and color your models according to year can make plots too busy.
```{r}
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  geom_jitter(alpha = 0.25) +
  # Add a quantile stat, at 0.05, 0.5, and 0.95
  stat_quantile(quantiles = c(0.05, 0.5, 0.95))
```
```{r}
# Amend the plot to color by year_group
ggplot(Vocab, aes(x = education, y = vocabulary, color = year_group)) +
  geom_jitter(alpha = 0.25) +
  stat_quantile(quantiles = c(0.05, 0.5, 0.95))
```
Quantile regression is a great tool for getting a more detailed overview of a large dataset.

## Using stat_sum

1 solution for overplotting is jittering with transparency. Another solution is to use stat_sum(), which calculates the total number of overlapping observations and maps that onto the size aesthetic.

stat_sum() allows a special variable, ..prop.., to show the proportion of values within the dataset.
```{r}
# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  # Replace this with a sum stat
  geom_jitter(alpha = 0.25)
```
```{r}
# Run this, look at the plot, then update it
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  # Replace this with a sum stat
  stat_sum()
```
```{r}
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum() +
  # Add a size scale, from 1 to 10
  scale_size(range = c(1,10))
```
```{r}
# Amend the stat to use proportion sizes
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_sum(aes(size =..prop..))
```
```{r}
# Amend the plot to group by education
ggplot(Vocab, aes(x = education, y = vocabulary, group = education)) +
  stat_sum(aes(size = ..prop..))
```
 If a few data points overlap, jittering is great. When you have lots of overlaps (particularly where continuous data has been rounded), using stat_sum() to count the overlaps is more useful.

## Stats outside geom

### Preparations

We'll aim to make the below plot:

![stats](images/stats.png)

We'll establish our positions and base layer of the plot.

Establishing these items as independent objects will allow us to recycle them easily in many layers, or plots.

- position_jitter() adds jittering (e.g. for points).
- position_dodge() dodges geoms, (e.g. bar, col, boxplot, violin, errorbar, pointrange).
- position_jitterdodge() jitters and dodges geoms, (e.g. points).

```{r}
# Define position objects
# 1. Jitter with width 0.2
posn_j <- position_jitter(width = 0.2)
# 2. Dodge with width 0.1
posn_d <- position_dodge(width = 0.1)
# 3. Jitter-dodge with jitter.width 0.2 and dodge.width 0.1
posn_jd <- position_jitterdodge(jitter.width = 0.2, dodge.width = 0.1)
```
```{r}
# Create the plot base: wt vs. fcyl, colored by fam
p_wt_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, wt, color = fam))
```
```{r}
# Add a point layer
p_wt_vs_fcyl_by_fam +
  geom_point()
```
The default positioning of the points is highly susceptible to overplotting.

### Using position objects

Now that the position objects have been created, you can apply them to the base plot to see their effects. You do this by adding a point geom and setting the position argument to the position object.

```{r}
# Add jittering only
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_j)
```
```{r}
# Add dodging only
p_wt_vs_fcyl_by_fam_jit <- p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_d)
p_wt_vs_fcyl_by_fam_jit
```
```{r}
# Add jittering and dodging
p_wt_vs_fcyl_by_fam +
  geom_point(position = posn_jd)
```
Although you can set position by setting the position argument to a string (for example position = "dodge"), defining objects promotes consistency between layers.

### Plotting variations

The preparation is done; now let's explore stat_summary().

Summary statistics refers to a combination of location (mean or median) and spread (standard deviation or confidence interval).

These metrics are calculated in stat_summary() by passing a function to the fun.data argument. mean_sdl(), calculates multiples of the standard deviation and mean_cl_normal() calculates the t-corrected 95% CI.

Arguments to the data function are passed to stat_summary()'s fun.args argument as a list.
```{r}
p_wt_vs_fcyl_by_fam_jit +
  # Change the geom to be an errorbar
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d)
```
```{r}
p_wt_vs_fcyl_by_fam_jit +
  # Change the geom to be an errorbar
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), position = posn_d,
  geom  ="errorbar")
```
```{r}
p_wt_vs_fcyl_by_fam_jit +
  # Add a summary stat of normal confidence limits
  stat_summary(fun.data = mean_cl_normal, position = posn_d)
```
You can always assign your own function to the fun.data argument as long as the result is a data frame and the variable names match the aesthetics that you will need for the geom layer.

# Coordinates

## Zooming In

We'll compare zooming by changing scales and by changing coordinates.

The big difference is that the scale functions change the underlying dataset, which affects calculations made by computed geoms (like histograms or smooth trend lines), whereas coordinate functions make no changes to the dataset.
```{r}
# Run the code, view the plot, then update it
ggplot(mtcars, aes(x = wt, y = hp, color = fam)) +
  geom_point() +
  geom_smooth() +
  # Add a continuous x scale from 3 to 6
  scale_x_continuous(limits = c(3, 6))
```
```{r}
ggplot(mtcars, aes(x = wt, y = hp, color = fam)) +
  geom_point() +
  geom_smooth() +
  # Add Cartesian coordinates with x limits from 3 to 6
  coord_cartesian(xlim = c(3, 6))
```
Using the scale function to zoom in meant that there wasn't enough data to calculate the trend line, and geom_smooth() failed. When coord_cartesian() was applied, the full dataset was used for the trend calculation.

## Aspect ratio: 1:1 ratios

We can set the aspect ratio of a plot with coord_fixed(), which uses ratio = 1 as a default. A 1:1 aspect ratio is most appropriate when two continuous variables are on the same scale, as with the iris dataset.

All variables are measured in centimeters, so it only makes sense that one unit on the plot should be the same physical distance on each axis. This gives a more truthful depiction of the relationship between the two variables since the aspect ratio can change the angle of our smoothing line. This would give an erroneous impression of the data. Of course the underlying linear models don't change, but our perception can be influenced by the angle drawn.
```{r}
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) +
  # Fix the coordinate ratio
  coord_fixed(ratio = 1)
```
A 1:1 aspect ratio is helpful when your axes show the same scales.

### Setting rations

When values are not on the same scale it can be a bit tricky to set an appropriate aspect ratio. A classic William Cleveland (inventor of dot plots) example is the sunspots data set. We have 3200 observations from 1750 to 2016.

sun_plot is a plot without any set aspect ratio. It fills up the graphics device.
```{r}
library(zoo)
sunspots.m <- data.frame(
year = index(sunspot.month),
value = reshape2::melt(sunspot.month)$value)
```
```{r}
sun_plot <- ggplot(sunspots.m, aes(x = year, y = value)) +
  geom_line() +
  coord_fixed() # default to 1:1 aspect ratio

sun_plot
```
```{r}
# Fix the aspect ratio to 1:1
sun_plot +
  coord_fixed(ratio = 1)
```
```{r}
# Change the aspect ratio to 20:1
sun_plot +
  coord_fixed(ratio = 0.0555)
```
 Making a wide plot by calling coord_fixed() with a high ratio is often useful for long time series.

### Expand and clip

The coord_*() layer functions offer two useful arguments that work well together: expand and clip.

- expand sets a buffer margin around the plot, so data and axes don't overlap. Setting expand to 0 draws the axes to the limits of the data.
- clip decides whether plot elements that would lie outside the plot panel are displayed or ignored ("clipped").

When done properly this can make a great visual effect! We'll use theme_classic() and modify the axis lines in this example.
```{r}
ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # Add Cartesian coordinates with zero expansion
  coord_cartesian(expand = 0) +
  theme_classic()
```
```{r}
ggplot(mtcars, aes(wt, mpg)) +
  geom_point(size = 2) +
  # Turn clipping off
  coord_cartesian(expand = 0, clip = "off") +
  theme_classic() +
  # Remove axis lines
  theme(axis.line = element_blank())
```
These arguments make clean and accurate plots by not cutting off data.

### Log-transforming scales
Using scale_y_log10() and scale_x_log10() is equivalent to transforming our actual dataset before getting to ggplot2.

Using coord_trans(), setting x = "log10" and/or y = "log10" arguments, transforms the data after statistics have been calculated. The plot will look the same as with using scale_*_log10(), but the scales will be different, meaning that we'll see the original values on our log10 transformed axes. This can be useful since log scales can be somewhat
```{r}
# Produce a scatter plot of brainwt vs. bodywt
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  ggtitle("Raw Values")
```
```{r}
# Add scale_*_*() functions
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("Scale_ functions")
```
```{r}
# Perform a log10 coordinate system transformation
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10")
```
Each transformation method has implications for the plot's interpretability. Think about your audience when choosing a method for applying transformations.

### Adding stats to transformed scales

Remember that statistics are calculated on the untransformed data. A linear model may end up looking not-so-linear after an axis transformation. Let's revisit the two plots from the previous exercise and compare their linear models.
```{r}
# Plot with a scale_*_*() function:
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a log10 x scale
  scale_x_log10() +
  # Add a log10 y scale
  scale_y_log10() +
  ggtitle("Scale functions")
```
```{r}
# Plot with transformed coordinates
ggplot(msleep, aes(bodywt, brainwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add a log10 coordinate transformation for x and y axes
  coord_trans(x = "log10", y = "log10")
```
The smooth trend line is calculated after scale transformations but not coordinate transformations, so the second plot doesn't make sense.

### Useful double axes
```{r}
library(lubridate)
airquality <- airquality %>% 
  mutate(Date = make_date(1974, Month, Day))
```


Double x and y-axes are a contentious topic in data visualization. Here, we will review a great use case where double axes actually do add value to a plot.
```{r}
# Using airquality, plot Temp vs. Date
ggplot(airquality, aes(Date, Temp)) +
  # Add a line layer
  geom_line() +
  labs(x = "Date (1973)", y = "Fahrenheit")
```
```{r}
# Define breaks (Fahrenheit)
y_breaks <- c(59, 68, 77, 86, 95, 104)

# Convert y_breaks from Fahrenheit to Celsius
y_labels <- (y_breaks - 32)*5/9

# Create a secondary x-axis
secondary_y_axis <- sec_axis(
  # Use identity transformation
  trans = identity,
  name = "Celsius",
  # Define breaks and labels as above
  breaks = y_breaks,
  labels = y_labels
)

# Examine the object
secondary_y_axis
```
```{r}
# Update the plot
ggplot(airquality, aes(Date, Temp)) +
  geom_line() +
  # Add the secondary y-axis 
  scale_y_continuous(sec.axis = secondary_y_axis) +
  labs(x = "Date (1973)", y = "Fahrenheit")
```
 Double axes are most useful when you want to display the same value in two differnt units.
 
 ### Flipping axes
 
 Flipping axes means to reverse the variables mapped onto the x and y aesthetics. We can just change the mappings in aes(), but we can also use the coord_flip() layer function.

There are two reasons to use this function:

- We want a vertical geom to be horizontal, or
- We've completed a long series of plotting functions and want to flip it without having to rewrite all our commands.
```{r}
# Plot fcyl bars, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Place bars side by side
  geom_bar(position = "dodge")
```
```{r}
ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar(position = "dodge") +
  # Flip the x and y coordinates
  coord_flip()
```
```{r}
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Set a dodge width of 0.5 for partially overlapping bars
  geom_bar(position = position_dodge(width = 0.5)) +
  coord_flip()
```
Horizontal bars are especially useful when the axis labels are long.

Another example:
```{r}
mtcars <- read.csv("mtcars.csv", stringsAsFactors=FALSE)
mtcars <- mtcars %>% 
  mutate(fam = as.factor(am), fcyl = as.factor(cyl), car = model, fvs = as.factor(vs))
```

```{r}
# Plot of wt vs. car
ggplot(mtcars, aes(car, wt)) +
  # Add a point layer
  geom_point() +
  labs(x = "car", y = "weight")
```
```{r}
# Flip the axes to set car to the y axis
ggplot(mtcars, aes(car, wt)) +
  geom_point() +
  labs(x = "car", y = "weight") +
  coord_flip()
```
### Pie charts

The coord_polar() function converts a planar x-y Cartesian plot to polar coordinates. This can be useful if you are producing pie charts.

We can imagine two forms for pie charts - the typical filled circle, or a colored ring.

Typical pie charts omit all of the non-data ink. Pie charts are not really better than stacked bar charts.
```{r}
# Run the code, view the plot, then update it
ggplot(mtcars, aes(x = 1, fill = fcyl)) +
  geom_bar() +
  # Add a polar coordinate system
  coord_polar(theta = "y")
```
```{r}
ggplot(mtcars, aes(x = 1, fill = fcyl)) +
  # Reduce the bar width to 0.1
  geom_bar(width = 0.1) +
  coord_polar(theta = "y") +
  # Add a continuous x scale from 0.5 to 1.5
  scale_x_continuous(limits = c(0.5, 1.5))
```

Polar coordinates are particularly useful if you are dealing with a cycle, like yearly data, that you would like to see represented as such.

### Wind rose plots

Polar coordinate plots are well-suited to scales like compass direction or time of day. A popular example is the "wind rose".
```{r}
library(openair)
```
Converting [degrees into factor](https://community.rstudio.com/t/convert-wind-direction-degrees-into-factors-in-a-data-frame/14636V):
```{r}
library(rvest)
library(tidyverse)

# grab a table of values from the web with a little scraping
url <- 'http://snowfence.umn.edu/Components/winddirectionanddegreeswithouttable3.htm'
page <- read_html(url)
directions_raw <- page %>% html_node('td table') %>% html_table(header = TRUE)
directions <- directions_raw %>% 
    set_names(~tolower(sub(' Direction', '', .x))) %>% 
    slice(-1) %>% 
    separate(degree, c('degree_min', 'degree_max'), sep = '\\s+-\\s+', convert = TRUE)
directions
```
```{r}
wind <- mydata %>% 
    mutate(wd_cardinal = cut(
        wd, 
        breaks = c(0, directions$degree_max, 360), 
        labels = c(directions$cardinal, 'N')
    ), ws_interval = cut(ws,
                         breaks = c(0,2,4,6,8,10,12,14),
                         include.lowest=T)
    )
```


```{r}
# Using wind, plot wd filled by ws
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  # Add a bar layer with width 1
  geom_bar(width = 1)
```
```{r}
# Convert to polar coordinates:
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  geom_bar(width = 1) +
  coord_polar()
```
```{r}
# Convert to polar coordinates:
ggplot(wind, aes(wd_cardinal, fill = ws_interval)) +
  geom_bar(width = 1) +
  coord_polar(start = -pi/16)
```
They are not common, but polar coordinate plots are really useful.

# Facet

## Facets layer

Faceting splits the data up into groups, according to a categorical variable, then plots each group in its own panel. For splitting the data by one or two categorical variables, facet_grid() is best.

Given categorical variables A and B, the code pattern is

    plot +
      facet_grid(rows = vars(A), cols = vars(B))
      
This draws a panel for each pairwise combination of the values of A and B.
```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am
  facet_grid(rows = vars(am))
```
```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet columns by cyl
  facet_grid(cols=vars(cyl))
```
```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am and columns by cyl
  facet_grid(rows = vars(am), cols = vars(cyl))
```
### Many variables

In addition to aesthetics, facets are another way of encoding factor (i.e. categorical) variables. They can be used to reduce the complexity of plots with many variables.

Two variables are mapped onto the color aesthetic, using hue and lightness. To achieve this we combined fcyl and fam into a single interaction variable, fcyl_fam. This will allow us to take advantage of Color Brewer's Paired color palette.
```{r}
mtcars <- mtcars %>% 
  mutate(fcyl_fam = interaction(fcyl, fam, sep=":"))
# See the interaction column
mtcars$fcyl_fam
```
```{r}
# Color the points by fcyl_fam
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam)) +
  geom_point() +
  # Use a paired color palette
  scale_color_brewer(palette = "Paired")
```
```{r}
# Update the plot to map disp to size
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam, size = disp)) +
  geom_point() +
  scale_color_brewer(palette = "Paired")
```
```{r}
# Update the plot
ggplot(mtcars, aes(x = wt, y = mpg, color = fcyl_fam, size = disp)) +
  geom_point() +
  scale_color_brewer(palette = "Paired") +
  # Grid facet on gear and vs
  facet_grid(rows = vars(gear), cols = vars(vs)) 
```
The last plot you've created contains 7 variables (4 categorical, 3 continuous). Useful combinations of aesthetics and facets help to achieve this.

## Formula notation

As well as the vars() notation for specifying which variables should be used to split the dataset into facets, there is also a traditional formula notation. The three cases are shown in the table.

Modern notation |	Formula notation
--------------- | ----------------
facet_grid(rows = vars(A)) |	facet_grid(A ~ .)
facet_grid(cols = vars(B)) |	facet_grid(. ~ B)
facet_grid(rows = vars(A), cols = vars(B)) |	facet_grid(A ~ B)

```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am using formula notation
  facet_grid(am ~ .)
```
```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet columns by cyl using formula notation
  facet_grid(. ~ cyl)
```
```{r}
ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  # Facet rows by am and columns by cyl using formula notation
  facet_grid(am ~ cyl)
```
While many ggplots still use the traditional formula notation, using vars() is now preferred.

## Facet labels and order

### Labelling facets

If our factor levels are not clear, your facet labels may be confusing. We can assign proper labels in your original data before plotting or we can use the labeller argument in the facet layer.

The default value is

- label_value: Default, displays only the value

Common alternatives are:

- label_both: Displays both the value and the variable name
- label_context: Displays only the values or both the values and variables depending on whether multiple factors are faceted

```{r}
# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # The default is label_value
  facet_grid(cols = vars(cyl))
```
O
```{r}
# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Displaying both the values and the variables
  facet_grid(cols = vars(cyl), labeller = label_both)
```
```{r}
# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Label context
  facet_grid(cols = vars(cyl), labeller = label_context)
```
```{r}
# Plot wt by mpg
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  # Two variables
  facet_grid(cols = vars(vs, cyl), labeller = label_context)
```
Make sure there is no ambiguity in interpreting plots by using proper labels.

### Setting order

If you want to change the order of your facets, it's best to properly define your factor variables before plotting.

Let's see this in action with the mtcars transmission variable am. In this case, 0 = "automatic" and 1 = "manual".

Here, we'll make am a factor variable and relabel the numbers to proper names. The default order is alphabetical. To rearrange them we'll call fct_rev() from the forcats package to reverse the order.
```{r}
library(forcats)

# Make factor, set proper labels explictly
mtcars$fam <- factor(mtcars$am, labels = c(`0` = "automatic",
                                           `1` = "manual"))

# Default order is alphabetical
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(cols = vars(fam))
```
```{r}
# Make factor, set proper labels explictly, and
# manually set the label order
mtcars$fam <- factor(mtcars$am,
                     levels = c(1, 0),
                     labels = c("manual", "automatic"))
mtcars$fvs <- factor(mtcars$vs,
                     levels = c(1, 0),
                     labels = c("straight", "V-shaped"))

# View again
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(cols = vars(fam))
```
Arrange your facets in an intuitive order for your data.

## Facet plotting spaces

### Continuous variables

By default every facet of a plot has the same axes. If the data ranges vary wildly between facets, it can be clearer if each facet has its own scale. This is achieved with the scales argument to facet_grid().

- "fixed" (default): axes are shared between facets.
- free: each facet has its own axes.
- free_x: each facet has its own x-axis, but the y-axis is shared.
- free_y: each facet has its own y-axis, but the x-axis is shared.

When faceting by columns, "free_y" has no effect, but we can adjust the x-axis. In contrast, when faceting by rows, "free_x" has no effect, but we can adjust the y-axis.
```{r}
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Facet columns by cyl 
  facet_grid(cols = vars(cyl))
```
```{r}
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Facet columns by cyl 
  facet_grid(cols = vars(cyl),
             scales = "free_x")
```
```{r}
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() + 
  # Swap cols for rows; free the y-axis scales
  facet_grid(rows = vars(cyl), scales = "free_y")
```
Shared scales make it easy to compare between facets, but can be confusing if the data ranges are very different. In that case, used free scales.

### Categorical variables

When you have a categorical variable with many levels which are not all present in each sub-group of another variable, it's usually desirable to drop the unused levels.

By default, each facet of a plot is the same size. This behavior can be changed with the spaces argument, which works in the same way as scales: "free_x" allows different sized facets on the x-axis, "free_y", allows different sized facets on the y-axis, "free" allows different sizes in both directions.
```{r}
ggplot(mtcars, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Facet rows by gear
  facet_grid(rows = vars(gear))
```
```{r}
ggplot(mtcars, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Free the y scales and space
  facet_grid(rows = vars(gear),
  scales = "free_y",
  space = "free_y")
```
```{r}
mtcars2 <- mtcars %>% 
  # arrange from lo to hi
  arrange(-mpg) %>% 
  # redefine in factor leveles
  mutate(car = as_factor(car))
```
```{r}
ggplot(mtcars2, aes(x = mpg, y = car, color = fam)) +
  geom_point() +
  # Free the y scales and space
  facet_grid(rows = vars(gear),
  scales = "free_y",
  space = "free_y")
```
Freeing the y-scale to remove blank lines helps focus attention on the actual data present.

## Facet wrap and margins

### Wrapping for many levels

facet_grid() is fantastic for categorical variables with a small number of levels. Although it is possible to facet variables with many levels, the resulting plot will be very wide or very tall, which can make it difficult to view.

The solution is to use facet_wrap() which separates levels along one axis but wraps all the subsets across a given number of rows or columns.
```{r}
library(carData)
```
```{r}
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Create facets, wrapping by year, using vars()
  facet_wrap(vars(year))
```
```{r}
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Create facets, wrapping by year, using a formula
  facet_wrap(~ year)
```
```{r}
ggplot(Vocab, aes(x = education, y = vocabulary)) +
  stat_smooth(method = "lm", se = FALSE) +
  # Update the facet layout, using 11 columns
  facet_wrap(~ year,
  ncol = 11)
```
Start experimenting with facets in your own plots.

### Margin plots

Facets are great for seeing subsets in a variable, but sometimes you want to see both those subsets and all values in a variable.

Here, the margins argument to facet_grid() is your friend.

- FALSE (default): no margins.
- TRUE: add margins to every variable being faceted by.
- c("variable1", "variable2"): only add margins to the variables listed.


```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Facet rows by fvs and cols by fam
  facet_grid(cols = vars(gear), rows = vars(fvs, fam))
```
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to add margins
  facet_grid(rows = vars(fvs, fam), cols = vars(gear),
  margins = TRUE)
```
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to only show margins on fam
  facet_grid(rows = vars(fvs, fam), cols = vars(gear), margins = "fam")
```
```{r}
ggplot(mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  # Update the facets to only show margins on gear and fvs
  facet_grid(rows = vars(fvs, fam), cols = vars(gear), margins = c("gear", "fvs"))
```
It can be really helpful to show the full margin plots!

# Best Practices

## Bar plots

### Dynamite plots

"Dynamite plots" (bar plots with error bars) are not well suited for their intended purpose of depicting distributions. If we really want error bars on bar plots, we can of course get them, but we'll need to set the positions manually. A point geom will typically serve you much better.

```{r}
# Plot wt vs. fcyl
ggplot(mtcars, aes(x = fcyl, y = wt)) +
  # Add a bar summary stat of means, colored skyblue
  stat_summary(fun.y = mean, geom = "bar", fill = "skyblue") +
  # Add an errorbar summary stat std deviation limits
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
```
Remember, we can specify any function in fun.data or fun.y and we can also specify any geom, as long as it's appropriate to the data type.

### Position dodging

we will add a distinction between transmission type, fam, for the dynamite plots and explore position dodging (where bars are side-by-side).
```{r}
# Update the aesthetics to color and fill by fam
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.1)
```
```{r}
# Set alpha for the first and set position for each stat summary function
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar", alpha = 0.5, position = "dodge") +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", position = "dodge", width = 0.1)
```
```{r}
# Define a dodge position object with width 0.9
posn_d <- position_dodge(width = 0.9)

# For each summary stat, update the position to posn_d
ggplot(mtcars, aes(x = fcyl, y = wt, color = fam, fill = fam)) +
  stat_summary(fun.y = mean, geom = "bar", position = posn_d, alpha = 0.5) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), width = 0.1, position = posn_d, geom = "errorbar")
```
Slightly overlapping bar plots are common in the popular press and add a bit of style to your data viz.

### Using aggregated data

If it is appropriate to use bar plots, then it nice to give an impression of the number of values in each group.

stat_summary() doesn't keep track of the count. stat_sum() does (that's the whole point), but it's difficult to access. It's more straightforward to calculate exactly what we want to plot ourselves.

```{r}
mtcars_by_cyl <- mtcars %>%
  group_by(cyl) %>% 
  summarize(mean_wt = mean(wt),
          sd_wt = sd(wt),
          n_wt = n()) %>% 
          mutate( prop = n_wt/sum(n_wt))
mtcars_by_cyl
```
```{r}
# Using mtcars_cyl, plot mean_wt vs. cyl
ggplot(mtcars_by_cyl, aes(cyl, mean_wt)) +
  # Add a bar layer with identity stat, filled skyblue
  geom_bar(stat = "identity", fill = "skyblue")
```
```{r}
ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  # Swap geom_bar() for geom_col()
  geom_col(fill = "skyblue")
```
```{r}
ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  # Set the width aesthetic to prop
  geom_col(fill = "skyblue", aes(width = prop))
```
```{r}
ggplot(mtcars_by_cyl, aes(x = cyl, y = mean_wt)) +
  geom_col(aes(width = prop), fill = "skyblue") +
  # Add an errorbar layer
  geom_errorbar(
    # ... at mean weight plus or minus 1 std dev
    aes(ymin = mean_wt - sd_wt,
    ymax = mean_wt + sd_wt),
    # with width 0.1
    width = 0.1
  )
```
This is a good start, but it's difficult to adjust the spacing between the bars.

## Heatmaps

Since heat maps encode color on a continuous scale, they are difficult to accurately decode. Hence, heat maps are most useful if you have a small number of boxes and/or a clear pattern that allows you to overcome decoding difficulties.

To produce them, map two categorical variables onto the x and y aesthetics, along with a continuous variable onto fill. The geom_tile() layer adds the boxes.
```{r}
library(lattice)
```
```{r}
str(barley)
```
```{r}
# Using barley, plot variety vs. year, filled by yield
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  # Add a tile geom
  geom_tile()
```
```{r}
# Previously defined
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  geom_tile() + 
  # Facet, wrapping by site, with 1 column
  facet_wrap(facets = vars(site), ncol = 1) +
  # Add a fill scale using an 2-color gradient
  scale_fill_gradient(low = "white", high = "red")
```
```{r}
library(RColorBrewer)
# A palette of 9 reds
red_brewer_palette <- brewer.pal(9, "Reds")

# Update the plot
ggplot(barley, aes(x = year, y = variety, fill = yield)) +
  geom_tile() + 
  facet_wrap(facets = vars(site), ncol = 1) +
  # Update scale to use n-colors from red_brewer_palette
  scale_fill_gradientn(colors = red_brewer_palette)
```
We can continue by using breaks, limits and labels to modify the fill scale and update the theme, but this is a pretty good start.

### Heat map alternatives

There are several alternatives to heat maps. The best choice really depends on the data and the story you want to tell with this data. If there is a time component, the most obvious choice is a line plot.

```{r}
# Using barley, plot yield vs. year, colored and grouped by variety
ggplot(barley, aes(x = year, y = yield, color = variety, group = variety)) +
  # Add a line layer
  geom_line() +
  # Facet, wrapping by site, with 1 row
  facet_wrap( ~ site, nrow = 1)
```
```{r}
# Using barely, plot yield vs. year, colored, grouped, and filled by site
ggplot(barley, aes(x = year, y = yield, color = site, group = site, fill = site)) +
  # Add a line summary stat aggregated by mean
  stat_summary(fun.y = mean, geom = "line") +
  # Add a ribbon summary stat with 10% opacity, no color
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "ribbon", alpha = 0.1, color = NA)
```
 Whenever you see a heat map, ask yourself it it's really necessary. Many people use them because they look fancy and complicated - signs of poor communication skills.

## When good data makes bad plots
When you first encounter a data visualization, either from yourself or a colleague, you always want to critically ask if it's obscuring the data in any way.

Let's take a look at the steps we could take to produce and improve the plot in the view.

The data comes from an experiment where the effect of two different types of vitamin C sources, orange juice or ascorbic acid, were tested on the growth of the odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.

The data is stored in the TG data frame, which contains three variables: dose, len, and supp.
```{r}
library(datasets)
TG = (ToothGrowth)
str(TG)
```
```{r}
# Initial plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.1)) +
  theme_gray(3)

# View plot
growth_by_dose
```
```{r}
# cjhange theme_gray(3) to theme_classic()
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.1)) +
  theme_classic()

# View plot
growth_by_dose
```
```{r}
# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  theme_classic()

# View plot
growth_by_dose
```
```{r}
# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  # Use the right geometry
  stat_summary(fun.y = mean,
               geom = "line",
               position = position_dodge(0.1)) +
  theme_classic()

# View plot
growth_by_dose
```
```{r}
# Change type
TG$dose <- as.numeric(as.character(TG$dose))

# Plot
growth_by_dose <- ggplot(TG, aes(dose, len, color = supp)) +
  stat_summary(fun.data = mean_sdl,
               fun.args = list(mult = 1),
               position = position_dodge(0.2)) +
  stat_summary(fun.y = mean,
               geom = "line",
               position = position_dodge(0.1)) +
  theme_classic() +
  # Adjust labels and colors:
  labs(x = "Dose (mg/day)", y = "Odontoblasts length (mean, standard deviation)", color = "Supplement") +
  scale_color_brewer(palette = "Set1", labels = c("Orange juice", "Ascorbic acid")) +
  scale_y_continuous(limits = c(0,35), breaks = seq(0, 35, 5), expand = c(0,0))

# View plot
growth_by_dose
```

