ggplot2 Tutorial

Introduction

The qplot function is a basic ggplot2 function. The general syntax is as follows: qplot(x, y, data, color, shape, size, facts, geom, stat), where:
- x, y are the variables to plot
- data is the dataset containing the variables
- color, shape, and size are aesthetic arguments mapped to additional variables
- facets defines the optional faceting of the plot based on a variable
- geom allows the actual visualization of th ddata, which essentially defines the type of plot generated
- stat defines the statistics to be used for the data

library(ggplot2)
library(gridExtra)
data(iris)
plot1 <- qplot(Petal.Length, data=iris, geom="histogram")
plot2 <- qplot(Petal.Length, data=iris, geom="density")
grid.arrange(plot1, plot2, nrow=1, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The iris dataset consists of data on three different species of flowers. Suppose we want to generate a histogram and density geom based on species. We do so as follows.

plot3 <- qplot(Petal.Length, data=iris, geom="histogram", 
               color=Species, fill=Species, alpha=I(0.5))
plot4 <- qplot(Petal.Length, data=iris, geom="density", 
               color=Species, fill=Species, alpha=I(0.5))
grid.arrange(plot3, plot4, nrow=1, ncol=2)

An example of a scatterplot using the qplot function in the ggplot2 package using the ToothGrowth dataset, which reports the length of teeth of 10 guinea pigs for three different doses of vitamin C, delivered in two different ways - as orange juice or as ascorbic acid.

data("ToothGrowth")
qplot(dose, len, data=ToothGrowth, geom="point")

Notice that the length increases as the intake of vitamin C increases. However, we want to explore the influence of the method of intake on length. We might have the data plotted in two different colors.

qplot(dose, len, data=ToothGrowth, geom="point", col=supp)

The resulting plot suggests that the subgroup for which orange juice was administered has longer teeth than its counterpart. We can clarify this inference using facets.

qplot(dose, len, data=ToothGrowth, geom="point", facets=.~supp)

# we may want to highlight the general tendency of the data for further interpretability
qplot(dose, len, data=ToothGrowth, geom=c("point","smooth"), facets=.~supp)

We use qplot to depict time-series data, with the economics dataset.

data("economics")
head(economics)

##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018

qplot(date, unemploy, data=economics, geom="line")

I. The Layers and Grammar of Graphics

We create a plot layer by layer using the ggplot() function.

# first, create basic plot object containing the data and aesthetic mapping
p <- ggplot(data=ToothGrowth, aes(x=dose, y=len, col=supp))
# enter "p" into the console will result in an empty plot, since no geoms have been added
newPlot <- p + geom_point()
summary(newPlot)

## data: len, supp, dose [60x3]
## mapping:  x = dose, y = len, colour = supp
## faceting: facet_null() 
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

newPlot + geom_point(aes(col=NULL))

newPlot + geom_point(aes(col=NULL)) + theme(legend.position="none")

The example code below exhibits the coordinate system(s) available in ggplot2.

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) + 
  geom_point() + 
  coord_flip() # flpping the axes

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) + 
  geom_point() + 
  coord_fixed(ratio=0.1) # each unit on x-axis translates to 10 units on y-axis

There are two ways to create facet plots in ggplot2: grid and wrap faceting.

A. Grid Facetting (GF)

This is the more common method of facetting in R use in ggplot2. To use GF, split the data into subgroups relative to at least two variables in the dataset.

data("mtcars")
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

mtcars$cyl <- as.factor(mtcars$cyl)
ggplot(data=mtcars, aes(hp)) + 
  geom_density() + 
  facet_grid(cyl~.) + 
  scale_x_log10()

ggplot(data=mtcars, aes(hp)) + 
  geom_density() + 
  facet_grid(.~cyl) + 
  scale_x_log10() # looks better

# facetting relative to two variables
mtcars$roundedHp <- round(mtcars$hp, digits=-1)
ggplot(data=mtcars,aes(mpg)) + 
  geom_density() + 
  facet_grid(roundedHp~cyl) + 
  scale_x_log10()

ggplot(data=mtcars,aes(mpg)) + 
  geom_density() + 
  facet_grid(roundedHp~cyl, margin=TRUE) + 
  scale_x_log10()

B. Components of a Layer in ggplot2

1. Data

The ggplot2 package can only work with data in the form of R data frames. Thus, vectors must be combined into a data frame to be visualized by ggplot2’s functions.

2. Aesthetic Mapping

The aes() function can be used for mapping variables to x-y plots, colors, sizes, and shapes. The aesthetic attributes that can be mapped depend on the geom function chosen.

3. Geometric

The geometry attributes determine the type of plot that will be applied to the underlying data provided with the ggplot() function.

4. Position Adjustment

Position adjustment specifications are provided in the geom_X() function for which the adjustment is to be applied.

4.a. Position Adjustment of Categorical Data

These adjustments are commonly used to adjust the positioning of bars in a barplot. Different types of adjustments include:
- Dodge: using the position_dodge() function, bars in a barplot are placed next to each other for each category.
- Fill: using the position_fill() function, causes objects to overlap on top of one another and to be standardized to have the same height. In a barplot, bars of the same category are stacked upon one another and represent proportions rather than absolute values.
- Stack: using the position_stack() function, is identical to the fill but without the standardization, thus maintaining a visual of absolute values.

library(MASS)
g1 <- ggplot(birthwt, aes(x=race,fill=factor(ftv))) +
  geom_bar(position="stack")
g2 <- ggplot(birthwt, aes(x=race,fill=factor(ftv))) +
  geom_bar(position="dodge")
g3 <- ggplot(birthwt, aes(x=race,fill=factor(ftv))) +
  geom_bar(position="fill")
g4 <- ggplot(birthwt, aes(x=race,fill=factor(ftv))) +
  geom_bar(position=position_dodge(width=0.5))
grid.arrange(g1,g2,g3,g4, nrow=2, ncol=2)

4.b. Position Adjustment of Continuous Data

The only position adjustment for continuous data is jittering, which may be specified by the position_jitter() function.

g <- ggplot(ChickWeight, aes(x=Time, y=weight)) 
g1 <- g + geom_point()
g2 <- g + geom_point(position="jitter")
g3 <- g + geom_point(position=position_jitter(width=.5,height=0))
grid.arrange(g1,g2,g3, nrow=1, ncol=3)

C. Summary: Types of Plots

1. Histograms and Density Plots

h <- ggplot(data=iris, aes(x=Petal.Length, color=Species, fill=Species)) +
  geom_histogram(alpha=I(0.5))
d <- ggplot(data=iris, aes(x=Petal.Length, color=Species, fill=Species)) +
  geom_density(alpha=I(0.5))
grid.arrange(h,d, nrow=1, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2. Boxplots

ggplot(data=mtcars,aes(factor(cyl),hp)) +
  geom_jitter() +
  geom_boxplot(alpha=I(0.6)) +
  scale_y_log10()

3. Scatterplots

ggplot(data=ToothGrowth, aes(x=dose,y=len)) +
  geom_point() +
  stat_smooth() +
  facet_grid(.~supp)

II. Some Advanced Techniques

A. Adding Statistics

1. Smooth Lines

Smooth lines in ggplot2 produces a local regression that follows the data and visualizes the fluctuation of the data points. There are two ways to create smooth lines.

1.a. stat_smooth()

This stat function allows greater statistical control over the computation of the smooth line, i.e., how the line responds to the actual data points. You can control the type of line to be generated by providing a method argument, which can be lm, glm, gam, loess, or rlm. For datasets with fewer than 1000 observations, the default is loess; for 1000 or greater, the default is gam.
The loess method generates a line that fits a polynomial surface determined by one or more numerical predictors using local fits, whereas gam fits a generalized additive model. lm generates the familiar linear model based on least squres, while rlm fits a linear model that’s more robust than lm to outliers.

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + stat_smooth() + facet_grid(.~supp)

# suppose we want to see the linear regression
ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + stat_smooth(method="lm") + facet_grid(.~supp)

The gray bands in the plots represent confidence interval, which set to 95% by default. CI’s can be changed by altering the level argument in the stat_smooth() function. To omit the CI band altogether, set se=FALSE.
The smooth line is determined along each point in the line, by using points in a neighborhood or interval around each such point. The span parameter controls how localized the smoothing should be.

1.b. geom_smooth()

The geom_smooth() function is nothing more than the stat method discussed directly above with all its defaults. Thus the plot generated by the code below is identical to the loess faceted plot above:

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_smooth() + geom_point() + facet_grid(.~supp)

2. Linear Regression

Recall the code from above:

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + stat_smooth(method="lm") # no facetting here

# to omit the CI band
ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + stat_smooth(method="lm", se=FALSE)

2.a. Faceting Statistics

A straightforward way to facet plots by a categorical variable was shown above. A somewhat more complicated process is required to depict statistical information relevant to each subgroup of data. We need to add margins to the facets and apply the statistical analysis desired.

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + stat_smooth() + facet_grid(.~supp,margins=TRUE)

The above facet plot reserves a margin for summary data on both subgroups. We can apply the statistics only to one facet; or apply separate statistics to each facet. The example below applies a smooth line to the first facet and a linear regression to the second.

ggplot(data=ToothGrowth, aes(x=dose,y=len,col=supp)) +
  geom_point() + 
  stat_smooth(data=subset(ToothGrowth,supp=="OJ")) +
  stat_smooth(data=subset(ToothGrowth,supp=="VC"),method="lm") +
  facet_grid(.~supp)

B. Advanced Aesthetic Mapping

As was discussed above, aesthetic mapping makes possible the application of sophisticated and personalized schemes to represent data or calculate statistical transformations based on the value or a variable used as a flagging factor. The examples below will employ datasets created by simulating random variables.

1. Typical Aesthetic Mappings

We’ve already covered the mapping x and y variables above via the aes() function. The focus will be on other mapping options, including color, line type, and symbol attributes, all of which can be mapped to different variables and combined in the resulting plot to achieve the desired visual effect.

The following example uses a dataset with three series of exponential variables of 1, 1.5, and 2. We include a flag which will allow the retrieval of three different sequences of data.

cont <- data.frame(y=c(1:20, (1:20)^1.5, (1:20)^2), 
                   x=1:20, group=rep(c(1,2,3), each=20))
cont

##             y  x group
## 1    1.000000  1     1
## 2    2.000000  2     1
## 3    3.000000  3     1
## 4    4.000000  4     1
## 5    5.000000  5     1
## 6    6.000000  6     1
## 7    7.000000  7     1
## 8    8.000000  8     1
## 9    9.000000  9     1
## 10  10.000000 10     1
## 11  11.000000 11     1
## 12  12.000000 12     1
## 13  13.000000 13     1
## 14  14.000000 14     1
## 15  15.000000 15     1
## 16  16.000000 16     1
## 17  17.000000 17     1
## 18  18.000000 18     1
## 19  19.000000 19     1
## 20  20.000000 20     1
## 21   1.000000  1     2
## 22   2.828427  2     2
## 23   5.196152  3     2
## 24   8.000000  4     2
## 25  11.180340  5     2
## 26  14.696938  6     2
## 27  18.520259  7     2
## 28  22.627417  8     2
## 29  27.000000  9     2
## 30  31.622777 10     2
## 31  36.482873 11     2
## 32  41.569219 12     2
## 33  46.872167 13     2
## 34  52.383203 14     2
## 35  58.094750 15     2
## 36  64.000000 16     2
## 37  70.092796 17     2
## 38  76.367532 18     2
## 39  82.819080 19     2
## 40  89.442719 20     2
## 41   1.000000  1     3
## 42   4.000000  2     3
## 43   9.000000  3     3
## 44  16.000000  4     3
## 45  25.000000  5     3
## 46  36.000000  6     3
## 47  49.000000  7     3
## 48  64.000000  8     3
## 49  81.000000  9     3
## 50 100.000000 10     3
## 51 121.000000 11     3
## 52 144.000000 12     3
## 53 169.000000 13     3
## 54 196.000000 14     3
## 55 225.000000 15     3
## 56 256.000000 16     3
## 57 289.000000 17     3
## 58 324.000000 18     3
## 59 361.000000 19     3
## 60 400.000000 20     3

## represent the different sequences of data as points/lines using geom_point()/geom_line() functions  

## data represented as points
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()

ggplot(data=cont, aes(x=x, y=y, col=factor(group), size=factor(group))) + geom_point()

## Warning: Using size for a discrete variable is not advised.

ggplot(data=cont, aes(x=x, y=y, col=factor(group), shape=factor(group))) + geom_point()

## data represented by lines
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_line()

ggplot(data=cont, aes(x=x, y=y, col=factor(group), size=factor(group))) + geom_line()

## Warning: Using size for a discrete variable is not advised.

ggplot(data=cont, aes(x=x, y=y, col=factor(group), linetype=factor(group))) + geom_line()

2. Mapping the Aesthetic to New Stat Variables

The stat_type functions discussed above generate new variables during calculations, which can be used in plots on top of the original dataset. An example:

## create a simple normal distribtion with default mean (0) and SD (1)
set.seed(1234)
x <- data.frame(x=rnorm(1000))
ggplot(data=x, aes(x=x, fill=..count..)) + geom_histogram() ## newly created variable is 'count,' which must be surrounded by .. to avoid errors

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=x, aes(x=x)) + geom_histogram(aes(y=..density.., fill=..density..))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## combine multiple geoms on the same plot and add kernel density func on top of histogram
ggplot(data=x, aes(x=x)) +
  geom_histogram(aes(y=..density.., fill=..density..)) +
  geom_density()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## above example shows graphical effect that has data bins shaded in proportion to amount of data in each bin
## can use alpha variable to achieve similar effect
ggplot(data=x, aes(x=x)) + geom_histogram(aes(alpha=..count..))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## to change the colorscale from gray to blue
ggplot(data=x, aes(x=x)) + geom_histogram(aes(alpha=..count.., fill=..count..))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## applied to real data
ggplot(data=iris,
      aes(x=Petal.Length, col=Species, fill=Species, alpha=..count..)) + 
      geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3. Mapping Continuous vs. Categorical Variables

An example.

# create dataset with four different distributions of r.v.'s
dist <- data.frame(value=rnorm(10000, 1:4), group=1:4)
# all distributions just created are normal w. SD=1 and mean=value of group
# plot data as jittered points using different color per group
ggplot(dist, aes(x=group, y=value, color=group)) +
      geom_jitter(alpha=0.5)

# the x-axis is treated as continuous, whereas it represents values that are supposed to categorical.  We fix this as follows.
ggplot(dist, aes(x=group, y=value,
                 color=as.factor(group))) +
      geom_jitter(alpha=0.5)

# similar example on real data
ggplot(mtcars, aes(mpg, wt)) + geom_point(aes(color=cyl))

ggplot(mtcars, aes(mpg, wt)) + geom_point(aes(color=factor(cyl)))

4. Adding Text and Reference Lines

An example.

x <- data.frame(x=rnorm(1000))
ggplot(x, aes(x=x)) +
  geom_histogram(alpha=0.5) +
  geom_vline(aes(xintercept=median(x)),
             color="red", linetype="dashed", size=1) +
  geom_hline(aes(yintercept=50), col="black", linetype="solid")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We add text annotations.

ggplot(x, aes(x=x)) +
  geom_histogram(alpha=0.5) +
  geom_vline(aes(xintercept=median(x)),
             color="red", linetype="dashed", size=1) +
  geom_hline(aes(yintercept=50), col="black", linetype="solid") +
  geom_text(aes(x=median(x), y=80), label="Median", hjust=1) +
  geom_text(aes(x=median(x), y=80, label=round(median(x), digit=3)), hjust=-0.5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## annotation using annotate() functions instead of geom_text
## example drawing a shade area on histogram covering interquartile range of middle 50%
ggplot(x, aes(x=x)) +
  geom_histogram(alpha=0.5) +
  geom_vline(aes(xintercept=median(x)),
             color="red", linetype="dashed", size=1) +
  geom_hline(aes(yintercept=50), col="black", linetype="solid") +
  geom_text(aes(x=median(x), y=80), label="Median", hjust=1) +
  geom_text(aes(x=median(x), y=80, label=round(median(x), digit=3)), hjust=-0.5) +
  annotate("rect", xmin=quantile(x$x, probs=0.25), xmax=quantile(x$x, probs=.75),
           ymin=0, ymax=100,
           alpha=.2, fill="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.a. Adding Text and Reference Lines with Facets

We can apply the techniques above to plots divided into facets. Using the dist dataset used above, consisting of four different normal distributions, we apply a reference line and annotation on the first facet as follows.

dist <- data.frame(value=rnorm(10000, 1:4), group=1:4)
ggplot(dist, aes(x=value, fill=as.factor(group))) +
  geom_histogram(alpha=0.5) +
  geom_vline(data=subset(dist, group=="1"),
             aes(xintercept=median(value)), color="black",
             linetype="dashed", size=1)+
  geom_text(data=subset(dist, group=="1"),
            aes(x=median(value), y=350, 
            label=round(median(value), digit=3),
            hjust=-0.2)) +
  facet_grid(.~group)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We apply the same techniques as well as a method to maintain consistent coloring as follows.

myColors <- scales::hue_pal()(4)
ggplot(dist, aes(x=value, fill=as.factor(group))) +
  geom_histogram(alpha=0.5) +
  ### facet 1
  geom_vline(data=subset(dist, group=1),
             aes(xintercept=median(value)), 
             color=myColors[1], linetype="dashed", size=1.5) +
  geom_text(data=subset(dist, group==1),
            aes(x=median(value), y=350, label=round(median(value), digit=3)),
            hjust=-0.2) +
  ### facet 2
  geom_vline(data=subset(dist, group=2),
             aes(xintercept=median(value)), 
             color=myColors[2], linetype="dashed", size=1.5) +
  geom_text(data=subset(dist, group==2),
            aes(x=median(value), y=350, label=round(median(value), digit=3)),
            hjust=-0.2) +
  ### facet 3
  geom_vline(data=subset(dist, group=3),
             aes(xintercept=median(value)), 
             color=myColors[3], linetype="dashed", size=1.5) +
  geom_text(data=subset(dist, group==3),
            aes(x=median(value), y=350, label=round(median(value), digit=3)),
            hjust=-0.2) +
  ### facet 4
  geom_vline(data=subset(dist, group=4),
             aes(xintercept=median(value)), 
             color=myColors[4], linetype="dashed", size=1.5) +
  geom_text(data=subset(dist, group==4),
            aes(x=median(value), y=350, label=round(median(value), digit=3)),
            hjust=-0.2) +
  facet_grid(.~group)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

C. Polar Coordinate-Based Plots

We use the mtcars dataset.

1. Pie Charts

A pie chart in ggplot2 is essentially a stacked bar chart in polar coordinates. We thus create a stacked bar chart then change the coordinate system.

plot <- ggplot(data=mtcars,
               aes(x=factor(1), 
               fill=factor(cyl))) +
        geom_bar(width=1)
plot

plot + coord_polar(theta="y")

ggplot(data=mtcars,
       aes(x=factor(1),
       fill=factor(cyl))) +
geom_bar(width=0.5) +
coord_polar(theta="y")

2. Bullseye Charts

Bullseye charts represent variables in a circular way such that the area is plotted is proportional to the variable value. They differ from pie charts in that the areas are represented as concentric circles, instead of slices.

plot <- ggplot(data=mtcars,
               aes(x=factor(1), 
               fill=factor(cyl))) +
        geom_bar(width=1)
plot + coord_polar()

3. Coxcomb Diagrams

A coxcomb diagram resembles a pie chart, except that the areas representing the data aren’t normalized to fill the entire circle or pie.

ggplot(data=mtcars, aes(x=cyl, fill=factor(cyl))) +
  geom_bar(width=1) +
  coord_polar(theta="x")

III. Controlling Plot Details

We now discuss methods to change default details of plots created with the ggplot2 package.

A. Plot Title and Axis Labels

The default plot in ggplot2 doesn’t have a title or axis labels that correspond to the names of the variables represented. To personalize titles/axis labels, the methods below may be useful.

set.seed(1234)
par(mfrow=c(2,2))
x <- data.frame(x=rnorm(1000))
plot <- ggplot(data=x, aes(x=x, fill=..count..)) + geom_histogram()
plot2 <- plot + labs(title="Customized Title for Histogram", x="Random Variable X", y="Number of Occurences")
grid.arrange(plot, plot2, nrow=1, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

B. Axis Scales

In general, scales are assigned by the aesthetic mapping (the ‘aes()’ command) and trace the represented visual objects to the underlying data by creating legends in addition to the plot axis.

1. Discrete Axis

One may have a plot with discrete scales, for instance, when representing data grouped in categories along one of the axes. The example below uses the dataset with the four different normal distributions created above. We directly define the grouping variable as a factor which obviates the need to convert numbers to factors later on.

dist <- data.frame(value=rnorm(10000, 1:4), group=factor(1:4))
## we visialize each distinct normal distribution as boxplots with each defined by the variable group along the x-axis
myBoxplot <- ggplot(dist, aes(x=group, y=value, fill=group)) +
  geom_boxplot()
myBoxplot

## if we want to manually set the order of the discrete variables
myBoxplot + scale_x_discrete(limits=c("1","3","2","4"))

## if we want simply to invert the order of the discrete variables
myBoxplot + scale_x_discrete(limits=rev(levels(dist$group)))

2. Continuous Axis

Two common adjustments to continuous scales are modifyig the default data range represented, and inversing the direction of the data. We use the the dataset dist used directly above, since the value for the y-axis (“value”) is continuous.

myBoxplot2 <- myBoxplot + scale_y_continuous(limits=c(-10,10))
grid.arrange(myBoxplot, myBoxplot2, nrow=1, ncol=2)

## if we need to make sure that a value in the range is included in the plot
grid.arrange(myBoxplot, myBoxplot + expand_limits(y=-10), nrow=1, ncol=2)

## to remove axis tick marks
myBoxplot3 <- ggplot(subset(dist, group=="1"),
                     aes(x=group, y=value, fill=group)) + geom_boxplot()

grid.arrange(myBoxplot3, myBoxplot3 + scale_x_discrete(breaks=NULL) + xlab("Distribution of Variable 1"), nrow=1, ncol=2)

3. Axis Transforms

Plot scales are linear by default. This can be overridden by a transformed scale via several methods. The two main options are to transform the axis by changing the scale or by changing the coordinate system. These are shown below.

cont <- data.frame(y=c(1:20, (1:20)^1.5, (1:20)^2), x=1:20,
                   group=rep(c(1,2,3), each=20))
## create scatterplot first; then transform y-axis into log10 values
myScatter <- ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()
myScatterLog <- myScatter + scale_y_log10()
## use coordinate transformation, in which case the transformation is applied before the scale is defined, in which case the scale containing the original values is now represented o na log axis
grid.arrange(myScatter, myScatterLog, myScatter + coord_trans(y="log10"), nrow=1, ncol=3)

C. Legends

We explore methods to change four main aspects of legends: title, labels, legend box, position. Sometimes, a legend is unnecessary, in which case we can remove it as follows:

myBoxplot <- ggplot(dist, aes(x=group, y=value, fill=group)) +
  geom_boxplot()
myBoxplotNoLegend <- myBoxplot + guides(fill=FALSE)
myBoxplotNoLegend2 <- myBoxplot + scale_fill_discrete(guide=FALSE) ## same as above
grid.arrange(myBoxplot, myBoxplotNoLegend, myBoxplotNoLegend2, nrow=1, ncol=3)

1. Legend Title

The following shows how to modify the axis name and legend title by altering the name argument of the scale functions corresponding to each aesthetic.

myBoxplot + 
  scale_x_discrete(name="Name of m x-axis HERE") +
  scale_fill_discrete(name="Name of my legend HERE")

# to remove name of the legend
myBoxplot + scale_fill_discrete(name="")

2. Legend Keys and Key Labels

Keys refer to the symbols that relate the legend to the plot, whereas key labels describe what the keys represent. Two commonn modifications are changing the order of the elements and modifying the text of the key labels.

myBoxplot + scale_fill_discrete(breaks=c("1","3","2","4"))

## to reverse the legend order
myBoxplot + scale_fill_discrete(guide=guide_legend(reverse=TRUE))

## to change key labels
myBoxplot + scale_fill_discrete(breaks=c("1","2","3","4"),
                                labels=c("Dist 1", "Dist 2", "Dist 3", "Dist 4"),
                                name="LEGENDARY NAME")

D. Themes

ggplot2 offers a series of functions that enable detailed control of plot appearance. These functions do not affect the representation of the data in terms of the geometry employed or how the data is transformed via scales. A simple example:

grid.arrange(myBoxplot + theme_grey(), myBoxplot + theme_bw(), nrow=1, ncol=2)

For access to different specs for the themes available, use the theme() function and specify the elements to be customized. The theme() fuction requires specification of two things: the element of the plot to be modified, and the theme element (which is essentially a functio that allows custom formatting of an element). We use the theme() function to customize the legend, axi, and plot background.

1. Legend Themes

We can modify a legend’s background, position, and margins.

## add a rectangular box around the legend
myBoxplot + theme(legend.background=element_rect(color="black", fill="gray90"))

## control key background
myBoxplot + theme(legend.key=element_rect(color="black", fill="yellow"))

## control the space around the legend
require(grid)

## Loading required package: grid

myBoxplot + theme(legend.margin=unit(3, "cm"))

## control legend text (size, position, color, font)
myBoxplot + theme(legend.text=element_text(size=20,color="red",angle=45,face="italic"))

## control legend position
grid.arrange(myBoxplot + theme(legend.position="bottom"), myBoxplot + theme(legend.position=c(0.5,0.5)), nrow=1,ncol=2)

2. Axis and Title Themes

We can specify the details of how axis text is presented, including type of characters, size, and position. A simple example.

p1 <- myBoxplot + theme(axis.text = element_text(color="blue",face="italic")) # tick text on axes changed
p2 <- myBoxplot + theme(axis.title.y = element_text(size=rel(1.5), angle=0)) # axis title text modified
grid.arrange(p1, p2, nrow=1, ncol=2)

We can use the title argument to control all titles in a plot, i.e., for the axes, the legend, and the plot as a whole.

p1 <- myBoxplot + labs(title="This is the greatest boxplot") +
  theme(title=element_text(size=rel(1.5), color="blue"))
## to modify only some title elements
p2 <- myBoxplot + labs(title="This is a so-so boxplot") +
  theme(plot.title=element_text(size=rel(1.5), color="blue"))
grid.arrange(p1, p2, nrow=1, ncol=2)

We can modify other aspects of axes, including the appearance of the lines representing the axes and the tick marks.

p1 <- myBoxplot + theme(axis.line=element_line(size=3, color="red", linetype="solid")) ## modifying axes lines
require(grid)
p2 <- myBoxplot + theme(axis.ticks.length=unit(.85,"cm"), axis.ticks.margin=unit(.85,"cm"))
grid.arrange(p1, p2, nrow=1, ncol=2)

3. Plot Background Themes

The most common modifcations to aspects of a plot’s background are to its background and panel grid.

## example with boxplot
myBoxplot + theme(panel.background=element_rect(fill="gray80"), panel.grid.major=element_line(color="blue"), panel.grid.minor=element_line(color="white", linetype="dotted", size=1))

## example with scatterplot
myScatter + theme(panel.background=element_rect(fill="gray80"), panel.grid.major=element_line(color="blue"), panel.grid.minor=element_line(color="white", linetype="dotted", size=1))

We can also specify the actual plot background and border (as opposed to the specs on the grid modified above).

myScatter + theme(plot.background=element_rect(fill="green", color="red", size=2, linetype="dotted" ))

4. Facet Themes

The same formatting styles can be aplied to facet plots. An example:

myScatter + facet_grid(.~group) + theme(panel.background=element_rect(fill="lightblue"))

Other customizations are shown.

## customize top strip area
myScatter + facet_grid(.~group) + theme(strip.background=element_rect(color="lightblue", fill="pink", size=3, linetype="dashed"))

## when there is longer title, might be better  to change the text size and orientation
myScatter + facet_grid(.~group, labeller=label_both) + theme(strip.text.x=element_text(color="red", angle=45, size=15, hjust=0.5, vjust=0.5))

## customize margin between panels
myScatter + facet_grid(.~group) + theme(panel.margin=unit(2,"cm"))

IV. Plot Output

1. Multiple Plots on One Page

One can combine multiple plots in two ways.

1.a. Arranging Plots by Row and Column

The example below uses the par() function to place plots. First, we define a set of variables to hold some plots.

library(ggplot2)
library(grid)
data(Orange)
x1 <- ggplot(Orange, aes(age, circumference)) +
  geom_point(aes(colour=factor(Tree)))
## removing the legend
x2 <- x1 + theme(legend.position="none")
## remove aesthetic
x3 <- ggplot(Orange, aes(age, circumference)) + geom_point()
## plot without data
x4 <- x3 + theme(panel.border=element_rect(linetype="solid", colour="black"))
x5 <- x3 + theme(axis.ticks=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), panel.grid.major=element_blank(), panel.grid.minor=element_blank(), panel.background=element_blank()) + ylab(" ") + xlab(" ")

We now render the plots on one page.

pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))
print(x5, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(x4, vp=viewport(layout.pos.row = 1, layout.pos.col = 2))
print(x3, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))
print(x2, vp=viewport(layout.pos.row = 2, layout.pos.col = 2))

1.b. Specifying Plot Position

Whereas the method discussed directly above is ideal for rendering multiple plots that are clearly separate from one another, it isn’t suited for more precise control of plot position, e.g., if we want to partially superimpose two plots. In this case, we still utilize the viewport() function, but specify the x, y, width, and height arguments.

We demonstrate using the cont dataset created above. In particular, we are interested in the behavior of the data on both the normal and logarithmic scales.

myScatter <- ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()
myScatterLog <- myScatter + scale_y_log10() + theme(legend.position="none")

print(myScatter, vp=viewport(width=1, height=1, x=0.5, y=0.5))
print(myScatterLog, vp=viewport(width=0.4, height=0.4, x=0.315, y=0.76))

2. Saving Plots to Files

There are three ways to save created plots. They are discussed below.

2.a. Manual Save

This is done by using the menus in the graph window that appears when the plot is rendered.

2.b. Saving a Plot Without Rendering

This method is useful when running scripts that produce multiple plots which one wants to save. Several functions that correspond to the type of file format in which the plot is to saved exist, th most common of which are the png(), pdf(), and jpeg() functions. The examples below use the pdf() function.

## saving a single plot
pdf("myFile.pdf")
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()
dev.off()

## png 
##   2

## saving multiple plots on the same pdf file; code below creates a two-page pdf file
pdf("myFile2.pdf")
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()
ggplot(Orange, aes(age, circumference)) + geom_point(aes(colour=factor(Tree)))
dev.off()

## png 
##   2

2.c. Saving a Plot After Rendering

Sometimes we want to render a plot to make sure it’s what we want before saving it. This can be done with either the dev.copy() or ggsave() function. Examples:

## dev.copy()
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()

dev.copy(pdf, file="newFile.pdf")

## pdf 
##   3

dev.off()

## png 
##   2

## ggsave()
ggplot(data=cont, aes(x=x, y=y, col=factor(group))) + geom_point()

ggsave(file="newFile2.pdf")

## Saving 7 x 5 in image

V. Special Applications of ggplot2

1. Plotting Maps with ggplot2 and ggmaps

1.a. Mapping Repsentations with ggplot2 and maps

The maps package contains map data for selected countries and the world, including example datasets used to combine data to map representations. The limitation of this package is the limited number of maps available. A simple example:

require(maps)

## Loading required package: maps

## 
##  # maps v3.1: updated 'world': all lakes moved to separate new #
##  # 'lakes' database. Type '?world' or 'news(package="maps")'.  #

data(us.cities)
big_cities <- subset(us.cities, pop > 500000)
qplot(long, lat, data=big_cities) + borders("state", size=0.5)

Since the us.cities database includes state information, we can select only cities in a given state.

ca_cities <- subset(us.cities, country.etc=="CA")
ggplot(ca_cities, aes(long, lat)) +
  borders(database="county", regions="california", color="grey70") +
  geom_point()

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)

We can add aesthetic mapping to a specific variable in a map plot.

data(world.cities)
capitals <- subset(world.cities, capital == 1)
ggplot(capitals, aes(long, lat)) +
  borders("world", fill="lightblue", col="cornflowerblue") +
  geom_point(aes(size=pop), col="darkgreen")

## add text to the map
city.Italy <- world.cities[world.cities$country.etc=="Italy",]
city.Italy.big <- subset(city.Italy, pop > 500000)

ggplot(city.Italy.big, aes(long, lat)) + borders("italy") +
  geom_point(aes(size=pop)) +
  geom_text(aes(long, lat, label=name), hjust=-0.2)

## similar example for the world
capitals.big <- subset(capitals, pop > 5000000)

ggplot(capitals.big, aes(long, lat)) + borders("world") +
  geom_point(aes(size=pop)) +
  geom_text(aes(long, lat, label=country.etc), hjust=-0.2, size=4)

We can modify the coordinate systems of our representations, the most important aspect of which is the type of projection to use. An example:

ggplot(capitals.big, aes(long, lat)) + borders("world") +
  geom_point(aes(size=pop)) +
  geom_text(aes(long, lat, label=country.etc), hjust=-0.2, size=4) +
  coord_map(projection="ortho", orientation=c(41,20,0))

To add map data contained in different databases, we need to convert the map data to an R data frame and match this with the data we want to include. An illustration:

data(votes.repub)
states <- map_data("state")
class(votes.repub) # data is represented as a matrix, which needs to be converted into df

## [1] "matrix"

## some data cleansing to ensure consistency
repubVotes <- as.data.frame(votes.repub)
names(repubVotes) <- paste("Year", names(repubVotes), sep="")
repubVotes$region <- tolower(rownames(repubVotes))

## combine datasets by matching state names
finalData <- merge(states, repubVotes, by="region")
finalData <- finalData[order(finalData$order),]

## plot
ggplot(finalData, aes(long, lat)) +
  borders("state") + geom_polygon(aes(group=group, fill=Year1976))