1. Introduction

In this session we will extend the simple visualizations you have done in earlier sessions using core (base) R plotting functions to the ggplot2 package - a dedicated visualization package based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package). This conceptualizes graphics (and plots) in terms of their theoretical components. The approach is to handle each element of the graphic separately in a series of layers and in so doing to control each part of the plot. This is different to the basic plot functions used above which applies specific plotting functions based on the type or class of data that were passed to them.

The ggplot2package is included as part of the tidyverse () which was introduced in earlier sessions.

The ggplot2 package can be loaded implicitly by loading the tidyverse as you have done in earlier sessions

library(tidyverse)

Or having installed thetidyverse, the ggplot2 package can be loaded on its own:

library(ggplot2)

In this practical we will mainly work with the Georgia data that you have encountered in early sessions. This data comes with a number of R packages bu there we will use the data in the GISTools package. You should load this and the following packages into your workspace. You should have all of these except the hexbin package.

# load into the R session
library(tidyverse)
library(GISTools)
library(kableExtra)
library(hexbin)

Now load the Georgia data and examine what is loaded

data(georgia)
ls()
## [1] "georgia"       "georgia.polys" "georgia2"

The objects georga and georgia2 are sp objects (SpatialPolygonsDataFrame) and georgia.polys object is list where each element contains the polygon coordinates for each county of the georgia2 object. In this session we will work a tibble of the data.frame of georgia. But before we do that assign the data.framea todf`:

df <- data.frame(georgia)

This is like the attribute table of a shapefile. You could examine the data in the manner described in the first parts of Session 2. But here we will use the head() function to examine the first six records (rows) in the data table.

head(df)
##   Latitude  Longitud TotPop90 PctRural PctBach PctEld PctFB PctPov
## 0 31.75339 -82.28558    15744     75.6     8.2  11.43  0.64   19.9
## 1 31.29486 -82.87474     6213    100.0     6.4  11.77  1.58   26.0
## 2 31.55678 -82.45115     9566     61.7     6.6  11.11  0.27   24.1
## 3 31.33084 -84.45401     3615    100.0     9.4  13.17  0.11   24.8
## 4 33.07193 -83.25085    39530     42.7    13.3   8.64  1.43   17.5
## 5 34.35270 -83.50054    10308    100.0     6.4  11.37  0.34   15.1
##   PctBlack        X       Y    ID     Name MedInc
## 0    20.76 941396.6 3521764 13001  Appling  32152
## 1    26.86 895553.0 3471916 13003 Atkinson  27657
## 2    15.42 930946.4 3502787 13005    Bacon  29342
## 3    51.67 745398.6 3474765 13007    Baker  29610
## 4    42.39 849431.3 3665553 13009  Baldwin  36414
## 5     3.49 819317.3 3807616 13011    Banks  41783

You can see that this has attributes for the counties of Georgia and a number of variables are included.

Now we will convert this to a tibble and assign to an object call tb and again examine it:

tb <- as.tibble(df)
tb
## # A tibble: 159 x 14
##    Latit… Longi… TotPo… PctRu… PctB… PctE… PctFB PctP… PctB…      X      Y
##  *  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1   31.8  -82.3  15744   75.6  8.20 11.4  0.640  19.9 20.8  941397 3.52e⁶
##  2   31.3  -82.9   6213  100    6.40 11.8  1.58   26.0 26.9  895553 3.47e⁶
##  3   31.6  -82.5   9566   61.7  6.60 11.1  0.270  24.1 15.4  930946 3.50e⁶
##  4   31.3  -84.5   3615  100    9.40 13.2  0.110  24.8 51.7  745399 3.47e⁶
##  5   33.1  -83.3  39530   42.7 13.3   8.64 1.43   17.5 42.4  849431 3.67e⁶
##  6   34.4  -83.5  10308  100    6.40 11.4  0.340  15.1  3.49 819317 3.81e⁶
##  7   34.0  -83.7  29721   64.6  9.20 10.6  0.920  14.7 11.4  803747 3.77e⁶
##  8   34.2  -84.8  55911   75.2  9.00  9.66 0.820  10.7  9.21 699012 3.79e⁶
##  9   31.8  -83.2  16245   47.0  7.60 12.8  0.330  22.0 31.3  863021 3.52e⁶
## 10   31.3  -83.2  14153   66.2  7.50 12.0  1.19   19.3 11.6  859916 3.47e⁶
## # ... with 149 more rows, and 3 more variables: ID <int>, Name <chr>,
## #   MedInc <dbl>

Notice the difference. Calling the tibble object automatically just prints out the first 10 rows and only the columns that fit in the window. Also note that we also told what type of data are held in the each column (field or attribute). Enter

class(tb)
## [1] "tbl_df"     "tbl"        "data.frame"

You will notice that 3 classes are listed. The tibble is, you guessed it, the tidverse’s favourite data table format. When we come to look at spatial data with sf package in later on you will see that the attributes are held in tibble-like structures with the polygon data.

A tibble is a re-imagining of the data frame. Earlier on a data.frame object called cen.dat was created when you executed the code below:

cen.dat <- read.csv("census_data.csv")

The tibble format seeks to keep what been found to be good and effective about data.frame objects, and getting rid of what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

Some of the features of a tibble included

df
tb

You should explore the tibble vignette in the in your own time after this session. This is displayed by entering

vignette("tibble")

2. Quick plots

Plots can be created using either the qplot or ggplot functions in the ggplot2 package. The function qplot is used to produce quick simple plots in a similar way to the plot function. It takes x and y as and a data argument for a data.frame containing x and y. The figure below is created by defining a vector of the sequence from 0 to 2 \(pi\) (x), its \(sin\) (y) and a y vector with a small random error term (y2r).

x <- seq(0,2*pi,len=100)
y = sin(x)
y2 <- y + rnorm(100,0,0.1)

These can then be plotted using qplot:

qplot(x,y2,col=I('darkred')) + 
  geom_line(aes(x,y), col="darkblue", size = 1.5) + 
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20,face="bold"))

Notice how the plot type is first specified (in this case qplot()) and then subsequent lines include instructions for what to plot an how to plot them. Here geom_lines() was specified followed by some style instructions.

Try adding

  theme_bw()

or

  theme_dark()

to the above. Remember that you need to include a + for each additional element in ggplot.

qplot(x,y2,col=I('darkred')) + 
  geom_line(aes(x,y), col="darkgreen", size = 1.5) + 
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20,face="bold")) +
  theme_bw()

3. Different ggplot options

In subsequent sections, different flavours and types of ggplot will be introduced. But this is a vast package and involves a bit of a learning curve at first. To fully understand all that it can do is beyond the scope of this workshop but there is plenty of help and advice on the internet. For example the following sites may be useful:

In the next sections we will focus on developing different kinds kinds of plots for different kinds of variables in the tb object, including

The concept of faceting for groups of comparative plots (for example under different treatments) will also be explored.

before that we need to create some categorical variables to play with later in the plots. First an indicator of rural/ not-rural, which we set to values using the levels function.

tb$rural <- as.factor((tb$PctRural > 50) +  0)
levels(tb$rural) <- list("Non-Rural" = 0, "Rural"=1)

(note the use of the + 0 to convert the TRUE and FALSE values to 1s and 0s).

And then a an income category variable around the IQR of the MedInc (median county income) variable. There are fancier ways to do it but the code below is tractable:

tb$IncClass <- rep("Medium", nrow(tb)) 
tb$IncClass[tb$MedInc >=  41204] = "High"
tb$IncClass[tb$MedInc <=  29773] = "Low"

This can be turned into a factor as below:

tb$IncClass <- factor(tb$IncClass,
                      levels=c("Low","Medium","High"), 
                      ordered=TRUE)

the distributions can be checked

table(tb$IncClass)
## 
##    Low Medium   High 
##     40     79     40

Before really getting into ggplot we will briefly show bar plots. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of counties in the tb dataset, grouped by IncClass. The chart shows that there more counties with average income as you would expect as the data was derived the IQR .

ggplot(data = tb) + 
  geom_bar(mapping = aes(x = IncClass))

On the x-axis, the chart displays the IncClass variable and the y-axis automatically displays the count. So unlike scatterplots which the raw values of your dataset. Others like bar charts, calculate new values to plot:

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

ggplot(data = tb) + 
  stat_count(mapping = aes(x = IncClass))

3.1 Scatterplots

Scatterplots show 2 variables together and we can examine data pairs in the census R object. For example consider PctBach and PctEld, representing the % of the county population with bachelors degrees and that are elderly (whatever that means). The ggplot call has a basic syntax of

ggplot(data = <a>, aes(x,y,colour))

followed by the type of plot:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point()

The plot can be enhanced by passing a grouping variable to aes:

ggplot(data = tb, mapping =  aes(x = PctBach, y = PctEld,colour=rural)) + 
  geom_point()

Now try using the IncClass variable created earlier as the group. What happens? What do you see? Does this make sense? Are there any trends. I think tentatively we could say that the Poor areas are more elderly and have fewer people with bachelors degrees.

This might be confirmed by adding a trend line:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point() +
  geom_smooth()

By default the type of trend line is automatically selected. Try adjusting the code above and specifying the trend line as below:

geom_smooth(method = "lm")

Also note that style templates can be added and colours changed:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point() +
  geom_smooth(method = "lm", col = "red", fill = "lightsalmon") + 
  theme_dark()

You can explore other styles by trying the ones listed under the help for theme_dark.

And try reversing the order of the elements. For example in the plot above, the geom_smooth() element comes after the geom_points() element. What do you notice if you reverse their order in the code?

As with all plots the axis labels and title can be specified (note the use of the shape parameter and its effects:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point(shape = 23, fill = "red") +
  geom_smooth(method = "lm") + 
  theme_bw() +
  xlab("% Bachelors") +
  ylab("% Elderly") +
  ggtitle("My title", subtitle = "my subtitle")

Try adding the colour=IncClass parameter to the mapping aesthetics in the code above. What happens? Why do you think that is?

Finally you can specify different types of points using the shape parameter as alluded to above. R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the colour and fill aesthetics. The hollow shapes (0–14) have a border determined by colour; the solid shapes (15–18) are filled with colour; the filled shapes (21–24) have a border of colour and are filled with fill.

3.2 Histograms

We can use histograms to examine the distributions of the income across the 159 counties

ggplot(tb, aes(x=MedInc)) + 
  geom_histogram(, binwidth = 5000, colour = "red", fill = "grey")

Of course the axes can be labelled, the theme set and title included as with the above examples, by including additional elements to the plot.

As in Session 2 we can also plot probability densities with `ggplot"

ggplot(tb, aes(x=MedInc)) + 
  geom_histogram(aes(y=..density..),
    binwidth=5000,colour="white") +
   geom_density(alpha=.4, fill="darksalmon") +
   geom_vline(aes(xintercept=median(MedInc, na.rm=T)),   # Ignore NA values for mean
               color="orangered1", linetype="dashed", size=1)

And we can also plot histograms of multiple groups of data values together on the same plot to perhaps compare distributions in a single graphic using the fill option. Obviously we have to be careful about which variables we pot together for this to make sense.

ggplot(data=tb,aes(x=MedInc,fill=IncClass)) + 
  geom_histogram(binwidth = 2000) 

And again the probabilities densities can be plotted.

ggplot(data = tb, aes(x = MedInc, fill = IncClass)) +
  geom_histogram(aes(y = ..density..), binwidth = 2000, col = "black") +
  scale_fill_manual("Income Class", 
    values = c("orange", "palegoldenrod","firebrick3")) +
  geom_line(stat = "density")

A final kind of multiple histogram plots can be generated using the facet() options in ggplot. These create separate plots for each group. Here the PctBach variable is plotted and median incomes compared

ggplot(tb, aes(x=PctBach, fill=IncClass)) +
  geom_histogram(color="grey30",
    binwidth = 1) +
    scale_fill_manual("Income Class", 
    values = c("orange", "palegoldenrod","firebrick3")) +
  facet_grid(IncClass~.) +
  xlab("% Bachelor degrees") +
  ggtitle("The distribution of Bachelors degree % in different income classes")

Another way of doing this is as below using a scatter plot example

ggplot(tb, aes(x = PctBach, y = PctEld, fill=IncClass)) +
  geom_point(alpha = 0.5, shape = 16) +
  facet_wrap( ~ IncClass, ncol = 3) +
  scale_y_continuous(limits = quantile(tb$PctEld, c(0.1, 0.9))) +
  geom_smooth(method='lm', lwd = 0.5) +
  # label the axes and the plot
  xlab("% BachelorDegree") +
  ylab("% of Elderly")
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

The quantile function results in just the 10th to 90th percentiles of the distribution being plotted.

3.3 Boxplots

Perhaps a better way of examining distributions is through boxplots. And here we can start to use some of the piping functions that come with tidyverse.

Boxplots display the distribution of a continuous variable and can be broken down by a categorical variables.

Each boxplot consists of:

  • A box that stretches from the 25th to 75th percentile of the distribution (IQR).
  • A line that displays the median, i.e. 50th percentile of the distribution
  • Lines tell you about distribution and skew
  • Points of observations > 1.5 times the IQR - outliers
  • A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

First lets create some boxplots, using similar syntax to the other plot types above. The code below generates a single boxplot of the PctBach variable:

ggplot(tb, aes(x = "",PctBach)) + 
  geom_boxplot() 

And we can extend boxplots with some grouping as before:

ggplot(tb, aes(x = "",PctBach, fill = rural)) + 
  geom_boxplot() +
  scale_fill_manual("Rural", 
    values = c("orange", "firebrick3")) +
  xlab("") +
  ylab("% Bachelor degrees")

It also possible to extend the grouping to compare more than 1 treatment:

ggplot(tb, aes(IncClass, PctBach, fill = factor(rural))) + 
  geom_boxplot() +
  scale_fill_manual(name = "Rural", 
                    values = c("orange", "firebrick3"),
                    labels = c("Non-Rural" = "Not Rural", "Rural" = "Rural")) +
  xlab("Income Class") +
  ylab("% Bachelors")

And of course the same information could be grouped and displayed using a different organisation:

ggplot(tb, aes(rural, PctBach, fill = factor(IncClass))) + 
  geom_boxplot() +
  scale_fill_manual(name = "Income Class", 
                    values = c("orange", "palegoldenrod","firebrick3")) +
  xlab("Non-Rural and Rural") +
  ylab("% Bachelors")

Finally outliers can be removed from the plot by specifying the outlier shape as NA as below:

ggplot(tb, aes(rural, PctBach, fill = factor(IncClass))) + 
  geom_boxplot(outlier.shape = NA) +
  scale_fill_manual(name = "Income Class", 
                    values = c("orange", "palegoldenrod","firebrick3")) +
  xlab("Non-Rural and Rural") +
  ylab("% Bachelors")

Finally, finally, sometimes if we have many categories, it can be useful to order the boxplots. The code below use the mpg dataset that comes with ggplot2.

For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes and to order the boxplots:

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, 
    hwy, FUN = median), y = hwy))

3.4 Piping

The pipe operator is %>%. The dplyr package imports this operator from another package (magrittr). This allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.

Here’s an example that determines the average median income

tb %>% 
    summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 1 x 1
##   avg_MedIbc
##        <dbl>
## 1      37147

We can add a grouping to this using the group_by function:

tb %>%
  group_by(IncClass) %>%
  summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 3 x 2
##   IncClass avg_MedIbc
##   <ord>         <dbl>
## 1 Low           27410
## 2 Medium        34533
## 3 High          52047

In this way we can use some of the verbs in dplyr to manipulate the data. Other verbs and a summary of what the do are listed below

kable(df)
verbs Description
select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the split-apply-combine concept

In your own time you should explore the operation of these in the dplyr vignette:

vignette("dplyr")

The reason for this diversion is because we would like to use some of the dplyr verbs and the piping syntax to manipulate and prepare the data prior to constructing boxplots in ggplot.

tb %>% ungroup %>% 
  mutate(black_high = ifelse(PctBlack > median(PctBlack),
    "High % Black Area","Low % Black Area")) %>% 
  ggplot(aes(y=PctBach,x=IncClass, fill = IncClass)) + 
  facet_wrap(~black_high) + 
  coord_flip() + 
  geom_boxplot() + 
  labs(y='Percentage with Bachelors',x='Income class') +
  scale_fill_manual(name = "Income Class", 
                    values = c("orange", "palegoldenrod","firebrick3"))

It is important to note that the shading / colour assignments relates to the fill parameter. So if you want to change the shading to, for example, shade by the type of black area, then this can be changed as below. Remember that you have to change the name and the number of colours in values:

tb %>% ungroup %>% 
  mutate(black_high = ifelse(PctBlack > median(PctBlack),
    "High % Black Area","Low % Black Area")) %>% 
  ggplot(aes(y=PctBach,x=IncClass, fill = black_high)) + 
  facet_wrap(~black_high) + 
  coord_flip() + 
  geom_boxplot() + 
  labs(y='Percentage with Bachelors',x='Income class') +
  scale_fill_manual(name = "Area type", 
                    values = c("orange", "firebrick3"))

Just a final remark about boxplots: you can explore the distribution of two categorical variables as in the last sets of box plots, using the geom_count and geom_tile functions

These visualize the covariation between categorical variables and count the number of observations for each combination.

ggplot(data = tb) +
  geom_count(mapping = aes(x = rural, y = IncClass))

tb %>% 
  count(rural, IncClass) %>%  
  ggplot(mapping = aes(x = rural, y = IncClass)) +
    geom_tile(mapping = aes(fill = n))

Obviously these are perhaps more useful when a larger number of categories are present in each categorical variable than is the case here.

So I hope that the above shows that the use of piping allows you to dynamically create variables and summaries that you can then pass to ggplot. This is contrast to the definition of the rural and IncClass variables that was done at the start of this session. This allows for fast and informative exploratory data analysis: critical aspect of data science.

If you are interested there is an alternative set of exercises for Exploratory Data Analysis by Hadley Wickham’s (all bow down!) at http://r4ds.had.co.nz/exploratory-data-analysis.html.

3.5 Other plot types

In this section we will briefly show the a number of other plots including hex plots and lollipop charts. These are given a briefer treatment that the above but you should note that principles that you have learnt with scatter plots, histograms and box plots - for example to group the data in different ways, to manipulate shading colours and general plot formats - can be applied to these other ggplot plot types in the same way.

Hex plots

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black - it gets harder to see any trends. This can be solved using the alpha aesthetic to add transparency.

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point(alpha = 0.5, size = 3)

This is almost 2d kernel density mapping by stealth. But using transparency can be challenging for very large datasets -f not least because of the computational overhead. Another solution is to use binning. Previously you used geom_histogram() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to have installed the hexbin package to use geom_hex(). Here we will use the diamonds dataset that comes with ggplot2 as it has ~54,000 records

ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price))

and

ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price))

Incidentally, as we are using the diamonds data, we can revisit the bar plot function to show an ordered bar plot:

diamonds %>% 
count(cut) %>%
  ggplot(aes(x = reorder(cut, n), y = n)) + 
  geom_col() + 
  labs(x = "Cut Type",
       y = "Number",
       title = "Diamond Cuts") +
  coord_flip() +
  theme_minimal()

Finally - the use of transparencies was close to being a 2D probability density plot. In fact an actual density plot is possible using geom_density2d:

ggplot(data = diamonds) +
  geom_density2d(mapping = aes(x = carat, y = price))

Lollipop charts

Finally we can compare the different income classes and the relative relationship with percentage bachelor degrees. This requires some data manipulation including the computing the mean percentate of bachelor’s degree’s for each income class, and their respective populations (mean percentages should always be weighted by denominator)

by_tmp <- group_by(tb, IncClass)
by_tb <- summarise(by_tmp, 
                   Bachelors = weighted.mean(PctBach,TotPop90),
                       Population = sum(TotPop90))

Now a diverging lollipop chart can be produced to show how the proportion of the population with a batchelor’s degree varies with income category. Arguably there is more information in displaying this in other formats (say using boxplots and grouping) but this is a useful headline infographic. The code here is longer, but essentially it still sticks to the rules of the grammar of graphics. The only slight deviance is the annotate layer, as this does not adress the data frame directly, and so the calculation relating to placing the text on the dotted ‘mean level’ line requires the full notation (ie by_tb$Population) to refer to relevent data columns.

ggplot(by_tb, aes(x=IncClass, y=Bachelors, label=paste0(round(Bachelors,0),'%'))) + 
  geom_segment(aes(y = weighted.mean(Bachelors,Population), 
                   x = IncClass, 
                   yend = Bachelors, 
                   xend = IncClass), 
                   color = "darkred") +
  geom_point(stat='identity', color="darkred", size=14)  +
  geom_text(color="white", size=4) +
  labs(title="Diverging Lollipop Chart") +  
  ylab('% Bachelors Degree') + xlab('Income Class') +
  geom_hline(aes(yintercept=weighted.mean(Bachelors,Population)),lty=2) + 
  annotate("text",
           y=weighted.mean(by_tb$Bachelors,by_tb$Population)+0.25,
           x='Medium',
           label='Population Mean Level',angle=270, size=3) +
  coord_flip() +
  theme_minimal()

4. Finally…

Finally, finally ggplot2 provides over 30 geoms, and extension packages provide even more (see https://www.ggplot2-exts.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at http://rstudio.com/cheatsheets. To learn more about any single geom, use help - for example: ?geom_smooth.

A summary of these is provided here: