1. Introduction

In this session we will extend the simple visualizations you have done in earlier sessions using core (base) R plotting functions to the ggplot2 package - a dedicated visualization package based on the Grammar of Graphics (Wilkinson, 2005) (hence the gg in the name of the package). This conceptualizes graphics (and plots) in terms of their theoretical components. The approach is to handle each element of the graphic separately in a series of layers and in so doing to control each part of the plot. This is different to the basic plot functions used above which applies specific plotting functions based on the type or class of data that were passed to them.

The ggplot2package is included as part of the tidyverse () which was introduced in earlier sessions.

The ggplot2 package can be loaded implicitly by loading the tidyverse as you have done in earlier sessions

library(tidyverse)

Or having installed thetidyverse, the ggplot2 package can be loaded on its own:

library(ggplot2)

In this practical we will mainly work with the Georgia data that you have encountered in early sessions. This data comes with a number of R packages bu there we will use the data in the GISTools package. You should load this and the following packages into your workspace. You should have all of these except the hexbin package.

# load into the R session
library(tidyverse)
library(GISTools)
library(kableExtra)
library(hexbin)

Now load the Georgia data and examine what is loaded

data(georgia)
ls()
## [1] "georgia"       "georgia.polys" "georgia2"

The objects georga and georgia2 are sp objects (SpatialPolygonsDataFrame) and georgia.polys object is list where each element contains the polygon coordinates for each county of the georgia2 object. In this session we will work a tibble of the data.frame of georgia. But before we do that assign the data.framea todf`:

df <- data.frame(georgia)

This is like the attribute table of a shapefile. You could examine the data in the manner described in the first parts of Session 2. But here we will use the head() function to examine the first six records (rows) in the data table.

head(df)
##   Latitude  Longitud TotPop90 PctRural PctBach PctEld PctFB PctPov
## 0 31.75339 -82.28558    15744     75.6     8.2  11.43  0.64   19.9
## 1 31.29486 -82.87474     6213    100.0     6.4  11.77  1.58   26.0
## 2 31.55678 -82.45115     9566     61.7     6.6  11.11  0.27   24.1
## 3 31.33084 -84.45401     3615    100.0     9.4  13.17  0.11   24.8
## 4 33.07193 -83.25085    39530     42.7    13.3   8.64  1.43   17.5
## 5 34.35270 -83.50054    10308    100.0     6.4  11.37  0.34   15.1
##   PctBlack        X       Y    ID     Name MedInc
## 0    20.76 941396.6 3521764 13001  Appling  32152
## 1    26.86 895553.0 3471916 13003 Atkinson  27657
## 2    15.42 930946.4 3502787 13005    Bacon  29342
## 3    51.67 745398.6 3474765 13007    Baker  29610
## 4    42.39 849431.3 3665553 13009  Baldwin  36414
## 5     3.49 819317.3 3807616 13011    Banks  41783

You can see that this has attributes for the counties of Georgia and a number of variables are included.

Now we will convert this to a tibble and assign to an object call tb and again examine it:

tb <- as.tibble(df)
tb
## # A tibble: 159 x 14
##    Latit… Longi… TotPo… PctRu… PctB… PctE… PctFB PctP… PctB…      X      Y
##  *  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1   31.8  -82.3  15744   75.6  8.20 11.4  0.640  19.9 20.8  941397 3.52e⁶
##  2   31.3  -82.9   6213  100    6.40 11.8  1.58   26.0 26.9  895553 3.47e⁶
##  3   31.6  -82.5   9566   61.7  6.60 11.1  0.270  24.1 15.4  930946 3.50e⁶
##  4   31.3  -84.5   3615  100    9.40 13.2  0.110  24.8 51.7  745399 3.47e⁶
##  5   33.1  -83.3  39530   42.7 13.3   8.64 1.43   17.5 42.4  849431 3.67e⁶
##  6   34.4  -83.5  10308  100    6.40 11.4  0.340  15.1  3.49 819317 3.81e⁶
##  7   34.0  -83.7  29721   64.6  9.20 10.6  0.920  14.7 11.4  803747 3.77e⁶
##  8   34.2  -84.8  55911   75.2  9.00  9.66 0.820  10.7  9.21 699012 3.79e⁶
##  9   31.8  -83.2  16245   47.0  7.60 12.8  0.330  22.0 31.3  863021 3.52e⁶
## 10   31.3  -83.2  14153   66.2  7.50 12.0  1.19   19.3 11.6  859916 3.47e⁶
## # ... with 149 more rows, and 3 more variables: ID <int>, Name <chr>,
## #   MedInc <dbl>

Notice the difference. Calling the tibble object automatically just prints out the first 10 rows and only the columns that fit in the window. Also note that we also told what type of data are held in the each column (field or attribute). Enter

class(tb)
## [1] "tbl_df"     "tbl"        "data.frame"

You will notice that 3 classes are listed. The tibble is, you guessed it, the tidverse’s favourite data table format. When we come to look at spatial data with sf package in later on you will see that the attributes are held in tibble-like structures with the polygon data.

A tibble is a re-imagining of the data frame. Earlier on a data.frame object called cen.dat was created when you executed the code below:

cen.dat <- read.csv("census_data.csv")

The tibble format seeks to keep what been found to be good and effective about data.frame objects, and getting rid of what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

Some of the features of a tibble included

df
tb

You should explore the tibble vignette in the in your own time after this session. This is displayed by entering

vignette("tibble")

2. Quick plots

Plots can be created using either the qplot or ggplot functions in the ggplot2 package. The function qplot is used to produce quick simple plots in a similar way to the plot function. It takes x and y as and a data argument for a data.frame containing x and y. The figure below is created by defining a vector of the sequence from 0 to 2 \(pi\) (x), its \(sin\) (y) and a y vector with a small random error term (y2r).

x <- seq(0,2*pi,len=100)
y = sin(x)
y2 <- y + rnorm(100,0,0.1)

These can then be plotted using qplot:

qplot(x,y2,col=I('darkred')) + 
  geom_line(aes(x,y), col="darkblue", size = 1.5) + 
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20,face="bold"))

Notice how the plot type is first specified (in this case qplot()) and then subsequent lines include instructions for what to plot an how to plot them. Here geom_lines() was specified followed by some style instructions.

Try adding

  theme_bw()

or

  theme_dark()

to the above. Remember that you need to include a + for each additional element in ggplot.

qplot(x,y2,col=I('darkred')) + 
  geom_line(aes(x,y), col="darkgreen", size = 1.5) + 
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20,face="bold")) +
  theme_bw()

3. Different ggplot options

In subsequent sections, different flavours and types of ggplot will be introduced. But this is a vast package and involves a bit of a learning curve at first. To fully understand all that it can do is beyond the scope of this workshop but there is plenty of help and advice on the internet. For example the following sites may be useful:

In the next sections we will focus on developing different kinds kinds of plots for different kinds of variables in the tb object, including

The concept of faceting for groups of comparative plots (for example under different treatments) will also be explored.

before that we need to create some categorical variables to play with later in the plots. First an indicator of rural/ not-rural, which we set to values using the levels function.

tb$rural <- as.factor((tb$PctRural > 50) +  0)
levels(tb$rural) <- list("Non-Rural" = 0, "Rural"=1)

(note the use of the + 0 to convert the TRUE and FALSE values to 1s and 0s).

And then a an income category variable around the IQR of the MedInc (median county income) variable. There are fancier ways to do it but the code below is tractable:

tb$IncClass <- rep("Medium", nrow(tb)) 
tb$IncClass[tb$MedInc >=  41204] = "High"
tb$IncClass[tb$MedInc <=  29773] = "Low"

This can be turned into a factor as below:

tb$IncClass <- factor(tb$IncClass,
                      levels=c("Low","Medium","High"), 
                      ordered=TRUE)

the distributions can be checked

table(tb$IncClass)
## 
##    Low Medium   High 
##     40     79     40

Before really getting into ggplot we will briefly show bar plots. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of counties in the tb dataset, grouped by IncClass. The chart shows that there more counties with average income as you would expect as the data was derived the IQR .

ggplot(data = tb) + 
  geom_bar(mapping = aes(x = IncClass))

On the x-axis, the chart displays the IncClass variable and the y-axis automatically displays the count. So unlike scatterplots which the raw values of your dataset. Others like bar charts, calculate new values to plot:

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(), and if you scroll down you can find a section called “Computed variables”. That describes how it computes two new variables: count and prop.

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

ggplot(data = tb) + 
  stat_count(mapping = aes(x = IncClass))

3.1 Scatterplots

Scatterplots show 2 variables together and we can examine data pairs in the census R object. For example consider PctBach and PctEld, representing the % of the county population with bachelors degrees and that are elderly (whatever that means). The ggplot call has a basic syntax of

ggplot(data = <a>, aes(x,y,colour))

followed by the type of plot:

ggplot(data = tb, mapping = aes(x = PctBach, y = PctEld)) + 
  geom_point()