Setup

Laptop users: You should have R installed; if not:

  1. Open a web browser and go to http://cran.r-project.org and download and install it.

  2. It’s also helpful to install RStudio after installing R (download from http://rstudio.com).

  3. In R, type install.packages("tidyverse") to install a suite of usefull packages including ggplot2.

Everyone: Download workshop materials:

  1. Download data sets from your email or from google drive.

  2. Extract the files containing the data and save them to a folder called “dataSets”" in your working directory. You can set your working directory on Windows with File –> Change directory or at the command line with setwd(“filepath”) – for example, setwd(“/Users/kami/dataSets”). It is important to do this, otherwise R won’t be able to find the data files when you attempt to read them.

The Grammar Of Graphics in ggplot2

The name ggplot2 comes from “Grammar of Graphics.”

The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you like. Building blocks of a graph include:

Setup: install the tidyverse package

The ggplot2 packages is included in a popular collection of packages called “the tidyverse”. Take a moment to ensure that it is installed, and that we have loaded the ggplot2 package. (You likely installed this in Workshop 1.)

# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

While at it, install the package “ggrepel” which you will use later.

Example Data: Housing prices

Let’s look at housing prices.

housing <- read_csv("dataSets/landdata-states.csv") # read the csv file in the data sets and set it to a local variable called housing.
## Parsed with column specification:
## cols(
##   State = col_character(),
##   region = col_character(),
##   Date = col_double(),
##   Home.Value = col_integer(),
##   Structure.Cost = col_integer(),
##   Land.Value = col_integer(),
##   Land.Share..Pct. = col_double(),
##   Home.Price.Index = col_double(),
##   Land.Price.Index = col_double(),
##   Year = col_integer(),
##   Qrtr = col_integer()
## )
head(housing[1:5]) # look at the beginning of this housing data frame, columns 1 through 5.
## # A tibble: 6 x 5
##   State region  Date Home.Value Structure.Cost
##   <chr> <chr>  <dbl>      <int>          <int>
## 1 AK    West   2010.     224952         160599
## 2 AK    West   2010.     225511         160252
## 3 AK    West   2010.     225820         163791
## 4 AK    West   2010      224994         161787
## 5 AK    West   2008      234590         155400
## 6 AK    West   2008.     233714         157458

Let’s compare R base graphics with ggplot2 for creating a histogram.

Base graphics histogram example:

# 'hist()' is a function which creates a histogram of the data set housing using column/variable Home.Value.
hist(housing$Home.Value)

ggplot2 histogram example:

library(ggplot2) # load package
ggplot(housing, aes(x = Home.Value)) + # data frame is housing, set x to Home.value
  geom_histogram() # make a histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot2 Base graphics VS ggplot for more complex graphs:

Base graphics colored scatter plot example:

plot(Home.Value ~ Date,
     col = factor(State),
     data = filter(housing, State %in% c("MA", "TX"))) # filter only for states which are MA or TX
legend("topleft",
       legend = c("MA", "TX"),
       col = c("black", "red"), # color these black and red
       pch = 1)

ggplot2 colored scatter plot example:

ggplot(filter(housing, State %in% c("MA", "TX")),
       aes(x=Date,
           y=Home.Value,
           color=State))+ # plot information about date, home.value and State.
  geom_point() # command for scatterplot

ggplot2 is more compelling!

Geometric Objects And Aesthetics

Aesthetic Mapping

In ggplot2 aesthetic means “something you can see.” Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • line type
  • size

Each type of geom accepts only a subset of all aesthetics. More information can be found on help pages. Aesthetic mappings are set with the aes() function.

Geometic Objects (geom)

Geometric objects are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

A plot must have at least one geom; there is no upper bound. You can add a geom to a plot using the + operator.

Points (Scatterplot)

Now that we know about geometric objects and aesthetic mappings, we can make a ggplot. geom_point requires mappings for x and y, all others are optional. Note– the order for x and y doesn’t matter.

hp2001Q1 <- filter(housing, Date == 2001.25) # filter housing to just the first quarter of 2001 and store this to a new data frame called hp2001Q1
ggplot(hp2001Q1,
       aes(y = Structure.Cost, x = Land.Value)) +
  geom_point()

We could also change this to a log scale:

ggplot(hp2001Q1,
       aes(y = Structure.Cost, x = log(Land.Value))) +
  geom_point()

Lines (Prediction Line)

A plot constructed with ggplot can have more than one geom. In this case the mappings established in the ggplot() call are plot defaults that can be added to or overridden. Our plot could use a regression line (we will learn more about this in our third workshop).

hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1)) # add a column called pred.SC to the hp2001Q1

# assign plot to variable so we can add to the plot
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost)) # plot with Land.Value on a log scale, and Structure.Cost on the y-axis

p1 + geom_point(aes(color = Home.Value)) +
  geom_line(aes(y = pred.SC)) # add a color scale to Home.Value, and add the prediction line

Note above we are assigning a graphic to the variable “p1.” This is very convenient because in ggplot2, we can keept modifying graphics to further add to their customization. This lets us build our plots iteratively.

Text (Label Points)

Each geom accepts a particular set of mappings. For example, geom_text() accepts a labels mapping.

p1 + 
  geom_text(aes(label=State), size = 3)

# install.packages("ggrepel") # this package helps to avoid overlapping text labels
library("ggrepel")
p1 + 
  geom_point() + 
  geom_text_repel(aes(label=State), size = 3) # label dots with states

Scales

Scales: Controlling Aesthetic Mapping

Aesthetic mapping only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x) you don’t say what shapes should be used. Similarly, aes(color = z) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2 scales include

  • position
  • color and fill
  • size
  • shape
  • line type

Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. Try typing scale_<tab> at the command line to see a list of scale modification functions.

Common Scale Arguments

The following arguments are common to most scales in ggplot2:

  • name: the first argument gives the axis or legend title
  • limits: the minimum and maximum of the scale
  • breaks: the points along the scale where labels should appear
  • labels: the labels that appear at each break

Scale Modification Examples

Start by constructing a dotplot showing the distribution of home values by Date and State.

p3 <- ggplot(housing,
             aes(x = State,
                 y = Home.Price.Index)) + # indicates which variable is on which axis
        theme(legend.position="top", # where the legend should go
              axis.text=element_text(size = 6)) # text size
(p4 <- p3 + geom_point(aes(color = Date), # use a dot plot
                       alpha = 0.5,
                       size = 1.5,
                       position = position_jitter(width = 0.25, height = 0)))

Now modify the breaks for the x axis and color scales

p4 + scale_x_discrete(name="State Abbreviation") +
  scale_color_continuous(name="",
                         breaks = c(1976, 1994, 2013), # where to break on color scale
                         labels = c("'76", "'94", "'13")) # label in legend

Next change the low and high values to blue and red:

p4 +
  scale_x_discrete(name="State Abbreviation") +
  scale_color_continuous(name="",
                         breaks = c(1976, 1994, 2013),
                         labels = c("'76", "'94", "'13"),
                         low = "blue", high = "red") # changed to blue-red scale

Faceting

Faceting

  • Faceting let’s us create separate graphs for subsets of data
  • ggplot2 offers two functions for creating small multiples:
    1. facet_wrap(): define subsets as the levels of a single grouping variable
    2. facet_grid(): define subsets as the crossing of two grouping variables
  • This is useful for comparing plots.

What is the trend in housing prices in each state?

  • Start by using a technique we already know: let’s map State to color:
p5 <- ggplot(housing, aes(x = Date, y = Home.Value))
p5 + geom_line(aes(color = State))  

There are two problems with this graphic: there are too many states to distinguish each one by color, and the lines cover each other.

Faceting to the rescue

We can remedy the deficiencies of the previous plot by faceting by state rather than mapping state to color.

(p5 <- p5 + geom_line() +
   facet_wrap(~State, ncol = 10)) # facet_wrap defines subsets at the level of State

Themes

Themes

The ggplot2 theme system handles non-data plot elements like

  • Axis labels
  • Plot background
  • Facet label backround
  • Legend appearance

Built-in themes include:

  • theme_gray() (default)
  • theme_bw()
  • theme_classc()
p5 + theme_linedraw()

p5 + theme_light()

Exercises (and Solutions) with the Economist Data Set

These data consist of Human Development Index and Corruption Perception Index scores for several countries.

The data for the exercises is available in the dataSets/EconomistData.csv file. Read it in with

dat <- read_csv("dataSets/EconomistData.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   HDI.Rank = col_integer(),
##   HDI = col_double(),
##   CPI = col_double(),
##   Region = col_character()
## )

Let’s do the following tasks: 1. Create a scatter plot with CPI on the x axis and HDI on the y axis. 2. Color the points blue. 3. Map the color of the the points to Region. 4. Make the points bigger by setting size to 2 5. Map the size of the points to HDI.Rank

# create scatterplot with CPI on the x axis and HDI on the y axis
ggplot(dat, aes(x = CPI, y = HDI)) +
  geom_point()

# color points blue
ggplot(dat, aes(x = CPI, y = HDI)) +
  geom_point(color = "blue")

# color by region
ggplot(dat, aes(x = CPI, y = HDI)) +
  geom_point(aes(color = Region))

# make the points bigger by setting the size to 2
ggplot(dat, aes(x = CPI, y = HDI)) +
  geom_point(aes(color = Region), size = 2)

# map the size of the points to the HDI.Rank
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color = Region, size =  HDI.Rank))

Adding the trend line

We can add a trend line with the following code. Let’s recreat a scatter plot and set it to the object pc1.

pc1 <- ggplot(dat, aes(x = CPI, y = HDI, color = Region))
pc1 + geom_point()

pc2 <- pc1 +
  geom_smooth(mapping = aes(linetype = "r2"), 
              method = "lm", # this means linear model
              formula = y ~ x, se = FALSE, 
              color = "red")
pc3 <- pc2 + geom_point() 

Notice that we put the geom_line layer first so that it will be plotted underneath the points.

Add title and format axes

The last step is to add the title and format the axes. We do that using the scales system in ggplot2.

library(grid)
(pc4 <- pc3 +
  scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
                     limits = c(.9, 10.5),
                     breaks = 1:10) +
  scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
                     limits = c(0.2, 1.0),
                     breaks = seq(0.2, 1.0, by = 0.1)) +
  
  ggtitle("Corruption and Human development"))

Challenge

Challenge 1

Last year’s Mathematical Contest in Modeling had a problem about climate and a nation’s fragility.

Here is the description:

A fragile state is one where the state government is not able to, or chooses not to, provide the basic essentials to its people. For the purpose of this problem “state” refers to a sovereign state or country. Being a fragile state increases the vulnerability of a country’s population to the impact of such climate shocks as natural disasters, decreasing arable land, unpredictable weather, and increasing temperatures. Non-sustainable environmental practices, migration, and resource shortages, which are common in developing states, may further aggravate states with weak governance (Schwartz and Randall, 2003; Theisen, Gleditsch, and Buhaug, 2013). Arguably, drought in both Syria and Yemen further exacerbated already fragile states. Environmental stress alone does not necessarily trigger violent conflict, but evidence suggests that it enables violent conflict when it combines with weak governance and social fragmentation. This confluence can enhance a spiral of violence, typically along latent ethnic and political divisions (Krakowka, Heimel, and Galgano 2012).

Download the data set from http://fundforpeace.org/fsi/excel/ and make some exploratory graphics.

Challenge 2

Go to https://www.kaggle.com and register. Once registered, sign in and navigate to the Titanic dataset. Donwload it.

This tutorial (https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic) generates a few ggplot2 graphics from this set. Try to recreate them.

Acknowledgements

This tutorial includes exercises and data from Harvard’s workshop on ggplot2. The original tutorial can be found here: <https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html.