Open a web browser and go to http://cran.r-project.org and download and install it.
It’s also helpful to install RStudio after installing R (download from http://rstudio.com).
In R, type install.packages("tidyverse")
to install a suite of usefull packages including ggplot2
.
Download data sets from your email or from google drive.
Extract the files containing the data and save them to a folder called “dataSets”" in your working directory. You can set your working directory on Windows with File –> Change directory or at the command line with setwd(“filepath”) – for example, setwd(“/Users/kami/dataSets”). It is important to do this, otherwise R won’t be able to find the data files when you attempt to read them.
The name ggplot2 comes from “Grammar of Graphics.”
The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you like. Building blocks of a graph include:
The ggplot2
packages is included in a popular collection of packages called “the tidyverse”. Take a moment to ensure that it is installed, and that we have loaded the ggplot2
package. (You likely installed this in Workshop 1.)
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
While at it, install the package “ggrepel” which you will use later.
Housing prices
Let’s look at housing prices.
housing <- read_csv("dataSets/landdata-states.csv") # read the csv file in the data sets and set it to a local variable called housing.
## Parsed with column specification:
## cols(
## State = col_character(),
## region = col_character(),
## Date = col_double(),
## Home.Value = col_integer(),
## Structure.Cost = col_integer(),
## Land.Value = col_integer(),
## Land.Share..Pct. = col_double(),
## Home.Price.Index = col_double(),
## Land.Price.Index = col_double(),
## Year = col_integer(),
## Qrtr = col_integer()
## )
head(housing[1:5]) # look at the beginning of this housing data frame, columns 1 through 5.
## # A tibble: 6 x 5
## State region Date Home.Value Structure.Cost
## <chr> <chr> <dbl> <int> <int>
## 1 AK West 2010. 224952 160599
## 2 AK West 2010. 225511 160252
## 3 AK West 2010. 225820 163791
## 4 AK West 2010 224994 161787
## 5 AK West 2008 234590 155400
## 6 AK West 2008. 233714 157458
Let’s compare R base graphics with ggplot2 for creating a histogram.
Base graphics histogram example:
# 'hist()' is a function which creates a histogram of the data set housing using column/variable Home.Value.
hist(housing$Home.Value)
ggplot2
histogram example:
library(ggplot2) # load package
ggplot(housing, aes(x = Home.Value)) + # data frame is housing, set x to Home.value
geom_histogram() # make a histogram
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2
Base graphics VS ggplot
for more complex graphs:Base graphics colored scatter plot example:
plot(Home.Value ~ Date,
col = factor(State),
data = filter(housing, State %in% c("MA", "TX"))) # filter only for states which are MA or TX
legend("topleft",
legend = c("MA", "TX"),
col = c("black", "red"), # color these black and red
pch = 1)
ggplot2
colored scatter plot example:
ggplot(filter(housing, State %in% c("MA", "TX")),
aes(x=Date,
y=Home.Value,
color=State))+ # plot information about date, home.value and State.
geom_point() # command for scatterplot
ggplot2
is more compelling!
In ggplot2 aesthetic means “something you can see.” Examples include:
Each type of geom accepts only a subset of all aesthetics. More information can be found on help pages. Aesthetic mappings are set with the aes()
function.
geom
)Geometric objects are the actual marks we put on a plot. Examples include:
geom_point
, for scatter plots, dot plots, etc)geom_line
, for time series, trend lines, etc)geom_boxplot
, for, well, boxplots!)A plot must have at least one geom; there is no upper bound. You can add a geom to a plot using the +
operator.
Now that we know about geometric objects and aesthetic mappings, we can make a ggplot. geom_point
requires mappings for x and y, all others are optional. Note– the order for x and y doesn’t matter.
hp2001Q1 <- filter(housing, Date == 2001.25) # filter housing to just the first quarter of 2001 and store this to a new data frame called hp2001Q1
ggplot(hp2001Q1,
aes(y = Structure.Cost, x = Land.Value)) +
geom_point()
We could also change this to a log scale:
ggplot(hp2001Q1,
aes(y = Structure.Cost, x = log(Land.Value))) +
geom_point()
A plot constructed with ggplot
can have more than one geom. In this case the mappings established in the ggplot()
call are plot defaults that can be added to or overridden. Our plot could use a regression line (we will learn more about this in our third workshop).
hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1)) # add a column called pred.SC to the hp2001Q1
# assign plot to variable so we can add to the plot
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost)) # plot with Land.Value on a log scale, and Structure.Cost on the y-axis
p1 + geom_point(aes(color = Home.Value)) +
geom_line(aes(y = pred.SC)) # add a color scale to Home.Value, and add the prediction line
Note above we are assigning a graphic to the variable “p1.” This is very convenient because in ggplot2, we can keept modifying graphics to further add to their customization. This lets us build our plots iteratively.
Each geom
accepts a particular set of mappings. For example, geom_text()
accepts a labels
mapping.
p1 +
geom_text(aes(label=State), size = 3)
# install.packages("ggrepel") # this package helps to avoid overlapping text labels
library("ggrepel")
p1 +
geom_point() +
geom_text_repel(aes(label=State), size = 3) # label dots with states
Aesthetic mapping only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = x)
you don’t say what shapes should be used. Similarly, aes(color = z)
doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2
scales include
Scales are modified with a series of functions using a scale_<aesthetic>_<type>
naming scheme. Try typing scale_<tab>
at the command line to see a list of scale modification functions.
The following arguments are common to most scales in ggplot2:
Start by constructing a dotplot showing the distribution of home values by Date and State.
p3 <- ggplot(housing,
aes(x = State,
y = Home.Price.Index)) + # indicates which variable is on which axis
theme(legend.position="top", # where the legend should go
axis.text=element_text(size = 6)) # text size
(p4 <- p3 + geom_point(aes(color = Date), # use a dot plot
alpha = 0.5,
size = 1.5,
position = position_jitter(width = 0.25, height = 0)))
Now modify the breaks for the x axis and color scales
p4 + scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013), # where to break on color scale
labels = c("'76", "'94", "'13")) # label in legend
Next change the low and high values to blue and red:
p4 +
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = "blue", high = "red") # changed to blue-red scale
ggplot2
offers two functions for creating small multiples:
facet_wrap()
: define subsets as the levels of a single grouping variablefacet_grid()
: define subsets as the crossing of two grouping variablesp5 <- ggplot(housing, aes(x = Date, y = Home.Value))
p5 + geom_line(aes(color = State))
There are two problems with this graphic: there are too many states to distinguish each one by color, and the lines cover each other.
We can remedy the deficiencies of the previous plot by faceting by state rather than mapping state to color.
(p5 <- p5 + geom_line() +
facet_wrap(~State, ncol = 10)) # facet_wrap defines subsets at the level of State
The ggplot2
theme system handles non-data plot elements like
Built-in themes include:
theme_gray()
(default)theme_bw()
theme_classc()
p5 + theme_linedraw()
p5 + theme_light()
These data consist of Human Development Index and Corruption Perception Index scores for several countries.
The data for the exercises is available in the dataSets/EconomistData.csv
file. Read it in with
dat <- read_csv("dataSets/EconomistData.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## HDI.Rank = col_integer(),
## HDI = col_double(),
## CPI = col_double(),
## Region = col_character()
## )
Let’s do the following tasks: 1. Create a scatter plot with CPI on the x axis and HDI on the y axis. 2. Color the points blue. 3. Map the color of the the points to Region. 4. Make the points bigger by setting size to 2 5. Map the size of the points to HDI.Rank
# create scatterplot with CPI on the x axis and HDI on the y axis
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point()
# color points blue
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(color = "blue")
# color by region
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color = Region))
# make the points bigger by setting the size to 2
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color = Region), size = 2)
# map the size of the points to the HDI.Rank
ggplot(dat, aes(x = CPI, y = HDI)) +
geom_point(aes(color = Region, size = HDI.Rank))
We can add a trend line with the following code. Let’s recreat a scatter plot and set it to the object pc1.
pc1 <- ggplot(dat, aes(x = CPI, y = HDI, color = Region))
pc1 + geom_point()
pc2 <- pc1 +
geom_smooth(mapping = aes(linetype = "r2"),
method = "lm", # this means linear model
formula = y ~ x, se = FALSE,
color = "red")
pc3 <- pc2 + geom_point()
Notice that we put the geom_line
layer first so that it will be plotted underneath the points.
The last step is to add the title and format the axes. We do that using the scales
system in ggplot2.
library(grid)
(pc4 <- pc3 +
scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
limits = c(.9, 10.5),
breaks = 1:10) +
scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
limits = c(0.2, 1.0),
breaks = seq(0.2, 1.0, by = 0.1)) +
ggtitle("Corruption and Human development"))
Last year’s Mathematical Contest in Modeling had a problem about climate and a nation’s fragility.
Here is the description:
A fragile state is one where the state government is not able to, or chooses not to, provide the basic essentials to its people. For the purpose of this problem “state” refers to a sovereign state or country. Being a fragile state increases the vulnerability of a country’s population to the impact of such climate shocks as natural disasters, decreasing arable land, unpredictable weather, and increasing temperatures. Non-sustainable environmental practices, migration, and resource shortages, which are common in developing states, may further aggravate states with weak governance (Schwartz and Randall, 2003; Theisen, Gleditsch, and Buhaug, 2013). Arguably, drought in both Syria and Yemen further exacerbated already fragile states. Environmental stress alone does not necessarily trigger violent conflict, but evidence suggests that it enables violent conflict when it combines with weak governance and social fragmentation. This confluence can enhance a spiral of violence, typically along latent ethnic and political divisions (Krakowka, Heimel, and Galgano 2012).
Download the data set from http://fundforpeace.org/fsi/excel/ and make some exploratory graphics.
Go to https://www.kaggle.com and register. Once registered, sign in and navigate to the Titanic dataset. Donwload it.
This tutorial (https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic) generates a few ggplot2 graphics from this set. Try to recreate them.
This tutorial includes exercises and data from Harvard’s workshop on ggplot2. The original tutorial can be found here: <https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html.