ggplot2: An introductory tutorial to graphing

author: Chris Inkpen
date: March 2014

R ggplot2 tutorial - Penn State R User Group

R script file location https://drive.google.com/file/d/0B73N8essZOakUDJlM3o0aEhmWlU/edit?usp=sharing

This is a brief walkthrough tutorial of the R ggplot2 graphics package developed by Hadley Wickham. Wickham's “ggplot2: elegant graphics for data analysis” is available for download on the Penn State library website. Moreover, there are a number of specific tutorials and walkthroughs (listed at the bottom). A lot of the source code for this tutorial came through using the R graphics cookbook by Winston Change (O'Reilly), which I highly recommend. A link to purchase the book can be found at the bottom of the page.

Installing the necessary packages

First, we're going to want to load in the ggplot2 package along with a couple of others (just in case we need them)

install.packages('ggplot2') install.packages('plyr') install.packages("hexbin") install.packages("gcookbook") install.packages("lattice")

Next, we'll load in all of our packages that we'll need in one quick shot as opposed to loading them up one at a time.

libs <- c("ggplot2", "plyr", "hexbin", "gcookbook")
lapply(libs, require, character.only = T)

## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: hexbin
## Loading required package: grid
## Loading required package: lattice
## Loading required package: gcookbook

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE

For this tutorial, we will be using the built-in dataset “diamonds”

data(diamonds)
head(diamonds)

##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Here we can see the diamonds dataset has 53940 observations (rows) with 10 variables of interest; the carat, cut, color, clarity, depth, table, price, and x, y, and z coordinates.

The str() command shows us the first variable “carat” is a numerical variable, followed by an ordinal factor variable with 5 levels (cut = fair, good, etc.) and so on.

In ggplot2, there are two types of graphing methods; qplot and ggplot. qplot is similar to the base plot graphics package in R, but a little more extensible. ggplot allows for more sophisticated graphics by adding layers to “plot objects” in the parlance of object-oriented languages.

Creating histograms

Commands for this portion come from following source:
http://docs.ggplot2.org/0.9.3.1/geom_histogram.html

To create histogram for count of diamonds by price first create diamonds_small object

set.seed(6298)
diamonds_small <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(diamonds_small, aes(x = price)) + geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-3

Use ggplot just to look at frequency distribution:

ggplot(diamonds_small, aes(price, ..density.., colour = cut)) + geom_freqpoly(binwidth = 1000)

plot of chunk unnamed-chunk-4

Writing a plot object (hist_cut), ggplot can color in the histograms and show diamonds by count and cut:

hist_cut <- ggplot(diamonds_small, aes(x = price, fill = cut))
hist_cut + geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-5

In this example each category stacked on top of each other. This plotting grammar can also be used to create a frequency distribution plot

ggplot(diamonds_small, aes(price, fill = cut)) + geom_density(alpha = 0.2)

plot of chunk unnamed-chunk-6

Box Plot:

In the event that you want to use ggplot for nice univariate or bivariate box-plots, it can do that too.

p <- ggplot(diamonds, aes(cut, price))
p + geom_boxplot()

plot of chunk unnamed-chunk-7

Creating a scatterplot

Here we want to display the relationship between two continuous variables. similar to the “plot” function in R, if you want to plot two variables, for example, carat and price, you pass ggplot a vector of x values and a vector of y values. Since the Diamonds dataset has 54k observations, this can be an idea of how to deal with displaying “Big Data”, although 54 thousand observations isn't that many, it is fairly hard to graph in an appealing way.

R basic graphing way

plot(diamonds$carat, diamonds$price)

plot of chunk unnamed-chunk-8

qplot way

qplot(diamonds$carat, diamonds$price)

plot of chunk unnamed-chunk-9

written another way (if the columns are in the same data frame)

qplot(carat, price, data=diamonds)

So, this is not a very pretty plot (much better than the first), due to the massive amount of observations. But carat and price certainly do appear to have a positive relationship.

ggplot2 way

ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

plot of chunk unnamed-chunk-10

Here we're specifying our data, are aes (x value, y value) and the representation using the geom_point command and using the default point.

We can specify the shape we want to use and the size in the geom_point() argument.

ggplot(diamonds, aes(x = carat, y = price)) + geom_point(shape = 25, size = 1)

plot of chunk unnamed-chunk-11

I think these are little diamonds, which seems appropriate.

What happens if you want to group points by a variable of interest, using shape and/or color? Here we'll take a look at the same graph but grouped by cut type (a 5 category ordinal variable)

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point(size = 1.5)

plot of chunk unnamed-chunk-12

Pretty cool, but it's still quite crowded there. This lets us look at the groups by different color and shape. Realistically, this is a problem of overplotted data (there's just too many points). So we should do something to make this less cumbersome to understand. One tactic is to make the points semitransparent using the alpha specification in the geom_point() argument.

Here we'll write our plot into an object and then add the geom_point() layer to it.

easy <- ggplot(diamonds, aes(x = carat, y = price))
easy + geom_point(alpha = 0.1)

plot of chunk unnamed-chunk-13

easy + geom_point(alpha = 0.05)

plot of chunk unnamed-chunk-13

Another way to get around this is to “bin” the points into rectangles. We can then map the density of the points in the rectangles using a fill color for the rectangles.

bin <- ggplot(diamonds, aes(x = carat, y = price))
bin + stat_bin2d()

plot of chunk unnamed-chunk-14

This doesn't look too good on its own so we'll fiddle with the colors of the “counts” for the bins using the binhex package.

bin + stat_binhex() + scale_fill_gradient(low = "lightblue", high = "red", breaks = c(0, 
    500, 1000, 2000, 4000, 6000, 8000), limits = c(0, 8000)) + stat_smooth(method = lm) + 
    ylim(0, 20000)

## Warning: Removed 38 rows containing missing values (geom_path).

plot of chunk unnamed-chunk-15

To show scatterplots based on individual settings, we'll use an example from looking at cars and their miles per gallon usage.

data(mpg)
head(mpg)

##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact

Using the str() command, you can take a look at the structure of the data. Here, we can utilize color to distinguish by a particular feature.

qplot(data = mpg, x = displ, y = hwy, color = manufacturer)

plot of chunk unnamed-chunk-17

qplot(data = mpg, x = displ, y = hwy, color = class)

plot of chunk unnamed-chunk-17

You can also do the same thing to subsets of the data - Faceting

It's common to want to recreate the same type of plot for different classes or subsets of our data (i.e. based on values of some factor, aka a categorical variable). qplot() and ggplot makes this amazingly easy by providing a facets= argument.

qplot(data = mpg, x = displ, y = hwy, color = manufacturer, facets = ~class)

plot of chunk unnamed-chunk-18

Now we'll specify the facets layer in ggplot

myGG <- ggplot(mpg, aes(x = displ, y = hwy))

We'll go into this in more detail in the next section, but here is how to set a trend line over the groups.

myGG + geom_point(aes(color = manufacturer)) + stat_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-20

With facets

myGG + geom_point(aes(color = manufacturer)) + stat_smooth(method = lm, se = FALSE) + 
    facet_grid(class ~ .)

plot of chunk unnamed-chunk-21

Fitting Regression Model Lines to Data

data(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Here is our ggplot baseplot

sp <- ggplot(cars, aes(x = speed, y = dist))
sp + geom_point()

plot of chunk unnamed-chunk-23

Now we will add the line of best fit with the stat_smooth(method=lm) command.

sp + geom_point() + stat_smooth(method = lm)

plot of chunk unnamed-chunk-24

This adds a default 95% confidence region for the regression fit. To get a 99% confidence region, you just need to specify it as a level.

sp + geom_point() + stat_smooth(method = lm, level = 0.99)

plot of chunk unnamed-chunk-25

To turn it off, you will just set the se, aka standard errors, as false.

sp + geom_point() + stat_smooth(method = lm, se = FALSE)

plot of chunk unnamed-chunk-26

# To change up the color to a fast car theme.
sp + geom_point(color = "red") + stat_smooth(method = lm, se = FALSE, color = "black")

plot of chunk unnamed-chunk-26

Finally, we can look at a Loess curve instead of a regular linear model line.

# Loess
sp + geom_point() + stat_smooth(method = loess, se = FALSE)

plot of chunk unnamed-chunk-27

# This uses a locally weighted polynomial curve.

Labeling Points in a Scatter Plot

data(countries)
head(countries)

##          Name Code Year   GDP laborrate healthexp infmortality
## 1 Afghanistan  AFG 1960 55.61        NA        NA           NA
## 2 Afghanistan  AFG 1961 55.67        NA        NA           NA
## 3 Afghanistan  AFG 1962 54.36        NA        NA           NA
## 4 Afghanistan  AFG 1963 73.20        NA        NA           NA
## 5 Afghanistan  AFG 1964 76.37        NA        NA           NA
## 6 Afghanistan  AFG 1965 94.10        NA        NA           NA

twok <- subset(countries, Year == 2009 & healthexp > 2000)

head(twok)
str(twok)

Taking an example from the graphics cookbook, we'll use the annotate() and the geom_text() to label a few points. We can take the countries data to look at the relationship between health expenditures and infant mortality per 1,000 live births. We'll only look at the subset that septn more than $2000 USD per capita.

sp <- ggplot(twok, aes(x = healthexp, y = infmortality)) + geom_point()
sp

plot of chunk unnamed-chunk-29

This will label only the values of Canada and USA because we know where they are and we tell ggplot to put a layer on top of the regular plot and write USA in text at point 7400,6.8 and Canada in text at point 4350,5.4

sp + annotate("text", x = 4350, y = 5.4, label = "Canada") + annotate("text", 
    x = 7400, y = 6.8, label = "USA")

plot of chunk unnamed-chunk-30

Using the geom_text() argument, we can put the labels for all the values using the Name variable as the label.

sp + geom_text(aes(label = Name), size = 4)

plot of chunk unnamed-chunk-31

The automatic setting places the names right on top of the points and centers them. vjust=1 makes the baseline of the text on the same level as the point. We can decrease it to make the text level below the point.

sp + geom_text(aes(label = Name), size = 4, vjust = -1)

plot of chunk unnamed-chunk-32

# hjust configures text alignment (left or right)
sp + geom_text(aes(label = Name), size = 4, hjust = -0.1)

plot of chunk unnamed-chunk-32

This is still pretty busy and there are a number of names that overlap. If we're only interested in seeing the plotted names for a couple of points, we can take a few steps that seem a little complicated as a whole but will pay off and are rather simple when you break them down into pieces.

We'll first add a copy of the name variable with a new name

twok$Name1 <- twok$Name
head(twok)

##           Name Code Year   GDP laborrate healthexp infmortality     Name1
## 254    Andorra  AND 2009    NA        NA      3090          3.1   Andorra
## 560  Australia  AUS 2009 42131      65.2      3867          4.2 Australia
## 611    Austria  AUT 2009 45555      60.4      5037          3.6   Austria
## 968    Belgium  BEL 2009 43640      53.5      5104          3.6   Belgium
## 1733    Canada  CAN 2009 39599      67.8      4380          5.2    Canada
## 2702   Denmark  DNK 2009 55933      65.4      6273          3.4   Denmark

Now use the %in% operator to see where each name we want to display is located. This gives a logical vector indicating which entries in the first vector, twok$Name1, are present in the second vector, where we specify the names of the countries we want shown.

idx <- twok$Name1 %in% c("Canada", "Ireland", "United Kingdom", "United States", 
    "New Zealand", "Iceland", "Japan", "Netherlands")

# Then we use this Boolean vector to overwrite all the other entries in
# Name1 with NA values

twok$Name1[!idx] <- NA

We're basically saying, take the Name1 column, and if it has a value of FALSE, set it equal to NA.

sp <- ggplot(twok, aes(x = healthexp, y = infmortality)) + geom_point()
sp + geom_text(aes(label = Name1), size = 4, hjust = -0.1) + xlim(2000, 10000)

## Warning: Removed 19 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-35

Creating a Balloon Plot

We'll use this same data to create a balloon plot where the points are sized to represent a 3rd value, in this example(drawn from the R Graphics Cookbook), GDP.

head(countries)

##          Name Code Year   GDP laborrate healthexp infmortality
## 1 Afghanistan  AFG 1960 55.61        NA        NA           NA
## 2 Afghanistan  AFG 1961 55.67        NA        NA           NA
## 3 Afghanistan  AFG 1962 54.36        NA        NA           NA
## 4 Afghanistan  AFG 1963 73.20        NA        NA           NA
## 5 Afghanistan  AFG 1964 76.37        NA        NA           NA
## 6 Afghanistan  AFG 1965 94.10        NA        NA           NA

balloons <- subset(countries, Year == 2009 & healthexp > 2000 & Name %in% c("Canada", 
    "Ireland", "United Kingdom", "United States", "Luxembourg", "Switzerland", 
    "New Zealand", "Iceland", "Japan", "Netherlands"))

Here we'll set GDP to size so the value of GDP is expressed in the radius of the points for the observations they are attached to, but if you left it as is setting size=GDP doubling a value of GDP would actually quadruple the area, which distorts our view of the data (and interpretation) so we want to map it to area by using scale_size_area()

p <- ggplot(balloons, aes(x = healthexp, y = infmortality, size = GDP)) + geom_point(shape = 21, 
    color = "black", fill = "cornsilk") + xlim(0, 11000)
p

plot of chunk unnamed-chunk-37

Not that visually interesting, so now we'll try it scaled to area and with names.

p + scale_size_area(max_size = 15) + geom_text(aes(label = Name), size = 4, 
    hjust = -0.3)

plot of chunk unnamed-chunk-38

If you were so inclined to save this, in rstudio you could click on the export button in the lower right quadrant viewer and save it as an image, pdf, or copy it to clipboard. If you're using R or want to specify how to save, use the following commands. Default is inches, and you can specify unit if desired.

getwd()

## [1] "/Users/Chris/Desktop/Big Data stuff/R Resources/ggplot2"

setwd("/Users/Whatever your working directory is")

ggsave("myplot.pdf", width = 8, height = 8)

References and Helpful Tutorials You Should Check Out

The awesome O'reilly R Graphics Cookbook qplot ceb-institute r bloggers r for public health r for public health r for public health teachpress