author: Chris Inkpen
date: March 2014
R script file location https://drive.google.com/file/d/0B73N8essZOakUDJlM3o0aEhmWlU/edit?usp=sharing
This is a brief walkthrough tutorial of the R ggplot2 graphics package developed by Hadley Wickham. Wickham's “ggplot2: elegant graphics for data analysis” is available for download on the Penn State library website. Moreover, there are a number of specific tutorials and walkthroughs (listed at the bottom). A lot of the source code for this tutorial came through using the R graphics cookbook by Winston Change (O'Reilly), which I highly recommend. A link to purchase the book can be found at the bottom of the page.
First, we're going to want to load in the ggplot2 package along with a couple of others (just in case we need them)
install.packages('ggplot2')
install.packages('plyr')
install.packages("hexbin")
install.packages("gcookbook")
install.packages("lattice")
Next, we'll load in all of our packages that we'll need in one quick shot as opposed to loading them up one at a time.
libs <- c("ggplot2", "plyr", "hexbin", "gcookbook")
lapply(libs, require, character.only = T)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: hexbin
## Loading required package: grid
## Loading required package: lattice
## Loading required package: gcookbook
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
For this tutorial, we will be using the built-in dataset “diamonds”
data(diamonds)
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Here we can see the diamonds dataset has 53940 observations (rows) with 10 variables of interest; the carat, cut, color, clarity, depth, table, price, and x, y, and z coordinates.
The str()
command shows us the first variable “carat” is a numerical variable, followed by an ordinal factor variable with 5 levels (cut = fair, good, etc.) and so on.
In ggplot2, there are two types of graphing methods; qplot and ggplot. qplot is similar to the base plot graphics package in R, but a little more extensible. ggplot allows for more sophisticated graphics by adding layers to “plot objects” in the parlance of object-oriented languages.
Commands for this portion come from following source:
http://docs.ggplot2.org/0.9.3.1/geom_histogram.html
To create histogram for count of diamonds by price first create diamonds_small object
set.seed(6298)
diamonds_small <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(diamonds_small, aes(x = price)) + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Use ggplot just to look at frequency distribution:
ggplot(diamonds_small, aes(price, ..density.., colour = cut)) + geom_freqpoly(binwidth = 1000)
Writing a plot object (hist_cut), ggplot can color in the histograms and show diamonds by count and cut:
hist_cut <- ggplot(diamonds_small, aes(x = price, fill = cut))
hist_cut + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
In this example each category stacked on top of each other. This plotting grammar can also be used to create a frequency distribution plot
ggplot(diamonds_small, aes(price, fill = cut)) + geom_density(alpha = 0.2)
In the event that you want to use ggplot for nice univariate or bivariate box-plots, it can do that too.
p <- ggplot(diamonds, aes(cut, price))
p + geom_boxplot()
Here we want to display the relationship between two continuous variables. similar to the “plot” function in R, if you want to plot two variables, for example, carat and price, you pass ggplot a vector of x values and a vector of y values. Since the Diamonds dataset has 54k observations, this can be an idea of how to deal with displaying “Big Data”, although 54 thousand observations isn't that many, it is fairly hard to graph in an appealing way.
R basic graphing way
plot(diamonds$carat, diamonds$price)
qplot way
qplot(diamonds$carat, diamonds$price)
written another way (if the columns are in the same data frame)
qplot(carat, price, data=diamonds)
So, this is not a very pretty plot (much better than the first), due to the massive amount of observations. But carat and price certainly do appear to have a positive relationship.
ggplot2 way
ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
Here we're specifying our data, are aes (x value, y value) and the representation using the geom_point command and using the default point.
We can specify the shape we want to use and the size in the geom_point()
argument.
ggplot(diamonds, aes(x = carat, y = price)) + geom_point(shape = 25, size = 1)
I think these are little diamonds, which seems appropriate.
What happens if you want to group points by a variable of interest, using shape and/or color? Here we'll take a look at the same graph but grouped by cut type (a 5 category ordinal variable)
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point(size = 1.5)
Pretty cool, but it's still quite crowded there. This lets us look at the groups by different color and shape. Realistically, this is a problem of overplotted data (there's just too many points). So we should do something to make this less cumbersome to understand. One tactic is to make the points semitransparent using the alpha
specification in the geom_point()
argument.
Here we'll write our plot into an object and then add the geom_point()
layer to it.
easy <- ggplot(diamonds, aes(x = carat, y = price))
easy + geom_point(alpha = 0.1)
easy + geom_point(alpha = 0.05)
Another way to get around this is to “bin” the points into rectangles. We can then map the density of the points in the rectangles using a fill color for the rectangles.
bin <- ggplot(diamonds, aes(x = carat, y = price))
bin + stat_bin2d()
This doesn't look too good on its own so we'll fiddle with the colors of the “counts” for the bins using the binhex
package.
bin + stat_binhex() + scale_fill_gradient(low = "lightblue", high = "red", breaks = c(0,
500, 1000, 2000, 4000, 6000, 8000), limits = c(0, 8000)) + stat_smooth(method = lm) +
ylim(0, 20000)
## Warning: Removed 38 rows containing missing values (geom_path).
To show scatterplots based on individual settings, we'll use an example from looking at cars and their miles per gallon usage.
data(mpg)
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
Using the str()
command, you can take a look at the structure of the data. Here, we can utilize color to distinguish by a particular feature.
qplot(data = mpg, x = displ, y = hwy, color = manufacturer)
qplot(data = mpg, x = displ, y = hwy, color = class)
You can also do the same thing to subsets of the data - Faceting
It's common to want to recreate the same type of plot for different classes or subsets of our data (i.e. based on values of some factor, aka a categorical variable).
qplot()
and ggplot
makes this amazingly easy by providing a facets=
argument.
qplot(data = mpg, x = displ, y = hwy, color = manufacturer, facets = ~class)
Now we'll specify the facets layer in ggplot
myGG <- ggplot(mpg, aes(x = displ, y = hwy))
We'll go into this in more detail in the next section, but here is how to set a trend line over the groups.
myGG + geom_point(aes(color = manufacturer)) + stat_smooth(method = lm, se = FALSE)
With facets
myGG + geom_point(aes(color = manufacturer)) + stat_smooth(method = lm, se = FALSE) +
facet_grid(class ~ .)
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Here is our ggplot baseplot
sp <- ggplot(cars, aes(x = speed, y = dist))
sp + geom_point()
Now we will add the line of best fit with the stat_smooth(method=lm)
command.
sp + geom_point() + stat_smooth(method = lm)
This adds a default 95% confidence region for the regression fit. To get a 99% confidence region, you just need to specify it as a level.
sp + geom_point() + stat_smooth(method = lm, level = 0.99)
To turn it off, you will just set the se
, aka standard errors, as false.
sp + geom_point() + stat_smooth(method = lm, se = FALSE)
# To change up the color to a fast car theme.
sp + geom_point(color = "red") + stat_smooth(method = lm, se = FALSE, color = "black")
Finally, we can look at a Loess curve instead of a regular linear model line.
# Loess
sp + geom_point() + stat_smooth(method = loess, se = FALSE)
# This uses a locally weighted polynomial curve.
data(countries)
head(countries)
## Name Code Year GDP laborrate healthexp infmortality
## 1 Afghanistan AFG 1960 55.61 NA NA NA
## 2 Afghanistan AFG 1961 55.67 NA NA NA
## 3 Afghanistan AFG 1962 54.36 NA NA NA
## 4 Afghanistan AFG 1963 73.20 NA NA NA
## 5 Afghanistan AFG 1964 76.37 NA NA NA
## 6 Afghanistan AFG 1965 94.10 NA NA NA
twok <- subset(countries, Year == 2009 & healthexp > 2000)
head(twok)
str(twok)
Taking an example from the graphics cookbook, we'll use the annotate()
and the geom_text() to label a few points. We can take the countries data to look at the relationship between health expenditures and infant mortality per 1,000 live births. We'll only look at the subset that septn more than $2000 USD per capita.
sp <- ggplot(twok, aes(x = healthexp, y = infmortality)) + geom_point()
sp
This will label only the values of Canada and USA because we know where they are and we tell ggplot to put a layer on top of the regular plot and write USA in text at point 7400,6.8 and Canada in text at point 4350,5.4
sp + annotate("text", x = 4350, y = 5.4, label = "Canada") + annotate("text",
x = 7400, y = 6.8, label = "USA")
Using the geom_text()
argument, we can put the labels for all the values using the Name
variable as the label.
sp + geom_text(aes(label = Name), size = 4)
The automatic setting places the names right on top of the points and centers them. vjust=1
makes the baseline of the text on the same level as the point. We can decrease it to make the text level below the point.
sp + geom_text(aes(label = Name), size = 4, vjust = -1)
# hjust configures text alignment (left or right)
sp + geom_text(aes(label = Name), size = 4, hjust = -0.1)
This is still pretty busy and there are a number of names that overlap. If we're only interested in seeing the plotted names for a couple of points, we can take a few steps that seem a little complicated as a whole but will pay off and are rather simple when you break them down into pieces.
We'll first add a copy of the name variable with a new name
twok$Name1 <- twok$Name
head(twok)
## Name Code Year GDP laborrate healthexp infmortality Name1
## 254 Andorra AND 2009 NA NA 3090 3.1 Andorra
## 560 Australia AUS 2009 42131 65.2 3867 4.2 Australia
## 611 Austria AUT 2009 45555 60.4 5037 3.6 Austria
## 968 Belgium BEL 2009 43640 53.5 5104 3.6 Belgium
## 1733 Canada CAN 2009 39599 67.8 4380 5.2 Canada
## 2702 Denmark DNK 2009 55933 65.4 6273 3.4 Denmark
Now use the %in%
operator to see where each name we want to display is located. This gives a logical vector indicating which entries in the first vector, twok$Name1
, are present in the second vector, where we specify the names of the countries we want shown.
idx <- twok$Name1 %in% c("Canada", "Ireland", "United Kingdom", "United States",
"New Zealand", "Iceland", "Japan", "Netherlands")
# Then we use this Boolean vector to overwrite all the other entries in
# Name1 with NA values
twok$Name1[!idx] <- NA
We're basically saying, take the Name1
column, and if it has a value of FALSE
, set it equal to NA
.
sp <- ggplot(twok, aes(x = healthexp, y = infmortality)) + geom_point()
sp + geom_text(aes(label = Name1), size = 4, hjust = -0.1) + xlim(2000, 10000)
## Warning: Removed 19 rows containing missing values (geom_text).
We'll use this same data to create a balloon plot where the points are sized to represent a 3rd value, in this example(drawn from the R Graphics Cookbook), GDP
.
head(countries)
## Name Code Year GDP laborrate healthexp infmortality
## 1 Afghanistan AFG 1960 55.61 NA NA NA
## 2 Afghanistan AFG 1961 55.67 NA NA NA
## 3 Afghanistan AFG 1962 54.36 NA NA NA
## 4 Afghanistan AFG 1963 73.20 NA NA NA
## 5 Afghanistan AFG 1964 76.37 NA NA NA
## 6 Afghanistan AFG 1965 94.10 NA NA NA
balloons <- subset(countries, Year == 2009 & healthexp > 2000 & Name %in% c("Canada",
"Ireland", "United Kingdom", "United States", "Luxembourg", "Switzerland",
"New Zealand", "Iceland", "Japan", "Netherlands"))
Here we'll set GDP to size so the value of GDP is expressed in the radius of the points for the observations they are attached to, but if you left it as is setting size=GDP doubling a value of GDP would actually quadruple the area, which distorts our view of the data (and interpretation) so we want to map it to area by using scale_size_area()
p <- ggplot(balloons, aes(x = healthexp, y = infmortality, size = GDP)) + geom_point(shape = 21,
color = "black", fill = "cornsilk") + xlim(0, 11000)
p
Not that visually interesting, so now we'll try it scaled to area and with names.
p + scale_size_area(max_size = 15) + geom_text(aes(label = Name), size = 4,
hjust = -0.3)
If you were so inclined to save this, in rstudio you could click on the export button in the lower right quadrant viewer and save it as an image, pdf, or copy it to clipboard. If you're using R or want to specify how to save, use the following commands. Default is inches, and you can specify unit if desired.
getwd()
## [1] "/Users/Chris/Desktop/Big Data stuff/R Resources/ggplot2"
setwd("/Users/Whatever your working directory is")
ggsave("myplot.pdf", width = 8, height = 8)
The awesome O'reilly R Graphics Cookbook qplot ceb-institute r bloggers r for public health r for public health r for public health teachpress