The ggplot2 package in R is an implementation of The Grammar of Graphics as described by Leland Wilkinson in his book. The package was originally written by Hadley Wickham while he was a graduate student at Iowa State University (he still actively maintains the packgae). The package implements what might be considered a third graphics system for R (along with base graphics and lattice).
You might ask yourself “Why do we need a grammar of graphics?” Well, for much the same reasons that having a grammar is useful for spoken languages. The grammer allows for a more compact summary of the base components of a language, and it allows us to extend the language and to handle situations that we have not before seen.
For example, consider the following plot made using base graphics.
with(airquality, {
plot(Temp, Ozone)
lines(loess.smooth(Temp, Ozone))
})
The base plotting system is convenient and it often mirrors how we think of building plots and analyzing data. But a key drawback is that you can’t go back once plot has started (e.g. to adjust margins), so there is in fact a need to plan in advance. Furthermore, it is difficult to “translate” a plot to others because there’s no formal graphical language; each plot is just a series of R commands.
Here’s the same plot made using ggplot2.
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE)
## Warning: Removed 37 rows containing non-finite values (stat_smooth).
## Warning: Removed 37 rows containing missing values (geom_point).
Note that the output is roughly equivalent, and the amount of code is similar, but ggplot2 allows for a more elegant way of expressing the components of the plot. In this case, the plot is a dataset (airquality) with aesthetic mappings derived from the Temp and Ozone variables, a set of points, and a smoother. In a sense, the ggplot2 system takes many of the cues from the base plotting system and formalizes them a bit.
The ggplot2 system also takes some cues from lattice. With the lattice system, plots are created with a single function call (xyplot, bwplot, etc.). Things like margins and spacing are set automatically because the entire plot is specified at once. The lattice system is most useful for conditioning types of plots and is good for putting many many plots on a screen. That said, it is sometimes awkward to specify an entire plot in a single function call because many different options have to be specified at once. Furthermore, annotation in plots is not intuitive and the use of panel functions and subscripts is difficult to wield and requires intense preparation.
The ggplot2 system essentially takes the good parts of both the base graphics and lattice graphics system. It automatically handles things like margins and spacing, and also has the concept of “themes” which provide a default set of plotting symbols and colors. While ggplot2 bears a superficial similarity to lattice, ggplot2 is generally easier and more intuitive to use. The default thems makes many choices for you, but you can customize the presentation if you want.
The qplot() function in ggplot2 is meant to get you going quickly. It works much like the plot() function in base graphics system. It looks for variables to plot within a data frame, similar to lattice, or in the parent environment. In general, it’s good to get used to putting your data in a data frame and then passing it to qplot().
Plots are made up of aesthetics (size, shape, color) and geoms (points, lines). Factors play an important role for indicating subsets of the data (if they are to have different properties) so they should be labeled properly. The qplot() hides much of what goes on underneath, which is okay for most operations, ggplot() is the core function and is very flexible for doing things qplot() cannot do.
One thing that is always true, but is particularly useful when using ggplot2, is that you should always use informative and descriptive labels on your data. More generally, your data should have appropriate metadata so that you can quickly look at a dataset and know:
This means that each column of a data frame should have a meaningful (but concise) variable name that accurately reflects the data stored in that column. Also, non-numeric or categorical variables should be coded as factor variables and have meaningful labels for each level of the factor.
For example, it’s common to code a binary variable as a “0” or a “1”, but the problem is that from quickly looking at the data, it’s impossible to know whether which level of that variable is represented by a “0” or a “1”. Much better to simply label each observation as what they are. If a variable represents temperature categories, it might be better to use “cold”, “mild”, and “hot” rather than “1”, “2”, and “3”.
While it’s sometimes a pain to make sure all of your data are properly labelled, this investment in time can pay dividends down the road when you’re trying to figure out what you were plotting. In other words, including the proper metadata can make your exploratory plots essentially self-documenting.
library(ggplot2)
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
You can see from the str() output that all of the factor variables are appropriately coded with meaningful labels. This will come in handy when qplot() has to label different aspects of a plot. Also note that all of the columns/variables have meaningful (if sometimes abbreviated) names, rather than names like “X1”, and “X2”, etc.
We can make a quick scatterplot of the engine displacement (displ) and the highway miles per gallon (hwy).
qplot(displ, hwy, data = mpg)
qplot(displ, hwy, data = mpg, color = drv)
** Sometimes it’s nice to add a smoother to a scatterplot ot highlight any trends. Trends can be difficult to see if the data are very noisy or there are many data points obscuring the view. A smooth is a “geom” that you can add along with your data points.
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
qplot(displ, hwy, data = mpg, color = drv, geom = c("point", "smooth"))
Note that previously, we didn’t have to specify geom = “point” because that was done automatically. But if you want the smoother overlayed with the points, then you need to specify both explicitly.
Here it seems that engine displacement and highway mileage have a nonlinear U-shaped relationship, but from the previous plot we know that this is largely due to confounding by the drive class of the car.
NOTE - In statistics, a confounding variable (also confounding factor, a confound, or confounder) is an extraneous variable in a statistical model that correlates (directly or inversely) with both the dependent variable and the independent variable.
qplot(hwy, data = mpg, fill = drv, binwidth = 2)
qplot(drv, hwy, data = mpg, geom = "boxplot")
Facets are a way to create multiple panels of plots based on the levels of categorical variable. Here, we want to see a histogram of the highway mileages and the categorical variable is the drive class variable. We can do that using the facets argument to qplot().
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
qplot(displ, hwy, data = mpg, facets = . ~ drv)
qplot(displ, hwy, data = mpg, facets = . ~ drv) + geom_smooth()
# Load the DF for the MAACS data
load("maacs.rda")
# Spot Check
head(maacs)
## id eno duBedMusM pm25 mopos
## 1 1 141 2423 15.560 yes
## 2 2 124 2793 34.370 yes
## 3 3 126 3055 38.953 yes
## 4 4 164 775 33.249 yes
## 5 5 99 1634 27.060 yes
## 6 6 68 939 18.890 yes
# Check the structure of the data
str(maacs)
## 'data.frame': 750 obs. of 5 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ eno : num 141 124 126 164 99 68 41 50 12 30 ...
## $ duBedMusM: num 2423 2793 3055 775 1634 ...
## $ pm25 : num 15.6 34.4 39 33.2 27.1 ...
## $ mopos : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
The key variables are:
eno: exhaled nitric oxide
The outcome of interest for this analysis will be exhaled nitric oxide (eNO), which is a measure of pulmonary inflamation. We can get a sense of how eNO is distributed in this population by making a quick histogram of the variable. Here, we take the log of eNO because some right-skew in the data.
qplot(log(eno), data = maacs)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 108 rows containing non-finite values (stat_bin).
qplot(log(eno), data = maacs, fill = mopos)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 108 rows containing non-finite values (stat_bin).
We can see from this plot that the non-allergic subjects are shifted slightly to the left, indicating a lower eNO and less pulmonary inflammation. That said, there is significant overlap between the two groups.
An alternative to histograms is a density smoother, which sometimes can be easier to visualize when there are multiple groups. Here is a density smooth of the entire study population.
qplot(log(eno), data = maacs, geom = "density")
## Warning: Removed 108 rows containing non-finite values (stat_density).
qplot(log(eno), data = maacs, geom = "density", color = mopos)
## Warning: Removed 108 rows containing non-finite values (stat_density).
These tell the same story as the stratified histograms, which sould come as no surprise.
Now we can examine the indoor environment and its relationship to eNO. Here, we use the level of indoor PM2.5 as a measure of indoor environment air quality. We can make a simple scatterplot of PM2.5 and eNO.
qplot(log(pm25), log(eno), data = maacs, geom = c("point", "smooth"))
## Warning: Removed 184 rows containing non-finite values (stat_smooth).
## Warning: Removed 184 rows containing missing values (geom_point).
The relationship appears modest at best, as there is substantial noise in the data. However, one question that we might be interested in is whether allergic individuals are prehaps more sensitive to PM2.5 inhalation than non-allergic individuals. To examine that question we can stratify the data into two groups.
This first plot uses different plot symbols for the two groups and overlays them on a single canvas. We can do this by mapping the mopos variable to the shape aesthetic.
qplot(log(pm25), log(eno), data = maacs, shape = mopos)
## Warning: Removed 184 rows containing missing values (geom_point).
Because there is substantial overlap in the data it is a bit challenging to discern the circles from the triangles. Part of the reason might be that all of the symbols are the same color (black).
We can plot each group a different color to see if that helps.
qplot(log(pm25), log(eno), data = maacs, color = mopos)
## Warning: Removed 184 rows containing missing values (geom_point).
qplot(log(pm25), log(eno), data = maacs, color = mopos) + geom_smooth(method = "lm")
## Warning: Removed 184 rows containing non-finite values (stat_smooth).
## Warning: Removed 184 rows containing missing values (geom_point).
Here we see quite clearly that the red group and the green group exhibit rather different relationships between PM2.5 and eNO. For the non-allergic individuals, there appears to be a slightly negative relationship between PM2.5 and eNO and for the allergic individuals, there is a positive relationship.
This suggests a strong interaction between PM2.5 and allergic status, an hypothesis perhaps worth following up on in greater detail than this brief exploratory analysis.
Another, and perhaps more clear, way to visualize this interaction is to use separate panels for the non-allergic and allergic individuals using the facets argument to qplot().
qplot(log(pm25), log(eno), data = maacs, facets = . ~ mopos) + geom_smooth(method = "lm")
## Warning: Removed 184 rows containing non-finite values (stat_smooth).
## Warning: Removed 184 rows containing missing values (geom_point).