This is a short tutorial for learning ggplot2.
If you have any questions, Please contact me at:
chutang[AT]clarkson[Dot]edu
The Iris (Type head(iris) in R console) flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems”" as an example of linear discriminant analysis.
We are going to use iris dataset for all ggplot2 usage.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
ggplot(data = , aes(x =, y =, ...)) + geom_xxx()
ggplot(): starts an object and specify the data
geom_xxx(): refers to the geometry of your plot. There are many types of geometry(point, boxplot, histgram…), check the cheatsheet for more details
aes(): specifies the “aesthetic” elements, legend is automatically created
- Recall the grammer of graphics, each part of code stands for a layer. You can simply use
+to combine all together.
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + geom_point()
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) + geom_point()
factor variableIn the above example, variable Species is a factor variable which is used for identifying different categories. We call it the Group Identifier. The GI is normally numeric variable with finite numeric values(string variable is also possible). You need to make sure the factor variable is defined properly in your plot (telling the R which variable you are using as GI).
Here, use str(iris), you can see that Species has already been defined as a factor variable. If a potential factor variable is not defined as factor, you need to use factor(variable name) to specify it explicitly.
Use the mtcars dataset to plot a scatter plot to display the relationship between mpg(X axis) against wt(Y axis). Please make sure you identify the different groups (gear) properly.
- hint: make sure you use
factor(gear)to specify the GI. You can choose tocolouror toshapedifferent groups
Your graph should look like as below:
ggplot(mtcars, aes(x = mpg, y = wt, colour = factor(gear))) + geom_point()
If we forget to define the factor variable, what will happen?
ggplot(mtcars, aes(x = mpg, y = wt, colour = gear)) + geom_point()
Now, get back to iris dataset. We want to graph the distribution of Sepal.Length of all Species.
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
How about if we want to show the trend by different Species group?
ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can also show the density graph by simply adding a geom_density() layer
ggplot(iris, aes(x = Sepal.Length, colour = Species)) + geom_density()
Boxplot is a standardized way of illustrating the distrubution of data based on the five important number summary (minimum, first quartile, median, third quartile, and maximum).
In our example, we want to see the distrubution of Sepal.Length for all three Species
Teaser: boxplot is very useful for illustrating ANOVA results
ggplot(iris,
aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
ggplot(iris,
aes(x = Species, y = Sepal.Length)) +
geom_boxplot() + coord_flip()
What Does the following code do? Does it work? Does it make sense? Why/Why not? (hint: think about the concept of grammar of graphics again!)
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + geom_point(position = "jitter")
How about this one:
ggplot(iris, aes(Species, Sepal.Length)) + geom_point(position = "jitter") + geom_boxplot()
Trend line can aid the eye in seeing patterns in the presence of overplotting. Using geom_smooth() to add Trend line to your graph:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + geom_smooth()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + geom_smooth(aes(colour = Species))
Remember:
trend lineis not necessarily describing the regression results of your data. It may be veryDIFFERENTfrom the regression line of your model.
Aesthetics (“aes()”) refer to the attributes of the data you want to display. As an example, the the aesthetics available for geom_point() are: x, y, alpha, colour, fill, shape, and size.
You can easily find out how many aesthetics options you can use for certain geom layer by typing ?geom_xxx
Observe the following codes:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + geom_point()
ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species))
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(colour = Species))
ggplot(iris, aes(x = Sepal.Length)) + geom_point(aes(y = Sepal.Width, colour = Species))
This flexibility and distinction of ways to specify aesthetics becomes important when you have more than one layers
Now, try the following codes in your console and observe their differences:
- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + geom_point() + geom_smooth()
- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(colour = Species)) + geom_smooth()
- ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + geom_smooth()
- ggplot(iris) + geom_point() + geom_smooth(aes(x = Sepal.Length, y = Sepal.Width, colour = Species))
Faceting generates small multiples each showing a different subset of the data. With small multiples, you can rapidly compare patterns in different parts of the data. It is very powerful and useful for exploratory data analysis. There are three types of faceting:
- facet_null(): a single plot, the default.
- facet_wrap(): “wraps” a 1 dimentional ribbon of panels into 2 dimention,
facet_wrap(~ GI)
- facet_grid(): produces a 2d grid of panels defined by variables which form the rows and colnumns.
facet_grid(row ~ col)
base <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
base + facet_wrap(~Species)
base <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
base + facet_grid(Species ~ .)
Faceting is an alternative approach to using aesthetics (like colour, shape or size) to differentiate groups. It is good when groups overlap a lot, but it makes small differences harder to observe. Remember the graph we drew before:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + geom_point()
library(dplyr)
iris_new <- select(iris, -Species)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(data = iris_new,
colour = "grey70") + geom_point(aes(colour = Species)) + facet_wrap(~Species)
The theming system in ggplot2 enables a user to control non-data elements of a ggplot object. It makes the ggplot2 a flexible and powerful graphing tool for data visualization.
Installation: > install.package(“ggthemes”)
Usage: > library(ggthemes)
… + theme_xxx()
You can find more details about ggthemes at:
[https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html]
library(ggthemes)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(data = iris_new,
colour = "grey70") + geom_point(aes(colour = Species)) + facet_wrap(~Species) +
theme_solarized()
ggplot2
- Scales, axes and legends
- Positioning
- Coordinate systems