Base R Visualization. For an introduction to data visualization in R, see Visualization in Base R.
Introduction to ggplot2. To view the first part of this series on visualization in ggplot2, see Visualization with ggplot2: Part I.
Installing Packages. Packages, or libraries, extend R’s functionality.
install.packages() and the package name in quotes ""ggplot2 installed, you do not need to reinstall it, just load itggplot2install.packages("ggplot2")
Additional Packages. Because we’ll focus on graphical themes in this session:
ggthemes and GGally to enhance the ggplot2 packageggplot2, skip this if ggthemes is already installedggthemes adds a number of new premade themes to ggplot2install.packages("ggthemes")
install.packages("GGally")
Loading Packages. Once installed, packages must be loaded into your session:
library()"") are optional for package nameslibrary(ggplot2)
library(ggthemes)
library(GGally)
Visualization Shorthand. Like human language and parts of speech, e.g. verbs, nouns, adjectives, etc.:
gg in ggplot2ggplot2 was designed by Hadley Wickham using this frameworkLayers. “Layers” are the equivalent of parts of speech in human language; there are 7 in total:
ggplot())aes())geom_*())facet_*())stat_*())coord_*())theme_*())Essential Layers. Three layers are essential to any ggplot2 visualization:
Observe. Here we use the practice dataset diamonds, which we call with function data():
ggplot2 functions with the addition operator, or +diamonds dataset is built into ggplot2diamonds with function data()ggplot(diamonds) + # Data
aes(x = carat, y = price) + # Aesthetics, here x- and y-axes
geom_point() # Geometries, here a scatterplot
“Proper” Syntax. Though the above is easier, function aes() should be nested in ggplot()
ggplot(diamonds, # Separate arguments with commas
aes(x = carat,
y = price)) + # Double end parantheses ends both ggplot() and aes()
geom_point()
Data Ink. The aesthetics layer is only used to show data ink, in function aes()
x =, y =, color =, fill =, size =, etc.ggplot(diamonds,
aes(x = carat,
y = price,
color = clarity)) + # Argument "color =" shows variable "clarity"
geom_point()
Non-Data Ink. The geometries layer also shows non-data ink in geom_*() functions
x = and y =, all data ink arguments are the sameggplot(diamonds,
aes(x = carat,
y = price)) +
geom_point(alpha = 0.075, # Argument "alpha =" makes data transparent (7.5%)
color = "tomato") # Argument "color =" makes your data pop
Conclusion. Combine both data and non-data ink for optimal effect.
ggplot(diamonds,
aes(x = carat,
y = price,
color = clarity)) +
geom_point(alpha = 0.35) # Argument "alpha =" makes data transparent (7.5%)
There are 60+ geometry functions in package ggplot2. They all begin with geom_.
geom_ in the RStudio consoleWhat’s Good? Over time, which geometries to use becomes intuitive.
Scatter Plots measure two continuous (numeric) variables, like length and width:
geom_point() for a typical scatter plotgeom_jitter()data(iris)
ggplot(iris,
aes(x = Petal.Length, # Length in centimeters
y = Petal.Width, # Width in centimeters
color = Species)) +
geom_point()
ggplot(iris,
aes(x = Petal.Length, # Length in centimeters
y = Petal.Width, # Width in centimeters
color = Species)) +
geom_jitter()
Bar Plots measure one continuous (numeric) and one categorical variable:
geom_bar() needs argument stat = to tell R how to plot your geometryoptions(scipen = 999) disables scientific notation (orders of magnitude)options(scipen = 999) # Disable scientific notation
ggplot(diamonds,
aes(x = cut, # Categorical (Quality of Cut)
y = price)) + # Continuous (US Dollars)
geom_bar(stat = "identity", # Argument "stat =" tells R how to handle stats
color = "skyblue")
Histograms visualize distributions of a single continuous variable, like mpg:
geom_histogram() automatically chooses bin widthsfill = is needed for areas and color = is for lines and pointsdata(diamonds)
ggplot(diamonds,
aes(x = carat)) +
geom_histogram(fill = "dodgerblue3") # Argument "fill =" needed for areas
Line Graphs generally visualize changes in data over time.
ggplot2
geom_line() seems obvious, but it makes the line touch the x-axis between valuesgeom_step() is an improvement, making horizontal lines between valuesgeom_path() is often ideal, directly connecting each data point with a lineeconomics already has dates and number of unemployed, in thousandsdata(economics)
ggplot(economics,
aes(x = date,
y = unemploy)) +
geom_path()
Conclusion. Choose the right geometries for the variables you want to show.
Then make it pretty. But never too pretty that it obfuscates the clarity of your ideas.
“His clothes were all new and had been cut by a good Moscow tailor. But there was something wrong even with his clothes. They were rather too fashionable, as clothes always are from conscientious but not very talented tailors.”
Fyodor Dostoevsky, “The Idiot”, Part II, Chapter 2
Factors in R are the same as categorical variables, a.k.a. “nominal”, “discrete”, etc.
as.factor() on a variable to tell R that it is categoricaldata(mtcars)
ggplot(mtcars, aes(x = cyl, # Number of cylinders
y = mpg)) + # Miles per gallon
geom_point()
What the heck? Where are all the 5- and 7-cylinder cars?
mtcars only have 4-, 6-, and 8-cylinder enginescyl (“cylinders”) is categoricalas.factor()data(mtcars)
ggplot(mtcars, aes(x = as.factor(cyl), # Cylinders "coerced" (changed) to factor
y = mpg)) +
geom_point()
Better. But we can add some random noise by using the geometry geom_jitter():
width = and height = controls degree of random noisedata(mtcars)
ggplot(mtcars, aes(x = as.factor(cyl), # Still a factor
y = mpg)) +
geom_jitter(width = 0.05) # Jittered slightly, but "mpg" still precise
Right.
Because we’re cutting it close on time, let’s cover premade themes quickly.
theme_theme() allows for total customizationtheme_set() allows for global themes and will apply to all plots
+ operatorggthemes
theme_ in the console and scroll through the autocomplete dropdownTry theme_light():
library(ggthemes)
ggplot(diamonds,
aes(x = cut,
y = price)) +
geom_jitter(width = 0.25,
alpha = 0.05,
color = "tomato") +
theme_light() # Just add the theme layer onto the graph
Or theme_fivethirtyeight():
library(ggthemes)
ggplot(diamonds,
aes(x = cut,
y = price)) +
geom_jitter(width = 0.25,
alpha = 0.1,
color = "tomato") +
theme_fivethirtyeight()
Or theme_minimal():
library(ggthemes)
ggplot(diamonds,
aes(x = cut,
y = price)) +
geom_jitter(width = 0.25,
alpha = 0.05,
color = "tomato") +
theme_classic()
Instructions. Take any of the above plots and use the + operator to add function labs().
labs(), you can set graph labels for the following arguments:
title =subtitle =caption =x =y =mtcars, diamonds, economics, or irisnames() with the dataset name inside () to get the variable names
help() with the dataset for definitionsaes(), a geometry, and non-data ink inside geom_*()
theme_*() functionThanks, everyone.