1 Review

For an introduction to data visualization in R, see Visualization in Base R.


2 The Grammar of Graphics

Before introducing package ggplot2, it’s important to understand the theory on which it’s built.


2.1 Human Grammar

Consider human language. Native speakers rarely understand grammatical rules, but they use them intuitively.

For example, we may not know why the indefinite article is sometimes “a” and at other times “an”. Observe:

  • “There’s a snake in my boot.”
  • “There’s an elephant in the room.”

We may also intuit the definite article, “the”, in the case of superlatives. Observe:

  • “Be better.”
  • “Be the best.”


Every Grammatical Element is Significant. We use parts of speech like comparatives, superlatives, prepositions, definite and indefinite articles, verbs, nouns, adjectives, adverbs, phrasal verbs, and all other variety of grammatical elemets to create novel sentences. Observe a classic example:

“The quick brown fox jumps over the lazy dog.”

Note that if we change a single word, we change the signficance of the sentence.

“The slow brown fox jumps over the lazy dog.”

By using an antonym for “quick”, the sentence is even more absurd. Let’s make two more changes.

“The slow brown fox runs over the ethereal dog.”

Now we have a fox that can presumably operate a motor vehicle, and we delve into metaphysics a bit.


We intuit the grammar of visualization. But we often don’t know the “rules” - if there are any.

  • We know what looks good
  • We know what looks bad

Like human sentences, machine visualization can be elegant and efficient in conveying meaning.

Or it can be a salad of visual elements that disrespects your audience’s time.


2.2 A Grammar of Graphics

In 1999, Leland Wilkinson published The Grammar of Graphics and a theoretical framework for such a grammar.


Layers. Graphics are comprised of distinct layers of grammatical elements.

  • Layers are like parts of speech (e.g. nouns, verbs, etc.)
  • Like complete sentences, there are essential layers to make complete graphical expressions
  • These essential layers include the:
    • Data Layer or what data you use
    • Aesthetics Layer or what variable you wish to graph
    • Gemoetry Layer or the shape or form by which to graph them your variables


Typical Conversation in Grammatical Terms. Note what layers are really being discussed:


Data:

“Are you pulling occupations from O*NET or BLS? We only need SOC-level."


Aesthetics:

“Could you color code the data points by ethnicity?”


Geometry:

“I’m trying to emphasize the increase in cases of EBLL levels over time.”


Coordinates:

“Can we zoom in on just the household incomes that are less than $45,000?”


Statistics:

“Let’s add one of those squiggly lines to make that trend really stand out.”


Facets:

“Can we show multiple graphs that are organized by country of origin?”


Themes:

“We can only use colors that are in the company logo. Wait, what is that?”

“My two weeks’ notice.”


A Unified Framework. Individually, each element is a building block. As a whole, the Grammar of Graphics provides a common language for visualization experts.


Mapping or Aesthetic Mapping is simply depicting a variable by using these elements.


3 ggplot2

Package ggplot2 is a popular, flexible, and powerful visualization extension for R.

  • Authored by Hadley Wickham of RStudio
  • Built as a “wrapper library” around R package grid
  • A core Tidyverse package that interfaces seamlessly within the Tidyverse ecosystem
  • Implements many best practices by default, like the color decodability research of Cynthia Brewer
  • Works best with Tidy Data

Further resources:


3.1 Installing & Loading

Installing and loading ggplot2 is easy.


Use function install.packages() to install ggplot2.

  • Make sure to use quotes around ggplot2
  • You only have to install it once
install.packages("ggplot2") 


Use function library() to load ggplot2.

  • You don’t need quotes, but you can use them (both are shown)
  • You must load ggplot2 every time you start a new session
library(ggplot2)
library("ggplot2")


3.2 Loading Practice Data

We’ll use the practice dataset, diamonds, which comes with the ggplot2 package.

  • The go-to practice dataset for ggplot2
  • Contains ~54,000 observations, each a diamond
  • Variables include color,, price, clarity, etc.
  • Learn more by running ?diamonds or help(diamonds)
  • Load the dataset with function data()
data(diamonds)


Some other functiosn for exploring the diamonds dataset include:

str(diamonds)          # The structure of the data
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
names(diamonds)        # Variables names, or use...
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
colnames(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
dim(diamonds)          # Tabular dimensions 
## [1] 53940    10


3.3 Functions & Additions

Package ggplot2 has an absurd number of functions, but if you grasp the theory and practice, you’ll get it.

Each function begins with a name corresponding to the layers identified above.

  • Function ggplot() corresponds to the Data Layer
  • Function aes() corresponds to the Aesthetics Layer
  • Functions beginning with geom_ correspond to the Geometry Layer
  • Functions beginning with theme_ correspond to the Themes Layer
  • Functions beginning with facet_correspond to the Facets Layer
  • Etc.


The Addition Operator, or +, connects all of these functions into a single visualization.

For example, I’ve saved a grob, or Graphical Object that contains a plot, to demonstrate.

  • You don’t really need to know what a grob is, yet, so don’t worry about that now
  • Observe the use of the Addition Operator to modify grobs by adding new functions
my_grob


Note that we can add, for example, a new them with a theme_*() function.

  • Here, the * represents anything that may follow theme_
my_grob +
  theme_classic()


Let’s try a different premade theme - there are a ton!

my_grob +
  theme_minimal()

my_grob +
  theme_light()


Package Extensions. You can even add new themes with package ggthemes.

  • Make sure to install with install.packages() and load with library()
library(ggthemes)
my_grob +
  theme_fivethirtyeight()

Pretty neat. There’s even a color scheme for each of Wes Anderson’s movies.

We’ll discuss extensions at a later session.


3.4 Intializing a Plot

The anatomy of a plot is simple. As mentioned, three layers are essential:

  • The Data Layer with function ggplot() and the name of the dataset
  • The Aesthetics Layer with function aes() and the variables to map
  • The Geomtry Layer with functions starting with geom_ for the shape of the plot


You can chain all of these together with the Addition Operator.

  • Here, we map variables to the x-axis and y-axis using arguments x = and y = in aes()
  • Note that you do not need to reference the dataset again, once called in function ggplot()
  • What’s more, you don’t have to put variable names in quotes, like carat and price
    • This is called a Quoting Function
ggplot(data = diamonds) +
  aes(x = carat, y = price) +
  geom_point()


The “Right” Way. Although it’s much easier and cleaner to keep these functions separate:

  • There is a preferred syntax that combines functions ggplot() and aes()
  • This isn’t immediately useful but is paramount for advanced visualizations
  • Why? You can use different datasets in the same graphic!
  • If you’re new to coding, do whatever is easier for now
ggplot(data = diamonds, aes(x = carat, 
                            y = price)) +
  geom_point()


3.5 Aesthetics v. Attributes

The difference between Attributes and Aesthetics is extremely important but very simple.

Aesthetic Mappings depict a variable from your data

  • That is, x =, y =, color =, fill =, alpha =, etc. all represent data
  • This helps with depicting multiple variables in a single graphic
  • The location of aesthetics belong in function aes()
    • That is, the Aesthetics Layer


Attributes do not depict data - this is called Non-Data Ink

  • Almost the same arguments apply, like color = and fill =
  • This helps with decoding a visualization and facilitating understanding
  • The location of attributes belong in any functiosn starting with geom_
    • That is, the Geometry Layer


3.5.1 Examples

Let’s check out the same plot with an extra Aesthetic Mapping. Here, we’ll add a new variable, each diamond’s quality of cut, and map it to color = in function aes():

ggplot(data = diamonds, aes(x = carat, 
                            y = price,
                            color = cut)) +
  geom_point()

We can already gain new insights by mapping the variable cut.

Question. What can you tell about the relationship between price and cut?


Now, observe an example of an Attribute or Non-Data Ink. Because we’re not depicting a new variable, we don’t put these arguments in functon aes(). Instead, we put them in function geom_point(), and we can choose the Attributes that help us to interpret the plot. Specifically:

  • alpha = sets the transparency of data points; helpful because of heavy overlap
  • color = helps us discern transparent points from the default grey background of ggplot2
  • We can use theme_light() or other functions for a non-grey background, if needed
ggplot(data = diamonds, aes(x = carat, 
                            y = price)) +
  geom_point(alpha = 0.1,
             color = "tomato") +
  theme_light()

We haven’t added any new data to the plot, but now we can more easily interpret it.

Question. What insights are made more clear by the parsimonious use of Non-Data Ink?


4 Practice

I’m went over my hourage cap for session prep like three hours ago, and I have some rare birds to hunt in Red Dead Redemption 2, but this practice may still be useful.


Instructions. Using the same plot with which we’ve practiced, experiment with:

  • Different Aesthetic Mappings for each variable
  • Remember to put them inside function aes() because they are Data Ink
  • Run the following to get their names:
names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"


Also, try:

  • Different Attributes to make the plot more appealing and decodable
  • Remember to put them inside function geom_point() because they are Non-Data Ink


It’s dangerous to go alone. Take this:


Thanks for reading!