What's so great about Ggplot2?

Sarah Hosking for #R-Ladies Paris
9 November 2017

Preface

This file was generated from an RMD file.

I have hidden most of the data munging code so that you can focus on the code for Ggplot2.

To see ALL code used in this presentation, see the source files on github:
https://github.com/limegimlet/rladies_nov2017_ggplot.git.

It's all about layers

You can build plots like you would for this cake.

What's so great about layers?

If Ggplot2 is so easy, why this presentation?

Once you get over the initial learning curve, it's all too easy during demos to go gaga showing off great-looking faceted plots with fabulous colour palettes and themes.

But after a few presentations I noticed what people talked to me about wasn't the fancier plotting techniques. It was a few fundamentals that stumped them.

Barriers to true Ggplot2 ❤

  1. Opaque terminology
  2. Untidy dataframes
  3. Unclear how to iterate

But before we dive into those ...

What data are we exploring?

Paris air quality!

(Photo from Le Monde. http://www.lemonde.fr/planete/article/2014/11/24/a-paris-la-pollution-est-aussi-nocive-que-le-tabagisme-passif_4528203_3244.html)

Why not data that's, um, less of a downer?

My audience is often Parisian. The air quality data is available for public download, and the internet already has enough plots about cars, irises and diamonds.

Moreoever, when I was really getting into Ggplot2 in late 2016, European headlines were dominated by stories of extreme pollution spikes from fine particulate matter.

And since I live between 2 major arterial roads in city center, my question was "Just how bad is it?" Also, being a runner and cyclist, I wanted to see if there was clearly a best time of day to exercise outdoors.

About the data set

nrow(airparif)
## [1] 55080
head(airparif)
##         date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011    07  21    1    7   NA  12 NA NA
## 3 2011-07-21 2011    07  21    2    6   NA  10 NA NA
## 4 2011-07-21 2011    07  21    3    7   NA  10 NA NA
## 5 2011-07-21 2011    07  21    4    7   NA  14 NA NA
## 6 2011-07-21 2011    07  21    5   10   NA  30 NA NA
## 7 2011-07-21 2011    07  21    6   10   NA  36 NA NA

The 5 pollutants being monitored are:

Create the first layer

Let's build a simple histogram of PM10.

# create data layer
ggplot(data = airparif, aes(x = PM10))

Hmmmmmm.....

Why is this blank?

Add a geom

# add plot-type layer
ggplot(data = airparif, aes(x = PM10)) + 
  geom_histogram()

Voilà. But before we go further, let's clarify a few terms.

Opaque term 1: aes()

Mental barrier #1 for new(er) Ggplot2 users is aes(), called the aesthetic.

This term spooks people.

In plain English: aes() is where you map data frame variables to visual elements.

The most basic visual elements to any plot are the x and y axes. Other examples are linetype, shape, and fill.

Map visual elements to variables

This tells Ggplot2 to change this visual element for each group.

Notice how this automatically generates a nice-looking legend.

# create data layer that maps
# a variable to fill colour
p <- ggplot(data = airparif, aes(x = PM10, fill = year))

p + geom_histogram()

Before those with design backgrounds start hyperventilating, I've created a rainbow histogram only to illustrate a point. As a method to communicate data, it totally sucks.

Set a visual element

By contrast, to change the look of the entire plot, you set a visual element to a value.

# create data layer, *without* fill=
p <- ggplot(data = airparif, aes(x = PM10))

# add plot layer & set fill color 
p + geom_histogram(fill = 'blue')

Opaque term 2: geom_xxx()

This refers to the geometric projection of your data.

In plain English, it means the plot type.

So we can easily swap another distribution plot layer (in this case, density plot) that will work with our existing data layer.

# create data layer *with* fill mapped to year
p <- ggplot(data = airparif, aes(x = PM10, fill = year))

# create density plot this time
p + geom_density()

# make easier on eyes!
p + geom_density(alpha = 1/4, linetype = 0)

Tip 2: Use tidy data + facets

Why? Let's plot the distrubtion of another - NO2.

# plot another var distribution
p <- ggplot(data = airparif, aes(x = NO2))

p + geom_histogram()

Now, imagine we had 20 pollutants in our data frame.

Instead, use tidy data.

Untidy vs Tidy

Not tidy

##         date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011    07  21    1    7   NA  12 NA NA
## 3 2011-07-21 2011    07  21    2    6   NA  10 NA NA
## 4 2011-07-21 2011    07  21    3    7   NA  10 NA NA
## 5 2011-07-21 2011    07  21    4    7   NA  14 NA NA
## 6 2011-07-21 2011    07  21    5   10   NA  30 NA NA
## 7 2011-07-21 2011    07  21    6   10   NA  36 NA NA

Tidy

##         date year month day hour pollutant value
## 1 2011-07-21 2011    07  21    1      PM10     7
## 2 2011-07-21 2011    07  21    2      PM10     6
## 3 2011-07-21 2011    07  21    3      PM10     7
## 4 2011-07-21 2011    07  21    4      PM10     7
## 5 2011-07-21 2011    07  21    5      PM10    10
## 6 2011-07-21 2011    07  21    6      PM10    10

Plot tidy data & facet

The advantage of gathering your data into a tidy format is it generates categorical variables. In our case, pollutant.

Categorical variables are easy to group and facet in Ggplot2.

# create plot of data layer only
p.tidy <- ggplot(data = airparif.tidy, aes(x = value))
p.tidy

# add geom layer (a.k.a. the plot type)
hg <- p.tidy + geom_histogram()
hg

# add facet layer
hg + facet_wrap(~pollutant)

# assign facet wrap to a variable
# & fix ranges for x-axes
fw <- facet_wrap(~pollutant, scales = 'free_x')
hg + fw

Tip 3: Learn to iterate quickly

# display earlier histogram of NO2 only
p + geom_histogram()

# update plot var to use new tidy df & x var
p <- p %+% airparif.tidy + aes(x = value)
p + geom_histogram()

# add them together
p + 
  geom_histogram() +
  fw

# map fill to pollutant
p <- p %+% aes(fill = pollutant)
p +
  geom_histogram() +
  fw

The original plot for a single histogram required 2 lines of code.

Here we've created 5 histograms using 3 lines of code.

Another example: switch to boxplots

How do distributions change from month to month?

# update data layer to plot values by *month*
# and add a y variable.
p <- p %+% aes(x = month, y = value)

# add boxplot & facet layers
bp <- p + geom_boxplot()
bp + fw

# update fw var to free y-axis!
fw <- facet_wrap(~pollutant, scales = 'free_y')

# try again
bp + fw

# change to *hourly* boxplot
bp <- bp %+% aes(x = as.factor(hour))

bp + fw

Quite frankly, these outliers look scary, particularly for carbon monoxide & nitrogen dioxide. But the values alone don't mean anything to me. I need context.

Add context by adding data

So just how bad are these outliers, the pollution spikes?

To answer that question, I'd like to add horizontal lines showing the low-med-high-very high thresholds. These are different for each pollutant.

# df of 1-hr threshold values for each pollutant
head(levels_h.long, 10)
##    pollutant     level value
## 1       PM10       low    25
## 2       PM10    medium    50
## 3       PM10      high    90
## 4       PM10 very high   180
## 5       PM25       low    15
## 6       PM25    medium    30
## 7       PM25      high    55
## 8       PM25 very high   110
## 9        NO2       low    50
## 10       NO2    medium   100

This involves adding a geom_hline() layer to the plot. By default, the data specified in data layer is global. However, in subsequent layers you can overwrite this by specifying different data.

# add horizontal lines based on threshold levels df
hl <- geom_hline(data = levels_h.long,
                  aes(yintercept = value, group = pollutant), 
                 linetype = 2)

# without hlines
bp + fw

# with hlines
bp + fw + hl

This reassures me for CO and NO2. PM10 & PM25, on the other hand...

Bonus: Grouping

As I recall from news headlines, Dec 2016 was a bad month for particulate matter.

How did pollution fluctuate throughout the day?

To answer this, let's make a faceted spaghetti plot, with a "strand" for each day of the Dec 2016.

Spaghetti line plot

# update p with new df, x & y
p.dec16 <- p %+% dec2016 + aes(hour, value)

p.dec16 + 
  geom_line() +
  fw

This is NOT what I was after.

Why is this happening?

Remember we are plotting values along an hourly x-axis. There are 31 days in December, so for each hour on the x-axis, there 31 data points.

By default, geom_line() connects all of them with a single line. We can change that.


# change mapping to plot is grouped by date
p.dec16 <- p.dec16 %+% aes(group = date)

# add geom layer to plot var
p.dec16 <- p.dec16 +
  geom_line(colour = 'darkgrey')

p.dec16 + fw

That's more like it!

For all pollutants except O3, I see big spikes during the middle of the day. Did these all occur on the same day?

I know PM10 was extremely high that month, so if I can identify the day with the highest PM10 spike, is that the same day as the big spikes for PM25, NO2 and CO?

Highlight a single line

When did PM10 pollution spike?

We need to create a new data frame for the day of with the highest PM10 value.

# Dec 2016 particulate data
dec2016_pm <- dec2016 %>%
  filter(pollutant %in% c('PM10', 'PM25'))

# find index of max value in Dec
max_val <- which.max(dec2016_pm[,'value'])

# find date
max_date <- dec2016_pm[max_val, 'date']

# filter on this date
dec2016_pm_max <- dec2016 %>%
  filter(date == max_date)

Then we will create another geom_line() layer using the data for this PM10 spike day.

# create geom layer for day of PM10 spike
p.spike_day <- geom_line(data = dec2016_pm_max, aes(hour, value,
                                                     colour = pollutant))

# add on top of the Dec 2016 spaghetti plot
p.dec16 +
  p.spike_day +
  fw

# add context: bring back the threshold lines
p.dec16 +
  p.spike_day +
  fw +
  hl

Analysis conclusion & follow up

The plots show that on the day PM10 reached its maximum value for December 2016, PM25, NO2, and CO also reached their peak values for the month. This suggests that those pollutants are positively correlated, especially PM25.

In terms of severity:

Ideas to follow up:

Ggplot2 recap

To visualize groups:

To iterate quickly:

Learn more

Much of the data munging I did for this presention used the tidyr and dplyr from the tidyverse package. This slide show by Garret Grolemund is one of the best slide tutorials I've ever seen.

I referred to these quite frequently:

I have not looked at this book yet, but it is often recommended by Ggplot2 gurus.