Sarah Hosking for #R-Ladies Paris
9 November 2017
Preface
This file was generated from an RMD file.
I have hidden most of the data munging code so that you can focus on the code for Ggplot2
.
To see ALL code used in this presentation, see the source files on github:
https://github.com/limegimlet/rladies_nov2017_ggplot.git.
You can build plots like you would for this cake.
Once you get over the initial learning curve, it's all too easy during demos to go gaga showing off great-looking faceted plots with fabulous colour palettes and themes.
But after a few presentations I noticed what people talked to me about wasn't the fancier plotting techniques. It was a few fundamentals that stumped them.
But before we dive into those ...
Paris air quality!
(Photo from Le Monde. http://www.lemonde.fr/planete/article/2014/11/24/a-paris-la-pollution-est-aussi-nocive-que-le-tabagisme-passif_4528203_3244.html)
My audience is often Parisian. The air quality data is available for public download, and the internet already has enough plots about cars, irises and diamonds.
Moreoever, when I was really getting into Ggplot2 in late 2016, European headlines were dominated by stories of extreme pollution spikes from fine particulate matter.
And since I live between 2 major arterial roads in city center, my question was "Just how bad is it?" Also, being a runner and cyclist, I wanted to see if there was clearly a best time of day to exercise outdoors.
nrow(airparif)
## [1] 55080
head(airparif)
## date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011 07 21 1 7 NA 12 NA NA
## 3 2011-07-21 2011 07 21 2 6 NA 10 NA NA
## 4 2011-07-21 2011 07 21 3 7 NA 10 NA NA
## 5 2011-07-21 2011 07 21 4 7 NA 14 NA NA
## 6 2011-07-21 2011 07 21 5 10 NA 30 NA NA
## 7 2011-07-21 2011 07 21 6 10 NA 36 NA NA
The 5 pollutants being monitored are:
PM25
= Particulate matter < 2.5 µmPM10
= Particulate matter < 10 µmNO2
= Nitrogen Dioxide (Azote)CO
= Carbon monoxideO3
= OzoneLet's build a simple histogram of PM10
.
# create data layer
ggplot(data = airparif, aes(x = PM10))
Hmmmmmm.....
geom
layergeom
# add plot-type layer
ggplot(data = airparif, aes(x = PM10)) +
geom_histogram()
Voilà. But before we go further, let's clarify a few terms.
aes()
Mental barrier #1 for new(er) Ggplot2 users is aes()
, called the aesthetic.
This term spooks people.
In plain English: aes()
is where you map data frame variables to visual elements.
The most basic visual elements to any plot are the x and y axes. Other examples are linetype
, shape
, and fill
.
This tells Ggplot2 to change this visual element for each group.
Notice how this automatically generates a nice-looking legend.
# create data layer that maps
# a variable to fill colour
p <- ggplot(data = airparif, aes(x = PM10, fill = year))
p + geom_histogram()
Before those with design backgrounds start hyperventilating, I've created a rainbow histogram only to illustrate a point. As a method to communicate data, it totally sucks.
By contrast, to change the look of the entire plot, you set a visual element to a value.
# create data layer, *without* fill=
p <- ggplot(data = airparif, aes(x = PM10))
# add plot layer & set fill color
p + geom_histogram(fill = 'blue')
geom_xxx()
This refers to the geometric projection of your data.
In plain English, it means the plot type.
So we can easily swap another distribution plot layer (in this case, density plot) that will work with our existing data layer.
# create data layer *with* fill mapped to year
p <- ggplot(data = airparif, aes(x = PM10, fill = year))
# create density plot this time
p + geom_density()
# make easier on eyes!
p + geom_density(alpha = 1/4, linetype = 0)
Why? Let's plot the distrubtion of another - NO2.
# plot another var distribution
p <- ggplot(data = airparif, aes(x = NO2))
p + geom_histogram()
Now, imagine we had 20 pollutants in our data frame.
Instead, use tidy data.
Not tidy
## date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011 07 21 1 7 NA 12 NA NA
## 3 2011-07-21 2011 07 21 2 6 NA 10 NA NA
## 4 2011-07-21 2011 07 21 3 7 NA 10 NA NA
## 5 2011-07-21 2011 07 21 4 7 NA 14 NA NA
## 6 2011-07-21 2011 07 21 5 10 NA 30 NA NA
## 7 2011-07-21 2011 07 21 6 10 NA 36 NA NA
Tidy
## date year month day hour pollutant value
## 1 2011-07-21 2011 07 21 1 PM10 7
## 2 2011-07-21 2011 07 21 2 PM10 6
## 3 2011-07-21 2011 07 21 3 PM10 7
## 4 2011-07-21 2011 07 21 4 PM10 7
## 5 2011-07-21 2011 07 21 5 PM10 10
## 6 2011-07-21 2011 07 21 6 PM10 10
The advantage of gathering your data into a tidy format is it generates categorical variables. In our case, pollutant
.
Categorical variables are easy to group and facet in Ggplot2.
# create plot of data layer only
p.tidy <- ggplot(data = airparif.tidy, aes(x = value))
p.tidy
# add geom layer (a.k.a. the plot type)
hg <- p.tidy + geom_histogram()
hg
# add facet layer
hg + facet_wrap(~pollutant)
# assign facet wrap to a variable
# & fix ranges for x-axes
fw <- facet_wrap(~pollutant, scales = 'free_x')
hg + fw
%+%
# display earlier histogram of NO2 only
p + geom_histogram()
# update plot var to use new tidy df & x var
p <- p %+% airparif.tidy + aes(x = value)
p + geom_histogram()
# add them together
p +
geom_histogram() +
fw
# map fill to pollutant
p <- p %+% aes(fill = pollutant)
p +
geom_histogram() +
fw
The original plot for a single histogram required 2 lines of code.
Here we've created 5 histograms using 3 lines of code.
How do distributions change from month to month?
# update data layer to plot values by *month*
# and add a y variable.
p <- p %+% aes(x = month, y = value)
# add boxplot & facet layers
bp <- p + geom_boxplot()
bp + fw
# update fw var to free y-axis!
fw <- facet_wrap(~pollutant, scales = 'free_y')
# try again
bp + fw
# change to *hourly* boxplot
bp <- bp %+% aes(x = as.factor(hour))
bp + fw
Quite frankly, these outliers look scary, particularly for carbon monoxide & nitrogen dioxide. But the values alone don't mean anything to me. I need context.
So just how bad are these outliers, the pollution spikes?
To answer that question, I'd like to add horizontal lines showing the low-med-high-very high thresholds. These are different for each pollutant.
# df of 1-hr threshold values for each pollutant
head(levels_h.long, 10)
## pollutant level value
## 1 PM10 low 25
## 2 PM10 medium 50
## 3 PM10 high 90
## 4 PM10 very high 180
## 5 PM25 low 15
## 6 PM25 medium 30
## 7 PM25 high 55
## 8 PM25 very high 110
## 9 NO2 low 50
## 10 NO2 medium 100
This involves adding a geom_hline()
layer to the plot. By default, the data specified in data layer is global. However, in subsequent layers you can overwrite this by specifying different data.
# add horizontal lines based on threshold levels df
hl <- geom_hline(data = levels_h.long,
aes(yintercept = value, group = pollutant),
linetype = 2)
# without hlines
bp + fw
# with hlines
bp + fw + hl
This reassures me for CO and NO2. PM10 & PM25, on the other hand...
As I recall from news headlines, Dec 2016 was a bad month for particulate matter.
How did pollution fluctuate throughout the day?
To answer this, let's make a faceted spaghetti plot, with a "strand" for each day of the Dec 2016.
# update p with new df, x & y
p.dec16 <- p %+% dec2016 + aes(hour, value)
p.dec16 +
geom_line() +
fw
This is NOT what I was after.
Why is this happening?
Remember we are plotting values along an hourly x-axis. There are 31 days in December, so for each hour on the x-axis, there 31 data points.
By default, geom_line()
connects all of them with a single line. We can change that.
# change mapping to plot is grouped by date
p.dec16 <- p.dec16 %+% aes(group = date)
# add geom layer to plot var
p.dec16 <- p.dec16 +
geom_line(colour = 'darkgrey')
p.dec16 + fw
That's more like it!
For all pollutants except O3, I see big spikes during the middle of the day. Did these all occur on the same day?
I know PM10 was extremely high that month, so if I can identify the day with the highest PM10 spike, is that the same day as the big spikes for PM25, NO2 and CO?
When did PM10 pollution spike?
We need to create a new data frame for the day of with the highest PM10 value.
# Dec 2016 particulate data
dec2016_pm <- dec2016 %>%
filter(pollutant %in% c('PM10', 'PM25'))
# find index of max value in Dec
max_val <- which.max(dec2016_pm[,'value'])
# find date
max_date <- dec2016_pm[max_val, 'date']
# filter on this date
dec2016_pm_max <- dec2016 %>%
filter(date == max_date)
Then we will create another geom_line()
layer using the data for this PM10 spike day.
# create geom layer for day of PM10 spike
p.spike_day <- geom_line(data = dec2016_pm_max, aes(hour, value,
colour = pollutant))
# add on top of the Dec 2016 spaghetti plot
p.dec16 +
p.spike_day +
fw
# add context: bring back the threshold lines
p.dec16 +
p.spike_day +
fw +
hl
The plots show that on the day PM10 reached its maximum value for December 2016, PM25, NO2, and CO also reached their peak values for the month. This suggests that those pollutants are positively correlated, especially PM25.
In terms of severity:
Ideas to follow up:
ggpairs()
geom_tile()
To visualize groups:
aes()
facet_wrap
& facet_grid
To iterate quickly:
%+%
Much of the data munging I did for this presention used the tidyr and dplyr from the tidyverse
package. This slide show by Garret Grolemund is one of the best slide tutorials I've ever seen.
I referred to these quite frequently:
Ggplot2 cheatsheet: http://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Online excerpts from ggplot2: Elegant Graphics for Data Analysis (Use R!) by Hadley Wickham is how, among other things, I learned about the magical %+%
operator: http://ggplot2.org/book/
I have not looked at this book yet, but it is often recommended by Ggplot2 gurus.