Plots

This is an R Markdown document created for myself with examples of plots. Based on Coursera’s Exploratory Data Analysis course.

Loading the data

After loading the data it is sometimes a good idea to discover and/or define what class each column belongs to, doing stuff like:

sapply(mtcars, class)
mtcars$mpg = as.numeric(mtcars$mpg)
transform(airquality, Month = factor(Month)) # a coluna month da dataframe airquality passa a ser do tipo factor

But when loading the data you can already define the class of each column, doing something like so:

pollution <- read.csv("data/avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric"))
head(pollution)

##     pm25 fips region longitude latitude
## 1  9.771 1003   east    -87.75    30.59
## 2  9.994 1027   east    -85.84    33.27
## 3 10.689 1033   east    -87.73    34.73
## 4 11.337 1049   east    -85.80    34.46
## 5 12.120 1055   east    -86.03    34.02
## 6 10.828 1069   east    -85.35    31.19

Simple Summaries of Data

One dimension

Five-number summary
Boxplots
Histograms
Density plot
Barplot

Five Number Summary

summary(pollution$pm25)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.38    8.55   10.00    9.84   11.40   18.40

Boxplot

boxplot(pollution$pm25, col = "blue")

plot of chunk unnamed-chunk-5

Histogram

hist(pollution$pm25, col = "green")

plot of chunk unnamed-chunk-6

Density

plot(density(pollution$pm25), col="green")

plot of chunk unnamed-chunk-7

Histogram

hist(pollution$pm25, col = "green")
rug(pollution$pm25)

plot of chunk unnamed-chunk-8

Histogram

hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)

plot of chunk unnamed-chunk-9

Overlaying Features

boxplot(pollution$pm25, col = "blue")
abline(h = 12)

plot of chunk unnamed-chunk-10

Overlaying Features

hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

plot of chunk unnamed-chunk-11

Barplot

barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")

plot of chunk unnamed-chunk-12

Simple Summaries of Data

Two dimensions

Multiple/overlayed 1-D plots (Lattice/ggplot2)
Scatterplots
Smooth scatterplots

\(> 2\) dimensions

Overlayed/multiple 2-D plots; coplots
Use color, size, shape to add dimensions
Spinning plots
Actual 3-D plots (not that useful)

Multiple Boxplots

boxplot(pm25 ~ region, data = pollution, col = "red")

plot of chunk unnamed-chunk-13

even more boxplots!

library(datasets)
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")

plot of chunk unnamed-chunk-14

Multiple Histograms

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")

plot of chunk unnamed-chunk-15

Scatterplot

with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)

plot of chunk unnamed-chunk-16

Scatterplot - Using Color

with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)

plot of chunk unnamed-chunk-17

Multiple Scatterplots

par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))

plot of chunk unnamed-chunk-18

Scatter plot with loads of data

x=rnorm(10000)
y=rnorm(10000)
smoothScatter(x, y)

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

plot of chunk unnamed-chunk-19

Plot systems in R

Base plotting system: artist canvas: you start with a blank canvas and you add things to it one by one. Each element of the plot you add is one instruction. Example:

library(datasets)
data(cars)
with(cars, plot(speed, dist))

plot of chunk unnamed-chunk-20

Lattice plotting system: instead of adding things one by one, everything is within a single instruction. You have to specify a lot of information in the call… If you use a pannel plot (for instance, plotting a as a function of b conditioned on c: for each realisation of c you have a pannel plotting a as a function of b) you can plot a lot of data in a single page. Example:

library(lattice)

## Warning: package 'lattice' was built under R version 3.0.3

state=data.frame(state.x77, region=state.region)
xyplot(Life.Exp~Income | region, data=state, layout=c(4, 1))

plot of chunk unnamed-chunk-21

ggplot2 plotting system: combines a bit of the previous systems. Example

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.3

data(mpg)
qplot(displ, hwy, data=mpg)

plot of chunk unnamed-chunk-22

Some Important Base Graphics Parameters

Many base plotting functions share a set of parameters. Here are a few key ones:

pch: the plotting symbol (default is open circle)
lty: the line type (default is solid line), can be dashed, dotted, etc.
lwd: the line width, specified as an integer multiple
col: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name
xlab: character string for the x-axis label
ylab: character string for the y-axis label

Some Important Base Graphics Parameters

The par() function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.

las: the orientation of the axis labels on the plot
bg: the background color
mar: the margin size
oma: the outer margin size (default is 0 for all sides)
mfrow: number of plots per row, column (plots are filled row-wise)
mfcol: number of plots per row, column (plots are filled column-wise)

Some Important Base Graphics Parameters

Default values for global graphics parameters

par("lty")

## [1] "solid"

par("col")

## [1] "black"

par("pch")

## [1] 1

Base Plotting Functions

plot: make a scatterplot, or other type of plot depending on the class of the object being plotted
lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
points: add points to a plot
text: add text labels to a plot using specified x, y coordinates
title: add annotations to x, y axis labels, title, subtitle, outer margin
mtext: add arbitrary text to the margins (inner or outer) of the plot
axis: adding axis ticks/labels

Base Plot with Annotation

library(datasets)
with(airquality, plot(Wind, Ozone))
title(main = "Ozone and Wind in New York City") ## Add a title

plot of chunk unnamed-chunk-24

Base Plot with Annotation

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", xaxt='n', ann=FALSE))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue", xaxt='n', ann=FALSE))
axis(1, 1:5, LETTERS[1:5])

plot of chunk unnamed-chunk-25

Base Plot with Annotation

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", type = "n"))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))

plot of chunk unnamed-chunk-26

Base Plot with Regression Line

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", pch = 20))
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd = 2)

plot of chunk unnamed-chunk-27

Multiple Base Plots

par(mfrow = c(1, 2))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
})

plot of chunk unnamed-chunk-28

Multiple Base Plots

par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
mtext("Ozone and Weather in New York City", outer = TRUE)
})

plot of chunk unnamed-chunk-29

explicitly launch graphics device

pdf(file = "myplot.pdf") ## Open PDF device; create 'myplot.pdf' in my working directory
# Create plot and send to a file (no plot appears on screen)
with(faithful, plot(eruptions, waiting))
title(main = "Old Faithful Geyser data") ## Annotate plot; still nothing on screen
dev.off() ## Close the PDF file device

## pdf 
##   2

# Now you can view the file 'myplot.pdf' on your computer

copying a plot

library(datasets)
with(faithful, plot(eruptions, waiting)) ## Create plot on screen device
title(main = "Old Faithful Geyser data") ## Add a main title

plot of chunk unnamed-chunk-31

dev.copy(png, file = "geyserplot.png") ## Copy my plot to a PNG file

## png 
##   3

dev.off() ## Don't forget to close the PNG device!

## pdf 
##   2

) ## manipulate function

library(UsingR)
data(galton)
library(manipulate)
myHist <- function(mu){
  hist(galton$child,col="blue",breaks=100)
  lines(c(mu, mu), c(0, 150),col="red",lwd=5)
  mse <- mean((galton$child - mu)^2)
  text(63, 150, paste("mu = ", mu))
  text(63, 140, paste("MSE = ", round(mse, 2)))
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))

points of different sizes

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.0.3

## Loading required package: MASS
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies

data(galton)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)),
     as.numeric(as.vector(freqData$child)),
     pch = 21, col = "black", bg = "lightblue",
     cex = .15 * freqData$freq,
     xlab = "parent", ylab = "child")

plot of chunk freqGalton

The Lattice Plotting System

The lattice plotting system is implemented using the following packages:

lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot
grid: implements a different graphing system independent of the “base” system; the lattice package builds on top of grid
We seldom call functions from the grid package directly
The lattice plotting system does not have a “two-phase” aspect with separate plotting and annotation like in base plotting
All plotting/annotation is done at once with a single function call

Lattice Functions

xyplot: this is the main function for creating scatterplots
bwplot: box-and-whiskers plots (“boxplots”)
histogram: histograms
stripplot: like a boxplot but with actual points
dotplot: plot dots on “violin strings”
splom: scatterplot matrix; like pairs in base plotting system
levelplot, contourplot: for plotting “image” data

Lattice Functions

Lattice functions generally take a formula for their first argument, usually of the form

xyplot(y ~ x | f * g, data)

We use the formula notation here, hence the ~.
On the left of the ~ is the y-axis variable, on the right is the x-axis variable
f and g are conditioning variables - they are optional
the * indicates an interaction between two variables
The second argument is the data frame or list from which the variables in the formula should be looked up
If no data frame or list is passed, then the parent frame is used.
If no other arguments are passed, there are defaults that can be used.

Simple Lattice Plot

library(datasets)
library(lattice)
## Convert 'Month' to a factor variable
airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))

plot of chunk unnamed-chunk-34

Lattice Panel Functions

Lattice functions have a panel function which controls what happens inside each panel of the plot.
The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel
Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)

Lattice Panel Functions

set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y <- x + f - f * x+ rnorm(100, sd = 0.5)
f <- factor(f, labels = c("Group 1", "Group 2"))
xyplot(y ~ x | f, layout = c(2, 1)) ## Plot with 2 panels

plot of chunk unnamed-chunk-35

Lattice Panel Functions

## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
       panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
       panel.abline(h = median(y), lty = 2) ## Add a horizontal line at the median
})

plot of chunk unnamed-chunk-36

Lattice Panel Functions: Regression line

## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
               panel.xyplot(x, y, ...) ## First call default panel function
               panel.lmline(x, y, col = 2) ## Overlay a simple linear regression line
       })

plot of chunk unnamed-chunk-37

What is ggplot2?

Grammar of graphics represents an abstraction of graphics ideas/objects
Think “verb”, “noun”, “adjective” for graphics
Allows for a “theory” of graphics on which to build new graphics and graphics objects
“Shorten the distance from mind to page”

The Basics: `qplot()`

Works much like the plot function in base graphics system
Looks for data in a data frame, similar to lattice, or in the parent environment
Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled
The qplot() hides what goes on underneath, which is okay for most operations
ggplot() is the core function and very flexible for doing things qplot() cannot do

ggplot2 “Hello, world!”

library(ggplot2)
qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-38

Modifying aesthetics

qplot(displ, hwy, data = mpg, color = drv)

plot of chunk unnamed-chunk-39

qplot(displ, hwy, data = mpg, shape = drv)

plot of chunk unnamed-chunk-39

Adding a geom

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-40

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm") # linear regression model

plot of chunk unnamed-chunk-40

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm", facets=.~cyl) # linear regression model

plot of chunk unnamed-chunk-40

Histograms and density

qplot(hwy, data = mpg, fill = drv)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-41

qplot(hwy, data = mpg, geom="density", fill = drv)

plot of chunk unnamed-chunk-41

qplot(hwy, data = mpg, geom="density", color = drv)

plot of chunk unnamed-chunk-41

Facets

qplot(displ, hwy, data = mpg, facets = . ~ drv)   # facets = what should determine rows ~ what should determine columns

plot of chunk unnamed-chunk-42

qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)

plot of chunk unnamed-chunk-42

qplot(hwy, data = mpg, facets = drv ~ cyl, binwidth = 2)

plot of chunk unnamed-chunk-42

Summary of qplot()

The qplot() function is the analog to plot() but with many built-in features
Syntax somewhere in between base/lattice
Produces very nice graphics, essentially publication ready (if you like the design)
Difficult to go against the grain/customize (don’t bother; use full ggplot2 power in that case)

Basic Components of a ggplot2 Plot

A data frame
aesthetic mappings: how data are mapped to color, size
geoms: geometric objects like points, lines, shapes.
facets: for conditional plots.
stats: statistical transformations like binning, quantiles, smoothing.
scales: what scale an aesthetic map uses (example: male = red, female = blue).
coordinate system