Plots

This is an R Markdown document created for myself with examples of plots. Based on Coursera’s Exploratory Data Analysis course.

Loading the data

After loading the data it is sometimes a good idea to discover and/or define what class each column belongs to, doing stuff like:

sapply(mtcars, class)
mtcars$mpg = as.numeric(mtcars$mpg)
transform(airquality, Month = factor(Month)) # a coluna month da dataframe airquality passa a ser do tipo factor

But when loading the data you can already define the class of each column, doing something like so:

pollution <- read.csv("data/avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric"))
head(pollution)
##     pm25 fips region longitude latitude
## 1  9.771 1003   east    -87.75    30.59
## 2  9.994 1027   east    -85.84    33.27
## 3 10.689 1033   east    -87.73    34.73
## 4 11.337 1049   east    -85.80    34.46
## 5 12.120 1055   east    -86.03    34.02
## 6 10.828 1069   east    -85.35    31.19

Simple Summaries of Data

One dimension

  • Five-number summary
  • Boxplots
  • Histograms
  • Density plot
  • Barplot

Five Number Summary

summary(pollution$pm25)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.38    8.55   10.00    9.84   11.40   18.40

Boxplot

boxplot(pollution$pm25, col = "blue")

plot of chunk unnamed-chunk-5


Histogram

hist(pollution$pm25, col = "green")

plot of chunk unnamed-chunk-6


Density

plot(density(pollution$pm25), col="green")

plot of chunk unnamed-chunk-7


Histogram

hist(pollution$pm25, col = "green")
rug(pollution$pm25)

plot of chunk unnamed-chunk-8


Histogram

hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)

plot of chunk unnamed-chunk-9


Overlaying Features

boxplot(pollution$pm25, col = "blue")
abline(h = 12)

plot of chunk unnamed-chunk-10


Overlaying Features

hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

plot of chunk unnamed-chunk-11


Barplot

barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")

plot of chunk unnamed-chunk-12


Simple Summaries of Data

Two dimensions

  • Multiple/overlayed 1-D plots (Lattice/ggplot2)
  • Scatterplots
  • Smooth scatterplots

\(> 2\) dimensions

  • Overlayed/multiple 2-D plots; coplots
  • Use color, size, shape to add dimensions
  • Spinning plots
  • Actual 3-D plots (not that useful)

Multiple Boxplots

boxplot(pm25 ~ region, data = pollution, col = "red")

plot of chunk unnamed-chunk-13


even more boxplots!

library(datasets)
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")

plot of chunk unnamed-chunk-14


Multiple Histograms

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")

plot of chunk unnamed-chunk-15


Scatterplot

with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)

plot of chunk unnamed-chunk-16


Scatterplot - Using Color

with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)

plot of chunk unnamed-chunk-17


Multiple Scatterplots

par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))

plot of chunk unnamed-chunk-18


Scatter plot with loads of data

x=rnorm(10000)
y=rnorm(10000)
smoothScatter(x, y)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

plot of chunk unnamed-chunk-19

Plot systems in R

  • Base plotting system: artist canvas: you start with a blank canvas and you add things to it one by one. Each element of the plot you add is one instruction. Example:
library(datasets)
data(cars)
with(cars, plot(speed, dist))

plot of chunk unnamed-chunk-20

  • Lattice plotting system: instead of adding things one by one, everything is within a single instruction. You have to specify a lot of information in the call… If you use a pannel plot (for instance, plotting a as a function of b conditioned on c: for each realisation of c you have a pannel plotting a as a function of b) you can plot a lot of data in a single page. Example:
library(lattice)
## Warning: package 'lattice' was built under R version 3.0.3
state=data.frame(state.x77, region=state.region)
xyplot(Life.Exp~Income | region, data=state, layout=c(4, 1))

plot of chunk unnamed-chunk-21

  • ggplot2 plotting system: combines a bit of the previous systems. Example
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
data(mpg)
qplot(displ, hwy, data=mpg)

plot of chunk unnamed-chunk-22

Some Important Base Graphics Parameters

Many base plotting functions share a set of parameters. Here are a few key ones:

  • pch: the plotting symbol (default is open circle)

  • lty: the line type (default is solid line), can be dashed, dotted, etc.

  • lwd: the line width, specified as an integer multiple

  • col: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name

  • xlab: character string for the x-axis label

  • ylab: character string for the y-axis label


Some Important Base Graphics Parameters

The par() function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.

  • las: the orientation of the axis labels on the plot
  • bg: the background color
  • mar: the margin size
  • oma: the outer margin size (default is 0 for all sides)
  • mfrow: number of plots per row, column (plots are filled row-wise)
  • mfcol: number of plots per row, column (plots are filled column-wise)

Some Important Base Graphics Parameters

Default values for global graphics parameters

par("lty")
## [1] "solid"
par("col")
## [1] "black"
par("pch")
## [1] 1

Base Plotting Functions

  • plot: make a scatterplot, or other type of plot depending on the class of the object being plotted

  • lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots

  • points: add points to a plot
  • text: add text labels to a plot using specified x, y coordinates
  • title: add annotations to x, y axis labels, title, subtitle, outer margin
  • mtext: add arbitrary text to the margins (inner or outer) of the plot
  • axis: adding axis ticks/labels


Base Plot with Annotation

library(datasets)
with(airquality, plot(Wind, Ozone))
title(main = "Ozone and Wind in New York City") ## Add a title

plot of chunk unnamed-chunk-24

Base Plot with Annotation

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", xaxt='n', ann=FALSE))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue", xaxt='n', ann=FALSE))
axis(1, 1:5, LETTERS[1:5])

plot of chunk unnamed-chunk-25


Base Plot with Annotation

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", type = "n"))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))

plot of chunk unnamed-chunk-26


Base Plot with Regression Line

with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", pch = 20))
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd = 2)

plot of chunk unnamed-chunk-27


Multiple Base Plots

par(mfrow = c(1, 2))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
})

plot of chunk unnamed-chunk-28


Multiple Base Plots

par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
mtext("Ozone and Weather in New York City", outer = TRUE)
})

plot of chunk unnamed-chunk-29


explicitly launch graphics device

pdf(file = "myplot.pdf") ## Open PDF device; create 'myplot.pdf' in my working directory
# Create plot and send to a file (no plot appears on screen)
with(faithful, plot(eruptions, waiting))
title(main = "Old Faithful Geyser data") ## Annotate plot; still nothing on screen
dev.off() ## Close the PDF file device
## pdf 
##   2
# Now you can view the file 'myplot.pdf' on your computer

copying a plot

library(datasets)
with(faithful, plot(eruptions, waiting)) ## Create plot on screen device
title(main = "Old Faithful Geyser data") ## Add a main title

plot of chunk unnamed-chunk-31

dev.copy(png, file = "geyserplot.png") ## Copy my plot to a PNG file
## png 
##   3
dev.off() ## Don't forget to close the PNG device!
## pdf 
##   2

) ## manipulate function

library(UsingR)
data(galton)
library(manipulate)
myHist <- function(mu){
  hist(galton$child,col="blue",breaks=100)
  lines(c(mu, mu), c(0, 150),col="red",lwd=5)
  mse <- mean((galton$child - mu)^2)
  text(63, 150, paste("mu = ", mu))
  text(63, 140, paste("MSE = ", round(mse, 2)))
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))

points of different sizes

library(UsingR)
## Warning: package 'UsingR' was built under R version 3.0.3
## Loading required package: MASS
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies
data(galton)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)),
     as.numeric(as.vector(freqData$child)),
     pch = 21, col = "black", bg = "lightblue",
     cex = .15 * freqData$freq,
     xlab = "parent", ylab = "child")

plot of chunk freqGalton

The Lattice Plotting System

The lattice plotting system is implemented using the following packages:

  • lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot

  • grid: implements a different graphing system independent of the “base” system; the lattice package builds on top of grid
  • We seldom call functions from the grid package directly

  • The lattice plotting system does not have a “two-phase” aspect with separate plotting and annotation like in base plotting

  • All plotting/annotation is done at once with a single function call


Lattice Functions

  • xyplot: this is the main function for creating scatterplots
  • bwplot: box-and-whiskers plots (“boxplots”)
  • histogram: histograms
  • stripplot: like a boxplot but with actual points
  • dotplot: plot dots on “violin strings”
  • splom: scatterplot matrix; like pairs in base plotting system
  • levelplot, contourplot: for plotting “image” data

Lattice Functions

Lattice functions generally take a formula for their first argument, usually of the form

xyplot(y ~ x | f * g, data)
  • We use the formula notation here, hence the ~.

  • On the left of the ~ is the y-axis variable, on the right is the x-axis variable

  • f and g are conditioning variables - they are optional
  • the * indicates an interaction between two variables

  • The second argument is the data frame or list from which the variables in the formula should be looked up

  • If no data frame or list is passed, then the parent frame is used.

  • If no other arguments are passed, there are defaults that can be used.

Simple Lattice Plot

library(datasets)
library(lattice)
## Convert 'Month' to a factor variable
airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))

plot of chunk unnamed-chunk-34


Lattice Panel Functions

  • Lattice functions have a panel function which controls what happens inside each panel of the plot.

  • The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel

  • Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)


Lattice Panel Functions

set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y <- x + f - f * x+ rnorm(100, sd = 0.5)
f <- factor(f, labels = c("Group 1", "Group 2"))
xyplot(y ~ x | f, layout = c(2, 1)) ## Plot with 2 panels

plot of chunk unnamed-chunk-35


Lattice Panel Functions

## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
       panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
       panel.abline(h = median(y), lty = 2) ## Add a horizontal line at the median
})

plot of chunk unnamed-chunk-36


Lattice Panel Functions: Regression line

## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
               panel.xyplot(x, y, ...) ## First call default panel function
               panel.lmline(x, y, col = 2) ## Overlay a simple linear regression line
       })

plot of chunk unnamed-chunk-37


What is ggplot2?

  • Grammar of graphics represents an abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects
  • “Shorten the distance from mind to page”

The Basics: qplot()

  • Works much like the plot function in base graphics system
  • Looks for data in a data frame, similar to lattice, or in the parent environment
  • Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)
  • Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled
  • The qplot() hides what goes on underneath, which is okay for most operations
  • ggplot() is the core function and very flexible for doing things qplot() cannot do

ggplot2 “Hello, world!”

library(ggplot2)
qplot(displ, hwy, data = mpg)

plot of chunk unnamed-chunk-38


Modifying aesthetics

qplot(displ, hwy, data = mpg, color = drv)

plot of chunk unnamed-chunk-39

qplot(displ, hwy, data = mpg, shape = drv)

plot of chunk unnamed-chunk-39


Adding a geom

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-40

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm") # linear regression model

plot of chunk unnamed-chunk-40

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm", facets=.~cyl) # linear regression model

plot of chunk unnamed-chunk-40


Histograms and density

qplot(hwy, data = mpg, fill = drv)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-41

qplot(hwy, data = mpg, geom="density", fill = drv)

plot of chunk unnamed-chunk-41

qplot(hwy, data = mpg, geom="density", color = drv)

plot of chunk unnamed-chunk-41


Facets

qplot(displ, hwy, data = mpg, facets = . ~ drv)   # facets = what should determine rows ~ what should determine columns

plot of chunk unnamed-chunk-42

qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)

plot of chunk unnamed-chunk-42

qplot(hwy, data = mpg, facets = drv ~ cyl, binwidth = 2)

plot of chunk unnamed-chunk-42


Summary of qplot()

  • The qplot() function is the analog to plot() but with many built-in features
  • Syntax somewhere in between base/lattice
  • Produces very nice graphics, essentially publication ready (if you like the design)
  • Difficult to go against the grain/customize (don’t bother; use full ggplot2 power in that case)

Basic Components of a ggplot2 Plot

  • A data frame
  • aesthetic mappings: how data are mapped to color, size
  • geoms: geometric objects like points, lines, shapes.
  • facets: for conditional plots.
  • stats: statistical transformations like binning, quantiles, smoothing.
  • scales: what scale an aesthetic map uses (example: male = red, female = blue).
  • coordinate system

Building Plots with ggplot2

  • When building plots in ggplot2 (rather than using qplot) the “artist’s palette” model may be the closest analogy
  • Plots are built up in layers
  • Plot the data
  • Overlay a summary
  • Metadata and annotation

ggplot2 and qplot

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm", facets=.~cyl) # linear regression model

plot of chunk unnamed-chunk-43

p=ggplot(mpg, aes(displ, hwy))
g=p+geom_point() # before the geom_point, it doesnt know whether you want to draw points or lines or whatever
print(g)

plot of chunk unnamed-chunk-43

g+geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-43

g+geom_smooth(method="lm")

plot of chunk unnamed-chunk-43

g+geom_smooth(method="lm")+facet_grid(.~manufacturer)

plot of chunk unnamed-chunk-43

p+geom_point(color="steelblue", size=4)

plot of chunk unnamed-chunk-43

p+geom_point(aes(color=fl), size=4, alpha=0.5)

plot of chunk unnamed-chunk-43

h=p+geom_point(aes(color=fl), size=4, alpha=0.5)+labs(x="blah", y="bleh", title="blih")
print(h)

plot of chunk unnamed-chunk-43

g+geom_smooth(size=4, linetype=3, method="lm", se=FALSE)

plot of chunk unnamed-chunk-43

h+theme_bw(base_family="Times", base_size=20)

plot of chunk unnamed-chunk-43


Annotation

  • Labels: xlab(), ylab(), labs(), ggtitle()
  • Each of the “geom” functions has options to modify
  • For things that only make sense globally, use theme()
  • Example: theme(legend.position = "none")
  • Two standard appearance themes are included
  • theme_gray(): The default theme (gray background)
  • theme_bw(): More stark/plain


A Note about Axis Limits

testdat <- data.frame(x = 1:100, y = rnorm(100))
testdat[50,2] <- 100 ## Outlier!
plot(testdat$x, testdat$y, type = "l", ylim = c(-3,3))

plot of chunk unnamed-chunk-44

g <- ggplot(testdat, aes(x = x, y = y))
g + geom_line()

plot of chunk unnamed-chunk-44


Axis Limits

g + geom_line() + ylim(-3, 3)

plot of chunk unnamed-chunk-45

g + geom_line() + coord_cartesian(ylim = c(-3, 3))

plot of chunk unnamed-chunk-45


Using the cut function

mpg2=mpg
cutpoints=quantile(mpg2$cty, seq(0, 1, length=4), na.rm=TRUE)
mpg2$cty2=cut(mpg2$cty, cutpoints)

plot colours

  • The default color schemes for most plots in R is ugly
  • There are functions to help deal with colours, notably (in the grDevices package):
  • colorRamp
  • colorRampPalette
  • These functions take palettes of colors and help to interpolate between the colors
  • The function colors() lists the names of colors you can use in any plotting function

  • colorRamp: Take a palette of colors and return a function that takes valeus between 0 and 1, indicating the extremes of the color palette (e.g. see the ‘gray’ function, which interpolates between black and white)
  • colorRampPalette: Take a palette of colors and return a function that takes integer arguments and returns a vector of colors interpolating the palette (like heat.colors or topo.colors)

Examples of usage:

These are examples of a colorRamp function and a colorRampPalette:

gray(c(0.1, 0.2, 0.3))
## [1] "#1A1A1A" "#333333" "#4D4D4D"
heat.colors(5)
## [1] "#FF0000FF" "#FF5500FF" "#FFAA00FF" "#FFFF00FF" "#FFFF80FF"

We can create our own palette pal, interpolating between red and blue, and the instruction pal(0.5) will return the colour in between red na blue (in RGB):

pal = colorRamp(c("red", "blue"))
pal(0.5)
##       [,1] [,2]  [,3]
## [1,] 127.5    0 127.5

As for the colourRampPalette, the number we input is the number of colours in between that we want

pal = colorRampPalette(c("red", "blue"))
pal(4)
## [1] "#FF0000" "#AA0055" "#5500AA" "#0000FF"

The colours are returned in hexadecimal (ranging from 00 to FF), the first two characters correspond to red, the second to to green, the last two to blue. We can also interpolate between more than two colours.


How to convert from rgb to hexadecimal? Using the rgb function (color transparency may be addded). Example of how to use:

x=rnorm(10000)
y=rnorm(10000)
plot(x, y, col=rgb(0, 0, 0, 0.2), pch=19)

You can create your own palette of colours, but there are some interesting palettes in R, notably in the RColorBrewer Package (check the brewer.pal help page and google to find out the names of the palettes)

  • There are 3 types of palettes
  • Sequential: typically used for numeric data
  • Diverging: used to plot how much something differs from a given quantity
  • Qualitative: used to represent categorical data
  • Palette informa1on can be used in conjunction with the colorRamp() and colorRampPalette()

Example of how to use:

library(RColorBrewer)
cols = brewer.pal(3, "BuGn")
pal = colorRampPalette(cols)

others:

library(datasets); data(swiss); require(stats); require(graphics)
pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))

plot of chunk unnamed-chunk-50