This is an R Markdown document created for myself with examples of plots. Based on Coursera’s Exploratory Data Analysis course.
After loading the data it is sometimes a good idea to discover and/or define what class each column belongs to, doing stuff like:
sapply(mtcars, class)
mtcars$mpg = as.numeric(mtcars$mpg)
transform(airquality, Month = factor(Month)) # a coluna month da dataframe airquality passa a ser do tipo factor
But when loading the data you can already define the class of each column, doing something like so:
pollution <- read.csv("data/avgpm25.csv", colClasses = c("numeric", "character", "factor", "numeric", "numeric"))
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771 1003 east -87.75 30.59
## 2 9.994 1027 east -85.84 33.27
## 3 10.689 1033 east -87.73 34.73
## 4 11.337 1049 east -85.80 34.46
## 5 12.120 1055 east -86.03 34.02
## 6 10.828 1069 east -85.35 31.19
One dimension
summary(pollution$pm25)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.38 8.55 10.00 9.84 11.40 18.40
boxplot(pollution$pm25, col = "blue")
hist(pollution$pm25, col = "green")
plot(density(pollution$pm25), col="green")
hist(pollution$pm25, col = "green")
rug(pollution$pm25)
hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)
boxplot(pollution$pm25, col = "blue")
abline(h = 12)
hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)
barplot(table(pollution$region), col = "wheat", main = "Number of Counties in Each Region")
Two dimensions
\(> 2\) dimensions
boxplot(pm25 ~ region, data = pollution, col = "red")
library(datasets)
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = "Month", ylab = "Ozone (ppb)")
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))
x=rnorm(10000)
y=rnorm(10000)
smoothScatter(x, y)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
library(datasets)
data(cars)
with(cars, plot(speed, dist))
library(lattice)
## Warning: package 'lattice' was built under R version 3.0.3
state=data.frame(state.x77, region=state.region)
xyplot(Life.Exp~Income | region, data=state, layout=c(4, 1))
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
data(mpg)
qplot(displ, hwy, data=mpg)
Many base plotting functions share a set of parameters. Here are a few key ones:
pch: the plotting symbol (default is open circle)
lty: the line type (default is solid line), can be dashed, dotted, etc.
lwd: the line width, specified as an integer multiple
col: the plotting color, specified as a number, string, or hex code; the colors() function gives you a vector of colors by name
xlab: character string for the x-axis label
ylab: character string for the y-axis label
The par() function is used to specify global graphics parameters that affect all plots in an R session. These parameters can be overridden when specified as arguments to specific plotting functions.
las: the orientation of the axis labels on the plotbg: the background colormar: the margin sizeoma: the outer margin size (default is 0 for all sides)mfrow: number of plots per row, column (plots are filled row-wise)mfcol: number of plots per row, column (plots are filled column-wise)Default values for global graphics parameters
par("lty")
## [1] "solid"
par("col")
## [1] "black"
par("pch")
## [1] 1
plot: make a scatterplot, or other type of plot depending on the class of the object being plotted
lines: add lines to a plot, given a vector x values and a corresponding vector of y values (or a 2-column matrix); this function just connects the dots
points: add points to a plottext: add text labels to a plot using specified x, y coordinatestitle: add annotations to x, y axis labels, title, subtitle, outer marginmtext: add arbitrary text to the margins (inner or outer) of the plotaxis: adding axis ticks/labels
library(datasets)
with(airquality, plot(Wind, Ozone))
title(main = "Ozone and Wind in New York City") ## Add a title
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", xaxt='n', ann=FALSE))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue", xaxt='n', ann=FALSE))
axis(1, 1:5, LETTERS[1:5])
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", type = "n"))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City", pch = 20))
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd = 2)
par(mfrow = c(1, 2))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
})
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(0, 0, 2, 0))
with(airquality, {
plot(Wind, Ozone, main = "Ozone and Wind")
plot(Solar.R, Ozone, main = "Ozone and Solar Radiation")
plot(Temp, Ozone, main = "Ozone and Temperature")
mtext("Ozone and Weather in New York City", outer = TRUE)
})
pdf(file = "myplot.pdf") ## Open PDF device; create 'myplot.pdf' in my working directory
# Create plot and send to a file (no plot appears on screen)
with(faithful, plot(eruptions, waiting))
title(main = "Old Faithful Geyser data") ## Annotate plot; still nothing on screen
dev.off() ## Close the PDF file device
## pdf
## 2
# Now you can view the file 'myplot.pdf' on your computer
library(datasets)
with(faithful, plot(eruptions, waiting)) ## Create plot on screen device
title(main = "Old Faithful Geyser data") ## Add a main title
dev.copy(png, file = "geyserplot.png") ## Copy my plot to a PNG file
## png
## 3
dev.off() ## Don't forget to close the PNG device!
## pdf
## 2
) ## manipulate function
library(UsingR)
data(galton)
library(manipulate)
myHist <- function(mu){
hist(galton$child,col="blue",breaks=100)
lines(c(mu, mu), c(0, 150),col="red",lwd=5)
mse <- mean((galton$child - mu)^2)
text(63, 150, paste("mu = ", mu))
text(63, 140, paste("MSE = ", round(mse, 2)))
}
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
library(UsingR)
## Warning: package 'UsingR' was built under R version 3.0.3
## Loading required package: MASS
##
## Attaching package: 'UsingR'
##
## The following object is masked from 'package:ggplot2':
##
## movies
data(galton)
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
plot(as.numeric(as.vector(freqData$parent)),
as.numeric(as.vector(freqData$child)),
pch = 21, col = "black", bg = "lightblue",
cex = .15 * freqData$freq,
xlab = "parent", ylab = "child")
The lattice plotting system is implemented using the following packages:
lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplot, bwplot, levelplot
We seldom call functions from the grid package directly
The lattice plotting system does not have a “two-phase” aspect with separate plotting and annotation like in base plotting
All plotting/annotation is done at once with a single function call
xyplot: this is the main function for creating scatterplotsbwplot: box-and-whiskers plots (“boxplots”)histogram: histogramsstripplot: like a boxplot but with actual pointsdotplot: plot dots on “violin strings”splom: scatterplot matrix; like pairs in base plotting systemlevelplot, contourplot: for plotting “image” dataLattice functions generally take a formula for their first argument, usually of the form
xyplot(y ~ x | f * g, data)
We use the formula notation here, hence the ~.
On the left of the ~ is the y-axis variable, on the right is the x-axis variable
the * indicates an interaction between two variables
The second argument is the data frame or list from which the variables in the formula should be looked up
If no data frame or list is passed, then the parent frame is used.
If no other arguments are passed, there are defaults that can be used.
library(datasets)
library(lattice)
## Convert 'Month' to a factor variable
airquality <- transform(airquality, Month = factor(Month))
xyplot(Ozone ~ Wind | Month, data = airquality, layout = c(5, 1))
Lattice functions have a panel function which controls what happens inside each panel of the plot.
The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel
Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)
set.seed(10)
x <- rnorm(100)
f <- rep(0:1, each = 50)
y <- x + f - f * x+ rnorm(100, sd = 0.5)
f <- factor(f, labels = c("Group 1", "Group 2"))
xyplot(y ~ x | f, layout = c(2, 1)) ## Plot with 2 panels
## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = median(y), lty = 2) ## Add a horizontal line at the median
})
## Custom panel function
xyplot(y ~ x | f, panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call default panel function
panel.lmline(x, y, col = 2) ## Overlay a simple linear regression line
})
qplot()plot function in base graphics systemqplot() hides what goes on underneath, which is okay for most operationsggplot() is the core function and very flexible for doing things qplot() cannot dolibrary(ggplot2)
qplot(displ, hwy, data = mpg)
qplot(displ, hwy, data = mpg, color = drv)
qplot(displ, hwy, data = mpg, shape = drv)
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm") # linear regression model
qplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm", facets=.~cyl) # linear regression model
qplot(hwy, data = mpg, fill = drv)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(hwy, data = mpg, geom="density", fill = drv)
qplot(hwy, data = mpg, geom="density", color = drv)
qplot(displ, hwy, data = mpg, facets = . ~ drv) # facets = what should determine rows ~ what should determine columns
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
qplot(hwy, data = mpg, facets = drv ~ cyl, binwidth = 2)
qplot() function is the analog to plot() but with many built-in featuresqplot(displ, hwy, data = mpg, geom = c("point", "smooth"), method="lm", facets=.~cyl) # linear regression model
p=ggplot(mpg, aes(displ, hwy))
g=p+geom_point() # before the geom_point, it doesnt know whether you want to draw points or lines or whatever
print(g)
g+geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
g+geom_smooth(method="lm")
g+geom_smooth(method="lm")+facet_grid(.~manufacturer)
p+geom_point(color="steelblue", size=4)
p+geom_point(aes(color=fl), size=4, alpha=0.5)
h=p+geom_point(aes(color=fl), size=4, alpha=0.5)+labs(x="blah", y="bleh", title="blih")
print(h)
g+geom_smooth(size=4, linetype=3, method="lm", se=FALSE)
h+theme_bw(base_family="Times", base_size=20)
xlab(), ylab(), labs(), ggtitle()theme()theme(legend.position = "none")theme_gray(): The default theme (gray background)theme_bw(): More stark/plaintestdat <- data.frame(x = 1:100, y = rnorm(100))
testdat[50,2] <- 100 ## Outlier!
plot(testdat$x, testdat$y, type = "l", ylim = c(-3,3))
g <- ggplot(testdat, aes(x = x, y = y))
g + geom_line()
g + geom_line() + ylim(-3, 3)
g + geom_line() + coord_cartesian(ylim = c(-3, 3))
mpg2=mpg
cutpoints=quantile(mpg2$cty, seq(0, 1, length=4), na.rm=TRUE)
mpg2$cty2=cut(mpg2$cty, cutpoints)
grDevices package):colorRampcolorRampPalettecolorRamp: Take a palette of colors and return a function that takes valeus between 0 and 1, indicating the extremes of the color palette (e.g. see the ‘gray’ function, which interpolates between black and white)colorRampPalette: Take a palette of colors and return a function that takes integer arguments and returns a vector of colors interpolating the palette (like heat.colors or topo.colors)Examples of usage:
These are examples of a colorRamp function and a colorRampPalette:
gray(c(0.1, 0.2, 0.3))
## [1] "#1A1A1A" "#333333" "#4D4D4D"
heat.colors(5)
## [1] "#FF0000FF" "#FF5500FF" "#FFAA00FF" "#FFFF00FF" "#FFFF80FF"
We can create our own palette pal, interpolating between red and blue, and the instruction pal(0.5) will return the colour in between red na blue (in RGB):
pal = colorRamp(c("red", "blue"))
pal(0.5)
## [,1] [,2] [,3]
## [1,] 127.5 0 127.5
As for the colourRampPalette, the number we input is the number of colours in between that we want
pal = colorRampPalette(c("red", "blue"))
pal(4)
## [1] "#FF0000" "#AA0055" "#5500AA" "#0000FF"
The colours are returned in hexadecimal (ranging from 00 to FF), the first two characters correspond to red, the second to to green, the last two to blue. We can also interpolate between more than two colours.
How to convert from rgb to hexadecimal? Using the rgb function (color transparency may be addded). Example of how to use:
x=rnorm(10000)
y=rnorm(10000)
plot(x, y, col=rgb(0, 0, 0, 0.2), pch=19)
You can create your own palette of colours, but there are some interesting palettes in R, notably in the RColorBrewer Package (check the brewer.pal help page and google to find out the names of the palettes)
colorRamp() and colorRampPalette()Example of how to use:
library(RColorBrewer)
cols = brewer.pal(3, "BuGn")
pal = colorRampPalette(cols)
library(datasets); data(swiss); require(stats); require(graphics)
pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))