Graphics Overview

Some Background

R Coding Best Practices

Open up a new script file. At the top, use comments for:

  • The date
  • The intended purpose of the file
  • Any library imports you might need

We’ll touch on some other things to keep your code tidy.

Motor Trend Cars Plot

data(mtcars)
?mtcars
str(mtcars)

First, Data Manipulation and Cleaning

Look at mtcars$am

  • R treats values differently depending on how they’re labelled.
  • “factor” variables are labels - NOT numbers
  • In our data, 0=auto and 1=manual.
  • It is NOT true that manual = auto + 1, or 0*manual = auto

Using factor(mtcars$am) tells R not to use the values for math.

Computer Science: For Loops

A for loop runs a chunk of code for a specified number of times, with an index variable that changes each time.

for(i in 1:3){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3

(Note the syntax - especially the locations of the squirelly brackets and the spacing.)

For Loops Can use any index variable

for(i in c('cyl', "vs", 'am', 'gear', "carb")){
    print(i)
}
## [1] "cyl"
## [1] "vs"
## [1] "am"
## [1] "gear"
## [1] "carb"

Make a lot of factors at once

We could have written out a bunch of lines, but this is faster.

for(i in c('cyl', "vs", 'am', 'gear', "carb")){
    mtcars[, i] <- factor(mtcars[, i])
}
str(mtcars)

A Pretty Plot

A first scatterplot

attach(mtcars) # This function is pure evil, but useful
plot(mpg~disp)

Subsets of Data

It’s well known that manual cars are more efficient, so we should account for this.

mpg
mpg[am == 0]
mpg[am == 1]

Recall: square brackets are used for subsetting.

Plotting one variable at a time

par(mfrow=c(1,2)) # side by side plots
plot(mpg~disp, data=mtcars[am == 0, ])
plot(mpg~disp, data=mtcars[am == 1, ])

Colouring by variables

par(mfrow=c(1,1)) # Back to one plot
plot(mpg~disp, col=am)

Arguments to the plot function

plot(mpg~disp, main="Fuel Efficiency vs. Engine Size",
     ylab="Miles per Gallon", xlab="Displacement",
     pch=16, las=1, col=am)

Add a legend

legend('topright', legend=c('Auto', 'Manu'),
       col=c(2,1), pch=16)

Plots with ggplot2

A Newer Plotting Method

install.packages('ggplot2')
library(ggplot2)

The ggplot2 library adds some simple but powerful plotting functions.

qplot(x=disp, y=mpg, colour=am)

More Things to add!

qplot(x=disp, y=mpg, colour=am, shape=cyl, size=2)

The Grammar of Graphics

The ‘gg’ in ‘ggplot2’ stands for grammar of graphics.

This means you literally add layers to a plot.

This makes the plotting code more readable and more powerful.

Grammar Example

p <- ggplot(mapping = aes(x=disp, y=mpg, colour=am),
            data=mtcars) +
    geom_point(size=2) + theme_bw() + 
    xlab("Displacement") + ylab("MPG") +
    ggtitle("Fuel Efficiency by Size and Transmission")
p

Some things are super simple

p + geom_smooth()

Other Plots in Base Graphics

barplot(table(am))

boxplot(mpg)
boxplot(mpg~am) # Formula Notation
boxplot(mpg~am*cyl)

# Histogram
hist(mpg)

# Density Function
plot(density(mpg)) # Kernel Density Estimation

# Pie Charts: Avoid at all cost.

Other Scatter Plots

# Make some toy data
x <- 1:20
y <- 5+2*sin(x/2) + rnorm(n=20, mean = 0, sd = 1)

plot(x,y, type="l")
plot(x,y, type="l", lty=2)

plot(y~x, type="l") # Formula Notation

# points and lines functions
plot(y~x, type="s")
lines(y~x, type='h')
points(y~x, pch=16)

Graphical Parameters

Changing par()

Setting a Default

par() will let you change the default plotting values.

R doesn’t have an easy way to revert changes, so we have a workaround.

defaultpar <- par()

Remember this?

par(mfrow=c(1,2))
  • mfrow: Multiple Figures ROW-wise
  • 1 means one row of columns
  • 2 means two columns

Two Plots

par(mfrow=c(1,2))
plot(mpg~disp)
boxplot(mpg~am)

Changing Margins

# Default margins
par(mar=c(5.1,4.1,4.1,2.1))

R adds a lot of white space. Try changing the numbers below and running the code again!

par(mar=c(4,4,4,4))
plot(mpg~disp)

But when would you use this?

par(mar=c(5,4,4,0), mfrow=c(1,2))
plot(mpg~wt)
par(mar=c(5,0,4,2))
plot(mpg~disp, yaxt='n')

Adding a plot to a plot

par(mfrow=c(1,1), mar=c(5,4,4,2)) # revert changes
plot(mpg~disp)
par(new=TRUE) # Makes R forget about current plot
plot(mpg~wt, col="red")

Controlling the axes

plot(mpg~disp)
par(new=TRUE)
plot(mpg~wt, col=2, xaxt="n", xlab="")

axis(side=3, col=2, col.axis=2)
mtext(text='wt', side=3, col=2)

# If the y limits were different
axis(side=4, col=2, col.axis=2)

Note that this is usually a terrible idea. (Not as bad as pie charts, but still bad.)

Default Plot Parameters

You can also change all of the defaults.

  • cex, cex.main, cex.axis… are Character EXpansion - the relative size of text and points (2 means twice as big as normal)
  • col, col.main, col.axis, … let you change the colours on all subsequent plots
  • bg changes the background colour
  • pch, lty, type will change the line type or plotting character for your next plots

If you like ugly things…

par(col='hotpink', cex=0.6, bg='lightblue', pch="?", 
    col.axis='darkorange') # don't do this
plot(mpg~disp)

Some more colour fun

You can define your own rgb values

x <- 1:40/pi
y <- sin(x)

par(mfrow=c(1,3))
plot(x, y, col=rgb(1,0,0), type='b')
plot(x, y, col=rgb(1,0,1), type='b')
plot(x, y, col="#ff0000", type='b')

Remember for loops?

red <- seq(0,1, by=0.1)
x <- seq(0, 4*pi, length.out = 200)
y <- sin(x)

par(mfrow=c(1,1))
plot(x, y, col=rgb(0,1,0), type='l', lwd=2)
for(i in 1:length(red)){
    lines(x, y*(0.9^i), col=rgb(red[i], 1, 0), lwd=2)
}

This is just a silly example to remind you of some plot functions.

Transparency

# Start with an empty plot
plot(1,1, type='n', ylim=c(0,0.75), xlim=c(-3,3))
for(i in 1:200) lines(density(rnorm(37)),
                      col=rgb(0,0,0,0.2))

Colour Palletes

A colour pallette is a continuous change in colours.

R has several built in and you can make your own.

?terrain.colors
?colorRampPalette

Variance-Covariance Matrix

mtcars_small <- mtcars[,c('mpg','disp','hp','wt','qsec')]
# try heat.colors, cm.colors, etc, also try changing the '30'
heatmap(cov(mtcars_small), col=terrain.colors(30))

Custom Pallette

mycolours <- colorRampPalette(c('yellow', 'black', 'purple'))
heatmap(var(mtcars_small), col=mycolours(30))

Saving Plots

?png

Anything between png() and dev.off() will be saved.

If you overwrite something, it’s overwritten in the file.

Saving Example

Once you’re happy with your plot:

getwd() # The plot will save to this folder
png(filename = "BeautifulPlot.png", height=600, 
    width=1000, units=px)

plot(mpg~disp, main="Save Gas by Driving Manual",
     col=am)
# Nothing shows up, but there's a file being created

dev.off()
# File is finished being created.

Further Reading for Plots

For 3D plotting, the rgl package is great!

install.packages('rgl')
library('rgl')
plot3d(x=disp, y=wt, z=mpg, type = 's') # 's': Spheres
plot3d(x=disp, y=wt, z=mpg, type = 'h', add=TRUE)

Other useful plotting packages

  • plot3D for static 3D plots
  • ggplot2, ggmap, ggvis, etc for pretty plots
  • shiny for interactive plots (very advanced)
  • RGoogleMaps, OpenStreetMap, maps for mapping

Making Basic Models

t-tests

Let’s do some Statistics!

From before, mpg changes with transmission type.

How can we quantify this?

With statistics, of course! Let’s do a t-test.

Hypothesis and Assumptions

Assumptions: Random sample, normal(ish) population, independent observations, equal variance.

Hypotheses:

\[H_0: \mu_1 = \mu_2 \hbox{, }\quad H_A: \mu_1 \ne \mu_2\]

(Sorry about the math.)

“t-tests” are not just one thing

?t.test

Have a look through the arguments. You can do a lot of different tests, make sure you’re doing the right one!

  • alternative: one-sided or two sided
  • mu: The hypothesized value (e.g. \(\mu_1 - \mu_2 > 5\))
  • paired: Are your data in pairs? (e.g. twins, left ey/right eye)
  • conf.level: the function returns a confidence interval (this does not affect the actual t-test)
  • var.equal: Check the variances first (don’t have to be exactly equal)

Other Plots

# Always plot things first
boxplot(mpg~am)
par(mfrow=c(1,2))
hist(mpg[am == 0])
hist(mpg[am == 1]) # close enough

var(mpg[am == 0])
var(mpg[am == 1]) # not equal at all!

# The usual way
t.test(mpg[am==0], mpg[am==1])
# Formula notation (not always possible)
t.test(mpg~am)

Multiple Comparisons

p-value correction

\(\alpha=0.05\) means your accepting a 5% chance of a false positive.

If you do two tests at \(\alpha=0.05\), you still THINK that you have a 5% chance, but you’re wrong (it’s closer to 10%).

If you do 20 tests, you have a 74% chance of a false positive!!!

This is how we get results like “vaccines cause autism” and “red wine makes you immune to cancer”.

We need to correct the p-values.

Some possible corrections

There are 3 levels for cyl, meaning 3 possible pairwise t-tests.

# Fancy code (very useful)
tapply(mpg, cyl, mean) # mean of mpg accross cyl levels

?pairwise.t.test

pairwise.t.test(mpg, cyl, p.adj="bonf")
pairwise.t.test(mpg, cyl, p.adj="holm")
# see also: "holm", "hochberg", "hommel", "bonferroni", 
#   "BH", "BY", "fdr", "none"

More commonly: ANOVA

ANOVA does all of the comparisons at once, rather than adjusting the final p-values.

As the name implies, it looks closely at the variance to do this.

?anova
?aov

Cylinders again

mtcyl <- aov(mpg~cyl)
mtcyl
anova(mtcyl) # Slightly more readable
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## cyl        2 824.78  412.39  39.697 4.979e-09 ***
## Residuals 29 301.26   10.39                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So there is at least one different group

Now we can look at those t-tests again to see which group differs.

pairwise.t.test(mpg, cyl, p.adj="bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mpg and cyl 
## 
##   4       6      
## 6 0.00036 -      
## 8 2.6e-09 0.01246
## 
## P value adjustment method: bonferroni

Two Way ANOVA

For when you have multiple different groupings, i.e. cylinders and transmission.

# No interaction
anova(aov(mpg ~ cyl + am))

# Interaction
anova(aov(mpg ~ cyl * am)) # Note the *

Linear Models

The Usual Model

mylm <- lm(mpg ~ disp + cyl + am)
summary(mylm)

We’re going to look at some extensions of this.

Formula Notation

Model Notation
\(y = \beta_0+\beta_1x_1+\beta_2x_2\) y ~ x1 + x2
\(y = \beta_1x_1 +\beta_2x_2\) y ~ x1 + x2 - 1
\(y = \beta_0 + \beta_1x_1x_2\) y ~ x1 : x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ x1 + x2 + x1:x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ x1 * x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ (x1 + x2)^2
\(y = \beta_0 + \beta_1x_1^2\) y ~ I(x1^2)
\(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\) y ~ poly(x1, 2)
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\) y ~ .

Formulae in Action

Recall the small mtcars dataset.

head(mtcars_small)
str(mtcars_small)

Let’s do some regression.

All of the variables

It’s very important that we specify the dataframe

# ALWAYS PLOT!!!
pairs(mtcars_small)

# Create the model
lm_all <- lm(mpg ~ ., data = mtcars_small)
summary(lm_all)

Model Selection

We don’t need to keep remaking models!

The weird \(\sim\;.\) thing works here, too.

lm_small <- update(lm_all, ~ . - disp)
summary(lm_small)
lm_small <- update(lm_small, ~ . - qsec)
summary(lm_small)

ALWAYS PLOT THINGS

par(mfrow=c(1,3))
plot(mpg~hp); abline(lm(mpg~hp))
plot(mpg~disp); abline(lm(mpg~disp))
plot(hp~disp); abline(lm(hp~disp)) # multicollinearity = bad

Random Effects Models

Sleep Study Example

install.packages("lme4")
library(lme4)
?sleepstudy

They tested reaction times in sleep deprived subjects.

There were multiple observations per subject.

This means that there’s variance for each subject as well as between subjects!

Random effects models account for this.

Multiple Obsevations per Subject

plot(Reaction ~ Subject, data = sleepstudy)

Looking at the data

str(sleepstudy)
plot(Reaction ~ Days, data = sleepstudy, col = Subject)

Spaghetti Plot

subjects <- levels(sleepstudy$Subject)
plot(Reaction ~ Days, data = sleepstudy, type = 'n')
for(i in 1:length(subjects)){
    lines(Reaction ~ Days, col=i,
          data = subset(sleepstudy, Subject == subjects[i]))
}

Plotting the Mean and SD

plot(Reaction ~ Days, data = sleepstudy)
mean_day <- with(sleepstudy, tapply(Reaction, Days, mean))
sd_day <- with(sleepstudy, tapply(Reaction, Days, sd))
lines(0:9, mean_day, lwd=2)

Even More Formula Notation

\(y \sim x + (x|z)\)

The bar means “within” or “given”. This is modelling the mean of y, but the value changes with x and the variance of x depends on z.

Usually, z is subject ID or something similar.

Modelling

The variance within each day is partially explained by the subject - some people might not be affected as much by sleep desprivation.

sleepy <- lmer(Reaction ~ Days + (Days | Subject),
               data = sleepstudy)
summary(sleepy)

RMarkdown

RMarkdown is a quick way to make reproducible research.

  • Simple syntax
  • Easily integrate R code and explanations
  • Include LaTeX
  • This presentation was made in RMarkdown

To make one, go to File -> New -> RMarkdown

Making Functions

Reducing Redundancy

Why make a custom function?

  • You’re writing the same code over and over
  • You’re doing the same thing to many different objects
  • You don’t like the built in way of doing things
  • You’re a rebel

General Syntax

fun_name <- function(x){
    value <- something
    return(value)
}

The names function and return are reserved in R.

Example - New Mean

mean_na <- function(x){
    newmean <- mean(x, na.rm=TRUE)
    return(newmean)
}

x <- c(0,2,3,1,NA)
mean(x)
## [1] NA
mean_na(x)
## [1] 1.5

Less Syntax

A function will return the last thing it sees.

It’s like typing something into the console.

mean_na <- function(x){
    mean(x, na.rm=TRUE)
}

Minimal Syntax

If a function is only one line, you don’t need the curly brackets.

mean_na <- function(x) mean(x, na.rm=TRUE)

If-Else statements

is_it_1 <- function(a){
    if(a == 1){
        return("Yes")
    } else {
        return("No")
    }
}

is_it_1(1)
## [1] "Yes"
is_it_1("a")
## [1] "No"

A Useful Function

plot_mpg <- function(xvar){
    plot(mpg~xvar, main="Fuel Efficiency", data=mtcars,
         ylab="Miles per Gallon", pch=16, col=am)
}
plot_mpg(disp)

The elipses argument

plot_mpg <- function(xvar, ...){
    plot(mpg~xvar, main="Fuel Efficiency", data=mtcars, 
         ylab="Miles per Gallon",  pch=16, col=am, ...)
}
plot_mpg(disp, xlab="Displacement")

Linear model for any two variables

lm2 <- function(x, y, plot_it = FALSE, ...){
    newlm <- lm(y~x)
    
    if(plot_it){
        plot(y~x, ...)
        abline(newlm)
    }
    
    summary(newlm)
}
  • Uses a default value
  • Uses an if statement
  • Can add things to the plot function
  • Returns the last thing it sees

Some Pairwise Models

par(mfrow=c(1,2))
lm_disp <- lm2(disp, mpg)
lm_wt <- lm2(wt, mpg, plot_it = TRUE, xlab = "Weight")
lm_hp <- lm2(hp, mpg, plot_it = TRUE, xlab = "Horsepower")

Generalized Linear Models

A Small Extension

What is a GLM?

Linear model: mean of \(y\) is a linear function of \(x\).

GLM: mean of \(g(y)\) is a linear function of \(x\)

Choosing \(g()\) lets us do a lot of different things.

Examples: Poisson regression, Logistic regression, Multinomial regression

Logistic Example

Logistic regression uses continuous or categorical variables to predict a binary outcome.

Recall that am is binary.

It doesn’t make sense, but we’ll do a regression anyway.

Familiar Formulas

The only thing that changes is the family argument.

am_model <- glm(am ~ hp, data=mtcars, family=binomial)
summary(am_model)

# For model comparison
AIC(am_model)

Plotting is hard

Before we can plot, we need estimates at each \(x\) value.

range(disp)
## [1]  71.1 472.0
newx <- seq(50,500, by = 2)
newy <- predict(am_model, newdata = list(disp=newx), 
                type='response')

Plotting the Predictions

plot(disp, am)
lines(newx, newy)

Other Models

Family Default Link Function \(g()\)
binomial (link = logit)
gaussian (link = identity)
Gamma (link = inverse)
inverse.gaussian (link = 1/mu^2)
poisson (link = log)
quasi (link = identity, variance = constant)
quasibinomial (link = logit)
quasipoisson (link = log)