Open up a new script file. At the top, use comments for:
We’ll touch on some other things to keep your code tidy.
data(mtcars)
?mtcars
str(mtcars)
Look at mtcars$am
- R treats values differently depending on how they’re labelled.
- “factor” variables are labels - NOT numbers
- In our data, 0=auto and 1=manual.
- It is NOT true that manual = auto + 1, or 0*manual = auto
Using factor(mtcars$am) tells R not to use the values for math.
A for loop runs a chunk of code for a specified number of times, with an index variable that changes each time.
for(i in 1:3){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
(Note the syntax - especially the locations of the squirelly brackets and the spacing.)
for(i in c('cyl', "vs", 'am', 'gear', "carb")){
print(i)
}
## [1] "cyl"
## [1] "vs"
## [1] "am"
## [1] "gear"
## [1] "carb"
We could have written out a bunch of lines, but this is faster.
for(i in c('cyl', "vs", 'am', 'gear', "carb")){
mtcars[, i] <- factor(mtcars[, i])
}
str(mtcars)
attach(mtcars) # This function is pure evil, but useful
plot(mpg~disp)
It’s well known that manual cars are more efficient, so we should account for this.
mpg
mpg[am == 0]
mpg[am == 1]
Recall: square brackets are used for subsetting.
par(mfrow=c(1,2)) # side by side plots
plot(mpg~disp, data=mtcars[am == 0, ])
plot(mpg~disp, data=mtcars[am == 1, ])
par(mfrow=c(1,1)) # Back to one plot
plot(mpg~disp, col=am)
plot(mpg~disp, main="Fuel Efficiency vs. Engine Size",
ylab="Miles per Gallon", xlab="Displacement",
pch=16, las=1, col=am)
legend('topright', legend=c('Auto', 'Manu'),
col=c(2,1), pch=16)
install.packages('ggplot2')
library(ggplot2)
The ggplot2 library adds some simple but powerful plotting functions.
qplot(x=disp, y=mpg, colour=am)
qplot(x=disp, y=mpg, colour=am, shape=cyl, size=2)
The ‘gg’ in ‘ggplot2’ stands for grammar of graphics.
This means you literally add layers to a plot.
This makes the plotting code more readable and more powerful.
p <- ggplot(mapping = aes(x=disp, y=mpg, colour=am),
data=mtcars) +
geom_point(size=2) + theme_bw() +
xlab("Displacement") + ylab("MPG") +
ggtitle("Fuel Efficiency by Size and Transmission")
p
p + geom_smooth()
barplot(table(am))
boxplot(mpg)
boxplot(mpg~am) # Formula Notation
boxplot(mpg~am*cyl)
# Histogram
hist(mpg)
# Density Function
plot(density(mpg)) # Kernel Density Estimation
# Pie Charts: Avoid at all cost.
# Make some toy data
x <- 1:20
y <- 5+2*sin(x/2) + rnorm(n=20, mean = 0, sd = 1)
plot(x,y, type="l")
plot(x,y, type="l", lty=2)
plot(y~x, type="l") # Formula Notation
# points and lines functions
plot(y~x, type="s")
lines(y~x, type='h')
points(y~x, pch=16)
par() will let you change the default plotting values.
R doesn’t have an easy way to revert changes, so we have a workaround.
defaultpar <- par()
par(mfrow=c(1,2))
mfrow: Multiple Figures ROW-wisepar(mfrow=c(1,2))
plot(mpg~disp)
boxplot(mpg~am)
# Default margins
par(mar=c(5.1,4.1,4.1,2.1))
R adds a lot of white space. Try changing the numbers below and running the code again!
par(mar=c(4,4,4,4))
plot(mpg~disp)
par(mar=c(5,4,4,0), mfrow=c(1,2))
plot(mpg~wt)
par(mar=c(5,0,4,2))
plot(mpg~disp, yaxt='n')
par(mfrow=c(1,1), mar=c(5,4,4,2)) # revert changes
plot(mpg~disp)
par(new=TRUE) # Makes R forget about current plot
plot(mpg~wt, col="red")
plot(mpg~disp)
par(new=TRUE)
plot(mpg~wt, col=2, xaxt="n", xlab="")
axis(side=3, col=2, col.axis=2)
mtext(text='wt', side=3, col=2)
# If the y limits were different
axis(side=4, col=2, col.axis=2)
Note that this is usually a terrible idea. (Not as bad as pie charts, but still bad.)
You can also change all of the defaults.
cex, cex.main, cex.axis… are Character EXpansion - the relative size of text and points (2 means twice as big as normal)col, col.main, col.axis, … let you change the colours on all subsequent plotsbg changes the background colourpch, lty, type will change the line type or plotting character for your next plotspar(col='hotpink', cex=0.6, bg='lightblue', pch="?",
col.axis='darkorange') # don't do this
plot(mpg~disp)
rgb valuesx <- 1:40/pi
y <- sin(x)
par(mfrow=c(1,3))
plot(x, y, col=rgb(1,0,0), type='b')
plot(x, y, col=rgb(1,0,1), type='b')
plot(x, y, col="#ff0000", type='b')
for loops?red <- seq(0,1, by=0.1)
x <- seq(0, 4*pi, length.out = 200)
y <- sin(x)
par(mfrow=c(1,1))
plot(x, y, col=rgb(0,1,0), type='l', lwd=2)
for(i in 1:length(red)){
lines(x, y*(0.9^i), col=rgb(red[i], 1, 0), lwd=2)
}
This is just a silly example to remind you of some plot functions.
# Start with an empty plot
plot(1,1, type='n', ylim=c(0,0.75), xlim=c(-3,3))
for(i in 1:200) lines(density(rnorm(37)),
col=rgb(0,0,0,0.2))
A colour pallette is a continuous change in colours.
R has several built in and you can make your own.
?terrain.colors
?colorRampPalette
mtcars_small <- mtcars[,c('mpg','disp','hp','wt','qsec')]
# try heat.colors, cm.colors, etc, also try changing the '30'
heatmap(cov(mtcars_small), col=terrain.colors(30))
mycolours <- colorRampPalette(c('yellow', 'black', 'purple'))
heatmap(var(mtcars_small), col=mycolours(30))
?png
Anything between png() and dev.off() will be saved.
If you overwrite something, it’s overwritten in the file.
Once you’re happy with your plot:
getwd() # The plot will save to this folder
png(filename = "BeautifulPlot.png", height=600,
width=1000, units=px)
plot(mpg~disp, main="Save Gas by Driving Manual",
col=am)
# Nothing shows up, but there's a file being created
dev.off()
# File is finished being created.
For 3D plotting, the rgl package is great!
install.packages('rgl')
library('rgl')
plot3d(x=disp, y=wt, z=mpg, type = 's') # 's': Spheres
plot3d(x=disp, y=wt, z=mpg, type = 'h', add=TRUE)
plot3D for static 3D plotsggplot2, ggmap, ggvis, etc for pretty plotsshiny for interactive plots (very advanced)RGoogleMaps, OpenStreetMap, maps for mappingFrom before, mpg changes with transmission type.
How can we quantify this?
With statistics, of course! Let’s do a t-test.
Assumptions: Random sample, normal(ish) population, independent observations, equal variance.
Hypotheses:
\[H_0: \mu_1 = \mu_2 \hbox{, }\quad H_A: \mu_1 \ne \mu_2\]
(Sorry about the math.)
?t.test
Have a look through the arguments. You can do a lot of different tests, make sure you’re doing the right one!
alternative: one-sided or two sidedmu: The hypothesized value (e.g. \(\mu_1 - \mu_2 > 5\))paired: Are your data in pairs? (e.g. twins, left ey/right eye)conf.level: the function returns a confidence interval (this does not affect the actual t-test)var.equal: Check the variances first (don’t have to be exactly equal)# Always plot things first
boxplot(mpg~am)
par(mfrow=c(1,2))
hist(mpg[am == 0])
hist(mpg[am == 1]) # close enough
var(mpg[am == 0])
var(mpg[am == 1]) # not equal at all!
# The usual way
t.test(mpg[am==0], mpg[am==1])
# Formula notation (not always possible)
t.test(mpg~am)
\(\alpha=0.05\) means your accepting a 5% chance of a false positive.
If you do two tests at \(\alpha=0.05\), you still THINK that you have a 5% chance, but you’re wrong (it’s closer to 10%).
If you do 20 tests, you have a 74% chance of a false positive!!!
This is how we get results like “vaccines cause autism” and “red wine makes you immune to cancer”.
We need to correct the p-values.
There are 3 levels for cyl, meaning 3 possible pairwise t-tests.
# Fancy code (very useful)
tapply(mpg, cyl, mean) # mean of mpg accross cyl levels
?pairwise.t.test
pairwise.t.test(mpg, cyl, p.adj="bonf")
pairwise.t.test(mpg, cyl, p.adj="holm")
# see also: "holm", "hochberg", "hommel", "bonferroni",
# "BH", "BY", "fdr", "none"
ANOVA does all of the comparisons at once, rather than adjusting the final p-values.
As the name implies, it looks closely at the variance to do this.
?anova
?aov
mtcyl <- aov(mpg~cyl)
mtcyl
anova(mtcyl) # Slightly more readable
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 824.78 412.39 39.697 4.979e-09 ***
## Residuals 29 301.26 10.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we can look at those t-tests again to see which group differs.
pairwise.t.test(mpg, cyl, p.adj="bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: mpg and cyl
##
## 4 6
## 6 0.00036 -
## 8 2.6e-09 0.01246
##
## P value adjustment method: bonferroni
For when you have multiple different groupings, i.e. cylinders and transmission.
# No interaction
anova(aov(mpg ~ cyl + am))
# Interaction
anova(aov(mpg ~ cyl * am)) # Note the *
mylm <- lm(mpg ~ disp + cyl + am)
summary(mylm)
We’re going to look at some extensions of this.
| Model | Notation |
|---|---|
| \(y = \beta_0+\beta_1x_1+\beta_2x_2\) | y ~ x1 + x2 |
| \(y = \beta_1x_1 +\beta_2x_2\) | y ~ x1 + x2 - 1 |
| \(y = \beta_0 + \beta_1x_1x_2\) | y ~ x1 : x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 + x2 + x1:x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 * x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ (x1 + x2)^2 |
| \(y = \beta_0 + \beta_1x_1^2\) | y ~ I(x1^2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\) | y ~ poly(x1, 2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\) | y ~ . |
Recall the small mtcars dataset.
head(mtcars_small)
str(mtcars_small)
Let’s do some regression.
It’s very important that we specify the dataframe
# ALWAYS PLOT!!!
pairs(mtcars_small)
# Create the model
lm_all <- lm(mpg ~ ., data = mtcars_small)
summary(lm_all)
We don’t need to keep remaking models!
The weird \(\sim\;.\) thing works here, too.
lm_small <- update(lm_all, ~ . - disp)
summary(lm_small)
lm_small <- update(lm_small, ~ . - qsec)
summary(lm_small)
par(mfrow=c(1,3))
plot(mpg~hp); abline(lm(mpg~hp))
plot(mpg~disp); abline(lm(mpg~disp))
plot(hp~disp); abline(lm(hp~disp)) # multicollinearity = bad
install.packages("lme4")
library(lme4)
?sleepstudy
They tested reaction times in sleep deprived subjects.
There were multiple observations per subject.
This means that there’s variance for each subject as well as between subjects!
Random effects models account for this.
plot(Reaction ~ Subject, data = sleepstudy)
str(sleepstudy)
plot(Reaction ~ Days, data = sleepstudy, col = Subject)
subjects <- levels(sleepstudy$Subject)
plot(Reaction ~ Days, data = sleepstudy, type = 'n')
for(i in 1:length(subjects)){
lines(Reaction ~ Days, col=i,
data = subset(sleepstudy, Subject == subjects[i]))
}
plot(Reaction ~ Days, data = sleepstudy)
mean_day <- with(sleepstudy, tapply(Reaction, Days, mean))
sd_day <- with(sleepstudy, tapply(Reaction, Days, sd))
lines(0:9, mean_day, lwd=2)
\(y \sim x + (x|z)\)
The bar means “within” or “given”. This is modelling the mean of y, but the value changes with x and the variance of x depends on z.
Usually, z is subject ID or something similar.
The variance within each day is partially explained by the subject - some people might not be affected as much by sleep desprivation.
sleepy <- lmer(Reaction ~ Days + (Days | Subject),
data = sleepstudy)
summary(sleepy)
RMarkdown is a quick way to make reproducible research.
To make one, go to File -> New -> RMarkdown
fun_name <- function(x){
value <- something
return(value)
}
The names function and return are reserved in R.
mean_na <- function(x){
newmean <- mean(x, na.rm=TRUE)
return(newmean)
}
x <- c(0,2,3,1,NA)
mean(x)
## [1] NA
mean_na(x)
## [1] 1.5
A function will return the last thing it sees.
It’s like typing something into the console.
mean_na <- function(x){
mean(x, na.rm=TRUE)
}
If a function is only one line, you don’t need the curly brackets.
mean_na <- function(x) mean(x, na.rm=TRUE)
is_it_1 <- function(a){
if(a == 1){
return("Yes")
} else {
return("No")
}
}
is_it_1(1)
## [1] "Yes"
is_it_1("a")
## [1] "No"
plot_mpg <- function(xvar){
plot(mpg~xvar, main="Fuel Efficiency", data=mtcars,
ylab="Miles per Gallon", pch=16, col=am)
}
plot_mpg(disp)
plot_mpg <- function(xvar, ...){
plot(mpg~xvar, main="Fuel Efficiency", data=mtcars,
ylab="Miles per Gallon", pch=16, col=am, ...)
}
plot_mpg(disp, xlab="Displacement")
lm2 <- function(x, y, plot_it = FALSE, ...){
newlm <- lm(y~x)
if(plot_it){
plot(y~x, ...)
abline(newlm)
}
summary(newlm)
}
if statementpar(mfrow=c(1,2))
lm_disp <- lm2(disp, mpg)
lm_wt <- lm2(wt, mpg, plot_it = TRUE, xlab = "Weight")
lm_hp <- lm2(hp, mpg, plot_it = TRUE, xlab = "Horsepower")
Linear model: mean of \(y\) is a linear function of \(x\).
GLM: mean of \(g(y)\) is a linear function of \(x\)
Choosing \(g()\) lets us do a lot of different things.
Examples: Poisson regression, Logistic regression, Multinomial regression
Logistic regression uses continuous or categorical variables to predict a binary outcome.
Recall that am is binary.
It doesn’t make sense, but we’ll do a regression anyway.
The only thing that changes is the family argument.
am_model <- glm(am ~ hp, data=mtcars, family=binomial)
summary(am_model)
# For model comparison
AIC(am_model)
Before we can plot, we need estimates at each \(x\) value.
range(disp)
## [1] 71.1 472.0
newx <- seq(50,500, by = 2)
newy <- predict(am_model, newdata = list(disp=newx),
type='response')
plot(disp, am)
lines(newx, newy)
| Family | Default Link Function \(g()\) |
|---|---|
| binomial | (link = logit) |
| gaussian | (link = identity) |
| Gamma | (link = inverse) |
| inverse.gaussian | (link = 1/mu^2) |
| poisson | (link = log) |
| quasi | (link = identity, variance = constant) |
| quasibinomial | (link = logit) |
| quasipoisson | (link = log) |