Intermediate R Workshop

Introduction
Script Files
Vectors
dplyr functions
The Pipe Operator
Plotting with ggplot2
Modelling

Introduction

Thank you for attending the intermediate R wokshops! This is where I am putting all of the code that we went over. I tried to add comments to explain everything, but feel free to email me if you want more information.

THis page only shows the code. You can copy and paste into R to see the output. As a warning, you’ll probably get a lot of “object not found” errors because you haven’t created an object yet.

Script Files

Most work in R is done in script files. This is a way to save your code, but it doesn’t save your output. Make sure your script file works on it’s own; you should be able to open it, run it line by line, and get the same output everytime. It should also be well commented in case you forgot why you wrote certain code, and you don’t need to save every line of code that you wrote (e.g. you don’t need to save ?mean).

Vectors

# Vector Operations

# Create a dummy vector with c()
x <- c(1,2,4,2,3,2,1)
x*2
x^2

y <- 2*x - 3
y
y ^ 4
y * x

# Logical values
TRUE
FALSE
TRUE == FALSE
FALSE == FALSE
FALSE < TRUE
FALSE == 0
TRUE == 1
FALSE * TRUE
TRUE ^ FALSE
FALSE ^ TRUE
FALSE ^ FALSE

# Logical vectors
x
x > 2
x >= 2
x == 2
x > y

# Subsetting
# Square brackets are for subsets
x[1] # first element of x
x[3]

x[x > 2] # logical subset
x[x >= 2]
y[x >= 2]

# Functions ----
mean(x)
mean(y)
mean(2*x - 3)
#?mean
# mean has arguments: x, trim, and na.rm
# just x, but trim = 0 and na.rm = FALSE
#   These are called defaults

# NA is a special value
NA
NA ^ FALSE
z <- c(1,2,4,12,NA)
z
mean(z, na.rm = TRUE)
# explicit arguements
mean(x = z, trim = 0, na.rm = TRUE)
# implied arguements
mean(z, 0, TRUE)

# aside: dealing with NA
is.na(z)
!is.na(z)
z[!is.na(z)]
z[c(1,2,3,4)]

z[is.na(z)] <- 0
z

# Data Frames ---- 
data(mtcars) # built-in dataset
#?mtcars

# Looking at a dataframe
names(mtcars) # column names
head(mtcars) # first 6 observations
str(mtcars) # structure (works for anything)

# Getting columns
mtcars$mpg # outputs a vector, not a data frame

# Dataframe subsets
mtcars$am
mtcars$mpg[mtcars$am == 1]
mtcars$mpg[mtcars$am == 0]

# before comma: rows, after: columns
mtcars[mtcars$am == 0, ]

dplyr functions

The dplyr package is one of the most useful things to happen to R in a long time. This is hands down the best way to deal with dataframes (which is how most data is imported into R).

This is just a quick introduction to the functions, we’ll see just how powerful this can be in the next section (the pipe operator).

# dplyr: dealing with dataframes ----
#install.packages("dplyr")
library(dplyr)

# dataframe subsets
filter(mtcars, am == 0)
filter(mtcars, cyl == 4)
filter(mtcars, am == 0, cyl != 4)

# sorting
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))

# Getting/removing columns from a dataframe
select(mtcars, mpg) # mtcars$mpg
select(mtcars, -mpg) # remove a column
select(mtcars, starts_with("d"))

# Modify/add columns to a dataframe
mutate(mtcars, mpg2 = mpg^2) # new column
mutate(mtcars, vs = factor(vs))
mtcars <- mutate(mtcars, vs = factor(vs))
str(mtcars)

The Pipe Operator

The package magrittr is named after René Magritte, who painted The Treachery of Images (pictured above).

The pipe is %>%. In the expression mtcars$mpg %>% mean(), the value on the left hand side is put as the first argument of the function on the right hand side (recall that R looks at each argument in order unless they’re labelled). The expression mtcars$mpg %>% mean() is the same as mean(mtcars$mpg). However, as we will see, we can use multiple pipes in a row and do some powerful transformations of our data.

# The Pipe operator ----
mean(z, na.rm = TRUE)
# The new way
z %>% mean(na.rm = TRUE)
z %>% mean()
z %>% mean

# A pipline
mtcars %>%
    filter(am == 0) %>% # Just automatic cars
    select(-am) %>% # Get rid of the am column
    mutate(cyl = factor(cyl)) %>% # change cyl to a factor
    str() # structure of the dataframe



# Don't do this:
mutate(select(filter(mtcars, am == 0), -am), cyl = factor(cyl))

# a few new functions
group_by(mtcars, am)
summarise(mtcars, mean.mpg = mean(mpg), sd.mpg = sd(mpg))

# but why? Because:
mtcars %>%
    group_by(am) %>%
    summarise(mean.mpg = mean(mpg), sd.mpg = sd(mpg))

# Using n()
mtcars %>%
    group_by(am, cyl) %>%
    summarise(mean.mpg = mean(mpg), 
              sdmpg = sd(mpg),
              count = n())
# This is by far the best way to get summaries 
# of your data

Plotting with ggplot2

There are a lot of ways to make plots in R. The basic graphics function plot() is very flexible and works for a lot of different things, but it’s usually pretty ugly and has a lot of quirks. It also gets very difficult to do more advanced things.

ggplot2 offers a tradeoff. It doesn’t work for everything, but it gives some very simple functions for creating information dense plots!

Base R graphics work a lot like qplot(). ggplot() adds a lot of options, and I believe that if you can understand this then base graphics will be easier to learn. I give a few examples of qplot, then quickly change to ggplot.

There’s a certain strategy to making ggplot graphs. You need to start with your dataset, know what you want to plot, then slowly add the fancy features. For instance, if I wanted to create a scatterplot of mpg versus disp with colours based on cyl and shapes based on am, I would start with:

ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg)) +
    geom_point()

From there, I would add the colours and shapes, then maybe change sizes of points, and add labeles

ggplot(data = mtcars,
       mapping = aes(x = disp, y = mpg,
                     colour = am)) +
    geom_point()

When I do this, I don’t get what I expect. This is because am isn’t a factor. Building this plot up slowly allows me to see exactly which part gave me an error.

# Plotting Data ----
#install.packges("ggplot2")
library(ggplot2)

# quick plots
qplot(x = mpg, data = mtcars, geom = "histogram")
qplot(x = mpg, data = mtcars, geom = "density")
qplot(x = disp, y = mpg, data = mtcars,
      geom = c("point", "smooth"))
qplot(x = factor(am), y = mpg, data = mtcars,
      geom = "boxplot")

# More options for plots
# The grammar of graphics
ggplot(data = mtcars, 
       mapping = aes(x = disp, y = mpg)) +
    geom_point()

ggplot(mtcars, aes(x = disp, y = mpg)) +
    geom_point(colour = "red", size = 3) +
    geom_smooth(colour = "hotpink", size = 0.5, 
                se = FALSE) +
    theme_classic() +
    labs(x = "Displacement", y = "Miles/Gallon",
         title = "Fuel Efficiency by Size",
         subtitle = "Big engines burn more gas")

mtcars %>% 
    mutate(am = factor(am)) %>%
    filter(cyl > 4) %>%
    ggplot(aes(x = am, y = mpg)) +
        geom_boxplot() + theme_bw()

ggplot(mtcars, aes(y = mpg, x = disp, 
                   colour = factor(am),
                   shape = factor(cyl))) +
    geom_point() + theme_bw() +
    labs(x = "Displacement", y = "MPG",
         title = "Bivariate Plot",
         colour = "Transmission",
         shape = "Cylinders") +
    scale_colour_manual(
        values = c("blue", "black"))

ggplot(mtcars, aes(y = mpg, x = disp,
                   colour = factor(am))) +
    geom_point() + geom_smooth(se = FALSE) +
    theme_bw()

# Misc plots
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("Auto", "Man")

ggplot(mtcars, aes(x = factor(cyl),
                   fill = am)) +
    geom_bar(position = "fill") +
    theme(legend.position = "bottom")
# position = "stack"
# position = "dodge"

ggplot(mtcars, aes(x = mpg,
                   fill = factor(cyl))) +
    geom_density(alpha = 0.2) # translucency

# Facets 
ggplot(mtcars, aes(x = mpg)) +
    geom_histogram() +
    facet_wrap(~cyl, ncol = 1)

ggplot(mtcars, aes(x = mpg)) +
    geom_histogram(fill = "grey", 
                   alpha = 0.2,
                   colour = "black") +
    facet_wrap(~cyl + am) + theme_bw()

ggplot(mtcars, aes(x = am, y = mpg,
                   fill = factor(cyl))) +
    geom_boxplot()


# Pie charts
# If you want to make a pie chart, DON'T.
# Every single pie chart is misleading.
# Humans judge distance, not area.
# Just make a bar chart. Pie charts are always inferior to bar charts.

Modelling

In general, R uses a consistent modelling framework. The model formula is always something like y ~ x1 + x2. This sort of notation is used in linear models, anova models, t-tests, logistic regression, Poisson regression, non-linear least squares, and random effects models.

In the workshops, I talked a little bit about modelling strategies. I’m not going to describe them here, but understand that there are some quirks, especially with categorical (factor) variables.

# Modelling ----
# ALWAYS PLOT YOUR DATA FIRST
ggplot(mtcars, aes(x = disp, y = mpg)) +
    geom_point() + 
    geom_smooth(method = "lm")

# fit a simple linear regression
mpg1 <- lm(mpg ~ disp, data = mtcars)
summary(mpg1) # summary gives you more information

# base R plotting functions
# without the "which" argument, it tries to plot
# four plots at once.
plot(mpg1, which = 1)
plot(mpg1, which = 2)
plot(mpg1, which = 3)
plot(mpg1, which = 5)

# Should we add am to the model?
ggplot(mtcars, aes(x = disp, y = mpg, 
                   colour = factor(am))) + 
    geom_point() + 
    geom_smooth(method = "lm")

# The plot looked good, so let's try it
mpg2 <- lm(mpg ~ disp + am, data = mtcars)
summary(mpg2) # not significant???
# The way we added am to the model, it only changes the intercept
# to change the slope as well (as in the ggplot above),
# we need to make it an interaction

# The colon multiplies two variables
mpg3 <- lm(mpg ~ disp + am + disp:am,
           data = mtcars)
# The star multiplies the variables and keeps the individual terms
# am*disp is the same as am+disp+am:disp
mpg3 <- lm(mpg ~ disp * am, data = mtcars)
summary(mpg3)

# Try adding the car's weight
mpg4 <- lm(mpg ~ disp*am + wt, data = mtcars)
summary(mpg4) # now disp isn't significant?

ggplot(mtcars, aes(x = wt, y = disp)) +
    geom_point()
# the variables are correlated, so we probably only need one in the model

mpg5 <- lm(mpg ~ wt*am, data = mtcars)
summary(mpg5)

# t tests
t.test(mpg ~ am, data = mtcars)
mtcars %>% group_by(am) %>%
    summarise(mean.mpg = mean(mpg),
              sd.mpg = sd(mpg))

#?t.test
t.test(formula = mpg ~ am, data = mtcars,
       alternative = "less", mu = 5,
       var.equal = TRUE)

# Anova 
anova(aov(mpg ~ factor(cyl), data = mtcars))

anova(aov(mpg ~ factor(cyl) + factor(am),
          data = mtcars))
?manova

# GLM 
myglm <- glm(factor(am) ~ mpg + cyl,
             data = mtcars, 
             family = binomial)
summary(myglm)

myglm2 <- glm(factor(am) ~ mpg, 
              data = mtcars,
              family = binomial)
summary(myglm2)

# Since GLMs aren't lines, we need more code to plot them
# We set up all of the x values, then calculate y at each x
mpgseq <- seq(10.4, 33.9, by = 0.1)
# the predict function calculates y
amseq <- predict(myglm2, 
                 newdata = list(mpg = mpgseq),
                 type = "response")
qplot(x = mpgseq, y = amseq, geom = "line")