1 Introduction

Welcome ot the intermediate R workshop tutorial! The table of contents above should let you know what you’re in for, but I’ll give a brief description here. This file is meant to be a supplement to the workshop. The code appears in it’s final form here, but we will build it up slowly in the workshop so that it doesn’t seem so daunting. Think of this as the solutions file to some textbook problems.

In Section 1, we learn an important lesson about datasets. The datasaurus dozen dataset comes in a form that you can probably expect in your own research. It has x and y measurements for a dozen or so different categories. We’ll go through some data manipulation techniques and explain the logic behind each one. We’ll also get a nice overview of the apply family of functions. We might also talk about some plotting techniques…

Section 2 will focus on models in R. We can’t cover every model that R has to offer, but most of the built-in models have more-or-less standard input and output format. We’ll learn to write general models and extract the relevant information. We demonstrate our knowledge with a simulation study where we investigate the proper use of p-values.

Section 3 is our chance to do an entire analysis from start to finish. We will import a dataset, explore some graphical techniques, fit some models, and produce a final report. For simple reports, we can use RMarkdown to mix words and code. This makes sharing code easy and it keeps all of your analysis and results in a standalone file.

The last section will focus on some of the more advanced topics. We’ll see some non-standard uses of the apply family, build our own functions, learn about some special syntax, and save our work in a way that makes it easy to pick up again later.

We hope you enjoyed this workshop! Remember that the only way to get better at something is to keep trying.

2 Data Manipulation and Exploration

2.1 Basic exploration

For this section, we use the datasaurus dozen dataset. First, we must install the package and load it. Note that installing a package doesn’t automatically load it into the system. You only have to install it once, but you have to load it every time you want to use it.

install.packages("datasauRus")
library("datasauRus")

Now that it’ i’s loaded, we need to tell R that we want to look at the data. Below, there are some very helpful functions for looking at a dataset. They’re not all useful here, but they might be useful to you later.

data("datasaurus_dozen")
head(datasaurus_dozen, 10)
tail(datasaurus_dozen, 10)
str(datasaurus_dozen) # The structure of a dataframe
names(datasaurus_dozen) # column names

# For a categorical variable
unique(datasaurus_dozen$dataset) # unique names in the dataset
table(datasaurus_dozen$dataset) # how many of each value?

# For numbers
range(datasaurus_dozen$x)
range(datasaurus_dozen$y)

We also can look at subsets of the data. The following code is here as a reminder of R’s syntax. There are many different ways to do the same task; which way you use depends on the situation.

Make sure you understand each line. Pay attention to the use of the dollar signs, commas, double equal signs, and quotation marks. If you do something wrong with any of these, you will get an error.

# Getting columns - these all do the exact same thing!
mean(datasaurus_dozen$y)
mean(datasaurus_dozen$'y')
mean(datasaurus_dozen[, 'y']) # after the comma is the columns, before is rows
# since we want all of the rows, we leave a blank before the comma
mean(datasaurus_dozen[, 3]) # y is the third column
with(datasaurus_dozen, mean(y)) # the with function can be very helpful

# Subsetting by a categorical value in the rows.
# again, these all do the exact same thing.
mean(datasaurus_dozen$y[datasaurus_dozen$dataset == "dino"])
mean(datasaurus_dozen[datasaurus_dozen$dataset == "dino", "y"])
mean(datasaurus_dozen[datasaurus_dozen$dataset == "dino", 3])
mean(subset(datasaurus_dozen, dataset == "dino")$y)
mean(subset(datasaurus_dozen, dataset == "dino")$"y")
mean(subset(datasaurus_dozen, dataset == "dino")[, "y"])
mean(subset(datasaurus_dozen, dataset == "dino")[, 3])
with(datasaurus_dozen, mean(y[dataset == "dino"])) # my favourite
with(datasaurus_dozen[datasaurus_dozen$dataset == "dino", ], mean(y))
with(subset(datasaurus_dozen, dataset == "dino"), mean(y))

Some notes:

  • The with and subset functions allow you to drop the dollar signs. If you’re just accessing a column it’s fine to use dollar signs, but when you’re doing subsets I try and avoid it.
  • dino is one of the values in the column labelled dataset. dino absolutely must be in quotes.
  • When using the dollar sign, quotes are optional. However, they’re necessary if the column name has special characters (like “-”).
  • all of the code in the second part includes dataset == "dino", but there are different ways to get y.

2.2 Writing Functions

Suppose we want the mean and standard deviation of each dataset. We could do something like:

with(datasaurus_dozen, mean(y[dataset == "away"]))
## [1] 47.83472
with(datasaurus_dozen, mean(y[dataset == "dino"]))
## [1] 47.83225
with(datasaurus_dozen, mean(y[dataset == "star"]))
## [1] 47.83955
# and so on

but that’s a lot of typing. Instead, we can make our own function!

meanasaurus <- function(dataname){
    meanofdata <- with(datasaurus_dozen, mean(y[dataset == dataname]))
    return(meanofdata)
}
meanasaurus("away")
## [1] 47.83472
meanasaurus("dino")
## [1] 47.83225
meanasaurus("star")
## [1] 47.83955

The following are equivalent:

meanasaurus <- function(dataname){
    meanofdata <- with(datasaurus_dozen, mean(y[dataset == dataname]))
    return(meanofdata)
}

meanasaurus <- function(dataname) {
    with(datasaurus_dozen, mean(y[dataset == dataname]))
}

meanasaurus <- function(dataname) with(datasaurus_dozen, mean(y[dataset == dataname]))

NOTE: This is terrible form for a function. You never want to call a variable that was not passed. If we hadn’t loaded the datasauRus package, the above function wouldn’t work. Usually, you want your functions to be applicable to many situations.

Again, you should only reference variables that were passed as arguments. We will talk about this later. For now, we re-write the functions so that it is better.

meanasaurus <- function(dataname){
    # Check if the dataset exists
    if(!exists("datasaurus_dozen")) stop("datasauRus has not been loaded")
    # return the mean
    with(datasaurus_dozen, mean(y[dataset == dataname]))
}

2.3 Applying across subsets

In this data, we have several different datasets. Each one can be found using something like subset(datasetname, columname == "value"). But what if we want the mean of all of the different datasets? Again, there are several ways to do this.

# Get a vector with the names of the datasets
columns <- unique(datasaurus_dozen$dataset)

# from before: with(datasaurus_dozen, mean(y[dataset == "dino"]))
for(i in columns) print(with(datasaurus_dozen, mean(y[dataset == i])))

# similarly,
for(i in colums) print(meanasaurus(i))

It gets the job done, but it’s not pretty and it’s hard to tell where each mean came from. The apply family of functions will make a lot of things nicer. Here, we use sapply - the s means that it’s simplifying the results. In this case, it works the same as the for loop.

columns <- unique(datasaurus_dozen$dataset)
sapply(X=columns, FUN = function(column) with(datasaurus_dozen, mean(y[dataset == column])))
##       dino       away    h_lines    v_lines    x_shape       star 
##   47.83225   47.83472   47.83025   47.83699   47.83972   47.83955 
## high_lines       dots     circle   bullseye   slant_up slant_down 
##   47.83545   47.83983   47.83772   47.83082   47.83150   47.83590 
## wide_lines 
##   47.83160
sapply(X=columns, FUN = meanasaurus)
##       dino       away    h_lines    v_lines    x_shape       star 
##   47.83225   47.83472   47.83025   47.83699   47.83972   47.83955 
## high_lines       dots     circle   bullseye   slant_up slant_down 
##   47.83545   47.83983   47.83772   47.83082   47.83150   47.83590 
## wide_lines 
##   47.83160
sapply(columns, meanasaurus)
##       dino       away    h_lines    v_lines    x_shape       star 
##   47.83225   47.83472   47.83025   47.83699   47.83972   47.83955 
## high_lines       dots     circle   bullseye   slant_up slant_down 
##   47.83545   47.83983   47.83772   47.83082   47.83150   47.83590 
## wide_lines 
##   47.83160

See how it keeps all of the names? The code isn’t very nice to look at, though. In fact, this is one of the least readable ways to do this. Fortunately, there are special functions that we can use.

tapply(X = datasaurus_dozen$y, INDEX = datasaurus_dozen$dataset, FUN = mean)
##       away   bullseye     circle       dino       dots    h_lines 
##   47.83472   47.83082   47.83772   47.83225   47.83983   47.83025 
## high_lines slant_down   slant_up       star    v_lines wide_lines 
##   47.83545   47.83590   47.83150   47.83955   47.83699   47.83160 
##    x_shape 
##   47.83972
# A better way:
with(datasaurus_dozen, tapply(y, dataset, mean))
##       away   bullseye     circle       dino       dots    h_lines 
##   47.83472   47.83082   47.83772   47.83225   47.83983   47.83025 
## high_lines slant_down   slant_up       star    v_lines wide_lines 
##   47.83545   47.83590   47.83150   47.83955   47.83699   47.83160 
##    x_shape 
##   47.83972
with(datasaurus_dozen, by(y, dataset, mean))
## dataset: away
## [1] 47.83472
## -------------------------------------------------------- 
## dataset: bullseye
## [1] 47.83082
## -------------------------------------------------------- 
## dataset: circle
## [1] 47.83772
## -------------------------------------------------------- 
## dataset: dino
## [1] 47.83225
## -------------------------------------------------------- 
## dataset: dots
## [1] 47.83983
## -------------------------------------------------------- 
## dataset: h_lines
## [1] 47.83025
## -------------------------------------------------------- 
## dataset: high_lines
## [1] 47.83545
## -------------------------------------------------------- 
## dataset: slant_down
## [1] 47.8359
## -------------------------------------------------------- 
## dataset: slant_up
## [1] 47.8315
## -------------------------------------------------------- 
## dataset: star
## [1] 47.83955
## -------------------------------------------------------- 
## dataset: v_lines
## [1] 47.83699
## -------------------------------------------------------- 
## dataset: wide_lines
## [1] 47.8316
## -------------------------------------------------------- 
## dataset: x_shape
## [1] 47.83972

The by function produces a lot of output, so you normally wouldn’t use it here. However, it works well if your output is a vector. For instance,

by(datasaurus_dozen[, c('x', 'y')], 
   datasaurus_dozen$dataset, 
   function(x) apply(x, 2, mean))
## datasaurus_dozen$dataset: away
##        x        y 
## 54.26610 47.83472 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: bullseye
##        x        y 
## 54.26873 47.83082 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: circle
##        x        y 
## 54.26732 47.83772 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: dino
##        x        y 
## 54.26327 47.83225 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: dots
##        x        y 
## 54.26030 47.83983 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: h_lines
##        x        y 
## 54.26144 47.83025 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: high_lines
##        x        y 
## 54.26881 47.83545 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: slant_down
##        x        y 
## 54.26785 47.83590 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: slant_up
##        x        y 
## 54.26588 47.83150 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: star
##        x        y 
## 54.26734 47.83955 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: v_lines
##        x        y 
## 54.26993 47.83699 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: wide_lines
##        x        y 
## 54.26692 47.83160 
## -------------------------------------------------------- 
## datasaurus_dozen$dataset: x_shape
##        x        y 
## 54.26015 47.83972

If we also want the mean and sd as well as the correlation between x and y, we can go back to the sapply function.

# allosaurus because it gives all the data. I'm clever.
allosaurus <- function(dataname){ 
    # begin by subsetting - saves us typing later
    thisdata <- datasaurus_dozen[datasaurus_dozen$dataset == dataname,]
    # means
    mx <- mean(thisdata$x)
    my <- mean(thisdata$y)
    # standard deviations
    sx <- sd(thisdata$x)
    sy <- sd(thisdata$y)
    # correlation between x and y
    corxy <- cor(thisdata$x, thisdata$y)
    
    # Note that we can name things inside of vectors.
    # This is not always useful, and sometimes it's a terrible idea.
    return(c("X Mean" = mx,"Y Mean" = my, "X SD" = sx, "Y SD" = sy, "Cor" = corxy))
}

# Testing
allosaurus("dino")
##      X Mean      Y Mean        X SD        Y SD         Cor 
## 54.26327324 47.83225282 16.76514204 26.93540349 -0.06447185
sapply(columns, allosaurus) # notice how nicely this is formatted!
##               dino        away     h_lines     v_lines     x_shape
## X Mean 54.26327324 54.26609978 54.26144178 54.26992723 54.26015033
## Y Mean 47.83225282 47.83472062 47.83025191 47.83698799 47.83971728
## X SD   16.76514204 16.76982495 16.76589790 16.76995861 16.76995770
## Y SD   26.93540349 26.93974342 26.93987622 26.93768381 26.93000169
## Cor    -0.06447185 -0.06412835 -0.06171484 -0.06944557 -0.06558334
##              star  high_lines        dots      circle    bullseye
## X Mean 54.2673411 54.26880528 54.26030345 54.26731971 54.26873002
## Y Mean 47.8395452 47.83545020 47.83982921 47.83771727 47.83082316
## X SD   16.7689592 16.76670402 16.76773549 16.76001266 16.76923949
## Y SD   26.9302747 26.93999796 26.93019152 26.93003609 26.93572669
## Cor    -0.0629611 -0.06850422 -0.06034144 -0.06834336 -0.06858639
##           slant_up  slant_down  wide_lines
## X Mean 54.26588179 54.26784882 54.26691630
## Y Mean 47.83149565 47.83589633 47.83160199
## X SD   16.76885267 16.76675895 16.76999962
## Y SD   26.93860807 26.93610493 26.93790193
## Cor    -0.06860921 -0.06897974 -0.06657523

If you have data that has columns for group, explanatory variable, and response, you can just change the names above!

Try the above analysis with the iris dataset!

data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
unique(iris$Species)
## [1] setosa     versicolor virginica 
## Levels: setosa versicolor virginica
levels(iris$Species) # since it's a factor
## [1] "setosa"     "versicolor" "virginica"
mean(iris$Sepal.Length)
## [1] 5.843333
mean(iris$Sepal.Width)
## [1] 3.057333

Note that there are a lot of things that need to be changed! However, the basics are the same.

Notice that all of the datasets had the same mean and standard deviations for both x and y. Furthermore, every dataset has the same correlation between x and y. Surely that means they are the same!!!

2.4 Rule Number 1: Always always always plot your data

You probably wondered why this is called the datasaurus dozen. Here’s why:

plot(y ~ x, data = subset(datasaurus_dozen, dataset == "dino"),
     main = "The Datasaurus", xlab = "x", ylab = "y",
     pch = 19, # adds filled in points rather than empty circles
     las=1) # rotates the y-axis values

So do all of the datasets look like this?

par(mar=c(3.5,3.5,2,2)) # You probably won't need this.
columns <- unique(datasaurus_dozen$dataset)
par(mfrow=c(4,4))
for(i in columns){
    plot(y ~ x, data = subset(datasaurus_dozen, dataset == i))
}

ALL of these plots have the same means, standard deviations, and correlations! Always always always plot your data!!!

Side note: this can also be done with the apply function.

par(mfrow = c(4,4))
invisible(sapply(columns, 
                 function(column) {
                     plot(y ~ x, 
                          data = subset(datasaurus_dozen, 
                                        dataset == column))
                 }
))

3 Extracting Information From Models

A lot of what you do in R will be based on models. You may be using t-tests, linear models, logistic regression, anova models, etc. In R, these models all have similar input and output.

For this section, we’re going to start by making some simple linear models. From there, we’re going to look into the linear model output. Next, we run a simulation study to see what happens with p-values.

3.1 Fitting a linear model

The mtcars dataset was collected by Motor Trend magazine in the 1970s. It’s a very simple dataset, but there’s a lot that can be done with it. In this example, we start out by fitting a linear model for miles per gallon against weight, while also taking into account the transmission type (automatic = 0, manual = 1). First things first, we plot our data!

data(mtcars)
#?mtcars # built-n datasets have help files
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
plot(mpg ~ wt, data = mtcars, col = am + 1)

# note that am is either 0 or 1. am + 1 is 1 or 2
# R interprets 1 as black and 2 as red

There are a lot of different linear models that we could make here! Below is a brief overview of the syntax.

Model Notation
\(y = \beta_0+\beta_1x_1+\beta_2x_2\) y ~ x1 + x2
\(y = \beta_1x_1 +\beta_2x_2\) y ~ x1 + x2 - 1
\(y = \beta_0 + \beta_1x_1x_2\) y ~ x1 : x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ x1 + x2 + x1:x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ x1 * x2
\(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) y ~ (x1 + x2)^2
\(y = \beta_0 + \beta_1x_1^2\) y ~ I(x1^2)
\(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\) y ~ poly(x1, 2)
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\) y ~ .

We’ll just fit a simple linear model with interaction.

mtlm <- lm(mpg ~ wt*am, data = mtcars)
summary(mtlm)
## 
## Call:
## lm(formula = mpg ~ wt * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.4161     3.0201  10.402 4.00e-11 ***
## wt           -3.7859     0.7856  -4.819 4.55e-05 ***
## am           14.8784     4.2640   3.489  0.00162 ** 
## wt:am        -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

Everything is significant and it’s all fine and dandy. However, I’m more interested in what mtlm actually looks like.

mtlm
## 
## Call:
## lm(formula = mpg ~ wt * am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt           am        wt:am  
##      31.416       -3.786       14.878       -5.298
str(mtlm) # that's a lot of things!
## List of 12
##  $ coefficients : Named num [1:4] 31.42 -3.79 14.88 -5.3
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##  $ residuals    : Named num [1:32] -1.494 0.823 -2.419 2.156 0.307 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ effects      : Named num [1:32] -113.6497 -29.1157 0.0473 9.5033 0.1274 ...
##   ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "am" "wt:am" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:32] 22.5 20.2 25.2 19.2 18.4 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ assign       : int [1:4] 0 1 2 3
##  $ qr           :List of 5
##   ..$ qr   : num [1:32, 1:4] -5.657 0.177 0.177 0.177 0.177 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
##   ..$ qraux: num [1:4] 1.18 1.05 1.08 1.07
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 1e-07
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 28
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ wt * am, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ wt * am
##   .. ..- attr(*, "variables")= language list(mpg, wt, am)
##   .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
##   .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
##   .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
##   .. ..- attr(*, "order")= int [1:3] 1 1 2
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(mpg, wt, am)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
##  $ model        :'data.frame':   32 obs. of  3 variables:
##   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt * am
##   .. .. ..- attr(*, "variables")= language list(mpg, wt, am)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
##   .. .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 2
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, wt, am)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
##  - attr(*, "class")= chr "lm"

There’s a lot of information there, and it’s not easy to get at. Most of the information is about the model fitting (i.e. the QR decomposition) Instead, let’s look at the summary again.

# R is an object-oriented programming language
# Everything is an object
str(summary(mtlm))
## List of 11
##  $ call         : language lm(formula = mpg ~ wt * am, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ wt * am
##   .. ..- attr(*, "variables")= language list(mpg, wt, am)
##   .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
##   .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
##   .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
##   .. ..- attr(*, "order")= int [1:3] 1 1 2
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(mpg, wt, am)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
##  $ residuals    : Named num [1:32] -1.494 0.823 -2.419 2.156 0.307 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ coefficients : num [1:4, 1:4] 31.42 -3.79 14.88 -5.3 3.02 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:4] FALSE FALSE FALSE FALSE
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##  $ sigma        : num 2.59
##  $ df           : int [1:3] 4 28 4
##  $ r.squared    : num 0.833
##  $ adj.r.squared: num 0.815
##  $ fstatistic   : Named num [1:3] 46.6 3 28
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:4, 1:4] 1.358 -0.346 -1.358 0.346 -0.346 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##   .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
##  - attr(*, "class")= chr "summary.lm"

Again, we see a lot of things. However, these are more useful to us. For instance,

mtsum <- summary(mtlm)
mtsum$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 31.416055  3.0201093 10.402291 4.001043e-11
## wt          -3.785908  0.7856478 -4.818836 4.551182e-05
## am          14.878423  4.2640422  3.489277 1.621034e-03
## wt:am       -5.298360  1.4446993 -3.667449 1.017148e-03

3.2 Fitting a lot of linear models

The probability of rejecting the null hypothesis giving that the null hypothesis is true.

Everyone always wants to look at a p-value. In this section, we’ll look at one of the major shortcomings of p-values. Specifically, we’re going to look at what happens when you make a linear model then remove terms based on their p-values.

We’re going to do a simulation study. We’re going to generate random uncorrelated data, then we’ll test to see if it’s correlated. This seems like a strang ething to do, but recall that a significance level of 0.05 means that there’s a 5% chance that we’ll reject the null hypothesis when the null hypothesis is true.

In this study, our null hypothesis is that there is no correlation between x and y. This null hypothesis is true because that’s how we’re making it.

x <- rnorm(40)
y <- rnorm(40)

plot(y~x)

xylm <- lm(y~x)
summary(xylm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.37982 -0.60947 -0.01794  0.76261  2.39463 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.1712     0.1609  -1.064    0.294
## x            -0.1176     0.1433  -0.821    0.417
## 
## Residual standard error: 1.013 on 38 degrees of freedom
## Multiple R-squared:  0.01742,    Adjusted R-squared:  -0.008438 
## F-statistic: 0.6737 on 1 and 38 DF,  p-value: 0.4169
abline(xylm)

Clearly, we don’t have any relationship between x and y. To quantify this, we find the p-value for the slope.

summary(xylm)$coef
##               Estimate Std. Error    t value  Pr(>|t|)
## (Intercept) -0.1711767  0.1608776 -1.0640182 0.2940353
## x           -0.1176460  0.1433363 -0.8207691 0.4168972
# the p-value is the 2nd row, 4th column
summary(xylm)$coef[2,4]
## [1] 0.4168972

Let’s do this 10000 times.

pvalues <- c() # empty vector to start

for(i in 1:10000){ #do this 100 times
    x <- rnorm(40)
    y <- rnorm(40)
    xylm <- lm(y~x)
    sig <- summary(xylm)$coef[2,4]
    pvalues[i] <- sig
}

hist(pvalues, 
     breaks = 20, # 20 breaks means that each box is 5%
     col=c(2, rep(0,19))) # make the first box red

The distribution looks like a box. This is good - the p-values should follow a uniform distribution. This means that significant p-values (the red box) happen 5% of the time.

Now, watch what happens when we only take the most significant variable.

x1 <- rnorm(40)
x2 <- rnorm(40)
y <- rnorm(40)

xxylm <- lm(y ~ x1 + x2)
summary(xxylm)$coef
##                Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) -0.21636188  0.1685676 -1.2835322 0.20728641
## x1          -0.04423465  0.1733378 -0.2551933 0.79998708
## x2          -0.29231321  0.1704306 -1.7151455 0.09468355
# the p-values are the 2nd and 3rd row, 4th column
pvals <- summary(xxylm)$coef[c(2,3), 4]

# which is the smallest p-value?
minx <- which.min(pvals)

# make a new dataframe so we can remove one of the x's
xxdf <- data.frame(x1, x2)
# fit a new linear model with only the most significant x
xylm <- lm(y ~ xxdf[,minx])

# what has happened?
summary(xylm)$coef
##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)  -0.2146407  0.1663478 -1.290313 0.20473859
## xxdf[, minx] -0.2898493  0.1680507 -1.724773 0.09269475
# The p-value has changed slightly, but not much
# is this really such a bad thing?

This is something that happens a lot - you fit a linear model then remove the covariates that aren’t significant. This is not a good strategy, though - your p-values aren’t valid anymore!

pvalues2 <- c() # empty vector

t0 <- Sys.time() # Just to see how long this takes
for(i in 1:10000){
    x1 <- rnorm(40)
    x2 <- rnorm(40)
    y <- rnorm(40)
    
    xxylm <- lm(y ~ x1 + x2)
    summary(xxylm)$coef
    
    pvals <- summary(xxylm)$coef[c(2,3), 4]
    
    minx <- which.min(pvals)
    
    xxdf <- data.frame(x1, x2)
    xylm <- lm(y ~ xxdf[,minx])
    
    pvalues2[i] <- summary(xylm)$coef[2,4]
}
Sys.time() - t0 # it took this long:
## Time difference of 18.70499 secs
hist(pvalues2, breaks = 20, col = c(2, rep(0,19)))

With this method, around 10% of your p-values are significant, the claimed value of 5%! Be very careful when fitting linear models - it’s easy to accidentally ruin your analysis.

3.3 Other Models

The same tricks can be used with other types of models. Below are some common model forms for you to play around with. Note that summary isn’t as useful for other models.

mtttest <- t.test(mpg~am, data=mtcars)
names(mtttest)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
mtttest$p.value
## [1] 0.001373638
# Using anova to do an F-test for linear models
mtlm1 <- lm(mpg ~ wt, data = mtcars)
mtlm2 <- lm(mpg ~ wt + am, data = mtcars)
mtanova <- anova(mtlm1, mtlm2)
names(mtanova)
## [1] "Res.Df"    "RSS"       "Df"        "Sum of Sq" "F"         "Pr(>F)"
mtanova$"Pr(>F)"
## [1]        NA 0.9879146
# logistic model for transmission against mpg and weight
mtlogistic <- glm(am ~ mpg + wt, data = mtcars, family = binomial)
mtlogsum <- summary(mtlogistic)
names(mtlogsum)
##  [1] "call"           "terms"          "family"         "deviance"      
##  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
##  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
## [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
## [17] "cov.scaled"
mtlogsum$coefficients
##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) 25.8865540  12.193529  2.122975 0.03375597
## mpg         -0.3241639   0.239499 -1.353509 0.17589322
## wt          -6.4161721   2.546606 -2.519500 0.01175217

These will be useful in the next section.

4 A Full Analysis

Data from: http://r-blog.salvaggio.net/?p=521

Go to the bottom of the page and find the link to the data. From this, we’ll make an RMarkdown document. To create this document in RStudio, click on the icon at the top left (a paper with a green plus) and select RMarkdown.

For reference, here are some of the plots we made. Notice that the R code shows up above the plot, so your collaborators can see exactly what you did. Much of this was not covered, but is included so you can see some extra code. Hopefully you will find something useful here.

4.1 Some Data Manipulation

This is a really short bit of code, but it took me a long time to figure out what code to write. This sort of thing happens a lot with R.

ram <- read.csv("Ramones.csv", stringsAsFactors = FALSE)

# The lengths of songs are 120S, as in 120 seconds
# I use gsub to remove the S, which still leaves the numbers as strings
# as.numeric changes the strings to numbers
ram$lengthS <- as.numeric(gsub("S", "", ram$lengthS))

4.2 Checking Distributional Assumptions

# Chords as a percent of total chords
# This is categorical, so I did not use a histogram
chords_dens <- table(ram$chords)/length(ram$chords)

plot(chords_dens, main = "Checking Distributional Assumptions",
     ylab = "Density", xlab="Number of Chords")

mch <- mean(ram$chords, na.rm=TRUE)
sch <- sd(ram$chords, na.rm=TRUE)

lines(2:12, dpois(2:12, lambda=mch), col="blue", lwd=2, type='b')

xn <- seq(2,12, 0.1)
# I used 5 in the normal curve because that's the mode
# If it were truly normal, the mean and mode would be equal
yn <- dnorm(xn, mean = 5, sd = sch)
lines(xn,yn, col = "red", lwd=2)

legend('topright', legend = c("True", "Poisson(5.3)", "Normal(5, 2.25)"), 
       lty=1, lwd=2, col=c(1,4,2))

hist(ram$lengthS, main="Song Lengths", xlab="Length (Seconds)", freq=FALSE)
xn <- seq(70, 300, 1)
yn <- dnorm(xn, mean = mean(ram$lengthS), sd = sd(ram$lengthS))
lines(yn ~ xn)

# The data is clearly right-skewed
library(ggplot2)
# ggplot2 version - I made the bars look a little bit better than the default
ggplot(ram, aes(lengthS)) + geom_histogram(col='black', fill='grey') + 
    xlab("Song Length") + ylab("Frequency") +
    ggtitle("Histogram of Song Lengths")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.3 A Violin Plot to Compare Distributions

library(ggplot2)
chords <- ram$chords
key <- factor(ram$key)
ggplot(mapping=aes(y=chords, x=factor(key), colour=factor(key)), data=ram) + 
    geom_violin() + geom_jitter(position = position_jitter(0.01)) + 
    xlab("Key") + ylab("Number of Chords") + 
    ggtitle("Number of Chords in Each Key")
## Warning: Removed 12 rows containing non-finite values (stat_ydensity).
## Warning: Removed 12 rows containing missing values (geom_point).

4.4 Boxplots Over Time

par(mfrow=c(2,1))
boxplot(chords~album, data=ram)
boxplot(lengthS~album, data=ram)

4.5 Album Lengths Over Time

years <- unique(ram$year)
years <- sort(c(years, 1977)) # Two albums in one year
# the data was already sorted, so adding a number and sorting is fine

albumLen <- sapply(unique(ram$album), 
                   function(x) 
                       sum(ram$lengthS[ram$album == x])
                   )/60

plot(years, albumLen, main="Album Length by Year",
     xlab="Year", ylab="Album Length (Minutes)")
abline(lm(albumLen ~ years))

4.6 More Data Manipulation; Fitting a Linear Model

years <- unique(ram$year)
years <- sort(c(years, 1977)) # Two albums in one year
# the data was already sorted, so adding a number and sorting is fine

albumLen <- sapply(
    # Find all of the album names
    unique(ram$album),  
    function(x) {
        # for each album name, add up all of the song lengths
        sum(ram$lengthS[ram$album == x])
    })
albumLen <- albumLen/60 # Convert to minutes

posUS <- unique(ram$albumPeakPositionUS)
posUS <- c(posUS, 148) # two albums had same peack position
# fortunately, the repeated album was the last year, so we don't
# have to sort too much

ggplot(mapping=aes(x=years, y=albumLen)) + geom_point() + 
    geom_smooth(method="lm") + theme_light() +
    geom_hline(yintercept = mean(albumLen), colour='red') + 
    xlab("Release Year") + ylab("Album Lengths (Minutes)") +
    ggtitle("Album Length Over Time (the slope is NOT significant)") + 
    scale_x_continuous(breaks=seq(1977, 1995, 2))

summary(lm(albumLen ~ years))
## 
## Call:
## lm(formula = albumLen ~ years)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3859 -1.4747  0.0619  1.7214  3.2451 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -325.3208   201.3051  -1.616    0.132
## years          0.1805     0.1015   1.779    0.101
## 
## Residual standard error: 2.286 on 12 degrees of freedom
## Multiple R-squared:  0.2087, Adjusted R-squared:  0.1427 
## F-statistic: 3.164 on 1 and 12 DF,  p-value: 0.1006

4.7 Non-Parametric Smoothing - Is the Linear Model Appropriate?

ggplot(mapping=aes(x=years, y=albumLen)) + geom_point() + geom_smooth() +
    xlab("Release Year") + ylab("Album Lengths (Minutes)") +
    ggtitle("Album Length Over Time - Nonparametric Smoother") +
    theme_bw()
## `geom_smooth()` using method = 'loess'

4.8 Another Linear Model - Album Position Over Time

ggplot(mapping=aes(x=years, y=posUS)) + geom_point() + 
    geom_smooth(method="lm") + theme_light() + 
    xlab("Release Year") + ylab("Peak Billboard Position (US)") +
    ggtitle("The Ramones got less popular over time (slope IS significant)") + 
    scale_x_continuous(breaks=seq(1977, 1995, 2))

4.9 Poisson Regression

rampois <- glm(chords ~ lengthS, data=ram, family=poisson)
summary(rampois)
## 
## Call:
## glm(formula = chords ~ lengthS, family = poisson, data = ram)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1257  -0.6170  -0.1608   0.2689   2.1246  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.674e+00  1.365e-01  12.265   <2e-16 ***
## lengthS     4.604e-05  8.485e-04   0.054    0.957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 65.713  on 163  degrees of freedom
## Residual deviance: 65.710  on 162  degrees of freedom
##   (12 observations deleted due to missingness)
## AIC: 646.2
## 
## Number of Fisher Scoring iterations: 4
newdata <- data.frame(lengthS = 1:400)
rampred <- predict(rampois, newdata = newdata, type = "response")

plot(chords ~ lengthS, data= ram)
lines(rampred ~ newdata[,1], col=4, lwd=2)

# Just for fun
abline(lm(chords~lengthS, data=ram), col=2)
legend('topright', legend=c("Linear Model", "GLM"), col=c(2,4), lwd=2, lty=1)

4.10 Logistic Regression

ramlog <- glm(factor(side) ~ lengthS, data=ram, family = binomial)
summary(ramlog)
## 
## Call:
## glm(formula = factor(side) ~ lengthS, family = binomial, data = ram)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.22578  -1.17517  -0.01097   1.17932   1.20595  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16259    0.62350  -0.261    0.794
## lengthS      0.00104    0.00387   0.269    0.788
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 243.99  on 175  degrees of freedom
## Residual deviance: 243.92  on 174  degrees of freedom
## AIC: 247.92
## 
## Number of Fisher Scoring iterations: 3
repred2 <- predict(ramlog, newdata, type="response")

plot(as.numeric(factor(side))-1 ~ lengthS, data=ram)
lines(repred2 ~ newdata[,1])

4.11 Manipulating Character Strings

This next part gets a little tricky. Working with character strings in R is not the easiest thing to do. The following code has to split the writer names into all of the writers of a track, then we find out who wrote which track.

# Writers are separated by commas, so split by commas
writerList <- strsplit(ram$writers, split = ", ")
# strsplit returns a list that is difficult to work with
writerNames <- unique(unlist(writerList))
# Only look at writers who have the word "Ramone" in their names
Ramones <- writerNames[grep("Ramone", writerNames)]

# Look at who wrote the most songs
table(unlist(writerList)[unlist(writerList) %in% Ramones])
## 
##    C.J. Ramone Dee Dee Ramone    Joey Ramone  Johnny Ramone   Marky Ramone 
##              2             67             43             18              5 
##        Ramones  Richie Ramone   Tommy Ramone 
##             36              4              3
Ramones <- Ramones[c(2,3,4,5)] # Only the most prolific writers

The writerList object is a list. The first item in the list has as many elements as there were writers of the song. The inList will be an index with TRUE for each row that includes that writer. I create a dataframe that includes information on every song that writer wrote.

# Initiate an empty data frame
writerdf <- data.frame(Album=NA, Year=NA, Side=NA, LengthS=NA, 
                       NumWriters=NA, Writer=NA, AlbumPos=NA)
for(i in Ramones){
    inList <- sapply(writerList, function(x) i %in% x)
    writerdf <- rbind(writerdf, 
                      data.frame(Album=ram$album[inList], 
                                 Year=ram$year[inList], 
                                 Side=ram$side[inList], 
                                 LengthS=ram$lengthS[inList], 
                                 NumWriters=ram$nbreWriters[inList], 
                                 Writer=rep(i, sum(inList)),
                                 AlbumPos=ram$albumPeakPositionUS[inList]))
}
# the first rows are NAs
writerdf <- writerdf[-1,]
# strsplit returns a list with one element, e.g. "Joey Ramone"
# "Joey Ramone" will become "Joey", "Ramone", then I just take "Joey"
writerdf$shortName <- sapply(writerdf$Writer, function(x) strsplit(x, " ")[[1]][1])
# Change "Dee" to "Dee Dee" BEFORE making it a factor. Once it's a factor,
# it's much harder to deal with
writerdf$shortName[writerdf$shortName == "Dee"] <- "Dee Dee"
# Make the names into factors (R treats factors differently, that's what we want)
writerdf$shortName <- factor(writerdf$shortName)
par(mfrow=c(2,2))
stripchart(Year~shortName, data=writerdf, method='jitter', pch=19, las=1,
        xlab="Year", main="Years Active", 
        col=rgb(0,0,0,0.2))
stripchart(I(LengthS/60)~shortName, data=writerdf, method='jitter', pch=19, las=1,
        xlab="Length of Song (Minutes)", main="Who writes longer songs?", 
        col=rgb(0,0,0,0.2))
stripchart(NumWriters~shortName, data=writerdf, method='jitter', pch=19, las=1,
        xlab="Number of Writers", main="Who Collaborates?", 
        col=rgb(0,0,0,0.2))
stripchart(AlbumPos~shortName, data=writerdf, method='jitter', pch=19, las=1,
        xlab="Peak Album Position", main="Who Writes the Best?", 
        col=rgb(0,0,0,0.2))

library(cowplot) # allows multiple ggplot2 plots on one page
## Warning: package 'cowplot' was built under R version 3.3.3
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
# Violin plot for active years.
# There are a lot of unnecessarily repeated values here.
g1 <- ggplot(mapping=aes(y=Year, x=shortName, 
                         colour=shortName), data=writerdf) + 
    geom_violin(draw_quantiles = 0.5) + 
    geom_jitter(position = position_jitter(0.1)) + 
    xlab("Ramone") + ylab("Years Active") + 
    ggtitle("Years Active") + scale_color_brewer(palette = "Dark2")

# Violin plot for song length
g2 <- ggplot(mapping=aes(y=LengthS/60, x=shortName, 
                         colour=shortName), data=writerdf) + 
    geom_violin(draw_quantiles = 0.5) + 
    geom_jitter(position = position_jitter(0.1)) + 
    xlab("Ramone") + ylab("Song Lengths") + 
    ggtitle("Who writes longer songs?") + scale_color_brewer(palette = "Dark2")

# Barplot - "Ramones" is just the collaboration, so I remove it
# Also, I used fill instead of color to make the bars match the colours
g3 <- ggplot(mapping=aes(NumWriters), data=subset(writerdf, shortName != "Ramones")) + 
    geom_bar(aes(fill=shortName), position="dodge") +
    xlab("Number of Writers") + ylab("Number of Songs") + 
    ggtitle("Writer Collaboration") + scale_fill_brewer(palette = "Dark2")

# Violin plot for album position. Lower is better.
# Again, there are unnecessarily repeated values.
# This is likely not the best visualization of this data.
g4 <- ggplot(mapping=aes(y=AlbumPos, x=shortName, 
                         colour=shortName), data=writerdf) + 
    geom_violin(draw_quantiles = 0.5) + 
    geom_jitter(position = position_jitter(0.1)) + 
    xlab("Ramone") + ylab("Peak Album Position") + 
    ggtitle("Who writes better songs?") + scale_color_brewer(palette = "Dark2")

plot_grid(g1,g2,g3,g4, ncol=2)

We see that Dee Dee writes shorter songs than Joey and they tend to have slightly lower album positions (recall that 1 is the highest possible position). Joey wrote more of the songs alone. Note that Johnny is never the sole writer of a song.

For some of the songs, the writer is simply credited as “The Ramones”. The plot of Peak Album Positions shows that they tended to write better albums when they worked together!

4.12 Things I had to look up/Google while making this

I don’t want you to think that I can just sit down and write all of this. That’s not what you should be aiming for. Most of my R skills come from knowing what to Google and what help files to check.

While doing the above analysis, I kept track of what I needed help with.

  • r remove parts of a string
  • ?grep (I remembered that there was a function called grep, I forgot what it did. I ended up using gsub, which the help file for grep told me about.)
  • ?strsplit
  • r boxplot alternatives
  • r violin plots
  • ggplot2 violin plots (ggplot2 makes it a little bit easier)
  • ?stripchart
  • ggplot2 jitter
  • Rcolorbrewer color palettes (for the colours in the violin charts)
  • ggplot2 add gridlines (this is how I got scale_x_continuous to work)
  • ?grep (I forgot how to use this again)
  • ggplot2 multiple plots
  • geom_barplot scale_color_brewer (had to change “color” to “fill” for the barplot)

5 Another Data Analysis

The mpg dataset is built in to the ggplot2 package. Let’s go through some models with it!

data(mpg)
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

5.1 Univariate Plots

As we learned before, we start by plotting. First, some univariate statistics.

par(mfrow=c(2,3))
hist(mpg$displ, main="Displacement")
hist(mpg$cty, main = "City")
hist(mpg$hwy, main = "Highway")
barplot(table(mpg$cyl), main = "Number of Cylinders")
barplot(table(mpg$drv), main = "Drivetrain")
barplot(table(mpg$class), las=2, main = "Class")

5.2 Bivariate Plots

par(mfrow=c(3,3))
plot(displ ~ cty, data=mpg, main="Displacement by City")
abline(lm(displ ~ cty, data=mpg))
plot(displ ~ hwy, data=mpg, main = "Displacement by Highway")
abline(lm(displ ~ hwy, data=mpg))
plot(cty ~ hwy, data=mpg, main = "City by Highway")
abline(lm(cty ~ hwy, data=mpg))

boxplot(displ ~ cyl, data = mpg, main = "Displacement by Cylinder",
        xlab="Cylinder")
boxplot(displ ~ drv, data = mpg, main = "Displacement by Drivetrain",
        xlab="Drivetrain")
boxplot(displ ~ class, data = mpg, main = "Displacement by Class",
        xlab="Class", las=2)

boxplot(cty ~ cyl, data = mpg, main = "City by Cylinder",
        xlab = "Cylinder")
boxplot(cty ~ drv, data = mpg, main = "City by Drivetrain",
        xlab = "Drivetrain")
boxplot(cty ~ class, data = mpg, main = "City by Class",
        xlab = "Class", las=2)

5.3 Trivariate Plots

This is probably the most information you want to include in one plot.

par(mfrow=c(2,2))
plot(displ ~ cty, col=factor(cyl), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$cyl)), 
       col=1:4, pch=16)
plot(displ ~ cty, col=factor(drv), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$drv)), 
       col=1:3, pch=16)
plot(displ ~ cty, col=factor(class), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$class)), 
       col=1:7, pch=16)

boxplot(displ ~ drv*cyl, las=2, data=mpg,
        col=rep(1:4+1, each = 3),
        main="Interaction between Drivetrain and Cylinders")

5.4 ANOVA

Some one-way and two-way anova tables

anova(aov(displ ~ drv, data=mpg))
## Analysis of Variance Table
## 
## Response: displ
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## drv         2 189.55  94.776  109.82 < 2.2e-16 ***
## Residuals 231 199.36   0.863                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(displ ~ cyl, data=mpg))
## Analysis of Variance Table
## 
## Response: displ
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## cyl         1 336.54  336.54  1490.6 < 2.2e-16 ***
## Residuals 232  52.38    0.23                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(displ ~ class, data=mpg))
## Analysis of Variance Table
## 
## Response: displ
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## class       6 223.09  37.181  50.898 < 2.2e-16 ***
## Residuals 227 165.83   0.731                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ drv, data=mpg))
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## drv         2 1878.8  939.41  92.676 < 2.2e-16 ***
## Residuals 231 2341.5   10.14                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ cyl, data=mpg))
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## cyl         1 2740.1 2740.13  429.47 < 2.2e-16 ***
## Residuals 232 1480.2    6.38                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ class, data=mpg))
## Analysis of Variance Table
## 
## Response: cty
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## class       6 2295.0  382.51  45.099 < 2.2e-16 ***
## Residuals 227 1925.3    8.48                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Determining the best model
aov1 <- aov(cty ~ drv*cyl*class, data=mpg)
anova(aov1)
## Analysis of Variance Table
## 
## Response: cty
##                Df  Sum Sq Mean Sq  F value    Pr(>F)    
## drv             2 1878.81  939.41 238.9940 < 2.2e-16 ***
## cyl             1 1201.14 1201.14 305.5806 < 2.2e-16 ***
## class           6  209.15   34.86   8.8682 1.219e-08 ***
## drv:cyl         2   26.23   13.12   3.3370  0.037413 *  
## drv:class       3    4.81    1.60   0.4076  0.747663    
## cyl:class       5   62.73   12.55   3.1918  0.008438 ** 
## drv:cyl:class   1    0.25    0.25   0.0623  0.803086    
## Residuals     213  837.23    3.93                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the update function to change a model that you've created
aov2 <- update(aov1, ~.-drv:class - drv:cyl:class)

anova(aov1, aov2) # not a significant difference => use aov2
## Analysis of Variance Table
## 
## Model 1: cty ~ drv * cyl * class
## Model 2: cty ~ drv + cyl + class + drv:cyl + cyl:class
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    213 837.23                           
## 2    217 850.43 -4   -13.194 0.8391 0.5017
aov2$call
## aov(formula = cty ~ drv + cyl + class + drv:cyl + cyl:class, 
##     data = mpg)

I used a special function in the code above. The update function is very useful for model building. The first argument is the original model, the second argument is the new formula. ~. means use everything that was in the original model, then I start subtracting things that I want to remove.

5.5 Poisson counts for the number of cylinders

Here, we fit a model to try and predict the number of cylinders based on the car’s displacement. We could have added other covariates, but it would be more difficult to plot.

mpgpois <- glm(cyl ~ displ, data = mpg, family=poisson)
summary(mpgpois)
## 
## Call:
## glm(formula = cyl ~ displ, family = poisson, data = mpg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.98488  -0.15890  -0.07696   0.28678   0.64538  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.08716    0.08111  13.403   <2e-16 ***
## displ        0.18877    0.02014   9.374   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 103.685  on 233  degrees of freedom
## Residual deviance:  17.451  on 232  degrees of freedom
## AIC: 864.5
## 
## Number of Fisher Scoring iterations: 4
newdata <- list(displ = seq(1,8,0.5))
mpgpoisPred <- predict(mpgpois, newdata, "response")

plot(cyl ~ displ, data=mpg)
lines(mpgpoisPred ~ newdata$displ, col='blue', lwd=2)
abline(lm(cyl~displ, data=mpg), col='red', lwd=2)

legend('bottomright', legend=c("Poisson", "Linear"), 
       col=c('blue','red'), lty=1, lwd=2)

5.6 Predicting auto/manual

mpg$am <- mpg$trans
# Get first letter, then convert to 0 and 1
mpg$am <- as.numeric(factor(substr(mpg$am, 1,1))) - 1
# now 1 is manual

mpgbin <- glm(am ~ displ, data=mpg, family=binomial)
summary(mpgbin)
## 
## Call:
## glm(formula = am ~ displ, family = binomial, data = mpg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2470  -0.9353  -0.6600   1.1949   2.3044  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9754     0.4222   2.310   0.0209 *  
## displ        -0.5082     0.1251  -4.063 4.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 296.48  on 233  degrees of freedom
## Residual deviance: 277.67  on 232  degrees of freedom
## AIC: 281.67
## 
## Number of Fisher Scoring iterations: 3
newdata <- list(displ=seq(1,8,0.1))
mpgbinPred <- predict(mpgbin, newdata, "response")

par(mfrow=c(1,2))
plot(am ~ displ, data=mpg)
lines(mpgbinPred ~ newdata$displ, lwd=2)

# Alternatively...
plot(displ ~ factor(am), data=mpg, horizontal=TRUE)
lines(I(mpgbinPred+1) ~ newdata$displ, lwd=2)

5.7 Poisson Regression with Multiple Predictors

Just for demonstration, I include the transmission type in the model. When doing predictions, we can set the value of each variable. This is easier when one of the variables is binary (0 or 1).

Here, I find the predicted number of cylinders versus displacement for automatic cars and the same thing for manual cars. It turns out that the transmission type doesn’t affect the curve, so this is not a meaningful plot.

This sort of prediction will likely come up whenever you have a categorical explanatory variable.

mpgpois <- glm(cyl ~ displ * am, data = mpg, family=poisson)
summary(mpgpois)
## 
## Call:
## glm(formula = cyl ~ displ * am, family = poisson, data = mpg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.92013  -0.18500  -0.05115   0.26414   0.62685  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.10546    0.10647  10.383  < 2e-16 ***
## displ        0.18596    0.02525   7.364 1.78e-13 ***
## am          -0.03164    0.16854  -0.188    0.851    
## displ:am     0.00178    0.04479   0.040    0.968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 103.685  on 233  degrees of freedom
## Residual deviance:  17.277  on 230  degrees of freedom
## AIC: 868.33
## 
## Number of Fisher Scoring iterations: 4
newdata0 <- list(displ = seq(1,8,0.5), am = rep(0, 15))
mpgpoisPred0 <- predict(mpgpois, newdata0, "response")
newdata1 <- list(displ = seq(1,8,0.5), am = rep(1, 15))
mpgpoisPred1 <- predict(mpgpois, newdata1, "response")

plot(cyl ~ displ, data=mpg)
lines(mpgpoisPred0 ~ newdata0$displ, col='blue', lwd=2)
lines(mpgpoisPred1 ~ newdata1$displ, col='red', lwd=2)

legend('bottomright', legend=c("Automatic", "Manual"), 
       col=c('blue','red'), lty=1, lwd=2)

6 Miscellaneous Topics

6.1 Functions (The Proper Way)

Variables defined inside a function will only be inside that function.

myfun <- function(x, y, z){
    xtemp <- x+y # I define xtemp here...
    ytemp <- z-x
    return(xtemp*ytemp)
}

myfun(3,2,1)
xtemp # ... but it does not exist here

It is possible to reference values outside of a function, but it is a very bad idea.

xtemp <- 2

myfun <- function(x,y)
    return(x + y + xtemp)

myfun(2,3)

xtemp <- "hello"

myfun(2,3) # the function didn't change, but global values did

The return command is useful in more complicated functions.

myfun <- function(x,y){
    if(x > y){
        return(y^2)
    } else if(x < y) {
        return(x^2)
    } else {
        return(x^2)
    }
}

myfun(2,5)
myfun(3,1)
myfun(3,3)

There is a special argument in function calls. It can be very very useful.

myfun <- function(x, y, ...){
    plot(y~x, ...)
}

x1 <- 1:30
y1 <- rnorm(30, mean = x1, sd = sqrt(x1))

myfun(x1, y1, main="I'm adding a new argument!", 
      xlab="I never told myfun what to do with xlab, but here it is!",
      ylab = "The ellipses are magic")

6.2 Saving Objects

The save function lets you save multiple R objects. This way, you can load your datasets, vectors, plots, and functions later. You must tell it a filename (with file=), but other than that you can save whatever you want.

save(mydata, myfunction, thisggplot, anotherThing, 
     file="2017-06-20 Data Dump.Rdata")

When you start your next section, you can load the objects back into the environment.

load("2017-06-20 Data Dump.Rdata")