Welcome ot the intermediate R workshop tutorial! The table of contents above should let you know what you’re in for, but I’ll give a brief description here. This file is meant to be a supplement to the workshop. The code appears in it’s final form here, but we will build it up slowly in the workshop so that it doesn’t seem so daunting. Think of this as the solutions file to some textbook problems.
In Section 1, we learn an important lesson about datasets. The datasaurus dozen dataset comes in a form that you can probably expect in your own research. It has x and y measurements for a dozen or so different categories. We’ll go through some data manipulation techniques and explain the logic behind each one. We’ll also get a nice overview of the apply family of functions. We might also talk about some plotting techniques…
Section 2 will focus on models in R. We can’t cover every model that R has to offer, but most of the built-in models have more-or-less standard input and output format. We’ll learn to write general models and extract the relevant information. We demonstrate our knowledge with a simulation study where we investigate the proper use of p-values.
Section 3 is our chance to do an entire analysis from start to finish. We will import a dataset, explore some graphical techniques, fit some models, and produce a final report. For simple reports, we can use RMarkdown to mix words and code. This makes sharing code easy and it keeps all of your analysis and results in a standalone file.
The last section will focus on some of the more advanced topics. We’ll see some non-standard uses of the apply family, build our own functions, learn about some special syntax, and save our work in a way that makes it easy to pick up again later.
We hope you enjoyed this workshop! Remember that the only way to get better at something is to keep trying.
For this section, we use the datasaurus dozen dataset. First, we must install the package and load it. Note that installing a package doesn’t automatically load it into the system. You only have to install it once, but you have to load it every time you want to use it.
install.packages("datasauRus")
library("datasauRus")
Now that it’ i’s loaded, we need to tell R that we want to look at the data. Below, there are some very helpful functions for looking at a dataset. They’re not all useful here, but they might be useful to you later.
data("datasaurus_dozen")
head(datasaurus_dozen, 10)
tail(datasaurus_dozen, 10)
str(datasaurus_dozen) # The structure of a dataframe
names(datasaurus_dozen) # column names
# For a categorical variable
unique(datasaurus_dozen$dataset) # unique names in the dataset
table(datasaurus_dozen$dataset) # how many of each value?
# For numbers
range(datasaurus_dozen$x)
range(datasaurus_dozen$y)
We also can look at subsets of the data. The following code is here as a reminder of R’s syntax. There are many different ways to do the same task; which way you use depends on the situation.
Make sure you understand each line. Pay attention to the use of the dollar signs, commas, double equal signs, and quotation marks. If you do something wrong with any of these, you will get an error.
# Getting columns - these all do the exact same thing!
mean(datasaurus_dozen$y)
mean(datasaurus_dozen$'y')
mean(datasaurus_dozen[, 'y']) # after the comma is the columns, before is rows
# since we want all of the rows, we leave a blank before the comma
mean(datasaurus_dozen[, 3]) # y is the third column
with(datasaurus_dozen, mean(y)) # the with function can be very helpful
# Subsetting by a categorical value in the rows.
# again, these all do the exact same thing.
mean(datasaurus_dozen$y[datasaurus_dozen$dataset == "dino"])
mean(datasaurus_dozen[datasaurus_dozen$dataset == "dino", "y"])
mean(datasaurus_dozen[datasaurus_dozen$dataset == "dino", 3])
mean(subset(datasaurus_dozen, dataset == "dino")$y)
mean(subset(datasaurus_dozen, dataset == "dino")$"y")
mean(subset(datasaurus_dozen, dataset == "dino")[, "y"])
mean(subset(datasaurus_dozen, dataset == "dino")[, 3])
with(datasaurus_dozen, mean(y[dataset == "dino"])) # my favourite
with(datasaurus_dozen[datasaurus_dozen$dataset == "dino", ], mean(y))
with(subset(datasaurus_dozen, dataset == "dino"), mean(y))
Some notes:
with and subset functions allow you to drop the dollar signs. If you’re just accessing a column it’s fine to use dollar signs, but when you’re doing subsets I try and avoid it.dino is one of the values in the column labelled dataset. dino absolutely must be in quotes.dataset == "dino", but there are different ways to get y.Suppose we want the mean and standard deviation of each dataset. We could do something like:
with(datasaurus_dozen, mean(y[dataset == "away"]))
## [1] 47.83472
with(datasaurus_dozen, mean(y[dataset == "dino"]))
## [1] 47.83225
with(datasaurus_dozen, mean(y[dataset == "star"]))
## [1] 47.83955
# and so on
but that’s a lot of typing. Instead, we can make our own function!
meanasaurus <- function(dataname){
meanofdata <- with(datasaurus_dozen, mean(y[dataset == dataname]))
return(meanofdata)
}
meanasaurus("away")
## [1] 47.83472
meanasaurus("dino")
## [1] 47.83225
meanasaurus("star")
## [1] 47.83955
The following are equivalent:
meanasaurus <- function(dataname){
meanofdata <- with(datasaurus_dozen, mean(y[dataset == dataname]))
return(meanofdata)
}
meanasaurus <- function(dataname) {
with(datasaurus_dozen, mean(y[dataset == dataname]))
}
meanasaurus <- function(dataname) with(datasaurus_dozen, mean(y[dataset == dataname]))
NOTE: This is terrible form for a function. You never want to call a variable that was not passed. If we hadn’t loaded the datasauRus package, the above function wouldn’t work. Usually, you want your functions to be applicable to many situations.
Again, you should only reference variables that were passed as arguments. We will talk about this later. For now, we re-write the functions so that it is better.
meanasaurus <- function(dataname){
# Check if the dataset exists
if(!exists("datasaurus_dozen")) stop("datasauRus has not been loaded")
# return the mean
with(datasaurus_dozen, mean(y[dataset == dataname]))
}
In this data, we have several different datasets. Each one can be found using something like subset(datasetname, columname == "value"). But what if we want the mean of all of the different datasets? Again, there are several ways to do this.
# Get a vector with the names of the datasets
columns <- unique(datasaurus_dozen$dataset)
# from before: with(datasaurus_dozen, mean(y[dataset == "dino"]))
for(i in columns) print(with(datasaurus_dozen, mean(y[dataset == i])))
# similarly,
for(i in colums) print(meanasaurus(i))
It gets the job done, but it’s not pretty and it’s hard to tell where each mean came from. The apply family of functions will make a lot of things nicer. Here, we use sapply - the s means that it’s simplifying the results. In this case, it works the same as the for loop.
columns <- unique(datasaurus_dozen$dataset)
sapply(X=columns, FUN = function(column) with(datasaurus_dozen, mean(y[dataset == column])))
## dino away h_lines v_lines x_shape star
## 47.83225 47.83472 47.83025 47.83699 47.83972 47.83955
## high_lines dots circle bullseye slant_up slant_down
## 47.83545 47.83983 47.83772 47.83082 47.83150 47.83590
## wide_lines
## 47.83160
sapply(X=columns, FUN = meanasaurus)
## dino away h_lines v_lines x_shape star
## 47.83225 47.83472 47.83025 47.83699 47.83972 47.83955
## high_lines dots circle bullseye slant_up slant_down
## 47.83545 47.83983 47.83772 47.83082 47.83150 47.83590
## wide_lines
## 47.83160
sapply(columns, meanasaurus)
## dino away h_lines v_lines x_shape star
## 47.83225 47.83472 47.83025 47.83699 47.83972 47.83955
## high_lines dots circle bullseye slant_up slant_down
## 47.83545 47.83983 47.83772 47.83082 47.83150 47.83590
## wide_lines
## 47.83160
See how it keeps all of the names? The code isn’t very nice to look at, though. In fact, this is one of the least readable ways to do this. Fortunately, there are special functions that we can use.
tapply(X = datasaurus_dozen$y, INDEX = datasaurus_dozen$dataset, FUN = mean)
## away bullseye circle dino dots h_lines
## 47.83472 47.83082 47.83772 47.83225 47.83983 47.83025
## high_lines slant_down slant_up star v_lines wide_lines
## 47.83545 47.83590 47.83150 47.83955 47.83699 47.83160
## x_shape
## 47.83972
# A better way:
with(datasaurus_dozen, tapply(y, dataset, mean))
## away bullseye circle dino dots h_lines
## 47.83472 47.83082 47.83772 47.83225 47.83983 47.83025
## high_lines slant_down slant_up star v_lines wide_lines
## 47.83545 47.83590 47.83150 47.83955 47.83699 47.83160
## x_shape
## 47.83972
with(datasaurus_dozen, by(y, dataset, mean))
## dataset: away
## [1] 47.83472
## --------------------------------------------------------
## dataset: bullseye
## [1] 47.83082
## --------------------------------------------------------
## dataset: circle
## [1] 47.83772
## --------------------------------------------------------
## dataset: dino
## [1] 47.83225
## --------------------------------------------------------
## dataset: dots
## [1] 47.83983
## --------------------------------------------------------
## dataset: h_lines
## [1] 47.83025
## --------------------------------------------------------
## dataset: high_lines
## [1] 47.83545
## --------------------------------------------------------
## dataset: slant_down
## [1] 47.8359
## --------------------------------------------------------
## dataset: slant_up
## [1] 47.8315
## --------------------------------------------------------
## dataset: star
## [1] 47.83955
## --------------------------------------------------------
## dataset: v_lines
## [1] 47.83699
## --------------------------------------------------------
## dataset: wide_lines
## [1] 47.8316
## --------------------------------------------------------
## dataset: x_shape
## [1] 47.83972
The by function produces a lot of output, so you normally wouldn’t use it here. However, it works well if your output is a vector. For instance,
by(datasaurus_dozen[, c('x', 'y')],
datasaurus_dozen$dataset,
function(x) apply(x, 2, mean))
## datasaurus_dozen$dataset: away
## x y
## 54.26610 47.83472
## --------------------------------------------------------
## datasaurus_dozen$dataset: bullseye
## x y
## 54.26873 47.83082
## --------------------------------------------------------
## datasaurus_dozen$dataset: circle
## x y
## 54.26732 47.83772
## --------------------------------------------------------
## datasaurus_dozen$dataset: dino
## x y
## 54.26327 47.83225
## --------------------------------------------------------
## datasaurus_dozen$dataset: dots
## x y
## 54.26030 47.83983
## --------------------------------------------------------
## datasaurus_dozen$dataset: h_lines
## x y
## 54.26144 47.83025
## --------------------------------------------------------
## datasaurus_dozen$dataset: high_lines
## x y
## 54.26881 47.83545
## --------------------------------------------------------
## datasaurus_dozen$dataset: slant_down
## x y
## 54.26785 47.83590
## --------------------------------------------------------
## datasaurus_dozen$dataset: slant_up
## x y
## 54.26588 47.83150
## --------------------------------------------------------
## datasaurus_dozen$dataset: star
## x y
## 54.26734 47.83955
## --------------------------------------------------------
## datasaurus_dozen$dataset: v_lines
## x y
## 54.26993 47.83699
## --------------------------------------------------------
## datasaurus_dozen$dataset: wide_lines
## x y
## 54.26692 47.83160
## --------------------------------------------------------
## datasaurus_dozen$dataset: x_shape
## x y
## 54.26015 47.83972
If we also want the mean and sd as well as the correlation between x and y, we can go back to the sapply function.
# allosaurus because it gives all the data. I'm clever.
allosaurus <- function(dataname){
# begin by subsetting - saves us typing later
thisdata <- datasaurus_dozen[datasaurus_dozen$dataset == dataname,]
# means
mx <- mean(thisdata$x)
my <- mean(thisdata$y)
# standard deviations
sx <- sd(thisdata$x)
sy <- sd(thisdata$y)
# correlation between x and y
corxy <- cor(thisdata$x, thisdata$y)
# Note that we can name things inside of vectors.
# This is not always useful, and sometimes it's a terrible idea.
return(c("X Mean" = mx,"Y Mean" = my, "X SD" = sx, "Y SD" = sy, "Cor" = corxy))
}
# Testing
allosaurus("dino")
## X Mean Y Mean X SD Y SD Cor
## 54.26327324 47.83225282 16.76514204 26.93540349 -0.06447185
sapply(columns, allosaurus) # notice how nicely this is formatted!
## dino away h_lines v_lines x_shape
## X Mean 54.26327324 54.26609978 54.26144178 54.26992723 54.26015033
## Y Mean 47.83225282 47.83472062 47.83025191 47.83698799 47.83971728
## X SD 16.76514204 16.76982495 16.76589790 16.76995861 16.76995770
## Y SD 26.93540349 26.93974342 26.93987622 26.93768381 26.93000169
## Cor -0.06447185 -0.06412835 -0.06171484 -0.06944557 -0.06558334
## star high_lines dots circle bullseye
## X Mean 54.2673411 54.26880528 54.26030345 54.26731971 54.26873002
## Y Mean 47.8395452 47.83545020 47.83982921 47.83771727 47.83082316
## X SD 16.7689592 16.76670402 16.76773549 16.76001266 16.76923949
## Y SD 26.9302747 26.93999796 26.93019152 26.93003609 26.93572669
## Cor -0.0629611 -0.06850422 -0.06034144 -0.06834336 -0.06858639
## slant_up slant_down wide_lines
## X Mean 54.26588179 54.26784882 54.26691630
## Y Mean 47.83149565 47.83589633 47.83160199
## X SD 16.76885267 16.76675895 16.76999962
## Y SD 26.93860807 26.93610493 26.93790193
## Cor -0.06860921 -0.06897974 -0.06657523
If you have data that has columns for group, explanatory variable, and response, you can just change the names above!
Try the above analysis with the iris dataset!
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
levels(iris$Species) # since it's a factor
## [1] "setosa" "versicolor" "virginica"
mean(iris$Sepal.Length)
## [1] 5.843333
mean(iris$Sepal.Width)
## [1] 3.057333
Note that there are a lot of things that need to be changed! However, the basics are the same.
Notice that all of the datasets had the same mean and standard deviations for both x and y. Furthermore, every dataset has the same correlation between x and y. Surely that means they are the same!!!
You probably wondered why this is called the datasaurus dozen. Here’s why:
plot(y ~ x, data = subset(datasaurus_dozen, dataset == "dino"),
main = "The Datasaurus", xlab = "x", ylab = "y",
pch = 19, # adds filled in points rather than empty circles
las=1) # rotates the y-axis values
So do all of the datasets look like this?
par(mar=c(3.5,3.5,2,2)) # You probably won't need this.
columns <- unique(datasaurus_dozen$dataset)
par(mfrow=c(4,4))
for(i in columns){
plot(y ~ x, data = subset(datasaurus_dozen, dataset == i))
}
ALL of these plots have the same means, standard deviations, and correlations! Always always always plot your data!!!
Side note: this can also be done with the apply function.
par(mfrow = c(4,4))
invisible(sapply(columns,
function(column) {
plot(y ~ x,
data = subset(datasaurus_dozen,
dataset == column))
}
))
A lot of what you do in R will be based on models. You may be using t-tests, linear models, logistic regression, anova models, etc. In R, these models all have similar input and output.
For this section, we’re going to start by making some simple linear models. From there, we’re going to look into the linear model output. Next, we run a simulation study to see what happens with p-values.
The mtcars dataset was collected by Motor Trend magazine in the 1970s. It’s a very simple dataset, but there’s a lot that can be done with it. In this example, we start out by fitting a linear model for miles per gallon against weight, while also taking into account the transmission type (automatic = 0, manual = 1). First things first, we plot our data!
data(mtcars)
#?mtcars # built-n datasets have help files
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
plot(mpg ~ wt, data = mtcars, col = am + 1)
# note that am is either 0 or 1. am + 1 is 1 or 2
# R interprets 1 as black and 2 as red
There are a lot of different linear models that we could make here! Below is a brief overview of the syntax.
| Model | Notation |
|---|---|
| \(y = \beta_0+\beta_1x_1+\beta_2x_2\) | y ~ x1 + x2 |
| \(y = \beta_1x_1 +\beta_2x_2\) | y ~ x1 + x2 - 1 |
| \(y = \beta_0 + \beta_1x_1x_2\) | y ~ x1 : x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 + x2 + x1:x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ x1 * x2 |
| \(y = \beta_0+\beta_1x_1+\beta_2x_2 +\beta_3x_1x_2\) | y ~ (x1 + x2)^2 |
| \(y = \beta_0 + \beta_1x_1^2\) | y ~ I(x1^2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_1^2\) | y ~ poly(x1, 2) |
| \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ...\) | y ~ . |
We’ll just fit a simple linear model with interaction.
mtlm <- lm(mpg ~ wt*am, data = mtcars)
summary(mtlm)
##
## Call:
## lm(formula = mpg ~ wt * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6004 -1.5446 -0.5325 0.9012 6.0909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.4161 3.0201 10.402 4.00e-11 ***
## wt -3.7859 0.7856 -4.819 4.55e-05 ***
## am 14.8784 4.2640 3.489 0.00162 **
## wt:am -5.2984 1.4447 -3.667 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
## F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
Everything is significant and it’s all fine and dandy. However, I’m more interested in what mtlm actually looks like.
mtlm
##
## Call:
## lm(formula = mpg ~ wt * am, data = mtcars)
##
## Coefficients:
## (Intercept) wt am wt:am
## 31.416 -3.786 14.878 -5.298
str(mtlm) # that's a lot of things!
## List of 12
## $ coefficients : Named num [1:4] 31.42 -3.79 14.88 -5.3
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## $ residuals : Named num [1:32] -1.494 0.823 -2.419 2.156 0.307 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ effects : Named num [1:32] -113.6497 -29.1157 0.0473 9.5033 0.1274 ...
## ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "am" "wt:am" ...
## $ rank : int 4
## $ fitted.values: Named num [1:32] 22.5 20.2 25.2 19.2 18.4 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ assign : int [1:4] 0 1 2 3
## $ qr :List of 5
## ..$ qr : num [1:32, 1:4] -5.657 0.177 0.177 0.177 0.177 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
## ..$ qraux: num [1:4] 1.18 1.05 1.08 1.07
## ..$ pivot: int [1:4] 1 2 3 4
## ..$ tol : num 1e-07
## ..$ rank : int 4
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 28
## $ xlevels : Named list()
## $ call : language lm(formula = mpg ~ wt * am, data = mtcars)
## $ terms :Classes 'terms', 'formula' language mpg ~ wt * am
## .. ..- attr(*, "variables")= language list(mpg, wt, am)
## .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
## .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
## .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
## .. ..- attr(*, "order")= int [1:3] 1 1 2
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(mpg, wt, am)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
## $ model :'data.frame': 32 obs. of 3 variables:
## ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language mpg ~ wt * am
## .. .. ..- attr(*, "variables")= language list(mpg, wt, am)
## .. .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
## .. .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
## .. .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
## .. .. ..- attr(*, "order")= int [1:3] 1 1 2
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(mpg, wt, am)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
## - attr(*, "class")= chr "lm"
There’s a lot of information there, and it’s not easy to get at. Most of the information is about the model fitting (i.e. the QR decomposition) Instead, let’s look at the summary again.
# R is an object-oriented programming language
# Everything is an object
str(summary(mtlm))
## List of 11
## $ call : language lm(formula = mpg ~ wt * am, data = mtcars)
## $ terms :Classes 'terms', 'formula' language mpg ~ wt * am
## .. ..- attr(*, "variables")= language list(mpg, wt, am)
## .. ..- attr(*, "factors")= int [1:3, 1:3] 0 1 0 0 0 1 0 1 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "mpg" "wt" "am"
## .. .. .. ..$ : chr [1:3] "wt" "am" "wt:am"
## .. ..- attr(*, "term.labels")= chr [1:3] "wt" "am" "wt:am"
## .. ..- attr(*, "order")= int [1:3] 1 1 2
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(mpg, wt, am)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "mpg" "wt" "am"
## $ residuals : Named num [1:32] -1.494 0.823 -2.419 2.156 0.307 ...
## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ coefficients : num [1:4, 1:4] 31.42 -3.79 14.88 -5.3 3.02 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:4] FALSE FALSE FALSE FALSE
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## $ sigma : num 2.59
## $ df : int [1:3] 4 28 4
## $ r.squared : num 0.833
## $ adj.r.squared: num 0.815
## $ fstatistic : Named num [1:3] 46.6 3 28
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:4, 1:4] 1.358 -0.346 -1.358 0.346 -0.346 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## .. ..$ : chr [1:4] "(Intercept)" "wt" "am" "wt:am"
## - attr(*, "class")= chr "summary.lm"
Again, we see a lot of things. However, these are more useful to us. For instance,
mtsum <- summary(mtlm)
mtsum$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.416055 3.0201093 10.402291 4.001043e-11
## wt -3.785908 0.7856478 -4.818836 4.551182e-05
## am 14.878423 4.2640422 3.489277 1.621034e-03
## wt:am -5.298360 1.4446993 -3.667449 1.017148e-03
The probability of rejecting the null hypothesis giving that the null hypothesis is true.
Everyone always wants to look at a p-value. In this section, we’ll look at one of the major shortcomings of p-values. Specifically, we’re going to look at what happens when you make a linear model then remove terms based on their p-values.
We’re going to do a simulation study. We’re going to generate random uncorrelated data, then we’ll test to see if it’s correlated. This seems like a strang ething to do, but recall that a significance level of 0.05 means that there’s a 5% chance that we’ll reject the null hypothesis when the null hypothesis is true.
In this study, our null hypothesis is that there is no correlation between x and y. This null hypothesis is true because that’s how we’re making it.
x <- rnorm(40)
y <- rnorm(40)
plot(y~x)
xylm <- lm(y~x)
summary(xylm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37982 -0.60947 -0.01794 0.76261 2.39463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1712 0.1609 -1.064 0.294
## x -0.1176 0.1433 -0.821 0.417
##
## Residual standard error: 1.013 on 38 degrees of freedom
## Multiple R-squared: 0.01742, Adjusted R-squared: -0.008438
## F-statistic: 0.6737 on 1 and 38 DF, p-value: 0.4169
abline(xylm)
Clearly, we don’t have any relationship between x and y. To quantify this, we find the p-value for the slope.
summary(xylm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1711767 0.1608776 -1.0640182 0.2940353
## x -0.1176460 0.1433363 -0.8207691 0.4168972
# the p-value is the 2nd row, 4th column
summary(xylm)$coef[2,4]
## [1] 0.4168972
Let’s do this 10000 times.
pvalues <- c() # empty vector to start
for(i in 1:10000){ #do this 100 times
x <- rnorm(40)
y <- rnorm(40)
xylm <- lm(y~x)
sig <- summary(xylm)$coef[2,4]
pvalues[i] <- sig
}
hist(pvalues,
breaks = 20, # 20 breaks means that each box is 5%
col=c(2, rep(0,19))) # make the first box red
The distribution looks like a box. This is good - the p-values should follow a uniform distribution. This means that significant p-values (the red box) happen 5% of the time.
Now, watch what happens when we only take the most significant variable.
x1 <- rnorm(40)
x2 <- rnorm(40)
y <- rnorm(40)
xxylm <- lm(y ~ x1 + x2)
summary(xxylm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.21636188 0.1685676 -1.2835322 0.20728641
## x1 -0.04423465 0.1733378 -0.2551933 0.79998708
## x2 -0.29231321 0.1704306 -1.7151455 0.09468355
# the p-values are the 2nd and 3rd row, 4th column
pvals <- summary(xxylm)$coef[c(2,3), 4]
# which is the smallest p-value?
minx <- which.min(pvals)
# make a new dataframe so we can remove one of the x's
xxdf <- data.frame(x1, x2)
# fit a new linear model with only the most significant x
xylm <- lm(y ~ xxdf[,minx])
# what has happened?
summary(xylm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2146407 0.1663478 -1.290313 0.20473859
## xxdf[, minx] -0.2898493 0.1680507 -1.724773 0.09269475
# The p-value has changed slightly, but not much
# is this really such a bad thing?
This is something that happens a lot - you fit a linear model then remove the covariates that aren’t significant. This is not a good strategy, though - your p-values aren’t valid anymore!
pvalues2 <- c() # empty vector
t0 <- Sys.time() # Just to see how long this takes
for(i in 1:10000){
x1 <- rnorm(40)
x2 <- rnorm(40)
y <- rnorm(40)
xxylm <- lm(y ~ x1 + x2)
summary(xxylm)$coef
pvals <- summary(xxylm)$coef[c(2,3), 4]
minx <- which.min(pvals)
xxdf <- data.frame(x1, x2)
xylm <- lm(y ~ xxdf[,minx])
pvalues2[i] <- summary(xylm)$coef[2,4]
}
Sys.time() - t0 # it took this long:
## Time difference of 18.70499 secs
hist(pvalues2, breaks = 20, col = c(2, rep(0,19)))
With this method, around 10% of your p-values are significant, the claimed value of 5%! Be very careful when fitting linear models - it’s easy to accidentally ruin your analysis.
The same tricks can be used with other types of models. Below are some common model forms for you to play around with. Note that summary isn’t as useful for other models.
mtttest <- t.test(mpg~am, data=mtcars)
names(mtttest)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
mtttest$p.value
## [1] 0.001373638
# Using anova to do an F-test for linear models
mtlm1 <- lm(mpg ~ wt, data = mtcars)
mtlm2 <- lm(mpg ~ wt + am, data = mtcars)
mtanova <- anova(mtlm1, mtlm2)
names(mtanova)
## [1] "Res.Df" "RSS" "Df" "Sum of Sq" "F" "Pr(>F)"
mtanova$"Pr(>F)"
## [1] NA 0.9879146
# logistic model for transmission against mpg and weight
mtlogistic <- glm(am ~ mpg + wt, data = mtcars, family = binomial)
mtlogsum <- summary(mtlogistic)
names(mtlogsum)
## [1] "call" "terms" "family" "deviance"
## [5] "aic" "contrasts" "df.residual" "null.deviance"
## [9] "df.null" "iter" "deviance.resid" "coefficients"
## [13] "aliased" "dispersion" "df" "cov.unscaled"
## [17] "cov.scaled"
mtlogsum$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.8865540 12.193529 2.122975 0.03375597
## mpg -0.3241639 0.239499 -1.353509 0.17589322
## wt -6.4161721 2.546606 -2.519500 0.01175217
These will be useful in the next section.
Data from: http://r-blog.salvaggio.net/?p=521
Go to the bottom of the page and find the link to the data. From this, we’ll make an RMarkdown document. To create this document in RStudio, click on the icon at the top left (a paper with a green plus) and select RMarkdown.
For reference, here are some of the plots we made. Notice that the R code shows up above the plot, so your collaborators can see exactly what you did. Much of this was not covered, but is included so you can see some extra code. Hopefully you will find something useful here.
This is a really short bit of code, but it took me a long time to figure out what code to write. This sort of thing happens a lot with R.
ram <- read.csv("Ramones.csv", stringsAsFactors = FALSE)
# The lengths of songs are 120S, as in 120 seconds
# I use gsub to remove the S, which still leaves the numbers as strings
# as.numeric changes the strings to numbers
ram$lengthS <- as.numeric(gsub("S", "", ram$lengthS))
# Chords as a percent of total chords
# This is categorical, so I did not use a histogram
chords_dens <- table(ram$chords)/length(ram$chords)
plot(chords_dens, main = "Checking Distributional Assumptions",
ylab = "Density", xlab="Number of Chords")
mch <- mean(ram$chords, na.rm=TRUE)
sch <- sd(ram$chords, na.rm=TRUE)
lines(2:12, dpois(2:12, lambda=mch), col="blue", lwd=2, type='b')
xn <- seq(2,12, 0.1)
# I used 5 in the normal curve because that's the mode
# If it were truly normal, the mean and mode would be equal
yn <- dnorm(xn, mean = 5, sd = sch)
lines(xn,yn, col = "red", lwd=2)
legend('topright', legend = c("True", "Poisson(5.3)", "Normal(5, 2.25)"),
lty=1, lwd=2, col=c(1,4,2))
hist(ram$lengthS, main="Song Lengths", xlab="Length (Seconds)", freq=FALSE)
xn <- seq(70, 300, 1)
yn <- dnorm(xn, mean = mean(ram$lengthS), sd = sd(ram$lengthS))
lines(yn ~ xn)
# The data is clearly right-skewed
library(ggplot2)
# ggplot2 version - I made the bars look a little bit better than the default
ggplot(ram, aes(lengthS)) + geom_histogram(col='black', fill='grey') +
xlab("Song Length") + ylab("Frequency") +
ggtitle("Histogram of Song Lengths")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
chords <- ram$chords
key <- factor(ram$key)
ggplot(mapping=aes(y=chords, x=factor(key), colour=factor(key)), data=ram) +
geom_violin() + geom_jitter(position = position_jitter(0.01)) +
xlab("Key") + ylab("Number of Chords") +
ggtitle("Number of Chords in Each Key")
## Warning: Removed 12 rows containing non-finite values (stat_ydensity).
## Warning: Removed 12 rows containing missing values (geom_point).
par(mfrow=c(2,1))
boxplot(chords~album, data=ram)
boxplot(lengthS~album, data=ram)
years <- unique(ram$year)
years <- sort(c(years, 1977)) # Two albums in one year
# the data was already sorted, so adding a number and sorting is fine
albumLen <- sapply(unique(ram$album),
function(x)
sum(ram$lengthS[ram$album == x])
)/60
plot(years, albumLen, main="Album Length by Year",
xlab="Year", ylab="Album Length (Minutes)")
abline(lm(albumLen ~ years))
years <- unique(ram$year)
years <- sort(c(years, 1977)) # Two albums in one year
# the data was already sorted, so adding a number and sorting is fine
albumLen <- sapply(
# Find all of the album names
unique(ram$album),
function(x) {
# for each album name, add up all of the song lengths
sum(ram$lengthS[ram$album == x])
})
albumLen <- albumLen/60 # Convert to minutes
posUS <- unique(ram$albumPeakPositionUS)
posUS <- c(posUS, 148) # two albums had same peack position
# fortunately, the repeated album was the last year, so we don't
# have to sort too much
ggplot(mapping=aes(x=years, y=albumLen)) + geom_point() +
geom_smooth(method="lm") + theme_light() +
geom_hline(yintercept = mean(albumLen), colour='red') +
xlab("Release Year") + ylab("Album Lengths (Minutes)") +
ggtitle("Album Length Over Time (the slope is NOT significant)") +
scale_x_continuous(breaks=seq(1977, 1995, 2))
summary(lm(albumLen ~ years))
##
## Call:
## lm(formula = albumLen ~ years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3859 -1.4747 0.0619 1.7214 3.2451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -325.3208 201.3051 -1.616 0.132
## years 0.1805 0.1015 1.779 0.101
##
## Residual standard error: 2.286 on 12 degrees of freedom
## Multiple R-squared: 0.2087, Adjusted R-squared: 0.1427
## F-statistic: 3.164 on 1 and 12 DF, p-value: 0.1006
ggplot(mapping=aes(x=years, y=albumLen)) + geom_point() + geom_smooth() +
xlab("Release Year") + ylab("Album Lengths (Minutes)") +
ggtitle("Album Length Over Time - Nonparametric Smoother") +
theme_bw()
## `geom_smooth()` using method = 'loess'
ggplot(mapping=aes(x=years, y=posUS)) + geom_point() +
geom_smooth(method="lm") + theme_light() +
xlab("Release Year") + ylab("Peak Billboard Position (US)") +
ggtitle("The Ramones got less popular over time (slope IS significant)") +
scale_x_continuous(breaks=seq(1977, 1995, 2))
rampois <- glm(chords ~ lengthS, data=ram, family=poisson)
summary(rampois)
##
## Call:
## glm(formula = chords ~ lengthS, family = poisson, data = ram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1257 -0.6170 -0.1608 0.2689 2.1246
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.674e+00 1.365e-01 12.265 <2e-16 ***
## lengthS 4.604e-05 8.485e-04 0.054 0.957
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 65.713 on 163 degrees of freedom
## Residual deviance: 65.710 on 162 degrees of freedom
## (12 observations deleted due to missingness)
## AIC: 646.2
##
## Number of Fisher Scoring iterations: 4
newdata <- data.frame(lengthS = 1:400)
rampred <- predict(rampois, newdata = newdata, type = "response")
plot(chords ~ lengthS, data= ram)
lines(rampred ~ newdata[,1], col=4, lwd=2)
# Just for fun
abline(lm(chords~lengthS, data=ram), col=2)
legend('topright', legend=c("Linear Model", "GLM"), col=c(2,4), lwd=2, lty=1)
ramlog <- glm(factor(side) ~ lengthS, data=ram, family = binomial)
summary(ramlog)
##
## Call:
## glm(formula = factor(side) ~ lengthS, family = binomial, data = ram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.22578 -1.17517 -0.01097 1.17932 1.20595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16259 0.62350 -0.261 0.794
## lengthS 0.00104 0.00387 0.269 0.788
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 243.99 on 175 degrees of freedom
## Residual deviance: 243.92 on 174 degrees of freedom
## AIC: 247.92
##
## Number of Fisher Scoring iterations: 3
repred2 <- predict(ramlog, newdata, type="response")
plot(as.numeric(factor(side))-1 ~ lengthS, data=ram)
lines(repred2 ~ newdata[,1])
This next part gets a little tricky. Working with character strings in R is not the easiest thing to do. The following code has to split the writer names into all of the writers of a track, then we find out who wrote which track.
# Writers are separated by commas, so split by commas
writerList <- strsplit(ram$writers, split = ", ")
# strsplit returns a list that is difficult to work with
writerNames <- unique(unlist(writerList))
# Only look at writers who have the word "Ramone" in their names
Ramones <- writerNames[grep("Ramone", writerNames)]
# Look at who wrote the most songs
table(unlist(writerList)[unlist(writerList) %in% Ramones])
##
## C.J. Ramone Dee Dee Ramone Joey Ramone Johnny Ramone Marky Ramone
## 2 67 43 18 5
## Ramones Richie Ramone Tommy Ramone
## 36 4 3
Ramones <- Ramones[c(2,3,4,5)] # Only the most prolific writers
The writerList object is a list. The first item in the list has as many elements as there were writers of the song. The inList will be an index with TRUE for each row that includes that writer. I create a dataframe that includes information on every song that writer wrote.
# Initiate an empty data frame
writerdf <- data.frame(Album=NA, Year=NA, Side=NA, LengthS=NA,
NumWriters=NA, Writer=NA, AlbumPos=NA)
for(i in Ramones){
inList <- sapply(writerList, function(x) i %in% x)
writerdf <- rbind(writerdf,
data.frame(Album=ram$album[inList],
Year=ram$year[inList],
Side=ram$side[inList],
LengthS=ram$lengthS[inList],
NumWriters=ram$nbreWriters[inList],
Writer=rep(i, sum(inList)),
AlbumPos=ram$albumPeakPositionUS[inList]))
}
# the first rows are NAs
writerdf <- writerdf[-1,]
# strsplit returns a list with one element, e.g. "Joey Ramone"
# "Joey Ramone" will become "Joey", "Ramone", then I just take "Joey"
writerdf$shortName <- sapply(writerdf$Writer, function(x) strsplit(x, " ")[[1]][1])
# Change "Dee" to "Dee Dee" BEFORE making it a factor. Once it's a factor,
# it's much harder to deal with
writerdf$shortName[writerdf$shortName == "Dee"] <- "Dee Dee"
# Make the names into factors (R treats factors differently, that's what we want)
writerdf$shortName <- factor(writerdf$shortName)
par(mfrow=c(2,2))
stripchart(Year~shortName, data=writerdf, method='jitter', pch=19, las=1,
xlab="Year", main="Years Active",
col=rgb(0,0,0,0.2))
stripchart(I(LengthS/60)~shortName, data=writerdf, method='jitter', pch=19, las=1,
xlab="Length of Song (Minutes)", main="Who writes longer songs?",
col=rgb(0,0,0,0.2))
stripchart(NumWriters~shortName, data=writerdf, method='jitter', pch=19, las=1,
xlab="Number of Writers", main="Who Collaborates?",
col=rgb(0,0,0,0.2))
stripchart(AlbumPos~shortName, data=writerdf, method='jitter', pch=19, las=1,
xlab="Peak Album Position", main="Who Writes the Best?",
col=rgb(0,0,0,0.2))
library(cowplot) # allows multiple ggplot2 plots on one page
## Warning: package 'cowplot' was built under R version 3.3.3
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
# Violin plot for active years.
# There are a lot of unnecessarily repeated values here.
g1 <- ggplot(mapping=aes(y=Year, x=shortName,
colour=shortName), data=writerdf) +
geom_violin(draw_quantiles = 0.5) +
geom_jitter(position = position_jitter(0.1)) +
xlab("Ramone") + ylab("Years Active") +
ggtitle("Years Active") + scale_color_brewer(palette = "Dark2")
# Violin plot for song length
g2 <- ggplot(mapping=aes(y=LengthS/60, x=shortName,
colour=shortName), data=writerdf) +
geom_violin(draw_quantiles = 0.5) +
geom_jitter(position = position_jitter(0.1)) +
xlab("Ramone") + ylab("Song Lengths") +
ggtitle("Who writes longer songs?") + scale_color_brewer(palette = "Dark2")
# Barplot - "Ramones" is just the collaboration, so I remove it
# Also, I used fill instead of color to make the bars match the colours
g3 <- ggplot(mapping=aes(NumWriters), data=subset(writerdf, shortName != "Ramones")) +
geom_bar(aes(fill=shortName), position="dodge") +
xlab("Number of Writers") + ylab("Number of Songs") +
ggtitle("Writer Collaboration") + scale_fill_brewer(palette = "Dark2")
# Violin plot for album position. Lower is better.
# Again, there are unnecessarily repeated values.
# This is likely not the best visualization of this data.
g4 <- ggplot(mapping=aes(y=AlbumPos, x=shortName,
colour=shortName), data=writerdf) +
geom_violin(draw_quantiles = 0.5) +
geom_jitter(position = position_jitter(0.1)) +
xlab("Ramone") + ylab("Peak Album Position") +
ggtitle("Who writes better songs?") + scale_color_brewer(palette = "Dark2")
plot_grid(g1,g2,g3,g4, ncol=2)
We see that Dee Dee writes shorter songs than Joey and they tend to have slightly lower album positions (recall that 1 is the highest possible position). Joey wrote more of the songs alone. Note that Johnny is never the sole writer of a song.
For some of the songs, the writer is simply credited as “The Ramones”. The plot of Peak Album Positions shows that they tended to write better albums when they worked together!
I don’t want you to think that I can just sit down and write all of this. That’s not what you should be aiming for. Most of my R skills come from knowing what to Google and what help files to check.
While doing the above analysis, I kept track of what I needed help with.
?grep (I remembered that there was a function called grep, I forgot what it did. I ended up using gsub, which the help file for grep told me about.)?strsplit?stripchartscale_x_continuous to work)?grep (I forgot how to use this again)The mpg dataset is built in to the ggplot2 package. Let’s go through some models with it!
data(mpg)
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
As we learned before, we start by plotting. First, some univariate statistics.
par(mfrow=c(2,3))
hist(mpg$displ, main="Displacement")
hist(mpg$cty, main = "City")
hist(mpg$hwy, main = "Highway")
barplot(table(mpg$cyl), main = "Number of Cylinders")
barplot(table(mpg$drv), main = "Drivetrain")
barplot(table(mpg$class), las=2, main = "Class")
par(mfrow=c(3,3))
plot(displ ~ cty, data=mpg, main="Displacement by City")
abline(lm(displ ~ cty, data=mpg))
plot(displ ~ hwy, data=mpg, main = "Displacement by Highway")
abline(lm(displ ~ hwy, data=mpg))
plot(cty ~ hwy, data=mpg, main = "City by Highway")
abline(lm(cty ~ hwy, data=mpg))
boxplot(displ ~ cyl, data = mpg, main = "Displacement by Cylinder",
xlab="Cylinder")
boxplot(displ ~ drv, data = mpg, main = "Displacement by Drivetrain",
xlab="Drivetrain")
boxplot(displ ~ class, data = mpg, main = "Displacement by Class",
xlab="Class", las=2)
boxplot(cty ~ cyl, data = mpg, main = "City by Cylinder",
xlab = "Cylinder")
boxplot(cty ~ drv, data = mpg, main = "City by Drivetrain",
xlab = "Drivetrain")
boxplot(cty ~ class, data = mpg, main = "City by Class",
xlab = "Class", las=2)
This is probably the most information you want to include in one plot.
par(mfrow=c(2,2))
plot(displ ~ cty, col=factor(cyl), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$cyl)),
col=1:4, pch=16)
plot(displ ~ cty, col=factor(drv), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$drv)),
col=1:3, pch=16)
plot(displ ~ cty, col=factor(class), data = mpg, pch=16)
legend('topright', legend = levels(factor(mpg$class)),
col=1:7, pch=16)
boxplot(displ ~ drv*cyl, las=2, data=mpg,
col=rep(1:4+1, each = 3),
main="Interaction between Drivetrain and Cylinders")
Some one-way and two-way anova tables
anova(aov(displ ~ drv, data=mpg))
## Analysis of Variance Table
##
## Response: displ
## Df Sum Sq Mean Sq F value Pr(>F)
## drv 2 189.55 94.776 109.82 < 2.2e-16 ***
## Residuals 231 199.36 0.863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(displ ~ cyl, data=mpg))
## Analysis of Variance Table
##
## Response: displ
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 336.54 336.54 1490.6 < 2.2e-16 ***
## Residuals 232 52.38 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(displ ~ class, data=mpg))
## Analysis of Variance Table
##
## Response: displ
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 223.09 37.181 50.898 < 2.2e-16 ***
## Residuals 227 165.83 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ drv, data=mpg))
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drv 2 1878.8 939.41 92.676 < 2.2e-16 ***
## Residuals 231 2341.5 10.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ cyl, data=mpg))
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 2740.1 2740.13 429.47 < 2.2e-16 ***
## Residuals 232 1480.2 6.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(cty ~ class, data=mpg))
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## class 6 2295.0 382.51 45.099 < 2.2e-16 ***
## Residuals 227 1925.3 8.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Determining the best model
aov1 <- aov(cty ~ drv*cyl*class, data=mpg)
anova(aov1)
## Analysis of Variance Table
##
## Response: cty
## Df Sum Sq Mean Sq F value Pr(>F)
## drv 2 1878.81 939.41 238.9940 < 2.2e-16 ***
## cyl 1 1201.14 1201.14 305.5806 < 2.2e-16 ***
## class 6 209.15 34.86 8.8682 1.219e-08 ***
## drv:cyl 2 26.23 13.12 3.3370 0.037413 *
## drv:class 3 4.81 1.60 0.4076 0.747663
## cyl:class 5 62.73 12.55 3.1918 0.008438 **
## drv:cyl:class 1 0.25 0.25 0.0623 0.803086
## Residuals 213 837.23 3.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the update function to change a model that you've created
aov2 <- update(aov1, ~.-drv:class - drv:cyl:class)
anova(aov1, aov2) # not a significant difference => use aov2
## Analysis of Variance Table
##
## Model 1: cty ~ drv * cyl * class
## Model 2: cty ~ drv + cyl + class + drv:cyl + cyl:class
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 213 837.23
## 2 217 850.43 -4 -13.194 0.8391 0.5017
aov2$call
## aov(formula = cty ~ drv + cyl + class + drv:cyl + cyl:class,
## data = mpg)
I used a special function in the code above. The update function is very useful for model building. The first argument is the original model, the second argument is the new formula. ~. means use everything that was in the original model, then I start subtracting things that I want to remove.
Here, we fit a model to try and predict the number of cylinders based on the car’s displacement. We could have added other covariates, but it would be more difficult to plot.
mpgpois <- glm(cyl ~ displ, data = mpg, family=poisson)
summary(mpgpois)
##
## Call:
## glm(formula = cyl ~ displ, family = poisson, data = mpg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.98488 -0.15890 -0.07696 0.28678 0.64538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.08716 0.08111 13.403 <2e-16 ***
## displ 0.18877 0.02014 9.374 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 103.685 on 233 degrees of freedom
## Residual deviance: 17.451 on 232 degrees of freedom
## AIC: 864.5
##
## Number of Fisher Scoring iterations: 4
newdata <- list(displ = seq(1,8,0.5))
mpgpoisPred <- predict(mpgpois, newdata, "response")
plot(cyl ~ displ, data=mpg)
lines(mpgpoisPred ~ newdata$displ, col='blue', lwd=2)
abline(lm(cyl~displ, data=mpg), col='red', lwd=2)
legend('bottomright', legend=c("Poisson", "Linear"),
col=c('blue','red'), lty=1, lwd=2)
mpg$am <- mpg$trans
# Get first letter, then convert to 0 and 1
mpg$am <- as.numeric(factor(substr(mpg$am, 1,1))) - 1
# now 1 is manual
mpgbin <- glm(am ~ displ, data=mpg, family=binomial)
summary(mpgbin)
##
## Call:
## glm(formula = am ~ displ, family = binomial, data = mpg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2470 -0.9353 -0.6600 1.1949 2.3044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9754 0.4222 2.310 0.0209 *
## displ -0.5082 0.1251 -4.063 4.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 296.48 on 233 degrees of freedom
## Residual deviance: 277.67 on 232 degrees of freedom
## AIC: 281.67
##
## Number of Fisher Scoring iterations: 3
newdata <- list(displ=seq(1,8,0.1))
mpgbinPred <- predict(mpgbin, newdata, "response")
par(mfrow=c(1,2))
plot(am ~ displ, data=mpg)
lines(mpgbinPred ~ newdata$displ, lwd=2)
# Alternatively...
plot(displ ~ factor(am), data=mpg, horizontal=TRUE)
lines(I(mpgbinPred+1) ~ newdata$displ, lwd=2)
Just for demonstration, I include the transmission type in the model. When doing predictions, we can set the value of each variable. This is easier when one of the variables is binary (0 or 1).
Here, I find the predicted number of cylinders versus displacement for automatic cars and the same thing for manual cars. It turns out that the transmission type doesn’t affect the curve, so this is not a meaningful plot.
This sort of prediction will likely come up whenever you have a categorical explanatory variable.
mpgpois <- glm(cyl ~ displ * am, data = mpg, family=poisson)
summary(mpgpois)
##
## Call:
## glm(formula = cyl ~ displ * am, family = poisson, data = mpg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.92013 -0.18500 -0.05115 0.26414 0.62685
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10546 0.10647 10.383 < 2e-16 ***
## displ 0.18596 0.02525 7.364 1.78e-13 ***
## am -0.03164 0.16854 -0.188 0.851
## displ:am 0.00178 0.04479 0.040 0.968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 103.685 on 233 degrees of freedom
## Residual deviance: 17.277 on 230 degrees of freedom
## AIC: 868.33
##
## Number of Fisher Scoring iterations: 4
newdata0 <- list(displ = seq(1,8,0.5), am = rep(0, 15))
mpgpoisPred0 <- predict(mpgpois, newdata0, "response")
newdata1 <- list(displ = seq(1,8,0.5), am = rep(1, 15))
mpgpoisPred1 <- predict(mpgpois, newdata1, "response")
plot(cyl ~ displ, data=mpg)
lines(mpgpoisPred0 ~ newdata0$displ, col='blue', lwd=2)
lines(mpgpoisPred1 ~ newdata1$displ, col='red', lwd=2)
legend('bottomright', legend=c("Automatic", "Manual"),
col=c('blue','red'), lty=1, lwd=2)
Variables defined inside a function will only be inside that function.
myfun <- function(x, y, z){
xtemp <- x+y # I define xtemp here...
ytemp <- z-x
return(xtemp*ytemp)
}
myfun(3,2,1)
xtemp # ... but it does not exist here
It is possible to reference values outside of a function, but it is a very bad idea.
xtemp <- 2
myfun <- function(x,y)
return(x + y + xtemp)
myfun(2,3)
xtemp <- "hello"
myfun(2,3) # the function didn't change, but global values did
The return command is useful in more complicated functions.
myfun <- function(x,y){
if(x > y){
return(y^2)
} else if(x < y) {
return(x^2)
} else {
return(x^2)
}
}
myfun(2,5)
myfun(3,1)
myfun(3,3)
There is a special argument in function calls. It can be very very useful.
myfun <- function(x, y, ...){
plot(y~x, ...)
}
x1 <- 1:30
y1 <- rnorm(30, mean = x1, sd = sqrt(x1))
myfun(x1, y1, main="I'm adding a new argument!",
xlab="I never told myfun what to do with xlab, but here it is!",
ylab = "The ellipses are magic")
The save function lets you save multiple R objects. This way, you can load your datasets, vectors, plots, and functions later. You must tell it a filename (with file=), but other than that you can save whatever you want.
save(mydata, myfunction, thisggplot, anotherThing,
file="2017-06-20 Data Dump.Rdata")
When you start your next section, you can load the objects back into the environment.
load("2017-06-20 Data Dump.Rdata")