Levi Waldron
Systematic part of model:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Random part of model:
\(y_i = E[y_i|x_i] + \epsilon_i\)
\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
* STATA and R code dummy variables automatically, behind-the-scenes
Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]
Interaction between coffee and time of day on performance
Image credit: http://personal.stevens.edu/~ysakamot/
Source of Variation | Sum Sq | Deg Fr | Mean Sq | F |
---|---|---|---|---|
Model | MSS | k | MSS/k | (MSS/k)/MSE |
Residual | RSS | n-(k-1) | RSS/(n-k-1) | |
Total | TSS | n-1 |
aov()
, lm()
, glm()
, and coxph()
use a “model formula” interface.response variable ~ explanatory variables
Model formula for simple linear regression:
y ~ x
Additional explanatory variables would be added as follows:
y ~ x + z
Note that “+” does not have its usual meaning, which would be achieved by:
y ~ I(x + z)
lm( y ~ u + v)
u
and v
factors: ANOVA
u
and v
numeric: multiple regression
one factor, one numeric: ANCOVA
symbol | example | meaning |
---|---|---|
+ | + x | include this variable |
- | - x | delete this variable |
: | x : z | include the interaction |
* | x * z | include these variables and their interactions |
/ | x / z | nesting: include z nested within x |
| | x | z | conditioning: include x given z |
^ | (u + v + w)^3 | include these variables and |
all interactions up to three way | ||
1 | -1 | intercept: delete the intercept |
How to interpret the following model formulae?
y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2
How to interpret the following model formulae?
y ~ u + v + w + u:v + u:w + v:w + u:v:w
y ~ u * v * w
y ~ (u + v + w)^3
5 + 2 #addition
5 - 2 #subtraction
5 * 2 #multiplication
5 / 2 #division
5 ^ 2 #exponentiation
5 ** 2 #exponentiation
5 %% 2 #modulus (a.k.a. remainder)
5 < x #less than
5 <= x #less than or equal to
5 > x #greater than
5 >= x #greater than or equal to
5 == x #equal to
5 != x #not equal to
!x #logical NOT
True || False #stepwise logical OR
True && False #stepwise logical AND
x <- 5
x * 2
## [1] 10
x <- x + 1
y <- 4
x * y
## [1] 24
set.seed(1)
rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
1:5
## [1] 1 2 3 4 5
sample( 1:5 )
## [1] 2 1 3 4 5
c("yes", "no")
## [1] "yes" "no"
factor(c("yes", "no"))
## [1] yes no
## Levels: no yes
factor(c("good", "very good", "poor"),
levels=c("poor", "good", "very good"),
ordered=TRUE)
## [1] good very good poor
## Levels: poor < good < very good
1:5 %in% 4:5
## [1] FALSE FALSE FALSE TRUE TRUE
c(NA, NaN, -Inf, Inf)
## [1] NA NaN -Inf Inf
class()
to find the class of a variable.
c( 1, "2", FALSE)
## [1] "1" "2" "FALSE"
c( 1, FALSE )
## [1] 1 0
x <- 1:4
x[ 2 ]
## [1] 2
x <- 1:10
x[ 4:7 ]
## [1] 4 5 6 7
x <- c( "a", "b", "c", "d", "e", "f" )
x[ c(5,3,1) ]
## [1] "e" "c" "a"
x[ -1 ]
## [1] "b" "c" "d" "e" "f"
x[-1:-2]
## [1] "c" "d" "e" "f"
x <- 1:10
y <- x%%2 == 0
x[y]
## [1] 2 4 6 8 10
matrix( 1:20, nrow = 5, ncol = 4 )
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
boring.matrix <- matrix( 1:20, nrow = 5, ncol = 4 )
dim( boring.matrix )
## [1] 5 4
boring.matrix[ ,1 ]
## [1] 1 2 3 4 5
boring.matrix[ 2, 1 ]
## [1] 2
boring.matrix[ 2, ]
## [1] 2 7 12 17
boring.matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
boring.matrix[ boring.matrix[ ,1 ] ==3,]
## [1] 3 8 13 18
boring.matrix <- matrix(1:9, nrow = 3)
boring.matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
t(boring.matrix)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
boring.matrix + 1
## [,1] [,2] [,3]
## [1,] 2 5 8
## [2,] 3 6 9
## [3,] 4 7 10
boring.matrix + 1:3
## [,1] [,2] [,3]
## [1,] 2 5 8
## [2,] 4 7 10
## [3,] 6 9 12
boring.matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
boring.matrix + boring.matrix
## [,1] [,2] [,3]
## [1,] 2 8 14
## [2,] 4 10 16
## [3,] 6 12 18
boring.matrix * boring.matrix
## [,1] [,2] [,3]
## [1,] 1 16 49
## [2,] 4 25 64
## [3,] 9 36 81
boring.matrix %*% boring.matrix
## [,1] [,2] [,3]
## [1,] 30 66 102
## [2,] 36 81 126
## [3,] 42 96 150
colnames(boring.matrix) <- c("col.1", "col.2", "col.3")
rownames(boring.matrix) <- c("row.1", "row.2", "row.3")
boring.matrix
## col.1 col.2 col.3
## row.1 1 4 7
## row.2 2 5 8
## row.3 3 6 9
boring.matrix["row.1", ]
## col.1 col.2 col.3
## 1 4 7
my.person <- list( measurements, self.reporting,
sex, parents)
my.person
## [[1]]
## [1] 1.3 1.6 3.2 9.8 10.2
##
## [[2]]
## [1] 13 6 4 7 6 5 8 9 7 4
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [1] "Parent1.name" "Parent2.name"
my.person[1:2]
## [[1]]
## [1] 1.3 1.6 3.2 9.8 10.2
##
## [[2]]
## [1] 13 6 4 7 6 5 8 9 7 4
my.person[[1]]
## [1] 1.3 1.6 3.2 9.8 10.2
my.person <- list( measure = measurements,
self.measure = self.reporting,
s = sex,
parents = parents )
my.person
## $measure
## [1] 1.3 1.6 3.2 9.8 10.2
##
## $self.measure
## [1] 13 6 4 7 6 5 8 9 7 4
##
## $s
## [1] FALSE
##
## $parents
## [1] "Parent1.name" "Parent2.name"
my.person$parents
## [1] "Parent1.name" "Parent2.name"
list
with vector elements of equal lengthx <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
test.dataframe <- data.frame(x,y,z,a)
test.dataframe[[4]]
## [1] one two three four five six
## Levels: five four one six three two
test.dataframe$parents
## NULL
class( test.dataframe[[1]] )
## [1] "integer"
class( test.dataframe[[2]] )
## [1] "numeric"
class( test.dataframe[[3]] )
## [1] "factor"
class( test.dataframe[[4]] )
## [1] "factor"
mini.frame.one <- data.frame( "one" = 1:5 )
mini.frame.two <- data.frame( "two" = 6:10 )
cbind( mini.frame.one, mini.frame.two )
## one two
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
Alternatively: c( mini.frame.one, mini.frame.two )
test.dataframe
## x y z a
## 1 11 0.0 one one
## 2 12 0.2 two two
## 3 13 0.4 three three
## 4 14 0.6 four four
## 5 15 0.8 five five
## 6 16 1.0 six six
test.dataframe[[1]] = 21:26
test.dataframe
## x y z a
## 1 21 0.0 one one
## 2 22 0.2 two two
## 3 23 0.4 three three
## 4 24 0.6 four four
## 5 25 0.8 five five
## 6 26 1.0 six six
read.table
read.csv
read.delim
chol <- read.delim("http://www.statlab.uni-heidelberg.de/data/ancova/cholesterol.data")
summary(chol)
## cholesterol age state
## Min. :112.0 Min. :18.00 Iowa :11
## 1st Qu.:181.2 1st Qu.:39.50 Nebraska:19
## Median :199.0 Median :48.00
## Mean :213.7 Mean :48.57
## 3rd Qu.:247.0 3rd Qu.:58.00
## Max. :356.0 Max. :78.00
fit <- lm(cholesterol ~ age * state, data=chol)
summary(fit)
##
## Call:
## lm(formula = cholesterol ~ age * state, data = chol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.480 -31.907 -4.303 22.829 85.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8112 55.1166 0.650 0.52156
## age 3.2381 1.0088 3.210 0.00352 **
## stateNebraska 65.4866 61.9834 1.057 0.30045
## age:stateNebraska -0.7177 1.1628 -0.617 0.54247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.14 on 26 degrees of freedom
## Multiple R-squared: 0.5326, Adjusted R-squared: 0.4786
## F-statistic: 9.875 on 3 and 26 DF, p-value: 0.00016
anova(fit)
## Analysis of Variance Table
##
## Response: cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 48976 48976 26.3124 2.388e-05 ***
## state 1 5456 5456 2.9315 0.09877 .
## age:state 1 709 709 0.3809 0.54247
## Residuals 26 48395 1861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2, 2))
plot(fit)
fit1 <- lm(cholesterol ~ state, data=chol)
fit2 <- lm(cholesterol ~ state + age, data=chol)
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: cholesterol ~ state
## Model 2: cholesterol ~ state + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 102924
## 2 27 49104 1 53820 29.593 9.361e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(MASS)
fit <- lm(cholesterol ~ age * state, data=chol)
step <- stepAIC(fit, direction="backward")
## Start: AIC=229.58
## cholesterol ~ age * state
##
## Df Sum of Sq RSS AIC
## - age:state 1 709.05 49104 228.01
## <none> 48395 229.58
##
## Step: AIC=228.01
## cholesterol ~ age + state
##
## Df Sum of Sq RSS AIC
## <none> 49104 228.01
## - state 1 5456 54560 229.18
## - age 1 53820 102924 248.22
AIC = Akaike’s Information Criterion