BIOS 621/821 Session 1

Levi Waldron

Welcome and outline - session 1

A bit about me - research interests

Some of my activities that may interest you

Multiple Linear Regression Model (sec. 4.2)

Systematic part of model:

\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]

Multiple Linear Regression Model (cont’d)

Random part of model:

\(y_i = E[y_i|x_i] + \epsilon_i\)

\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)

Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)

Continuous predictors

Binary predictors (2 levels)

Multilevel Categorical Predictors (Ordinal or Nominal)

* STATA and R code dummy variables automatically, behind-the-scenes

Inference from multiple linear regression

Interaction (effect modification)

Interaction is modeled as the product of two covariates: \[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1*x_2 \]

Interaction (effect modification)

Interaction between coffee and time of day on performance

Interaction between coffee and time of day on performance

Image credit: http://personal.stevens.edu/~ysakamot/

ANOVA table

Source of Variation Sum Sq Deg Fr Mean Sq F
Model MSS k MSS/k (MSS/k)/MSE
Residual RSS n-(k-1) RSS/(n-k-1)
Total TSS n-1

Regression in R: model formulae

Model formulae tutorial

response variable ~ explanatory variables

Model formulae (cont’d)

Model formula for simple linear regression:

y ~ x

Model formulae (cont’d)

Additional explanatory variables would be added as follows:

y ~ x + z

Note that “+” does not have its usual meaning, which would be achieved by:

y ~ I(x + z)

Types of standard linear models

lm( y ~ u + v)

u and v factors: ANOVA
u and v numeric: multiple regression
one factor, one numeric: ANCOVA

Model formulae (cont’d)

symbol example meaning
+ + x include this variable
- - x delete this variable
: x : z include the interaction
* x * z include these variables and their interactions
/ x / z nesting: include z nested within x
| x | z conditioning: include x given z
^ (u + v + w)^3 include these variables and
    all interactions up to three way
1 -1 intercept: delete the intercept

Model formulae (cont’d)

How to interpret the following model formulae?

y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2

Model formulae (cont’d)

How to interpret the following model formulae?

y ~ u + v + w + u:v + u:w + v:w + u:v:w
y ~ u * v * w
y ~ (u + v + w)^3

Introduction to the R language

Logic

Storing Data: The Rules

x <- 5
x * 2
## [1] 10
x <- x + 1
y <- 4
x * y
## [1] 24

Basic Data Types

set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
1:5
## [1] 1 2 3 4 5
sample( 1:5 )
## [1] 2 1 3 4 5

Basic Data Types

c("yes", "no")
## [1] "yes" "no"
factor(c("yes", "no"))
## [1] yes no 
## Levels: no yes

Basic Data Types

factor(c("good", "very good", "poor"), 
       levels=c("poor", "good", "very good"), 
       ordered=TRUE)
## [1] good      very good poor     
## Levels: poor < good < very good
1:5 %in% 4:5
## [1] FALSE FALSE FALSE  TRUE  TRUE
c(NA, NaN, -Inf, Inf)
## [1]   NA  NaN -Inf  Inf

class() to find the class of a variable.

Vectors Must Be of One Data Mode

c( 1, "2", FALSE)
## [1] "1"     "2"     "FALSE"
c( 1, FALSE )
## [1] 1 0

Selecting Vector Elements

x <- 1:4
x[ 2 ]
## [1] 2
x <- 1:10
x[ 4:7 ]
## [1] 4 5 6 7

Selecting Vector Elements

x <- c( "a", "b", "c", "d", "e", "f" )
x[ c(5,3,1) ]
## [1] "e" "c" "a"
x[ -1 ]
## [1] "b" "c" "d" "e" "f"
x[-1:-2]
## [1] "c" "d" "e" "f"

Selecting Vector Elements

x <- 1:10
y <- x%%2 == 0
x[y]
## [1]  2  4  6  8 10

2-Dimensional Vectors are Matrices

matrix( 1:20, nrow = 5, ncol = 4 )
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

Indexing Matrices

boring.matrix <- matrix( 1:20, nrow = 5, ncol = 4 )
dim( boring.matrix )
## [1] 5 4
boring.matrix[ ,1 ]
## [1] 1 2 3 4 5
boring.matrix[ 2, 1 ]
## [1] 2
boring.matrix[ 2, ]
## [1]  2  7 12 17

Indexing Matrices

boring.matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
boring.matrix[ boring.matrix[ ,1 ] ==3,]
## [1]  3  8 13 18

Matrix Operations

boring.matrix <- matrix(1:9, nrow = 3)
boring.matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
t(boring.matrix)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Matrix Operations (cont’d)

boring.matrix + 1
##      [,1] [,2] [,3]
## [1,]    2    5    8
## [2,]    3    6    9
## [3,]    4    7   10
boring.matrix + 1:3
##      [,1] [,2] [,3]
## [1,]    2    5    8
## [2,]    4    7   10
## [3,]    6    9   12

Matrix Operations (cont’d)

boring.matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
boring.matrix + boring.matrix
##      [,1] [,2] [,3]
## [1,]    2    8   14
## [2,]    4   10   16
## [3,]    6   12   18

Matrix Operations (cont’d)

boring.matrix * boring.matrix
##      [,1] [,2] [,3]
## [1,]    1   16   49
## [2,]    4   25   64
## [3,]    9   36   81
boring.matrix %*% boring.matrix
##      [,1] [,2] [,3]
## [1,]   30   66  102
## [2,]   36   81  126
## [3,]   42   96  150

Naming rows and columns

colnames(boring.matrix) <- c("col.1", "col.2", "col.3")
rownames(boring.matrix) <- c("row.1", "row.2", "row.3")
boring.matrix
##       col.1 col.2 col.3
## row.1     1     4     7
## row.2     2     5     8
## row.3     3     6     9
boring.matrix["row.1", ]
## col.1 col.2 col.3 
##     1     4     7

Lists are Filing Cabinets

my.person <- list( measurements, self.reporting, 
                   sex, parents)
my.person
## [[1]]
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## [[2]]
##  [1] 13  6  4  7  6  5  8  9  7  4
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [1] "Parent1.name" "Parent2.name"

Lists are Filing Cabinets

my.person[1:2]
## [[1]]
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## [[2]]
##  [1] 13  6  4  7  6  5  8  9  7  4
my.person[[1]]
## [1]  1.3  1.6  3.2  9.8 10.2

Lists are Filing Cabinets

my.person <- list( measure = measurements, 
                   self.measure = self.reporting, 
                   s = sex, 
                   parents = parents )
my.person
## $measure
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## $self.measure
##  [1] 13  6  4  7  6  5  8  9  7  4
## 
## $s
## [1] FALSE
## 
## $parents
## [1] "Parent1.name" "Parent2.name"
my.person$parents
## [1] "Parent1.name" "Parent2.name"

The data.frame object

x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
test.dataframe <- data.frame(x,y,z,a)

Accessing data.frame elements

test.dataframe[[4]]
## [1] one   two   three four  five  six  
## Levels: five four one six three two
test.dataframe$parents
## NULL

Columns of a data.frame May Contain Different Data Modes

class( test.dataframe[[1]] )
## [1] "integer"
class( test.dataframe[[2]] )
## [1] "numeric"
class( test.dataframe[[3]] )
## [1] "factor"
class( test.dataframe[[4]] )
## [1] "factor"

Combining Data Frames

mini.frame.one <- data.frame( "one" = 1:5 )
mini.frame.two <- data.frame( "two" = 6:10 )
cbind( mini.frame.one, mini.frame.two )
##   one two
## 1   1   6
## 2   2   7
## 3   3   8
## 4   4   9
## 5   5  10

Alternatively: c( mini.frame.one, mini.frame.two )

Updating Data Frames

test.dataframe
##    x   y     z     a
## 1 11 0.0   one   one
## 2 12 0.2   two   two
## 3 13 0.4 three three
## 4 14 0.6  four  four
## 5 15 0.8  five  five
## 6 16 1.0   six   six
test.dataframe[[1]] = 21:26
test.dataframe
##    x   y     z     a
## 1 21 0.0   one   one
## 2 22 0.2   two   two
## 3 23 0.4 three three
## 4 24 0.6  four  four
## 5 25 0.8  five  five
## 6 26 1.0   six   six

Reading Tables

Example: Cholesterol vs. Age

chol <- read.delim("http://www.statlab.uni-heidelberg.de/data/ancova/cholesterol.data")
summary(chol)
##   cholesterol         age             state   
##  Min.   :112.0   Min.   :18.00   Iowa    :11  
##  1st Qu.:181.2   1st Qu.:39.50   Nebraska:19  
##  Median :199.0   Median :48.00                
##  Mean   :213.7   Mean   :48.57                
##  3rd Qu.:247.0   3rd Qu.:58.00                
##  Max.   :356.0   Max.   :78.00

Example: Cholesterol vs. Age linear model

fit <- lm(cholesterol ~ age * state, data=chol)
summary(fit)
## 
## Call:
## lm(formula = cholesterol ~ age * state, data = chol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.480 -31.907  -4.303  22.829  85.833 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        35.8112    55.1166   0.650  0.52156   
## age                 3.2381     1.0088   3.210  0.00352 **
## stateNebraska      65.4866    61.9834   1.057  0.30045   
## age:stateNebraska  -0.7177     1.1628  -0.617  0.54247   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.14 on 26 degrees of freedom
## Multiple R-squared:  0.5326, Adjusted R-squared:  0.4786 
## F-statistic: 9.875 on 3 and 26 DF,  p-value: 0.00016

Example: Cholesterol vs. Age ANOVA table

anova(fit)
## Analysis of Variance Table
## 
## Response: cholesterol
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1  48976   48976 26.3124 2.388e-05 ***
## state      1   5456    5456  2.9315   0.09877 .  
## age:state  1    709     709  0.3809   0.54247    
## Residuals 26  48395    1861                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Cholesterol vs. Age diagnostic plots

par(mfrow=c(2, 2))
plot(fit)

Example: Cholesterol vs. Age partial F-test

fit1 <- lm(cholesterol ~ state, data=chol)
fit2 <- lm(cholesterol ~ state + age, data=chol)
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: cholesterol ~ state
## Model 2: cholesterol ~ state + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     28 102924                                  
## 2     27  49104  1     53820 29.593 9.361e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example: Cholesterol vs. Age backwards selection

library(MASS)
fit <- lm(cholesterol ~ age * state, data=chol)
step <- stepAIC(fit, direction="backward")
## Start:  AIC=229.58
## cholesterol ~ age * state
## 
##             Df Sum of Sq   RSS    AIC
## - age:state  1    709.05 49104 228.01
## <none>                   48395 229.58
## 
## Step:  AIC=228.01
## cholesterol ~ age + state
## 
##         Df Sum of Sq    RSS    AIC
## <none>                49104 228.01
## - state  1      5456  54560 229.18
## - age    1     53820 102924 248.22

AIC = Akaike’s Information Criterion