R for Psychologists 3: Linear models, contrast codes & plotting

Alejandro de la Vega
10/8/2013

Review

Last time we covered:

How to join data frames using common identifier
- join(d1, d2, by= c(“id”))
How to reshape your data
- melt & cast
Split-Apply-Combine problems
- ddply & dlply

Basic statistics with R

Usually, if you can, best to use a linear model
But often you have to report “standard” correlation or t-test
Usage of t.tests, cor.test
Correlation matrix

T-test

If you have wide format data
T.test expects 1 or 2 vectors

t.test(french_fries$potato, french_fries$grassy)


    Welch Two Sample t-test

data:  french_fries$potato and french_fries$grassy 
t = 43.4, df = 879, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 6.004 6.573 
sample estimates:
mean of x mean of y 
   6.9525    0.6642

Formula notation for long format

t.test(extra~group, data=sleep)


    Welch Two Sample t-test

data:  extra by group 
t = -1.861, df = 17.78, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.3655  0.2055 
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33

Paired t-test

t.test(extra~group, data=sleep, paired=TRUE)


    Paired t-test

data:  extra by group 
t = -4.062, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.4599 -0.7001 
sample estimates:
mean of the differences 
                  -1.58

Correlation test is very similar

cor.test(french_fries$potato, french_fries$grassy)


    Pearson's product-moment correlation

data:  french_fries$potato and french_fries$grassy 
t = 3.434, df = 693, p-value = 0.0006295
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.05553 0.20179 
sample estimates:
   cor 
0.1294

Formula notation

cor.test(~potato + grassy, data=french_fries)


    Pearson's product-moment correlation

data:  potato and grassy 
t = 3.434, df = 693, p-value = 0.0006295
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.05553 0.20179 
sample estimates:
   cor 
0.1294

These are all objects!

x <- cor.test(~potato + grassy, data=french_fries)
str(x)

List of 9
 $ statistic  : Named num 3.43
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 693
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.000629
 $ estimate   : Named num 0.129
  ..- attr(*, "names")= chr "cor"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "correlation"
 $ alternative: chr "two.sided"
 $ method     : chr "Pearson's product-moment correlation"
 $ data.name  : chr "potato and grassy"
 $ conf.int   : atomic [1:2] 0.0555 0.2018
  ..- attr(*, "conf.level")= num 0.95
 - attr(*, "class")= chr "htest"

Easy to pull out information

x$statistic

    t 
3.434

x$p.value

[1] 0.0006295

x$estimate

   cor 
0.1294

Correlation matrix (Hmisc package)

rcorr(as.matrix(fries))

        grassy potato buttery
grassy    1.00   0.13    0.10
potato    0.13   1.00    0.41
buttery   0.10   0.41    1.00

n
        grassy potato buttery
grassy     696    695     692
potato     695    696     692
buttery    692    692     696

P
        grassy potato buttery
grassy         0.0006 0.0092 
potato  0.0006        0.0000 
buttery 0.0092 0.0000

Linear modeling

Can express almost anything in a linear model

m <- lm(choice~condition, data=trustData)
m


Call:
lm(formula = choice ~ condition, data = trustData)

Coefficients:
(Intercept)   conditionT   conditionU  
     0.7061      -0.0249      -0.3137

Remember: factor order matters!

levels(trustData$condition)

[1] "N" "T" "U"

Summary

Summary function gives you additional (necessary) information

summary(m)


Call:
lm(formula = choice ~ condition, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.706 -0.681  0.294  0.319  0.608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7061     0.0298   23.72  < 2e-16 ***
conditionT   -0.0249     0.0474   -0.53      0.6    
conditionU   -0.3137     0.0603   -5.20  2.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.466 on 481 degrees of freedom
Multiple R-squared: 0.0557, Adjusted R-squared: 0.0518 
F-statistic: 14.2 on 2 and 481 DF,  p-value: 1.02e-06

Changing family type - GLM

summary(glm(choice~condition, family=binomial, data=trustData))


Call:
glm(formula = choice ~ condition, family = binomial, data = trustData)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.565  -1.512   0.834   0.876   1.368  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.877      0.140    6.25  4.1e-10 ***
conditionT    -0.117      0.220   -0.53     0.59    
conditionU    -1.314      0.270   -4.87  1.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 628.69  on 483  degrees of freedom
Residual deviance: 602.86  on 481  degrees of freedom
AIC: 608.9

Number of Fisher Scoring iterations: 4

Remember, it's an object

type name and press “tab” or use str() to see contents

sum_m$coefficients

            Estimate Std. Error z value  Pr(>|z|)
(Intercept)   0.8766     0.1402   6.251 4.089e-10
conditionT   -0.1171     0.2201  -0.532 5.947e-01
conditionU   -1.3138     0.2697  -4.871 1.112e-06

sum_m$aic

[1] 608.9

Contrast coding

Using factors is fine, but order matters!
Better to set up deliberate models using contrast codes
Instead of using ad-hoc methods like Tukey's
recode in car package

Recode (car package)

Syntax is pretty clunky unfortunately
Allows you to specify value in new column for each level
Sort of like transform
Problem: Want to code linearly Untrust > Neutral > Trust

    sub condition value delay choice
415   1         U    30    14      1
114   3         N    22    21      1
363   9         T    14    14      1
475  10         U    18    14      0
448   6         U    14    90      0
405  10         T    22    42      1

Recode usage

Similar to transform but returns a vector

trustData$linear <- recode(trustData$condition, 
  "'N' = 0; 'U' = -1; 'T' = 1", as.factor.result=F)

    sub condition value delay choice linear
415   1         U    30    14      1     -1
114   3         N    22    21      1      0
363   9         T    14    14      1      1
475  10         U    18    14      0     -1
448   6         U    14    90      0     -1
405  10         T    22    42      1      1

Recode + lm

Test linear effect of condition on behavior

summary(lm(choice~linear, data=trustData))


Call:
lm(formula = choice ~ linear, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.741 -0.628  0.259  0.372  0.486 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6277     0.0221   28.38  < 2e-16 ***
linear        0.1136     0.0315    3.61  0.00034 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.473 on 482 degrees of freedom
Multiple R-squared: 0.0263, Adjusted R-squared: 0.0243 
F-statistic:   13 on 1 and 482 DF,  p-value: 0.000338

Reminder: ddply for summary stats

ddply(trustData,.(linear), summarize, mean = mean(choice))

  linear   mean
1     -1 0.3924
2      0 0.7061
3      1 0.6813

Untrustworthy vs other conditions

trustData$UvO <- recode(trustData$condition,
  "'N' = 1; 'U' = -2; 'T' = 1", as.factor.result=F)
summary(lm(choice~UvO, data=trustData))


Call:
lm(formula = choice ~ UvO, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.696 -0.696  0.304  0.304  0.608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5950     0.0233   25.54  < 2e-16 ***
UvO           0.1013     0.0191    5.31  1.7e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.466 on 482 degrees of freedom
Multiple R-squared: 0.0552, Adjusted R-squared: 0.0532 
F-statistic: 28.2 on 1 and 482 DF,  p-value: 1.7e-07

Repeated measures - ANOVA

“Easy” way to do it is with ANOVA function
Error - treatments nested within subjects

summary(aov(choice ~ linear + Error(sub/linear), data=trustData))


Error: sub
       Df Sum Sq Mean Sq
linear  1   2.27    2.27

Error: sub:linear
       Df Sum Sq Mean Sq
linear  1   6.49    6.49

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)  
linear      1    0.9   0.919    4.37  0.037 *
Residuals 480  100.9   0.210                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Issues with ANOVA

Too many for me to go into
Inflexible in specification of model
- Can't treat stimuli as random
Can't take “family”
Read: “Treating stimuli as a random factor in social psychology: a new and comprehensive solution to a pervasive but largely ignored problem.” by Judd, Westfall & Kenny in J Pers. Soc Psych. PubMed

How to use lme (nlme package)

summary(lme(choice ~ linear, random = ~ 1 | sub, trustData))

Linear mixed-effects model fit by REML
 Data: trustData 
    AIC   BIC logLik
  648.8 665.5 -320.4

Random effects:
 Formula: ~1 | sub
        (Intercept) Residual
StdDev:      0.1339   0.4579

Fixed effects: choice ~ linear 
             Value Std.Error  DF t-value p-value
(Intercept) 0.6168   0.04757 473  12.968       0
linear      0.1830   0.03609 473   5.071       0
 Correlation: 
       (Intr)
linear -0.125

Standardized Within-Group Residuals:
    Min      Q1     Med      Q3     Max 
-1.8012 -1.1688  0.5713  0.7159  1.5473 

Number of Observations: 484
Number of Groups: 10

lmer (lme4)

library(lme4)
summary(lmer(choice ~ linear + (1 | sub), family = binomial, trustData))

Generalized linear mixed model fit by the Laplace approximation 
Formula: choice ~ linear + (1 | sub) 
   Data: trustData 
 AIC BIC logLik deviance
 609 621   -301      603
Random effects:
 Groups Name        Variance Std.Dev.
 sub    (Intercept) 0.312    0.559   
Number of obs: 484, groups: sub, 10

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.520      0.204    2.55    0.011 *  
linear         0.781      0.166    4.70  2.6e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Correlation of Fixed Effects:
       (Intr)
linear -0.113

Plotting!

Plotting should be part of your analysis pipeline
Should be interleaved with analysis, not just left to the end
Plots should update themselves!
Make sure to write plots into analysis script
Professional and high res

Base R plotting

plot(mtcars$mpg, mtcars$qsec)

plot of chunk unnamed-chunk-27

Unfortunately, not the most aesthetic
Adding “layers” (e.g. lines) can be unintuitive
Some disagree and its probably worth learning some of the basics
But ggplots2 makes advanced plotting much easier

ggplot2 philosophy

“Grammar of graphics”
First, map data
Then add layers to express the data
- geometry
- statistical transformations
- scales
- coordinate systems
- (positioning, adjustments)

Mapping data

Map one part of data to X, and one part to Y

library(ggplot2)
plot1 <- ggplot(mtcars, aes(mpg, qsec))

As of yet incomplete because no geometry specified

Adding geometry

plot1 + geom_point()

plot of chunk unnamed-chunk-29

Scatterplot
- Geometry is points
Defaults:
- Statistic: Identity (i.e. value)
- Scale: linear
- Coordinate system: Cartesian

Line plot

plot1 + geom_line()

plot of chunk unnamed-chunk-30

Hmm… connecting each point with line
Because line plot is not just a geometry but also…
Statistic: linear fit

Line of best fit

ggplot(mtcars, aes(mpg, qsec)) + geom_smooth(method=lm)

plot of chunk unnamed-chunk-31

stat_smooth is actually a combination of linear model statistic with a line geomtry
specified lm but could also use another function (such as rlm or loess smooth fit)

Putting layers together

ggplot(mtcars, aes(mpg, qsec)) +
  geom_smooth(method = lm, se = F, size = 2) +
  geom_point(size = 5)

plot of chunk unnamed-chunk-32

Mapping color & group

ggplot(mtcars, aes(mpg, qsec, color=factor(cyl))) +
  geom_point(size = 5)

plot of chunk unnamed-chunk-33

Color mapped to variable

Bar plot data mapping

What is different about the mapping between a correlation plot and a bar plot?

Bar plot - summary

ggplot(mtcars, aes(cyl, qsec)) + stat_summary(fun.y = mean, geom = "bar")

plot of chunk unnamed-chunk-35

Mapping height to value
- Split by cateogrical variable
- Take mean like ddply
Geometry: geom_bar
Statistic: “mean” + identity

Bar plot - fill & position = another variable

ggplot(mtcars, aes(cyl, qsec, fill = factor(gear))) + stat_summary(fun.y = mean, geom = "bar", position = "dodge")

plot of chunk unnamed-chunk-36

Bar plot with error bars (SE)

ggplot(mtcars, aes(cyl, qsec, fill=factor(cyl))) +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3)

plot of chunk unnamed-chunk-37

Dot plot with error bars (SE)

ggplot(mtcars, aes(cyl, qsec, color = factor(cyl))) +
  stat_summary(fun.y = mean, geom = "point", size = 10) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange", size = 3)

plot of chunk unnamed-chunk-38

Interaction line plot

ggplot(ToothGrowth, aes(factor(dose), len, color = supp, group=supp)) + 
  stat_summary(fun.y = mean, geom="point", size=5) +
  stat_summary(fun.y=mean, geom="line", size=3)

plot of chunk unnamed-chunk-39

Violin plot

ggplot(mtcars, aes(cyl, qsec,  color = factor(cyl))) +
  stat_summary(fun.y = mean, geom = "point", size = 10) +
  geom_violin(alpha=0.2, fill = "black")

plot of chunk unnamed-chunk-40

Histograms

ggplot(mtcars, aes(qsec)) + geom_histogram()

plot of chunk unnamed-chunk-41

Density plot by cylinders

ggplot(mtcars, aes(qsec, fill = factor(cyl))) +
  geom_density(alpha = 0.5)

plot of chunk unnamed-chunk-42

Adjusting theme

p2 + theme_bw() +  xlab("Quarter Second Time") + theme(legend.position="none",panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.x = element_text(size=15,vjust=.3))

Faceting

p2 = ggplot(mtcars, aes(mpg, qsec, color=factor(cyl))) + geom_point(size =4) + geom_smooth(method=lm, se=F, size=2)
p2

plot of chunk unnamed-chunk-45

This is kind of a crazy plot

Faceting

plot of chunk unnamed-chunk-46

Faceting

p2 + facet_grid(. ~ cyl)

plot of chunk unnamed-chunk-47

You can do lots of cool stuff if you think about your mappings

ggplot(mtcars, aes(mpg, qsec, size=wt, color=wt)) + geom_point()

plot of chunk unnamed-chunk-48

More information

http://docs.ggplot2.org/current/