R for Psychologists 3: Linear models, contrast codes & plotting

Alejandro de la Vega
10/8/2013

Review

Last time we covered:

  • How to join data frames using common identifier
    • join(d1, d2, by= c(“id”))
  • How to reshape your data
    • melt & cast
  • Split-Apply-Combine problems
    • ddply & dlply

Basic statistics with R

  • Usually, if you can, best to use a linear model
  • But often you have to report “standard” correlation or t-test
  • Usage of t.tests, cor.test
  • Correlation matrix

T-test

  • If you have wide format data
  • T.test expects 1 or 2 vectors
t.test(french_fries$potato, french_fries$grassy)

    Welch Two Sample t-test

data:  french_fries$potato and french_fries$grassy 
t = 43.4, df = 879, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 6.004 6.573 
sample estimates:
mean of x mean of y 
   6.9525    0.6642 

Formula notation for long format

t.test(extra~group, data=sleep)

    Welch Two Sample t-test

data:  extra by group 
t = -1.861, df = 17.78, p-value = 0.07939
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.3655  0.2055 
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33 

Paired t-test

t.test(extra~group, data=sleep, paired=TRUE)

    Paired t-test

data:  extra by group 
t = -4.062, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -2.4599 -0.7001 
sample estimates:
mean of the differences 
                  -1.58 

Correlation test is very similar

cor.test(french_fries$potato, french_fries$grassy)

    Pearson's product-moment correlation

data:  french_fries$potato and french_fries$grassy 
t = 3.434, df = 693, p-value = 0.0006295
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.05553 0.20179 
sample estimates:
   cor 
0.1294 

Formula notation

cor.test(~potato + grassy, data=french_fries)

    Pearson's product-moment correlation

data:  potato and grassy 
t = 3.434, df = 693, p-value = 0.0006295
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.05553 0.20179 
sample estimates:
   cor 
0.1294 

These are all objects!

x <- cor.test(~potato + grassy, data=french_fries)
str(x)
List of 9
 $ statistic  : Named num 3.43
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 693
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.000629
 $ estimate   : Named num 0.129
  ..- attr(*, "names")= chr "cor"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "correlation"
 $ alternative: chr "two.sided"
 $ method     : chr "Pearson's product-moment correlation"
 $ data.name  : chr "potato and grassy"
 $ conf.int   : atomic [1:2] 0.0555 0.2018
  ..- attr(*, "conf.level")= num 0.95
 - attr(*, "class")= chr "htest"

Easy to pull out information

x$statistic
    t 
3.434 
x$p.value
[1] 0.0006295
x$estimate
   cor 
0.1294 

Correlation matrix (Hmisc package)

rcorr(as.matrix(fries))
        grassy potato buttery
grassy    1.00   0.13    0.10
potato    0.13   1.00    0.41
buttery   0.10   0.41    1.00

n
        grassy potato buttery
grassy     696    695     692
potato     695    696     692
buttery    692    692     696

P
        grassy potato buttery
grassy         0.0006 0.0092 
potato  0.0006        0.0000 
buttery 0.0092 0.0000        

Linear modeling

  • Can express almost anything in a linear model
m <- lm(choice~condition, data=trustData)
m

Call:
lm(formula = choice ~ condition, data = trustData)

Coefficients:
(Intercept)   conditionT   conditionU  
     0.7061      -0.0249      -0.3137  
  • Remember: factor order matters!
levels(trustData$condition)
[1] "N" "T" "U"

Summary

  • Summary function gives you additional (necessary) information
summary(m)

Call:
lm(formula = choice ~ condition, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.706 -0.681  0.294  0.319  0.608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7061     0.0298   23.72  < 2e-16 ***
conditionT   -0.0249     0.0474   -0.53      0.6    
conditionU   -0.3137     0.0603   -5.20  2.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.466 on 481 degrees of freedom
Multiple R-squared: 0.0557, Adjusted R-squared: 0.0518 
F-statistic: 14.2 on 2 and 481 DF,  p-value: 1.02e-06 

Changing family type - GLM

summary(glm(choice~condition, family=binomial, data=trustData))

Call:
glm(formula = choice ~ condition, family = binomial, data = trustData)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.565  -1.512   0.834   0.876   1.368  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.877      0.140    6.25  4.1e-10 ***
conditionT    -0.117      0.220   -0.53     0.59    
conditionU    -1.314      0.270   -4.87  1.1e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 628.69  on 483  degrees of freedom
Residual deviance: 602.86  on 481  degrees of freedom
AIC: 608.9

Number of Fisher Scoring iterations: 4

Remember, it's an object

  • type name and press “tab” or use str() to see contents
sum_m$coefficients
            Estimate Std. Error z value  Pr(>|z|)
(Intercept)   0.8766     0.1402   6.251 4.089e-10
conditionT   -0.1171     0.2201  -0.532 5.947e-01
conditionU   -1.3138     0.2697  -4.871 1.112e-06
sum_m$aic
[1] 608.9

Contrast coding

  • Using factors is fine, but order matters!
  • Better to set up deliberate models using contrast codes
  • Instead of using ad-hoc methods like Tukey's
  • recode in car package

Recode (car package)

  • Syntax is pretty clunky unfortunately
  • Allows you to specify value in new column for each level
  • Sort of like transform

  • Problem: Want to code linearly Untrust > Neutral > Trust

    sub condition value delay choice
415   1         U    30    14      1
114   3         N    22    21      1
363   9         T    14    14      1
475  10         U    18    14      0
448   6         U    14    90      0
405  10         T    22    42      1

Recode usage

  • Similar to transform but returns a vector
trustData$linear <- recode(trustData$condition, 
  "'N' = 0; 'U' = -1; 'T' = 1", as.factor.result=F)
    sub condition value delay choice linear
415   1         U    30    14      1     -1
114   3         N    22    21      1      0
363   9         T    14    14      1      1
475  10         U    18    14      0     -1
448   6         U    14    90      0     -1
405  10         T    22    42      1      1

Recode + lm

  • Test linear effect of condition on behavior
summary(lm(choice~linear, data=trustData))

Call:
lm(formula = choice ~ linear, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.741 -0.628  0.259  0.372  0.486 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6277     0.0221   28.38  < 2e-16 ***
linear        0.1136     0.0315    3.61  0.00034 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.473 on 482 degrees of freedom
Multiple R-squared: 0.0263, Adjusted R-squared: 0.0243 
F-statistic:   13 on 1 and 482 DF,  p-value: 0.000338 

Reminder: ddply for summary stats

ddply(trustData,.(linear), summarize, mean = mean(choice))
  linear   mean
1     -1 0.3924
2      0 0.7061
3      1 0.6813

Untrustworthy vs other conditions

trustData$UvO <- recode(trustData$condition,
  "'N' = 1; 'U' = -2; 'T' = 1", as.factor.result=F)
summary(lm(choice~UvO, data=trustData))

Call:
lm(formula = choice ~ UvO, data = trustData)

Residuals:
   Min     1Q Median     3Q    Max 
-0.696 -0.696  0.304  0.304  0.608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5950     0.0233   25.54  < 2e-16 ***
UvO           0.1013     0.0191    5.31  1.7e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.466 on 482 degrees of freedom
Multiple R-squared: 0.0552, Adjusted R-squared: 0.0532 
F-statistic: 28.2 on 1 and 482 DF,  p-value: 1.7e-07 

Repeated measures - ANOVA

  • “Easy” way to do it is with ANOVA function
  • Error - treatments nested within subjects
summary(aov(choice ~ linear + Error(sub/linear), data=trustData))

Error: sub
       Df Sum Sq Mean Sq
linear  1   2.27    2.27

Error: sub:linear
       Df Sum Sq Mean Sq
linear  1   6.49    6.49

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)  
linear      1    0.9   0.919    4.37  0.037 *
Residuals 480  100.9   0.210                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Issues with ANOVA

  • Too many for me to go into
  • Inflexible in specification of model
    • Can't treat stimuli as random
  • Can't take “family”
  • Read: “Treating stimuli as a random factor in social psychology: a new and comprehensive solution to a pervasive but largely ignored problem.” by Judd, Westfall & Kenny in J Pers. Soc Psych. PubMed

How to use lme (nlme package)

summary(lme(choice ~ linear, random = ~ 1 | sub, trustData))
Linear mixed-effects model fit by REML
 Data: trustData 
    AIC   BIC logLik
  648.8 665.5 -320.4

Random effects:
 Formula: ~1 | sub
        (Intercept) Residual
StdDev:      0.1339   0.4579

Fixed effects: choice ~ linear 
             Value Std.Error  DF t-value p-value
(Intercept) 0.6168   0.04757 473  12.968       0
linear      0.1830   0.03609 473   5.071       0
 Correlation: 
       (Intr)
linear -0.125

Standardized Within-Group Residuals:
    Min      Q1     Med      Q3     Max 
-1.8012 -1.1688  0.5713  0.7159  1.5473 

Number of Observations: 484
Number of Groups: 10 

lmer (lme4)

library(lme4)
summary(lmer(choice ~ linear + (1 | sub), family = binomial, trustData))
Generalized linear mixed model fit by the Laplace approximation 
Formula: choice ~ linear + (1 | sub) 
   Data: trustData 
 AIC BIC logLik deviance
 609 621   -301      603
Random effects:
 Groups Name        Variance Std.Dev.
 sub    (Intercept) 0.312    0.559   
Number of obs: 484, groups: sub, 10

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.520      0.204    2.55    0.011 *  
linear         0.781      0.166    4.70  2.6e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Correlation of Fixed Effects:
       (Intr)
linear -0.113

Plotting!

  • Plotting should be part of your analysis pipeline
  • Should be interleaved with analysis, not just left to the end
  • Plots should update themselves!
  • Make sure to write plots into analysis script
  • Professional and high res

Base R plotting

plot(mtcars$mpg, mtcars$qsec)

plot of chunk unnamed-chunk-27

  • Unfortunately, not the most aesthetic
  • Adding “layers” (e.g. lines) can be unintuitive
  • Some disagree and its probably worth learning some of the basics
  • But ggplots2 makes advanced plotting much easier

ggplot2 philosophy

  • “Grammar of graphics”
  • First, map data
  • Then add layers to express the data
    • geometry
    • statistical transformations
    • scales
    • coordinate systems
    • (positioning, adjustments)

Mapping data

  • Map one part of data to X, and one part to Y
library(ggplot2)
plot1 <- ggplot(mtcars, aes(mpg, qsec))
  • As of yet incomplete because no geometry specified

Adding geometry

plot1 + geom_point()

plot of chunk unnamed-chunk-29

  • Scatterplot

    • Geometry is points

    Defaults:

    • Statistic: Identity (i.e. value)
    • Scale: linear
    • Coordinate system: Cartesian

Line plot

plot1 + geom_line()

plot of chunk unnamed-chunk-30

  • Hmm… connecting each point with line
  • Because line plot is not just a geometry but also…
  • Statistic: linear fit

Line of best fit

ggplot(mtcars, aes(mpg, qsec)) + geom_smooth(method=lm)

plot of chunk unnamed-chunk-31

  • stat_smooth is actually a combination of linear model statistic with a line geomtry
  • specified lm but could also use another function (such as rlm or loess smooth fit)

Putting layers together

ggplot(mtcars, aes(mpg, qsec)) +
  geom_smooth(method = lm, se = F, size = 2) +
  geom_point(size = 5)

plot of chunk unnamed-chunk-32

Mapping color & group

ggplot(mtcars, aes(mpg, qsec, color=factor(cyl))) +
  geom_point(size = 5)

plot of chunk unnamed-chunk-33

  • Color mapped to variable

Bar plot data mapping

  • What is different about the mapping between a correlation plot and a bar plot? plot of chunk unnamed-chunk-34

Bar plot - summary

ggplot(mtcars, aes(cyl, qsec)) + stat_summary(fun.y = mean, geom = "bar")

plot of chunk unnamed-chunk-35

  • Mapping height to value
    • Split by cateogrical variable
    • Take mean like ddply
  • Geometry: geom_bar
  • Statistic: “mean” + identity

Bar plot - fill & position = another variable

ggplot(mtcars, aes(cyl, qsec, fill = factor(gear))) + stat_summary(fun.y = mean, geom = "bar", position = "dodge")

plot of chunk unnamed-chunk-36

Bar plot with error bars (SE)

ggplot(mtcars, aes(cyl, qsec, fill=factor(cyl))) +
  stat_summary(fun.y = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3)

plot of chunk unnamed-chunk-37

Dot plot with error bars (SE)

ggplot(mtcars, aes(cyl, qsec, color = factor(cyl))) +
  stat_summary(fun.y = mean, geom = "point", size = 10) +
  stat_summary(fun.data = mean_cl_boot, geom = "pointrange", size = 3)

plot of chunk unnamed-chunk-38

Interaction line plot

ggplot(ToothGrowth, aes(factor(dose), len, color = supp, group=supp)) + 
  stat_summary(fun.y = mean, geom="point", size=5) +
  stat_summary(fun.y=mean, geom="line", size=3)

plot of chunk unnamed-chunk-39

Violin plot

ggplot(mtcars, aes(cyl, qsec,  color = factor(cyl))) +
  stat_summary(fun.y = mean, geom = "point", size = 10) +
  geom_violin(alpha=0.2, fill = "black")

plot of chunk unnamed-chunk-40

Histograms

ggplot(mtcars, aes(qsec)) + geom_histogram()

plot of chunk unnamed-chunk-41

Density plot by cylinders

ggplot(mtcars, aes(qsec, fill = factor(cyl))) +
  geom_density(alpha = 0.5)

plot of chunk unnamed-chunk-42

Adjusting theme

p2 + theme_bw() +  xlab("Quarter Second Time") + theme(legend.position="none",panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.x = element_text(size=15,vjust=.3))

plot of chunk unnamed-chunk-44

Faceting

p2 = ggplot(mtcars, aes(mpg, qsec, color=factor(cyl))) + geom_point(size =4) + geom_smooth(method=lm, se=F, size=2)
p2

plot of chunk unnamed-chunk-45

  • This is kind of a crazy plot

Faceting

plot of chunk unnamed-chunk-46

Faceting

p2 + facet_grid(. ~ cyl)

plot of chunk unnamed-chunk-47

You can do lots of cool stuff if you think about your mappings

ggplot(mtcars, aes(mpg, qsec, size=wt, color=wt)) + geom_point()

plot of chunk unnamed-chunk-48

More information