R for Behavioural Scientists: Mid-Winter Workshop

author: John A. E. Anderson date: February 5 2014

Not meant to be scary, just fun!

Why R?

Well…

  • It's free!
  • It's incredibly extensible (i.e. you can do things poor SPSS can only dream of)
  • It encourages reproducible research
  • It really underscores the idea that everything (really) is just a variant of the GLM (very apparent when you can run an anova using the lm command)

The RStudio Environment:

alt text (Note to self; demo this)

There are generally 4 panes:

  • The Script Editor (top left)
  • The Console (bottom left)
  • The Environment (containing your active variables) and
  • Your history (both top-right)
  • Finally the Files/plots/packages window in the bottom right (very useful!)

Some incredibly useful books... (well, two incredibly useful books)

alt text

  • This book is a great introduction to the R environment & is a more general resource (nice pictures)

alt text

  • This book is written by the beloved Andy Field (the man is a genius!) He's a psychologist & writes for psychologists, so everything is relevent

Installing Packages (For ME!!??)

  • Go packages and then install packages

alt text

Let's try installing ggplot2 (we'll get back to this package later)

Notice that the R code is also generated; it would be equally valid to type this

install.packages("ggplot2")

to load that package, you can either click the checkbox next to the package name, or you can enter the following:

library("ggplot2")

Loading & saving data

  • Data should be in .csv format (comma separated variable)
  • Variable names should not contain spaces, and should not start with irregular characters or numbers
#replace with your dataset
TimeOfDayPriming <- read.csv("/Users/John/Dropbox/R_Course 2/TimeOfDayPriming.csv")
  • You now have an object which you can manipulate, subset, run analyses on etc.
  • Your original data (i.e. TimeOfDayPriming.csv) will not be touched - UNLESS you deliberately overwrite it

Loading & saving data

Just to prove I've actually loaded something ;) plot of chunk unnamed-chunk-4

Age is plotted by group - the beanplot reveals that the young group is bimodal (grad students vs. undergrads methinks?)

Loading & saving data

alt text the easy way

Loading & saving data

  • To save a dataframe,
write.table(TimeOfDayPriming, file = "test.csv",col.names = TRUE, sep = ",", row.names = TRUE)

The very Basics: R as an overpowered calculator

Try the following (in your console)

A <- 5690 + 69462

This should produce the following…

A <- 5690 + 69462
A
[1] 75152

Subtraction works just as you think it would…

5690 - 5452

This should produce the following…

[1] 238

The very Basics: R as an overpowered calculator

Try the following (in your console)

5690 * 69462

This should produce the following…

[1] 395238780

And division …

5690 / 5452
[1] 1.043654

The Basics:

Try the following (in your console)

summary(cars)

This should produce the following…

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

summary is a base function built into R - you can use it on dataframes & other objects

The Basics: A dataframe (like Excel actually)

To see the “cars” dataframe simply call the dataframe in the console

cars

This should produce the following…

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

(I used the function head(cars) to just display the first couple of rows, you can do the same thing with tail to display the last few rows)

The Basics: selecting variables...

Ok I see there are two variables called speed and dist, how do I select one of them? (very easily actually)

cars$speed

This should produce the following…

 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25

Voila - the speed variable notice that this vector is plotted horizontally, this doesn't matter, the content is the same

The Basics: descriptive stats...

So, how to get the mean & standard deviation value of dist? it's very simple…

mean(cars$dist)
[1] 42.98
sd(cars$dist)
[1] 25.76938

The Basics: Adding & Deleting a Column

Ocassionally you'll want to add a new column of variables: Let's take the cars - what if we want to add a new column that takes the square-root transform of distance?

carsNew <- cars
carsNew$SRDist <- sqrt(cars$dist)
head(carsNew)
  speed dist   SRDist
1     4    2 1.414214
2     4   10 3.162278
3     7    4 2.000000
4     7   22 4.690416
5     8   16 4.000000
6     9   10 3.162278

The Basics: Adding & Deleting a Column

You'll note that applying functions to vectors in R is ridculously easy - as we just did with the square-root transform, you can do addition, subtraction, mutiplication or any other mathematics to each incidence in a vector.

The Basics: Adding & Deleting a Column II

To remove a column, simply use the null command

carsNew$dist <- NULL
head(carsNew)
  speed   SRDist
1     4 1.414214
2     4 3.162278
3     7 2.000000
4     7 4.690416
5     8 4.000000
6     9 3.162278

The Basics: Adding & Deleting a Column III

We could also create an entirely new dataframe by selecting columns

NewDataFrame <- cbind(carsNew$SRDist, cars$dist)
NewDataFrame <- na.omit(NewDataFrame)
colnames(NewDataFrame) <- c('SRDist', 'originalDist')
 NewDataFrame=as.data.frame(NewDataFrame)
#this should give you some hint that we could just as easily say as.matrix - you'd be right!
head(NewDataFrame)
    SRDist originalDist
1 1.414214            2
2 3.162278           10
3 2.000000            4
4 4.690416           22
5 4.000000           16
6 3.162278           10

The Basics: Logicals

R is built so that things work better if you use logical indices rather than loops. Technically you could build a loop - but generally it will run more slowly and take up more memory.

install.packages("plyr", repos='http://probability.ca/cran/')
library(plyr)
SpeedGreaterThan20 <-subset(cars, cars$speed > 20)
mean(cars$speed)
[1] 15.4
mean(SpeedGreaterThan20$speed)
[1] 23.71429

The Basics: Missing Values

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, NA, 8, 3, NA, 3)

We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values

mean(z)
[1] NA
mean(z,  na.rm=TRUE)
[1] 3.4

The Basics: Extreme Values & Winsorization

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, 100, 8, 3, -50, 3)

Let's get rid of these extreme values:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}
mean(z)
[1] 9.571429
mean(remove_outliers(z), na.rm=TRUE)
[1] 3.4

The Basics: Extreme Values & Winsorization

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, 100, 8, 3, -50, 3)

We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values

library(psych)
mean(z)
[1] 9.571429
mean(winsor(z))
[1] 3.485714

The Basics: Extreme Values & Winsorization II

beanplot(z, winsor(z))

plot of chunk unnamed-chunk-33

You can also apply winsorization to individuals within groups using the plyr package

EC <- ddply(DC, .(Subject, Condition), function(Acc_RT) {
     transform(Acc_RT, RTwin = winsor(Acc_RT, trim = .1))
})

Slide With Plot(s)

The basic plot function in R is… well… very basic.

plot(cars)

plot of chunk unnamed-chunk-35

#boxplot(cars)
library(beanplot)
beanplot(cars)

plot of chunk unnamed-chunk-36

Slide With Plot(s) II

hist(cars$dist)

plot of chunk unnamed-chunk-37

d <- density(cars$dist) # returns the density data 
plot(d)

plot of chunk unnamed-chunk-38

GGPLOT (It's kinda amazing!)

plot of chunk unnamed-chunk-39

plot of chunk unnamed-chunk-40

Anatomy of a GGPLOT

library(ggthemes)
#faceted bar graph (festival)
bar <- ggplot(diamonds, aes(color, price));
bar + stat_summary(fun.y = mean, geom = "bar")+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2)+ theme_economist() 

plot of chunk unnamed-chunk-42

Anatomy of a GGPLOT II

setwd("/Users/John/Dropbox/R_Course 2/Field Data Files")
examData <- read.delim("Exam Anxiety.dat", header = TRUE)
#grouped Scatter Graph
scatter <- ggplot(examData, aes(Anxiety, Exam, colour = Gender));
scatter + geom_point() + geom_smooth(method = "rlm", alpha = .1)+ labs(x = "Exam Anxiety", y = "Exam Performance %") + theme_wsj()

plot of chunk unnamed-chunk-44

Two groups; independent t-tests

Let's take a subset of two factors from a dataset called diamonds and run a t-test on price over the factors.

library(ggplot2)
TwoColDiamonds <- subset(diamonds, color =="E" | color =="I" )
ColorTtest <- t.test(price ~ color, data = TwoColDiamonds)

    Welch Two Sample t-test

data:  price by color
t = -27.799, df = 8485.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2157.217 -1873.028
sample estimates:
mean in group E mean in group I 
       3076.752        5091.875 

comparing 2 means t-tests; effect size

to calculate an effect size, simply access the t value and df value we have stored in ColorTtest

t <- ColorTtest$statistic[[1]]
df <- ColorTtest$parameter[[1]]
r <- sqrt(t^2/(t^2+df))
round(r,3)
[1] 0.289

Repeated Measures t-test

From Andy Field's book; Let's assume we have one group of arachnophobes who on two separate ocasions are exposed to either a picture or real spider.

picture <- c(30, 35, 45,40,50,35,55,25,30,45,40,50)
real <- c(40, 35, 50, 55, 65, 55, 50, 35, 30, 50,60, 39)
Spider <- data.frame(picture,real)
dep.t.test <- t.test(Spider$real, Spider$picture, paired=TRUE)
dep.t.test

    Paired t-test

data:  Spider$real and Spider$picture
t = 2.4725, df = 11, p-value = 0.03098
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  0.7687815 13.2312185
sample estimates:
mean of the differences 
                      7 

basic linear models (regression)

linearModel <- lm(cars$dist ~ cars$speed)
summary(linearModel)

Call:
lm(formula = cars$dist ~ cars$speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

basic linear models (regression)

we can also view the residuals from the model and assign that to a new object if we want

head(resid(linearModel))
        1         2         3         4         5         6 
 3.849460 11.849460 -5.947766 12.052234  2.119825 -7.812584 

basic linear models (regression) II

RegModel <- lm(cars$dist ~ cars$speed)
plot(cars)

plot of chunk unnamed-chunk-52

plot(cars$dist ~ cars$speed)
abline(RegModel)

plot of chunk unnamed-chunk-53

basic linear models (regression) III

library(MASS)
RegModel1 <- rlm(cars$dist ~ cars$speed)
plot(cars$dist ~ cars$speed)
abline(RegModel, col = "red")
abline(RegModel1, col = "green")

plot of chunk unnamed-chunk-54 here the robust model rlm is very similar to the regular lm model

basic linear models (as ANOVA)

here the assumption would be that the second variable would be a factor with more than 2 levels… but it still works as a continuous!

Of course, R uses type I sums of squares - the output won't look exactly like SPSS…

print(anova(linearModel))
Analysis of Variance Table

Response: cars$dist
           Df Sum Sq Mean Sq F value   Pr(>F)    
cars$speed  1  21186 21185.5  89.567 1.49e-12 ***
Residuals  48  11354   236.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

basic linear models (as ANOVA)

To correct the sums of squares, we need the car package, load it using install.packages(car)

library(car)
model <- aov(dist ~ speed, data=cars)
print(Anova(model, type="III"))
Anova Table (Type III tests)

Response: dist
             Sum Sq Df F value   Pr(>F)    
(Intercept)  1600.3  1  6.7655  0.01232 *  
speed       21185.5  1 89.5671 1.49e-12 ***
Residuals   11353.5 48                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let's take a look at a dataset with actual categorical variables...

load the ggplot2 package if you haven't already; library(ggplot2)

library(ggplot2)
head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

you can also use View(diamonds) to view this in your RStudio browser ,,, ,

diamonds continued...

let's suppose we're interested in the interaction of cut and color on price…

plot of chunk unnamed-chunk-58

We can define the model as follows

DiamondModel <- lm (price ~ cut * color, data = diamonds)

diamond, continued...

DiamondModel <- aov(price ~ cut * color, data=diamonds)
print(Anova(DiamondModel, type="III"))
Anova Table (Type III tests)

Response: price
                Sum Sq    Df    F value    Pr(>F)    
(Intercept) 3.7606e+11     1 24713.0009 < 2.2e-16 ***
cut         8.7891e+09     4   144.3969 < 2.2e-16 ***
color       9.4602e+09     6   103.6142 < 2.2e-16 ***
cut:color   1.6535e+09    24     4.5274 1.001e-12 ***
Residuals   8.2027e+11 53905                         
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Second ANOVA Example (By Andy Field);

Viagra <- read.delim("/Users/John/Dropbox/R_Course 2/Field Data Files/Viagra.dat")
Viagra$dose<-factor(Viagra$dose,levels =c(1:3),labels = c("Placebo", "Low Dose", "High Dose"))
head(Viagra)
  person     dose libido
1      1  Placebo      3
2      2  Placebo      2
3      3  Placebo      1
4      4  Placebo      1
5      5  Placebo      4
6      6 Low Dose      5

A Second ANOVA Example (By Andy Field); II

#line graph (hiccups)
line <- ggplot(Viagra, aes(dose, libido))
line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line", aes(group=1),colour = "Red", linetype = "dashed", size = 2)+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, size =2) + labs(x = "Dose of Viagra", y = "Libido") + theme_economist() +theme(text = element_text(size = 20))

plot of chunk unnamed-chunk-63

A Second ANOVA Example (By Andy Field); III

Descriptive statistics…

library(pastecs)
by(Viagra, Viagra$dose, stat.desc)
Viagra$dose: Placebo
                 person dose     libido
nbr.val       5.0000000   NA  5.0000000
nbr.null      0.0000000   NA  0.0000000
nbr.na        0.0000000   NA  0.0000000
min           1.0000000   NA  1.0000000
max           5.0000000   NA  4.0000000
range         4.0000000   NA  3.0000000
sum          15.0000000   NA 11.0000000
median        3.0000000   NA  2.0000000
mean          3.0000000   NA  2.2000000
SE.mean       0.7071068   NA  0.5830952
CI.mean.0.95  1.9632432   NA  1.6189318
var           2.5000000   NA  1.7000000
std.dev       1.5811388   NA  1.3038405
coef.var      0.5270463   NA  0.5926548
-------------------------------------------------------- 
Viagra$dose: Low Dose
                 person dose     libido
nbr.val       5.0000000   NA  5.0000000
nbr.null      0.0000000   NA  0.0000000
nbr.na        0.0000000   NA  0.0000000
min           6.0000000   NA  2.0000000
max          10.0000000   NA  5.0000000
range         4.0000000   NA  3.0000000
sum          40.0000000   NA 16.0000000
median        8.0000000   NA  3.0000000
mean          8.0000000   NA  3.2000000
SE.mean       0.7071068   NA  0.5830952
CI.mean.0.95  1.9632432   NA  1.6189318
var           2.5000000   NA  1.7000000
std.dev       1.5811388   NA  1.3038405
coef.var      0.1976424   NA  0.4074502
-------------------------------------------------------- 
Viagra$dose: High Dose
                 person dose     libido
nbr.val       5.0000000   NA  5.0000000
nbr.null      0.0000000   NA  0.0000000
nbr.na        0.0000000   NA  0.0000000
min          11.0000000   NA  3.0000000
max          15.0000000   NA  7.0000000
range         4.0000000   NA  4.0000000
sum          65.0000000   NA 25.0000000
median       13.0000000   NA  5.0000000
mean         13.0000000   NA  5.0000000
SE.mean       0.7071068   NA  0.7071068
CI.mean.0.95  1.9632432   NA  1.9632432
var           2.5000000   NA  2.5000000
std.dev       1.5811388   NA  1.5811388
coef.var      0.1216261   NA  0.3162278

A Second ANOVA Example (By Andy Field); IV

Levene's test for homogeneity of variances

leveneTest(Viagra$libido,Viagra$dose, center = median)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  0.1176   0.89
      12               

Given the non-significance of this result, we can proceed confident in the knowlege that the variances are similar across all the levels of the factors

A Second ANOVA Example (By Andy Field); V

The ANOVA…

ViagraModel <- aov(libido ~ dose, data = Viagra)
summary(ViagraModel)
            Df Sum Sq Mean Sq F value Pr(>F)  
dose         2  20.13  10.067   5.119 0.0247 *
Residuals   12  23.60   1.967                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Second ANOVA Example (By Andy Field); VI

Planned Contrasts

Viagra$dose <- as.factor(Viagra$dose)
contrast1 <- c(-2,1,1)
contrast2<-c(0,-1,1)
contrasts(Viagra$dose) <- cbind(contrast1,contrast2)
Viagra$dose
 [1] Placebo   Placebo   Placebo   Placebo   Placebo   Low Dose  Low Dose 
 [8] Low Dose  Low Dose  Low Dose  High Dose High Dose High Dose High Dose
[15] High Dose
attr(,"contrasts")
          contrast1 contrast2
Placebo          -2         0
Low Dose          1        -1
High Dose         1         1
Levels: Placebo Low Dose High Dose

A Second ANOVA Example (By Andy Field); VII

Planned Contrasts II

ViagraPlanned <-aov(libido ~dose, data =Viagra)
summary.lm(ViagraPlanned)

Call:
aov(formula = libido ~ dose, data = Viagra)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.4667     0.3621   9.574 5.72e-07 ***
dosecontrast1   0.6333     0.2560   2.474   0.0293 *  
dosecontrast2   0.9000     0.4435   2.029   0.0652 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared:  0.4604,    Adjusted R-squared:  0.3704 
F-statistic: 5.119 on 2 and 12 DF,  p-value: 0.02469

A Second ANOVA Example (By Andy Field); VIII

Trend Analyses

contrasts(Viagra$dose)<-contr.poly(3)
ViagraTrend <-aov(libido ~ dose, data = Viagra)
summary.lm(ViagraTrend)

Call:
aov(formula = libido ~ dose, data = Viagra)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.4667     0.3621   9.574 5.72e-07 ***
dose.L        1.9799     0.6272   3.157  0.00827 ** 
dose.Q        0.3266     0.6272   0.521  0.61201    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared:  0.4604,    Adjusted R-squared:  0.3704 
F-statistic: 5.119 on 2 and 12 DF,  p-value: 0.02469

A Second ANOVA Example (By Andy Field); IX

Post-Hoc Tests

pairwise.t.test(Viagra$libido, Viagra$dose, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  Viagra$libido and Viagra$dose 

          Placebo Low Dose
Low Dose  0.845   -       
High Dose 0.025   0.196   

P value adjustment method: bonferroni 

Other cool things you can do in R:

  • Repeated measures ANOVAs (using the ez package)
  • Cluster analyses
  • PCA
  • Graph Analyses
  • Hierarchical linear modeling
  • Bayesian Statistics

The trick is just to find the right package & spend some time getting to know what it can do

Reproducibility: Scripts in R

setwd("/Users/John/Dropbox/R_Course 2/")
r -f Flanker_example.R
#Rscript Flanker_example.R

These commands allow you to call upon scripts you've already written & execute them. You can even call scripts from scripts, or scripts from a linux shell & pass the results to a different program…

Reproducibility II: Scripts in R

Reproducibility III: Version Control with Git