R for Behavioural Scientists: Mid-Winter Workshop

author: John A. E. Anderson date: February 5 2014

Not meant to be scary, just fun!

Why R?

Well…

  • It's free!
  • It's incredibly extensible (i.e. you can do things poor SPSS can only dream of)
  • It encourages reproducible research
  • It really underscores the idea that everything (really) is just a variant of the GLM (very apparent when you can run an anova using the lm command)

The RStudio Environment:

alt text (Note to self; demo this)

There are generally 4 panes:

  • The Script Editor (top left)
  • The Console (bottom left)
  • The Environment (containing your active variables) and
  • Your history (both top-right)
  • Finally the Files/plots/packages window in the bottom right (very useful!)

Some incredibly useful books... (well, two incredibly useful books)

alt text

  • This book is a great introduction to the R environment & is a more general resource (nice pictures)

alt text

  • This book is written by the beloved Andy Field (the man is a genius!) He's a psychologist & writes for psychologists, so everything is relevent

Installing Packages (For ME!!??)

  • Go packages and then install packages

alt text

Let's try installing ggplot2 (we'll get back to this package later)

Notice that the R code is also generated; it would be equally valid to type this

install.packages("ggplot2")

to load that package, you can either click the checkbox next to the package name, or you can enter the following:

library("ggplot2")

Loading & saving data

  • Data should be in .csv format (comma separated variable)
  • Variable names should not contain spaces, and should not start with irregular characters or numbers
#replace with your dataset
TimeOfDayPriming <- read.csv("C:/Users/JA/Desktop/TimeOfDayPriming.csv")
  • You now have an object which you can manipulate, subset, run analyses on etc.
  • Your original data (i.e. TimeOfDayPriming.csv) will not be touched - UNLESS you deliberately overwrite it

Loading & saving data

Just to prove I've actually loaded something ;) plot of chunk unnamed-chunk-4

Age is plotted by group - the beanplot reveals that the young group is bimodal (grad students vs. undergrads methinks?)

Loading & saving data

alt text the easy way

Loading & saving data

  • To save a dataframe,
write.table(TimeOfDayPriming, file = "test.csv",col.names = TRUE, sep = ",", row.names = TRUE)

The very Basics: R as an overpowered calculator

Try the following (in your console)

A <- 5690 + 69462

This should produce the following…

A <- 5690 + 69462
A
[1] 75152

Subtraction works just as you think it would…

5690 - 5452

This should produce the following…

[1] 238

The very Basics: R as an overpowered calculator

Try the following (in your console)

5690 * 69462

This should produce the following…

[1] 395238780

And division …

5690 / 5452
[1] 1.044

The Basics:

Try the following (in your console)

summary(cars)

This should produce the following…

     speed           dist    
 Min.   : 4.0   Min.   :  2  
 1st Qu.:12.0   1st Qu.: 26  
 Median :15.0   Median : 36  
 Mean   :15.4   Mean   : 43  
 3rd Qu.:19.0   3rd Qu.: 56  
 Max.   :25.0   Max.   :120  

summary is a base function built into R - you can use it on dataframes & other objects

The Basics: A dataframe (like Excel actually)

To see the “cars” dataframe simply call the dataframe in the console

cars

This should produce the following…

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

(I used the function head(cars) to just display the first couple of rows, you can do the same thing with tail to display the last few rows)

The Basics: selecting variables...

Ok I see there are two variables called speed and dist, how do I select one of them? (very easily actually)

cars$speed

This should produce the following…

 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25

Voila - the speed variable notice that this vector is plotted horizontally, this doesn't matter, the content is the same

The Basics: descriptive stats...

So, how to get the mean & standard deviation value of dist? it's very simple…

mean(cars$dist)
[1] 42.98
sd(cars$dist)
[1] 25.77

The Basics: Adding & Deleting a Column

Ocassionally you'll want to add a new column of variables: Let's take the cars - what if we want to add a new column that takes the square-root transform of distance?

carsNew <- cars
carsNew$SRDist <- sqrt(cars$dist)
head(carsNew)
  speed dist SRDist
1     4    2  1.414
2     4   10  3.162
3     7    4  2.000
4     7   22  4.690
5     8   16  4.000
6     9   10  3.162

The Basics: Adding & Deleting a Column

You'll note that applying functions to vectors in R is ridculously easy - as we just did with the square-root transform, you can do addition, subtraction, mutiplication or any other mathematics to each incidence in a vector.

The Basics: Adding & Deleting a Column II

To remove a column, simply use the null command

carsNew$dist <- NULL
head(carsNew)
  speed SRDist
1     4  1.414
2     4  3.162
3     7  2.000
4     7  4.690
5     8  4.000
6     9  3.162

The Basics: Adding & Deleting a Column III

We could also create an entirely new dataframe by selecting columns

NewDataFrame <- cbind(carsNew$SRDist, cars$dist)
NewDataFrame <- na.omit(NewDataFrame)
colnames(NewDataFrame) <- c('SRDist', 'originalDist')
 NewDataFrame=as.data.frame(NewDataFrame)
#this should give you some hint that we could just as easily say as.matrix - you'd be right!
head(NewDataFrame)
  SRDist originalDist
1  1.414            2
2  3.162           10
3  2.000            4
4  4.690           22
5  4.000           16
6  3.162           10

The Basics: Logicals

R is built so that things work better if you use logical indices rather than loops. Technically you could build a loop - but generally it will run more slowly and take up more memory.

install.packages("plyr", repos='http://probability.ca/cran/')
library(plyr)
SpeedGreaterThan20 <-subset(cars, cars$speed > 20)
mean(cars$speed)
[1] 15.4
mean(SpeedGreaterThan20$speed)
[1] 23.71

The Basics: Missing Values

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, NA, 8, 3, NA, 3)

We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values

mean(z)
[1] NA
mean(z,  na.rm=TRUE)
[1] 3.4

The Basics: Extreme Values & Winsorization

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, 100, 8, 3, -50, 3)

Let's get rid of these extreme values:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}
mean(z)
[1] 9.571
mean(remove_outliers(z), na.rm=TRUE)
[1] 3.4

The Basics: Extreme Values & Winsorization

Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error

z <- c(1, 2, 100, 8, 3, -50, 3)

We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values

library(psych)
mean(z)
[1] 9.571
mean(winsor(z))
[1] 3.486

The Basics: Extreme Values & Winsorization II

beanplot(z, winsor(z))

plot of chunk unnamed-chunk-33

You can also apply winsorization to individuals within groups using the plyr package

EC <- ddply(DC, .(Subject, Condition), function(Acc_RT) {
     transform(Acc_RT, RTwin = winsor(Acc_RT, trim = .1))
})

Slide With Plot(s)

The basic plot function in R is… well… very basic.

plot(cars)

plot of chunk unnamed-chunk-35

#boxplot(cars)
library(beanplot)
beanplot(cars)

plot of chunk unnamed-chunk-36

Slide With Plot(s) II

hist(cars$dist)

plot of chunk unnamed-chunk-37

d <- density(cars$dist) # returns the density data 
plot(d)

plot of chunk unnamed-chunk-38

GGPLOT (It's kinda amazing!)

plot of chunk unnamed-chunk-39

plot of chunk unnamed-chunk-40

Anatomy of a GGPLOT

library(ggthemes)
#faceted bar graph (festival)
bar <- ggplot(diamonds, aes(color, price));
bar + stat_summary(fun.y = mean, geom = "bar")+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2)+ theme_economist() 

plot of chunk unnamed-chunk-42

Anatomy of a GGPLOT II

setwd("C:/Users/JA/Desktop/Field Data Files")
examData <- read.delim("Exam Anxiety.dat", header = TRUE)
#grouped Scatter Graph
scatter <- ggplot(examData, aes(Anxiety, Exam, colour = Gender));
scatter + geom_point() + geom_smooth(method = "rlm", alpha = .1)+ labs(x = "Exam Anxiety", y = "Exam Performance %") + theme_wsj()

plot of chunk unnamed-chunk-44

Two groups; independent t-tests

Let's take a subset of two factors from a dataset called diamonds and run a t-test on price over the factors.

library(ggplot2)
TwoColDiamonds <- subset(diamonds, color =="E" | color =="I" )
ColorTtest <- t.test(price ~ color, data = TwoColDiamonds)

    Welch Two Sample t-test

data:  price by color
t = -27.8, df = 8486, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2157 -1873
sample estimates:
mean in group E mean in group I 
           3077            5092 

comparing 2 means t-tests; effect size

to calculate an effect size, simply access the t value and df value we have stored in ColorTtest

t <- ColorTtest$statistic[[1]]
df <- ColorTtest$parameter[[1]]
r <- sqrt(t^2/(t^2+df))
round(r,3)
[1] 0.289

Repeated Measures t-test

From Andy Field's book; Let's assume we have one group of arachnophobes who on two separate ocasions are exposed to either a picture or real spider.

picture <- c(30, 35, 45,40,50,35,55,25,30,45,40,50)
real <- c(40, 35, 50, 55, 65, 55, 50, 35, 30, 50,60, 39)
Spider <- data.frame(picture,real)
dep.t.test <- t.test(Spider$real, Spider$picture, paired=TRUE)
dep.t.test

    Paired t-test

data:  Spider$real and Spider$picture
t = 2.473, df = 11, p-value = 0.03098
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  0.7688 13.2312
sample estimates:
mean of the differences 
                      7 

basic linear models (regression)

linearModel <- lm(cars$dist ~ cars$speed)
summary(linearModel)

Call:
lm(formula = cars$dist ~ cars$speed)

Residuals:
   Min     1Q Median     3Q    Max 
-29.07  -9.53  -2.27   9.21  43.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.579      6.758   -2.60    0.012 *  
cars$speed     3.932      0.416    9.46  1.5e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.4 on 48 degrees of freedom
Multiple R-squared:  0.651, Adjusted R-squared:  0.644 
F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12

basic linear models (regression)

we can also view the residuals from the model and assign that to a new object if we want

head(resid(linearModel))
     1      2      3      4      5      6 
 3.849 11.849 -5.948 12.052  2.120 -7.813 

basic linear models (regression) II

RegModel <- lm(cars$dist ~ cars$speed)
plot(cars)

plot of chunk unnamed-chunk-52


plot(cars$dist ~ cars$speed)
abline(RegModel)

plot of chunk unnamed-chunk-53

basic linear models (regression) III

library(MASS)
RegModel1 <- rlm(cars$dist ~ cars$speed)
plot(cars$dist ~ cars$speed)
abline(RegModel, col = "red")
abline(RegModel1, col = "green")

plot of chunk unnamed-chunk-54

here the robust model rlm is very similar to the regular lm model

basic linear models (as ANOVA)

here the assumption would be that the second variable would be a factor with more than 2 levels… but it still works as a continuous!

Of course, R uses type I sums of squares - the output won't look exactly like SPSS…

print(anova(linearModel))
Analysis of Variance Table

Response: cars$dist
           Df Sum Sq Mean Sq F value  Pr(>F)    
cars$speed  1  21185   21185    89.6 1.5e-12 ***
Residuals  48  11354     237                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

basic linear models (as ANOVA)

To correct the sums of squares, we need the car package, load it using install.packages(car)

library(car)
model <- aov(dist ~ speed, data=cars)
print(Anova(model, type="III"))
Anova Table (Type III tests)

Response: dist
            Sum Sq Df F value  Pr(>F)    
(Intercept)   1600  1    6.77   0.012 *  
speed        21185  1   89.57 1.5e-12 ***
Residuals    11354 48                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let's take a look at a dataset with actual categorical variables...

load the ggplot2 package if you haven't already; library(ggplot2)

library(ggplot2)
head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

you can also use View(diamonds) to view this in your RStudio browser ,,, ,

diamonds continued...

let's suppose we're interested in the interaction of cut and color on price…

plot of chunk unnamed-chunk-58

We can define the model as follows

DiamondModel <- lm (price ~ cut * color, data = diamonds)

diamond, continued...

DiamondModel <- aov(price ~ cut * color, data=diamonds)
print(Anova(DiamondModel, type="III"))
Anova Table (Type III tests)

Response: price
              Sum Sq    Df  F value Pr(>F)    
(Intercept) 3.76e+11     1 24713.00 <2e-16 ***
cut         8.79e+09     4   144.40 <2e-16 ***
color       9.46e+09     6   103.61 <2e-16 ***
cut:color   1.65e+09    24     4.53  1e-12 ***
Residuals   8.20e+11 53905                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Second ANOVA Example (By Andy Field);

Viagra <- read.delim("C:/Users/JA/Desktop/Field Data Files/Viagra.dat")
Viagra$dose<-factor(Viagra$dose,levels =c(1:3),labels = c("Placebo", "Low Dose", "High Dose"))
head(Viagra)
  person     dose libido
1      1  Placebo      3
2      2  Placebo      2
3      3  Placebo      1
4      4  Placebo      1
5      5  Placebo      4
6      6 Low Dose      5

A Second ANOVA Example (By Andy Field); II

#line graph (hiccups)
line <- ggplot(Viagra, aes(dose, libido))
line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line", aes(group=1),colour = "Red", linetype = "dashed", size = 2)+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, size =2) + labs(x = "Dose of Viagra", y = "Libido") + theme_economist() +theme(text = element_text(size = 20))

plot of chunk unnamed-chunk-63

A Second ANOVA Example (By Andy Field); III

Descriptive statistics…

library(pastecs)
by(Viagra, Viagra$dose, stat.desc)
Viagra$dose: Placebo
              person dose  libido
nbr.val       5.0000   NA  5.0000
nbr.null      0.0000   NA  0.0000
nbr.na        0.0000   NA  0.0000
min           1.0000   NA  1.0000
max           5.0000   NA  4.0000
range         4.0000   NA  3.0000
sum          15.0000   NA 11.0000
median        3.0000   NA  2.0000
mean          3.0000   NA  2.2000
SE.mean       0.7071   NA  0.5831
CI.mean.0.95  1.9632   NA  1.6189
var           2.5000   NA  1.7000
std.dev       1.5811   NA  1.3038
coef.var      0.5270   NA  0.5927
-------------------------------------------------------- 
Viagra$dose: Low Dose
              person dose  libido
nbr.val       5.0000   NA  5.0000
nbr.null      0.0000   NA  0.0000
nbr.na        0.0000   NA  0.0000
min           6.0000   NA  2.0000
max          10.0000   NA  5.0000
range         4.0000   NA  3.0000
sum          40.0000   NA 16.0000
median        8.0000   NA  3.0000
mean          8.0000   NA  3.2000
SE.mean       0.7071   NA  0.5831
CI.mean.0.95  1.9632   NA  1.6189
var           2.5000   NA  1.7000
std.dev       1.5811   NA  1.3038
coef.var      0.1976   NA  0.4075
-------------------------------------------------------- 
Viagra$dose: High Dose
              person dose  libido
nbr.val       5.0000   NA  5.0000
nbr.null      0.0000   NA  0.0000
nbr.na        0.0000   NA  0.0000
min          11.0000   NA  3.0000
max          15.0000   NA  7.0000
range         4.0000   NA  4.0000
sum          65.0000   NA 25.0000
median       13.0000   NA  5.0000
mean         13.0000   NA  5.0000
SE.mean       0.7071   NA  0.7071
CI.mean.0.95  1.9632   NA  1.9632
var           2.5000   NA  2.5000
std.dev       1.5811   NA  1.5811
coef.var      0.1216   NA  0.3162

A Second ANOVA Example (By Andy Field); IV

Levene's test for homogeneity of variances

leveneTest(Viagra$libido,Viagra$dose, center = median)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2    0.12   0.89
      12               

Given the non-significance of this result, we can proceed confident in the knowlege that the variances are similar across all the levels of the factors

A Second ANOVA Example (By Andy Field); V

The ANOVA…

ViagraModel <- aov(libido ~ dose, data = Viagra)
summary(ViagraModel)
            Df Sum Sq Mean Sq F value Pr(>F)  
dose         2   20.1   10.07    5.12  0.025 *
Residuals   12   23.6    1.97                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Second ANOVA Example (By Andy Field); VI

Planned Contrasts

Viagra$dose <- as.factor(Viagra$dose)
contrast1 <- c(-2,1,1)
contrast2<-c(0,-1,1)
contrasts(Viagra$dose) <- cbind(contrast1,contrast2)
Viagra$dose
 [1] Placebo   Placebo   Placebo   Placebo   Placebo   Low Dose  Low Dose 
 [8] Low Dose  Low Dose  Low Dose  High Dose High Dose High Dose High Dose
[15] High Dose
attr(,"contrasts")
          contrast1 contrast2
Placebo          -2         0
Low Dose          1        -1
High Dose         1         1
Levels: Placebo Low Dose High Dose

A Second ANOVA Example (By Andy Field); VII

Planned Contrasts II

ViagraPlanned <-aov(libido ~dose, data =Viagra)
summary.lm(ViagraPlanned)

Call:
aov(formula = libido ~ dose, data = Viagra)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.467      0.362    9.57  5.7e-07 ***
dosecontrast1    0.633      0.256    2.47    0.029 *  
dosecontrast2    0.900      0.443    2.03    0.065 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.4 on 12 degrees of freedom
Multiple R-squared:  0.46,  Adjusted R-squared:  0.37 
F-statistic: 5.12 on 2 and 12 DF,  p-value: 0.0247

A Second ANOVA Example (By Andy Field); VIII

Trend Analyses

contrasts(Viagra$dose)<-contr.poly(3)
ViagraTrend <-aov(libido ~ dose, data = Viagra)
summary.lm(ViagraTrend)

Call:
aov(formula = libido ~ dose, data = Viagra)

Residuals:
   Min     1Q Median     3Q    Max 
  -2.0   -1.2   -0.2    0.9    2.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.467      0.362    9.57  5.7e-07 ***
dose.L         1.980      0.627    3.16   0.0083 ** 
dose.Q         0.327      0.627    0.52   0.6120    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.4 on 12 degrees of freedom
Multiple R-squared:  0.46,  Adjusted R-squared:  0.37 
F-statistic: 5.12 on 2 and 12 DF,  p-value: 0.0247

A Second ANOVA Example (By Andy Field); IX

Post-Hoc Tests

pairwise.t.test(Viagra$libido, Viagra$dose, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  Viagra$libido and Viagra$dose 

          Placebo Low Dose
Low Dose  0.845   -       
High Dose 0.025   0.196   

P value adjustment method: bonferroni 

Other cool things you can do in R:

  • Repeated measures ANOVAs (using the ez package)
  • Cluster analyses
  • PCA
  • Graph Analyses
  • Hierarchical linear modeling
  • Bayesian Statistics

The trick is just to find the right package & spend some time getting to know what it can do

Reproducibility: Scripts in R

Error in parse(text = x, srcfile = src) : <text>:1:9: unexpected symbol
1: Rscript Flanker_example.R
            ^