author: John A. E. Anderson date: February 5 2014
Not meant to be scary, just fun!
Well…
(Note to self; demo this)
There are generally 4 panes:
Let's try installing ggplot2 (we'll get back to this package later)
Notice that the R code is also generated; it would be equally valid to type this
install.packages("ggplot2")
to load that package, you can either click the checkbox next to the package name, or you can enter the following:
library("ggplot2")
#replace with your dataset
TimeOfDayPriming <- read.csv("/Users/John/Dropbox/R_Course 2/TimeOfDayPriming.csv")
Just to prove I've actually loaded something ;)
Age is plotted by group - the beanplot reveals that the young group is bimodal (grad students vs. undergrads methinks?)
the easy way
write.table(TimeOfDayPriming, file = "test.csv",col.names = TRUE, sep = ",", row.names = TRUE)
Try the following (in your console)
A <- 5690 + 69462
This should produce the following…
A <- 5690 + 69462
A
[1] 75152
Subtraction works just as you think it would…
5690 - 5452
This should produce the following…
[1] 238
Try the following (in your console)
5690 * 69462
This should produce the following…
[1] 395238780
And division …
5690 / 5452
[1] 1.043654
Try the following (in your console)
summary(cars)
This should produce the following…
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
summary is a base function built into R - you can use it on dataframes & other objects
To see the “cars” dataframe simply call the dataframe in the console
cars
This should produce the following…
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
(I used the function head(cars) to just display the first couple of rows, you can do the same thing with tail to display the last few rows)
Ok I see there are two variables called speed and dist, how do I select one of them? (very easily actually)
cars$speed
This should produce the following…
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25
Voila - the speed variable notice that this vector is plotted horizontally, this doesn't matter, the content is the same
So, how to get the mean & standard deviation value of dist? it's very simple…
mean(cars$dist)
[1] 42.98
sd(cars$dist)
[1] 25.76938
Ocassionally you'll want to add a new column of variables: Let's take the cars - what if we want to add a new column that takes the square-root transform of distance?
carsNew <- cars
carsNew$SRDist <- sqrt(cars$dist)
head(carsNew)
speed dist SRDist
1 4 2 1.414214
2 4 10 3.162278
3 7 4 2.000000
4 7 22 4.690416
5 8 16 4.000000
6 9 10 3.162278
You'll note that applying functions to vectors in R is ridculously easy - as we just did with the square-root transform, you can do addition, subtraction, mutiplication or any other mathematics to each incidence in a vector.
To remove a column, simply use the null command
carsNew$dist <- NULL
head(carsNew)
speed SRDist
1 4 1.414214
2 4 3.162278
3 7 2.000000
4 7 4.690416
5 8 4.000000
6 9 3.162278
We could also create an entirely new dataframe by selecting columns
NewDataFrame <- cbind(carsNew$SRDist, cars$dist)
NewDataFrame <- na.omit(NewDataFrame)
colnames(NewDataFrame) <- c('SRDist', 'originalDist')
NewDataFrame=as.data.frame(NewDataFrame)
#this should give you some hint that we could just as easily say as.matrix - you'd be right!
head(NewDataFrame)
SRDist originalDist
1 1.414214 2
2 3.162278 10
3 2.000000 4
4 4.690416 22
5 4.000000 16
6 3.162278 10
R is built so that things work better if you use logical indices rather than loops. Technically you could build a loop - but generally it will run more slowly and take up more memory.
install.packages("plyr", repos='http://probability.ca/cran/')
library(plyr)
SpeedGreaterThan20 <-subset(cars, cars$speed > 20)
mean(cars$speed)
[1] 15.4
mean(SpeedGreaterThan20$speed)
[1] 23.71429
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, NA, 8, 3, NA, 3)
We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values
mean(z)
[1] NA
mean(z, na.rm=TRUE)
[1] 3.4
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, 100, 8, 3, -50, 3)
Let's get rid of these extreme values:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
mean(z)
[1] 9.571429
mean(remove_outliers(z), na.rm=TRUE)
[1] 3.4
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, 100, 8, 3, -50, 3)
We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values
library(psych)
mean(z)
[1] 9.571429
mean(winsor(z))
[1] 3.485714
beanplot(z, winsor(z))
You can also apply winsorization to individuals within groups using the plyr package
EC <- ddply(DC, .(Subject, Condition), function(Acc_RT) {
transform(Acc_RT, RTwin = winsor(Acc_RT, trim = .1))
})
The basic plot function in R is… well… very basic.
plot(cars)
#boxplot(cars)
library(beanplot)
beanplot(cars)
hist(cars$dist)
d <- density(cars$dist) # returns the density data
plot(d)
library(ggthemes)
#faceted bar graph (festival)
bar <- ggplot(diamonds, aes(color, price));
bar + stat_summary(fun.y = mean, geom = "bar")+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2)+ theme_economist()
setwd("/Users/John/Dropbox/R_Course 2/Field Data Files")
examData <- read.delim("Exam Anxiety.dat", header = TRUE)
#grouped Scatter Graph
scatter <- ggplot(examData, aes(Anxiety, Exam, colour = Gender));
scatter + geom_point() + geom_smooth(method = "rlm", alpha = .1)+ labs(x = "Exam Anxiety", y = "Exam Performance %") + theme_wsj()
Let's take a subset of two factors from a dataset called diamonds and run a t-test on price over the factors.
library(ggplot2)
TwoColDiamonds <- subset(diamonds, color =="E" | color =="I" )
ColorTtest <- t.test(price ~ color, data = TwoColDiamonds)
Welch Two Sample t-test
data: price by color
t = -27.799, df = 8485.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2157.217 -1873.028
sample estimates:
mean in group E mean in group I
3076.752 5091.875
to calculate an effect size, simply access the t value and df value we have stored in ColorTtest
t <- ColorTtest$statistic[[1]]
df <- ColorTtest$parameter[[1]]
r <- sqrt(t^2/(t^2+df))
round(r,3)
[1] 0.289
From Andy Field's book; Let's assume we have one group of arachnophobes who on two separate ocasions are exposed to either a picture or real spider.
picture <- c(30, 35, 45,40,50,35,55,25,30,45,40,50)
real <- c(40, 35, 50, 55, 65, 55, 50, 35, 30, 50,60, 39)
Spider <- data.frame(picture,real)
dep.t.test <- t.test(Spider$real, Spider$picture, paired=TRUE)
dep.t.test
Paired t-test
data: Spider$real and Spider$picture
t = 2.4725, df = 11, p-value = 0.03098
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7687815 13.2312185
sample estimates:
mean of the differences
7
linearModel <- lm(cars$dist ~ cars$speed)
summary(linearModel)
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
we can also view the residuals from the model and assign that to a new object if we want
head(resid(linearModel))
1 2 3 4 5 6
3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584
RegModel <- lm(cars$dist ~ cars$speed)
plot(cars)
plot(cars$dist ~ cars$speed)
abline(RegModel)
library(MASS)
RegModel1 <- rlm(cars$dist ~ cars$speed)
plot(cars$dist ~ cars$speed)
abline(RegModel, col = "red")
abline(RegModel1, col = "green")
here the robust model rlm is very similar to the regular lm model
here the assumption would be that the second variable would be a factor with more than 2 levels… but it still works as a continuous!
Of course, R uses type I sums of squares - the output won't look exactly like SPSS…
print(anova(linearModel))
Analysis of Variance Table
Response: cars$dist
Df Sum Sq Mean Sq F value Pr(>F)
cars$speed 1 21186 21185.5 89.567 1.49e-12 ***
Residuals 48 11354 236.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To correct the sums of squares, we need the car package, load it using install.packages(car)
library(car)
model <- aov(dist ~ speed, data=cars)
print(Anova(model, type="III"))
Anova Table (Type III tests)
Response: dist
Sum Sq Df F value Pr(>F)
(Intercept) 1600.3 1 6.7655 0.01232 *
speed 21185.5 1 89.5671 1.49e-12 ***
Residuals 11353.5 48
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
load the ggplot2 package if you haven't already; library(ggplot2)
library(ggplot2)
head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
you can also use View(diamonds) to view this in your RStudio browser ,,, ,
let's suppose we're interested in the interaction of cut and color on price…
We can define the model as follows
DiamondModel <- lm (price ~ cut * color, data = diamonds)
DiamondModel <- aov(price ~ cut * color, data=diamonds)
print(Anova(DiamondModel, type="III"))
Anova Table (Type III tests)
Response: price
Sum Sq Df F value Pr(>F)
(Intercept) 3.7606e+11 1 24713.0009 < 2.2e-16 ***
cut 8.7891e+09 4 144.3969 < 2.2e-16 ***
color 9.4602e+09 6 103.6142 < 2.2e-16 ***
cut:color 1.6535e+09 24 4.5274 1.001e-12 ***
Residuals 8.2027e+11 53905
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Viagra <- read.delim("/Users/John/Dropbox/R_Course 2/Field Data Files/Viagra.dat")
Viagra$dose<-factor(Viagra$dose,levels =c(1:3),labels = c("Placebo", "Low Dose", "High Dose"))
head(Viagra)
person dose libido
1 1 Placebo 3
2 2 Placebo 2
3 3 Placebo 1
4 4 Placebo 1
5 5 Placebo 4
6 6 Low Dose 5
#line graph (hiccups)
line <- ggplot(Viagra, aes(dose, libido))
line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line", aes(group=1),colour = "Red", linetype = "dashed", size = 2)+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, size =2) + labs(x = "Dose of Viagra", y = "Libido") + theme_economist() +theme(text = element_text(size = 20))
Descriptive statistics…
library(pastecs)
by(Viagra, Viagra$dose, stat.desc)
Viagra$dose: Placebo
person dose libido
nbr.val 5.0000000 NA 5.0000000
nbr.null 0.0000000 NA 0.0000000
nbr.na 0.0000000 NA 0.0000000
min 1.0000000 NA 1.0000000
max 5.0000000 NA 4.0000000
range 4.0000000 NA 3.0000000
sum 15.0000000 NA 11.0000000
median 3.0000000 NA 2.0000000
mean 3.0000000 NA 2.2000000
SE.mean 0.7071068 NA 0.5830952
CI.mean.0.95 1.9632432 NA 1.6189318
var 2.5000000 NA 1.7000000
std.dev 1.5811388 NA 1.3038405
coef.var 0.5270463 NA 0.5926548
--------------------------------------------------------
Viagra$dose: Low Dose
person dose libido
nbr.val 5.0000000 NA 5.0000000
nbr.null 0.0000000 NA 0.0000000
nbr.na 0.0000000 NA 0.0000000
min 6.0000000 NA 2.0000000
max 10.0000000 NA 5.0000000
range 4.0000000 NA 3.0000000
sum 40.0000000 NA 16.0000000
median 8.0000000 NA 3.0000000
mean 8.0000000 NA 3.2000000
SE.mean 0.7071068 NA 0.5830952
CI.mean.0.95 1.9632432 NA 1.6189318
var 2.5000000 NA 1.7000000
std.dev 1.5811388 NA 1.3038405
coef.var 0.1976424 NA 0.4074502
--------------------------------------------------------
Viagra$dose: High Dose
person dose libido
nbr.val 5.0000000 NA 5.0000000
nbr.null 0.0000000 NA 0.0000000
nbr.na 0.0000000 NA 0.0000000
min 11.0000000 NA 3.0000000
max 15.0000000 NA 7.0000000
range 4.0000000 NA 4.0000000
sum 65.0000000 NA 25.0000000
median 13.0000000 NA 5.0000000
mean 13.0000000 NA 5.0000000
SE.mean 0.7071068 NA 0.7071068
CI.mean.0.95 1.9632432 NA 1.9632432
var 2.5000000 NA 2.5000000
std.dev 1.5811388 NA 1.5811388
coef.var 0.1216261 NA 0.3162278
Levene's test for homogeneity of variances
leveneTest(Viagra$libido,Viagra$dose, center = median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.1176 0.89
12
Given the non-significance of this result, we can proceed confident in the knowlege that the variances are similar across all the levels of the factors
The ANOVA…
ViagraModel <- aov(libido ~ dose, data = Viagra)
summary(ViagraModel)
Df Sum Sq Mean Sq F value Pr(>F)
dose 2 20.13 10.067 5.119 0.0247 *
Residuals 12 23.60 1.967
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Planned Contrasts
Viagra$dose <- as.factor(Viagra$dose)
contrast1 <- c(-2,1,1)
contrast2<-c(0,-1,1)
contrasts(Viagra$dose) <- cbind(contrast1,contrast2)
Viagra$dose
[1] Placebo Placebo Placebo Placebo Placebo Low Dose Low Dose
[8] Low Dose Low Dose Low Dose High Dose High Dose High Dose High Dose
[15] High Dose
attr(,"contrasts")
contrast1 contrast2
Placebo -2 0
Low Dose 1 -1
High Dose 1 1
Levels: Placebo Low Dose High Dose
Planned Contrasts II
ViagraPlanned <-aov(libido ~dose, data =Viagra)
summary.lm(ViagraPlanned)
Call:
aov(formula = libido ~ dose, data = Viagra)
Residuals:
Min 1Q Median 3Q Max
-2.0 -1.2 -0.2 0.9 2.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4667 0.3621 9.574 5.72e-07 ***
dosecontrast1 0.6333 0.2560 2.474 0.0293 *
dosecontrast2 0.9000 0.4435 2.029 0.0652 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared: 0.4604, Adjusted R-squared: 0.3704
F-statistic: 5.119 on 2 and 12 DF, p-value: 0.02469
Trend Analyses
contrasts(Viagra$dose)<-contr.poly(3)
ViagraTrend <-aov(libido ~ dose, data = Viagra)
summary.lm(ViagraTrend)
Call:
aov(formula = libido ~ dose, data = Viagra)
Residuals:
Min 1Q Median 3Q Max
-2.0 -1.2 -0.2 0.9 2.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4667 0.3621 9.574 5.72e-07 ***
dose.L 1.9799 0.6272 3.157 0.00827 **
dose.Q 0.3266 0.6272 0.521 0.61201
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.402 on 12 degrees of freedom
Multiple R-squared: 0.4604, Adjusted R-squared: 0.3704
F-statistic: 5.119 on 2 and 12 DF, p-value: 0.02469
Post-Hoc Tests
pairwise.t.test(Viagra$libido, Viagra$dose, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: Viagra$libido and Viagra$dose
Placebo Low Dose
Low Dose 0.845 -
High Dose 0.025 0.196
P value adjustment method: bonferroni
The trick is just to find the right package & spend some time getting to know what it can do
setwd("/Users/John/Dropbox/R_Course 2/")
r -f Flanker_example.R
#Rscript Flanker_example.R
These commands allow you to call upon scripts you've already written & execute them. You can even call scripts from scripts, or scripts from a linux shell & pass the results to a different program…