author: John A. E. Anderson date: February 5 2014
Not meant to be scary, just fun!
Well…
(Note to self; demo this)
There are generally 4 panes:
Let's try installing ggplot2 (we'll get back to this package later)
Notice that the R code is also generated; it would be equally valid to type this
install.packages("ggplot2")
to load that package, you can either click the checkbox next to the package name, or you can enter the following:
library("ggplot2")
#replace with your dataset
TimeOfDayPriming <- read.csv("C:/Users/JA/Desktop/TimeOfDayPriming.csv")
Just to prove I've actually loaded something ;)
Age is plotted by group - the beanplot reveals that the young group is bimodal (grad students vs. undergrads methinks?)
the easy way
write.table(TimeOfDayPriming, file = "test.csv",col.names = TRUE, sep = ",", row.names = TRUE)
Try the following (in your console)
A <- 5690 + 69462
This should produce the following…
A <- 5690 + 69462
A
[1] 75152
Subtraction works just as you think it would…
5690 - 5452
This should produce the following…
[1] 238
Try the following (in your console)
5690 * 69462
This should produce the following…
[1] 395238780
And division …
5690 / 5452
[1] 1.044
Try the following (in your console)
summary(cars)
This should produce the following…
speed dist
Min. : 4.0 Min. : 2
1st Qu.:12.0 1st Qu.: 26
Median :15.0 Median : 36
Mean :15.4 Mean : 43
3rd Qu.:19.0 3rd Qu.: 56
Max. :25.0 Max. :120
summary is a base function built into R - you can use it on dataframes & other objects
To see the “cars” dataframe simply call the dataframe in the console
cars
This should produce the following…
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
(I used the function head(cars) to just display the first couple of rows, you can do the same thing with tail to display the last few rows)
Ok I see there are two variables called speed and dist, how do I select one of them? (very easily actually)
cars$speed
This should produce the following…
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25
Voila - the speed variable notice that this vector is plotted horizontally, this doesn't matter, the content is the same
So, how to get the mean & standard deviation value of dist? it's very simple…
mean(cars$dist)
[1] 42.98
sd(cars$dist)
[1] 25.77
Ocassionally you'll want to add a new column of variables: Let's take the cars - what if we want to add a new column that takes the square-root transform of distance?
carsNew <- cars
carsNew$SRDist <- sqrt(cars$dist)
head(carsNew)
speed dist SRDist
1 4 2 1.414
2 4 10 3.162
3 7 4 2.000
4 7 22 4.690
5 8 16 4.000
6 9 10 3.162
You'll note that applying functions to vectors in R is ridculously easy - as we just did with the square-root transform, you can do addition, subtraction, mutiplication or any other mathematics to each incidence in a vector.
To remove a column, simply use the null command
carsNew$dist <- NULL
head(carsNew)
speed SRDist
1 4 1.414
2 4 3.162
3 7 2.000
4 7 4.690
5 8 4.000
6 9 3.162
We could also create an entirely new dataframe by selecting columns
NewDataFrame <- cbind(carsNew$SRDist, cars$dist)
NewDataFrame <- na.omit(NewDataFrame)
colnames(NewDataFrame) <- c('SRDist', 'originalDist')
NewDataFrame=as.data.frame(NewDataFrame)
#this should give you some hint that we could just as easily say as.matrix - you'd be right!
head(NewDataFrame)
SRDist originalDist
1 1.414 2
2 3.162 10
3 2.000 4
4 4.690 22
5 4.000 16
6 3.162 10
R is built so that things work better if you use logical indices rather than loops. Technically you could build a loop - but generally it will run more slowly and take up more memory.
install.packages("plyr", repos='http://probability.ca/cran/')
library(plyr)
SpeedGreaterThan20 <-subset(cars, cars$speed > 20)
mean(cars$speed)
[1] 15.4
mean(SpeedGreaterThan20$speed)
[1] 23.71
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, NA, 8, 3, NA, 3)
We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values
mean(z)
[1] NA
mean(z, na.rm=TRUE)
[1] 3.4
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, 100, 8, 3, -50, 3)
Let's get rid of these extreme values:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
mean(z)
[1] 9.571
mean(remove_outliers(z), na.rm=TRUE)
[1] 3.4
Missing values are set to NA in R - generally trying to compute a statistic over a vector containing an NA will result in an error
z <- c(1, 2, 100, 8, 3, -50, 3)
We can either remove the NA values subsetting using plyr (better for dataframes), or simply compute the statistic, but tell R what to do with the missing values
library(psych)
mean(z)
[1] 9.571
mean(winsor(z))
[1] 3.486
beanplot(z, winsor(z))
You can also apply winsorization to individuals within groups using the plyr package
EC <- ddply(DC, .(Subject, Condition), function(Acc_RT) {
transform(Acc_RT, RTwin = winsor(Acc_RT, trim = .1))
})
The basic plot function in R is… well… very basic.
plot(cars)
#boxplot(cars)
library(beanplot)
beanplot(cars)
hist(cars$dist)
d <- density(cars$dist) # returns the density data
plot(d)
library(ggthemes)
#faceted bar graph (festival)
bar <- ggplot(diamonds, aes(color, price));
bar + stat_summary(fun.y = mean, geom = "bar")+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2)+ theme_economist()
setwd("C:/Users/JA/Desktop/Field Data Files")
examData <- read.delim("Exam Anxiety.dat", header = TRUE)
#grouped Scatter Graph
scatter <- ggplot(examData, aes(Anxiety, Exam, colour = Gender));
scatter + geom_point() + geom_smooth(method = "rlm", alpha = .1)+ labs(x = "Exam Anxiety", y = "Exam Performance %") + theme_wsj()
Let's take a subset of two factors from a dataset called diamonds and run a t-test on price over the factors.
library(ggplot2)
TwoColDiamonds <- subset(diamonds, color =="E" | color =="I" )
ColorTtest <- t.test(price ~ color, data = TwoColDiamonds)
Welch Two Sample t-test
data: price by color
t = -27.8, df = 8486, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2157 -1873
sample estimates:
mean in group E mean in group I
3077 5092
to calculate an effect size, simply access the t value and df value we have stored in ColorTtest
t <- ColorTtest$statistic[[1]]
df <- ColorTtest$parameter[[1]]
r <- sqrt(t^2/(t^2+df))
round(r,3)
[1] 0.289
From Andy Field's book; Let's assume we have one group of arachnophobes who on two separate ocasions are exposed to either a picture or real spider.
picture <- c(30, 35, 45,40,50,35,55,25,30,45,40,50)
real <- c(40, 35, 50, 55, 65, 55, 50, 35, 30, 50,60, 39)
Spider <- data.frame(picture,real)
dep.t.test <- t.test(Spider$real, Spider$picture, paired=TRUE)
dep.t.test
Paired t-test
data: Spider$real and Spider$picture
t = 2.473, df = 11, p-value = 0.03098
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7688 13.2312
sample estimates:
mean of the differences
7
linearModel <- lm(cars$dist ~ cars$speed)
summary(linearModel)
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.07 -9.53 -2.27 9.21 43.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.579 6.758 -2.60 0.012 *
cars$speed 3.932 0.416 9.46 1.5e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.4 on 48 degrees of freedom
Multiple R-squared: 0.651, Adjusted R-squared: 0.644
F-statistic: 89.6 on 1 and 48 DF, p-value: 1.49e-12
we can also view the residuals from the model and assign that to a new object if we want
head(resid(linearModel))
1 2 3 4 5 6
3.849 11.849 -5.948 12.052 2.120 -7.813
RegModel <- lm(cars$dist ~ cars$speed)
plot(cars)
plot(cars$dist ~ cars$speed)
abline(RegModel)
library(MASS)
RegModel1 <- rlm(cars$dist ~ cars$speed)
plot(cars$dist ~ cars$speed)
abline(RegModel, col = "red")
abline(RegModel1, col = "green")
here the robust model rlm is very similar to the regular lm model
here the assumption would be that the second variable would be a factor with more than 2 levels… but it still works as a continuous!
Of course, R uses type I sums of squares - the output won't look exactly like SPSS…
print(anova(linearModel))
Analysis of Variance Table
Response: cars$dist
Df Sum Sq Mean Sq F value Pr(>F)
cars$speed 1 21185 21185 89.6 1.5e-12 ***
Residuals 48 11354 237
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To correct the sums of squares, we need the car package, load it using install.packages(car)
library(car)
model <- aov(dist ~ speed, data=cars)
print(Anova(model, type="III"))
Anova Table (Type III tests)
Response: dist
Sum Sq Df F value Pr(>F)
(Intercept) 1600 1 6.77 0.012 *
speed 21185 1 89.57 1.5e-12 ***
Residuals 11354 48
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
load the ggplot2 package if you haven't already; library(ggplot2)
library(ggplot2)
head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
you can also use View(diamonds) to view this in your RStudio browser ,,, ,
let's suppose we're interested in the interaction of cut and color on price…
We can define the model as follows
DiamondModel <- lm (price ~ cut * color, data = diamonds)
DiamondModel <- aov(price ~ cut * color, data=diamonds)
print(Anova(DiamondModel, type="III"))
Anova Table (Type III tests)
Response: price
Sum Sq Df F value Pr(>F)
(Intercept) 3.76e+11 1 24713.00 <2e-16 ***
cut 8.79e+09 4 144.40 <2e-16 ***
color 9.46e+09 6 103.61 <2e-16 ***
cut:color 1.65e+09 24 4.53 1e-12 ***
Residuals 8.20e+11 53905
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Viagra <- read.delim("C:/Users/JA/Desktop/Field Data Files/Viagra.dat")
Viagra$dose<-factor(Viagra$dose,levels =c(1:3),labels = c("Placebo", "Low Dose", "High Dose"))
head(Viagra)
person dose libido
1 1 Placebo 3
2 2 Placebo 2
3 3 Placebo 1
4 4 Placebo 1
5 5 Placebo 4
6 6 Low Dose 5
#line graph (hiccups)
line <- ggplot(Viagra, aes(dose, libido))
line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line", aes(group=1),colour = "Red", linetype = "dashed", size = 2)+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2, size =2) + labs(x = "Dose of Viagra", y = "Libido") + theme_economist() +theme(text = element_text(size = 20))
Descriptive statistics…
library(pastecs)
by(Viagra, Viagra$dose, stat.desc)
Viagra$dose: Placebo
person dose libido
nbr.val 5.0000 NA 5.0000
nbr.null 0.0000 NA 0.0000
nbr.na 0.0000 NA 0.0000
min 1.0000 NA 1.0000
max 5.0000 NA 4.0000
range 4.0000 NA 3.0000
sum 15.0000 NA 11.0000
median 3.0000 NA 2.0000
mean 3.0000 NA 2.2000
SE.mean 0.7071 NA 0.5831
CI.mean.0.95 1.9632 NA 1.6189
var 2.5000 NA 1.7000
std.dev 1.5811 NA 1.3038
coef.var 0.5270 NA 0.5927
--------------------------------------------------------
Viagra$dose: Low Dose
person dose libido
nbr.val 5.0000 NA 5.0000
nbr.null 0.0000 NA 0.0000
nbr.na 0.0000 NA 0.0000
min 6.0000 NA 2.0000
max 10.0000 NA 5.0000
range 4.0000 NA 3.0000
sum 40.0000 NA 16.0000
median 8.0000 NA 3.0000
mean 8.0000 NA 3.2000
SE.mean 0.7071 NA 0.5831
CI.mean.0.95 1.9632 NA 1.6189
var 2.5000 NA 1.7000
std.dev 1.5811 NA 1.3038
coef.var 0.1976 NA 0.4075
--------------------------------------------------------
Viagra$dose: High Dose
person dose libido
nbr.val 5.0000 NA 5.0000
nbr.null 0.0000 NA 0.0000
nbr.na 0.0000 NA 0.0000
min 11.0000 NA 3.0000
max 15.0000 NA 7.0000
range 4.0000 NA 4.0000
sum 65.0000 NA 25.0000
median 13.0000 NA 5.0000
mean 13.0000 NA 5.0000
SE.mean 0.7071 NA 0.7071
CI.mean.0.95 1.9632 NA 1.9632
var 2.5000 NA 2.5000
std.dev 1.5811 NA 1.5811
coef.var 0.1216 NA 0.3162
Levene's test for homogeneity of variances
leveneTest(Viagra$libido,Viagra$dose, center = median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 0.12 0.89
12
Given the non-significance of this result, we can proceed confident in the knowlege that the variances are similar across all the levels of the factors
The ANOVA…
ViagraModel <- aov(libido ~ dose, data = Viagra)
summary(ViagraModel)
Df Sum Sq Mean Sq F value Pr(>F)
dose 2 20.1 10.07 5.12 0.025 *
Residuals 12 23.6 1.97
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Planned Contrasts
Viagra$dose <- as.factor(Viagra$dose)
contrast1 <- c(-2,1,1)
contrast2<-c(0,-1,1)
contrasts(Viagra$dose) <- cbind(contrast1,contrast2)
Viagra$dose
[1] Placebo Placebo Placebo Placebo Placebo Low Dose Low Dose
[8] Low Dose Low Dose Low Dose High Dose High Dose High Dose High Dose
[15] High Dose
attr(,"contrasts")
contrast1 contrast2
Placebo -2 0
Low Dose 1 -1
High Dose 1 1
Levels: Placebo Low Dose High Dose
Planned Contrasts II
ViagraPlanned <-aov(libido ~dose, data =Viagra)
summary.lm(ViagraPlanned)
Call:
aov(formula = libido ~ dose, data = Viagra)
Residuals:
Min 1Q Median 3Q Max
-2.0 -1.2 -0.2 0.9 2.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.467 0.362 9.57 5.7e-07 ***
dosecontrast1 0.633 0.256 2.47 0.029 *
dosecontrast2 0.900 0.443 2.03 0.065 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.4 on 12 degrees of freedom
Multiple R-squared: 0.46, Adjusted R-squared: 0.37
F-statistic: 5.12 on 2 and 12 DF, p-value: 0.0247
Trend Analyses
contrasts(Viagra$dose)<-contr.poly(3)
ViagraTrend <-aov(libido ~ dose, data = Viagra)
summary.lm(ViagraTrend)
Call:
aov(formula = libido ~ dose, data = Viagra)
Residuals:
Min 1Q Median 3Q Max
-2.0 -1.2 -0.2 0.9 2.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.467 0.362 9.57 5.7e-07 ***
dose.L 1.980 0.627 3.16 0.0083 **
dose.Q 0.327 0.627 0.52 0.6120
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.4 on 12 degrees of freedom
Multiple R-squared: 0.46, Adjusted R-squared: 0.37
F-statistic: 5.12 on 2 and 12 DF, p-value: 0.0247
Post-Hoc Tests
pairwise.t.test(Viagra$libido, Viagra$dose, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: Viagra$libido and Viagra$dose
Placebo Low Dose
Low Dose 0.845 -
High Dose 0.025 0.196
P value adjustment method: bonferroni
The trick is just to find the right package & spend some time getting to know what it can do
Error in parse(text = x, srcfile = src) : <text>:1:9: unexpected symbol
1: Rscript Flanker_example.R
^