This guide is a companion to the Handbook: it is for those who want to refresh their working knowledge of R or get a better understanding of the code provided in the Handbook. Unlike the Handbook, which has code for analysing the actual practical data, this guide uses made-up data from a fictive experiment to explain in more general terms how the code in the Handbook works. The guide starts with the very basics in case your R is rusty, and takes you all the way to plotting and t-tests. You can jump to the different sections by clicking on the navigation menu on the left.
If you have forgotten everything about R, this section is for you. Open RStudio. Recall that you can interact with R by typing straight into the console. This is handy for trying out simple things, but not what you want for anything that is more complicated! For your report, you want to type your code into a code file and save it.
For now, have a play in the console to re-familiarise yourself with R.
(2+3)*2 # comments start with "#"## [1] 10
5^2 # exponentiation## [1] 25
sqrt(25) # square root## [1] 5
We store values (numbers) in variables like so:
a <- 2
b <- 3
my.cool.result <- (a + b)*2
my.cool.result## [1] 10
my.cool.Result # throws an error...## Error in eval(expr, envir, enclos): object 'my.cool.Result' not found
When we have a variable that takes a bunch of values (say, a set of
ten measurements of body weight), we use a vector variable, or vector
for short, to hold the values. To define a vector, you put the
comma-separated values inside c(). In case you’re
wondering, the c stands for ‘combine’ (values into a
vector, that is).
a <- c(10, 20, 30, 40, 50, 60, 70)
a## [1] 10 20 30 40 50 60 70
a*2 # multiplies all values in `a` by 2 !!## [1] 20 40 60 80 100 120 140
By the way, when we stored the vector in a, this
overwrote the value ‘2’ that we had stored in a
earlier.
Now that we have a variable with multiple values (our vector
a), we can do means and standard deviations:
mean(a)## [1] 40
sd(a)## [1] 21.60247
We will use a fictive experiment as an example. Imagine we present an animal with trains of stimuli in quick succession. We are interested in how some particular aspect \(Z\) of the animal’s response depends on the number of stimuli.
We then calculate \(Z\) (our ‘dependent variable’) from our measurements of \(a\) and \(b\), according to the following equations:
\[X = a+b\]
\[Z = \frac{X}{b+1}\pi\]
so that we can relate \(Z\) to the number of stimuli (which is our ‘independent variable’).
R/RStudio needs to know where your data files are. You could do this the old-fashioned way, by setting the ‘working directory’ (RStudio menu Session > Set Working Directory > Choose Directory…); but it really pays off if you adopt working with RStudio Projects. This way, next time you work on the same project, you’ll just pick up where you left off (except that you’ll need to re-load any packages that you may be using).
To make an RStudio Project, simply
This will create a project file inside your project directory. Next time you work on your project, you can open it just by double-clicking the project file (this will launch RStudio).
You should see your data files listed in the ‘Files’ tab of the lower right pane in RStudio.
The data from our experiment are available as a CSV file (the
de-facto standard for any tabular data). So the first thing we need to
do is get the data into R, into what is called a data frame —
you can think of it as a spreadsheet inside R. We’ll call our data frame
MyData (note that the name is up to you, and not
necessarily the same as that of your CSV file!):
MyData <- read.csv(file = "demo-data.csv")It is a good idea to first check that the import went OK, and that
the data structure looks as expected. head() will print the
first few rows of a data frame to the console:
head(MyData)## a b stim.nr
## 1 1.439524 3.305293 1
## 2 1.769823 3.792083 1
## 3 3.558708 2.734604 1
## 4 2.070508 6.168956 1
## 5 2.129288 5.207962 1
## 6 3.715065 2.876891 1
So this data frame has three variables (columns): a,
b and stim.nr. At this point, you could also
do
summary(MyData)## a b stim.nr
## Min. :0.6195 Min. :-1.309 Min. :1.00
## 1st Qu.:1.5542 1st Qu.: 2.000 1st Qu.:1.75
## Median :2.4941 Median : 3.414 Median :3.00
## Mean :3.0452 Mean : 3.243 Mean :3.75
## 3rd Qu.:4.4317 3rd Qu.: 4.961 3rd Qu.:5.00
## Max. :7.2538 Max. : 6.516 Max. :8.00
to get some first idea what the data look like.
To access variables (columns) in a data frame, we use this syntax:
MyData$a # simply dumps the values of `a` on the console## [1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162
## [8] 0.7349388 1.3131471 1.5543380 4.2240818 3.3598138 3.4007715 3.1106827
## [15] 2.4441589 4.7869131 3.4978505 1.0333828 3.7013559 2.5272086 4.9321763
## [22] 5.7820251 4.9739956 5.2711088 5.3749607 4.3133067 6.8377870 6.1533731
## [29] 4.8618631 7.2538149 1.4264642 0.7049285 1.8951257 1.8781335 1.8215811
## [36] 1.6886403 1.5539177 0.9380883 0.6940373 0.6195290
In our example, we first need to calculate \(X\) as the sum of \(a\) and \(b\): \[X = a+b\]
You can calculate with variables that are inside a data frame just as you would with any other variables, but the result won’t automatically go into the data frame.
So the next bit of R will just spit out the result of \(a + b\) on the console — this is not what we usually want:
MyData$a + MyData$b## [1] 4.7448174 5.5619052 6.2933120 8.2394644 7.3372497 6.5919564
## [7] 6.0580314 4.2682834 6.0931123 5.4709690 9.4774003 8.3312671
## [13] 8.3579010 9.4792850 7.2183879 11.3033837 6.9490977 6.6179966
## [19] 8.8252101 7.7431502 8.3118158 8.2797016 7.6407882 7.2525334
## [25] 7.3031695 7.6168353 10.2859968 9.2063773 8.7841305 12.3038996
## [31] 1.9354331 -0.6042404 3.9008642 2.1689327 2.1335725 3.7142116
## [37] 2.2691446 0.7173706 1.8753408 1.4806376
Better, store the result in a new variable, X:
X <- MyData$a + MyData$bBut this creates X outside our data frame, and
when we have lots of calculations and intermediate results, things can
get very confusing (imagine you need to calculate X for two
different data sets…).
So, when calculating a variable from existing variables in a data frame, what you usually want to do is store the new variable as part of the same data frame:
MyData$X <- MyData$a + MyData$bLet’s check whether the mean of \(X\) is ballpark sensible:
mean(MyData$X)## [1] 6.288467
Finally, a very useful tip: if you have your
variables neatly stored in data frames, referencing multiple variables
can get really tedious using the Dataframe$variable syntax…
In our example, the next step is to calculate \[Z = \frac{X}{b+1}\pi\]
Instead of the barely readable
MyData$Z <- MyData$X/(MyData$b+1)*piuse with(MyData, ) like this:
MyData$Z <- with(MyData, X/(b+1)*pi )NOTE: In the plots below, the axis labels are way to small – you will lose marks if you have plots like these in your report! Why didn’t I fix them? Because on this web page, you can click on the plots to expand them. But the plots in your report must have axis labels that are readable without zooming in on the page.
R does simple box plots right out of the box (lame pun intended).
We can do a box plot of Z on the \(y\)-axis against stim.nr as a
categorical variable on the \(x\)-axis, like so:
boxplot(Z ~ stim.nr, MyData)Better / proper axis labels:
boxplot(Z ~ stim.nr, MyData,
xlab ="Number of stimuli",
ylab = "Response amplitude (Units)")Alternatively, we may want to consider the variable on the \(x\)-axis continuous, that is, allow for the
fact that (even though the number of stimuli can only assume positive
integers) the different values of stim.nr are not equally
spaced (since we have 1, 2, 4 and 8).
In this case, it is best to plot means as points, standard deviations as bars, and maybe connect the means by straight lines.
This is very tricky in plain R, so we use the ggplot2
package. If it isn’t installed already, you need to install it before
you can load it. You need to install just once on your computer, but you
need to load it at the beginning of each session.
library(ggplot2) # load package; only needed once in a session.
ggplot(data = MyData, mapping = aes(x=stim.nr, y=Z)) +
xlab("Number of stimuli") +
ylab("Response amplitude (Units)") +
stat_summary(fun = mean, geom = "line") + # plot lines through means
stat_summary(fun = mean, # add points and SD as bars
fun.min = function(x) mean(x) - sd(x), # mean - SD
fun.max = function(x) mean(x) + sd(x), # mean + SD
geom = "pointrange", size=1.5)Figure 3. The response amplitude depends on the number of stimuli (black filled circles, means; bars, standard deviations).
NOTE: ggplot2 is extremely powerful,
but the price is that the code can get quite complicated (I find myself
googling a lot…). Feel free to reuse and modify my code examples for
your plots!
In the above code,
data = ... is your data frame that holds the variables
you want to plot;aes(x=..., y=...) is the “aesthetic” (don’t ask…): this
is where you define what variables of your data frame
go on the \(x\) and \(y\) axis of your plot.xlab and ylab are where you set the axis
labels. If you omit these, ggplot() will use the variable
names as axis labels.If you are curious about the rest of the code,
stats_summary(fun = mean) bit,
fun = tells R what function to use to summarise the data:
here, we are using the mean. The fun.min and
fun.max functions define the end points of the bars, which
are the mean +/- the standard deviation, hence the
mean(x) + sd(x) and mean(x) - sd(x) (the
x is just a place-holder here, it’s not referring to what
is on the \(x\)-axis)If you wanted to, you could use fun = median instead of
the mean but then you would not want to use standard
deviations for the ‘error bars’. You would want first and third quartile
instead, so the
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) + sd(x), would become
fun.min = function(x) quantile(x, 0.25), # 1st quartile = 25%
fun.max = function(x) quantile(x, 0.75), # 3rd quartile = 75%Finally, we could also plot the raw data with the summary, and jitter them a bit along the \(x\)-axis so that they don’t fall on top of one another:
ggplot(data = MyData, mapping = aes(x=stim.nr, y=Z)) +
xlab("Number of stimuli") +
ylab("Response amplitude (Units)") +
stat_summary(fun = median, geom = "line") + # plot lines through medians
stat_summary(fun = median, # add points and quartiles as bars
fun.min = function(x) quantile(x, 0.25),
fun.max = function(x) quantile(x, 0.75),
geom = "pointrange", size=1) +
geom_jitter(width = 0.2, alpha=0.3, size=2.5)Figure 4. The response amplitude depends on the number of stimuli (grey filled circles, raw data points; black filled circles, medians; bars, interquartile range).
In our example, the response is clearly highest with four stimuli. If you want to numerically report the mean (and SD) response, how do you get these numbers?
What won’t work for this is the code below, because it will give the total mean across all groups (across all ‘stimulus numbers’):
mean(MyData$Z)## [1] 4.74601
Instead, to get the mean of only the ‘four stimuli’ group, you need to subset the data like this:
with(subset(MyData, stim.nr == 4), mean(Z))## [1] 6.737928
Note R uses ==, not =, for testing
equality.
There are different ways of organising data — not just in R, but in general. The data above were in “long format” where
The data in the practical that compare jump distances before and after blocking tibial flexion are in a different format — they look something like this:
TestData## best block
## 1 4.761333 3.462316
## 2 4.383095 3.646273
## 3 4.407680 5.294008
## 4 4.041479 3.610713
## 5 3.927318 3.713078
## 6 4.724375 5.341713
## 7 4.904521 4.139959
## 8 3.157520 2.957906
## 9 4.527413 3.311798
## 10 3.913458 3.495799
This is called “wide format”. Unlike in the previous example, this format doesn’t actually say what was measured (jump distance, in the Practical data). Also,
It is up to you whether you keep the data in wide format, or whether you change them into long format. Long format is more logical, and required for advanced analyses (final year projects!), but simple plots and tests can also be done in wide format.
To boxplot the data, you could simply say
boxplot(TestData$best, TestData$block)(The axis labels would need sorting out, though!).
Note how this syntax is fundamentally different from the syntax we
usded earlier, where we had boxplot(y ~ x) with
x the grouping variable and y the measured
response. The syntax has to be different because the data are in a
different format.
The logically sensible way to think about the data is:
free and blocked.But because of the format the data are in, we cannot say
boxplot(jump.dist ~ tibia.cond).
We can re-organise the data into a new data frame, in “long format”.
The rep(...) bit stands for ‘repeat’ and fills the column
tibia.cond with as many repeats of “free” and “blocked” as
there are rows in the wide-format version of the data:
TestData.long <- data.frame(
jump.dist = c(TestData$best, TestData$block), # puts all jump distances in one column
tibia.cond = rep(c("free", "blocked"), each = nrow(TestData)) # second column indicating condition
)
TestData.long## jump.dist tibia.cond
## 1 4.761333 free
## 2 4.383095 free
## 3 4.407680 free
## 4 4.041479 free
## 5 3.927318 free
## 6 4.724375 free
## 7 4.904521 free
## 8 3.157520 free
## 9 4.527413 free
## 10 3.913458 free
## 11 3.462316 blocked
## 12 3.646273 blocked
## 13 5.294008 blocked
## 14 3.610713 blocked
## 15 3.713078 blocked
## 16 5.341713 blocked
## 17 4.139959 blocked
## 18 2.957906 blocked
## 19 3.311798 blocked
## 20 3.495799 blocked
Now we can say,
boxplot(jump.dist ~ tibia.cond, TestData.long)and we can also use ggplot2 if we so fancy:
ggplot(data = TestData.long, mapping = aes(x=tibia.cond, y=jump.dist)) +
geom_boxplot()The syntax for a simple t-test follows that of a boxplot — except that you need to decide whether the data are paired or not!!
Unpaired t-test:
t.test(TestData$best, TestData$block)##
## Welch Two Sample t-test
##
## data: TestData$best and TestData$block
## t = 1.2405, df = 15.472, p-value = 0.2333
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2693993 1.0243249
## sample estimates:
## mean of x mean of y
## 4.274819 3.897356
Paired t-test:
t.test(TestData$best, TestData$block, paired = TRUE)##
## Paired t-test
##
## data: TestData$best and TestData$block
## t = 1.6918, df = 9, p-value = 0.1249
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.1272609 0.8821865
## sample estimates:
## mean difference
## 0.3774628
Unpaired t-test:
t.test(jump.dist ~ tibia.cond, TestData.long)##
## Welch Two Sample t-test
##
## data: jump.dist by tibia.cond
## t = -1.2405, df = 15.472, p-value = 0.2333
## alternative hypothesis: true difference in means between group blocked and group free is not equal to 0
## 95 percent confidence interval:
## -1.0243249 0.2693993
## sample estimates:
## mean in group blocked mean in group free
## 3.897356 4.274819
Paired t-test:
t.test(jump.dist ~ tibia.cond, TestData.long, paired = TRUE)##
## Paired t-test
##
## data: jump.dist by tibia.cond
## t = -1.6918, df = 9, p-value = 0.1249
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.8821865 0.1272609
## sample estimates:
## mean difference
## -0.3774628