About this Guide

This guide is a companion to the Handbook: it is for those who want to refresh their working knowledge of R or get a better understanding of the code provided in the Handbook. Unlike the Handbook, which has code for analysing the actual practical data, this guide uses made-up data from a fictive experiment to explain in more general terms how the code in the Handbook works. The guide starts with the very basics in case your R is rusty, and takes you all the way to plotting and t-tests. You can jump to the different sections by clicking on the navigation menu on the left.

Refresher: The Basics

If you have forgotten everything about R, this section is for you. Open RStudio. Recall that you can interact with R by typing straight into the console. This is handy for trying out simple things, but not what you want for anything that is more complicated! For your report, you want to type your code into a code file and save it.

R as a calculator…

For now, have a play in the console to re-familiarise yourself with R.

(2+3)*2   # comments start with "#"

## [1] 10

5^2       # exponentiation

## [1] 25

sqrt(25)  # square root

## [1] 5

Variables

We store values (numbers) in variables like so:

a <- 2
b <- 3
my.cool.result <- (a + b)*2
my.cool.result

## [1] 10

Remember: R is case sensitive!

my.cool.Result  # throws an error...

## Error in eval(expr, envir, enclos): object 'my.cool.Result' not found

Vectors

When we have a variable that takes a bunch of values (say, a set of ten measurements of body weight), we use a vector variable, or vector for short, to hold the values. To define a vector, you put the comma-separated values inside c(). In case you’re wondering, the c stands for ‘combine’ (values into a vector, that is).

a <- c(10, 20, 30, 40, 50, 60, 70)
a

## [1] 10 20 30 40 50 60 70

a*2  # multiplies all values in `a` by 2 !!

## [1]  20  40  60  80 100 120 140

By the way, when we stored the vector in a, this overwrote the value ‘2’ that we had stored in a earlier.

Summary statistics

Now that we have a variable with multiple values (our vector a), we can do means and standard deviations:

mean(a)

## [1] 40

sd(a)

## [1] 21.60247

Working with data frames and CSV files

We will use a fictive experiment as an example. Imagine we present an animal with trains of stimuli in quick succession. We are interested in how some particular aspect $Z$ of the animal’s response depends on the number of stimuli.

We vary the number of stimuli in the trains (1, 2, 4 or 8).
We cannot directly measure $Z$. Instead we characterise the response by measuring two variables, $a$ and $b$.

We then calculate $Z$ (our ‘dependent variable’) from our measurements of $a$ and $b$, according to the following equations:

\[X = a+b\]

\[Z = \frac{X}{b+1}\pi\]

so that we can relate $Z$ to the number of stimuli (which is our ‘independent variable’).

Managing your files with RStudio Projects

R/RStudio needs to know where your data files are. You could do this the old-fashioned way, by setting the ‘working directory’ (RStudio menu Session > Set Working Directory > Choose Directory…); but it really pays off if you adopt working with RStudio Projects. This way, next time you work on the same project, you’ll just pick up where you left off (except that you’ll need to re-load any packages that you may be using).

To make an RStudio Project, simply

Create a new folder anywhere on your disk.
Put your data files inside the folder.
In RStudio, go to menu File > New Project…
Choose ‘Existing Directory’ and use ‘Browse’ to find your folder. Don’t drill into the folder, just select (highlight) it and then click ‘Open’.
Click ‘Create Project’.

This will create a project file inside your project directory. Next time you work on your project, you can open it just by double-clicking the project file (this will launch RStudio).

You should see your data files listed in the ‘Files’ tab of the lower right pane in RStudio.

Reading in CSV files

The data from our experiment are available as a CSV file (the de-facto standard for any tabular data). So the first thing we need to do is get the data into R, into what is called a data frame — you can think of it as a spreadsheet inside R. We’ll call our data frame MyData (note that the name is up to you, and not necessarily the same as that of your CSV file!):

MyData <- read.csv(file = "demo-data.csv")

Working with data frames

It is a good idea to first check that the import went OK, and that the data structure looks as expected. head() will print the first few rows of a data frame to the console:

head(MyData)

##          a        b stim.nr
## 1 1.439524 3.305293       1
## 2 1.769823 3.792083       1
## 3 3.558708 2.734604       1
## 4 2.070508 6.168956       1
## 5 2.129288 5.207962       1
## 6 3.715065 2.876891       1

So this data frame has three variables (columns): a, b and stim.nr. At this point, you could also do

summary(MyData)

##        a                b             stim.nr    
##  Min.   :0.6195   Min.   :-1.309   Min.   :1.00  
##  1st Qu.:1.5542   1st Qu.: 2.000   1st Qu.:1.75  
##  Median :2.4941   Median : 3.414   Median :3.00  
##  Mean   :3.0452   Mean   : 3.243   Mean   :3.75  
##  3rd Qu.:4.4317   3rd Qu.: 4.961   3rd Qu.:5.00  
##  Max.   :7.2538   Max.   : 6.516   Max.   :8.00

to get some first idea what the data look like.

To access variables (columns) in a data frame, we use this syntax:

MyData$a  # simply dumps the values of `a` on the console

##  [1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162
##  [8] 0.7349388 1.3131471 1.5543380 4.2240818 3.3598138 3.4007715 3.1106827
## [15] 2.4441589 4.7869131 3.4978505 1.0333828 3.7013559 2.5272086 4.9321763
## [22] 5.7820251 4.9739956 5.2711088 5.3749607 4.3133067 6.8377870 6.1533731
## [29] 4.8618631 7.2538149 1.4264642 0.7049285 1.8951257 1.8781335 1.8215811
## [36] 1.6886403 1.5539177 0.9380883 0.6940373 0.6195290

In our example, we first need to calculate $X$ as the sum of $a$ and $b$: \[X = a+b\]

You can calculate with variables that are inside a data frame just as you would with any other variables, but the result won’t automatically go into the data frame.

So the next bit of R will just spit out the result of $a + b$ on the console — this is not what we usually want:

MyData$a + MyData$b

##  [1]  4.7448174  5.5619052  6.2933120  8.2394644  7.3372497  6.5919564
##  [7]  6.0580314  4.2682834  6.0931123  5.4709690  9.4774003  8.3312671
## [13]  8.3579010  9.4792850  7.2183879 11.3033837  6.9490977  6.6179966
## [19]  8.8252101  7.7431502  8.3118158  8.2797016  7.6407882  7.2525334
## [25]  7.3031695  7.6168353 10.2859968  9.2063773  8.7841305 12.3038996
## [31]  1.9354331 -0.6042404  3.9008642  2.1689327  2.1335725  3.7142116
## [37]  2.2691446  0.7173706  1.8753408  1.4806376

Better, store the result in a new variable, X:

X <- MyData$a + MyData$b

But this creates X outside our data frame, and when we have lots of calculations and intermediate results, things can get very confusing (imagine you need to calculate X for two different data sets…).

So, when calculating a variable from existing variables in a data frame, what you usually want to do is store the new variable as part of the same data frame:

MyData$X <- MyData$a + MyData$b

Let’s check whether the mean of $X$ is ballpark sensible:

mean(MyData$X)

## [1] 6.288467

Finally, a very useful tip: if you have your variables neatly stored in data frames, referencing multiple variables can get really tedious using the Dataframe$variable syntax… In our example, the next step is to calculate \[Z = \frac{X}{b+1}\pi\]

Instead of the barely readable

MyData$Z <- MyData$X/(MyData$b+1)*pi

use with(MyData, ) like this:

MyData$Z <- with(MyData, X/(b+1)*pi )

Plotting

NOTE: In the plots below, the axis labels are way to small – you will lose marks if you have plots like these in your report! Why didn’t I fix them? Because on this web page, you can click on the plots to expand them. But the plots in your report must have axis labels that are readable without zooming in on the page.

Box plots: categorical $x$-axis

R does simple box plots right out of the box (lame pun intended).

We can do a box plot of Z on the $y$-axis against stim.nr as a categorical variable on the $x$-axis, like so:

boxplot(Z ~ stim.nr, MyData)

Better / proper axis labels:

boxplot(Z ~ stim.nr, MyData,
        xlab  ="Number of stimuli",
        ylab = "Response amplitude (Units)")

Treating $x$ axis as continuous

Alternatively, we may want to consider the variable on the $x$-axis continuous, that is, allow for the fact that (even though the number of stimuli can only assume positive integers) the different values of stim.nr are not equally spaced (since we have 1, 2, 4 and 8).

In this case, it is best to plot means as points, standard deviations as bars, and maybe connect the means by straight lines.

This is very tricky in plain R, so we use the ggplot2 package. If it isn’t installed already, you need to install it before you can load it. You need to install just once on your computer, but you need to load it at the beginning of each session.

library(ggplot2)  # load package; only needed once in a session.

ggplot(data = MyData, mapping = aes(x=stim.nr, y=Z)) +
  xlab("Number of stimuli") +
  ylab("Response amplitude (Units)") +
  stat_summary(fun = mean, geom = "line") +            # plot lines through means
  stat_summary(fun = mean,                             # add points and SD as bars
               fun.min = function(x) mean(x) - sd(x),  # mean - SD
               fun.max = function(x) mean(x) + sd(x),  # mean + SD
               geom = "pointrange", size=1.5)

Figure 3. The response amplitude depends on the number of stimuli (black filled circles, means; bars, standard deviations).

NOTE: ggplot2 is extremely powerful, but the price is that the code can get quite complicated (I find myself googling a lot…). Feel free to reuse and modify my code examples for your plots!

In the above code,

data = ... is your data frame that holds the variables you want to plot;
aes(x=..., y=...) is the “aesthetic” (don’t ask…): this is where you define what variables of your data frame go on the $x$ and $y$ axis of your plot.
xlab and ylab are where you set the axis labels. If you omit these, ggplot() will use the variable names as axis labels.

If you are curious about the rest of the code,

in the stats_summary(fun = mean) bit, fun = tells R what function to use to summarise the data: here, we are using the mean. The fun.min and fun.max functions define the end points of the bars, which are the mean +/- the standard deviation, hence the mean(x) + sd(x) and mean(x) - sd(x) (the x is just a place-holder here, it’s not referring to what is on the $x$-axis)

If you wanted to, you could use fun = median instead of the mean but then you would not want to use standard deviations for the ‘error bars’. You would want first and third quartile instead, so the

fun.min = function(x) mean(x) - sd(x), 
fun.max = function(x) mean(x) + sd(x),

would become

fun.min = function(x) quantile(x, 0.25),  # 1st quartile = 25%
fun.max = function(x) quantile(x, 0.75),  # 3rd quartile = 75%

Finally, we could also plot the raw data with the summary, and jitter them a bit along the $x$-axis so that they don’t fall on top of one another:

ggplot(data = MyData, mapping = aes(x=stim.nr, y=Z)) +
  xlab("Number of stimuli") +
  ylab("Response amplitude (Units)") +
  stat_summary(fun = median, geom = "line") +            # plot lines through medians
  stat_summary(fun = median,                             # add points and quartiles as bars
               fun.min = function(x) quantile(x, 0.25), 
               fun.max = function(x) quantile(x, 0.75), 
               geom = "pointrange", size=1) +
  geom_jitter(width = 0.2, alpha=0.3, size=2.5)

Figure 4. The response amplitude depends on the number of stimuli (grey filled circles, raw data points; black filled circles, medians; bars, interquartile range).

Group means: subsetting data

In our example, the response is clearly highest with four stimuli. If you want to numerically report the mean (and SD) response, how do you get these numbers?

What won’t work for this is the code below, because it will give the total mean across all groups (across all ‘stimulus numbers’):

mean(MyData$Z)

## [1] 4.74601

Instead, to get the mean of only the ‘four stimuli’ group, you need to subset the data like this:

with(subset(MyData, stim.nr == 4), mean(Z))

## [1] 6.737928

Note R uses ==, not =, for testing equality.

“Long” versus “wide” data formats

There are different ways of organising data — not just in R, but in general. The data above were in “long format” where

each column represents exactly one variable (e.g., jump distance or tibial extension);
each row represents exactly one observation.

The data in the practical that compare jump distances before and after blocking tibial flexion are in a different format — they look something like this:

TestData

##        best    block
## 1  4.761333 3.462316
## 2  4.383095 3.646273
## 3  4.407680 5.294008
## 4  4.041479 3.610713
## 5  3.927318 3.713078
## 6  4.724375 5.341713
## 7  4.904521 4.139959
## 8  3.157520 2.957906
## 9  4.527413 3.311798
## 10 3.913458 3.495799

This is called “wide format”. Unlike in the previous example, this format doesn’t actually say what was measured (jump distance, in the Practical data). Also,

the jump distances are ‘spread out’ over two columns (hence “wide”), and
there are more than one observations per row (‘best’ and ‘blocked’ observations for each locust).

It is up to you whether you keep the data in wide format, or whether you change them into long format. Long format is more logical, and required for advanced analyses (final year projects!), but simple plots and tests can also be done in wide format.

Keeping “wide format”

To boxplot the data, you could simply say

boxplot(TestData$best, TestData$block)

(The axis labels would need sorting out, though!).

Note how this syntax is fundamentally different from the syntax we usded earlier, where we had boxplot(y ~ x) with x the grouping variable and y the measured response. The syntax has to be different because the data are in a different format.

Switching to “long format”

The logically sensible way to think about the data is:

the variable on the $x$-axis is the ‘tibia condition’ and it is categorical, taking two values: free and blocked.
and the response on $y$ is jump distance.

But because of the format the data are in, we cannot say boxplot(jump.dist ~ tibia.cond).

We can re-organise the data into a new data frame, in “long format”. The rep(...) bit stands for ‘repeat’ and fills the column tibia.cond with as many repeats of “free” and “blocked” as there are rows in the wide-format version of the data:

TestData.long <- data.frame(
  jump.dist = c(TestData$best, TestData$block),  # puts all jump distances in one column
  tibia.cond = rep(c("free", "blocked"), each = nrow(TestData))  # second column indicating condition
)

TestData.long

##    jump.dist tibia.cond
## 1   4.761333       free
## 2   4.383095       free
## 3   4.407680       free
## 4   4.041479       free
## 5   3.927318       free
## 6   4.724375       free
## 7   4.904521       free
## 8   3.157520       free
## 9   4.527413       free
## 10  3.913458       free
## 11  3.462316    blocked
## 12  3.646273    blocked
## 13  5.294008    blocked
## 14  3.610713    blocked
## 15  3.713078    blocked
## 16  5.341713    blocked
## 17  4.139959    blocked
## 18  2.957906    blocked
## 19  3.311798    blocked
## 20  3.495799    blocked

Now we can say,

boxplot(jump.dist ~ tibia.cond, TestData.long)

and we can also use ggplot2 if we so fancy:

ggplot(data = TestData.long, mapping = aes(x=tibia.cond, y=jump.dist)) +
  geom_boxplot()

Two-sample t-tests

The syntax for a simple t-test follows that of a boxplot — except that you need to decide whether the data are paired or not!!

In ‘wide’ format

Unpaired t-test:

t.test(TestData$best, TestData$block)

## 
##  Welch Two Sample t-test
## 
## data:  TestData$best and TestData$block
## t = 1.2405, df = 15.472, p-value = 0.2333
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2693993  1.0243249
## sample estimates:
## mean of x mean of y 
##  4.274819  3.897356

Paired t-test:

t.test(TestData$best, TestData$block, paired = TRUE)

## 
##  Paired t-test
## 
## data:  TestData$best and TestData$block
## t = 1.6918, df = 9, p-value = 0.1249
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.1272609  0.8821865
## sample estimates:
## mean difference 
##       0.3774628

In ‘long’ format