1 Introduction

1.1 Setting up your session

At the start of each session, make sure that you:

  1. place both your script (.r or .rmd) and your data in the same folder
  2. set the working directory to the same folder: Session –> Set working directory –> to source file location
  3. import your data – make sure the file name matches the one you have in the folder.

1.2 Simple maths

R can function like a calculator. You can do basic maths with the following operators:

+ Addition - Subtraction
* Multiplication / Division
^3 Exponents sqrt() Square root

Some logical operators (return ‘TRUE’ or ‘FALSE’ as output):

> Greater than < Smaller than
>= Greater or equal <= Smaller or equal
== Equality != Difference
& “and” | “or”

1.3 Functions

A function in R is a reusable block of code designed to perform a specific task. Functions take inputs, called arguments or parameters, process them, and return an output or result. They are an essential part of programming in R, helping to automate repetitive tasks and break down complex operations into smaller, manageable parts.

Each function has a name, followed by brackets, for example: mean(), sum(), round(), etc.

Arguments are the information you put into the function. For example, mean(x) calculates the mean value of x (which must be a vector of numbers). round(x, digits = 1) contains two arguments: x (the number to be rounded up) and digits = ..., which is the number of decimal places to round up to.


1.4 Objects

You can save individual numbers, words, vectors or whole spreadsheets as objects for later use. You do this with the format name <- value.

# Saving individual numbers:
participant_age <- 45

# Saving individual character strings (words)
participant_name <- "Alex"

To save multiple items in a vector, use the c() function. If the items are character strings, put them in "" quotation marks.

# Saving a vector of numbers
ages <- c(19, 22, 24, 19, 29, 26, 21)

# Saving a vector of strings 
names  <- c("Kathy", "David", "Joe", "Jenny", "Alex")

1.5 Object classes

Objects in R can have different classes:

  • numeric = a number
  • string = a word or a series of words
  • factor = a categorical variable

You can’t always tell what class an object is by looking at it. To check, you can pass an object through the class() function:

participant_age <- 45   # save the number 45 as `participant_age`
class(participant_age)  # check the class of the new object - it should be "numeric"
## [1] "numeric"

Compare with this case, where the number “45” is saved as a string, rather than a numeric, by mistake:

participant_age <- "45"   # because of the quote marks, 45 is a character string, not a number
class(participant_age)    # check the class of `participant_age` - the result should be "character"
## [1] "character"

To correct an object’s class, you can coerce it to the desired class with as.numeric() for numbers, as.character() for strings/words, as.factor() for categorical.

participant_age <- as.numeric(participant_age) # overwrite `participant_age` as a numeric object
class(participant_age) # check its class - it should be "numeric" again
## [1] "numeric"

1.6 Packages

Packages contain additional functions created by the R community. The tidyverse package combines multiple valuable packages, such as dplyr, which makes data wrangling easier, and ggplot2, which is used for data visualisation.

You need to install a library on your computer before you can use it. You need to do this only once. After that, you just need to load that library into R whenever you want to use it.

# Install a package for the first time: 
install.packages("tidyverse")   # notice the quote marks around the package name

# Load a package for use: 
library(tidyverse)

# Update an old package: 
update.packages("tidyverse")

# Remove a package:
remove.packages("tidyverse")

# Install an older version of a package:
install.packages(
    "https://cran.r-project.org/src/contrib/Archive/tidyverse/tidyverse_1.3.1.tar.gz", 
    repos=NULL, 
    type="source")
# NOTE: Replace the URL with a link to the version of the package you want to install.  
# Links to older versions of all packages can be found here: 
# https://cran.r-project.org/src/contrib/Archive/

1.7 Dataframes

A dataframe is a two-dimensional (consisting of rows and columns) data structure used to store data in a table format, where each column can contain different types of data (numeric, character, factor).

One of the most commonly used types of dataframe is called a tibble. You can create one from scratch with the tibble() function from the tidyverse package. Inside the brackets you must define each column’s contents in the format column name = contents:

data <- tibble(
  id = 1:5, 
  gender = c("Female", "Male", "Male", "Female", "Female"), 
  age    = c(22, 23, 28, 24, 22),
  score  = c(54, 49, 51, 62, 53)
  )
print(data)
## # A tibble: 5 × 4
##      id gender   age score
##   <int> <chr>  <dbl> <dbl>
## 1     1 Female    22    54
## 2     2 Male      23    49
## 3     3 Male      28    51
## 4     4 Female    24    62
## 5     5 Female    22    53

1.8 Importing data from Excel

You can import data into R with the read.csv() function. Notice that the spreadsheet you’re importing must be in .csv format (Comma-Separated Values). This format retains only the values in the spreadsheet, omitting the formatting of richer Excel documents (like .xlsx files). If your data is in a different format (such as .xlsx or .numbers), you must open it in Excel first and save it as .csv.

# Import data from a specific .csv on your computer: 
data <- read.csv("name_of_spreadsheet.csv")

# Manually select the file to import from your file browser: 
data <- read.csv(file.choose())

# Import data from an online file:
data <- read.csv("https://... link to the file goes here") 

1.9 Pipes %>%

The pipe operator %>% is part of the tidyverse and allows you to chain multiple functions together. The pipe sends the output of one function as input into another, and its output into another. For example:

data %>%                          # start with the dataframe 'data'
  filter(gender == "Female") %>%  # filter the female participants
  pull(score) %>%                 # extract the variable 'score'
  mean() %>%                      # calculate the mean of women's scores
  round(1)                        # round that mean to 1 decimal place

1.10 Extracting a variable from a dataframe

This can be done either with the $ operator or with the pull() function:

# This extracts the variable 'var' from the dataset 'data:
data$var

# This achieves the same but with pull():
pull(data, var)

# Pipes (%>%) don't handle the $ operator well. When coding with pipes, use pull() instead of $:
data %>% pull(var)

1.11 Previewing data

To see the top rows of a dataset (rather than the full dataframe), use head(). This is useful for getting a quick preview of the structure of a dataframe without printing it out in its entirety:

crime_data <- USArrests # Save `USArrests` data as the object `crime_data` 
head(crime_data)        # Preview the top rows of `crime_data`
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

2 Data wrangling

2.1 Combine dataframes: inner_join()

We can use inner_join() to merge two dataframes. The format is inner_join(dataframe1, dataframe2, by = ...). The by=... argument tells R which variable to match the two datasets by. Usually you will want to match by any columns that are the same across both dataframes.

# Match demographic data and scores by participant ID number:
data <- inner_join(demographics, scores, by = "id")

# Notice that c() is needed when matching multiple variables:
data <- inner_join(responses, pinfo, by = c("id", "intervention"))

2.2 Include or exclude variables/columns: select()

Use select() to include some columns/variables of interest while omitting others.

data <- data %>% select(id, gender, condition, score)

# To exclude one specific variable and keep all others in, use the `-` sign:
data <- data %>% select(-score) # omit 'score', keep everything else

select() is useful for rearranging columns in a dataframe. For example, I can move the column ‘condition’ to the front, followed by everything else (use everything() to quickly reference all variables in the dataset):

data <- data %>% select(condition, everything() )

2.3 Include or exclude rows: filter()

While select() deals with specific columns, filter() deals with specific rows or observations. The format is filter(dataset, variable == condition):

data %>% filter(gender == "Female")     # keep only the female participants

data %>% filter(score  > 50)            # keep only scores above 50

data %>% filter(condition != "control") # keep participants who are NOT in the control condition

You can link multiple conditions with & (“and”) or | (“or”):

# Keep participants who scored below 20 or above 70
data %>% filter(score < 20 | score > 70)  

# Keep participants in the experimental condition who passed the attention check: 
data %>% filter(condition == "experimental" & attention == "Pass") 

2.4 Change the order of rows: arrange()

# Arrange the dataset by participant name in alphabetical order
data %>% arrange(name)

# Arrange the dataset from lowest to highest scorers
data %>% arrange(score)

# The same arrangements but from highest to lowest - descending order
data %>% arrange(desc(score))

2.5 Create new variables (columns): mutate()

mutate() is a powerful function that can be used to create new columns based on values in existing ones. The syntax is data %>% mutate(new_variable = values), where ‘new_variable’ is the name of the new column and ‘values’ is the values that the new column should contain.

For example, let’s say you have participants’ year of birth in the column ‘year_of_birth’ and you want to calculate their ages based on it:

ID year_of_birth
1 1997
2 1997
3 2004
4 1996
5 2000

You could run data %>% mutate(age = 2025 - year_of_birth). This will produce a new dataframe identical to the old one, but with an added column called ‘age’ that equals the current year (2025) minus the year of birth:

ID year_of_birth age
1 1997 27
2 1997 27
3 2004 20
4 1996 28
5 2000 24

2.6 Recoding variables: recode()

recode() is used for recoding the values of a categorical variable. For example, if gender is coded in your data as 1 (for male) and 0 (for female), you can recode it as “Male” and “Female” instead. The formula is:

recode(old_variable, 'old value 1' = 'new value 1', 'old value 2' = 'new value 2', ...)

For example:

recode(gender, 1 = "Male", 2 = "Female")

This function is usually combined with mutate() if you want to save the recoded variable in the old dataset. For example:

data <- data %>%       # Create a new dataset 'data' overwriting the existing 'data'
  mutate(              # mutate() creates a new column 'gender' overwriting the old 'gender'
    gender = recode(   # the new 'gender' will consist of the recoded 'gender' values
      gender,
      1 = "Male",    # The old value '1' is recoded as 'Male'
      0 = "Female"   # The old value '0' is recoded as 'Female'
      )
    )

Note: Keep in mind that numbers should not be enclosed in quote marks (" "), otherwise they will be formatted as string rather than numeric values. See the section on object classes for more details


2.7 Calculating a mean score from multiple questions

Many psychological questionnaires contain multiple questions designed to reflect the same underlying construct. For example, personality questionnaires will ask multiple questions measuring extraversion; your responses to each question are converted to scores, and a mean extraversion score is calculated out of them. In this example, we have a dataset with 5 questions and we need to calculate a mean questionnaire score across all 5 questions for each participant:

Individual questions
ID Q1 Q2 Q3 Q4 Q5 Mean_score
1 25.5 18.8 23.7 18.1 21.5 ?
2 31.9 27.1 37.7 15.7 17.3 ?
3 26.8 22.8 30.0 19.1 27.4 ?
4 19.9 23.0 15.2 15.9 24.6 ?
5 22.3 21.6 31.2 17.1 19.4 ?

There are different ways of calculating such mean scores. Note that you cannot use the mean() function on its own for this. mean() requires a single numeric vector - i.e., a single column in your dataframe, whereas here you are trying to find the mean across multiple columns, row by row.

Option 1: Calculate it manually: add up the 5 questions and divide by the number of questions:

quest_scores %>% 
  mutate(Mean_score = (Q1 + Q2 + Q3 + Q4 + Q5)/5) 
Individual questions and mean questionnaire score
ID Q1 Q2 Q3 Q4 Q5 Mean_score
1 25.5 18.8 23.7 18.1 21.5 21.52
2 31.9 27.1 37.7 15.7 17.3 25.94
3 26.8 22.8 30.0 19.1 27.4 25.22
4 19.9 23.0 15.2 15.9 24.6 19.72
5 22.3 21.6 31.2 17.1 19.4 22.32

Option 2: This is a variation on the first solution. Instead of adding up the columns in the format Q1 + Q2 + ..., we can use sum(). The problem is this function only sums the items in a single vector, not multiple columns. To address this, we need to tell R to sum the columns row by row. This is done with the function rowwise():

quest_scores %>%
  rowwise() %>% 
  mutate(Mean_score = sum(Q1, Q2, Q3, Q4, Q5, na.rm = TRUE) / 5) %>%
  ungroup()

Note that we added ungroup() at the end of the operation above. This is because rowwise() groups the data by row – all functions after that will be performed on a row-by-row basis, rather than on the whole dataset. To revert the dataset back to normal, we have to ‘ungroup’ it.

Alternatively, if you have too many columns and can’t list all of them one by one, you can specify the range of columns – from Q1 to Q5: sum(across(Q1:Q5)):

quest_scores %>%
  rowwise() %>% 
  mutate(Mean_score = sum(across(Q1:Q5), na.rm = TRUE) / 5) %>%
  ungroup()

3 Summary statistics with summarise()

The function summarise() takes an existing dataframe and produces a table with summary statistics for it, such as counts, Means, Medians, Standard Deviations, etc. It uses the formula:

data %>% summarise(
  column1 = values, 
  column2 = values,
  column3 = values,
  ...
)

For example, let’s say we’re working with a dataset containing participant test scores collected after participants were given either a reward (reward condition), a punishment (punishment condition), or neither (control condition). We also have age and gender data:

id age gender condition score
1 23 male control 51.4
2 26 female punishment 51.2
3 29 female control 43.5
4 33 female punishment 31.0
5 34 female reward 50.1
6 39 female reward 40.1

We can use summarise() to calculate the number people in the sample, their mean age, and its standard deviation:

data %>%                   # start with the original data
  summarise(               # start building the summary table
    N = n(),               # n() prints the number of observations - no further arguments needed
    Age_mean = mean(age),  # 'Age_mean' is just what we call this column, and 'mean(age)' calculates the values
    Age_sd   = sd(age)
  )
## # A tibble: 1 × 3
##       N Age_mean Age_sd
##   <int>    <dbl>  <dbl>
## 1    50     33.9   6.88

We can also combine this function with group_by() to group our data by some categorical variable and generate summary statistics for each group. For example, the following code will produce the same summary table as above, but the summary statistics will be grouped by gender:

data %>%                 # start with the dataframe
  group_by(gender) %>%   # group it by 'gender'
  summarise(             # the resulting summaries will be generated separately for the two genders
    N = n(),               
    Age_mean = mean(age),  
    Age_sd   = sd(age)
  )
## # A tibble: 2 × 4
##   gender     N Age_mean Age_sd
##   <chr>  <int>    <dbl>  <dbl>
## 1 female    30     34.6   6.55
## 2 male      20     32.8   7.40

This is also useful for calculating mean scores for the different conditions in a study:

data %>%
  group_by(condition) %>%      # group data by condition
  summarise(                   
    N = n(),                   # get the number of participants in each condition
    Score_mean = mean(score),  # get the mean score in each condition
    Score_sd   = sd(score)     # get the SD of scores in each condition
  )
## # A tibble: 3 × 4
##   condition      N Score_mean Score_sd
##   <chr>      <int>      <dbl>    <dbl>
## 1 control       21       43.4     9.47
## 2 punishment    13       42.7    10.7 
## 3 reward        16       42.6     9.92

You can get creative with the summary table. The code below adds an extra column counting the number of participants per condition that failed the test (scoring under 40%). score < 40 creates a logical vector in which participants who meet the condition (scoring under 40) are marked as ‘TRUE’ and those that do not - ‘FALSE’. sum(score < 40) adds up the number of TRUE items in the vector, producing the total number of participants with scores under 40:

data %>%
  group_by(condition) %>%      
  summarise(                   
    N = n(),                   
    Score_mean = mean(score), 
    Score_sd   = sd(score),     
    Fails = sum(score < 40)  
  )
## # A tibble: 3 × 5
##   condition      N Score_mean Score_sd Fails
##   <chr>      <int>      <dbl>    <dbl> <int>
## 1 control       21       43.4     9.47     7
## 2 punishment    13       42.7    10.7      5
## 3 reward        16       42.6     9.92     6


4 Data visualisation

Base R has functions for making simple graphs, like plot() and hist(). For example:

hist(data$score)

However, the package ggplot2 (part of the tidyverse) gives us many more options for customising our graphs. It works by building a basic plot with the ggplot() function and adding graphic layers (called ‘geoms’) on top of it.

We can start making a simple plot by adding the following arguments in the brackets of ggplot():

  1. the name of the dataset (‘data’) and

  2. an aes() function, which will contain the X- and Y-axes of the plot:

ggplot(data, aes(...))
# or:
data %>% ggplot(aes(...))

We then map variables to the X- and Y-axes:

ggplot(data, aes(x = condition, # X-axis will represent condition
                 y = score))    # Y-axis will represent scores

If you run this code on its own, you’ll see an empty plot with nothing but the two axes - this will be the foundation of our graph:

We now have to add an extra graphical layer containing the type of graph we want to make. These layers are called ‘geoms’. To make a boxplot, add a geom_boxplot():

ggplot(data, aes(x = condition, 
                 y = score)) +
  geom_boxplot()    # A new layer containing the boxplot geom

Additional layers can be added to change the…

  • axis labels and title: + labs(x = "...", y = "...", title = "...") - by default all labels are taken from the variable names and contents, which cannot contain spaces and might not be capitalised. You can edit them to add capital letters and spaces (e.g., rename ‘mean_score’ to ‘Mean test scores’) or to make them more detailed.
  • ticks on the X-axis: + scale_x_discrete(labels = c("Control", "Punishment", "Reward"))
  • theme: + theme_classic(), theme_minimal(), etc. - look online for a list of themes available
ggplot(data, aes(x = condition, 
                 y = score)) +
  geom_boxplot() +
  labs(x = "Experimental condition",     # Change the label of the X-axis
       y = "Test scores",                # Change the label of the X-axis
       title = "Differences in scores between conditions")        +  # Change the graph's title
  scale_x_discrete(labels = c("Control", "Punishment", "Reward")) +  # Change the group labels
  theme_minimal()                        # Change the visual theme

In addition to mapping the X- and Y-axes in aes(), you can map a third variable to the ‘fill’ parameter. ‘fill’ defines the colour of the boxes, but alternatively, you can map a variable to ‘colour’ instead, which controls the outlines of the boxes rather than the fill. You may need additional arguments to change the labels and ticks of the new grouping variable as well.

ggplot(data, aes(x = condition,
                 y = score,
                 fill = gender)) +    # Adding "gender" as a new grouping variable
  geom_boxplot() +
  labs(x = "Experimental condition", 
       y = "Test scores", 
       fill = "Gender",              # Changing the label of the newly added variable
       title = "Differences in scores between conditions") +  
  scale_x_discrete(labels = c("Control", "Punishment", "Reward")) + 
  scale_fill_discrete(labels = c("Female", "Male")) + # Changing the ticks of the newly added variable
  theme_minimal()  



5 Probabilities

Binomial distributions (discrete events) Normal distributions (continuous values) Explanation
dbinom() dnorm() density: probability of observing a given value
pbinom() pnorm() cumulative: probability of observing values up to a given threshold
qbinom() qnorm() quantile: what value is in the bottom x% of the distribution?

Practical examples:


5.0.1 dbinom() - the density function

dbinom(x, size, prob) gives you the probability of an outcome occurring in a specific number (x) of trials given the total number of trials (size) and the probability (p) of that outcome on a single trial.

What is the chance that exactly one person has the same birthday as you in a class of 30?).

dbinom(x = 1, size = 29, prob = 1/365) = 0.074, or 7.4%.

To visualise this in ggplot(), we can create a tibble that contains the probabilities of different numbers of people having the same birthday as you. You can then plot this tibble in a bar chart, or explore the tibble manually in R to check the probabilities of any given number of students having the same birthday as you:

#calculate probability of each number of people with the same birthday as you and save in a data frame
bdays <- tibble(nstudents = 0:29, p = dbinom(x = 0:29, size = 29, prob = 1 / 365))

#plot in a bar graph
ggplot(bdays, aes(x = nstudents, y = p)) +
  geom_bar(stat = "identity", fill = "#4E84C4") +
  theme_classic() +
  labs(x = "Number of students", y = "Probability (P)") +
  scale_x_continuous(breaks = seq(0, 29, 1))



5.0.2 pbinom() - the distribution function

pbinom(x, size, prob) gives you the cumulative probability of the outcome occurring in up to and including q number of trials, given the total number of trials (size) and the probability (p) of that outcome on a single trial (e.g. what is the chance of getting 3 or fewer heads in 10 coin flips). This is known as the cumulative probability distribution function or the cumulative density function.

Example: what is the probability that no more than 3 out of 10 people taking their driving test today pass? The cut-off is 3, so q=3, and the question is about the number of passes out of 10 tests so, size = 10. The probability of passing a driving test in the UK is 0.463 (46.3% of tests in 2019-2020 were passes), so prob = 0.463.

pbinom(q = 3, size = 10, prob = 0.463) = 0.24, or 24%.

In this graph, you can see the distribution of probabilities visualised. 3 passes or fewer are highlighted in orange. The cumulative probability (24%) is the added probability of 0, 1, 2, and 3 passes out of 10:

tibble(num_pass = 0:10, P = dbinom(x = 0:10, size = 10, prob = 0.463)) %>% 
mutate(highlight = ifelse(num_pass > 3, T, F)) %>%
ggplot(aes(x = num_pass, y = P)) +
  geom_bar(stat = "identity", aes(fill = highlight)) +
  theme_classic() +
  labs(x = "Number of passes out of 10 tests", y = "Probability (p)") +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  theme(legend.position = "none")



But what if we wanted to know the probability of outcomes above a certain value rather than below? Say we want to know the probability of there being more than 5 passes in 10 tests. To do this we need to add the argument lower.tail = FALSE. The “lower tail” is the values below the cut-off point (the left side of the graph, highlighted in orange). lower.tail = FALSE cuts out the lower tail, leaving the upper tail (i.e., values above the cut-off point, or the right side of the graph). The code for 5 passes or more would be:

pbinom(q = 5, size = 10, prob = 0.463, lower.tail = FALSE) = 0.29, or 29%. Visualised, it would look like this:

tibble(num_pass = 0:10, P = dbinom(x = 0:10, size = 10, prob = 0.463)) %>% 
mutate(highlight = ifelse(num_pass < 6, T, F)) %>%
ggplot(aes(x = num_pass, y = P)) +
  geom_bar(stat = "identity", aes(fill = highlight)) +
  theme_classic() +
  labs(x = "Number of passes out of 10 tests", y = "Probability (p)") +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  theme(legend.position = "none")




5.0.3 qbinom() - the quantile function

qbinom() is the inverse of pbinom in that it gives you the number of occurrences below which the summation of probabilities is less than a given probability (p), given the total number of trials and probability of the outcome on a single trial.


dnorm()

pnorm()

qnorm()



6 Tests

6.1 Comparing two groups

To compare two groups on some continuous measure, we typically use a t-test (or a non-parametric equivalent if the assumptions are not met).

6.1.1 Independent-samples t-test

Used when the two groups are independent (between-subjects design - each participant takes part in only one condition).

Assumptions:

  • IV has only 2 groups/conditions
  • DV is continuous (interval or ratio - not ordinal)
  • DV scores for both groups are normally distributed - check with a Shapiro-Wilk test (run shapiro.test(data$scores) on each group separately), a Q-Q plot (qqPlot(data$scores)), or eyeball it with a histogram (hist(data$scores)).

How to run the test:

t.test(vector of group A scores, 
       vector of group B scores)

# Note that scores for each group need to be in a separate vector, so you'll have to adapt your code depending on whether the data is in long or wide format. 

# WIDE format: each group's scores is in a separate column (e.g., `groupA` and `groupB`:
t.test(data$groupA,
       data$groupB)

# LONG format: all scores are in the same column (`scores`), and group membership is marked in a separate column (`group`), so you have to filter the dataset per condition and extract the scores separately: 
t.test(data %>% filter(group == "A") %>% pull(scores), 
       data %>% filter(group == "B") %>% pull(scores)
       )

6.1.1.1 Effect size (Cohen’s d) for an independent-samples test

cohensD(data$before,
        data$after,
        method = “unequal”) # in cases where the groups’ sample sizes and variances are unequal

6.1.1.2 Non-parametric equivalent: Mann-Whitney U-test, a.k.a. Wilcoxon rank-sum test

Can be run when the IV has two independent groups and the DV is ordinal or non-normally distributed.

# For WIDE data (groups A and B have scores in separate columns: 'groupA' and 'groupB')
wilcox.test(data$groupA, data$groupB)

# For LONG data (all scores are in column 'scores', group labels are in column 'group')
wilcox.test(data %>% filter(group == "A") %>% pull(scores),
            data %>% filter(group == "B") %>% pull(scores),
            correct = FALSE   # this is optional - it removes Yates' continuity correction
            )

6.1.2 Paired-samples t-test

Used when all participants take part in both conditions (within-subjects / repeated measures design), e.g. participants are tested before an after an intervention. In the example below, we have two conditions: ‘before’ and ‘after’.

Assumptions:

  • IV has only 2 conditions
  • DV is interval or ratio (not ordinal)
  • The difference between the two conditions is normally distributed

Before running the test, check for…

  1. Missing data: data %>% filter(is.na(before) | is.na(after)) %>% count()

  2. The difference between the ‘before’ and ‘after’ conditions is normally distributed:

# Step 1: Calculate the difference between the two conditions: 
data <- data %>% mutate(difference = after - before)
# Step 2: Check for normality: 
qqplot(data$difference)
shapiro.test(data$difference)
# Step 3: Based on the outcome, decide whether to run a t-test or a non-parametric equivalent. 

The code for running the paired-samples t-test in R is the same as the independent-samples t-test, but with an added paired = TRUE at the end:

# Example: scores for the two conditions are saved in the columns as 'before' and 'after': 
t.test(data$before, 
       data$after, 
       paired = TRUE)

# Example: all scores are in the column 'scores', and condition labels are in the column 'condition':
t.test(data %>% filter(condition == "before") %>% pull(scores),
       data %>% filter(condition == "after") %>% pull(scores),
       paired = TRUE)

6.1.2.1 Effect size (Cohen’s d) for a paired-samples test

cohensD(data$before,
        data$after,
        method = “paired”) 

6.1.2.2 Non-parametric equivalent: Wilcoxon signed-rank test

Same participants take part in both conditions, but the data is either ordinal or the difference between the two conditions is not normally distributed.

wilcox.test(data$before, 
            data$after, 
            paired = TRUE,
            correct = FALSE  # this is optional - it removes Yates' continuity correction
            )

6.1.3 One-sample t-test

Used for comparing a single group to a pre-established mean score.

t.test(data$scores,
       mu = ...,                 # the pre-established mean that we are comparing the sample to
       alternative = "greater"   # the alternative hypothesis is that our sample mean is 'greater' than mu
       )

6.1.3.1 Effect size (Cohen’s d) for a one-sample test

cohensD(data %>% pull(scores), mu=...)


6.2 Comparing 3 or more independent groups

6.2.1 ANOVA (one-way)

Use a one-way ANOVA to compare 3 or more groups/conditions on the same continuous DV measure.

Assumptions:

  1. The DV is interval or ratio (not ordinal).
  2. The groups are independent.
  3. The variance of the residuals is homogeneous. This can be tested with a Levene’s test (levene.test()) on the residuals. ANOVA is robust to violations of this assumption if the groups are of equal size.
  4. The residuals are normally distributed. This can be tested with a QQplot and a Shapiro-Wilk test. ANOVA is robust to violations of this assumption if the groups are of equal size.

Running the ANOVA

You will need the afex package.

Running the ANOVA takes two steps: 1. building a model and 2. running an ANOVA on the model.

library(afex)

model <- aov_ez(
  id = "id",       # name of column with participant IDs
  dv = "..."       # name of column with scores on the outcome measure
  between = "...", # name of column with group/condition labels
  data = data      # name of dataframe
)

anova(model)

6.2.2 Kruskal-Wallis (non-parametric version)

kruskal.test(dv ~ group, data = df) - replace with:

  • dv = your Dependent (outcome) variable
  • group = your Independent (grouping) variable
  • df = the name of the dataframe

6.3 Associations between continuous measures

6.3.1 Correlation (Pearson’s r)

cor.test(df$x, df$y, method = "pearson"), where:

  • df = name of dataframe
  • x and y = the two variables to test for correlation

6.3.2 Non-parametric correlation (Spearman’s rho)

cor.test(df$x, df$y, method = "spearman")

6.3.3 Linear Regression

model <- lm(y ~ x, data = df) - to create the linear model summary(model) - to print out a summary of the model


6.4 Chi Squared

chisq.test(table(df$var1, df$var2))