At the start of each session, make sure that you:
.r
or .rmd
) and
your data in the same folderR can function like a calculator. You can do basic maths with the following operators:
+ |
Addition | - |
Subtraction | ||
* |
Multiplication | / |
Division | ||
^3 |
Exponents | sqrt() |
Square root |
Some logical operators (return ‘TRUE’ or ‘FALSE’ as output):
> |
Greater than | < |
Smaller than |
>= |
Greater or equal | <= |
Smaller or equal |
== |
Equality | != |
Difference |
& |
“and” | | |
“or” |
A function in R is a reusable block of code designed to perform a specific task. Functions take inputs, called arguments or parameters, process them, and return an output or result. They are an essential part of programming in R, helping to automate repetitive tasks and break down complex operations into smaller, manageable parts.
Each function has a name, followed by brackets, for
example: mean()
, sum()
, round()
,
etc.
Arguments are the information you put into the
function. For example, mean(x)
calculates the mean value of
x
(which must be a vector of numbers).
round(x, digits = 1)
contains two arguments: x
(the number to be rounded up) and digits = ...
, which is
the number of decimal places to round up to.
You can save individual numbers, words, vectors or whole spreadsheets
as objects for later use. You do this with the format
name <- value
.
# Saving individual numbers:
participant_age <- 45
# Saving individual character strings (words)
participant_name <- "Alex"
To save multiple items in a vector, use the
c()
function. If the items are character strings, put them
in ""
quotation marks.
# Saving a vector of numbers
ages <- c(19, 22, 24, 19, 29, 26, 21)
# Saving a vector of strings
names <- c("Kathy", "David", "Joe", "Jenny", "Alex")
Objects in R can have different classes:
You can’t always tell what class an object is by looking at it. To
check, you can pass an object through the class()
function:
participant_age <- 45 # save the number 45 as `participant_age`
class(participant_age) # check the class of the new object - it should be "numeric"
## [1] "numeric"
Compare with this case, where the number “45” is saved as a string, rather than a numeric, by mistake:
participant_age <- "45" # because of the quote marks, 45 is a character string, not a number
class(participant_age) # check the class of `participant_age` - the result should be "character"
## [1] "character"
To correct an object’s class, you can coerce it to the desired class
with as.numeric()
for numbers, as.character()
for strings/words, as.factor()
for categorical.
participant_age <- as.numeric(participant_age) # overwrite `participant_age` as a numeric object
class(participant_age) # check its class - it should be "numeric" again
## [1] "numeric"
Packages contain additional functions created by the R community. The
tidyverse
package combines multiple valuable packages, such
as dplyr
, which makes data wrangling easier, and
ggplot2
, which is used for data visualisation.
You need to install a library on your computer before you can use it. You need to do this only once. After that, you just need to load that library into R whenever you want to use it.
# Install a package for the first time:
install.packages("tidyverse") # notice the quote marks around the package name
# Load a package for use:
library(tidyverse)
# Update an old package:
update.packages("tidyverse")
# Remove a package:
remove.packages("tidyverse")
# Install an older version of a package:
install.packages(
"https://cran.r-project.org/src/contrib/Archive/tidyverse/tidyverse_1.3.1.tar.gz",
repos=NULL,
type="source")
# NOTE: Replace the URL with a link to the version of the package you want to install.
# Links to older versions of all packages can be found here:
# https://cran.r-project.org/src/contrib/Archive/
A dataframe is a two-dimensional (consisting of rows and columns) data structure used to store data in a table format, where each column can contain different types of data (numeric, character, factor).
One of the most commonly used types of dataframe is called a
tibble. You can create one from scratch with the
tibble()
function from the tidyverse
package.
Inside the brackets you must define each column’s contents in the format
column name = contents
:
data <- tibble(
id = 1:5,
gender = c("Female", "Male", "Male", "Female", "Female"),
age = c(22, 23, 28, 24, 22),
score = c(54, 49, 51, 62, 53)
)
print(data)
## # A tibble: 5 × 4
## id gender age score
## <int> <chr> <dbl> <dbl>
## 1 1 Female 22 54
## 2 2 Male 23 49
## 3 3 Male 28 51
## 4 4 Female 24 62
## 5 5 Female 22 53
You can import data into R with the read.csv()
function.
Notice that the spreadsheet you’re importing must be in .csv format
(Comma-Separated Values). This format retains only the values in the
spreadsheet, omitting the formatting of richer Excel documents (like
.xlsx files). If your data is in a different format (such as .xlsx or
.numbers), you must open it in Excel first and save it as .csv.
# Import data from a specific .csv on your computer:
data <- read.csv("name_of_spreadsheet.csv")
# Manually select the file to import from your file browser:
data <- read.csv(file.choose())
# Import data from an online file:
data <- read.csv("https://... link to the file goes here")
%>%
The pipe operator %>%
is part of the
tidyverse
and allows you to chain multiple functions
together. The pipe sends the output of one function as input into
another, and its output into another. For example:
data %>% # start with the dataframe 'data'
filter(gender == "Female") %>% # filter the female participants
pull(score) %>% # extract the variable 'score'
mean() %>% # calculate the mean of women's scores
round(1) # round that mean to 1 decimal place
This can be done either with the $
operator or with the
pull()
function:
# This extracts the variable 'var' from the dataset 'data:
data$var
# This achieves the same but with pull():
pull(data, var)
# Pipes (%>%) don't handle the $ operator well. When coding with pipes, use pull() instead of $:
data %>% pull(var)
To see the top rows of a dataset (rather than the full dataframe),
use head()
. This is useful for getting a quick preview of
the structure of a dataframe without printing it out in its
entirety:
crime_data <- USArrests # Save `USArrests` data as the object `crime_data`
head(crime_data) # Preview the top rows of `crime_data`
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
inner_join()
We can use inner_join()
to merge two dataframes. The
format is inner_join(dataframe1, dataframe2, by = ...)
. The
by=...
argument tells R which variable to match the two
datasets by. Usually you will want to match by any columns that are the
same across both dataframes.
# Match demographic data and scores by participant ID number:
data <- inner_join(demographics, scores, by = "id")
# Notice that c() is needed when matching multiple variables:
data <- inner_join(responses, pinfo, by = c("id", "intervention"))
select()
Use select()
to include some columns/variables of
interest while omitting others.
data <- data %>% select(id, gender, condition, score)
# To exclude one specific variable and keep all others in, use the `-` sign:
data <- data %>% select(-score) # omit 'score', keep everything else
select()
is useful for rearranging columns in a
dataframe. For example, I can move the column ‘condition’ to the front,
followed by everything else (use everything()
to quickly
reference all variables in the dataset):
data <- data %>% select(condition, everything() )
filter()
While select()
deals with specific
columns, filter()
deals with specific
rows or observations. The format is
filter(dataset, variable == condition)
:
data %>% filter(gender == "Female") # keep only the female participants
data %>% filter(score > 50) # keep only scores above 50
data %>% filter(condition != "control") # keep participants who are NOT in the control condition
You can link multiple conditions with &
(“and”) or
|
(“or”):
# Keep participants who scored below 20 or above 70
data %>% filter(score < 20 | score > 70)
# Keep participants in the experimental condition who passed the attention check:
data %>% filter(condition == "experimental" & attention == "Pass")
arrange()
# Arrange the dataset by participant name in alphabetical order
data %>% arrange(name)
# Arrange the dataset from lowest to highest scorers
data %>% arrange(score)
# The same arrangements but from highest to lowest - descending order
data %>% arrange(desc(score))
mutate()
mutate()
is a powerful function that can be used to
create new columns based on values in existing ones. The syntax is
data %>% mutate(new_variable = values)
, where
‘new_variable’ is the name of the new column and ‘values’ is the values
that the new column should contain.
For example, let’s say you have participants’ year of birth in the column ‘year_of_birth’ and you want to calculate their ages based on it:
ID | year_of_birth |
---|---|
1 | 1997 |
2 | 1997 |
3 | 2004 |
4 | 1996 |
5 | 2000 |
You could run
data %>% mutate(age = 2025 - year_of_birth)
. This will
produce a new dataframe identical to the old one, but with an added
column called ‘age’ that equals the current year (2025) minus the year
of birth:
ID | year_of_birth | age |
---|---|---|
1 | 1997 | 27 |
2 | 1997 | 27 |
3 | 2004 | 20 |
4 | 1996 | 28 |
5 | 2000 | 24 |
recode()
recode()
is used for recoding the values of a
categorical variable. For example, if gender is coded in your data as 1
(for male) and 0 (for female), you can recode it as “Male” and “Female”
instead. The formula is:
recode(old_variable, 'old value 1' = 'new value 1', 'old value 2' = 'new value 2', ...)
For example:
recode(gender, 1 = "Male", 2 = "Female")
This function is usually combined with mutate()
if you
want to save the recoded variable in the old dataset. For example:
data <- data %>% # Create a new dataset 'data' overwriting the existing 'data'
mutate( # mutate() creates a new column 'gender' overwriting the old 'gender'
gender = recode( # the new 'gender' will consist of the recoded 'gender' values
gender,
1 = "Male", # The old value '1' is recoded as 'Male'
0 = "Female" # The old value '0' is recoded as 'Female'
)
)
Note: Keep in mind that numbers should not be
enclosed in quote marks (" "
), otherwise they will be
formatted as string rather than numeric values. See the section on object classes for more details
Many psychological questionnaires contain multiple questions designed to reflect the same underlying construct. For example, personality questionnaires will ask multiple questions measuring extraversion; your responses to each question are converted to scores, and a mean extraversion score is calculated out of them. In this example, we have a dataset with 5 questions and we need to calculate a mean questionnaire score across all 5 questions for each participant:
ID | Q1 | Q2 | Q3 | Q4 | Q5 | Mean_score |
---|---|---|---|---|---|---|
1 | 25.5 | 18.8 | 23.7 | 18.1 | 21.5 | ? |
2 | 31.9 | 27.1 | 37.7 | 15.7 | 17.3 | ? |
3 | 26.8 | 22.8 | 30.0 | 19.1 | 27.4 | ? |
4 | 19.9 | 23.0 | 15.2 | 15.9 | 24.6 | ? |
5 | 22.3 | 21.6 | 31.2 | 17.1 | 19.4 | ? |
There are different ways of calculating such mean scores. Note that
you cannot use the mean()
function on its own for this.
mean()
requires a single numeric vector - i.e., a single
column in your dataframe, whereas here you are trying to find the mean
across multiple columns, row by row.
Option 1: Calculate it manually: add up the 5 questions and divide by the number of questions:
quest_scores %>%
mutate(Mean_score = (Q1 + Q2 + Q3 + Q4 + Q5)/5)
ID | Q1 | Q2 | Q3 | Q4 | Q5 | Mean_score |
---|---|---|---|---|---|---|
1 | 25.5 | 18.8 | 23.7 | 18.1 | 21.5 | 21.52 |
2 | 31.9 | 27.1 | 37.7 | 15.7 | 17.3 | 25.94 |
3 | 26.8 | 22.8 | 30.0 | 19.1 | 27.4 | 25.22 |
4 | 19.9 | 23.0 | 15.2 | 15.9 | 24.6 | 19.72 |
5 | 22.3 | 21.6 | 31.2 | 17.1 | 19.4 | 22.32 |
Option 2: This is a variation on the first solution.
Instead of adding up the columns in the format
Q1 + Q2 + ...
, we can use sum()
. The problem
is this function only sums the items in a single vector, not multiple
columns. To address this, we need to tell R to sum the columns row
by row. This is done with the function rowwise()
:
quest_scores %>%
rowwise() %>%
mutate(Mean_score = sum(Q1, Q2, Q3, Q4, Q5, na.rm = TRUE) / 5) %>%
ungroup()
Note that we added ungroup()
at the end of the operation
above. This is because rowwise()
groups the data
by row – all functions after that will be performed on a row-by-row
basis, rather than on the whole dataset. To revert the dataset back to
normal, we have to ‘ungroup’ it.
Alternatively, if you have too many columns and can’t list all of
them one by one, you can specify the range of columns – from Q1
to Q5: sum(across(Q1:Q5))
:
quest_scores %>%
rowwise() %>%
mutate(Mean_score = sum(across(Q1:Q5), na.rm = TRUE) / 5) %>%
ungroup()
summarise()
The function summarise()
takes an existing dataframe and
produces a table with summary statistics for it, such as counts, Means,
Medians, Standard Deviations, etc. It uses the formula:
data %>% summarise(
column1 = values,
column2 = values,
column3 = values,
...
)
For example, let’s say we’re working with a dataset containing participant test scores collected after participants were given either a reward (reward condition), a punishment (punishment condition), or neither (control condition). We also have age and gender data:
id | age | gender | condition | score |
---|---|---|---|---|
1 | 23 | male | control | 51.4 |
2 | 26 | female | punishment | 51.2 |
3 | 29 | female | control | 43.5 |
4 | 33 | female | punishment | 31.0 |
5 | 34 | female | reward | 50.1 |
6 | 39 | female | reward | 40.1 |
We can use summarise()
to calculate the number people in
the sample, their mean age, and its standard deviation:
data %>% # start with the original data
summarise( # start building the summary table
N = n(), # n() prints the number of observations - no further arguments needed
Age_mean = mean(age), # 'Age_mean' is just what we call this column, and 'mean(age)' calculates the values
Age_sd = sd(age)
)
## # A tibble: 1 × 3
## N Age_mean Age_sd
## <int> <dbl> <dbl>
## 1 50 33.9 6.88
We can also combine this function with group_by()
to
group our data by some categorical variable and generate summary
statistics for each group. For example, the following code will
produce the same summary table as above, but the summary statistics will
be grouped by gender:
data %>% # start with the dataframe
group_by(gender) %>% # group it by 'gender'
summarise( # the resulting summaries will be generated separately for the two genders
N = n(),
Age_mean = mean(age),
Age_sd = sd(age)
)
## # A tibble: 2 × 4
## gender N Age_mean Age_sd
## <chr> <int> <dbl> <dbl>
## 1 female 30 34.6 6.55
## 2 male 20 32.8 7.40
This is also useful for calculating mean scores for the different conditions in a study:
data %>%
group_by(condition) %>% # group data by condition
summarise(
N = n(), # get the number of participants in each condition
Score_mean = mean(score), # get the mean score in each condition
Score_sd = sd(score) # get the SD of scores in each condition
)
## # A tibble: 3 × 4
## condition N Score_mean Score_sd
## <chr> <int> <dbl> <dbl>
## 1 control 21 43.4 9.47
## 2 punishment 13 42.7 10.7
## 3 reward 16 42.6 9.92
You can get creative with the summary table. The code below adds an
extra column counting the number of participants per condition that
failed the test (scoring under 40%). score < 40
creates
a logical vector in which participants who meet the condition (scoring
under 40) are marked as ‘TRUE’ and those that do not - ‘FALSE’.
sum(score < 40)
adds up the number of TRUE items in the
vector, producing the total number of participants with scores under
40:
data %>%
group_by(condition) %>%
summarise(
N = n(),
Score_mean = mean(score),
Score_sd = sd(score),
Fails = sum(score < 40)
)
## # A tibble: 3 × 5
## condition N Score_mean Score_sd Fails
## <chr> <int> <dbl> <dbl> <int>
## 1 control 21 43.4 9.47 7
## 2 punishment 13 42.7 10.7 5
## 3 reward 16 42.6 9.92 6
Base R has functions for making simple graphs, like
plot()
and hist()
. For example:
hist(data$score)
However, the package ggplot2
(part of the
tidyverse
) gives us many more options for customising our
graphs. It works by building a basic plot with the ggplot()
function and adding graphic layers (called ‘geoms’) on top of it.
We can start making a simple plot by adding the following arguments
in the brackets of ggplot()
:
the name of the dataset (‘data’) and
an aes()
function, which will contain the X- and
Y-axes of the plot:
ggplot(data, aes(...))
# or:
data %>% ggplot(aes(...))
We then map variables to the X- and Y-axes:
ggplot(data, aes(x = condition, # X-axis will represent condition
y = score)) # Y-axis will represent scores
If you run this code on its own, you’ll see an empty plot with nothing but the two axes - this will be the foundation of our graph:
We now have to add an extra graphical layer containing the type of
graph we want to make. These layers are called ‘geoms’. To make a
boxplot, add a geom_boxplot()
:
ggplot(data, aes(x = condition,
y = score)) +
geom_boxplot() # A new layer containing the boxplot geom
Additional layers can be added to change the…
+ labs(x = "...", y = "...", title = "...")
- by default
all labels are taken from the variable names and contents, which cannot
contain spaces and might not be capitalised. You can edit them to add
capital letters and spaces (e.g., rename ‘mean_score’ to ‘Mean test
scores’) or to make them more detailed.+ scale_x_discrete(labels = c("Control", "Punishment", "Reward"))
+ theme_classic()
,
theme_minimal()
, etc. - look online for a list of themes
availableggplot(data, aes(x = condition,
y = score)) +
geom_boxplot() +
labs(x = "Experimental condition", # Change the label of the X-axis
y = "Test scores", # Change the label of the X-axis
title = "Differences in scores between conditions") + # Change the graph's title
scale_x_discrete(labels = c("Control", "Punishment", "Reward")) + # Change the group labels
theme_minimal() # Change the visual theme
In addition to mapping the X- and Y-axes in aes()
, you
can map a third variable to the ‘fill’ parameter. ‘fill’ defines the
colour of the boxes, but alternatively, you can map a variable to
‘colour’ instead, which controls the outlines of the boxes rather than
the fill. You may need additional arguments to change the labels and
ticks of the new grouping variable as well.
ggplot(data, aes(x = condition,
y = score,
fill = gender)) + # Adding "gender" as a new grouping variable
geom_boxplot() +
labs(x = "Experimental condition",
y = "Test scores",
fill = "Gender", # Changing the label of the newly added variable
title = "Differences in scores between conditions") +
scale_x_discrete(labels = c("Control", "Punishment", "Reward")) +
scale_fill_discrete(labels = c("Female", "Male")) + # Changing the ticks of the newly added variable
theme_minimal()
Binomial distributions (discrete events) | Normal distributions (continuous values) | Explanation |
---|---|---|
dbinom() |
dnorm() |
density: probability of observing a given value |
pbinom() |
pnorm() |
cumulative: probability of observing values up to a given threshold |
qbinom() |
qnorm() |
quantile: what value is in the bottom x% of the distribution? |
Practical examples:
dbinom()
- the density function
dbinom(x, size, prob)
gives you the probability of an outcome occurring in a specific number (x
) of trials given the total number of trials (size
) and the probability (p
) of that outcome on a single trial.
What is the chance that exactly one person has the same birthday as you in a class of 30?).
dbinom(x = 1, size = 29, prob = 1/365)
= 0.074, or
7.4%.
To visualise this in ggplot()
, we can create a tibble
that contains the probabilities of different numbers of people having
the same birthday as you. You can then plot this tibble in a bar chart,
or explore the tibble manually in R to check the probabilities of any
given number of students having the same birthday as you:
#calculate probability of each number of people with the same birthday as you and save in a data frame
bdays <- tibble(nstudents = 0:29, p = dbinom(x = 0:29, size = 29, prob = 1 / 365))
#plot in a bar graph
ggplot(bdays, aes(x = nstudents, y = p)) +
geom_bar(stat = "identity", fill = "#4E84C4") +
theme_classic() +
labs(x = "Number of students", y = "Probability (P)") +
scale_x_continuous(breaks = seq(0, 29, 1))
pbinom()
- the distribution function
pbinom(x, size, prob)
gives you the cumulative probability of the outcome occurring in up to and includingq
number of trials, given the total number of trials (size
) and the probability (p
) of that outcome on a single trial (e.g. what is the chance of getting 3 or fewer heads in 10 coin flips). This is known as the cumulative probability distribution function or the cumulative density function.
Example: what is the probability that no more than 3 out of
10 people taking their driving test today pass? The cut-off is
3, so q=3
, and the question is about the number of passes
out of 10 tests so, size = 10
. The probability of passing a
driving test in the UK is 0.463 (46.3% of tests in 2019-2020 were
passes), so prob = 0.463
.
pbinom(q = 3, size = 10, prob = 0.463)
= 0.24, or
24%.
In this graph, you can see the distribution of probabilities visualised. 3 passes or fewer are highlighted in orange. The cumulative probability (24%) is the added probability of 0, 1, 2, and 3 passes out of 10:
tibble(num_pass = 0:10, P = dbinom(x = 0:10, size = 10, prob = 0.463)) %>%
mutate(highlight = ifelse(num_pass > 3, T, F)) %>%
ggplot(aes(x = num_pass, y = P)) +
geom_bar(stat = "identity", aes(fill = highlight)) +
theme_classic() +
labs(x = "Number of passes out of 10 tests", y = "Probability (p)") +
scale_x_continuous(breaks = seq(0, 10, 1)) +
theme(legend.position = "none")
But what if we wanted to know the probability of outcomes
above a certain value rather than below? Say we want to
know the probability of there being more than 5 passes in 10 tests. To
do this we need to add the argument lower.tail = FALSE
. The
“lower tail” is the values below the cut-off point (the left
side of the graph, highlighted in orange).
lower.tail = FALSE
cuts out the lower tail, leaving the
upper tail (i.e., values above the cut-off point, or the
right side of the graph). The code for 5 passes or more would
be:
pbinom(q = 5, size = 10, prob = 0.463, lower.tail = FALSE)
= 0.29, or 29%. Visualised, it would look like this:
tibble(num_pass = 0:10, P = dbinom(x = 0:10, size = 10, prob = 0.463)) %>%
mutate(highlight = ifelse(num_pass < 6, T, F)) %>%
ggplot(aes(x = num_pass, y = P)) +
geom_bar(stat = "identity", aes(fill = highlight)) +
theme_classic() +
labs(x = "Number of passes out of 10 tests", y = "Probability (p)") +
scale_x_continuous(breaks = seq(0, 10, 1)) +
theme(legend.position = "none")
qbinom()
- the quantile function
qbinom()
is the inverse of pbinom in that it gives you the number of occurrences below which the summation of probabilities is less than a given probability (p
), given the total number of trials and probability of the outcome on a single trial.
dnorm()
pnorm()
qnorm()
To compare two groups on some continuous measure, we typically use a t-test (or a non-parametric equivalent if the assumptions are not met).
Used when the two groups are independent (between-subjects design - each participant takes part in only one condition).
Assumptions:
shapiro.test(data$scores)
on each
group separately), a Q-Q plot (qqPlot(data$scores)
), or
eyeball it with a histogram (hist(data$scores)
).How to run the test:
t.test(vector of group A scores,
vector of group B scores)
# Note that scores for each group need to be in a separate vector, so you'll have to adapt your code depending on whether the data is in long or wide format.
# WIDE format: each group's scores is in a separate column (e.g., `groupA` and `groupB`:
t.test(data$groupA,
data$groupB)
# LONG format: all scores are in the same column (`scores`), and group membership is marked in a separate column (`group`), so you have to filter the dataset per condition and extract the scores separately:
t.test(data %>% filter(group == "A") %>% pull(scores),
data %>% filter(group == "B") %>% pull(scores)
)
cohensD(data$before,
data$after,
method = “unequal”) # in cases where the groups’ sample sizes and variances are unequal
Can be run when the IV has two independent groups and the DV is ordinal or non-normally distributed.
# For WIDE data (groups A and B have scores in separate columns: 'groupA' and 'groupB')
wilcox.test(data$groupA, data$groupB)
# For LONG data (all scores are in column 'scores', group labels are in column 'group')
wilcox.test(data %>% filter(group == "A") %>% pull(scores),
data %>% filter(group == "B") %>% pull(scores),
correct = FALSE # this is optional - it removes Yates' continuity correction
)
Used when all participants take part in both conditions (within-subjects / repeated measures design), e.g. participants are tested before an after an intervention. In the example below, we have two conditions: ‘before’ and ‘after’.
Assumptions:
Before running the test, check for…
Missing data:
data %>% filter(is.na(before) | is.na(after)) %>% count()
The difference between the ‘before’ and ‘after’ conditions is normally distributed:
# Step 1: Calculate the difference between the two conditions:
data <- data %>% mutate(difference = after - before)
# Step 2: Check for normality:
qqplot(data$difference)
shapiro.test(data$difference)
# Step 3: Based on the outcome, decide whether to run a t-test or a non-parametric equivalent.
The code for running the paired-samples t-test in R is the same as
the independent-samples t-test, but with an added
paired = TRUE
at the end:
# Example: scores for the two conditions are saved in the columns as 'before' and 'after':
t.test(data$before,
data$after,
paired = TRUE)
# Example: all scores are in the column 'scores', and condition labels are in the column 'condition':
t.test(data %>% filter(condition == "before") %>% pull(scores),
data %>% filter(condition == "after") %>% pull(scores),
paired = TRUE)
cohensD(data$before,
data$after,
method = “paired”)
Same participants take part in both conditions, but the data is either ordinal or the difference between the two conditions is not normally distributed.
wilcox.test(data$before,
data$after,
paired = TRUE,
correct = FALSE # this is optional - it removes Yates' continuity correction
)
Used for comparing a single group to a pre-established mean score.
t.test(data$scores,
mu = ..., # the pre-established mean that we are comparing the sample to
alternative = "greater" # the alternative hypothesis is that our sample mean is 'greater' than mu
)
cohensD(data %>% pull(scores), mu=...)
Use a one-way ANOVA to compare 3 or more groups/conditions on the same continuous DV measure.
Assumptions:
levene.test()
) on the residuals.
ANOVA is robust to violations of this assumption if the groups are of
equal size.Running the ANOVA
You will need the afex
package.
Running the ANOVA takes two steps: 1. building a model and 2. running an ANOVA on the model.
library(afex)
model <- aov_ez(
id = "id", # name of column with participant IDs
dv = "..." # name of column with scores on the outcome measure
between = "...", # name of column with group/condition labels
data = data # name of dataframe
)
anova(model)
kruskal.test(dv ~ group, data = df)
- replace with:
dv
= your Dependent (outcome) variablegroup
= your Independent (grouping) variabledf
= the name of the dataframecor.test(df$x, df$y, method = "pearson")
, where:
df
= name of dataframex
and y
= the two variables to test for
correlationcor.test(df$x, df$y, method = "spearman")
model <- lm(y ~ x, data = df)
- to create the linear
model summary(model)
- to print out a summary of the
model
chisq.test(table(df$var1, df$var2))