How to Use This Guide

This document lists every R function taught across the R for PSYCH swirl course (Modules 1-11 plus Bonus Modules). Functions are organized by workflow stage and analysis type. Early sections cover importing data, data exploration, cleaning and outlier detection, descriptive statistics, and data manipulation. Each analysis section is self-contained: it shows you how to run the test, calculate effect sizes, interpret the output, and write up the results in APA format, and it tells you which assumptions to check and which visuals to use.

Power analyses, assumption-checking functions, and a visualization gallery (with runnable example plots) appear at the end after all analysis sections.

Code chunks run automatically when you knit this document. To run a single chunk, click the green “Run” arrow.

How to use code examples from this guide

Every code example in this guide is designed for you to copy, paste, and modify for your own data. Here is what to replace:

  • Data frame names (like mtcars, ToothGrowth, PlantGrowth) – replace with the name of YOUR data frame (whatever you called it when you loaded your data with read.csv()).
  • Variable/column names (like mpg, len, supp, weight) – replace with the names of YOUR variables. Remember: spelling and capitalization must match exactly what appears in your data.
  • Text in quotation marks (like "My Title", "Group", "steelblue") – replace with your own titles, labels, or color choices.

For example, if the guide shows:

t.test(len ~ supp, data = ToothGrowth, var.equal = TRUE)

And your data frame is called my_data, your outcome variable is anxiety, and your grouping variable is condition, you would type:

t.test(anxiety ~ condition, data = my_data, var.equal = TRUE)

What NOT to change: Function names (like t.test, ggplot, aov), operators (like ~, +, <-), and argument names (like data =, x =, fill =). These are part of R’s syntax and must stay exactly as shown.


1. Getting Familiar with R

R vs. RStudio

R is the programming language – it does the actual computing. RStudio is the application you work in – it provides a user-friendly interface for writing, running, and organizing your R code. Think of R as the engine and RStudio as the dashboard. You need both installed, but you will always open RStudio (not R directly).

The four panes in RStudio

When you open RStudio, you see four panes (panels):

Top-left – Source/Script pane: Where you write and edit your code files (R scripts or R Markdown documents). Code here does not run until you tell it to.

Bottom-left – Console: Where code actually runs and output appears. You can type commands here directly, but it is better to write code in the Source pane so you can save and reuse it.

Top-right – Environment: Shows all the objects (data frames, variables, models) currently stored in your R session. Also has a History tab showing previous commands.

Bottom-right – Files / Plots / Packages / Help / Viewer: A multi-purpose pane. The Files tab shows your working directory. The Plots tab displays your graphs. The Help tab shows function documentation.

R scripts vs. R Markdown

An R script (.R file) is a plain text file containing only R code and comments. Every line is code unless it starts with #.

An R Markdown (.Rmd file) combines text, formatting, and code chunks in one document. When you “knit” an R Markdown file, it produces a formatted report (HTML, PDF, or Word) with your text, code, and output all together. Code lives inside code chunks (boxes that start and end with three backticks).

Using # for headers and the outline

In R scripts, # marks a comment – R ignores everything after it on that line. You can also use # followed by a section name and four dashes (----) to create sections that appear in the document outline:

# Section 1: Load Data ----
# Section 2: Clean Data ----

In R Markdown, # at the start of a line creates a header. More # symbols mean smaller headers:

# Big Header (Level 1)
## Smaller Header (Level 2)
### Even Smaller (Level 3)

To view the outline in RStudio, click the small document-outline icon in the top-right corner of the Source pane (it looks like a bulleted list). This lets you jump between sections quickly.

Running code

In an R script:

  • Place your cursor on a line and press Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run that single line.
  • Highlight multiple lines and press Ctrl + Enter to run the entire selection.
  • Click the “Run” button at the top-right of the Source pane (or use the dropdown arrow next to it for more options).

In R Markdown code chunks:

  • Click the green “play” arrow at the top-right corner of a code chunk to run just that chunk.
  • Click the gray downward arrow (to the left of the play button) to run all chunks above the current one.
  • To knit the entire document (run all chunks and produce the formatted output), click the “Knit” button at the top of the Source pane.

Important things to know about R

R is case-sensitive. Mean() is not the same as mean() – the first will produce an error, the second will work. This applies to everything: function names, variable names, dataset names, and factor levels. If something isn’t working, check your capitalization first.

Spacing generally does not matter. x <- 5 and x<-5 do the same thing, but adding spaces makes your code easier to read.

R uses <- for assignment, not =. While = sometimes works, <- is the standard in R and is what you should use.

Quotation marks matter. Text strings need to be wrapped in quotes ("OJ" or 'OJ'), but variable and object names do not. For example, filter(data, group == "control") puts quotes around the word “control” because it is a text value, but group has no quotes because it is a column name.

The $ operator accesses columns. To refer to a specific variable inside a data frame, use dataframe$variable – for example, mtcars$mpg.

Assignment operator <-

Stores a value in a named object so you can use it later.

my_value <- 42
my_name <- "psychology"

Important: If you save a new value to the same object name, it replaces the old object completely. The old value is gone.

x <- 10
x        # returns 10

x <- 25
x        # now returns 25  --  the 10 is gone forever

This also applies to data frames. If you write mtcars <- filter(mtcars, mpg > 15), the original mtcars is overwritten with the filtered version. If you want to keep the original, save the result to a new name (e.g., mtcars_filtered <- filter(mtcars, mpg > 15)).

Combine values c()

Creates a vector (a list of values).

scores <- c(85, 90, 78, 92, 88)
groups <- c("A", "B", "A", "B", "A")

Basic arithmetic

10 + 3     # addition
10 - 3     # subtraction
10 * 3     # multiplication
10 / 3     # division

Logical comparisons

5 == 5     # equals (TRUE)
5 != 3     # not equal (TRUE)
5 > 3      # greater than (TRUE)
5 < 3      # less than (FALSE)

Useful keyboard shortcuts in RStudio

Shortcut What it does
Ctrl + Enter Run the current line or selected code
Ctrl + Shift + 1 Expand/collapse the Source (script) pane
Ctrl + Shift + 2 Expand/collapse the Console pane
Ctrl + Shift + 9 Expand/collapse the Viewer pane
Alt + Ctrl + Shift + 0 Show all four panes
Ctrl + Shift + C Comment/uncomment selected lines (adds or removes #)
Ctrl + Shift + M Insert the pipe operator %>%
Ctrl + Alt + I Insert a new R code chunk (in R Markdown)

Tip about commenting out code (Ctrl + Shift + C): Instead of deleting code you don’t want to run right now, highlight it and press Ctrl + Shift + C to add # in front of each line. This “comments it out” – R will skip it, but the code is still there if you want to use it later. Press the same shortcut again to uncomment.

Cautions and common mistakes

Backticks are not single quotes. The backtick character ` is usually found on the top-left key of your keyboard, below Escape and to the left of the number 1, sharing the key with the tilde ~. It looks similar to a single quote ' but they are different characters. In R Markdown, code chunks open and close with three backticks (```).

Do NOT delete the closing backticks of a code chunk. Every code chunk in R Markdown must have exactly three backticks at the start (```{r}) and three backticks at the end (```). If you accidentally delete the closing backticks, R will think the code chunk never ends and everything below will break.

Hiding or viewing code in knitted output

Code chunk options (the part inside {r}) control what appears when you knit your R Markdown document. Add options after the r, separated by commas:

  • {r, echo = FALSE} – Runs the code but hides the code itself in the knitted output (only the result/plot shows).
  • {r, eval = FALSE} – Shows the code in the output but does not run it.
  • {r, results = 'hide'} – Runs the code and shows it, but hides the text output (useful when you only want to see the plot, not the printed numbers).
  • {r, include = FALSE} – Runs the code silently (nothing appears in the output – useful for setup chunks like loading packages).
  • {r, message = FALSE, warning = FALSE} – Suppresses package loading messages and warnings in the output.

Setting defaults for the whole document. Instead of typing the same options on every single chunk, you can set defaults once at the top of your R Markdown document using a setup chunk. This is typically the very first chunk in the file:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

This tells R: “For every chunk in this document, show the code (echo = TRUE), suppress warnings, and suppress messages – unless I override these in a specific chunk.” Any option you set in an individual chunk will override the default for that chunk only.

Formatting text in R Markdown

Outside of code chunks, you can format text using Markdown syntax. This is how you add emphasis, lists, and other formatting to the written portions of your knitted document.

Format Syntax Result
Bold **bold text** bold text
Italic *italic text* italic text
Bold + italic ***bold italic*** bold italic
Superscript text^superscript^ textsuperscript
Subscript text~subscript~ textsubscript
Strikethrough ~~crossed out~~ crossed out
Inline code `code here` code here

For bullet points and numbered lists, start each line with - or 1. (leave a blank line before the list):

- First bullet
- Second bullet
  - Indented sub-bullet

1. First item
2. Second item

To add a link: [link text](https://example.com). To insert an image: ![caption](image_path.png).


2. Importing Data

Before you can analyze data, you need to tell R where your files are and load them into R.

Setting the working directory: setwd() and getwd()

The working directory is the folder R looks in when you read or save files. Use getwd() to see your current working directory and setwd() to change it.

getwd()                            # print the current working directory
setwd("/Users/yourname/Documents") # set it to a specific folder path

In RStudio, you can also set the working directory by going to Session > Set Working Directory > Choose Directory or by using the Files pane (click “More” > “Set As Working Directory”).

Tip: Keep your data files and R scripts in the same folder. That way, you can set the working directory once and everything will be easy to find.

Reading a CSV file: read.csv()

Most data files you will work with in this course are CSV (comma-separated values) files. Use read.csv() to load a CSV file into R as a data frame. Assign the result to a named object so you can use it.

# Read a CSV file from your working directory
data <- read.csv("my_data.csv")

# Read a CSV file from a specific file path
data <- read.csv("/Users/yourname/Documents/my_data.csv")

After loading, always check the data with View(), head(), or str() (see Section 4) to make sure it loaded correctly.

Saving data to a CSV file: write.csv()

After cleaning or transforming your data, you can save it as a new CSV file.

write.csv(data, "cleaned_data.csv", row.names = FALSE)

The row.names = FALSE argument prevents R from adding an extra column of row numbers to the file.


3. Loading Packages

library()

Loads an installed package into your current R session. Packages must be loaded each time you start R.

library(tidyverse)    # ggplot2, dplyr, and more
library(car)          # leveneTest()
library(effsize)      # cohen.d()
library(effectsize)   # eta_squared()
library(pwr)          # pwr.anova.test()
library(corrplot)     # corrplot()
library(afex)         # aov_ez() for repeated-measures and mixed ANOVA
library(emmeans)      # post-hoc for afex models
library(lavaan)       # sem() for mediation / SEM
library(patchwork)    # combine multiple ggplots side by side

4. Exploring Data

View()

Opens the dataset in a spreadsheet-style viewer in RStudio.

View(mtcars)

? (help)

Opens the help page for a function or dataset.

?mtcars
?mean

str()

Shows the structure of a dataset – variable names, types, and sample values. Useful for checking whether variables are numeric, factor, etc.

str(ToothGrowth)

dim() and nrow()

dim() returns rows and columns; nrow() returns just the number of rows.

dim(mtcars)    # 32 rows, 11 columns
nrow(mtcars)   # 32

names()

Returns the column (variable) names.

names(mtcars)

summary()

Returns key descriptive statistics (min, max, quartiles, mean) for each variable.

summary(mtcars)

5. Cleaning Data

Before running any analysis, it is good practice to inspect your data for missing values and outliers. Outliers can distort means, inflate standard deviations, and violate assumptions of normality and equal variances.

Checking for missing values: is.na() and sum()

Use is.na() to test each value and sum() to count how many are missing.

sum(is.na(mtcars$mpg))          # how many missing values in mpg?
colSums(is.na(mtcars))          # count missing values for every column at once

Identifying outliers with boxplot.stats()

The function boxplot.stats() uses the IQR method to identify outliers. By default it flags values beyond 1.5 × IQR from Q1/Q3. In this course, we use a stricter 3 × IQR rule (to avoid flagging too many values), which you set with coef = 3. The $out element returns the actual outlier values.

# Get outlier values using the 3×IQR rule
boxplot.stats(mtcars$mpg, coef = 3)$out

If this returns numeric(0), there are no outliers. If it returns values, those are the specific data points flagged as outliers.

To find which rows contain outliers (so you know which participants/observations to investigate):

# Store the outlier values
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out

# Find which rows contain those values
which(mtcars$mpg %in% outlier_values)

What does coef do? It sets how strict the outlier rule is. coef = 1.5 (the default) is the standard boxplot rule and flags more points. coef = 3 is stricter and only flags extreme outliers. In this course, we use coef = 3.

Spotting outliers visually: boxplots and histograms

Boxplots display outliers as individual points beyond the whiskers. Histograms reveal extreme values sitting far from the rest of the distribution. See Section 17 (Visualization Gallery) for runnable examples of both plots.

boxplot(mtcars$mpg, main = "MPG Boxplot")                  # base R boxplot
hist(mtcars$mpg, main = "MPG Histogram", breaks = 10)      # base R histogram

Removing or filtering outliers

Once you have identified outliers, you can use filter() (see Section 7) to create a clean dataset without them.

library(tidyverse)
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out
clean_data <- filter(mtcars, !(mpg %in% outlier_values))

When should you remove outliers? There is no single rule. Remove data points only when you have a substantive reason (e.g., data entry error, participant who did not follow instructions). Always report whether outliers were removed and how many.


6. Descriptive Statistics

mean(), sd(), var(), median()

Core descriptive statistics. Use na.rm = TRUE if there are missing values.

mean(mtcars$mpg)
sd(mtcars$mpg)
var(mtcars$mpg)
median(mtcars$mpg)

range(), quantile(), IQR()

Distribution measures useful for outlier detection and data description.

range(mtcars$mpg)
quantile(mtcars$mpg)                     # 0%, 25%, 50%, 75%, 100%
quantile(mtcars$mpg, probs = 0.90)       # 90th percentile
IQR(mtcars$mpg)                          # interquartile range (Q3 - Q1)

table()

Counts the number of observations per group. Works with one or two grouping variables.

table(ToothGrowth$supp)                         # one factor
table(ToothGrowth$supp, ToothGrowth$dose)        # two factors (contingency table)

tapply()

Applies a function (like mean or sd) separately to each group. Use list() for two grouping variables. The syntax is: tapply(outcome_variable, grouping_variable, function).

# One grouping variable
tapply(ToothGrowth$len, ToothGrowth$supp, mean)
tapply(ToothGrowth$len, ToothGrowth$supp, sd)

# Two grouping variables (for factorial designs)
tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)

7. Data Manipulation

filter() (dplyr)

Keeps only the rows that meet a condition.

library(tidyverse)
filter(mtcars, mpg > 25)
filter(ToothGrowth, supp == "OJ")

mutate() and ifelse()

mutate() creates a new variable (column) or modifies an existing one. You can use it with arithmetic, functions, or ifelse() to categorize values based on a condition.

library(tidyverse)

# Create a new column that categorizes cars as heavy or light
mtcars <- mutate(mtcars, heavy = ifelse(wt > 3, "Yes", "No"))

# Create a new column by doing arithmetic on existing columns
mtcars <- mutate(mtcars, hp_per_cyl = hp / cyl)

# Create a new column using a function
mtcars <- mutate(mtcars, mpg_z = as.numeric(scale(mpg)))

# Create a composite score as the mean of several columns
# (see also rowMeans() below for an alternative approach)
survey <- data.frame(q1 = c(4, 5, 3), q2 = c(3, 4, 5), q3 = c(5, 5, 4))
survey <- mutate(survey, composite = (q1 + q2 + q3) / 3)

# Recode a variable into multiple categories with nested ifelse()
mtcars <- mutate(mtcars, mpg_group = ifelse(mpg > 25, "High",
                                     ifelse(mpg > 15, "Medium", "Low")))

Pipe operator %>%

The pipe operator %>% comes from the dplyr package (loaded with tidyverse). It chains operations left-to-right so you can read code as “take this, then do this, then do this.” Keyboard shortcut: Ctrl + Shift + M (Cmd + Shift + M on Mac).

mtcars %>%
  filter(cyl == 4) %>%
  summarize(avg_mpg = mean(mpg))

scale()

Standardizes a variable to z-scores (mean = 0, SD = 1).

mtcars$mpg_z <- as.numeric(scale(mtcars$mpg))   # saves z-scores as a new column

factor()

Converts a numeric variable to a categorical factor. Required before running ANOVA on numeric grouping variables.

ToothGrowth$dose <- factor(ToothGrowth$dose)

which(), is.na(), !

which() returns row numbers where a condition is true. is.na() tests for missing values. ! negates.

which(mtcars$mpg > 30)   # which rows have mpg over 30?
is.na(df$score)           # TRUE where values are missing
!is.na(df$score)          # TRUE where values are NOT missing

Computing composite scores: rowMeans()

In psychology research, you often need to average several questionnaire items into a single composite score for each participant. rowMeans() calculates the mean across columns for each row. Use na.rm = TRUE to handle missing values.

# Suppose your data frame has columns q1 through q5 for a 5-item scale
# and you want to create a composite score for each participant

# Method 1: select columns by position (columns 2 through 6)
df$composite <- rowMeans(df[, 2:6], na.rm = TRUE)

# Method 2: select columns by name
df$composite <- rowMeans(df[, c("q1", "q2", "q3", "q4", "q5")], na.rm = TRUE)

# Method 3: use paste0() shorthand when columns are numbered sequentially
df$composite <- rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE)

# Method 4: use mutate() with rowMeans() for a tidyverse approach
df <- mutate(df, composite = rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE))

The resulting composite column contains each participant’s average across those items. You can then use this composite variable as an outcome or predictor in your analyses.

Handling missing values: na.rm = TRUE

Use na.rm = TRUE if there are missing values. Many R functions (e.g., mean(), sd(), sum(), rowMeans()) will return NA if any input value is missing – unless you tell them to ignore missing values with na.rm = TRUE.

mean(df$score, na.rm = TRUE)         # mean ignoring NAs
sd(df$score, na.rm = TRUE)           # standard deviation ignoring NAs
rowMeans(df[, 2:6], na.rm = TRUE)    # row means ignoring NAs

Converting scientific notation to regular numbers: format() and options()

R sometimes displays very small p-values or very large numbers in scientific notation (e.g., 2.5e-08 instead of 0.000000025). There are two ways to see the full number.

# Convert a single number to regular notation
format(2.5e-08, scientific = FALSE)

# Turn off scientific notation globally for the rest of your session
options(scipen = 999)

# To turn it back on later
options(scipen = 0)

scipen = 999 tells R to strongly prefer regular notation. This is especially helpful when reading p-values from statistical output.

Wide to long format: pivot_longer() (tidyr)

Within-subjects (repeated-measures) ANOVA requires data in long format, where each row is one observation. If your data is in wide format (one row per participant, multiple columns for each condition), convert it with pivot_longer().

library(tidyverse)

# Wide format: columns time1, time2, time3 for each participant
wide_data <- data.frame(
  id = 1:5,
  time1 = c(30, 28, 35, 32, 29),
  time2 = c(25, 22, 28, 27, 24),
  time3 = c(20, 18, 22, 21, 19)
)

# Convert to long format
long_data <- pivot_longer(wide_data,
                          cols = time1:time3,
                          names_to = "time",
                          values_to = "score")
long_data$id <- factor(long_data$id)

In long format, the dataset goes from 5 rows (one per person) to 15 rows (one per person per time point), with columns for id, time, and score.


8. T-Tests

T-tests compare means between two conditions. Before running a t-test, check the assumptions of normality (Shapiro-Wilk test or histogram) and equal variances (Levene’s test for independent samples). Recommended visuals: bar graph of group means with error bars, or boxplot.

One-sample t-test

Tests whether a sample mean differs from a known value.

t.test(faithful$waiting, mu = 60)

Independent samples t-test

Compares means of two independent groups. Use var.equal = TRUE for the standard t-test or var.equal = FALSE for Welch’s t-test (when variances are unequal).

# Standard t-test (assumes equal variances)
t.test(len ~ supp, data = ToothGrowth, var.equal = TRUE)

# Welch's t-test (does not assume equal variances)
t.test(len ~ supp, data = ToothGrowth, var.equal = FALSE)

Paired samples t-test

Compares means from the same participants measured twice. Must use the two-vector method (the formula interface does not support paired = TRUE).

t.test(sleep$extra[sleep$group == 1],
       sleep$extra[sleep$group == 2],
       paired = TRUE)

Effect size: Cohen’s d

Use the effsize package. Thresholds: 0.2 = small, 0.5 = medium, 0.8 = large. Report d numerically always; only describe the magnitude verbally (e.g., “a medium effect”) when the t-test is significant.

library(effsize)

# Independent samples
cohen.d(len ~ supp, data = ToothGrowth)

# Paired samples
cohen.d(sleep$extra[sleep$group == 1],
        sleep$extra[sleep$group == 2],
        paired = TRUE)

APA write-up template

[Group 1] (M = ##.##, SD = ##.##) [was/were] [significantly/not significantly] [higher/lower/different] compared to [Group 2] (M = ##.##, SD = ##.##), t(df) = ##.##, p = .###, d = ##.##.

Note: The APA template requires group means and standard deviations. Use tapply() from Section 6 to compute them by group – for example, tapply(ToothGrowth$len, ToothGrowth$supp, mean) and tapply(ToothGrowth$len, ToothGrowth$supp, sd).


9. One-Way ANOVA

One-way ANOVA tests whether three or more group means differ. Before running, check normality of residuals (Shapiro-Wilk) and equality of variances (Levene’s test). If variances are unequal, use Welch’s ANOVA. Recommended visuals: boxplot by group, or bar graph with error bars.

Important: If your grouping variable is numeric (e.g., dose levels like 0.5, 1, 2), you must convert it to a factor before running the ANOVA. See factor() in Section 7 (Data Manipulation).

Between-subjects one-way ANOVA

aov() + summary()

Save the model for use with summary(), TukeyHSD(), and residuals().

plant_model <- aov(weight ~ group, data = PlantGrowth)
summary(plant_model)

Welch’s ANOVA: oneway.test()

Robust alternative when the equal variances assumption is violated.

oneway.test(weight ~ group, data = PlantGrowth, var.equal = FALSE)

Post-hoc: TukeyHSD() and Bonferroni

Only run post-hoc tests when the overall ANOVA is significant. TukeyHSD is the standard; Bonferroni is a more conservative alternative.

TukeyHSD(plant_model)
plot(TukeyHSD(plant_model))   # CI plot

# Bonferroni alternative
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "bonferroni")

Effect size: eta-squared

Use the effectsize package. Thresholds: .01 = small, .06 = medium, .14 = large. For one-way ANOVA, eta-squared and partial eta-squared are identical.

library(effectsize)
eta_squared(plant_model)                  # eta-squared
eta_squared(plant_model, partial = TRUE)  # partial eta-squared

Pairwise Cohen’s d

For significant pairwise comparisons, compute Cohen’s d to report effect sizes.

library(effsize)
cohen.d(PlantGrowth$weight[PlantGrowth$group == "trt1"],
        PlantGrowth$weight[PlantGrowth$group == "trt2"])

APA write-up template

A one-way ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, eta-squared = .##. Tukey-corrected post-hoc comparisons revealed that [Group A] was significantly [higher/lower] than [Group B] (p = .###, d = ##.##), while [other comparisons].

Note: The APA template requires group means and standard deviations. Use tapply() from Section 6 to compute them by group – for example, tapply(PlantGrowth$weight, PlantGrowth$group, mean) and tapply(PlantGrowth$weight, PlantGrowth$group, sd).

Within-subjects (repeated-measures) one-way ANOVA

When the same participants are measured under all conditions, use a repeated-measures ANOVA. Data must be in long format (one row per observation; see Section 7 for pivot_longer()). The afex package provides aov_ez() which handles within-subjects designs and automatically applies the Greenhouse-Geisser sphericity correction when needed.

aov_ez() (afex package)

Arguments: id = participant ID column (as a string), dv = outcome variable (as a string), data = data frame, within = within-subjects factor(s) (as a string vector).

library(afex)
anxiety_model <- aov_ez(id = "id", dv = "anxiety",
                        data = anxiety_data, within = c("time"))
anxiety_model

Post-hoc: emmeans()

The emmeans package provides pairwise comparisons for afex models. Use adjust = "tukey" (default) or adjust = "bonferroni".

library(emmeans)
emmeans(anxiety_model, pairwise ~ time)                        # Tukey (default)
emmeans(anxiety_model, pairwise ~ time, adjust = "bonferroni") # Bonferroni

Effect size: generalized eta-squared

aov_ez() reports generalized eta-squared (ges) in its output. Interpretation thresholds are the same: .01 = small, .06 = medium, .14 = large.

APA write-up template

A repeated-measures ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, generalized eta-squared = .##. Tukey-corrected pairwise comparisons revealed that [condition A] was significantly [higher/lower] than [condition B] (p = .###).

Note: The APA template requires condition means and standard deviations. Use tapply() from Section 6 to compute them – for example, tapply(anxiety_data$anxiety, anxiety_data$time, mean) and tapply(anxiety_data$anxiety, anxiety_data$time, sd).


10. Two-Way / Factorial ANOVA

Factorial ANOVA tests the effects of two (or more) factors and their interaction. Before running, check normality of residuals and equality of variances. Recommended visuals: interaction plot (line graph) and grouped bar graph with error bars.

Important: If any grouping variable is numeric (e.g., dose levels like 0.5, 1, 2), you must convert it to a factor before running the ANOVA. See factor() in Section 7 (Data Manipulation).

Between-subjects factorial ANOVA

aov() with * for interaction

The * operator includes both main effects and the interaction. Produces three F-tests.

# Convert dose to factor first (required for numeric grouping variables)
ToothGrowth$dose <- factor(ToothGrowth$dose)

tooth_model <- aov(len ~ supp * dose, data = ToothGrowth)
summary(tooth_model)

Post-hoc: TukeyHSD() on the full model

Returns three sections: $supp (main effect of supp), $dose (main effect of dose), and $supp:dose (all cell-to-cell comparisons for the interaction).

TukeyHSD(tooth_model)

Effect size: partial eta-squared

In a factorial design, always use partial eta-squared because it accounts for the other factors.

library(effectsize)
eta_squared(tooth_model, partial = TRUE)

APA write-up template

A 2 x 3 between-subjects ANOVA examined the effects of [factor A] and [factor B] on [outcome]. There was a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. There was a significant main effect of [factor B], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. The [factor A] x [factor B] interaction was [significant/non-significant], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##.

When a significant interaction is present, focus the write-up on the interaction pattern (e.g., the effect of dose depended on supplement type). Main effects are less informative when qualified by a significant interaction.

Within-subjects factorial ANOVA

When the same participants experience all combinations of two factors, use aov_ez() with both factors listed in within.

library(afex)
# 2 (task) x 3 (difficulty) within-subjects design
cog_model <- aov_ez(id = "id", dv = "rt",
                    data = cognitive_data,
                    within = c("task", "difficulty"))
cog_model

Post-hoc: simple effects with emmeans()

Use | to run pairwise comparisons of one factor separately at each level of the other (simple effects). This is how you decompose a significant interaction.

library(emmeans)
emmeans(cog_model, pairwise ~ difficulty | task)                        # Tukey
emmeans(cog_model, pairwise ~ difficulty | task, adjust = "bonferroni") # Bonferroni

Mixed ANOVA (between + within)

A mixed design has at least one between-subjects factor and at least one within-subjects factor. Use aov_ez() with both the within and between arguments.

# group is between-subjects, time is within-subjects
therapy_model <- aov_ez(id = "id", dv = "depression",
                        data = therapy_data,
                        within = c("time"), between = c("group"))
therapy_model

Post-hoc for mixed designs

emmeans(therapy_model, pairwise ~ time | group)                        # Tukey
emmeans(therapy_model, pairwise ~ time | group, adjust = "bonferroni") # Bonferroni

APA write-up templates (within-subjects and mixed)

Within-subjects factorial: A 2 x 3 within-subjects ANOVA revealed a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, ges = .##. The [factor A] x [factor B] interaction was significant, F(df1, df2) = ##.##, p = .###, ges = .##. Tukey-corrected simple effects showed that [describe pattern].

Mixed ANOVA: A mixed ANOVA with [between factor] as the between-subjects factor and [within factor] as the within-subjects factor revealed a significant interaction, F(df1, df2) = ##.##, p = .###, ges = .##. Bonferroni-corrected post-hoc comparisons showed that [the therapy group improved significantly across time points while the control group did not].

Note: APA write-ups for factorial and mixed ANOVAs require cell means and standard deviations for each condition combination. Use tapply() with list() from Section 6 to compute them – for example, tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean).


11. Correlation

Correlation measures the strength and direction of the linear relationship between two continuous variables. Before interpreting, check for linearity (scatterplot) and outliers. Recommended visual: scatterplot with regression line.

cor.test() – Pearson and Spearman

Pearson r is the default; use Spearman rho for non-normal or ordinal data. Correlation strength thresholds: .10 = weak, .30 = moderate, .50 = strong.

# Pearson correlation (default)
cor.test(mtcars$wt, mtcars$mpg)

# Spearman rank correlation
cor.test(mtcars$wt, mtcars$mpg, method = "spearman")

Correlation matrix: cor()

Computes correlations for multiple variables at once. Subset the data frame to include only the variables you want.

cor(mtcars[, c("mpg", "wt", "hp", "disp")])

Correlation heatmap: corrplot()

Creates a color-coded heatmap of a correlation matrix. Save the matrix as an object first.

library(corrplot)
car_cors <- cor(mtcars[, c("mpg", "wt", "hp", "disp")])
corrplot(car_cors, method = "color", addCoef.col = "black")

APA write-up template

There was a [strong/moderate/weak] [positive/negative] correlation between [variable 1] and [variable 2], r(df) = .##, p = .###.

The effect size for correlation is the r value itself – no separate calculation needed.


12. Regression

Regression predicts an outcome from one or more predictors. Before interpreting, check residuals for normality (Shapiro-Wilk on residuals) and homoscedasticity (residuals vs. fitted plot). Recommended visuals: scatterplot with regression line for simple regression, residuals vs. fitted plot for assumption checking.

Simple linear regression: lm()

Fits a model predicting one outcome from one predictor. The equation is Y = b0 + b1*X.

model <- lm(mpg ~ wt, data = mtcars)
summary(model)

Key output: Intercept (b0) = predicted Y when X = 0. Slope (b1) = change in Y for each one-unit increase in X. Multiple R-squared = proportion of variance explained. F-statistic and p-value = overall model significance.

Multiple regression: lm() with +

Predicts an outcome from two or more predictors entered simultaneously. Each coefficient represents the effect of that predictor while controlling for the others.

full_model <- lm(mpg ~ wt + hp + disp, data = mtcars)
summary(full_model)

Hierarchical regression: anova() to compare models

Tests whether adding predictors significantly improves the model. A significant p-value means the fuller model explains significantly more variance.

model1 <- lm(mpg ~ wt, data = mtcars)
model2 <- lm(mpg ~ wt + hp + disp, data = mtcars)
anova(model1, model2)

Effect size: R-squared

R-squared is reported directly in the summary() output. Thresholds: .02 = small, .13 = medium, .26 = large.

Assumption checking for regression

model <- lm(mpg ~ wt, data = mtcars)
plot(model, which = 1)   # residuals vs fitted  --  look for random scatter
shapiro.test(residuals(model))   # normality of residuals

APA write-up templates

Simple regression: A simple linear regression showed that [predictor] significantly predicted [outcome], F(1, df) = ##.##, p = .###, R-squared = .##. For every one-unit increase in [predictor], [outcome] [increased/decreased] by ##.## units (b = ##.##, t = ##.##, p = .###).

Multiple regression: A multiple regression model significantly predicted [outcome], F(df1, df2) = ##.##, p = .###, R-squared = .##. [Predictor 1] (b = ##.##, p = .###) and [Predictor 2] (b = ##.##, p = .###) were significant predictors, while [Predictor 3] (b = ##.##, p = .###) was not significant.

Hierarchical regression: Adding [predictors] to the model significantly improved prediction, F-change(df1, df2) = ##.##, p = .###, R-squared change = .##.


13. Mediation Analysis

Mediation tests whether an IV affects a DV through an intermediate variable (the mediator). The indirect effect (a * b) represents the portion of the IV-DV relationship that goes through the mediator. Use the lavaan package and bootstrap confidence intervals for the indirect effect. No specific assumption checks beyond those for regression (normality, linearity); recommended visuals: scatterplots of each path.

Defining the lavaan model

The model is a text string using ~ for regressions, * for labeling paths, and := for defining computed parameters.

library(lavaan)

med_model <- '
  self_esteem ~ a*social_support
  depression ~ b*self_esteem + cp*social_support
  indirect := a*b
  total := cp + a*b
'

Line by line: Line 1 = a path (IV → mediator). Line 2 = b path (mediator → DV) and c’ path (direct effect, IV → DV controlling for mediator). Line 3 = indirect effect defined as a * b. Line 4 = total effect defined as c’ + indirect.

Fitting the model with sem()

med_fit <- sem(med_model, data = mediation_data)
summary(med_fit)

The Regressions section shows a, b, and c’ paths. The Defined Parameters section shows the indirect and total effects.

Bootstrap confidence intervals

Bootstrap CIs are the gold-standard test for the indirect effect because the sampling distribution of a * b is often not normal.

med_boot <- sem(med_model, data = mediation_data,
                se = "bootstrap", bootstrap = 100)
parameterEstimates(med_boot)

If the bootstrap CI for indirect does not include zero, mediation is supported.

Full vs. partial mediation

Full mediation: the indirect effect is significant but the direct effect (c’) is not – the entire IV-DV relationship goes through the mediator. Partial mediation: both the indirect effect and the direct effect are significant.

APA write-up template

A mediation analysis using lavaan tested whether [mediator] mediated the effect of [IV] on [DV]. The indirect effect was significant (b = ##.##, 95% bootstrap CI [##.##, ##.##]), supporting [full/partial] mediation. The direct effect of [IV] on [DV] was [significant/not significant] (b = ##.##, p = .###).

Note: When writing up mediation results, you may want to report means and standard deviations for each variable. Use mean() and sd() from Section 6 – for example, mean(mediation_data$social_support) and sd(mediation_data$social_support).


14. Chi-Square Tests

Chi-square tests analyze categorical (nominal) data – variables whose values represent categories rather than quantities. No distributional assumptions are needed, but expected cell frequencies should be at least 5 for the chi-square approximation to be valid (use Fisher’s exact test if they are not). Recommended visual: bar chart of observed frequencies.

Goodness-of-fit test

Tests whether the distribution of one categorical variable matches expected proportions. By default, chisq.test() assumes equal expected proportions.

# Count frequencies
table(survey_data$preference)

# Test against equal distribution
chisq.test(table(survey_data$preference))

# Test against specific expected proportions
chisq.test(table(survey_data$preference), p = c(0.40, 0.30, 0.20, 0.10))

Degrees of freedom = number of categories minus 1.

Test of independence

Tests whether two categorical variables are related. First create a contingency table with table(), then run chisq.test().

am_cyl_table <- table(mtcars$am, mtcars$cyl)
am_cyl_table
chisq.test(am_cyl_table)

If R warns that “Chi-squared approximation may be incorrect,” some expected cell frequencies are below 5. Use Fisher’s exact test instead.

Fisher’s exact test

Use when expected cell frequencies are too low for the chi-square approximation.

fisher.test(am_cyl_table)

Effect size: Cramer’s V

Computed manually from the chi-square result. Thresholds: .10 = small, .30 = medium, .50 = large.

chi_result <- chisq.test(am_cyl_table)

# V = sqrt(X2 / (N * (k - 1))), where k = smaller table dimension
sqrt(chi_result$statistic / (sum(am_cyl_table) * (min(dim(am_cyl_table)) - 1)))

APA write-up templates

Goodness-of-fit: A chi-square goodness-of-fit test indicated that therapy preferences were not equally distributed, X2(df) = ##.##, p = .###.

Independence: A chi-square test of independence showed a significant association between [variable 1] and [variable 2], X2(df, N = ##) = ##.##, p = .###, Cramer’s V = .##. [If applicable: Fisher’s exact test confirmed the result, p = .###.]


15. Power Analysis

Power analysis helps you determine the sample size needed to detect an effect, or the power of an existing study. Run these before data collection (a priori) whenever possible.

power.t.test()

For t-tests. Leave one argument as NULL to solve for it.

# Calculate power given n, effect size (delta), and sd
power.t.test(n = 30, delta = 5, sd = 10, sig.level = 0.05, type = "two.sample")

# Calculate required n for 80% power
power.t.test(delta = 5, sd = 10, sig.level = 0.05, power = 0.80, type = "two.sample")

pwr.anova.test() (pwr package)

For one-way ANOVA. Uses Cohen’s f for effect size (0.10 = small, 0.25 = medium, 0.40 = large).

library(pwr)
pwr.anova.test(k = 3, f = 0.25, sig.level = 0.05, power = 0.80)

16. Assumption Checking

shapiro.test() – Normality

Shapiro-Wilk test. If p > .05, normality is not violated. For ANOVA and regression, test the residuals rather than the raw data.

# On a variable directly
shapiro.test(mtcars$mpg)

# On residuals from a model
model <- aov(len ~ supp, data = ToothGrowth)
shapiro.test(residuals(model))

If normality is violated, check sample size: with n >= 30 per group, the test is robust due to the Central Limit Theorem.

leveneTest() (car package) – Equal Variances

If p > .05, the equal variances assumption is met.

library(car)
leveneTest(len ~ supp, data = ToothGrowth)
leveneTest(len ~ supp * dose, data = ToothGrowth)   # factorial design

If variances are unequal, use Welch’s t-test (var.equal = FALSE) or Welch’s ANOVA (oneway.test()).

Residual diagnostic plot – Linearity and Homoscedasticity

For regression models. Look for random scatter; a funnel or curve indicates a violation.

model <- lm(mpg ~ wt, data = mtcars)
plot(model, which = 1)   # residuals vs fitted values

Quick reference: which assumptions for which test

Test Normality Equal Variances Other
One-sample t-test Yes (or n >= 30) N/A
Independent t-test Yes (or n >= 30) Yes (Levene’s)
Paired t-test Yes (of differences) N/A
One-way ANOVA Yes (residuals) Yes (Levene’s)
Two-way ANOVA Yes (residuals) Yes (Levene’s)
Repeated-measures ANOVA Sphericity (auto-corrected by afex)
Correlation Normality for Pearson Linearity (scatterplot)
Regression Yes (residuals) Homoscedasticity (residual plot) Linearity
Chi-square N/A N/A Expected freq >= 5

Appendix: Complete Function Reference

An alphabetical listing of every function mentioned in this guide, with a brief description and the section where it is covered.

Function Description Section
? Open help page for a function or dataset 4
annotate() Add text, segments, or labels to a ggplot 17
anova() Compare two nested regression models (hierarchical regression) 12
aov() Fit a between-subjects ANOVA model 9, 10
aov_ez() Fit repeated-measures or mixed ANOVA (afex package) 9, 10
as.numeric() Convert an object to numeric (e.g., remove scale attributes) 7
boxplot() Base R boxplot (quick outlier check) 5
boxplot.stats() Identify outlier values using IQR method; use $out to extract them 5
c() Combine values into a vector 1
cat() Print text to the console (used for lavaan model strings) 13
chisq.test() Chi-square goodness-of-fit or test of independence 14
cohen.d() Compute Cohen’s d effect size (effsize package) 8, 9
colSums() Sum each column (used with is.na() to count missing values) 5
colors() List all 657 built-in R color names 17
cor() Compute a correlation matrix 11
cor.test() Test a single correlation (Pearson or Spearman) 11
coord_cartesian() Zoom into a plot without removing data 17
corrplot() Draw a correlation heatmap (corrplot package) 11, 17
data.frame() Create a data frame from vectors 1
dim() Return the number of rows and columns 4
emmeans() Pairwise post-hoc comparisons for afex models (emmeans package) 9, 10
eta_squared() Compute eta-squared or partial eta-squared (effectsize package) 9, 10
factor() Convert a variable to a categorical factor 7
filter() Keep rows that meet a condition (dplyr) 7
fisher.test() Fisher’s exact test (when expected cell frequencies < 5) 14
format() Convert scientific notation to regular numbers 7
geom_bar() Bar chart geometry (counts observations per category) 17
geom_boxplot() Boxplot geometry 17
geom_histogram() Histogram geometry 17
geom_line() Line geometry 17
geom_point() Scatterplot point geometry 17
geom_smooth() Add a fitted line (e.g., regression) with CI band 17
getwd() Print the current working directory 2
ggplot() Initialize a ggplot with data and aesthetic mappings 17
head() Display the first 6 rows of a data frame 4
hist() Base R histogram (quick distribution check) 5
ifelse() Return different values depending on a condition 7
IQR() Compute the interquartile range (Q3 - Q1) 6
is.na() Test for missing values (returns TRUE/FALSE) 5, 7
labs() Set plot title, axis labels, and legend title in ggplot 17
leveneTest() Levene’s test for equal variances (car package) 16
library() Load an installed package 3
lm() Fit a linear regression model 12
mean() Calculate the arithmetic mean 6
median() Calculate the median 6
mutate() Create or modify a variable in a data frame (dplyr) 7
names() Return column names of a data frame 4
nrow() Return the number of rows 4
oneway.test() Welch’s ANOVA (robust to unequal variances) 9
options() Set global R options (e.g., scipen for notation) 7
pairwise.t.test() Pairwise t-tests with p-value correction (e.g., Bonferroni) 9
parameterEstimates() Extract bootstrap CIs from a lavaan model 13
pivot_longer() Convert wide-format data to long format (tidyr) 7
plot() Base R plotting; also used for TukeyHSD CI plots and residual diagnostics 9, 12, 16
power.t.test() Power analysis for t-tests 15
pwr.anova.test() Power analysis for one-way ANOVA (pwr package) 15
quantile() Compute percentiles 6
range() Return the minimum and maximum 6
read.csv() Read a CSV file into a data frame 2
residuals() Extract residuals from a fitted model 12, 16
rowMeans() Compute the mean across columns for each row (composite scores) 7
scale() Standardize a variable to z-scores 7
scale_color_manual() Set custom colors for lines and points in ggplot 17
scale_fill_manual() Set custom colors for bars, boxes, and filled areas in ggplot 17
scale_x_continuous() Control x-axis limits, breaks, and labels 17
scale_y_continuous() Control y-axis limits, breaks, and labels 17
sd() Calculate the standard deviation 6
sem() Fit a structural equation / mediation model (lavaan package) 13
seq() Generate a sequence of numbers (used for axis tick marks) 17
setwd() Set the working directory 2
shapiro.test() Shapiro-Wilk test for normality 16
stat_summary() Compute and plot summary statistics in ggplot 17
str() Show the structure of a data frame 4
sum() Sum values (used with is.na() to count missing data) 5
summary() Descriptive summary of a data frame or model 4, 12, 13
t.test() Run a t-test (one-sample, independent, or paired) 8
table() Count observations per group; create contingency tables 6, 14
tapply() Apply a function by group (e.g., group means) 6
theme() Fine-tune individual ggplot elements (fonts, legend, etc.) 17
TukeyHSD() Tukey post-hoc pairwise comparisons for aov models 9, 10
var() Calculate the variance 6
View() Open a data frame in the RStudio viewer 4
which() Return indices where a condition is TRUE 5, 7
write.csv() Save a data frame to a CSV file 2