How to Use This Guide

This document lists every R function taught across the R for PSYCH swirl course (Modules 1-11 plus Bonus Modules). Functions are organized by workflow stage and analysis type. Early sections cover importing data, data exploration, cleaning and outlier detection, descriptive statistics, and data manipulation. Each analysis section is self-contained: it shows you how to run the test, calculate effect sizes, interpret the output, and write up the results in APA format, and it tells you which assumptions to check and which visuals to use.

Power analyses, assumption-checking functions, and a visualization gallery (with runnable example plots) appear at the end after all analysis sections.

Code chunks run automatically when you knit this document. To run a single chunk, click the green “Run” arrow.

How to use code examples from this guide

Every code example in this guide is designed for you to copy, paste, and modify for your own data. Here is what to replace:

Data frame names (like mtcars, ToothGrowth, PlantGrowth) – replace with the name of YOUR data frame (whatever you called it when you loaded your data with read.csv()).
Variable/column names (like mpg, len, supp, weight) – replace with the names of YOUR variables. Remember: spelling and capitalization must match exactly what appears in your data.
Text in quotation marks (like "My Title", "Group", "steelblue") – replace with your own titles, labels, or color choices.

For example, if the guide shows:

t.test(len ~ supp, data = ToothGrowth, var.equal = TRUE)

And your data frame is called my_data, your outcome variable is anxiety, and your grouping variable is condition, you would type:

t.test(anxiety ~ condition, data = my_data, var.equal = TRUE)

What NOT to change: Function names (like t.test, ggplot, aov), operators (like ~, +, <-), and argument names (like data =, x =, fill =). These are part of R’s syntax and must stay exactly as shown.

1. Getting Familiar with R

R vs. RStudio

R is the programming language – it does the actual computing. RStudio is the application you work in – it provides a user-friendly interface for writing, running, and organizing your R code. Think of R as the engine and RStudio as the dashboard. You need both installed, but you will always open RStudio (not R directly).

The four panes in RStudio

When you open RStudio, you see four panes (panels):

Top-left – Source/Script pane: Where you write and edit your code files (R scripts or R Markdown documents). Code here does not run until you tell it to.

Bottom-left – Console: Where code actually runs and output appears. You can type commands here directly, but it is better to write code in the Source pane so you can save and reuse it.

Top-right – Environment: Shows all the objects (data frames, variables, models) currently stored in your R session. Also has a History tab showing previous commands.

Bottom-right – Files / Plots / Packages / Help / Viewer: A multi-purpose pane. The Files tab shows your working directory. The Plots tab displays your graphs. The Help tab shows function documentation.

R scripts vs. R Markdown

An R script (.R file) is a plain text file containing only R code and comments. Every line is code unless it starts with #.

An R Markdown (.Rmd file) combines text, formatting, and code chunks in one document. When you “knit” an R Markdown file, it produces a formatted report (HTML, PDF, or Word) with your text, code, and output all together. Code lives inside code chunks (boxes that start and end with three backticks).

Using `#` for headers and the outline

In R scripts, # marks a comment – R ignores everything after it on that line. You can also use # followed by a section name and four dashes (----) to create sections that appear in the document outline:

# Section 1: Load Data ----
# Section 2: Clean Data ----

In R Markdown, # at the start of a line creates a header. More # symbols mean smaller headers:

# Big Header (Level 1)
## Smaller Header (Level 2)
### Even Smaller (Level 3)

To view the outline in RStudio, click the small document-outline icon in the top-right corner of the Source pane (it looks like a bulleted list). This lets you jump between sections quickly.

Running code

In an R script:

Place your cursor on a line and press Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run that single line.
Highlight multiple lines and press Ctrl + Enter to run the entire selection.
Click the “Run” button at the top-right of the Source pane (or use the dropdown arrow next to it for more options).

In R Markdown code chunks:

Click the green “play” arrow at the top-right corner of a code chunk to run just that chunk.
Click the gray downward arrow (to the left of the play button) to run all chunks above the current one.
To knit the entire document (run all chunks and produce the formatted output), click the “Knit” button at the top of the Source pane.

Important things to know about R

R is case-sensitive. Mean() is not the same as mean() – the first will produce an error, the second will work. This applies to everything: function names, variable names, dataset names, and factor levels. If something isn’t working, check your capitalization first.

Spacing generally does not matter. x <- 5 and x<-5 do the same thing, but adding spaces makes your code easier to read.

R uses <- for assignment, not =. While = sometimes works, <- is the standard in R and is what you should use.

Quotation marks matter. Text strings need to be wrapped in quotes ("OJ" or 'OJ'), but variable and object names do not. For example, filter(data, group == "control") puts quotes around the word “control” because it is a text value, but group has no quotes because it is a column name.

The $ operator accesses columns. To refer to a specific variable inside a data frame, use dataframe$variable – for example, mtcars$mpg.

Assignment operator `<-`

Stores a value in a named object so you can use it later.

my_value <- 42
my_name <- "psychology"

Important: If you save a new value to the same object name, it replaces the old object completely. The old value is gone.

x <- 10
x        # returns 10

x <- 25
x        # now returns 25  --  the 10 is gone forever

This also applies to data frames. If you write mtcars <- filter(mtcars, mpg > 15), the original mtcars is overwritten with the filtered version. If you want to keep the original, save the result to a new name (e.g., mtcars_filtered <- filter(mtcars, mpg > 15)).

Combine values `c()`

Creates a vector (a list of values).

scores <- c(85, 90, 78, 92, 88)
groups <- c("A", "B", "A", "B", "A")

Basic arithmetic

10 + 3     # addition
10 - 3     # subtraction
10 * 3     # multiplication
10 / 3     # division

Logical comparisons

5 == 5     # equals (TRUE)
5 != 3     # not equal (TRUE)
5 > 3      # greater than (TRUE)
5 < 3      # less than (FALSE)

Useful keyboard shortcuts in RStudio

Shortcut	What it does
Ctrl + Enter	Run the current line or selected code
Ctrl + Shift + 1	Expand/collapse the Source (script) pane
Ctrl + Shift + 2	Expand/collapse the Console pane
Ctrl + Shift + 9	Expand/collapse the Viewer pane
Alt + Ctrl + Shift + 0	Show all four panes
Ctrl + Shift + C	Comment/uncomment selected lines (adds or removes `#`)
Ctrl + Shift + M	Insert the pipe operator `%>%`
Ctrl + Alt + I	Insert a new R code chunk (in R Markdown)

Tip about commenting out code (Ctrl + Shift + C): Instead of deleting code you don’t want to run right now, highlight it and press Ctrl + Shift + C to add # in front of each line. This “comments it out” – R will skip it, but the code is still there if you want to use it later. Press the same shortcut again to uncomment.

Cautions and common mistakes

Backticks are not single quotes. The backtick character ` is usually found on the top-left key of your keyboard, below Escape and to the left of the number 1, sharing the key with the tilde ~. It looks similar to a single quote ' but they are different characters. In R Markdown, code chunks open and close with three backticks (```).

Do NOT delete the closing backticks of a code chunk. Every code chunk in R Markdown must have exactly three backticks at the start (```{r}) and three backticks at the end (```). If you accidentally delete the closing backticks, R will think the code chunk never ends and everything below will break.

Hiding or viewing code in knitted output

Code chunk options (the part inside {r}) control what appears when you knit your R Markdown document. Add options after the r, separated by commas:

{r, echo = FALSE} – Runs the code but hides the code itself in the knitted output (only the result/plot shows).
{r, eval = FALSE} – Shows the code in the output but does not run it.
{r, results = 'hide'} – Runs the code and shows it, but hides the text output (useful when you only want to see the plot, not the printed numbers).
{r, include = FALSE} – Runs the code silently (nothing appears in the output – useful for setup chunks like loading packages).
{r, message = FALSE, warning = FALSE} – Suppresses package loading messages and warnings in the output.

Setting defaults for the whole document. Instead of typing the same options on every single chunk, you can set defaults once at the top of your R Markdown document using a setup chunk. This is typically the very first chunk in the file:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

This tells R: “For every chunk in this document, show the code (echo = TRUE), suppress warnings, and suppress messages – unless I override these in a specific chunk.” Any option you set in an individual chunk will override the default for that chunk only.

Formatting text in R Markdown

Outside of code chunks, you can format text using Markdown syntax. This is how you add emphasis, lists, and other formatting to the written portions of your knitted document.

Format	Syntax	Result
Bold	`bold text`	bold text
Italic	`italic text`	italic text
Bold + italic	`*bold italic*`	*bold italic*
Superscript	`text^superscript^`	text^superscript
Subscript	`text~subscript~`	text_subscript
Strikethrough	`~~crossed out~~`	~~crossed out~~
Inline code	`code here`	`code here`

For bullet points and numbered lists, start each line with - or 1. (leave a blank line before the list):

- First bullet
- Second bullet
  - Indented sub-bullet

1. First item
2. Second item

To add a link: [link text](https://example.com). To insert an image: ![caption](image_path.png).

2. Importing Data

Before you can analyze data, you need to tell R where your files are and load them into R.

Setting the working directory: `setwd()` and `getwd()`

The working directory is the folder R looks in when you read or save files. Use getwd() to see your current working directory and setwd() to change it.

getwd()                            # print the current working directory
setwd("/Users/yourname/Documents") # set it to a specific folder path

In RStudio, you can also set the working directory by going to Session > Set Working Directory > Choose Directory or by using the Files pane (click “More” > “Set As Working Directory”).

Tip: Keep your data files and R scripts in the same folder. That way, you can set the working directory once and everything will be easy to find.

Reading a CSV file: `read.csv()`

Most data files you will work with in this course are CSV (comma-separated values) files. Use read.csv() to load a CSV file into R as a data frame. Assign the result to a named object so you can use it.

# Read a CSV file from your working directory
data <- read.csv("my_data.csv")

# Read a CSV file from a specific file path
data <- read.csv("/Users/yourname/Documents/my_data.csv")

After loading, always check the data with View(), head(), or str() (see Section 4) to make sure it loaded correctly.

Saving data to a CSV file: `write.csv()`

After cleaning or transforming your data, you can save it as a new CSV file.

write.csv(data, "cleaned_data.csv", row.names = FALSE)

The row.names = FALSE argument prevents R from adding an extra column of row numbers to the file.

3. Loading Packages

`library()`

Loads an installed package into your current R session. Packages must be loaded each time you start R.

library(tidyverse)    # ggplot2, dplyr, and more
library(car)          # leveneTest()
library(effsize)      # cohen.d()
library(effectsize)   # eta_squared()
library(pwr)          # pwr.anova.test()
library(corrplot)     # corrplot()
library(afex)         # aov_ez() for repeated-measures and mixed ANOVA
library(emmeans)      # post-hoc for afex models
library(lavaan)       # sem() for mediation / SEM
library(patchwork)    # combine multiple ggplots side by side

4. Exploring Data

`View()`

Opens the dataset in a spreadsheet-style viewer in RStudio.

View(mtcars)

`?` (help)

Opens the help page for a function or dataset.

?mtcars
?mean

`head()`

Displays the first 6 rows of a dataset.

head(mtcars)

`str()`

Shows the structure of a dataset – variable names, types, and sample values. Useful for checking whether variables are numeric, factor, etc.

str(ToothGrowth)

`dim()` and `nrow()`

dim() returns rows and columns; nrow() returns just the number of rows.

dim(mtcars)    # 32 rows, 11 columns
nrow(mtcars)   # 32

`names()`

Returns the column (variable) names.

names(mtcars)

`summary()`

Returns key descriptive statistics (min, max, quartiles, mean) for each variable.

summary(mtcars)

5. Cleaning Data

Before running any analysis, it is good practice to inspect your data for missing values and outliers. Outliers can distort means, inflate standard deviations, and violate assumptions of normality and equal variances.

Checking for missing values: `is.na()` and `sum()`

Use is.na() to test each value and sum() to count how many are missing.

sum(is.na(mtcars$mpg))          # how many missing values in mpg?
colSums(is.na(mtcars))          # count missing values for every column at once

Identifying outliers with `boxplot.stats()`

The function boxplot.stats() uses the IQR method to identify outliers. By default it flags values beyond 1.5 × IQR from Q1/Q3. In this course, we use a stricter 3 × IQR rule (to avoid flagging too many values), which you set with coef = 3. The $out element returns the actual outlier values.

# Get outlier values using the 3×IQR rule
boxplot.stats(mtcars$mpg, coef = 3)$out

If this returns numeric(0), there are no outliers. If it returns values, those are the specific data points flagged as outliers.

To find which rows contain outliers (so you know which participants/observations to investigate):

# Store the outlier values
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out

# Find which rows contain those values
which(mtcars$mpg %in% outlier_values)

What does coef do? It sets how strict the outlier rule is. coef = 1.5 (the default) is the standard boxplot rule and flags more points. coef = 3 is stricter and only flags extreme outliers. In this course, we use coef = 3.

Spotting outliers visually: boxplots and histograms

Boxplots display outliers as individual points beyond the whiskers. Histograms reveal extreme values sitting far from the rest of the distribution. See Section 17 (Visualization Gallery) for runnable examples of both plots.

boxplot(mtcars$mpg, main = "MPG Boxplot")                  # base R boxplot
hist(mtcars$mpg, main = "MPG Histogram", breaks = 10)      # base R histogram

Removing or filtering outliers

Once you have identified outliers, you can use filter() (see Section 7) to create a clean dataset without them.

library(tidyverse)
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out
clean_data <- filter(mtcars, !(mpg %in% outlier_values))

When should you remove outliers? There is no single rule. Remove data points only when you have a substantive reason (e.g., data entry error, participant who did not follow instructions). Always report whether outliers were removed and how many.

6. Descriptive Statistics

`mean()`, `sd()`, `var()`, `median()`

Core descriptive statistics. Use na.rm = TRUE if there are missing values.

mean(mtcars$mpg)
sd(mtcars$mpg)
var(mtcars$mpg)
median(mtcars$mpg)

`range()`, `quantile()`, `IQR()`

Distribution measures useful for outlier detection and data description.

range(mtcars$mpg)
quantile(mtcars$mpg)                     # 0%, 25%, 50%, 75%, 100%
quantile(mtcars$mpg, probs = 0.90)       # 90th percentile
IQR(mtcars$mpg)                          # interquartile range (Q3 - Q1)

`table()`

Counts the number of observations per group. Works with one or two grouping variables.

table(ToothGrowth$supp)                         # one factor
table(ToothGrowth$supp, ToothGrowth$dose)        # two factors (contingency table)

`tapply()`

Applies a function (like mean or sd) separately to each group. Use list() for two grouping variables. The syntax is: tapply(outcome_variable, grouping_variable, function).

# One grouping variable
tapply(ToothGrowth$len, ToothGrowth$supp, mean)
tapply(ToothGrowth$len, ToothGrowth$supp, sd)

# Two grouping variables (for factorial designs)
tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)

7. Data Manipulation

`filter()` (dplyr)

Keeps only the rows that meet a condition.

library(tidyverse)
filter(mtcars, mpg > 25)
filter(ToothGrowth, supp == "OJ")

`mutate()` and `ifelse()`

mutate() creates a new variable (column) or modifies an existing one. You can use it with arithmetic, functions, or ifelse() to categorize values based on a condition.

library(tidyverse)

# Create a new column that categorizes cars as heavy or light
mtcars <- mutate(mtcars, heavy = ifelse(wt > 3, "Yes", "No"))

# Create a new column by doing arithmetic on existing columns
mtcars <- mutate(mtcars, hp_per_cyl = hp / cyl)

# Create a new column using a function
mtcars <- mutate(mtcars, mpg_z = as.numeric(scale(mpg)))

# Create a composite score as the mean of several columns
# (see also rowMeans() below for an alternative approach)
survey <- data.frame(q1 = c(4, 5, 3), q2 = c(3, 4, 5), q3 = c(5, 5, 4))
survey <- mutate(survey, composite = (q1 + q2 + q3) / 3)

# Recode a variable into multiple categories with nested ifelse()
mtcars <- mutate(mtcars, mpg_group = ifelse(mpg > 25, "High",
                                     ifelse(mpg > 15, "Medium", "Low")))

Pipe operator `%>%`

The pipe operator %>% comes from the dplyr package (loaded with tidyverse). It chains operations left-to-right so you can read code as “take this, then do this, then do this.” Keyboard shortcut: Ctrl + Shift + M (Cmd + Shift + M on Mac).

mtcars %>%
  filter(cyl == 4) %>%
  summarize(avg_mpg = mean(mpg))

`scale()`

Standardizes a variable to z-scores (mean = 0, SD = 1).

mtcars$mpg_z <- as.numeric(scale(mtcars$mpg))   # saves z-scores as a new column

`factor()`

Converts a numeric variable to a categorical factor. Required before running ANOVA on numeric grouping variables.

ToothGrowth$dose <- factor(ToothGrowth$dose)

`which()`, `is.na()`, `!`

which() returns row numbers where a condition is true. is.na() tests for missing values. ! negates.

which(mtcars$mpg > 30)   # which rows have mpg over 30?
is.na(df$score)           # TRUE where values are missing
!is.na(df$score)          # TRUE where values are NOT missing

Computing composite scores: `rowMeans()`

In psychology research, you often need to average several questionnaire items into a single composite score for each participant. rowMeans() calculates the mean across columns for each row. Use na.rm = TRUE to handle missing values.

# Suppose your data frame has columns q1 through q5 for a 5-item scale
# and you want to create a composite score for each participant

# Method 1: select columns by position (columns 2 through 6)
df$composite <- rowMeans(df[, 2:6], na.rm = TRUE)

# Method 2: select columns by name
df$composite <- rowMeans(df[, c("q1", "q2", "q3", "q4", "q5")], na.rm = TRUE)

# Method 3: use paste0() shorthand when columns are numbered sequentially
df$composite <- rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE)

# Method 4: use mutate() with rowMeans() for a tidyverse approach
df <- mutate(df, composite = rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE))

The resulting composite column contains each participant’s average across those items. You can then use this composite variable as an outcome or predictor in your analyses.

Handling missing values: `na.rm = TRUE`

Use na.rm = TRUE if there are missing values. Many R functions (e.g., mean(), sd(), sum(), rowMeans()) will return NA if any input value is missing – unless you tell them to ignore missing values with na.rm = TRUE.

mean(df$score, na.rm = TRUE)         # mean ignoring NAs
sd(df$score, na.rm = TRUE)           # standard deviation ignoring NAs
rowMeans(df[, 2:6], na.rm = TRUE)    # row means ignoring NAs

Converting scientific notation to regular numbers: `format()` and `options()`

R sometimes displays very small p-values or very large numbers in scientific notation (e.g., 2.5e-08 instead of 0.000000025). There are two ways to see the full number.

# Convert a single number to regular notation
format(2.5e-08, scientific = FALSE)

# Turn off scientific notation globally for the rest of your session
options(scipen = 999)

# To turn it back on later
options(scipen = 0)

scipen = 999 tells R to strongly prefer regular notation. This is especially helpful when reading p-values from statistical output.

Wide to long format: `pivot_longer()` (tidyr)

Within-subjects (repeated-measures) ANOVA requires data in long format, where each row is one observation. If your data is in wide format (one row per participant, multiple columns for each condition), convert it with pivot_longer().

library(tidyverse)

# Wide format: columns time1, time2, time3 for each participant
wide_data <- data.frame(
  id = 1:5,
  time1 = c(30, 28, 35, 32, 29),
  time2 = c(25, 22, 28, 27, 24),
  time3 = c(20, 18, 22, 21, 19)
)

# Convert to long format
long_data <- pivot_longer(wide_data,
                          cols = time1:time3,
                          names_to = "time",
                          values_to = "score")
long_data$id <- factor(long_data$id)

In long format, the dataset goes from 5 rows (one per person) to 15 rows (one per person per time point), with columns for id, time, and score.

8. T-Tests

T-tests compare means between two conditions. Before running a t-test, check the assumptions of normality (Shapiro-Wilk test or histogram) and equal variances (Levene’s test for independent samples). Recommended visuals: bar graph of group means with error bars, or boxplot.

One-sample t-test

Tests whether a sample mean differs from a known value.

t.test(faithful$waiting, mu = 60)

Independent samples t-test

Compares means of two independent groups. Use var.equal = TRUE for the standard t-test or var.equal = FALSE for Welch’s t-test (when variances are unequal).

# Standard t-test (assumes equal variances)
t.test(len ~ supp, data = ToothGrowth, var.equal = TRUE)

# Welch's t-test (does not assume equal variances)
t.test(len ~ supp, data = ToothGrowth, var.equal = FALSE)

Paired samples t-test

Compares means from the same participants measured twice. Must use the two-vector method (the formula interface does not support paired = TRUE).

t.test(sleep$extra[sleep$group == 1],
       sleep$extra[sleep$group == 2],
       paired = TRUE)

Effect size: Cohen’s d

Use the effsize package. Thresholds: 0.2 = small, 0.5 = medium, 0.8 = large. Report d numerically always; only describe the magnitude verbally (e.g., “a medium effect”) when the t-test is significant.

library(effsize)

# Independent samples
cohen.d(len ~ supp, data = ToothGrowth)

# Paired samples
cohen.d(sleep$extra[sleep$group == 1],
        sleep$extra[sleep$group == 2],
        paired = TRUE)

APA write-up template

[Group 1] (M = ##.##, SD = ##.##) [was/were] [significantly/not significantly] [higher/lower/different] compared to [Group 2] (M = ##.##, SD = ##.##), t(df) = ##.##, p = .###, d = ##.##.

Note: The APA template requires group means and standard deviations. Use tapply() from Section 6 to compute them by group – for example, tapply(ToothGrowth$len, ToothGrowth$supp, mean) and tapply(ToothGrowth$len, ToothGrowth$supp, sd).

9. One-Way ANOVA

One-way ANOVA tests whether three or more group means differ. Before running, check normality of residuals (Shapiro-Wilk) and equality of variances (Levene’s test). If variances are unequal, use Welch’s ANOVA. Recommended visuals: boxplot by group, or bar graph with error bars.

Important: If your grouping variable is numeric (e.g., dose levels like 0.5, 1, 2), you must convert it to a factor before running the ANOVA. See factor() in Section 7 (Data Manipulation).

Between-subjects one-way ANOVA

`aov()` + `summary()`

Save the model for use with summary(), TukeyHSD(), and residuals().

plant_model <- aov(weight ~ group, data = PlantGrowth)
summary(plant_model)

Welch’s ANOVA: `oneway.test()`

Robust alternative when the equal variances assumption is violated.

oneway.test(weight ~ group, data = PlantGrowth, var.equal = FALSE)

Post-hoc: `TukeyHSD()` and Bonferroni

Only run post-hoc tests when the overall ANOVA is significant. TukeyHSD is the standard; Bonferroni is a more conservative alternative.

TukeyHSD(plant_model)
plot(TukeyHSD(plant_model))   # CI plot

# Bonferroni alternative
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "bonferroni")

Effect size: eta-squared

Use the effectsize package. Thresholds: .01 = small, .06 = medium, .14 = large. For one-way ANOVA, eta-squared and partial eta-squared are identical.

library(effectsize)
eta_squared(plant_model)                  # eta-squared
eta_squared(plant_model, partial = TRUE)  # partial eta-squared

Pairwise Cohen’s d

For significant pairwise comparisons, compute Cohen’s d to report effect sizes.

library(effsize)
cohen.d(PlantGrowth$weight[PlantGrowth$group == "trt1"],
        PlantGrowth$weight[PlantGrowth$group == "trt2"])

APA write-up template

A one-way ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, eta-squared = .##. Tukey-corrected post-hoc comparisons revealed that [Group A] was significantly [higher/lower] than [Group B] (p = .###, d = ##.##), while [other comparisons].

Note: The APA template requires group means and standard deviations. Use tapply() from Section 6 to compute them by group – for example, tapply(PlantGrowth$weight, PlantGrowth$group, mean) and tapply(PlantGrowth$weight, PlantGrowth$group, sd).

Within-subjects (repeated-measures) one-way ANOVA

When the same participants are measured under all conditions, use a repeated-measures ANOVA. Data must be in long format (one row per observation; see Section 7 for pivot_longer()). The afex package provides aov_ez() which handles within-subjects designs and automatically applies the Greenhouse-Geisser sphericity correction when needed.

`aov_ez()` (afex package)

Arguments: id = participant ID column (as a string), dv = outcome variable (as a string), data = data frame, within = within-subjects factor(s) (as a string vector).

library(afex)
anxiety_model <- aov_ez(id = "id", dv = "anxiety",
                        data = anxiety_data, within = c("time"))
anxiety_model

Post-hoc: `emmeans()`

The emmeans package provides pairwise comparisons for afex models. Use adjust = "tukey" (default) or adjust = "bonferroni".

library(emmeans)
emmeans(anxiety_model, pairwise ~ time)                        # Tukey (default)
emmeans(anxiety_model, pairwise ~ time, adjust = "bonferroni") # Bonferroni

Effect size: generalized eta-squared

aov_ez() reports generalized eta-squared (ges) in its output. Interpretation thresholds are the same: .01 = small, .06 = medium, .14 = large.

APA write-up template

A repeated-measures ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, generalized eta-squared = .##. Tukey-corrected pairwise comparisons revealed that [condition A] was significantly [higher/lower] than [condition B] (p = .###).

Note: The APA template requires condition means and standard deviations. Use tapply() from Section 6 to compute them – for example, tapply(anxiety_data$anxiety, anxiety_data$time, mean) and tapply(anxiety_data$anxiety, anxiety_data$time, sd).

10. Two-Way / Factorial ANOVA

Factorial ANOVA tests the effects of two (or more) factors and their interaction. Before running, check normality of residuals and equality of variances. Recommended visuals: interaction plot (line graph) and grouped bar graph with error bars.

Important: If any grouping variable is numeric (e.g., dose levels like 0.5, 1, 2), you must convert it to a factor before running the ANOVA. See factor() in Section 7 (Data Manipulation).

Between-subjects factorial ANOVA

`aov()` with `*` for interaction

The * operator includes both main effects and the interaction. Produces three F-tests.

# Convert dose to factor first (required for numeric grouping variables)
ToothGrowth$dose <- factor(ToothGrowth$dose)

tooth_model <- aov(len ~ supp * dose, data = ToothGrowth)
summary(tooth_model)

Post-hoc: `TukeyHSD()` on the full model

Returns three sections: $supp (main effect of supp), $dose (main effect of dose), and $supp:dose (all cell-to-cell comparisons for the interaction).

TukeyHSD(tooth_model)

Effect size: partial eta-squared

In a factorial design, always use partial eta-squared because it accounts for the other factors.

library(effectsize)
eta_squared(tooth_model, partial = TRUE)

APA write-up template

A 2 x 3 between-subjects ANOVA examined the effects of [factor A] and [factor B] on [outcome]. There was a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. There was a significant main effect of [factor B], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. The [factor A] x [factor B] interaction was [significant/non-significant], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##.

When a significant interaction is present, focus the write-up on the interaction pattern (e.g., the effect of dose depended on supplement type). Main effects are less informative when qualified by a significant interaction.

Within-subjects factorial ANOVA

When the same participants experience all combinations of two factors, use aov_ez() with both factors listed in within.

library(afex)
# 2 (task) x 3 (difficulty) within-subjects design
cog_model <- aov_ez(id = "id", dv = "rt",
                    data = cognitive_data,
                    within = c("task", "difficulty"))
cog_model

Post-hoc: simple effects with `emmeans()`

Use | to run pairwise comparisons of one factor separately at each level of the other (simple effects). This is how you decompose a significant interaction.

library(emmeans)
emmeans(cog_model, pairwise ~ difficulty | task)                        # Tukey
emmeans(cog_model, pairwise ~ difficulty | task, adjust = "bonferroni") # Bonferroni

Mixed ANOVA (between + within)

A mixed design has at least one between-subjects factor and at least one within-subjects factor. Use aov_ez() with both the within and between arguments.

# group is between-subjects, time is within-subjects
therapy_model <- aov_ez(id = "id", dv = "depression",
                        data = therapy_data,
                        within = c("time"), between = c("group"))
therapy_model

Post-hoc for mixed designs

emmeans(therapy_model, pairwise ~ time | group)                        # Tukey
emmeans(therapy_model, pairwise ~ time | group, adjust = "bonferroni") # Bonferroni

APA write-up templates (within-subjects and mixed)

Within-subjects factorial: A 2 x 3 within-subjects ANOVA revealed a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, ges = .##. The [factor A] x [factor B] interaction was significant, F(df1, df2) = ##.##, p = .###, ges = .##. Tukey-corrected simple effects showed that [describe pattern].

Mixed ANOVA: A mixed ANOVA with [between factor] as the between-subjects factor and [within factor] as the within-subjects factor revealed a significant interaction, F(df1, df2) = ##.##, p = .###, ges = .##. Bonferroni-corrected post-hoc comparisons showed that [the therapy group improved significantly across time points while the control group did not].

Note: APA write-ups for factorial and mixed ANOVAs require cell means and standard deviations for each condition combination. Use tapply() with list() from Section 6 to compute them – for example, tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean).

11. Correlation

Correlation measures the strength and direction of the linear relationship between two continuous variables. Before interpreting, check for linearity (scatterplot) and outliers. Recommended visual: scatterplot with regression line.

`cor.test()` – Pearson and Spearman

Pearson r is the default; use Spearman rho for non-normal or ordinal data. Correlation strength thresholds: .10 = weak, .30 = moderate, .50 = strong.

# Pearson correlation (default)
cor.test(mtcars$wt, mtcars$mpg)

# Spearman rank correlation
cor.test(mtcars$wt, mtcars$mpg, method = "spearman")

Correlation matrix: `cor()`

Computes correlations for multiple variables at once. Subset the data frame to include only the variables you want.

cor(mtcars[, c("mpg", "wt", "hp", "disp")])

Correlation heatmap: `corrplot()`

Creates a color-coded heatmap of a correlation matrix. Save the matrix as an object first.

library(corrplot)
car_cors <- cor(mtcars[, c("mpg", "wt", "hp", "disp")])
corrplot(car_cors, method = "color", addCoef.col = "black")

APA write-up template

There was a [strong/moderate/weak] [positive/negative] correlation between [variable 1] and [variable 2], r(df) = .##, p = .###.

The effect size for correlation is the r value itself – no separate calculation needed.

12. Regression

Regression predicts an outcome from one or more predictors. Before interpreting, check residuals for normality (Shapiro-Wilk on residuals) and homoscedasticity (residuals vs. fitted plot). Recommended visuals: scatterplot with regression line for simple regression, residuals vs. fitted plot for assumption checking.

Simple linear regression: `lm()`

Fits a model predicting one outcome from one predictor. The equation is Y = b0 + b1*X.

model <- lm(mpg ~ wt, data = mtcars)
summary(model)

Key output: Intercept (b0) = predicted Y when X = 0. Slope (b1) = change in Y for each one-unit increase in X. Multiple R-squared = proportion of variance explained. F-statistic and p-value = overall model significance.

Multiple regression: `lm()` with `+`

Predicts an outcome from two or more predictors entered simultaneously. Each coefficient represents the effect of that predictor while controlling for the others.

full_model <- lm(mpg ~ wt + hp + disp, data = mtcars)
summary(full_model)

Hierarchical regression: `anova()` to compare models

Tests whether adding predictors significantly improves the model. A significant p-value means the fuller model explains significantly more variance.

model1 <- lm(mpg ~ wt, data = mtcars)
model2 <- lm(mpg ~ wt + hp + disp, data = mtcars)
anova(model1, model2)

Effect size: R-squared

R-squared is reported directly in the summary() output. Thresholds: .02 = small, .13 = medium, .26 = large.

Assumption checking for regression

model <- lm(mpg ~ wt, data = mtcars)
plot(model, which = 1)   # residuals vs fitted  --  look for random scatter
shapiro.test(residuals(model))   # normality of residuals

APA write-up templates

Simple regression: A simple linear regression showed that [predictor] significantly predicted [outcome], F(1, df) = ##.##, p = .###, R-squared = .##. For every one-unit increase in [predictor], [outcome] [increased/decreased] by ##.## units (b = ##.##, t = ##.##, p = .###).

Multiple regression: A multiple regression model significantly predicted [outcome], F(df1, df2) = ##.##, p = .###, R-squared = .##. [Predictor 1] (b = ##.##, p = .###) and [Predictor 2] (b = ##.##, p = .###) were significant predictors, while [Predictor 3] (b = ##.##, p = .###) was not significant.

Hierarchical regression: Adding [predictors] to the model significantly improved prediction, F-change(df1, df2) = ##.##, p = .###, R-squared change = .##.

13. Mediation Analysis

Mediation tests whether an IV affects a DV through an intermediate variable (the mediator). The indirect effect (a * b) represents the portion of the IV-DV relationship that goes through the mediator. Use the lavaan package and bootstrap confidence intervals for the indirect effect. No specific assumption checks beyond those for regression (normality, linearity); recommended visuals: scatterplots of each path.

Defining the lavaan model

The model is a text string using ~ for regressions, * for labeling paths, and := for defining computed parameters.

library(lavaan)

med_model <- '
  self_esteem ~ a*social_support
  depression ~ b*self_esteem + cp*social_support
  indirect := a*b
  total := cp + a*b
'

Line by line: Line 1 = a path (IV → mediator). Line 2 = b path (mediator → DV) and c’ path (direct effect, IV → DV controlling for mediator). Line 3 = indirect effect defined as a * b. Line 4 = total effect defined as c’ + indirect.

Fitting the model with `sem()`

med_fit <- sem(med_model, data = mediation_data)
summary(med_fit)

The Regressions section shows a, b, and c’ paths. The Defined Parameters section shows the indirect and total effects.

Bootstrap confidence intervals

Bootstrap CIs are the gold-standard test for the indirect effect because the sampling distribution of a * b is often not normal.

med_boot <- sem(med_model, data = mediation_data,
                se = "bootstrap", bootstrap = 100)
parameterEstimates(med_boot)

If the bootstrap CI for indirect does not include zero, mediation is supported.

Full vs. partial mediation

Full mediation: the indirect effect is significant but the direct effect (c’) is not – the entire IV-DV relationship goes through the mediator. Partial mediation: both the indirect effect and the direct effect are significant.

APA write-up template

A mediation analysis using lavaan tested whether [mediator] mediated the effect of [IV] on [DV]. The indirect effect was significant (b = ##.##, 95% bootstrap CI [##.##, ##.##]), supporting [full/partial] mediation. The direct effect of [IV] on [DV] was [significant/not significant] (b = ##.##, p = .###).

Note: When writing up mediation results, you may want to report means and standard deviations for each variable. Use mean() and sd() from Section 6 – for example, mean(mediation_data$social_support) and sd(mediation_data$social_support).

14. Chi-Square Tests

Chi-square tests analyze categorical (nominal) data – variables whose values represent categories rather than quantities. No distributional assumptions are needed, but expected cell frequencies should be at least 5 for the chi-square approximation to be valid (use Fisher’s exact test if they are not). Recommended visual: bar chart of observed frequencies.

Goodness-of-fit test

Tests whether the distribution of one categorical variable matches expected proportions. By default, chisq.test() assumes equal expected proportions.

# Count frequencies
table(survey_data$preference)

# Test against equal distribution
chisq.test(table(survey_data$preference))

# Test against specific expected proportions
chisq.test(table(survey_data$preference), p = c(0.40, 0.30, 0.20, 0.10))

Degrees of freedom = number of categories minus 1.

Test of independence

Tests whether two categorical variables are related. First create a contingency table with table(), then run chisq.test().

am_cyl_table <- table(mtcars$am, mtcars$cyl)
am_cyl_table
chisq.test(am_cyl_table)

If R warns that “Chi-squared approximation may be incorrect,” some expected cell frequencies are below 5. Use Fisher’s exact test instead.

Fisher’s exact test

Use when expected cell frequencies are too low for the chi-square approximation.

fisher.test(am_cyl_table)

Effect size: Cramer’s V

Computed manually from the chi-square result. Thresholds: .10 = small, .30 = medium, .50 = large.

chi_result <- chisq.test(am_cyl_table)

# V = sqrt(X2 / (N * (k - 1))), where k = smaller table dimension
sqrt(chi_result$statistic / (sum(am_cyl_table) * (min(dim(am_cyl_table)) - 1)))

APA write-up templates

Goodness-of-fit: A chi-square goodness-of-fit test indicated that therapy preferences were not equally distributed, X²(df) = ##.##, p = .###.

Independence: A chi-square test of independence showed a significant association between [variable 1] and [variable 2], X²(df, N = ##) = ##.##, p = .###, Cramer’s V = .##. [If applicable: Fisher’s exact test confirmed the result, p = .###.]

15. Power Analysis

Power analysis helps you determine the sample size needed to detect an effect, or the power of an existing study. Run these before data collection (a priori) whenever possible.

`power.t.test()`

For t-tests. Leave one argument as NULL to solve for it.

# Calculate power given n, effect size (delta), and sd
power.t.test(n = 30, delta = 5, sd = 10, sig.level = 0.05, type = "two.sample")

# Calculate required n for 80% power
power.t.test(delta = 5, sd = 10, sig.level = 0.05, power = 0.80, type = "two.sample")

`pwr.anova.test()` (pwr package)

For one-way ANOVA. Uses Cohen’s f for effect size (0.10 = small, 0.25 = medium, 0.40 = large).

library(pwr)
pwr.anova.test(k = 3, f = 0.25, sig.level = 0.05, power = 0.80)

16. Assumption Checking

`shapiro.test()` – Normality

Shapiro-Wilk test. If p > .05, normality is not violated. For ANOVA and regression, test the residuals rather than the raw data.

# On a variable directly
shapiro.test(mtcars$mpg)

# On residuals from a model
model <- aov(len ~ supp, data = ToothGrowth)
shapiro.test(residuals(model))

If normality is violated, check sample size: with n >= 30 per group, the test is robust due to the Central Limit Theorem.

`leveneTest()` (car package) – Equal Variances

If p > .05, the equal variances assumption is met.

library(car)
leveneTest(len ~ supp, data = ToothGrowth)
leveneTest(len ~ supp * dose, data = ToothGrowth)   # factorial design

If variances are unequal, use Welch’s t-test (var.equal = FALSE) or Welch’s ANOVA (oneway.test()).

Residual diagnostic plot – Linearity and Homoscedasticity

For regression models. Look for random scatter; a funnel or curve indicates a violation.

model <- lm(mpg ~ wt, data = mtcars)
plot(model, which = 1)   # residuals vs fitted values

Quick reference: which assumptions for which test

Test	Normality	Equal Variances	Other
One-sample t-test	Yes (or n >= 30)	N/A	–
Independent t-test	Yes (or n >= 30)	Yes (Levene’s)	–
Paired t-test	Yes (of differences)	N/A	–
One-way ANOVA	Yes (residuals)	Yes (Levene’s)	–
Two-way ANOVA	Yes (residuals)	Yes (Levene’s)	–
Repeated-measures ANOVA	–	–	Sphericity (auto-corrected by afex)
Correlation	Normality for Pearson	–	Linearity (scatterplot)
Regression	Yes (residuals)	Homoscedasticity (residual plot)	Linearity
Chi-square	N/A	N/A	Expected freq >= 5

17. Data Visualization Gallery

This section shows runnable example plots for common analysis types. Each plot will render when you knit this document so you can see what they look like.

A note on colors: All examples below use named R colors (like "steelblue", "coral", "darkgreen") instead of hex codes. R recognizes 657 built-in color names. To see the full list, run colors() in the console. For a visual reference with color swatches, visit: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf

Reminder – copy, paste, modify: To use any plot below with your own data, copy the code and replace the data frame name, variable names, and text in quotes with your own. Keep the function names, + signs, and argument names (like x =, fill =) exactly as shown. See “How to use code examples from this guide” at the top of this document for details.

How ggplot works: the layered grammar of graphics

Every ggplot is built by adding layers together with +. The basic structure is:

ggplot(data, aes(x = variable1, y = variable2)) +
  geom_something() +
  labs() +
  theme_something()

Here is what each piece does:

ggplot(data, aes(...)) – This is the foundation layer. It tells R which data frame to use and which variables go on the x-axis and y-axis. The aes() function (short for “aesthetics”) maps variables to visual properties. Common aesthetics include x, y, fill (color inside bars/boxes), color (color of lines/points/outlines), and group.

geom_...() – The geometry layer controls what type of plot you see. Examples: geom_bar() for bar charts, geom_boxplot() for boxplots, geom_point() for scatterplots, geom_histogram() for histograms, geom_line() for line graphs. You can add multiple geom layers to the same plot.

stat_summary() – An alternative to geom layers that computes a summary statistic (like the mean) before plotting. Use fun = mean with geom = "bar" to make a bar graph of means, or fun.data = mean_cl_normal with geom = "errorbar" for 95% CI error bars.

labs() – Sets the plot title, subtitle, axis labels, and legend title. Example: labs(title = "My Plot", x = "Group", y = "Score").

scale_...() – Controls how aesthetics are mapped to visual values. Use scale_fill_manual(values = c(...)) to set bar/box colors, scale_color_manual(values = c(...)) to set line/point colors, or scale_y_continuous() / scale_x_continuous() to adjust axis limits and tick marks (see below).

theme_...() – Sets the overall visual style. theme_minimal() is clean with light gridlines; theme_classic() has only axis lines (like a textbook figure); theme_bw() is black-and-white with a border.

theme() – Fine-tune individual elements like font sizes, legend position, and text alignment (see the Customization section at the end of this gallery).

Quick reference: common customization layers

What you want to change	Code
Bar/box colors	`scale_fill_manual(values = c(...))`
Line/point colors	`scale_color_manual(values = c(...))`
Axis labels and title	`labs(title = "...", x = "...", y = "...")`
Overall look	`theme_minimal()`, `theme_classic()`, etc.
Font size (title)	`theme(plot.title = element_text(size = 16))`
Font size (axes)	`theme(axis.title = element_text(size = 13))`
Bold title	`theme(plot.title = element_text(face = "bold"))`
Center title	`theme(plot.title = element_text(hjust = 0.5))`
Remove legend	`theme(legend.position = "none")`
Move legend to bottom	`theme(legend.position = "bottom")`

Histogram

Quick check of distribution shape. Use before t-tests and ANOVA to eyeball normality.

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(title = "Distribution of MPG", x = "Miles per Gallon", y = "Count") +
  theme_minimal()

Code breakdown: aes(x = mpg) puts mpg on the x-axis (histograms only need an x variable). geom_histogram() creates the bars; binwidth = 2 means each bar spans 2 units on the x-axis. fill sets the inside color of the bars and color sets the border color. labs() adds a title and axis labels. theme_minimal() gives a clean look.

Boxplot by group

Shows median, quartiles, and outliers. Good for comparing groups before t-tests or ANOVA.

ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
  geom_boxplot() +
  labs(title = "Tooth Length by Supplement", x = "Supplement", y = "Tooth Length") +
  theme_minimal()

Code breakdown: aes(x = supp, y = len, fill = supp) maps the grouping variable (supp) to both the x-axis and the fill color. geom_boxplot() draws the boxes. The thick line is the median, the box edges are Q1 and Q3, whiskers extend to the most extreme points within 1.5 * IQR, and dots beyond the whiskers are outliers.

Bar graph with error bars

Standard visual for reporting group means. Use mean_cl_normal for 95% CI error bars or mean_se for standard error bars.

ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  labs(title = "Mean Tooth Length by Supplement (95% CI)",
       x = "Supplement", y = "Mean Tooth Length") +
  theme_minimal()

Code breakdown: Instead of geom_bar(), we use stat_summary(fun = mean, geom = "bar") to compute the mean of len for each level of supp and draw it as a bar. The second stat_summary() layer computes 95% confidence intervals (mean_cl_normal) and draws them as error bars. width = 0.2 controls how wide the error bar caps are.

Grouped bar graph (factorial designs)

Side-by-side bars for two-way ANOVA designs. Use position_dodge() to place bars next to each other rather than stacked.

ToothGrowth$dose <- factor(ToothGrowth$dose)

ggplot(ToothGrowth, aes(x = dose, y = len, fill = supp)) +
  stat_summary(fun = mean, geom = "bar", position = position_dodge(0.9)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
               position = position_dodge(0.9), width = 0.2) +
  labs(title = "Tooth Growth by Supplement and Dose",
       x = "Dose (mg/day)", y = "Mean Tooth Length", fill = "Supplement") +
  theme_minimal()

Code breakdown: aes(x = dose, ..., fill = supp) puts dose on the x-axis and uses fill color to distinguish supplement types. position_dodge(0.9) places the bars side by side (both the bars and error bars need the same dodge value to line up). fill = "Supplement" inside labs() changes the legend title.

Interaction plot (line graph)

Shows whether the effect of one factor depends on another. Non-parallel lines suggest an interaction. Use for factorial and mixed ANOVA designs.

ggplot(ToothGrowth, aes(x = dose, y = len, color = supp, group = supp)) +
  stat_summary(fun = mean, geom = "point", size = 3) +
  stat_summary(fun = mean, geom = "line", linewidth = 1) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1) +
  labs(title = "Interaction Plot - Supplement by Dose",
       x = "Dose", y = "Mean Tooth Length", color = "Supplement") +
  theme_minimal()

Code breakdown: color = supp maps supplement type to the line/point color (ggplot picks colors automatically). group = supp tells ggplot to draw separate lines for each supplement. We add three stat_summary() layers: points for the group means, lines connecting them, and error bars. The color = "Supplement" inside labs() renames the color legend.

Line plot for repeated measures (single factor)

For within-subjects designs with one factor, use group = 1 to connect all points with a single line (since there is only one condition trajectory to show).

ggplot(rep_data, aes(x = time, y = anxiety, group = 1)) +
  stat_summary(fun = mean, geom = "point", size = 3) +
  stat_summary(fun = mean, geom = "line", linewidth = 1) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1) +
  labs(title = "Anxiety Across Time Points",
       x = "Time Point", y = "Mean Anxiety Score") +
  theme_minimal()

Code breakdown: group = 1 tells ggplot that all time points belong to a single group and should be connected by one line. Without this, ggplot would treat each time point as its own group and would not draw a connecting line. The rest follows the same pattern: point, line, and error bar layers.

Scatterplot with regression line

Standard visual for correlation and simple regression. geom_smooth(method = "lm") adds a linear best-fit line with a 95% CI band.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", color = "steelblue") +
  labs(title = "MPG by Vehicle Weight",
       x = "Weight (1000 lbs)", y = "Miles per Gallon") +
  theme_minimal()

Code breakdown: geom_point() adds the individual data points. geom_smooth(method = "lm") fits a linear regression line and draws it with a shaded 95% confidence band around it. color = "steelblue" changes the line color. To remove the confidence band, add se = FALSE inside geom_smooth().

Correlation heatmap

Color-coded matrix showing correlations among multiple variables. Numbers on the cells show the r values. This uses the corrplot package rather than ggplot.

library(corrplot)
car_cors <- cor(mtcars[, c("mpg", "wt", "hp", "disp")])
corrplot(car_cors, method = "color", addCoef.col = "black")

Code breakdown: First, cor() computes the correlation matrix for the selected variables. Then corrplot() visualizes it. method = "color" fills each cell with a color (blue for positive, red for negative). addCoef.col = "black" prints the r values on each cell. You can further customize with arguments like tl.col (label color) and tl.srt (label rotation angle).

Bar chart for categorical data (chi-square)

Use for visualizing frequency distributions before running chi-square tests.

ggplot(data.frame(cyl = factor(mtcars$cyl)), aes(x = cyl, fill = cyl)) +
  geom_bar() +
  labs(title = "Count of Cars by Cylinder", x = "Cylinders", y = "Count") +
  scale_fill_manual(values = c("4" = "skyblue", "6" = "coral", "8" = "darkgreen")) +
  theme_minimal() +
  theme(legend.position = "none")

Code breakdown: geom_bar() counts the number of observations in each category and draws a bar for each (no need to compute counts yourself). fill = cyl inside aes() colors each bar by group. theme(legend.position = "none") removes the legend since the x-axis labels already show the group names. scale_fill_manual() sets specific named colors for each bar.

Significance bracket with `annotate()`

Adds a bracket and asterisk to mark significant pairwise comparisons on a bar graph. You need to choose y-axis values for the bracket that sit above the error bars without overlapping them.

ggplot(PlantGrowth, aes(x = group, y = weight, fill = group)) +
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  annotate("segment", x = 2, xend = 3, y = 6.2, yend = 6.2) +
  annotate("text", x = 2.5, y = 6.4, label = "*", size = 8) +
  labs(title = "Plant Weight by Treatment Group",
       x = "Group", y = "Mean Weight") +
  theme_minimal() +
  theme(legend.position = "none")

Code breakdown: Two annotate() layers create the bracket. The first draws a horizontal line segment from x = 2 to xend = 3 (the second and third bars) at y = 6.2 – high enough to clear the error bars. The second places an asterisk centered between the two bars at a slightly higher y value (y = 6.4). Adjust these y values based on your data. If the bracket overlaps your error bars, increase the y values.

Customization: colors, fonts, themes

You can customize nearly every visual element of a ggplot. Below are the most common adjustments.

Manual fill and color: `scale_fill_manual()` / `scale_color_manual()`

Override default colors with named R colors. Use fill for bars/boxes and color for lines/points. To see all available color names, run colors() in the console, or visit the visual reference at https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf.

ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2) +
  scale_fill_manual(values = c("OJ" = "coral", "VC" = "steelblue")) +
  labs(title = "Custom Named Colors", x = "Supplement", y = "Mean Length") +
  theme_minimal()

Themes

Themes control the overall look of the plot (background, gridlines, axis lines). Add a theme as the last layer of your plot. Here are the most common options:

theme_minimal() – Clean and modern. Light gray gridlines, no background box, no axis lines. Recommended for most assignments – it keeps the focus on the data without looking cluttered.

theme_classic() – Only x and y axis lines, no gridlines, no background. Looks like a textbook or journal figure. Recommended for APA-style figures since APA discourages gridlines.

theme_bw() – Black border around the plot with white background and gridlines. A good middle ground if you want the structure of gridlines but a more formal look.

theme_light() – Similar to theme_bw() but with lighter gray gridlines and border. Slightly softer look.

theme_gray() – The ggplot default (gray background with white gridlines). Generally not recommended for reports since it prints poorly in black-and-white.

library(patchwork)

base <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(color = "darkred", size = 2) +
  labs(x = "Weight", y = "MPG")

p1 <- base + labs(title = "theme_minimal()") + theme_minimal()
p2 <- base + labs(title = "theme_classic()") + theme_classic()
p3 <- base + labs(title = "theme_bw()") + theme_bw()
p4 <- base + labs(title = "theme_light()") + theme_light()

(p1 | p2) / (p3 | p4)

Adjusting text: axis labels, title, and font size

Use theme() to control individual text elements. Common arguments: size, face (“bold”, “italic”), hjust (horizontal justification, 0 = left, 0.5 = center, 1 = right).

ggplot(PlantGrowth, aes(x = group, y = weight, fill = group)) +
  stat_summary(fun = mean, geom = "bar") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  labs(title = "Customized Text Elements",
       x = "Treatment Group", y = "Mean Weight") +
  theme_minimal() +
  theme(
    plot.title   = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title   = element_text(size = 13),
    axis.text    = element_text(size = 11),
    legend.position = "none"
  )

Appendix: Complete Function Reference

An alphabetical listing of every function mentioned in this guide, with a brief description and the section where it is covered.

Function	Description	Section
`?`	Open help page for a function or dataset	4
`annotate()`	Add text, segments, or labels to a ggplot	17
`anova()`	Compare two nested regression models (hierarchical regression)	12
`aov()`	Fit a between-subjects ANOVA model	9, 10
`aov_ez()`	Fit repeated-measures or mixed ANOVA (afex package)	9, 10
`as.numeric()`	Convert an object to numeric (e.g., remove scale attributes)	7
`boxplot()`	Base R boxplot (quick outlier check)	5
`boxplot.stats()`	Identify outlier values using IQR method; use `$out` to extract them	5
`c()`	Combine values into a vector	1
`cat()`	Print text to the console (used for lavaan model strings)	13
`chisq.test()`	Chi-square goodness-of-fit or test of independence	14
`cohen.d()`	Compute Cohen’s d effect size (effsize package)	8, 9
`colSums()`	Sum each column (used with `is.na()` to count missing values)	5
`colors()`	List all 657 built-in R color names	17
`cor()`	Compute a correlation matrix	11
`cor.test()`	Test a single correlation (Pearson or Spearman)	11
`coord_cartesian()`	Zoom into a plot without removing data	17
`corrplot()`	Draw a correlation heatmap (corrplot package)	11, 17
`data.frame()`	Create a data frame from vectors	1
`dim()`	Return the number of rows and columns	4
`emmeans()`	Pairwise post-hoc comparisons for afex models (emmeans package)	9, 10
`eta_squared()`	Compute eta-squared or partial eta-squared (effectsize package)	9, 10
`factor()`	Convert a variable to a categorical factor	7
`filter()`	Keep rows that meet a condition (dplyr)	7
`fisher.test()`	Fisher’s exact test (when expected cell frequencies < 5)	14
`format()`	Convert scientific notation to regular numbers	7
`geom_bar()`	Bar chart geometry (counts observations per category)	17
`geom_boxplot()`	Boxplot geometry	17
`geom_histogram()`	Histogram geometry	17
`geom_line()`	Line geometry	17
`geom_point()`	Scatterplot point geometry	17
`geom_smooth()`	Add a fitted line (e.g., regression) with CI band	17
`getwd()`	Print the current working directory	2
`ggplot()`	Initialize a ggplot with data and aesthetic mappings	17
`head()`	Display the first 6 rows of a data frame	4
`hist()`	Base R histogram (quick distribution check)	5
`ifelse()`	Return different values depending on a condition	7
`IQR()`	Compute the interquartile range (Q3 - Q1)	6
`is.na()`	Test for missing values (returns TRUE/FALSE)	5, 7
`labs()`	Set plot title, axis labels, and legend title in ggplot	17
`leveneTest()`	Levene’s test for equal variances (car package)	16
`library()`	Load an installed package	3
`lm()`	Fit a linear regression model	12
`mean()`	Calculate the arithmetic mean	6
`median()`	Calculate the median	6
`mutate()`	Create or modify a variable in a data frame (dplyr)	7
`names()`	Return column names of a data frame	4
`nrow()`	Return the number of rows	4
`oneway.test()`	Welch’s ANOVA (robust to unequal variances)	9
`options()`	Set global R options (e.g., `scipen` for notation)	7
`pairwise.t.test()`	Pairwise t-tests with p-value correction (e.g., Bonferroni)	9
`parameterEstimates()`	Extract bootstrap CIs from a lavaan model	13
`pivot_longer()`	Convert wide-format data to long format (tidyr)	7
`plot()`	Base R plotting; also used for TukeyHSD CI plots and residual diagnostics	9, 12, 16
`power.t.test()`	Power analysis for t-tests	15
`pwr.anova.test()`	Power analysis for one-way ANOVA (pwr package)	15
`quantile()`	Compute percentiles	6
`range()`	Return the minimum and maximum	6
`read.csv()`	Read a CSV file into a data frame	2
`residuals()`	Extract residuals from a fitted model	12, 16
`rowMeans()`	Compute the mean across columns for each row (composite scores)	7
`scale()`	Standardize a variable to z-scores	7
`scale_color_manual()`	Set custom colors for lines and points in ggplot	17
`scale_fill_manual()`	Set custom colors for bars, boxes, and filled areas in ggplot	17
`scale_x_continuous()`	Control x-axis limits, breaks, and labels	17
`scale_y_continuous()`	Control y-axis limits, breaks, and labels	17
`sd()`	Calculate the standard deviation	6
`sem()`	Fit a structural equation / mediation model (lavaan package)	13
`seq()`	Generate a sequence of numbers (used for axis tick marks)	17
`setwd()`	Set the working directory	2
`shapiro.test()`	Shapiro-Wilk test for normality	16
`stat_summary()`	Compute and plot summary statistics in ggplot	17
`str()`	Show the structure of a data frame	4
`sum()`	Sum values (used with `is.na()` to count missing data)	5
`summary()`	Descriptive summary of a data frame or model	4, 12, 13
`t.test()`	Run a t-test (one-sample, independent, or paired)	8
`table()`	Count observations per group; create contingency tables	6, 14
`tapply()`	Apply a function by group (e.g., group means)	6
`theme()`	Fine-tune individual ggplot elements (fonts, legend, etc.)	17
`TukeyHSD()`	Tukey post-hoc pairwise comparisons for aov models	9, 10
`var()`	Calculate the variance	6
`View()`	Open a data frame in the RStudio viewer	4
`which()`	Return indices where a condition is TRUE	5, 7
`write.csv()`	Save a data frame to a CSV file	2

R Guide

Emily J. Hangen, SUNY Brockport

Spring 2026

How to Use This Guide

How to use code examples from this guide

1. Getting Familiar with R

R vs. RStudio

The four panes in RStudio

R scripts vs. R Markdown

Using # for headers and the outline

Running code

Important things to know about R

Assignment operator <-

Combine values c()

Basic arithmetic

Logical comparisons

Useful keyboard shortcuts in RStudio

Cautions and common mistakes

Hiding or viewing code in knitted output

Formatting text in R Markdown

2. Importing Data

Setting the working directory: setwd() and getwd()

Reading a CSV file: read.csv()

Saving data to a CSV file: write.csv()

3. Loading Packages

library()

4. Exploring Data

View()

? (help)

head()

str()

dim() and nrow()

names()

summary()

5. Cleaning Data

Checking for missing values: is.na() and sum()

Identifying outliers with boxplot.stats()

Spotting outliers visually: boxplots and histograms

Removing or filtering outliers

6. Descriptive Statistics

mean(), sd(), var(), median()

range(), quantile(), IQR()

table()

tapply()

7. Data Manipulation

filter() (dplyr)

mutate() and ifelse()

Pipe operator %>%

scale()

factor()

which(), is.na(), !

Computing composite scores: rowMeans()

Handling missing values: na.rm = TRUE

Converting scientific notation to regular numbers: format() and options()

Wide to long format: pivot_longer() (tidyr)

8. T-Tests

One-sample t-test

Independent samples t-test

Paired samples t-test

Effect size: Cohen’s d

APA write-up template

9. One-Way ANOVA

Between-subjects one-way ANOVA

aov() + summary()

Welch’s ANOVA: oneway.test()

Post-hoc: TukeyHSD() and Bonferroni

Effect size: eta-squared

Pairwise Cohen’s d

APA write-up template

Within-subjects (repeated-measures) one-way ANOVA

aov_ez() (afex package)

Post-hoc: emmeans()

Effect size: generalized eta-squared

APA write-up template

10. Two-Way / Factorial ANOVA

Between-subjects factorial ANOVA

aov() with * for interaction

Post-hoc: TukeyHSD() on the full model

Effect size: partial eta-squared

APA write-up template

Using `#` for headers and the outline

Assignment operator `<-`

Combine values `c()`

Setting the working directory: `setwd()` and `getwd()`

Reading a CSV file: `read.csv()`

Saving data to a CSV file: `write.csv()`

`library()`

`View()`

`?` (help)

`head()`

`str()`

`dim()` and `nrow()`

`names()`

`summary()`

Checking for missing values: `is.na()` and `sum()`

Identifying outliers with `boxplot.stats()`

`mean()`, `sd()`, `var()`, `median()`

`range()`, `quantile()`, `IQR()`

`table()`

`tapply()`

`filter()` (dplyr)

`mutate()` and `ifelse()`

Pipe operator `%>%`

`scale()`

`factor()`

`which()`, `is.na()`, `!`

Computing composite scores: `rowMeans()`

Handling missing values: `na.rm = TRUE`

Converting scientific notation to regular numbers: `format()` and `options()`

Wide to long format: `pivot_longer()` (tidyr)

`aov()` + `summary()`

Welch’s ANOVA: `oneway.test()`

Post-hoc: `TukeyHSD()` and Bonferroni

`aov_ez()` (afex package)

Post-hoc: `emmeans()`

`aov()` with `*` for interaction

Post-hoc: `TukeyHSD()` on the full model

Post-hoc: simple effects with `emmeans()`

`cor.test()` – Pearson and Spearman

Correlation matrix: `cor()`

Correlation heatmap: `corrplot()`

Simple linear regression: `lm()`

Multiple regression: `lm()` with `+`

Hierarchical regression: `anova()` to compare models

Fitting the model with `sem()`

`power.t.test()`

`pwr.anova.test()` (pwr package)

`shapiro.test()` – Normality

`leveneTest()` (car package) – Equal Variances

Significance bracket with `annotate()`

Manual fill and color: `scale_fill_manual()` / `scale_color_manual()`