This document lists every R function taught across the R for PSYCH swirl course (Modules 1-11 plus Bonus Modules). Functions are organized by workflow stage and analysis type. Early sections cover importing data, data exploration, cleaning and outlier detection, descriptive statistics, and data manipulation. Each analysis section is self-contained: it shows you how to run the test, calculate effect sizes, interpret the output, and write up the results in APA format, and it tells you which assumptions to check and which visuals to use.
Power analyses, assumption-checking functions, and a visualization gallery (with runnable example plots) appear at the end after all analysis sections.
Code chunks run automatically when you knit this document. To run a single chunk, click the green “Run” arrow.
Every code example in this guide is designed for you to copy, paste, and modify for your own data. Here is what to replace:
mtcars,
ToothGrowth, PlantGrowth) – replace with the
name of YOUR data frame (whatever you called it when you loaded your
data with read.csv()).mpg,
len, supp, weight) – replace with
the names of YOUR variables. Remember: spelling and capitalization must
match exactly what appears in your data."My Title", "Group", "steelblue")
– replace with your own titles, labels, or color choices.For example, if the guide shows:
And your data frame is called my_data, your outcome
variable is anxiety, and your grouping variable is
condition, you would type:
What NOT to change: Function names (like
t.test, ggplot, aov), operators
(like ~, +, <-), and argument
names (like data =, x =, fill =).
These are part of R’s syntax and must stay exactly as shown.
R is the programming language – it does the actual computing. RStudio is the application you work in – it provides a user-friendly interface for writing, running, and organizing your R code. Think of R as the engine and RStudio as the dashboard. You need both installed, but you will always open RStudio (not R directly).
When you open RStudio, you see four panes (panels):
Top-left – Source/Script pane: Where you write and edit your code files (R scripts or R Markdown documents). Code here does not run until you tell it to.
Bottom-left – Console: Where code actually runs and output appears. You can type commands here directly, but it is better to write code in the Source pane so you can save and reuse it.
Top-right – Environment: Shows all the objects (data frames, variables, models) currently stored in your R session. Also has a History tab showing previous commands.
Bottom-right – Files / Plots / Packages / Help / Viewer: A multi-purpose pane. The Files tab shows your working directory. The Plots tab displays your graphs. The Help tab shows function documentation.
An R script (.R file) is a plain text
file containing only R code and comments. Every line is code unless it
starts with #.
An R Markdown (.Rmd file) combines
text, formatting, and code chunks in one document. When you “knit” an R
Markdown file, it produces a formatted report (HTML, PDF, or Word) with
your text, code, and output all together. Code lives inside code chunks
(boxes that start and end with three backticks).
# for headers and the outlineIn R scripts, # marks a comment – R
ignores everything after it on that line. You can also use
# followed by a section name and four dashes
(----) to create sections that appear in the document
outline:
In R Markdown, # at the start of a line
creates a header. More # symbols mean smaller headers:
To view the outline in RStudio, click the small document-outline icon in the top-right corner of the Source pane (it looks like a bulleted list). This lets you jump between sections quickly.
In an R script:
In R Markdown code chunks:
R is case-sensitive. Mean() is not the
same as mean() – the first will produce an error, the
second will work. This applies to everything: function names, variable
names, dataset names, and factor levels. If something isn’t working,
check your capitalization first.
Spacing generally does not matter.
x <- 5 and x<-5 do the same thing, but
adding spaces makes your code easier to read.
R uses <- for assignment, not
=. While = sometimes works,
<- is the standard in R and is what you should use.
Quotation marks matter. Text strings need to be
wrapped in quotes ("OJ" or 'OJ'), but variable
and object names do not. For example,
filter(data, group == "control") puts quotes around the
word “control” because it is a text value, but group has no
quotes because it is a column name.
The $ operator accesses columns. To
refer to a specific variable inside a data frame, use
dataframe$variable – for example,
mtcars$mpg.
<-Stores a value in a named object so you can use it later.
Important: If you save a new value to the same object name, it replaces the old object completely. The old value is gone.
This also applies to data frames. If you write
mtcars <- filter(mtcars, mpg > 15), the original
mtcars is overwritten with the filtered version. If you
want to keep the original, save the result to a new name (e.g.,
mtcars_filtered <- filter(mtcars, mpg > 15)).
c()Creates a vector (a list of values).
| Shortcut | What it does |
|---|---|
| Ctrl + Enter | Run the current line or selected code |
| Ctrl + Shift + 1 | Expand/collapse the Source (script) pane |
| Ctrl + Shift + 2 | Expand/collapse the Console pane |
| Ctrl + Shift + 9 | Expand/collapse the Viewer pane |
| Alt + Ctrl + Shift + 0 | Show all four panes |
| Ctrl + Shift + C | Comment/uncomment selected lines (adds or removes
#) |
| Ctrl + Shift + M | Insert the pipe operator %>% |
| Ctrl + Alt + I | Insert a new R code chunk (in R Markdown) |
Tip about commenting out code (Ctrl + Shift + C):
Instead of deleting code you don’t want to run right now, highlight it
and press Ctrl + Shift + C to add # in front of each line.
This “comments it out” – R will skip it, but the code is still there if
you want to use it later. Press the same shortcut again to
uncomment.
Backticks are not single quotes. The backtick
character ` is usually found on the top-left key of your
keyboard, below Escape and to the left of the number 1, sharing the key
with the tilde ~. It looks similar to a single quote
' but they are different characters. In R Markdown, code
chunks open and close with three backticks (```).
Do NOT delete the closing backticks of a code chunk.
Every code chunk in R Markdown must have exactly three backticks at the
start (```{r}) and three backticks at the end
(```). If you accidentally delete the closing backticks, R
will think the code chunk never ends and everything below will
break.
Code chunk options (the part inside {r}) control what
appears when you knit your R Markdown document. Add options after the
r, separated by commas:
{r, echo = FALSE} – Runs the code but hides the code
itself in the knitted output (only the result/plot shows).{r, eval = FALSE} – Shows the code in the output but
does not run it.{r, results = 'hide'} – Runs the code and shows it, but
hides the text output (useful when you only want to see the plot, not
the printed numbers).{r, include = FALSE} – Runs the code silently (nothing
appears in the output – useful for setup chunks like loading
packages).{r, message = FALSE, warning = FALSE} – Suppresses
package loading messages and warnings in the output.Setting defaults for the whole document. Instead of typing the same options on every single chunk, you can set defaults once at the top of your R Markdown document using a setup chunk. This is typically the very first chunk in the file:
This tells R: “For every chunk in this document, show the code
(echo = TRUE), suppress warnings, and suppress messages –
unless I override these in a specific chunk.” Any option you set in an
individual chunk will override the default for that chunk only.
Outside of code chunks, you can format text using Markdown syntax. This is how you add emphasis, lists, and other formatting to the written portions of your knitted document.
| Format | Syntax | Result |
|---|---|---|
| Bold | **bold text** |
bold text |
| Italic | *italic text* |
italic text |
| Bold + italic | ***bold italic*** |
bold italic |
| Superscript | text^superscript^ |
textsuperscript |
| Subscript | text~subscript~ |
textsubscript |
| Strikethrough | ~~crossed out~~ |
|
| Inline code | `code here` |
code here |
For bullet points and numbered lists, start each line with
- or 1. (leave a blank line before the
list):
- First bullet
- Second bullet
- Indented sub-bullet
1. First item
2. Second item
To add a link: [link text](https://example.com). To
insert an image: .
Before you can analyze data, you need to tell R where your files are and load them into R.
setwd() and
getwd()The working directory is the folder R looks in when you read or save
files. Use getwd() to see your current working directory
and setwd() to change it.
getwd() # print the current working directory
setwd("/Users/yourname/Documents") # set it to a specific folder pathIn RStudio, you can also set the working directory by going to Session > Set Working Directory > Choose Directory or by using the Files pane (click “More” > “Set As Working Directory”).
Tip: Keep your data files and R scripts in the same folder. That way, you can set the working directory once and everything will be easy to find.
read.csv()Most data files you will work with in this course are CSV
(comma-separated values) files. Use read.csv() to load a
CSV file into R as a data frame. Assign the result to a named object so
you can use it.
# Read a CSV file from your working directory
data <- read.csv("my_data.csv")
# Read a CSV file from a specific file path
data <- read.csv("/Users/yourname/Documents/my_data.csv")After loading, always check the data with View(),
head(), or str() (see Section 4) to make sure
it loaded correctly.
library()Loads an installed package into your current R session. Packages must be loaded each time you start R.
library(tidyverse) # ggplot2, dplyr, and more
library(car) # leveneTest()
library(effsize) # cohen.d()
library(effectsize) # eta_squared()
library(pwr) # pwr.anova.test()
library(corrplot) # corrplot()
library(afex) # aov_ez() for repeated-measures and mixed ANOVA
library(emmeans) # post-hoc for afex models
library(lavaan) # sem() for mediation / SEM
library(patchwork) # combine multiple ggplots side by sidestr()Shows the structure of a dataset – variable names, types, and sample values. Useful for checking whether variables are numeric, factor, etc.
dim() and nrow()dim() returns rows and columns; nrow()
returns just the number of rows.
Before running any analysis, it is good practice to inspect your data for missing values and outliers. Outliers can distort means, inflate standard deviations, and violate assumptions of normality and equal variances.
is.na() and
sum()Use is.na() to test each value and sum() to
count how many are missing.
boxplot.stats()The function boxplot.stats() uses the IQR method to
identify outliers. By default it flags values beyond 1.5 × IQR from
Q1/Q3. In this course, we use a stricter 3 × IQR rule
(to avoid flagging too many values), which you set with
coef = 3. The $out element returns the actual
outlier values.
If this returns numeric(0), there are no outliers. If it
returns values, those are the specific data points flagged as
outliers.
To find which rows contain outliers (so you know which participants/observations to investigate):
# Store the outlier values
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out
# Find which rows contain those values
which(mtcars$mpg %in% outlier_values)What does coef do? It sets how strict
the outlier rule is. coef = 1.5 (the default) is the
standard boxplot rule and flags more points. coef = 3 is
stricter and only flags extreme outliers. In this course, we use
coef = 3.
Boxplots display outliers as individual points beyond the whiskers. Histograms reveal extreme values sitting far from the rest of the distribution. See Section 17 (Visualization Gallery) for runnable examples of both plots.
Once you have identified outliers, you can use filter()
(see Section 7) to create a clean dataset without them.
library(tidyverse)
outlier_values <- boxplot.stats(mtcars$mpg, coef = 3)$out
clean_data <- filter(mtcars, !(mpg %in% outlier_values))When should you remove outliers? There is no single rule. Remove data points only when you have a substantive reason (e.g., data entry error, participant who did not follow instructions). Always report whether outliers were removed and how many.
mean(), sd(), var(),
median()Core descriptive statistics. Use na.rm = TRUE if there
are missing values.
range(), quantile(),
IQR()Distribution measures useful for outlier detection and data description.
table()Counts the number of observations per group. Works with one or two grouping variables.
tapply()Applies a function (like mean or sd)
separately to each group. Use list() for two grouping
variables. The syntax is:
tapply(outcome_variable, grouping_variable, function).
# One grouping variable
tapply(ToothGrowth$len, ToothGrowth$supp, mean)
tapply(ToothGrowth$len, ToothGrowth$supp, sd)
# Two grouping variables (for factorial designs)
tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean)filter() (dplyr)Keeps only the rows that meet a condition.
mutate() and ifelse()mutate() creates a new variable (column) or modifies an
existing one. You can use it with arithmetic, functions, or
ifelse() to categorize values based on a condition.
library(tidyverse)
# Create a new column that categorizes cars as heavy or light
mtcars <- mutate(mtcars, heavy = ifelse(wt > 3, "Yes", "No"))
# Create a new column by doing arithmetic on existing columns
mtcars <- mutate(mtcars, hp_per_cyl = hp / cyl)
# Create a new column using a function
mtcars <- mutate(mtcars, mpg_z = as.numeric(scale(mpg)))
# Create a composite score as the mean of several columns
# (see also rowMeans() below for an alternative approach)
survey <- data.frame(q1 = c(4, 5, 3), q2 = c(3, 4, 5), q3 = c(5, 5, 4))
survey <- mutate(survey, composite = (q1 + q2 + q3) / 3)
# Recode a variable into multiple categories with nested ifelse()
mtcars <- mutate(mtcars, mpg_group = ifelse(mpg > 25, "High",
ifelse(mpg > 15, "Medium", "Low")))%>%The pipe operator %>% comes from the
dplyr package (loaded with tidyverse). It chains operations
left-to-right so you can read code as “take this, then do this,
then do this.” Keyboard shortcut: Ctrl + Shift +
M (Cmd + Shift + M on Mac).
scale()Standardizes a variable to z-scores (mean = 0, SD = 1).
factor()Converts a numeric variable to a categorical factor. Required before running ANOVA on numeric grouping variables.
which(), is.na(), !which() returns row numbers where a condition is true.
is.na() tests for missing values. !
negates.
rowMeans()In psychology research, you often need to average several
questionnaire items into a single composite score for each participant.
rowMeans() calculates the mean across columns for each row.
Use na.rm = TRUE to handle missing values.
# Suppose your data frame has columns q1 through q5 for a 5-item scale
# and you want to create a composite score for each participant
# Method 1: select columns by position (columns 2 through 6)
df$composite <- rowMeans(df[, 2:6], na.rm = TRUE)
# Method 2: select columns by name
df$composite <- rowMeans(df[, c("q1", "q2", "q3", "q4", "q5")], na.rm = TRUE)
# Method 3: use paste0() shorthand when columns are numbered sequentially
df$composite <- rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE)
# Method 4: use mutate() with rowMeans() for a tidyverse approach
df <- mutate(df, composite = rowMeans(df[, paste0("q", 1:5)], na.rm = TRUE))The resulting composite column contains each
participant’s average across those items. You can then use this
composite variable as an outcome or predictor in your analyses.
na.rm = TRUEUse na.rm = TRUE if there are missing values. Many R
functions (e.g., mean(), sd(),
sum(), rowMeans()) will return NA
if any input value is missing – unless you tell them to ignore missing
values with na.rm = TRUE.
format() and options()R sometimes displays very small p-values or very large numbers in
scientific notation (e.g., 2.5e-08 instead of
0.000000025). There are two ways to see the full
number.
# Convert a single number to regular notation
format(2.5e-08, scientific = FALSE)
# Turn off scientific notation globally for the rest of your session
options(scipen = 999)
# To turn it back on later
options(scipen = 0)scipen = 999 tells R to strongly prefer regular
notation. This is especially helpful when reading p-values from
statistical output.
pivot_longer() (tidyr)Within-subjects (repeated-measures) ANOVA requires data in long
format, where each row is one observation. If your data is in wide
format (one row per participant, multiple columns for each condition),
convert it with pivot_longer().
library(tidyverse)
# Wide format: columns time1, time2, time3 for each participant
wide_data <- data.frame(
id = 1:5,
time1 = c(30, 28, 35, 32, 29),
time2 = c(25, 22, 28, 27, 24),
time3 = c(20, 18, 22, 21, 19)
)
# Convert to long format
long_data <- pivot_longer(wide_data,
cols = time1:time3,
names_to = "time",
values_to = "score")
long_data$id <- factor(long_data$id)In long format, the dataset goes from 5 rows (one per person) to 15
rows (one per person per time point), with columns for id,
time, and score.
T-tests compare means between two conditions. Before running a t-test, check the assumptions of normality (Shapiro-Wilk test or histogram) and equal variances (Levene’s test for independent samples). Recommended visuals: bar graph of group means with error bars, or boxplot.
Tests whether a sample mean differs from a known value.
Compares means of two independent groups. Use
var.equal = TRUE for the standard t-test or
var.equal = FALSE for Welch’s t-test (when variances are
unequal).
Compares means from the same participants measured twice. Must use
the two-vector method (the formula interface does not support
paired = TRUE).
Use the effsize package. Thresholds: 0.2 = small, 0.5 =
medium, 0.8 = large. Report d numerically always; only describe the
magnitude verbally (e.g., “a medium effect”) when the t-test is
significant.
[Group 1] (M = ##.##, SD = ##.##) [was/were] [significantly/not significantly] [higher/lower/different] compared to [Group 2] (M = ##.##, SD = ##.##), t(df) = ##.##, p = .###, d = ##.##.
Note: The APA template requires group means and
standard deviations. Use tapply() from Section 6 to compute
them by group – for example,
tapply(ToothGrowth$len, ToothGrowth$supp, mean) and
tapply(ToothGrowth$len, ToothGrowth$supp, sd).
One-way ANOVA tests whether three or more group means differ. Before running, check normality of residuals (Shapiro-Wilk) and equality of variances (Levene’s test). If variances are unequal, use Welch’s ANOVA. Recommended visuals: boxplot by group, or bar graph with error bars.
Important: If your grouping variable is numeric
(e.g., dose levels like 0.5, 1, 2), you must convert it to a factor
before running the ANOVA. See factor() in Section 7 (Data
Manipulation).
aov() + summary()Save the model for use with summary(),
TukeyHSD(), and residuals().
oneway.test()Robust alternative when the equal variances assumption is violated.
TukeyHSD() and BonferroniOnly run post-hoc tests when the overall ANOVA is significant. TukeyHSD is the standard; Bonferroni is a more conservative alternative.
Use the effectsize package. Thresholds: .01 = small, .06
= medium, .14 = large. For one-way ANOVA, eta-squared and partial
eta-squared are identical.
For significant pairwise comparisons, compute Cohen’s d to report effect sizes.
A one-way ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, eta-squared = .##. Tukey-corrected post-hoc comparisons revealed that [Group A] was significantly [higher/lower] than [Group B] (p = .###, d = ##.##), while [other comparisons].
Note: The APA template requires group means and
standard deviations. Use tapply() from Section 6 to compute
them by group – for example,
tapply(PlantGrowth$weight, PlantGrowth$group, mean) and
tapply(PlantGrowth$weight, PlantGrowth$group, sd).
When the same participants are measured under all conditions, use a
repeated-measures ANOVA. Data must be in long format
(one row per observation; see Section 7 for
pivot_longer()). The afex package provides
aov_ez() which handles within-subjects designs and
automatically applies the Greenhouse-Geisser sphericity correction when
needed.
aov_ez() (afex package)Arguments: id = participant ID column (as a string),
dv = outcome variable (as a string), data =
data frame, within = within-subjects factor(s) (as a string
vector).
emmeans()The emmeans package provides pairwise comparisons for
afex models. Use adjust = "tukey" (default) or
adjust = "bonferroni".
aov_ez() reports generalized eta-squared (ges) in its
output. Interpretation thresholds are the same: .01 = small, .06 =
medium, .14 = large.
A repeated-measures ANOVA showed a significant effect of [factor] on [outcome], F(df1, df2) = ##.##, p = .###, generalized eta-squared = .##. Tukey-corrected pairwise comparisons revealed that [condition A] was significantly [higher/lower] than [condition B] (p = .###).
Note: The APA template requires condition means and
standard deviations. Use tapply() from Section 6 to compute
them – for example,
tapply(anxiety_data$anxiety, anxiety_data$time, mean) and
tapply(anxiety_data$anxiety, anxiety_data$time, sd).
Factorial ANOVA tests the effects of two (or more) factors and their interaction. Before running, check normality of residuals and equality of variances. Recommended visuals: interaction plot (line graph) and grouped bar graph with error bars.
Important: If any grouping variable is numeric
(e.g., dose levels like 0.5, 1, 2), you must convert it to a factor
before running the ANOVA. See factor() in Section 7 (Data
Manipulation).
aov() with * for interactionThe * operator includes both main effects and the
interaction. Produces three F-tests.
TukeyHSD() on the full modelReturns three sections: $supp (main effect of supp),
$dose (main effect of dose), and $supp:dose
(all cell-to-cell comparisons for the interaction).
In a factorial design, always use partial eta-squared because it accounts for the other factors.
A 2 x 3 between-subjects ANOVA examined the effects of [factor A] and [factor B] on [outcome]. There was a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. There was a significant main effect of [factor B], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##. The [factor A] x [factor B] interaction was [significant/non-significant], F(df1, df2) = ##.##, p = .###, partial eta-squared = .##.
When a significant interaction is present, focus the write-up on the interaction pattern (e.g., the effect of dose depended on supplement type). Main effects are less informative when qualified by a significant interaction.
When the same participants experience all combinations of two
factors, use aov_ez() with both factors listed in
within.
library(afex)
# 2 (task) x 3 (difficulty) within-subjects design
cog_model <- aov_ez(id = "id", dv = "rt",
data = cognitive_data,
within = c("task", "difficulty"))
cog_modelemmeans()Use | to run pairwise comparisons of one factor
separately at each level of the other (simple effects). This is how you
decompose a significant interaction.
A mixed design has at least one between-subjects factor and at least
one within-subjects factor. Use aov_ez() with both the
within and between arguments.
# group is between-subjects, time is within-subjects
therapy_model <- aov_ez(id = "id", dv = "depression",
data = therapy_data,
within = c("time"), between = c("group"))
therapy_modelWithin-subjects factorial: A 2 x 3 within-subjects ANOVA revealed a significant main effect of [factor A], F(df1, df2) = ##.##, p = .###, ges = .##. The [factor A] x [factor B] interaction was significant, F(df1, df2) = ##.##, p = .###, ges = .##. Tukey-corrected simple effects showed that [describe pattern].
Mixed ANOVA: A mixed ANOVA with [between factor] as the between-subjects factor and [within factor] as the within-subjects factor revealed a significant interaction, F(df1, df2) = ##.##, p = .###, ges = .##. Bonferroni-corrected post-hoc comparisons showed that [the therapy group improved significantly across time points while the control group did not].
Note: APA write-ups for factorial and mixed ANOVAs
require cell means and standard deviations for each condition
combination. Use tapply() with list() from
Section 6 to compute them – for example,
tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), mean).
Correlation measures the strength and direction of the linear relationship between two continuous variables. Before interpreting, check for linearity (scatterplot) and outliers. Recommended visual: scatterplot with regression line.
cor.test() – Pearson and SpearmanPearson r is the default; use Spearman rho for non-normal or ordinal data. Correlation strength thresholds: .10 = weak, .30 = moderate, .50 = strong.
cor()Computes correlations for multiple variables at once. Subset the data frame to include only the variables you want.
corrplot()Creates a color-coded heatmap of a correlation matrix. Save the matrix as an object first.
There was a [strong/moderate/weak] [positive/negative] correlation between [variable 1] and [variable 2], r(df) = .##, p = .###.
The effect size for correlation is the r value itself – no separate calculation needed.
Regression predicts an outcome from one or more predictors. Before interpreting, check residuals for normality (Shapiro-Wilk on residuals) and homoscedasticity (residuals vs. fitted plot). Recommended visuals: scatterplot with regression line for simple regression, residuals vs. fitted plot for assumption checking.
lm()Fits a model predicting one outcome from one predictor. The equation is Y = b0 + b1*X.
Key output: Intercept (b0) = predicted Y when X = 0. Slope (b1) = change in Y for each one-unit increase in X. Multiple R-squared = proportion of variance explained. F-statistic and p-value = overall model significance.
lm() with +Predicts an outcome from two or more predictors entered simultaneously. Each coefficient represents the effect of that predictor while controlling for the others.
anova() to compare modelsTests whether adding predictors significantly improves the model. A significant p-value means the fuller model explains significantly more variance.
R-squared is reported directly in the summary() output.
Thresholds: .02 = small, .13 = medium, .26 = large.
Simple regression: A simple linear regression showed that [predictor] significantly predicted [outcome], F(1, df) = ##.##, p = .###, R-squared = .##. For every one-unit increase in [predictor], [outcome] [increased/decreased] by ##.## units (b = ##.##, t = ##.##, p = .###).
Multiple regression: A multiple regression model significantly predicted [outcome], F(df1, df2) = ##.##, p = .###, R-squared = .##. [Predictor 1] (b = ##.##, p = .###) and [Predictor 2] (b = ##.##, p = .###) were significant predictors, while [Predictor 3] (b = ##.##, p = .###) was not significant.
Hierarchical regression: Adding [predictors] to the model significantly improved prediction, F-change(df1, df2) = ##.##, p = .###, R-squared change = .##.
Mediation tests whether an IV affects a DV through an intermediate
variable (the mediator). The indirect effect (a * b) represents the
portion of the IV-DV relationship that goes through the mediator. Use
the lavaan package and bootstrap confidence intervals for
the indirect effect. No specific assumption checks beyond those for
regression (normality, linearity); recommended visuals: scatterplots of
each path.
The model is a text string using ~ for regressions,
* for labeling paths, and := for defining
computed parameters.
library(lavaan)
med_model <- '
self_esteem ~ a*social_support
depression ~ b*self_esteem + cp*social_support
indirect := a*b
total := cp + a*b
'Line by line: Line 1 = a path (IV → mediator). Line 2 = b path (mediator → DV) and c’ path (direct effect, IV → DV controlling for mediator). Line 3 = indirect effect defined as a * b. Line 4 = total effect defined as c’ + indirect.
sem()The Regressions section shows a, b, and c’ paths. The Defined Parameters section shows the indirect and total effects.
Bootstrap CIs are the gold-standard test for the indirect effect because the sampling distribution of a * b is often not normal.
med_boot <- sem(med_model, data = mediation_data,
se = "bootstrap", bootstrap = 100)
parameterEstimates(med_boot)If the bootstrap CI for indirect does not include zero,
mediation is supported.
Full mediation: the indirect effect is significant but the direct effect (c’) is not – the entire IV-DV relationship goes through the mediator. Partial mediation: both the indirect effect and the direct effect are significant.
A mediation analysis using lavaan tested whether [mediator] mediated the effect of [IV] on [DV]. The indirect effect was significant (b = ##.##, 95% bootstrap CI [##.##, ##.##]), supporting [full/partial] mediation. The direct effect of [IV] on [DV] was [significant/not significant] (b = ##.##, p = .###).
Note: When writing up mediation results, you may
want to report means and standard deviations for each variable. Use
mean() and sd() from Section 6 – for example,
mean(mediation_data$social_support) and
sd(mediation_data$social_support).
Chi-square tests analyze categorical (nominal) data – variables whose values represent categories rather than quantities. No distributional assumptions are needed, but expected cell frequencies should be at least 5 for the chi-square approximation to be valid (use Fisher’s exact test if they are not). Recommended visual: bar chart of observed frequencies.
Tests whether the distribution of one categorical variable matches
expected proportions. By default, chisq.test() assumes
equal expected proportions.
# Count frequencies
table(survey_data$preference)
# Test against equal distribution
chisq.test(table(survey_data$preference))
# Test against specific expected proportions
chisq.test(table(survey_data$preference), p = c(0.40, 0.30, 0.20, 0.10))Degrees of freedom = number of categories minus 1.
Tests whether two categorical variables are related. First create a
contingency table with table(), then run
chisq.test().
If R warns that “Chi-squared approximation may be incorrect,” some expected cell frequencies are below 5. Use Fisher’s exact test instead.
Use when expected cell frequencies are too low for the chi-square approximation.
Computed manually from the chi-square result. Thresholds: .10 = small, .30 = medium, .50 = large.
Goodness-of-fit: A chi-square goodness-of-fit test indicated that therapy preferences were not equally distributed, X2(df) = ##.##, p = .###.
Independence: A chi-square test of independence showed a significant association between [variable 1] and [variable 2], X2(df, N = ##) = ##.##, p = .###, Cramer’s V = .##. [If applicable: Fisher’s exact test confirmed the result, p = .###.]
Power analysis helps you determine the sample size needed to detect an effect, or the power of an existing study. Run these before data collection (a priori) whenever possible.
power.t.test()For t-tests. Leave one argument as NULL to solve for
it.
shapiro.test() – NormalityShapiro-Wilk test. If p > .05, normality is not violated. For ANOVA and regression, test the residuals rather than the raw data.
# On a variable directly
shapiro.test(mtcars$mpg)
# On residuals from a model
model <- aov(len ~ supp, data = ToothGrowth)
shapiro.test(residuals(model))If normality is violated, check sample size: with n >= 30 per group, the test is robust due to the Central Limit Theorem.
leveneTest() (car package) – Equal VariancesIf p > .05, the equal variances assumption is met.
library(car)
leveneTest(len ~ supp, data = ToothGrowth)
leveneTest(len ~ supp * dose, data = ToothGrowth) # factorial designIf variances are unequal, use Welch’s t-test
(var.equal = FALSE) or Welch’s ANOVA
(oneway.test()).
For regression models. Look for random scatter; a funnel or curve indicates a violation.
| Test | Normality | Equal Variances | Other |
|---|---|---|---|
| One-sample t-test | Yes (or n >= 30) | N/A | – |
| Independent t-test | Yes (or n >= 30) | Yes (Levene’s) | – |
| Paired t-test | Yes (of differences) | N/A | – |
| One-way ANOVA | Yes (residuals) | Yes (Levene’s) | – |
| Two-way ANOVA | Yes (residuals) | Yes (Levene’s) | – |
| Repeated-measures ANOVA | – | – | Sphericity (auto-corrected by afex) |
| Correlation | Normality for Pearson | – | Linearity (scatterplot) |
| Regression | Yes (residuals) | Homoscedasticity (residual plot) | Linearity |
| Chi-square | N/A | N/A | Expected freq >= 5 |
This section shows runnable example plots for common analysis types. Each plot will render when you knit this document so you can see what they look like.
A note on colors: All examples below use named R
colors (like "steelblue", "coral",
"darkgreen") instead of hex codes. R recognizes 657
built-in color names. To see the full list, run colors() in
the console. For a visual reference with color swatches, visit: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf
Reminder – copy, paste, modify: To use any plot
below with your own data, copy the code and replace the data frame name,
variable names, and text in quotes with your own. Keep the function
names, + signs, and argument names (like x =,
fill =) exactly as shown. See “How to use code examples
from this guide” at the top of this document for details.
Every ggplot is built by adding layers together with +.
The basic structure is:
Here is what each piece does:
ggplot(data, aes(...)) – This is the
foundation layer. It tells R which data frame to use and which variables
go on the x-axis and y-axis. The aes() function (short for
“aesthetics”) maps variables to visual properties. Common aesthetics
include x, y, fill (color inside
bars/boxes), color (color of lines/points/outlines), and
group.
geom_...() – The geometry layer
controls what type of plot you see. Examples: geom_bar()
for bar charts, geom_boxplot() for boxplots,
geom_point() for scatterplots,
geom_histogram() for histograms, geom_line()
for line graphs. You can add multiple geom layers to the same plot.
stat_summary() – An alternative to geom
layers that computes a summary statistic (like the mean) before
plotting. Use fun = mean with geom = "bar" to
make a bar graph of means, or fun.data = mean_cl_normal
with geom = "errorbar" for 95% CI error bars.
labs() – Sets the plot title, subtitle,
axis labels, and legend title. Example:
labs(title = "My Plot", x = "Group", y = "Score").
scale_...() – Controls how aesthetics
are mapped to visual values. Use
scale_fill_manual(values = c(...)) to set bar/box colors,
scale_color_manual(values = c(...)) to set line/point
colors, or scale_y_continuous() /
scale_x_continuous() to adjust axis limits and tick marks
(see below).
theme_...() – Sets the overall visual
style. theme_minimal() is clean with light gridlines;
theme_classic() has only axis lines (like a textbook
figure); theme_bw() is black-and-white with a border.
theme() – Fine-tune individual elements
like font sizes, legend position, and text alignment (see the
Customization section at the end of this gallery).
| What you want to change | Code |
|---|---|
| Bar/box colors | scale_fill_manual(values = c(...)) |
| Line/point colors | scale_color_manual(values = c(...)) |
| Axis labels and title | labs(title = "...", x = "...", y = "...") |
| Overall look | theme_minimal(), theme_classic(),
etc. |
| Font size (title) | theme(plot.title = element_text(size = 16)) |
| Font size (axes) | theme(axis.title = element_text(size = 13)) |
| Bold title | theme(plot.title = element_text(face = "bold")) |
| Center title | theme(plot.title = element_text(hjust = 0.5)) |
| Remove legend | theme(legend.position = "none") |
| Move legend to bottom | theme(legend.position = "bottom") |
Quick check of distribution shape. Use before t-tests and ANOVA to eyeball normality.
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
labs(title = "Distribution of MPG", x = "Miles per Gallon", y = "Count") +
theme_minimal()Code breakdown: aes(x = mpg) puts mpg
on the x-axis (histograms only need an x variable).
geom_histogram() creates the bars;
binwidth = 2 means each bar spans 2 units on the x-axis.
fill sets the inside color of the bars and
color sets the border color. labs() adds a
title and axis labels. theme_minimal() gives a clean
look.
Shows median, quartiles, and outliers. Good for comparing groups before t-tests or ANOVA.
ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
geom_boxplot() +
labs(title = "Tooth Length by Supplement", x = "Supplement", y = "Tooth Length") +
theme_minimal()Code breakdown:
aes(x = supp, y = len, fill = supp) maps the grouping
variable (supp) to both the x-axis and the fill color.
geom_boxplot() draws the boxes. The thick line is the
median, the box edges are Q1 and Q3, whiskers extend to the most extreme
points within 1.5 * IQR, and dots beyond the whiskers are outliers.
Standard visual for reporting group means. Use
mean_cl_normal for 95% CI error bars or
mean_se for standard error bars.
ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
labs(title = "Mean Tooth Length by Supplement (95% CI)",
x = "Supplement", y = "Mean Tooth Length") +
theme_minimal()Code breakdown: Instead of geom_bar(),
we use stat_summary(fun = mean, geom = "bar") to compute
the mean of len for each level of supp and
draw it as a bar. The second stat_summary() layer computes
95% confidence intervals (mean_cl_normal) and draws them as
error bars. width = 0.2 controls how wide the error bar
caps are.
Side-by-side bars for two-way ANOVA designs. Use
position_dodge() to place bars next to each other rather
than stacked.
ToothGrowth$dose <- factor(ToothGrowth$dose)
ggplot(ToothGrowth, aes(x = dose, y = len, fill = supp)) +
stat_summary(fun = mean, geom = "bar", position = position_dodge(0.9)) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
position = position_dodge(0.9), width = 0.2) +
labs(title = "Tooth Growth by Supplement and Dose",
x = "Dose (mg/day)", y = "Mean Tooth Length", fill = "Supplement") +
theme_minimal()Code breakdown:
aes(x = dose, ..., fill = supp) puts dose on the x-axis and
uses fill color to distinguish supplement types.
position_dodge(0.9) places the bars side by side (both the
bars and error bars need the same dodge value to line up).
fill = "Supplement" inside labs() changes the
legend title.
Shows whether the effect of one factor depends on another. Non-parallel lines suggest an interaction. Use for factorial and mixed ANOVA designs.
ggplot(ToothGrowth, aes(x = dose, y = len, color = supp, group = supp)) +
stat_summary(fun = mean, geom = "point", size = 3) +
stat_summary(fun = mean, geom = "line", linewidth = 1) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1) +
labs(title = "Interaction Plot - Supplement by Dose",
x = "Dose", y = "Mean Tooth Length", color = "Supplement") +
theme_minimal()Code breakdown: color = supp maps
supplement type to the line/point color (ggplot picks colors
automatically). group = supp tells ggplot to draw separate
lines for each supplement. We add three stat_summary()
layers: points for the group means, lines connecting them, and error
bars. The color = "Supplement" inside labs()
renames the color legend.
For within-subjects designs with one factor, use
group = 1 to connect all points with a single line (since
there is only one condition trajectory to show).
ggplot(rep_data, aes(x = time, y = anxiety, group = 1)) +
stat_summary(fun = mean, geom = "point", size = 3) +
stat_summary(fun = mean, geom = "line", linewidth = 1) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.1) +
labs(title = "Anxiety Across Time Points",
x = "Time Point", y = "Mean Anxiety Score") +
theme_minimal()Code breakdown: group = 1 tells ggplot
that all time points belong to a single group and should be connected by
one line. Without this, ggplot would treat each time point as its own
group and would not draw a connecting line. The rest follows the same
pattern: point, line, and error bar layers.
Standard visual for correlation and simple regression.
geom_smooth(method = "lm") adds a linear best-fit line with
a 95% CI band.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", color = "steelblue") +
labs(title = "MPG by Vehicle Weight",
x = "Weight (1000 lbs)", y = "Miles per Gallon") +
theme_minimal()Code breakdown: geom_point() adds the
individual data points. geom_smooth(method = "lm") fits a
linear regression line and draws it with a shaded 95% confidence band
around it. color = "steelblue" changes the line color. To
remove the confidence band, add se = FALSE inside
geom_smooth().
Color-coded matrix showing correlations among multiple variables.
Numbers on the cells show the r values. This uses the
corrplot package rather than ggplot.
library(corrplot)
car_cors <- cor(mtcars[, c("mpg", "wt", "hp", "disp")])
corrplot(car_cors, method = "color", addCoef.col = "black")Code breakdown: First, cor() computes
the correlation matrix for the selected variables. Then
corrplot() visualizes it. method = "color"
fills each cell with a color (blue for positive, red for negative).
addCoef.col = "black" prints the r values on each cell. You
can further customize with arguments like tl.col (label
color) and tl.srt (label rotation angle).
Use for visualizing frequency distributions before running chi-square tests.
ggplot(data.frame(cyl = factor(mtcars$cyl)), aes(x = cyl, fill = cyl)) +
geom_bar() +
labs(title = "Count of Cars by Cylinder", x = "Cylinders", y = "Count") +
scale_fill_manual(values = c("4" = "skyblue", "6" = "coral", "8" = "darkgreen")) +
theme_minimal() +
theme(legend.position = "none")Code breakdown: geom_bar() counts the
number of observations in each category and draws a bar for each (no
need to compute counts yourself). fill = cyl inside
aes() colors each bar by group.
theme(legend.position = "none") removes the legend since
the x-axis labels already show the group names.
scale_fill_manual() sets specific named colors for each
bar.
annotate()Adds a bracket and asterisk to mark significant pairwise comparisons on a bar graph. You need to choose y-axis values for the bracket that sit above the error bars without overlapping them.
ggplot(PlantGrowth, aes(x = group, y = weight, fill = group)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
annotate("segment", x = 2, xend = 3, y = 6.2, yend = 6.2) +
annotate("text", x = 2.5, y = 6.4, label = "*", size = 8) +
labs(title = "Plant Weight by Treatment Group",
x = "Group", y = "Mean Weight") +
theme_minimal() +
theme(legend.position = "none")Code breakdown: Two annotate() layers
create the bracket. The first draws a horizontal line segment from
x = 2 to xend = 3 (the second and third bars)
at y = 6.2 – high enough to clear the error bars. The
second places an asterisk centered between the two bars at a slightly
higher y value (y = 6.4). Adjust these y values based on
your data. If the bracket overlaps your error bars, increase the y
values.
You can customize nearly every visual element of a ggplot. Below are the most common adjustments.
scale_fill_manual() /
scale_color_manual()Override default colors with named R colors. Use fill
for bars/boxes and color for lines/points. To see all
available color names, run colors() in the console, or
visit the visual reference at https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf.
ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2) +
scale_fill_manual(values = c("OJ" = "coral", "VC" = "steelblue")) +
labs(title = "Custom Named Colors", x = "Supplement", y = "Mean Length") +
theme_minimal()Themes control the overall look of the plot (background, gridlines, axis lines). Add a theme as the last layer of your plot. Here are the most common options:
theme_minimal() – Clean and modern.
Light gray gridlines, no background box, no axis lines.
Recommended for most assignments – it keeps the focus
on the data without looking cluttered.
theme_classic() – Only x and y axis
lines, no gridlines, no background. Looks like a textbook or journal
figure. Recommended for APA-style figures since APA
discourages gridlines.
theme_bw() – Black border around the
plot with white background and gridlines. A good middle ground if you
want the structure of gridlines but a more formal look.
theme_light() – Similar to
theme_bw() but with lighter gray gridlines and border.
Slightly softer look.
theme_gray() – The ggplot default (gray
background with white gridlines). Generally not recommended for reports
since it prints poorly in black-and-white.
library(patchwork)
base <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "darkred", size = 2) +
labs(x = "Weight", y = "MPG")
p1 <- base + labs(title = "theme_minimal()") + theme_minimal()
p2 <- base + labs(title = "theme_classic()") + theme_classic()
p3 <- base + labs(title = "theme_bw()") + theme_bw()
p4 <- base + labs(title = "theme_light()") + theme_light()
(p1 | p2) / (p3 | p4)Use theme() to control individual text elements. Common
arguments: size, face (“bold”, “italic”),
hjust (horizontal justification, 0 = left, 0.5 = center, 1
= right).
ggplot(PlantGrowth, aes(x = group, y = weight, fill = group)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
labs(title = "Customized Text Elements",
x = "Treatment Group", y = "Mean Weight") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 13),
axis.text = element_text(size = 11),
legend.position = "none"
)An alphabetical listing of every function mentioned in this guide, with a brief description and the section where it is covered.
| Function | Description | Section |
|---|---|---|
? |
Open help page for a function or dataset | 4 |
annotate() |
Add text, segments, or labels to a ggplot | 17 |
anova() |
Compare two nested regression models (hierarchical regression) | 12 |
aov() |
Fit a between-subjects ANOVA model | 9, 10 |
aov_ez() |
Fit repeated-measures or mixed ANOVA (afex package) | 9, 10 |
as.numeric() |
Convert an object to numeric (e.g., remove scale attributes) | 7 |
boxplot() |
Base R boxplot (quick outlier check) | 5 |
boxplot.stats() |
Identify outlier values using IQR method; use $out to
extract them |
5 |
c() |
Combine values into a vector | 1 |
cat() |
Print text to the console (used for lavaan model strings) | 13 |
chisq.test() |
Chi-square goodness-of-fit or test of independence | 14 |
cohen.d() |
Compute Cohen’s d effect size (effsize package) | 8, 9 |
colSums() |
Sum each column (used with is.na() to count missing
values) |
5 |
colors() |
List all 657 built-in R color names | 17 |
cor() |
Compute a correlation matrix | 11 |
cor.test() |
Test a single correlation (Pearson or Spearman) | 11 |
coord_cartesian() |
Zoom into a plot without removing data | 17 |
corrplot() |
Draw a correlation heatmap (corrplot package) | 11, 17 |
data.frame() |
Create a data frame from vectors | 1 |
dim() |
Return the number of rows and columns | 4 |
emmeans() |
Pairwise post-hoc comparisons for afex models (emmeans package) | 9, 10 |
eta_squared() |
Compute eta-squared or partial eta-squared (effectsize package) | 9, 10 |
factor() |
Convert a variable to a categorical factor | 7 |
filter() |
Keep rows that meet a condition (dplyr) | 7 |
fisher.test() |
Fisher’s exact test (when expected cell frequencies < 5) | 14 |
format() |
Convert scientific notation to regular numbers | 7 |
geom_bar() |
Bar chart geometry (counts observations per category) | 17 |
geom_boxplot() |
Boxplot geometry | 17 |
geom_histogram() |
Histogram geometry | 17 |
geom_line() |
Line geometry | 17 |
geom_point() |
Scatterplot point geometry | 17 |
geom_smooth() |
Add a fitted line (e.g., regression) with CI band | 17 |
getwd() |
Print the current working directory | 2 |
ggplot() |
Initialize a ggplot with data and aesthetic mappings | 17 |
head() |
Display the first 6 rows of a data frame | 4 |
hist() |
Base R histogram (quick distribution check) | 5 |
ifelse() |
Return different values depending on a condition | 7 |
IQR() |
Compute the interquartile range (Q3 - Q1) | 6 |
is.na() |
Test for missing values (returns TRUE/FALSE) | 5, 7 |
labs() |
Set plot title, axis labels, and legend title in ggplot | 17 |
leveneTest() |
Levene’s test for equal variances (car package) | 16 |
library() |
Load an installed package | 3 |
lm() |
Fit a linear regression model | 12 |
mean() |
Calculate the arithmetic mean | 6 |
median() |
Calculate the median | 6 |
mutate() |
Create or modify a variable in a data frame (dplyr) | 7 |
names() |
Return column names of a data frame | 4 |
nrow() |
Return the number of rows | 4 |
oneway.test() |
Welch’s ANOVA (robust to unequal variances) | 9 |
options() |
Set global R options (e.g., scipen for notation) |
7 |
pairwise.t.test() |
Pairwise t-tests with p-value correction (e.g., Bonferroni) | 9 |
parameterEstimates() |
Extract bootstrap CIs from a lavaan model | 13 |
pivot_longer() |
Convert wide-format data to long format (tidyr) | 7 |
plot() |
Base R plotting; also used for TukeyHSD CI plots and residual diagnostics | 9, 12, 16 |
power.t.test() |
Power analysis for t-tests | 15 |
pwr.anova.test() |
Power analysis for one-way ANOVA (pwr package) | 15 |
quantile() |
Compute percentiles | 6 |
range() |
Return the minimum and maximum | 6 |
read.csv() |
Read a CSV file into a data frame | 2 |
residuals() |
Extract residuals from a fitted model | 12, 16 |
rowMeans() |
Compute the mean across columns for each row (composite scores) | 7 |
scale() |
Standardize a variable to z-scores | 7 |
scale_color_manual() |
Set custom colors for lines and points in ggplot | 17 |
scale_fill_manual() |
Set custom colors for bars, boxes, and filled areas in ggplot | 17 |
scale_x_continuous() |
Control x-axis limits, breaks, and labels | 17 |
scale_y_continuous() |
Control y-axis limits, breaks, and labels | 17 |
sd() |
Calculate the standard deviation | 6 |
sem() |
Fit a structural equation / mediation model (lavaan package) | 13 |
seq() |
Generate a sequence of numbers (used for axis tick marks) | 17 |
setwd() |
Set the working directory | 2 |
shapiro.test() |
Shapiro-Wilk test for normality | 16 |
stat_summary() |
Compute and plot summary statistics in ggplot | 17 |
str() |
Show the structure of a data frame | 4 |
sum() |
Sum values (used with is.na() to count missing
data) |
5 |
summary() |
Descriptive summary of a data frame or model | 4, 12, 13 |
t.test() |
Run a t-test (one-sample, independent, or paired) | 8 |
table() |
Count observations per group; create contingency tables | 6, 14 |
tapply() |
Apply a function by group (e.g., group means) | 6 |
theme() |
Fine-tune individual ggplot elements (fonts, legend, etc.) | 17 |
TukeyHSD() |
Tukey post-hoc pairwise comparisons for aov models | 9, 10 |
var() |
Calculate the variance | 6 |
View() |
Open a data frame in the RStudio viewer | 4 |
which() |
Return indices where a condition is TRUE | 5, 7 |
write.csv() |
Save a data frame to a CSV file | 2 |