Purpose: A living notebook of short, reliable R snippets I actually use. Keep it short. Keep it runnable. Update as I learn.

Chapter 1️⃣

1. Quick Setup

# Install-once (uncomment as needed)
# install.packages(c("tidyverse", "janitor", "lubridate", "readr", "readxl", "openxlsx",
                    # "skimr", "here", "fs", "glue", "knitr", "rmarkdown", "ggthemes",
                    # "patchwork", "gt", "gtExtras", "stringr", "forcats"))

# Load every session
suppressPackageStartupMessages({
  library(tidyverse)
  library(janitor)
  library(lubridate)
  library(here)
  library(glue)
  library(skimr)
  library(gt)
knitr::opts_chunk$set(eval = FALSE)
})

# Reproducibility
set.seed(42)

Project tip: Use an RStudio Project and here::here() for paths. Never hard‑code "C:/Users/...".

2. Reading & Writing Data (fast + safe)

# CSV (robust defaults)
 df_csv <- readr::read_csv(here("data", "my_data.csv"))

# Excel (first sheet)
 df_xlsx <- readxl::read_excel(here("data", "my_data.xlsx"), sheet = 1)

# Write outputs with timestamps
 out_path <- here("output", glue("cleaned_{format(Sys.Date(), '%Y%m%d')}.csv"))
# readr::write_csv(df_csv, out_path)

Gotcha: If you see encoding issues, try locale = locale(encoding = "UTF-8") in read_csv().

3. Inspecting Data (what is this?)

# High‑level skim
skimr::skim(df_csv)

# Structure & types
str(df_csv)

# Column names (clean + check)
names(df_csv)
janitor::compare_df_cols(df_csv)

Rule of thumb: If a column should be a date, convert it immediately with lubridate.

4. Cleaning Columns & Rows

# Consistent names
clean <- df_csv %>%
  janitor::clean_names() %>%                 # snake_case column names
  mutate(across(where(is.character), trimws)) # trim leading/trailing spaces

# Remove complete duplicate rows
clean <- distinct(clean)

# Handle blanks as NA
clean <- mutate(clean, across(everything(), ~na_if(.x, "")))

Tip: Use distinct(.keep_all = TRUE) to de‑dupe by subset of columns.

5. dplyr Cheatsheet (minimal set)

result <- clean %>%
  filter(!is.na(id)) %>%
  mutate(
    date = lubridate::ymd(date),
    category = forcats::fct_lump_n(as.factor(category), n = 5)
  ) %>%
  group_by(category) %>%
  summarize(
    n = n(),
    mean_val = mean(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(n))

Mnemonic: Select–Filter–Mutate–Summarize–Arrange covers 80% of wrangling.

6. Joins (I always mix these up)

# left_join: keep all rows from x, bring matches from y
joined <- df_csv %>% left_join(df_xlsx, by = "id")

# anti_join: rows in x with no match in y (great for QA)
missing_keys <- df_csv %>% anti_join(df_xlsx, by = "id")

QA trick: anti_join() first to see what won’t match before any heavy processing.

7. Dates & Times (lubridate)

# Parse and standardize
clean_dates <- clean %>%
  mutate(
    date = ymd(date),
    year = year(date),
    month = month(date, label = TRUE, abbr = TRUE),
    wk = isoweek(date)
  )

Tip: If parsing fails, inspect with parse_date_time(x, orders = c("ymd", "mdy", "dmy")).

8. Strings (stringr)

text_clean <- clean %>%
  mutate(
    email = str_to_lower(email),
    domain = str_extract(email, "@.+$")
  )

Regex sanity: Test patterns at https://regex101.com/ before committing.

9. Factors (forcats)

fac <- clean %>%
  mutate(
    status = fct_relevel(as.factor(status), c("new", "active", "inactive")),
    top_cat = fct_lump_n(as.factor(category), n = 6)
  )

Plotting tip: Relevel factors to control ggplot ordering.

Chapter2️⃣

10. Plotting (ggplot2: minimal patterns)

# Bar (counts)
clean %>%
  ggplot(aes(x = category)) +
  geom_bar(fill = "#2E86AB") +
  theme_minimal(base_size = 12) +
  labs(title = "Counts by Category", x = NULL, y = "Count")

# Line (time series)
clean_dates %>%
  group_by(date) %>% summarize(n = n(), .groups = "drop") %>%
  ggplot(aes(date, n)) +
  geom_line(color = "#7D3C98", linewidth = 0.9) +
  theme_minimal(base_size = 12) +
  labs(title = "Daily Counts", x = NULL, y = NULL)

Small multiples: Use + facet_wrap(~group) when categories are many.

11. Tables (gt quick pattern)

result %>%
  gt::gt() %>%
  gt::fmt_number(columns = where(is.numeric), decimals = 2) %>%
  gt::tab_header(title = md("**Summary by Category**"))

Export: gtsave("table.png") or gt::gtsave() to PNG/PDF/HTML.

12. Modeling (tidymodels tiny starter)

# install.packages("tidymodels")  # once
# library(tidymodels)
# set.seed(42)
# split <- initial_split(clean, prop = 0.8)
# train <- training(split); test <- testing(split)
# rec <- recipe(target ~ ., data = train) %>% step_dummy(all_nominal(), -all_outcomes())
# mod <- linear_reg() %>% set_engine("lm")
# wf  <- workflow() %>% add_model(mod) %>% add_recipe(rec)
# fit <- fit(wf, data = train)
# metrics <- predict(fit, test) %>% bind_cols(test) %>% metrics(truth = target, estimate = .pred)

Reality check: Always baseline with a simple model (e.g., lm) before anything fancy.

13. Debugging & Safety Nets

Common errors: missing packages, wrong column names, bad joins, factor levels not set.
Tactics:
- rlang::last_error() to see context
- dplyr::glimpse() before/after key steps
- stopifnot() for assumptions (e.g., unique keys)
- Use tryCatch() around fragile I/O

stopifnot(!anyDuplicated(clean$id))  # ids should be unique

14. Reproducible Paths & Projects

Use RStudio Projects; root paths with here::here().
Keep folders: data/, R/, output/, figs/, docs/.
Save session info with outputs.

sessionInfo()

15. Handy Snippets I Reuse

# Percent of total
percent_of_total <- function(x) round(100 * x / sum(x, na.rm = TRUE), 1)

# Not-in operator
`%nin%` <- function(x, y) !(x %in% y)

# Quietly run an expression
quietly <- purrr::quietly

16. Checklist Before You Ship

Column names are clean & consistent
Dates parsed and in correct timezone/format
Joins audited with anti_join()
NAs handled intentionally
Figures have titles, labels, units
Code chunks are deterministic (set seeds)
Save artifacts with versioned filenames

17. Appendix: swirl (learn by doing)

Install once: install.packages("swirl")
Each session: library(swirl); swirl()
Navigate with: skip(), play() → nxt(), main(), info(), bye()

18. Appendix: Keyboard Macros (RStudio)

Run line/selection: Ctrl/Cmd + Enter
Run all chunks above: Ctrl + Shift + P (Windows/Linux) or Cmd + Option + P (macOS)
Insert chunk: Ctrl + Alt + I / Cmd + Option + I

19. To Do / Parking Lot

Add a targets or renv section when projects grow
Add unit tests with testthat for key helpers
Add a style guide decision (lintr/styler)

Chapter 3️⃣

20. Intro to the Tidyverse

Purpose: For data wrangling using the tidyverse.

20.1 What Is the Tidyverse

The tidyverse is a collection of R packages designed to make data analysis easier. Used for:

filtering data
summarizing data
grouping and arranging data
creating new tables
exploring patterns

Functions covered:

select()
filter()
summarise() / summarize()
group_by()
arrange()
count()

20.2 Pipes (%>%)

Pipes connect multiple data transformations into a readable workflow.

data %>%   step_one() %>%   step_two() %>%   step_three()

20.3 filter()

Filtering answers questions like:

When was NC affected by hurricanes
How many storms happened per year
What year had the strongest winds

Key points:

== tests equality
character strings must be in quotes
filter() returns rows where the condition is TRUE

20.4 count()

count() is used for “How many…?” questions.

What it does:

groups data
counts rows
creates a new column n

Examples:

dataset %>%   count(Year)

Sort from greatest to least:

dataset %>%   count(Year) %>%   arrange(desc(n))

Count hurricanes by name:

hurricNamed %>%   count(Name)

Top names first:

hurricNamed %>%   count(Name) %>%   arrange(desc(n))

20.5 summarise()

Used for:

averages
minimums and maximums
sums
medians
counts

Example:

hurricNamed %>%   summarise(     average = mean(deaths),     min = min(deaths),     max = max(deaths)   )

Atmospheric pressure:

hurricNamed %>%   summarise(     average = mean(LF.PressureMB, na.rm = TRUE),     min = min(LF.PressureMB, na.rm = TRUE),     max = max(LF.PressureMB, na.rm = TRUE)   )

20.6 group_by()

group_by() defines categories for grouped summaries.

Rule: group_by() must come before summarise().

Example:

hurricNamed %>%   group_by(AffectedStates) %>%   summarise(     average = mean(deaths),     min = min(deaths),     max = max(deaths)   )

Damage cost by year:

hurricNamed %>%   group_by(Year) %>%   summarise(     average = mean(BaseDam2014, na.rm = TRUE),     min = min(BaseDam2014, na.rm = TRUE),     max = max(BaseDam2014, na.rm = TRUE)   )

20.7 Layering Functions (multi‑step pipelines)

Typical pattern:

filter()
group_by()
summarise() or count()
arrange()

How many hurricanes impacted FL each year:

hurricNamed %>%   filter(AffectedStates == "FL") %>%   group_by(Year) %>%   summarise(n = n())

For multi‑state strings:

hurricNamed %>%   filter(str_detect(AffectedStates, "FL"))

Average 2014 damage cost and number of storms per wind speed:

hurricNamed %>%   group_by(LF.WindsMPH) %>%   summarise(     average_damage = mean(NDAM2014, na.rm = TRUE),     n = n()   )

21. Common Mistakes & Debugging Tidyverse Pipelines**

Purpose: Quick reference for common tidyverse errors and how to fix them.

21.1 Missing group_by() Before summarise()

Symptom: summarise() returns one row.

Fix:

df %>%   group_by(category) %>%   summarise(avg = mean(value))

21.2 Using = Instead of == in filter()

Correct:

filter(state == "FL")

21.3 Filtering Multi‑State Strings Incorrectly

Wrong:

filter(AffectedStates == "FL")

Right:

filter(str_detect(AffectedStates, "FL"))

21.4 Missing na.rm = TRUE

Fix:

summarise(avg = mean(value, na.rm = TRUE))

21.5 Using summarise() When mutate() Was Intended

Rule:

mutate() adds columns
summarise() collapses rows

Example:

df %>%   mutate(ratio = x / y)

21.6 Forgetting to Ungroup

Fix:

df %>%   group_by(category) %>%   summarise(n = n()) %>%   ungroup()

21.7 object not found Errors Inside Pipes

Usually caused by:

typos
removed columns
referencing columns created later

Debug:

df %>% glimpse()

Or break the pipe:

step1 <- df %>% filter(year > 2000) step2 <- step1 %>% mutate(rate = x / y)

21.8 Using = Instead of ~ in case_when()

Correct:

case_when(   x == 1 ~ "one",   TRUE ~ "other" )

21.9 Debugging with print() or View()

df %>%   filter(year > 2000) %>%   { print(head(.)); . } %>%   summarise(avg = mean(value))

21.10 Check Column Types Early

skimr::skim(df)

df %>% glimpse()

21.11 Pipeline Debug Checklist

Does every column exist
Using == not =
Missing group_by()
Missing na.rm = TRUE
Filtering multi‑value strings incorrectly
Column dropped earlier
summarise() used instead of mutate()

21.12 Safe Debug Wrapper

safe_pipe <- function(df) {   df %>%     filter(!is.na(value)) %>%     mutate(rate = x / y) %>%     group_by(category) %>%     summarise(avg = mean(rate, na.rm = TRUE)) %>%     ungroup() }

23: Quick Reference

What is a tibble? A tibble is a modern version of a data frame that prints cleanly, avoids converting strings to factors, and behaves more2. What is the tidyverse? A collection of R packages for data science (e.g., dplyr, ggplot2, tidyr). They share consistent syntax and follow tidy da3. select(), filter(), arrange()

select(): chooses columns.

filter(): chooses rows based on conditions.

arrange(): sorts rows.

Pipe Operator (%>%): Passes output from the left into the function on the right, making code more readable.

summarise(), group_by(), mutate()

group_by(): creates groups in data.

summarise(): calculates summary statistics per group.

mutate(): creates or modifies variables.

Chapter 4️⃣

24: Data Importing & Tidy Data

Data Frames

A data frame is a rectangular dataset where rows represent observations and columns represent variables. Most earlier examples in R used data already in tidy format.

Importing Data

Common file formats: - CSV: plain text, comma‑separated, 1 observation per row. - Excel (.xlsx): spreadsheet format with metadata.

Useful packages: readr, readxl, dplyr, tidyr, ggplot2.

Definition of Tidy Data

According to the tidyverse principles: 1. Each variable is a column
2. Each observation is a row
3. Each value is a single cell

Tidy data is essential because it works directly with tools like dplyr, ggplot2, and tidyr.

Messy vs. Tidy Examples

Different formats can represent the same values, but tidy formats (e.g., table1 in examples) are easier to manipulate than messy formats like table2 or table3.

Tidying Tools

Key tidyr functions: - pivot_longer(): wide → long - pivot_wider(): long → wide

These reshape data without changing values, helping convert messy datasets into tidy structure.

Melanie’s R Grab‑Bag: Useful Code & Tips for Future Me

Melanie Holden

February 11, 2026

Chapter 1️⃣

1. Quick Setup

2. Reading & Writing Data (fast + safe)

3. Inspecting Data (what is this?)

4. Cleaning Columns & Rows

5. dplyr Cheatsheet (minimal set)

6. Joins (I always mix these up)

7. Dates & Times (lubridate)

8. Strings (stringr)

9. Factors (forcats)

Chapter2️⃣

10. Plotting (ggplot2: minimal patterns)

11. Tables (gt quick pattern)

12. Modeling (tidymodels tiny starter)

13. Debugging & Safety Nets

14. Reproducible Paths & Projects

15. Handy Snippets I Reuse

16. Checklist Before You Ship

17. Appendix: swirl (learn by doing)

18. Appendix: Keyboard Macros (RStudio)

19. To Do / Parking Lot

Chapter 3️⃣

20. Intro to the Tidyverse

20.1 What Is the Tidyverse

20.2 Pipes (%>%)

20.3 filter()

20.4 count()

20.5 summarise()

20.6 group_by()

20.7 Layering Functions (multi‑step pipelines)

21. Common Mistakes & Debugging Tidyverse Pipelines**

21.1 Missing group_by() Before summarise()

21.2 Using = Instead of == in filter()

21.3 Filtering Multi‑State Strings Incorrectly

21.4 Missing na.rm = TRUE

21.5 Using summarise() When mutate() Was Intended

21.6 Forgetting to Ungroup

21.7 object not found Errors Inside Pipes

21.8 Using = Instead of ~ in case_when()

21.9 Debugging with print() or View()

21.10 Check Column Types Early

21.11 Pipeline Debug Checklist

21.12 Safe Debug Wrapper

23: Quick Reference

Chapter 4️⃣

24: Data Importing & Tidy Data

Data Frames

Importing Data

Definition of Tidy Data

Messy vs. Tidy Examples

Tidying Tools