Purpose: A living notebook of short, reliable R snippets I actually use. Keep it short. Keep it runnable. Update as I learn.

Chapter 1️⃣

1. Quick Setup

# Install-once (uncomment as needed)
# install.packages(c("tidyverse", "janitor", "lubridate", "readr", "readxl", "openxlsx",
                    # "skimr", "here", "fs", "glue", "knitr", "rmarkdown", "ggthemes",
                    # "patchwork", "gt", "gtExtras", "stringr", "forcats"))

# Load every session
suppressPackageStartupMessages({
  library(tidyverse)
  library(janitor)
  library(lubridate)
  library(here)
  library(glue)
  library(skimr)
  library(gt)
knitr::opts_chunk$set(eval = FALSE)
})

# Reproducibility
set.seed(42)

Project tip: Use an RStudio Project and here::here() for paths. Never hard‑code "C:/Users/...".

2. Reading & Writing Data (fast + safe)

# CSV (robust defaults)
 df_csv <- readr::read_csv(here("data", "my_data.csv"))

# Excel (first sheet)
 df_xlsx <- readxl::read_excel(here("data", "my_data.xlsx"), sheet = 1)

# Write outputs with timestamps
 out_path <- here("output", glue("cleaned_{format(Sys.Date(), '%Y%m%d')}.csv"))
# readr::write_csv(df_csv, out_path)

Gotcha: If you see encoding issues, try locale = locale(encoding = "UTF-8") in read_csv().

3. Inspecting Data (what is this?)

# High‑level skim
skimr::skim(df_csv)

# Structure & types
str(df_csv)

# Column names (clean + check)
names(df_csv)
janitor::compare_df_cols(df_csv)

Rule of thumb: If a column should be a date, convert it immediately with lubridate.

4. Cleaning Columns & Rows

# Consistent names
clean <- df_csv %>%
  janitor::clean_names() %>%                 # snake_case column names
  mutate(across(where(is.character), trimws)) # trim leading/trailing spaces

# Remove complete duplicate rows
clean <- distinct(clean)

# Handle blanks as NA
clean <- mutate(clean, across(everything(), ~na_if(.x, "")))

Tip: Use distinct(.keep_all = TRUE) to de‑dupe by subset of columns.

5. dplyr Cheatsheet (minimal set)

result <- clean %>%
  filter(!is.na(id)) %>%
  mutate(
    date = lubridate::ymd(date),
    category = forcats::fct_lump_n(as.factor(category), n = 5)
  ) %>%
  group_by(category) %>%
  summarize(
    n = n(),
    mean_val = mean(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(n))

Mnemonic: Select–Filter–Mutate–Summarize–Arrange covers 80% of wrangling.

6. Joins (I always mix these up)

# left_join: keep all rows from x, bring matches from y
joined <- df_csv %>% left_join(df_xlsx, by = "id")

# anti_join: rows in x with no match in y (great for QA)
missing_keys <- df_csv %>% anti_join(df_xlsx, by = "id")

QA trick: anti_join() first to see what won’t match before any heavy processing.

7. Dates & Times (lubridate)

# Parse and standardize
clean_dates <- clean %>%
  mutate(
    date = ymd(date),
    year = year(date),
    month = month(date, label = TRUE, abbr = TRUE),
    wk = isoweek(date)
  )

Tip: If parsing fails, inspect with parse_date_time(x, orders = c("ymd", "mdy", "dmy")).

8. Strings (stringr)

text_clean <- clean %>%
  mutate(
    email = str_to_lower(email),
    domain = str_extract(email, "@.+$")
  )

Regex sanity: Test patterns at https://regex101.com/ before committing.

9. Factors (forcats)

fac <- clean %>%
  mutate(
    status = fct_relevel(as.factor(status), c("new", "active", "inactive")),
    top_cat = fct_lump_n(as.factor(category), n = 6)
  )

Plotting tip: Relevel factors to control ggplot ordering.

Chapter2️⃣

10. Plotting (ggplot2: minimal patterns)

# Bar (counts)
clean %>%
  ggplot(aes(x = category)) +
  geom_bar(fill = "#2E86AB") +
  theme_minimal(base_size = 12) +
  labs(title = "Counts by Category", x = NULL, y = "Count")

# Line (time series)
clean_dates %>%
  group_by(date) %>% summarize(n = n(), .groups = "drop") %>%
  ggplot(aes(date, n)) +
  geom_line(color = "#7D3C98", linewidth = 0.9) +
  theme_minimal(base_size = 12) +
  labs(title = "Daily Counts", x = NULL, y = NULL)

Small multiples: Use + facet_wrap(~group) when categories are many.

11. Tables (gt quick pattern)

result %>%
  gt::gt() %>%
  gt::fmt_number(columns = where(is.numeric), decimals = 2) %>%
  gt::tab_header(title = md("**Summary by Category**"))

Export: gtsave("table.png") or gt::gtsave() to PNG/PDF/HTML.

12. Modeling (tidymodels tiny starter)

# install.packages("tidymodels")  # once
# library(tidymodels)
# set.seed(42)
# split <- initial_split(clean, prop = 0.8)
# train <- training(split); test <- testing(split)
# rec <- recipe(target ~ ., data = train) %>% step_dummy(all_nominal(), -all_outcomes())
# mod <- linear_reg() %>% set_engine("lm")
# wf  <- workflow() %>% add_model(mod) %>% add_recipe(rec)
# fit <- fit(wf, data = train)
# metrics <- predict(fit, test) %>% bind_cols(test) %>% metrics(truth = target, estimate = .pred)

Reality check: Always baseline with a simple model (e.g., lm) before anything fancy.

13. Debugging & Safety Nets

  • Common errors: missing packages, wrong column names, bad joins, factor levels not set.
  • Tactics:
    • rlang::last_error() to see context
    • dplyr::glimpse() before/after key steps
    • stopifnot() for assumptions (e.g., unique keys)
    • Use tryCatch() around fragile I/O
stopifnot(!anyDuplicated(clean$id))  # ids should be unique

14. Reproducible Paths & Projects

  • Use RStudio Projects; root paths with here::here().
  • Keep folders: data/, R/, output/, figs/, docs/.
  • Save session info with outputs.
sessionInfo()

15. Handy Snippets I Reuse

# Percent of total
percent_of_total <- function(x) round(100 * x / sum(x, na.rm = TRUE), 1)

# Not-in operator
`%nin%` <- function(x, y) !(x %in% y)

# Quietly run an expression
quietly <- purrr::quietly

16. Checklist Before You Ship

17. Appendix: swirl (learn by doing)

  • Install once: install.packages("swirl")
  • Each session: library(swirl); swirl()
  • Navigate with: skip(), play()nxt(), main(), info(), bye()

18. Appendix: Keyboard Macros (RStudio)

  • Run line/selection: Ctrl/Cmd + Enter
  • Run all chunks above: Ctrl + Shift + P (Windows/Linux) or Cmd + Option + P (macOS)
  • Insert chunk: Ctrl + Alt + I / Cmd + Option + I

19. To Do / Parking Lot

Chapter 3️⃣

20. Intro to the Tidyverse

Purpose: For data wrangling using the tidyverse.

20.1 What Is the Tidyverse

The tidyverse is a collection of R packages designed to make data analysis easier. Used for:

  • filtering data

  • summarizing data

  • grouping and arranging data

  • creating new tables

  • exploring patterns

Functions covered:

  • select()

  • filter()

  • summarise() / summarize()

  • group_by()

  • arrange()

  • count()

20.2 Pipes (%>%)

Pipes connect multiple data transformations into a readable workflow.

data %>%   step_one() %>%   step_two() %>%   step_three() 

20.3 filter()

Filtering answers questions like:

  • When was NC affected by hurricanes

  • How many storms happened per year

  • What year had the strongest winds

Key points:

  • == tests equality

  • character strings must be in quotes

  • filter() returns rows where the condition is TRUE

20.4 count()

count() is used for “How many…?” questions.

What it does:

  • groups data

  • counts rows

  • creates a new column n

Examples:

dataset %>%   count(Year) 

Sort from greatest to least:

dataset %>%   count(Year) %>%   arrange(desc(n)) 

Count hurricanes by name:

hurricNamed %>%   count(Name) 

Top names first:

hurricNamed %>%   count(Name) %>%   arrange(desc(n)) 

20.5 summarise()

Used for:

  • averages

  • minimums and maximums

  • sums

  • medians

  • counts

Example:

hurricNamed %>%   summarise(     average = mean(deaths),     min = min(deaths),     max = max(deaths)   ) 

Atmospheric pressure:

hurricNamed %>%   summarise(     average = mean(LF.PressureMB, na.rm = TRUE),     min = min(LF.PressureMB, na.rm = TRUE),     max = max(LF.PressureMB, na.rm = TRUE)   ) 

20.6 group_by()

group_by() defines categories for grouped summaries.

Rule: group_by() must come before summarise().

Example:

hurricNamed %>%   group_by(AffectedStates) %>%   summarise(     average = mean(deaths),     min = min(deaths),     max = max(deaths)   ) 

Damage cost by year:

hurricNamed %>%   group_by(Year) %>%   summarise(     average = mean(BaseDam2014, na.rm = TRUE),     min = min(BaseDam2014, na.rm = TRUE),     max = max(BaseDam2014, na.rm = TRUE)   ) 

20.7 Layering Functions (multi‑step pipelines)

Typical pattern:

  1. filter()

  2. group_by()

  3. summarise() or count()

  4. arrange()

How many hurricanes impacted FL each year:

hurricNamed %>%   filter(AffectedStates == "FL") %>%   group_by(Year) %>%   summarise(n = n()) 

For multi‑state strings:

hurricNamed %>%   filter(str_detect(AffectedStates, "FL")) 

Average 2014 damage cost and number of storms per wind speed:

hurricNamed %>%   group_by(LF.WindsMPH) %>%   summarise(     average_damage = mean(NDAM2014, na.rm = TRUE),     n = n()   ) 

21. Common Mistakes & Debugging Tidyverse Pipelines**

Purpose: Quick reference for common tidyverse errors and how to fix them.

21.1 Missing group_by() Before summarise()

Symptom: summarise() returns one row.

Fix:

df %>%   group_by(category) %>%   summarise(avg = mean(value)) 

21.2 Using = Instead of == in filter()

Correct:

filter(state == "FL") 

21.3 Filtering Multi‑State Strings Incorrectly

Wrong:

filter(AffectedStates == "FL") 

Right:

filter(str_detect(AffectedStates, "FL")) 

21.4 Missing na.rm = TRUE

Fix:

summarise(avg = mean(value, na.rm = TRUE)) 

21.5 Using summarise() When mutate() Was Intended

Rule:

  • mutate() adds columns

  • summarise() collapses rows

Example:

df %>%   mutate(ratio = x / y) 

21.6 Forgetting to Ungroup

Fix:

df %>%   group_by(category) %>%   summarise(n = n()) %>%   ungroup() 

21.7 object not found Errors Inside Pipes

Usually caused by:

  • typos

  • removed columns

  • referencing columns created later

Debug:

df %>% glimpse() 

Or break the pipe:

step1 <- df %>% filter(year > 2000) step2 <- step1 %>% mutate(rate = x / y) 

21.8 Using = Instead of ~ in case_when()

Correct:

case_when(   x == 1 ~ "one",   TRUE ~ "other" ) 

21.9 Debugging with print() or View()

df %>%   filter(year > 2000) %>%   { print(head(.)); . } %>%   summarise(avg = mean(value)) 

21.10 Check Column Types Early

skimr::skim(df) 

or

df %>% glimpse() 

21.11 Pipeline Debug Checklist

  1. Does every column exist

  2. Using == not =

  3. Missing group_by()

  4. Missing na.rm = TRUE

  5. Filtering multi‑value strings incorrectly

  6. Column dropped earlier

  7. summarise() used instead of mutate()

21.12 Safe Debug Wrapper

safe_pipe <- function(df) {   df %>%     filter(!is.na(value)) %>%     mutate(rate = x / y) %>%     group_by(category) %>%     summarise(avg = mean(rate, na.rm = TRUE)) %>%     ungroup() } 

23: Quick Reference

  1. What is a tibble? A tibble is a modern version of a data frame that prints cleanly, avoids converting strings to factors, and behaves more2. What is the tidyverse? A collection of R packages for data science (e.g., dplyr, ggplot2, tidyr). They share consistent syntax and follow tidy da3. select(), filter(), arrange()
  • select(): chooses columns.
  • filter(): chooses rows based on conditions.
  • arrange(): sorts rows.
  1. Pipe Operator (%>%): Passes output from the left into the function on the right, making code more readable.
  2. summarise(), group_by(), mutate()
  • group_by(): creates groups in data.
  • summarise(): calculates summary statistics per group.
  • mutate(): creates or modifies variables.

Chapter 4️⃣

24: Data Importing & Tidy Data

Data Frames

A data frame is a rectangular dataset where rows represent observations and columns represent variables. Most earlier examples in R used data already in tidy format.

Importing Data

Common file formats: - CSV: plain text, comma‑separated, 1 observation per row. - Excel (.xlsx): spreadsheet format with metadata.

Useful packages: readr, readxl, dplyr, tidyr, ggplot2.

Definition of Tidy Data

According to the tidyverse principles: 1. Each variable is a column
2. Each observation is a row
3. Each value is a single cell

Tidy data is essential because it works directly with tools like dplyr, ggplot2, and tidyr.

Messy vs. Tidy Examples

Different formats can represent the same values, but tidy formats (e.g., table1 in examples) are easier to manipulate than messy formats like table2 or table3.

Tidying Tools

Key tidyr functions: - pivot_longer(): wide → long - pivot_wider(): long → wide

These reshape data without changing values, helping convert messy datasets into tidy structure.