Purpose: A living notebook of short, reliable R snippets I actually use. Keep it short. Keep it runnable. Update as I learn.
# Install-once (uncomment as needed)
# install.packages(c("tidyverse", "janitor", "lubridate", "readr", "readxl", "openxlsx",
# "skimr", "here", "fs", "glue", "knitr", "rmarkdown", "ggthemes",
# "patchwork", "gt", "gtExtras", "stringr", "forcats"))
# Load every session
suppressPackageStartupMessages({
library(tidyverse)
library(janitor)
library(lubridate)
library(here)
library(glue)
library(skimr)
library(gt)
knitr::opts_chunk$set(eval = FALSE)
})
# Reproducibility
set.seed(42)Project tip: Use an RStudio Project and
here::here() for paths. Never hard‑code
"C:/Users/...".
# CSV (robust defaults)
df_csv <- readr::read_csv(here("data", "my_data.csv"))
# Excel (first sheet)
df_xlsx <- readxl::read_excel(here("data", "my_data.xlsx"), sheet = 1)
# Write outputs with timestamps
out_path <- here("output", glue("cleaned_{format(Sys.Date(), '%Y%m%d')}.csv"))
# readr::write_csv(df_csv, out_path)Gotcha: If you see encoding issues, try
locale = locale(encoding = "UTF-8") in
read_csv().
# High‑level skim
skimr::skim(df_csv)
# Structure & types
str(df_csv)
# Column names (clean + check)
names(df_csv)
janitor::compare_df_cols(df_csv)Rule of thumb: If a column should be a date, convert
it immediately with lubridate.
# Consistent names
clean <- df_csv %>%
janitor::clean_names() %>% # snake_case column names
mutate(across(where(is.character), trimws)) # trim leading/trailing spaces
# Remove complete duplicate rows
clean <- distinct(clean)
# Handle blanks as NA
clean <- mutate(clean, across(everything(), ~na_if(.x, "")))Tip: Use distinct(.keep_all = TRUE) to
de‑dupe by subset of columns.
result <- clean %>%
filter(!is.na(id)) %>%
mutate(
date = lubridate::ymd(date),
category = forcats::fct_lump_n(as.factor(category), n = 5)
) %>%
group_by(category) %>%
summarize(
n = n(),
mean_val = mean(value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(n))Mnemonic: Select–Filter–Mutate–Summarize–Arrange covers 80% of wrangling.
# left_join: keep all rows from x, bring matches from y
joined <- df_csv %>% left_join(df_xlsx, by = "id")
# anti_join: rows in x with no match in y (great for QA)
missing_keys <- df_csv %>% anti_join(df_xlsx, by = "id")QA trick: anti_join() first to see what
won’t match before any heavy processing.
# Parse and standardize
clean_dates <- clean %>%
mutate(
date = ymd(date),
year = year(date),
month = month(date, label = TRUE, abbr = TRUE),
wk = isoweek(date)
)Tip: If parsing fails, inspect with
parse_date_time(x, orders = c("ymd", "mdy", "dmy")).
Regex sanity: Test patterns at https://regex101.com/ before committing.
fac <- clean %>%
mutate(
status = fct_relevel(as.factor(status), c("new", "active", "inactive")),
top_cat = fct_lump_n(as.factor(category), n = 6)
)Plotting tip: Relevel factors to control ggplot ordering.
# Bar (counts)
clean %>%
ggplot(aes(x = category)) +
geom_bar(fill = "#2E86AB") +
theme_minimal(base_size = 12) +
labs(title = "Counts by Category", x = NULL, y = "Count")
# Line (time series)
clean_dates %>%
group_by(date) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(date, n)) +
geom_line(color = "#7D3C98", linewidth = 0.9) +
theme_minimal(base_size = 12) +
labs(title = "Daily Counts", x = NULL, y = NULL)Small multiples: Use
+ facet_wrap(~group) when categories are many.
result %>%
gt::gt() %>%
gt::fmt_number(columns = where(is.numeric), decimals = 2) %>%
gt::tab_header(title = md("**Summary by Category**"))Export: gtsave("table.png") or
gt::gtsave() to PNG/PDF/HTML.
# install.packages("tidymodels") # once
# library(tidymodels)
# set.seed(42)
# split <- initial_split(clean, prop = 0.8)
# train <- training(split); test <- testing(split)
# rec <- recipe(target ~ ., data = train) %>% step_dummy(all_nominal(), -all_outcomes())
# mod <- linear_reg() %>% set_engine("lm")
# wf <- workflow() %>% add_model(mod) %>% add_recipe(rec)
# fit <- fit(wf, data = train)
# metrics <- predict(fit, test) %>% bind_cols(test) %>% metrics(truth = target, estimate = .pred)Reality check: Always baseline with a simple model
(e.g., lm) before anything fancy.
rlang::last_error() to see contextdplyr::glimpse() before/after key stepsstopifnot() for assumptions (e.g., unique keys)tryCatch() around fragile I/Ohere::here().data/, R/,
output/, figs/, docs/.install.packages("swirl")library(swirl); swirl()skip(), play() →
nxt(), main(), info(),
bye()Purpose: For data wrangling using the tidyverse.
The tidyverse is a collection of R packages designed to make data analysis easier. Used for:
filtering data
summarizing data
grouping and arranging data
creating new tables
exploring patterns
Functions covered:
select()
filter()
summarise() / summarize()
group_by()
arrange()
count()
Pipes connect multiple data transformations into a readable workflow.
Filtering answers questions like:
When was NC affected by hurricanes
How many storms happened per year
What year had the strongest winds
Key points:
== tests equality
character strings must be in quotes
filter() returns rows where the condition is
TRUE
count() is used for “How many…?” questions.
What it does:
groups data
counts rows
creates a new column n
Examples:
Sort from greatest to least:
Count hurricanes by name:
Top names first:
Used for:
averages
minimums and maximums
sums
medians
counts
Example:
Atmospheric pressure:
group_by() defines categories for grouped summaries.
Rule: group_by() must come before
summarise().
Example:
hurricNamed %>% group_by(AffectedStates) %>% summarise( average = mean(deaths), min = min(deaths), max = max(deaths) ) Damage cost by year:
Typical pattern:
filter()
group_by()
summarise() or count()
arrange()
How many hurricanes impacted FL each year:
For multi‑state strings:
Average 2014 damage cost and number of storms per wind speed:
Purpose: Quick reference for common tidyverse errors and how to fix them.
Symptom: summarise() returns one row.
Fix:
Wrong:
Right:
Rule:
mutate() adds columns
summarise() collapses rows
Example:
Usually caused by:
typos
removed columns
referencing columns created later
Debug:
Or break the pipe:
Does every column exist
Using == not =
Missing group_by()
Missing na.rm = TRUE
Filtering multi‑value strings incorrectly
Column dropped earlier
summarise() used instead of mutate()
- What is a tibble? A tibble is a modern version of a data frame that prints cleanly, avoids converting strings to factors, and behaves more2. What is the tidyverse? A collection of R packages for data science (e.g., dplyr, ggplot2, tidyr). They share consistent syntax and follow tidy da3. select(), filter(), arrange()
- select(): chooses columns.
- filter(): chooses rows based on conditions.
- arrange(): sorts rows.
- Pipe Operator (%>%): Passes output from the left into the function on the right, making code more readable.
- summarise(), group_by(), mutate()
- group_by(): creates groups in data.
- summarise(): calculates summary statistics per group.
- mutate(): creates or modifies variables.
A data frame is a rectangular dataset where rows represent observations and columns represent variables. Most earlier examples in R used data already in tidy format.
Common file formats: - CSV: plain text, comma‑separated, 1 observation per row. - Excel (.xlsx): spreadsheet format with metadata.
Useful packages: readr, readxl,
dplyr, tidyr, ggplot2.
According to the tidyverse principles: 1. Each variable is a
column
2. Each observation is a row
3. Each value is a single cell
Tidy data is essential because it works directly with tools like
dplyr, ggplot2, and tidyr.
Different formats can represent the same values, but tidy formats (e.g., table1 in examples) are easier to manipulate than messy formats like table2 or table3.
Key tidyr functions: - pivot_longer(): wide → long -
pivot_wider(): long → wide
These reshape data without changing values, helping convert messy datasets into tidy structure.