Data Wrangling with R

Author

Allen Cosme

Published

March 16, 2026

Note

note: this codebook will follow the structure and content of the published codebook of M05-1-Data wrangling with Tidyvverse in R

1 Overview

1.1 Learning outcome

By the end of this module, you should be able to:

Describe the concept of Data Wrangling.
Describe how Tibbles are different from data frames
Explain how to convert wide or long data to “Tidy” data
Explain how to merge relational data sets using join functions.
Explain how to use grouped mutates and filter together.
Be familiar with major dplyr functions for transforming data.
Create a new variable with mutate() and case_when().
Use the pipe operator to shape the data to prepare for analysis and visualization

1.2 Textbook chapters covered

Ch3: Data Transformation
Ch5: Data Tidying
Ch13: Numbers

2 Introduction to data wrangling

2.1 loading packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

3 what is Tidyverse

The tidyverse is a collection of R packages that share a common design philosophy and are designed to work together seamlessly. The tidyverse includes packages for data manipulation, visualization, and modeling, among other tasks.

Some of the core packages in the tidyverse include:

ggplot2: for data visualization
dplyr: for data manipulation
tidyr: for data tidying
readr: for data import
purrr: for functional programming
tibble: for data frames
stringr: for string manipulation
forcats: for working with categorical variables

3.1 intitial data prep and exploration

#install.packages("nycflights13")
library(nycflights13)
data()

4 data wrangling with dplyr

Note

For MacOS, use shift + option + I for multi cursor activation

Data sets in package ‘nycflights13’:

airlines Airline names.
airports Airport metadata
flights Flights data
planes Plane metadata.
weather Hourly weather data

flights <- flights

flights |>
  count(year, month) |>
  arrange(desc(n)) |>
  mutate(month = as_factor(month)) |>
  mutate(month = fct_reorder(month,n)) |>
  ggplot(aes(month, n, fill = year)) +
  geom_col(fill = "#FF007F", show.legend = FALSE) +
  coord_flip() +
  labs(x = "Month", 
      y = "Number of flights",
      title = " Number of Daily flights by month during 2013 at New york"
      )