Lab 05: Data Engineering & Lifecycle

Author

Jeff Ronay

Published

September 24, 2025

๐Ÿง  Objective

Transform and version raw transactional data into a reproducible pipeline using lifecycle principles.

๐Ÿ“ฅ Load Data

library(readr)
transactions <- read_csv("data/pipeline_input.csv")
Rows: 8 Columns: 7
โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr  (5): transaction_id, user_id, currency, status, region
dbl  (1): amount
date (1): date

โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.

๐Ÿงผ Stage 1: Cleaning

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
cleaned <- transactions %>%
  filter(!is.na(amount), status == "Complete")

๐Ÿ“Š Stage 2: Aggregation

regional_totals <- cleaned %>%
  group_by(region) %>%
  summarise(total = sum(amount))

##๐Ÿ“ Stage 3: Lifecycle Reflection ## Describe each stage: input โ†’ transformation โ†’ output ## Versioning options using Quarto or targets

๐Ÿ“ˆ Optional Visualization

library(ggplot2)
ggplot(regional_totals, aes(x = region, y = total)) +
  geom_col(fill = "darkorange") +
  theme_minimal()