For this assignment, I worked with a time series dataset that has data recorded over time for more than one item.
The reason I needed this type of data is because window functions only make sense when time is involved. Each item needs to have values changing over days so that averages over time can be calculated.
Dataset Selection
I chose a dataset that includes: • a date column • an item name • a numeric value that changes over time
There are multiple items in the same dataset, so each item has its own time series. This is important because the calculations should be done separately for each item, not all together.
Why Window Functions Are Needed
Window functions let me calculate things like averages without collapsing the data.
Instead of grouping and losing rows, window functions: • keep every original row • look at nearby rows related to the current row • calculate values based on time order
This is especially useful for moving averages and year-to-date calculations.
Analytical Approach
For each item in the dataset, I calculated two things: 1. Year-to-date average This shows the average value for an item from the beginning of the year up to the current date. 2. Six-day moving average This shows the average of the current day and the previous five days. Both calculations are done within each item, so one item’s values do not affect another item.
How I approached the problem
First, I loaded and cleaned the data so the date and values were usable. Then, I sorted the data by item and date. This step is important because window functions depend on the correct time order. After that, I used dplyr window functions to calculate the year-to-date average and the six-day moving average while keeping all original rows.
Expected Outcome
The final dataset will include: • Original time series data • A YTD average column for each item • A six-day moving average column for each item
This output demonstrates how window functions can be used to analyze trends over time without aggregating away the original data.
Reproducibility
The analysis will be: • Fully reproducible • Implemented using R and dplyr • Structured so that rendering the Quarto file reproduces the same results
Code Base
Step 1: Load libraries
First, I loaded the libraries I needed. I used tidyverse for data manipulation and slider for moving averages.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Step 2: Create a time series dataset (reproducible)
For this assignment, I created a small time series dataset directly in R. This makes the analysis fully reproducible without relying on external files. The dataset contains: • a date column • an item name • a numeric value
Before using window functions, the data must be sorted by item and date.
data <- data %>%arrange(item, date)
Step 4: Calculate Year-to-Date (YTD) average
Next, I calculated the year-to-date average for each item. This shows the average value from the beginning of the year up to the current date, calculated separately for each item.
data <- data %>%group_by(item) %>%mutate(ytd_average =cummean(value) ) %>%ungroup()
Step 5: Calculate six-day moving average
After that, I calculated the six-day moving average. This average includes the current day and the previous five days. The calculation resets for each item.
data <- data %>%group_by(item) %>%mutate(six_day_moving_avg = ( value +lag(value, 1) +lag(value, 2) +lag(value, 3) +lag(value, 4) +lag(value, 5) ) /6 ) %>%ungroup()
Step 6: Final output
The final dataset includes: • the original data • the year-to-date average • the six-day moving average
data
# A tibble: 60 × 5
date item value ytd_average six_day_moving_avg
<date> <chr> <dbl> <dbl> <dbl>
1 2023-01-01 Item_A 12.9 12.9 NA
2 2023-01-02 Item_A 17.9 15.4 NA
3 2023-01-03 Item_A 14.1 15.0 NA
4 2023-01-04 Item_A 18.8 15.9 NA
5 2023-01-05 Item_A 19.4 16.6 NA
6 2023-01-06 Item_A 10.5 15.6 15.6
7 2023-01-07 Item_A 15.3 15.6 16
8 2023-01-08 Item_A 18.9 16.0 16.2
9 2023-01-09 Item_A 15.5 15.9 16.4
10 2023-01-10 Item_A 14.6 15.8 15.7
# ℹ 50 more rows
Step 7:View results for one item
This step makes it easier to see how the window functions work for a single item.
data %>%filter(item =="Item_A")
# A tibble: 30 × 5
date item value ytd_average six_day_moving_avg
<date> <chr> <dbl> <dbl> <dbl>
1 2023-01-01 Item_A 12.9 12.9 NA
2 2023-01-02 Item_A 17.9 15.4 NA
3 2023-01-03 Item_A 14.1 15.0 NA
4 2023-01-04 Item_A 18.8 15.9 NA
5 2023-01-05 Item_A 19.4 16.6 NA
6 2023-01-06 Item_A 10.5 15.6 15.6
7 2023-01-07 Item_A 15.3 15.6 16
8 2023-01-08 Item_A 18.9 16.0 16.2
9 2023-01-09 Item_A 15.5 15.9 16.4
10 2023-01-10 Item_A 14.6 15.8 15.7
# ℹ 20 more rows