In this assignment, I am using the data set of Coco-Cola (KO) and PepsiCo(PEP) price trends since January 1, 2022. I find this data set interesting as they are both each others main competitors. I will be using the dpylr and zoo R packages to better understand how to use window function for time series for 2 or more separate items , which is a variation of an aggregation function to THE 2 types of window function that will be required for this analysis will be a rolling window for the 6 day moving average and an expanding window for the Year to Data (YTD) average.
Data Preparation and Sorting: First, i will sort the data by Ticker and Date which ensure calculations will follow the correct chronological order so that there is no blunder to accurate time-series analysis
Rolling Window –> 6 Days Moving Average: I will group by Ticket only, use rollmean() from zoo package, k=6 for window size, align = “right”, and setting any incomplete parts with NA (fill = NA) using values from date and the first 5 days is used for the average.
Expanding Window –> YTD Average:
Group by Ticker and Year, use cummean () - cumulative mean in dyplr
package to calculate YTD average.
Source: (1) Gemini Pro (2) R for Data Science 2e (3)https://dplyr.tidyverse.org/articles/window-functions.html (4)https://www.r-bloggers.com/2017/07/tidy-time-series-analysis-part-2-rolling-functions/
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
I am interested in the two time series values of Coca Cola vs Pepsi stocks and want to see how they compare to each other price wise overtime in terms of 6 days and YTD overall. I had Gemini Pro LLM generate the data set representing the daily closing prices for Coca-Cola and PepsiCo starting January 2022
#Dataset genereate by LLM for Coco-Cola and PepsiCo
stock_data <- data.frame(
Date = as.Date(c(
# Coca-Cola (KO) - Jan 2022
"2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06",
"2022-01-07", "2022-01-10", "2022-01-11",
# PepsiCo (PEP) - Jan 2022
"2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06",
"2022-01-07", "2022-01-10", "2022-01-11"
)),
Ticker = c(rep("KO", 7), rep("PEP", 7)),
Price = c(
59.0, 59.5, 60.0, 59.8, 60.2, 60.5, 60.1, # KO Prices
173.0, 174.5, 175.0, 174.0, 175.5, 176.0, 175.2 # PEP Prices
)
)
print(stock_data)
## Date Ticker Price
## 1 2022-01-03 KO 59.0
## 2 2022-01-04 KO 59.5
## 3 2022-01-05 KO 60.0
## 4 2022-01-06 KO 59.8
## 5 2022-01-07 KO 60.2
## 6 2022-01-10 KO 60.5
## 7 2022-01-11 KO 60.1
## 8 2022-01-03 PEP 173.0
## 9 2022-01-04 PEP 174.5
## 10 2022-01-05 PEP 175.0
## 11 2022-01-06 PEP 174.0
## 12 2022-01-07 PEP 175.5
## 13 2022-01-10 PEP 176.0
## 14 2022-01-11 PEP 175.2
Using information found on the dplyr.tidyverse and R-blogger website, I combine that with other previously used function to manipulate the data set table.
#————————————————————- # 1. Calculate 6 Day Moving Average
calculated_data <- stock_data |>
group_by(Ticker) |> #group only by ticker so it slides
mutate(Moving_Avg_6D = rollmean(Price, k = 6, fill= NA, align = "right")) # window function (rollmean() for looking at the current 5 row preceding metrics)
print(calculated_data)
## # A tibble: 14 × 4
## # Groups: Ticker [2]
## Date Ticker Price Moving_Avg_6D
## <date> <chr> <dbl> <dbl>
## 1 2022-01-03 KO 59 NA
## 2 2022-01-04 KO 59.5 NA
## 3 2022-01-05 KO 60 NA
## 4 2022-01-06 KO 59.8 NA
## 5 2022-01-07 KO 60.2 NA
## 6 2022-01-10 KO 60.5 59.8
## 7 2022-01-11 KO 60.1 60.0
## 8 2022-01-03 PEP 173 NA
## 9 2022-01-04 PEP 174. NA
## 10 2022-01-05 PEP 175 NA
## 11 2022-01-06 PEP 174 NA
## 12 2022-01-07 PEP 176. NA
## 13 2022-01-10 PEP 176 175.
## 14 2022-01-11 PEP 175. 175.
#—————————————————————– # 2. Year to Date Average
calculated_YTD_data <- stock_data |>
mutate(Year = year(Date)) |> # Create a Year column
group_by(Ticker, Year) |> #Group by Ticker and Year to annual reset
mutate(YTD_avg = cummean(Price)) # Window Function (cummean()) to gather average from Jan 1st
print(calculated_YTD_data) #view completed result
## # A tibble: 14 × 5
## # Groups: Ticker, Year [2]
## Date Ticker Price Year YTD_avg
## <date> <chr> <dbl> <dbl> <dbl>
## 1 2022-01-03 KO 59 2022 59
## 2 2022-01-04 KO 59.5 2022 59.2
## 3 2022-01-05 KO 60 2022 59.5
## 4 2022-01-06 KO 59.8 2022 59.6
## 5 2022-01-07 KO 60.2 2022 59.7
## 6 2022-01-10 KO 60.5 2022 59.8
## 7 2022-01-11 KO 60.1 2022 59.9
## 8 2022-01-03 PEP 173 2022 173
## 9 2022-01-04 PEP 174. 2022 174.
## 10 2022-01-05 PEP 175 2022 174.
## 11 2022-01-06 PEP 174 2022 174.
## 12 2022-01-07 PEP 176. 2022 174.
## 13 2022-01-10 PEP 176 2022 175.
## 14 2022-01-11 PEP 175. 2022 175.
This assignment I used the LLM generated KO and PEP stock result to demonstrate how to apply window functions to analyze time-series financial data. I was able to track the daily closing price of both stock since January 1,2022 and run both rolling and expanding window calculation with dplyr and zoo package in R. This analysis relied on the window functions rollmean() and cummean() to satisfied this and allows me to observe the both the short-term trend and expected price of the stocks for the year which helps with seeing if the stocks are overperforming/underperforming for that year.
Challenge: In-completed Data: The first few days of the data was not calculable with the 6 day moving average due to it not being the full 6 day. Hence, filling that space with fill=NA in the window function prevents R from returning partial averages and focus on calculating what we need. Another issue I found while trying data sort/organize, was trying to use solely just the year but with lubricate inside the tidyverse package, I was able to do that and add it to group_by() to correct the calculation. Finally, a unexpected error also occurred with naming a column, initially “6_Day_Moving_Avg” but R syntax rules made it so that variable names cannot start with a number so I adjusted to a Moving_Avg_6D name column.