WK Milk Assignment #1

Author

William Kegg

Section 1

My personal time series for this course is the Milk Price Dataset. It shows the variation of milk (both organic & inorganic) prices per gallon over the course of a few decades in the United States. This Dataset was provided by the St. Louis FED and publicaly published on their website. Here is the link to their website: (https://fred.stlouisfed.org/series/APU0000709112).

I am expecting the Milk Price dataset to follow demand increases, to change based on economic stability (and the time of the year) and to increase gradually overall as time progresses due to inflation. Those previously listed factors will cause variation in the Milk Price Variable. I think this will be a hard variable to forecast due to the implications from politics and the strength of the economy. A lot of factors influence milk.

Section 2

library("vctrs")
library("gapminder")
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::data_frame() masks tibble::data_frame(), vctrs::data_frame()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("zoo")

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library("ggplot2")
library("knitr")
library("kableExtra")

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
library("seasonal")

Attaching package: 'seasonal'

The following object is masked from 'package:tibble':

    view
library("forecast")
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
milk <- read.csv("milk_price.csv")
milk$date <- as.Date(milk$date)
milk$year <- format(milk$date, "%Y")

attach(milk)
### Line Graph 1
ggplot(milk, aes(x = date)) +
  geom_line(aes(y = milk_price), color = 'blue', linetype = 'solid', size = 1, alpha = 0.8) +
  labs(title = 'Time Series',
       x = 'Date',
       y = 'Milk Price') +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The visualization above is a Line Graph that shows the progression of milk prices over the decades. This graph shows that milk prices are increasing (despite the fluctuations).

### Boxplot
ggplot(milk, aes(x = factor(1), y = milk_price)) +
  geom_boxplot() +
  labs(title = 'Boxplot of Milk Prices', x = '', y = 'Milk Price') +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

The visualization above is a Boxplot that shows the distribution of milk prices within the milk dataset. It shows the Median, Interquartile Range, Whiskers (Maximum and Minimum values) and Outliers. It is shown that there is a Median of around 3.2, no major Outliers, a Maximum value of around 4.2 and a Minimum value of around 2.5.

### Histogram
hist(milk_price, breaks = 10, col = "skyblue", border = "black", main = "Histogram of Milk Prices", xlab = "Price", ylab = "Frequency")

The visualization above is a Histogram that shows the frequency of each milk price. This graph shows that there is a huge concentration of Milk Prices between the $2.6 and $3.6.

### Density Plot
ggplot(milk, aes(x = milk_price)) +
  geom_density(fill = "skyblue", color = "blue", alpha = 0.7) +
  labs(title = 'Density Plot of Milk Prices', x = 'Milk Prices') +
  theme_minimal()

The visualization above is a Density Plot that shows where the milk data is more concentrated, and the skewness/symmetry of it as well. This graph shows that there is a huge concentration of Milk Prices between the $2.6 and $3.6. It also shows that the graph is skewed to the right.

### Milk Table


#### Summary Table

##### Milk Price
milk_price_summary <- summary(milk$milk_price)

##### Function to Calculate Mode
calculate_mode <- function(x) {
  uniq_x <- unique(x)
  uniq_x[which.max(tabulate(match(x, uniq_x)))]
}

##### Create the Table
summary_table <- data.frame(
  Mean = mean(milk$milk_price),
  Median = median(milk$milk_price),
  Mode = calculate_mode(milk$milk_price),
  Standard_Deviation = sd(milk$milk_price),
  Range = diff(range(milk$milk_price)),
  Minimum = min(milk$milk_price),
  Maximum = max(milk$milk_price)
)

##### Print the nicely formatted summary table with a title
kable(summary_table, caption = "Milk Price Summary") %>%
  kable_styling(full_width = FALSE)
Milk Price Summary
Mean Median Mode Standard_Deviation Range Minimum Maximum
3.200443 3.178 2.666 0.4241547 1.759 2.459 4.218

After a complete initial analysis of the milk data, there are not any clear outliers. Everything is in the proper range of values and makes sense with the ebbs & flows of the milk prices currently. The only somewhat questionable increase in milk prices occurred in 2005, but I wouldn’t exactly classify it as a clear outlier. I have also noticed milk prices are overall increasing as the years go on, despite the fluctuations in price. Milk prices are the highest they have ever been and I would predict them to continue to increase overall as time goes on.

Section 3

### Moving Average and Remainder Visualizations

#### Date Conversion
milk$date <- as.Date(milk$date)

#### Sorting of Data
milk <- milk[order(milk$date), ]

#### Moving Average
window_size <- 7
milk$moving_average <- rollmean(milk$milk_price, k = window_size, fill = NA)

#### Visualization of Moving Average
ggplot(milk, aes(x = date)) +
  geom_line(aes(y = milk_price), color = 'blue', linetype = 'solid', size = 1, alpha = 0.8) +
  geom_line(aes(y = moving_average), color = 'green', linetype = 'solid', size = 1) +
  labs(title = 'Time Series and Moving Average Visualization',
       x = 'Date',
       y = 'Milk Price') + theme_minimal()
Warning: Removed 6 rows containing missing values (`geom_line()`).

#### Remainder
milk$remainder <- milk$milk_price - milk$moving_average

#### Visualize of Remainder
ggplot(milk, aes(x = date, y = remainder)) +
  geom_line(color = 'red', linetype = 'solid', size = 1) +
  labs(title = 'Remainder Series Visualization',
       x = 'Date',
       y = 'Remainder') + theme_minimal()
Warning: Removed 6 rows containing missing values (`geom_line()`).

After comparing the remainder to the moving average visualizations, I have noticed that there is a clear is a larger than normal difference between the original time and the moving average around 2005 that wasn’t as clear in the moving average visualization.

### Determining Seasonality Using Time Series Decomposition
ts_data <- ts(milk$milk_price, frequency = 12)
decomposition <- decompose(ts_data)
autoplot(decomposition)

Since the pattern is prominent within the seasonal section across time, it is evident that there is strong seasonality. This does not match my expectations because I did not expect there to be seasonality with milk sales. In retrospect, it makes sense however. Certain periods of time have heavier usages of milk than others.

Section 4

### Naive Forecast

#### Manually Create the Naive Forecast
last_price <- tail(milk$milk_price, 1)
naive_forecast <- data.frame(date = seq(tail(milk$date, 1), by = "months", length.out = 6), milk_price = rep(last_price, 6))

#### Plot Original Time Series
plot(milk$date, milk$milk_price, type = "l", main = "Naive Forecast of Milk Prices", ylab = "Milk Price", xlab = "Date")

#### Add the Naive Forecast Line
lines(naive_forecast$date, naive_forecast$milk_price, col = "red")

#### Add Legend
legend("topleft", legend = c("Original", "Naive Forecast"), col = c("black", "red"), lty = 1)

### Naive Forecast with Drift

#### Manually Create a Drift Model
drift_lm <- lm(data = milk, milk_price ~ as.numeric(milk$date))

#### Make Predictions Using the Drift Model
milk$pred <- predict(drift_lm, newdata = milk)

#### Naive Forecast with Drift
naive_forecast_drift <- rep(tail(milk$pred, 1), 6)

#### Sequence of Dates
forecast_dates <- seq(tail(milk$date, 1) + 1, by = "months", length.out = 6)

#### Plot
plot(milk$date, milk$milk_price, type = "l", main = "Naive Forecast with Drift for Milk Prices", ylab = "Milk Price", xlab = "Date")

#### Add the Naive Forecast with Drift Line
lines(c(tail(milk$date, 1), forecast_dates), c(tail(milk$milk_price, 1), naive_forecast_drift), col = "red")

#### Add Legend
legend("topleft", legend = c("Original", "Naive Forecast with Drift"), col = c("black", "red"), lty = 1)

I believe that based off the results shown above, that a naive forecast would do better than a naive forecast with drift. Accounting for the dynamic nature of milk prices, I think the naive forecast does a decent job at representing the behavior of the data. Due to the many factors that influence the price of milk however, I do not believe that naive forecasting does a good job. There are too many factors and variables not included to make the naive forecast a great viable option. I feel as if there is something majorly missing and would recommend further investigation into the matter. Milk Prices are simply very volatile as previously stated at the start of this assignment.