For this report, I will be analyzing monthly national grocery store sales (in millions of dollars) in the United States from January 1992 to December of 2022. A time series model will be built and 4 different forecasting methods will be used to forecast future values and the accuracy metrics will be assessed for all 4 forecasting methods.
Grocery <- read.csv("STA321_GroceryStoreData.csv")[,-1]
Since I am interested in testing for forecasting performance, I will split the data into a training set and a testing set. The last ten observations will be used in the testing set.
training = Grocery[1:362]
testing = Grocery[363:372]
Grocery.ts = ts(training, frequency = 12, start = c(1992, 1))
The four forecasting methods i will use for building a forecasting table and for testing prediction accuracy for the time series model are moving average, naive, seasonal naive, and random walk. To forecast, i will use the training set to train and use the test set to forecast the next 10 months of sales.
pred.mv = meanf(Grocery.ts, h = 10)$mean
pred.naive = naive(Grocery.ts, h=10)$mean
pred.snaive = snaive(Grocery.ts, h=10)$mean
pred.rwf = rwf(Grocery.ts, h=10, drift = TRUE)$mean
pred.table = cbind(pred.mv = pred.mv,
pred.naive = pred.naive,
pred.snaive = pred.snaive,
pred.rwf = pred.rwf)
kable(pred.table, caption = "Forecasting Table")
| pred.mv | pred.naive | pred.snaive | pred.rwf |
|---|---|---|---|
| 42007.74 | 69619 | 63837 | 69734.75 |
| 42007.74 | 69619 | 64366 | 69850.49 |
| 42007.74 | 69619 | 65018 | 69966.24 |
| 42007.74 | 69619 | 65595 | 70081.98 |
| 42007.74 | 69619 | 65656 | 70197.73 |
| 42007.74 | 69619 | 67330 | 70313.47 |
| 42007.74 | 69619 | 67712 | 70429.22 |
| 42007.74 | 69619 | 68341 | 70544.96 |
| 42007.74 | 69619 | 68398 | 70660.71 |
| 42007.74 | 69619 | 68944 | 70776.45 |
Next, the time series will be generated with the 4 forecasting methods mentioned previously.
plot(340:372, Grocery[340:372], type="l", xlim=c(340,372), ylim=c(27000, 90000),
xlab = "observation sequence",
ylab = "Grocery Store Sales (Millions of Dollars)",
main = "Monthly Grocery Store Sales and forecasting")
points(363:372, Grocery[363:372],pch=20)
##
points(363:372, pred.mv, pch=15, col = "red")
points(363:372, pred.naive, pch=16, col = "blue")
points(363:372, pred.rwf, pch=18, col = "navy")
points(363:372, pred.snaive, pch=17, col = "purple")
##
lines(363:372, pred.mv, lty=2, col = "red")
lines(363:372, pred.snaive, lty=2, col = "purple")
lines(363:372, pred.naive, lty=2, col = "blue")
lines(363:372, pred.rwf, lty=2, col = "navy")
##
legend("topright", c("moving average", "naive", "drift", "seasonal naive"),
col=c("red", "blue", "navy", "purple"), pch=15:18, lty=rep(2,4),
bty="n", cex = 0.8)
## Accuracy checks To measure and compare the accuracy of the 4
forecasting methods, i will use the mean absolute prediction error.
true.value = Grocery[363:372]
PE.mv = 100*(true.value - pred.mv)/true.value
PE.naive = 100*(true.value - pred.naive)/true.value
PE.snaive = 100*(true.value - pred.snaive)/true.value
PE.rwf = 100*(true.value - pred.rwf)/true.value
##
MAPE.mv = mean(abs(PE.mv))
MAPE.naive = mean(abs(PE.naive))
MAPE.snaive = mean(abs(PE.snaive))
MAPE.rwf = mean(abs(PE.rwf))
##
MAPE = c(MAPE.mv, MAPE.naive, MAPE.snaive, MAPE.rwf)
## residual-based Error
e.mv = true.value - pred.mv
e.naive = true.value - pred.naive
e.snaive = true.value - pred.snaive
e.rwf = true.value - pred.rwf
## MAD
MAD.mv = sum(abs(e.mv))
MAD.naive = sum(abs(e.naive))
MAD.snaive = sum(abs(e.snaive))
MAD.rwf = sum(abs(e.rwf))
MAD = c(MAD.mv, MAD.naive, MAD.snaive, MAD.rwf)
## MSE
MSE.mv = mean((e.mv)^2)
MSE.naive = mean((e.naive)^2)
MSE.snaive = mean((e.snaive)^2)
MSE.rwf = mean((e.rwf)^2)
MSE = c(MSE.mv, MSE.naive, MSE.snaive, MSE.rwf)
##
accuracy.table = cbind(MAPE = MAPE, MAD = MAD, MSE = MSE)
row.names(accuracy.table) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
kable(accuracy.table, caption ="Overall performance of the four forecasting methods")
| MAPE | MAD | MSE | |
|---|---|---|---|
| Moving Average | 42.025209 | 304740.57 | 930341432 |
| Naive | 3.918975 | 28628.00 | 9868912 |
| Seasonal Naive | 8.237912 | 59621.00 | 35913534 |
| Drift | 3.048405 | 22262.02 | 5901977 |
I analyzed national grocery store sales from January 1992 to December 2022. The data was split into training and testing sets, and those sets were used to forecast future values, build a time series, and check accuracy measures for the forecasting methods used. The drift method worked best as it had the lowest prediction error of all 4 forecasting methods.