This dataset is the Monthly Ice Cream Production in the United States from 1918 to 1940. Here is the link to the original dataset. The dataset is measured in millions of gallons and is not seasonally adjusted.
The CSV file is read into R and prepared for analysis. The observation_date column is converted into a yearmonth format, so the data can be treated as a proper time series. Irrelevant columns are removed, and the dataset is converted to a tsibble, which makes it compatible with forecasting tools in the fpp3 framework. Renaming variables helps keep the workflow clear and reproducible.
# Load in data
setwd('/Users/emilybroderick/Desktop/R/PAF')
ice_cream <- read.csv("old_ice_cream.csv")
# Clean data and convert to tsibble
ice_cream <- ice_cream %>% mutate(
Months = yearmonth(observation_date) # Indicates time as months
) %>%
select(-observation_date) %>% # Removes old time column
as_tsibble(index = Months) %>% # Converts to a tsibble
rename("production" = "M0174BUSM411NNBR") # Rename for reproducability
Training and tests sets are created. The training set are composed of ~80% of the data and is used to train the models. The test set is composed of ~20% of the data and is used to test the models.
# Filter ~80% of observations into a training set (1918-1936)
train <- ice_cream %>%
filter(year(Months) <= 1936)
# Filter ~20% of observations into a test set (1937-1940)
test <- ice_cream %>%
filter(year(Months) > 1936)
# Create a time plot to get an idea of what the data look like
time_plot <- autoplot(train, production) +
labs(title = "Monthly Ice Cream Production in the US from 1918 to 1936",
y = "Ice Cream Production (Millions of Gallons)")
time_plot
This plot shows the strong seasonality of these data. There is a positive trend over the first 10 years, with a sharp decrease in production in the early 1930s, likely due to the Great Depression. Around 1934, a positive trend can be seen again.
# Create a seasonal plot of the years by month
seasonal_plot_yr <- train %>%
gg_season(production, period = "year") +
labs(y="Ice Cream Production (Millions of Gallons)",
title="Seasonal Plot: Monthly US Ice Cream Production: 1918-1936")
seasonal_plot_yr
This plot reinforces the consistent seasonality of the data.
seasonal_sub <- train %>%
gg_subseries(production) +
labs(
y = "Ice Cream Production (Millions of Gallons)",
title = "Seasonal Subseries Plot: Monthly US Ice Cream Production: 1918-1936"
)
seasonal_sub
This plot shows the slight variability in production year to year, but also the strong seasonality of the data.
ic_acf_feat <- train %>%
features(production, feat_acf)
ic_acf_feat %>%
gt() %>%
tab_header(title = "Time Series ACF Features")
| Time Series ACF Features | ||||||
| acf1 | acf10 | diff1_acf1 | diff1_acf10 | diff2_acf1 | diff2_acf10 | season_acf1 |
|---|---|---|---|---|---|---|
| 0.8372187 | 2.354666 | 0.6068245 | 1.631205 | -0.05561128 | 0.2439343 | 0.8976849 |
ic_acf <- acf(train$production, main="ACF of Ice Cream Sales")
The ACF graph shows that ice cream production is strongly correlated with its past values, especially at lags 1 through 4, and then again around lag 12. That pattern means there’s persistence in the data (what happens this month is related to what happened in recent months). The big spike at lag 12 suggests a yearly seasonal cycle where production in one month is very similar to production in the same month the previous year.
ic_pacf_feat <- train %>%
features(production, feat_pacf)
ic_pacf_feat %>%
gt() %>%
tab_header(title = "Time Series PACF Features")
| Time Series PACF Features | |||
| pacf5 | diff1_pacf5 | diff2_pacf5 | season_pacf |
|---|---|---|---|
| 1.312455 | 1.371892 | 1.085238 | -0.009753806 |
ic_pacf <- pacf(train$production, main = "PACF of Ice Cream Sales")
The PACF graph shows a big spike at the first lag, which means the current month is especially influenced by the immediately previous month. There are also some smaller spikes around lag 12, which line up with the seasonal effect we saw in the ACF.
Together, the graphs show that the ice cream production data has both short-term relationships (month to month) and longer-term seasonal patterns (year to year).
cl_add_decomp <- train %>%
model(classical_decomposition(production, type = "additive"))
components(cl_add_decomp) %>%
autoplot() + # Plot the decomp
labs(title = "Additive Classical Decomposition of US Ice Cream Production")
ml_add_decomp <- train %>%
model(classical_decomposition(production, type = "multiplicative"))
components(ml_add_decomp) %>%
autoplot() + # Plot the decomp
labs(title = "Multiplicative Classical Decomposition of US Ice Cream Production")
# Additive decomp
add_dcmp <- train %>%
model(stl = STL(production)) # Additive decomp
components(add_dcmp) %>%
autoplot() + # Plot the decomp
labs(title = "Additive STL Decomposition of US Ice Cream Production")
To perform a multiplicative decomposition using STL, I took the Log transform of the data. Taking the Log transform of the data converts a multiplicative relationship to an additive one since STL assumes an additive relationship.
# Mult decomp
mult_dcmp <- train %>%
model(stl = STL(log(production))) # Log transform
components(mult_dcmp) %>%
autoplot() + # Plot the decomp
labs(title = "Multiplicative STL Decomposition of US Ice Cream Production")
I used accuracy() to compare the two STL decompositions because it provides error metrics (like RMSE, MAE, and MAPE) that quantify how well the reconstructed series matches the original data, allowing me to objectively determine which decomposition fit better. STL decomposition is more sophisticated than its classical counterpart, which is why only the STL decompositions were compared.
m1 <- train %>% model(add = STL(production))
m2 <- train %>% model(mult = STL(log(production)))
decomp <- rbind(accuracy(m1),accuracy(m2))
decomp %>%
gt() %>%
tab_header(title = "Additive vs. Multiplicative STL Decomposition")
| Additive vs. Multiplicative STL Decomposition | |||||||||
| .model | .type | ME | RMSE | MAE | MPE | MAPE | MASE | RMSSE | ACF1 |
|---|---|---|---|---|---|---|---|---|---|
| add | Training | -0.002892767 | 1.421927 | 1.0619461 | 0.8435918 | 7.696093 | 0.5074113 | 0.5064117 | 0.2230786 |
| mult | Training | 0.064605476 | 1.130203 | 0.7606183 | -0.1529892 | 4.829916 | 0.3634331 | 0.4025158 | -0.1185394 |
The multiplicative model performed better, with lower RMSE, MAE, and MAPE, as well as smaller autocorrelation in the residuals, showing that it captured the proportional seasonal variation in the data more effectively. This means that the residuals are closer to white noise and that the data is fit more accurately. I could tell it was better both from the improved fit metrics and from inspecting the decomposed seasonal component, which looked more stable under the multiplicative approach. In forecasting, I would use the multiplicative (log-transformed) decomposition since it accounts for seasonality that grows with the trend, making predictions more accurate and realistic for this dataset.
Due to the high seasonality of the data, I decided to fit a seasonal naive model and a linear regression model. Since the data is multiplicative, a log transform was performed for each model.
# Fit the seasonal mean model
s_naive_fit <- train %>%
model(Seasonal = SNAIVE(log(production))) # Seasonal Naive Model
# Forecast 4 years
s_naive_fc <- s_naive_fit %>% forecast(h = 48)
# Fit the linear regression model
reg_fit <- train %>%
model(Regression = TSLM(log(production) ~ trend() + season())) # Linear Regression Model
report(reg_fit)
## Series: production
## Model: TSLM
## Transformation: log(production)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67145 -0.13995 0.01561 0.16944 0.43458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7447859 0.0560915 31.106 < 2e-16 ***
## trend() 0.0016660 0.0002222 7.498 1.68e-12 ***
## season()year2 0.0614677 0.0715493 0.859 0.3912
## season()year3 0.4175650 0.0715503 5.836 1.95e-08 ***
## season()year4 0.7043024 0.0715521 9.843 < 2e-16 ***
## season()year5 1.1285231 0.0715545 15.772 < 2e-16 ***
## season()year6 1.3849374 0.0715576 19.354 < 2e-16 ***
## season()year7 1.5139482 0.0715614 21.156 < 2e-16 ***
## season()year8 1.3676187 0.0715659 19.110 < 2e-16 ***
## season()year9 0.9876197 0.0715710 13.799 < 2e-16 ***
## season()year10 0.5071412 0.0715769 7.085 1.96e-11 ***
## season()year11 0.1714205 0.0715835 2.395 0.0175 *
## season()year12 0.0275058 0.0715907 0.384 0.7012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2205 on 215 degrees of freedom
## Multiple R-squared: 0.8717, Adjusted R-squared: 0.8646
## F-statistic: 121.8 on 12 and 215 DF, p-value: < 2.22e-16
# Forecast 4 years
reg_fc <- reg_fit %>% forecast(h = 48)
fits <- train %>%
model(
SNaive = SNAIVE(log(production)),
Regression = TSLM(log(production) ~ trend() + season())
)
fc <- forecast(fits, new_data = test)
fc
## # A fable: 96 x 4 [1M]
## # Key: .model [2]
## .model Months production .mean
## <chr> <mth> <dist> <dbl>
## 1 SNaive 1937 Jan t(N(2.1, 0.026)) 8.47
## 2 SNaive 1937 Feb t(N(2.1, 0.026)) 8.56
## 3 SNaive 1937 Mar t(N(2.7, 0.026)) 14.8
## 4 SNaive 1937 Apr t(N(2.9, 0.026)) 17.5
## 5 SNaive 1937 May t(N(3.4, 0.026)) 29.2
## 6 SNaive 1937 Jun t(N(3.5, 0.026)) 32.7
## 7 SNaive 1937 Jul t(N(3.7, 0.026)) 40.1
## 8 SNaive 1937 Aug t(N(3.5, 0.026)) 34.5
## 9 SNaive 1937 Sep t(N(3.2, 0.026)) 23.7
## 10 SNaive 1937 Oct t(N(2.7, 0.026)) 14.7
## # ℹ 86 more rows
fc_plot <- autoplot(ice_cream, production) +
autolayer(fc, level = 95) +
labs(title = "Forecasts of Ice Cream Production for 1937-1940",
y = "Ice Cream Production (Millions of Gallons)")
fc_plot
s_naive_fc_plot <- s_naive_fc %>%
autoplot(ice_cream %>%
filter(!is.na(production)), level = 80) +
geom_line(aes(y = .mean), data = s_naive_fc, linetype = 1, col = "blue") +
labs(title = "US Monthly Ice Cream Production: 1918-1940 with 4-year Forecast",
y = "Production (Millions of Gallons) ")
s_naive_fc_plot
s_naive_fit %>% gg_tsresiduals()
Overall, this model seems to forecast the data with some accuracy. The residuals have a little pattern to them around the early 1930s and has some autocorrelation indicating that the model does not account for all predictors.
reg_fc_plot <- reg_fc %>%
autoplot(ice_cream %>%
filter(!is.na(production)), level = 80) +
geom_line(aes(y = .mean), data = reg_fc, linetype = 1, col = "blue") +
labs(title = "US Monthly Ice Cream Production: 1918-1940 with 4-year Forecast",
y = "Production (Millions of Gallons) ")
reg_fc_plot
reg_fit %>% gg_tsresiduals()
Based on the residuals, this model does not appear to perform as well as the seasonal naive model on first glance. The residuals have a much stronger trend and autocorrelation meaning that some predictors are not accounted for.
# Use accuracy() to compare the two models
compare <- accuracy(fc, test)
compare %>%
gt() %>%
tab_header(title = "Seasonal Naive vs. Linear Regression Models")
| Seasonal Naive vs. Linear Regression Models | |||||||||
| .model | .type | ME | RMSE | MAE | MPE | MAPE | MASE | RMSSE | ACF1 |
|---|---|---|---|---|---|---|---|---|---|
| Regression | Test | 2.337174 | 3.028517 | 2.755360 | 13.93354 | 15.22064 | NaN | NaN | 0.39385881 |
| SNaive | Test | 2.058289 | 2.743555 | 2.466136 | 11.91838 | 13.23647 | NaN | NaN | 0.03458455 |
The Seasonal Naive (SNaive) model outperformed the Linear Regression model across multiple accuracy metrics, indicating it provided more reliable forecasts for the test data. Specifically, the SNaive model achieved lower RMSE, MAE, and MAPE values, meaning its predictions were consistently closer to the actual observed values. Additionally, the residuals from the SNaive model showed minimal autocorrelation (ACF ≈ 0), suggesting that it captured the underlying seasonal structure of the data effectively, leaving little unexplained pattern. In contrast, the Regression model exhibited higher errors and substantial residual autocorrelation, indicating that it failed to fully model important seasonal patterns. Overall, the SNaive model is better suited for this dataset because it aligns with the inherent seasonal nature of the data, yielding more accurate and unbiased forecasts.
Neither model is able to adequately account for impactful external events, such as the Great Depression, policy changes, or other economic shocks, which can cause sudden deviations from seasonal or historical trends. This limitation is reflected in the residuals, where there is a strong pattern around the time of the Great Depression. Incorporating domain knowledge could improve model performance, such as adding indicator variables for known disruptive events or adjusting forecasts based on anticipated policy shifts. In particular, the Regression model could be enhanced by including lagged terms or dummy variables representing such events, which would allow it to adapt more flexibly to non-seasonal influences. In practice, judgmental adjustments play an important role in forecasting, especially when quantitative models fail to anticipate unusual circumstances. Combining statistical forecasts with expert insight ensures that predictions remain realistic and relevant in the face of sudden deviations from historical trends.