# install.packages(c("ggplot2", "forecast"))
source("code_Time_Series.R")
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The data set for this example contains annual sales and advertising dollars (in 1960 dollars) for the Lydia Pinkham tonic from 1907 to 1960.These series are stored in two different objects called tonic_sales and tonic_advert, respectively.
Different advertising strategies were adopted from 1907-1914 (strategy A), 1915-1925 (B), 1926-1940 (C), and 1941-1960 (D). The advertising strategy is included using three dummy variables, tonic_strat_B, tonic_strat_C, and tonic_strat_D, with strategy A being the base case.
Also included are the first lags of sales and advertising in the series tonic_lag_sales and tonic_lag_advert. In all of seven of these series, the observation for 1907 has been removed to make all the series the same length due to the NA’s introduced by these lags.
load("data_Time_Series.Rdata")
# tonic_sales
# tonic_advert
# tonic_strat_B
# tonic_strat_C
# tonic_strat_D
# tonic_lag_sales
# tonic_lag_advert
Used the autoplot(...) to print each of the series one by one. Reviewed to see if advertising and sales seem correlated. Whether advertising follows sales, or do sales follow advertising.
# autoplot(tonic_sales)
# autoplot(tonic_advert)
# autoplot(tonic_strat_B)
# autoplot(tonic_strat_C)
# autoplot(tonic_strat_D)
# autoplot(tonic_lag_sales)
# autoplot(tonic_lag_advert)
Let’s start by first regressing tonic_sales on tonic_advertising to analyze the fit.
fit <- tslm(tonic_sales ~ tonic_advert)
aa_plot_fitted(fit)
The fit doesn’t look so great. Let’s also check the diagnostics.
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE
## Training set 8.573396e-15 336.7783 258.7706 -3.439857 14.21672 1.45032
## ACF1
## Training set 0.7056093
summary(fit)
##
## Call:
## tslm(formula = tonic_sales ~ tonic_advert)
##
## Residuals:
## Min 1Q Median 3Q Max
## -874.01 -229.13 -21.44 223.68 600.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 510.2978 129.1037 3.953 0.000239 ***
## tonic_advert 1.4187 0.1278 11.104 3.22e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 343.3 on 51 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.7017
## F-statistic: 123.3 on 1 and 51 DF, p-value: 3.218e-15
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 34.362, df = 10, p-value = 0.0001603
From above, it’s clear we are having trouble with autocorrelation. Maybe adding the lag 1 of advertising will help.
fit <- tslm(tonic_sales ~ tonic_advert + tonic_lag_advert)
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 40.979, df = 10, p-value = 1.138e-05
That didn’t seem to help the autocorrelation at all. Will now add lag 1 of sales to see if that helps improve the model.
fit <- tslm(tonic_sales ~ tonic_advert + tonic_lag_advert + tonic_lag_sales)
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 9.1205, df = 10, p-value = 0.5207
Clearly that is much better! Let’s verify the fit visually:
aa_plot_fitted(fit)
Looks pretty good. Finally, let’s look at accuracy and verify variable significance:
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.286654e-14 181.7368 129.2448 -1.048323 7.118487 0.7243728
## ACF1
## Training set 0.162151
summary(fit)
##
## Call:
## tslm(formula = tonic_sales ~ tonic_advert + tonic_lag_advert +
## tonic_lag_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -477.94 -97.66 -25.39 73.64 690.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154.06533 80.49015 1.914 0.061458 .
## tonic_advert 0.58944 0.14232 4.142 0.000136 ***
## tonic_lag_advert -0.66006 0.14156 -4.663 2.43e-05 ***
## tonic_lag_sales 0.95546 0.08764 10.902 1.06e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189 on 49 degrees of freedom
## Multiple R-squared: 0.9148, Adjusted R-squared: 0.9096
## F-statistic: 175.4 on 3 and 49 DF, p-value: < 2.2e-16
The MAPE is about 7%, and all variables are significant. This seems like a good model.
Now adding in the advertising strategy (without trying to remove insignificant variables):
fit <- tslm(tonic_sales ~ tonic_advert + tonic_lag_advert + tonic_lag_sales +
tonic_strat_B + tonic_strat_C + tonic_strat_D)
aa_plot_fitted(fit)
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.286445e-14 155.6047 118.0198 -0.818896 6.804955 0.6614604
## ACF1
## Training set 0.1851649
summary(fit)
##
## Call:
## tslm(formula = tonic_sales ~ tonic_advert + tonic_lag_advert +
## tonic_lag_sales + tonic_strat_B + tonic_strat_C + tonic_strat_D)
##
## Residuals:
## Min 1Q Median 3Q Max
## -319.81 -93.22 -26.86 79.42 452.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.87047 76.59058 1.683 0.09923 .
## tonic_advert 0.60279 0.13183 4.572 3.63e-05 ***
## tonic_lag_advert -0.37913 0.15127 -2.506 0.01580 *
## tonic_lag_sales 0.76540 0.09968 7.678 8.85e-10 ***
## tonic_strat_B 296.46048 95.51591 3.104 0.00326 **
## tonic_strat_C -11.86017 90.80719 -0.131 0.89665
## tonic_strat_D 104.99736 85.09806 1.234 0.22353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 167 on 46 degrees of freedom
## Multiple R-squared: 0.9375, Adjusted R-squared: 0.9294
## F-statistic: 115.1 on 6 and 46 DF, p-value: < 2.2e-16
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 6.9855, df = 10, p-value = 0.7268
All in all, this model looks quite good, and its error measures are better than the previous one without the strategies.
Since the model for forecasting tonic_sales is based on other time series, we have to forecast those time series first.
Let’s pretend it’s 1960 (the last year in the data), and we want to forecast tonic_sales for 1961.
Here’s the sales amount from 1960:
window(tonic_sales, start = 1960, end = 1960)
## Time Series:
## Start = 1960
## End = 1960
## Frequency = 1
## [1] 1289
And here’s the advertising amount from 1960:
window(tonic_advert, start = 1960, end = 1960)
## Time Series:
## Start = 1960
## End = 1960
## Frequency = 1
## [1] 564
Let’s suppose that, in 1961, we will stick with strategy D and we plan to spend 500 on advertising. Then we can predict tonic_sales in 1961 as follows using the above formula:
128.87047 + # intercept
0.60279 * 500 + # contribution from advertising
-0.37913 * 564 + # uses lagged advertising from 1960!
0.76540 * 1289 + # uses lagged sales from 1960!
104.99736 # contribution from strategy D
## [1] 1308.034
From the model, the above prediction is for tonic sales in 1961 following strategy D.