Read in the .csv files for problems 2, 4 & 5
CAWorkHours <- read.csv("CanadianWorkHours.csv", stringsAsFactors = FALSE)
StoreSales <- read.csv("DeptStoreSales.csv", stringsAsFactors = FALSE)
SouvSales <- read.csv("SouvenirSales.csv", stringsAsFactors = FALSE)
str(CAWorkHours)
## 'data.frame': 35 obs. of 2 variables:
## $ Year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
## $ HoursPerWeek: num 37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(CAWorkHours)
## Year HoursPerWeek
## 1 1966 37.2
## 2 1967 37.0
## 3 1968 37.4
## 4 1969 37.5
## 5 1970 37.7
## 6 1971 37.7
tail(CAWorkHours)
## Year HoursPerWeek
## 30 1995 35.7
## 31 1996 35.7
## 32 1997 35.5
## 33 1998 35.6
## 34 1999 36.3
## 35 2000 36.5
# Create a time series object out of Canadian Work Hours
CAhoursTS <- ts(CAWorkHours$HoursPerWeek, start = c(1966), frequency = 1)
#Set up the plot
par(oma=c(0,0,2,0))
plot(c(1966,2000), ylim = c(34,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Hours per Week", bty = "l", xaxt = "n", yaxt = "n")
# Add the time series CAhours
lines(CAhoursTS, bty = "l")
# Add the x and y axes
axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)))
axis(2, at = seq(34,38,0.5), labels = format(seq(34,38,0.5)), las = 2, cex.axis = 0.6)
Which one model of the following regression-based models would fit the series best?
Set up the models we want to fit to the data set
# Linear Trend
CAhoursLinear <- tslm(CAhoursTS ~ trend)
summary(CAhoursLinear)
##
## Call:
## tslm(formula = CAhoursTS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
# Quadratic trend
CAhoursQuad <- tslm(CAhoursTS ~ trend + I(trend^2))
summary(CAhoursQuad)
##
## Call:
## tslm(formula = CAhoursTS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
Visualization of fitted models
#Set up the plot
plot(c(1966,2000), ylim = c(34,38), xlim = c(1966,2000), type = "n", xlab = "Year", ylab = "Hours per Week", bty = "l", xaxt = "n", yaxt = "n")
# Add the time series CAhours
lines(CAhoursTS, bty = "l")
# Add the x and y axes
axis(1, at = seq(1966,2000,1), labels = format(seq(1966,2000,1)))
axis(2, at = seq(34,38,0.5), labels = format(seq(34,38,0.5)), las = 2, cex.axis = 0.6)
# Add the linear and quadratic trend models and trend models with seasonality
lines(CAhoursLinear$fitted, col = "blue")
lines(CAhoursQuad$fitted, col = "red")
# Add a legend
legend(1966, 36, c("Hours per Week", "Linear", "Quadratic"), lty = c(1,1,1), col = c("black", "blue", "red"), bty = "n")
The adjusted R^2 of the linear trend model and quadratic trend model are 0.5996 and 0.7465 respectively. The linear trend and quadratic trend models with seasonality cannot be fitted because non-seasonal data cannot be modelled using a seasonal factor.
• Linear trend model - The low Adj. R-sq. does not give a good explanation of variation in Canadian hours worked per week. • Linear trend model with seasonality - There is no seasonal component to the time series, so the model cannot be fitted to the data set. • Quadratic trend model - The higher Adj. R-Sq value gives a better explanation of the variation in Candian hours worked per week. This model would fit the series the best out of the 4 options. • Quadratic trend model with seasonality - There is no seasonal component to the time series, so the model cannot be fitted to the data set.
An exponential trend implies a multiplicative increase or decrease of the series over time. To fit an exponential trend, the output variable (Sales) is replaced with log(Sales) and then fit a linear regression.
Plot the department store sales time series
# Create a time series object out of Department Store Sales
StoresalesTS <- ts(StoreSales$Sales, start = c(1), frequency = 4)
# Set up the plot
plot(c(1,7), ylim = c(40000, 120000), xlim = c(1,7), type = "n", xlab = "Year", ylab = "Sales (in thousands)", bty = "l", xaxt = "n", yaxt = "n")
box()
box(which = "plot")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)
lines(StoresalesTS, bty = "l")
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))
axis(2, at = seq(40000, 120000, 10000), labels = format(seq(40, 120, 10)), las = 2)
Partition the data, so that the last 4 quarters are in the validation period and all previous quarters are in the training period.
valid.Length <- 4
train.Length <- length(StoresalesTS) - valid.Length
StoresalesTS.Train <- window(StoresalesTS, end = c(1, train.Length))
StoresalesTS.Valid <- window(StoresalesTS, start = c(1, train.Length+1))
Fit the regression model with exponential trend with seasonality. Use the tslm function with lambda = 0 to do this.
StoresalesExp <- tslm(StoresalesTS.Train ~ trend + season, lambda = 0)
summary(StoresalesExp)
##
## Call:
## tslm(formula = StoresalesTS.Train ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
# Plot the regression
plot(c(1,7), ylim = c(40000, 120000), xlim = c(1,7), type = "n", xlab = "Year", ylab = "Sales (in thousands)", bty = "l", xaxt = "n", yaxt = "n")
box()
box(which = "plot")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)
# Add the time series object
lines(StoresalesTS, bty = "l")
# Add the exponential fit
lines(StoresalesExp$fitted, col = "red")
# Add the x-axis and y-axis
axis(1, at = seq(1,7,1), labels = format(seq(1,7,1)))
axis(2, at = seq(40000, 120000, 10000), labels = format(seq(40, 120, 10)), las = 2)
From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales? Looking at the p-values, every variable is statistically significant from the base season, quarter 1, except season2, which has a p-value of 0.248. season4 has the largest coefficient, which means it is the largest of the 4 quarters. season2 has the smallest coefficient and is slightly higher than Q1 average sales, but as stated above this is the only season that is not statistically significant and as a result it could be higher or lower.
Use this model to forecast sales in quarters 21 and 22.
validPeriodForecasts <- forecast(StoresalesExp, h = valid.Length - 2)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
The forecasts for Q21 and Q22 are $58,793.71 and $60,951.51 respectively.
# Plot fit of regression model for department store sales
plot(c(1, 6), ylim = c(40000,100000), xlim = c(1,6), type="o", xlab="Year", ylab="Sales", bty="l", xaxt="n", yaxt="n")
axis(1, at=seq(1,6,1), labels=format(seq(1,6,1)))
axis(2, at=seq(40000, 100000, 10000), labels=format(seq(40000, 100000, 10000)), las=0)
lines(StoresalesExp$fitted, col="red")
lines(StoresalesTS.Train)
# Plot the residuals in log-transformed units
plot(StoresalesExp$residuals, type = "o", bty = "l")
# Plot the residuals in the orginal scale
plot(StoresalesTS.Train - StoresalesExp$fitted, ylim = c(-4000, 4000), xlim = c(1,6), type = "o", bty = "l", xlab = "Quarter", ylab = "Residuals")
minor.tick(nx = 3, ny = 2, tick.ratio = 0.5)
From the residual plot we can see that roughly the first year and last year is under-forecasted, and in between is over-forecasted. If the trend continues we can assume that Q21 and Q22 will be under-forecasted. We can verify this by plotting the residuals of the validation period.
# Residuals of the validation period
StoresalesTS.Valid - validPeriodForecasts$mean
## Qtr1 Qtr2
## 6 2006.29 3948.49
# Plot the residuals of the validation period
valid.PeriodForecasts <- forecast(StoresalesExp, h = valid.Length)
plot(StoresalesTS.Valid - valid.PeriodForecasts$mean, type = "o", bty = "l", ylab = "Residuals")
The residuals for the validation period are all positive, meaning they are all under-forecasted.
To see how well the regression model fits the data, generate a plot showing the training period fit and validation period forecasts.
# Set up the plot
plot(c(1, 7), c(40000,120000), type="n", xlab="Year", ylab="Sales", bty="l", xaxt="n", yaxt="n")
#Add the time series deptTS
lines(StoresalesTS, bty="l")
# Add the x-axis and y-axis
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
axis(2, at=seq(40000, 120000, 10000), labels=format(seq(40000, 120000, 10000)), las=0)
abline(v=6)
arrows(1, 110000, 6, 110000, code=3, length=0.1)
text(3.5, 115000, "Training", cex = 0.75)
abline(v=7)
arrows(6, 110000, 7, 110000, code=3, length=0.1)
text(6.5, 115000, "Validation", cex = 0.75)
lines(valid.PeriodForecasts$fitted, col="red")
lines(valid.PeriodForecasts$mean, col="blue", lty=2)
# Add the legend
legend(1,105000, c("Sales", "Exponential Trend + Season", "Forecasts"), lty=c(1,1,2), col=c("black", "red", "blue"), bty="n", cex = 0.7)
summary(valid.PeriodForecasts)
##
## Forecast method: Linear regression model
##
## Model Information:
##
## Call:
## tslm(formula = StoresalesTS.Train ~ trend + season, lambda = 0)
##
## Coefficients:
## (Intercept) trend season2 season3 season4
## 10.74894 0.01109 0.02496 0.16534 0.43375
##
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## ACF1
## Training set 0.5540818
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
Looking at the residual plot, which of the following statements appear to be true? • Seasonality is not captured well - The model seems to capture seasonality in the data. • The regression model fits the data well - The regression model gives a MAPE of 2.215 indicating that the model fits the data well. • The trend in the data is not captured well by the model - True. The dip in the middle of the series indicates that the trend is not captured. The residuals should not display a U-shaped trend if it is captured well.
Which of the following solutions is adequate and a parsimonious solution for improving model fit? • Fit a quadratic trend model to the residuals (with Quarter and Quarter.) • Fit a quadratic trend model to Sales (with Quarter and Quarter.)
To improve the model fit we would want to fit a quadratic trend model to Sales.
Both time plots show seasonality and trend. The first plot looks like it might have an exponential trend, but looking at the log time plot it looks like the trend may be more linear. In this case the seasons are months, therefore there would be 12 predictors, 1 for trend and 11 for seasonality.
# Create a time series object
SouvsalesTS <- ts(SouvSales$Sales, start = c(1995,1), frequency = 12)
# Set the validation period length to the last 12 months and the previous months as the training set
valid.Length <- 12
train.Length <- length(SouvsalesTS) - valid.Length
# Partition the data into the training and validation periods set above
Souvsales.Train <- window(SouvsalesTS, start=c(1995,1), end=c(1995, train.Length))
Souvsales.Valid <- window(SouvsalesTS, start=c(1995,train.Length+1), end=c(1995,train.Length+valid.Length))
# Model A - Regression model with sales as output variable and with linear trend and monthly predictors
modelA <- tslm(Souvsales.Train ~ trend + season)
summary(modelA)
##
## Call:
## tslm(formula = Souvsales.Train ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
modelAForecast <- forecast(modelA, h = valid.Length)
accuracy(modelAForecast, Souvsales.Valid)
## ME RMSE MAE MPE MAPE MASE
## Training set -5.684342e-14 5365.199 3205.089 6.967778 36.75088 0.855877
## Test set 8.251513e+03 17451.547 10055.276 10.533974 26.66568 2.685130
## ACF1 Theil's U
## Training set 0.4048039 NA
## Test set 0.3206228 0.9075924
Based on the coefficients months 11 and 12 have the highest average sales during the year. This makes sense because in Australia the summer months are December through February. Months 11 and 12 would be around the beginning of summer.
The MAPE for modelA is not very good.
Souvenir sales increase by $245.36 on average for each successive time period.
# Run regression model with log(Sales) with linear trend and monthly predictors
modelB <-tslm(log(Souvsales.Train) ~ trend + season)
summary(modelB)
##
## Call:
## tslm(formula = log(Souvsales.Train) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
Fitting a model to log(Sales) with linear trend is equivalent to fitting a model to Sales with exponential trend.
# Generate forecasts using this model
modelBForecast <- forecast(modelB, h = valid.Length)
summary(modelBForecast)
##
## Forecast method: Linear regression model
##
## Model Information:
##
## Call:
## tslm(formula = log(Souvsales.Train) ~ trend + season)
##
## Coefficients:
## (Intercept) trend season2 season3 season4
## 7.64636 0.02112 0.28201 0.69500 0.37387
## season5 season6 season7 season8 season9
## 0.42171 0.44705 0.58338 0.54690 0.63557
## season10 season11 season12
## 0.72949 1.20095 1.95220
##
##
## Error measures:
## ME RMSE MAE MPE MAPE
## Training set 4.935341e-17 0.1709374 0.1382015 -0.03661472 1.534989
## MASE ACF1
## Training set 0.444076 0.4593573
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2001 9.188097 8.917220 9.458974 8.769890 9.606304
## Feb 2001 9.491232 9.220354 9.762109 9.073024 9.909439
## Mar 2001 9.925335 9.654457 10.196212 9.507127 10.343542
## Apr 2001 9.625329 9.354452 9.896207 9.207122 10.043537
## May 2001 9.694286 9.423408 9.965163 9.276078 10.112493
## Jun 2001 9.740741 9.469864 10.011619 9.322534 10.158949
## Jul 2001 9.898195 9.627318 10.169072 9.479988 10.316402
## Aug 2001 9.882831 9.611954 10.153708 9.464624 10.301038
## Sep 2001 9.992619 9.721742 10.263496 9.574412 10.410826
## Oct 2001 10.107664 9.836787 10.378542 9.689457 10.525872
## Nov 2001 10.600248 10.329370 10.871125 10.182040 11.018455
## Dec 2001 11.372615 11.101738 11.643493 10.954408 11.790823
# Compute accuracy metrics by transforming the forecasts back to the original scale
accuracy(exp(modelBForecast$mean), Souvsales.Valid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
Use this model to forecast sales in February 2002.
# Forecast beyond the validation period to period 86, which corresponds to Feb 2002
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86
# Plot the actuals and the forecasts for the validation period
plot(Souvsales.Valid, bty = "l", xaxt = "n", yaxt = "n", xlab = "Year 2001", ylab = "Souvenir Sales")
# Add the axes
axis(1, at = seq(2001, 2001 + 11/12, 1/12), labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
axis(2, at = seq(0, 100000, 10000), labels = format(seq(0,100,10)))
# Add the actual, linear+season and log(linear+season) lines
lines(SouvsalesTS, bty = "l")
lines(modelAForecast$mean, col = "blue")
lines(modelBForecast$mean, col = "red")
# Add a legend
legend(2001, 110000, c("Actuals", "Linear+Season", "log(y) Linear+Season"), lty = c(1,1,1), col = c("black", "blue", "red"), bty = "n")
Model B is preferable for forecasting. It has a lower MAPE than Model A. ModelB residuals also trend closer to the actuals.
No forecasting is required if the goal is understanding the different components of sales in the souvenir shop between 1995 to 2001. Instead, linear regression would be used for modeling cross-sectional data to infer the affect of different components. I would use the sales sample to estimate a model using different components like climate, category of souvenir etc.The sales would be the outcome of the variable of interest, the independent variables would be the components. The standard error would be the variation due to taking the sample and the coefficient would be the beta estimated by the software. The p-values would be used for inferring the effects of each component to the sample of sales.