To answer which model would best fit the Canadian Manufacturing Workers Work-Hours data, I began by pulling in the .csv and creating some plots.
WorkHours <- read.csv("~/Personal/CanadianWorkHours.csv", stringsAsFactors = FALSE)
workTS <- ts(WorkHours$Hours, start=c(1966, 1), frequency=1)
yrange = range(workTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(workTS, bty="l")
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34.5,38,0.5), las=2)
After I had re-created the plot in the book, I ran some regression-based models. I wanted to do each of the four options, beginning with linear trend and linear trend with seasonality.
library(forecast)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
worklinear <- tslm(workTS ~ trend)
summary(worklinear)
##
## Call:
## tslm(formula = workTS ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20457 -0.28761 0.04779 0.30210 1.23190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.416134 0.175818 212.811 < 2e-16 ***
## trend -0.061373 0.008518 -7.205 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.509 on 33 degrees of freedom
## Multiple R-squared: 0.6113, Adjusted R-squared: 0.5996
## F-statistic: 51.91 on 1 and 33 DF, p-value: 2.928e-08
When running the linear model with seasonality I got the following error: Error in tslm(workTS ~ trend + season) : Non-seasonal data cannot be modelled using a seasonal factor," which indicates to me that this data set cannot be modelled with either a linear model with seasonality or a quadratic model with seasonality. Because there is only one data point annually, I think the seasonality would be hard to determine - we don’t have enough information to compare groups of years as seasons.
That meant I wanted to run the quadratic trend model next.
workquadratic <- tslm(workTS ~ trend + I(trend^2))
summary(workquadratic)
##
## Call:
## tslm(formula = workTS ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94503 -0.20964 -0.01652 0.31862 0.60160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1644156 0.2176599 175.340 < 2e-16 ***
## trend -0.1827154 0.0278786 -6.554 2.21e-07 ***
## I(trend^2) 0.0033706 0.0007512 4.487 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4049 on 32 degrees of freedom
## Multiple R-squared: 0.7614, Adjusted R-squared: 0.7465
## F-statistic: 51.07 on 2 and 32 DF, p-value: 1.1e-10
To compare the linear trend and quadratic trend models, I plotted both.
yrange = range(workTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(workTS, bty="l")
axis(1, at=seq(1966,2000,1), labels=format(seq(1966,2000,1)))
axis(2, at=seq(34.5,38,0.5), las=2)
lines(worklinear$fitted, col="red")
lines(workquadratic$fitted, col="blue")
legend(1966,36, c("Hours per week", "Linear", "Quadratic"), lty=c(1,1,1), col=c("black", "red", "blue"), bty="n")
Based on the results of the linear trend model and the quadratic trend model, as well as the visual, I think the quadratic trend model fits the series best.
To fit the Department Store sales data with an exponetial trend, you must take a logarithm of sales, as this is the y or output variable in the function.
I began by plotting the data and then partitioned it into training and validation periods.
DeptSales <- read.csv("~/Personal/DeptStoreSales (1).csv", stringsAsFactors = FALSE)
salesTS <- ts(DeptSales$Sales, start=c(1), frequency=4)
yrange = range(salesTS)
plot(c(1,7), yrange, type="n", xlab="Year", ylab="Sales", bty="l", xaxt="n", yaxt="n")
lines(salesTS, bty="l")
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
axis(2, at=seq(45000,105000, 10000), las=(2))
validLength <- 4
trainLength <- length(salesTS) - validLength
salesTrain <- window(salesTS, end=c(1, trainLength))
salesValid <- window(salesTS, start=c(1,trainLength+1))
Next I ran the regression model with an exponential trend and seasonality. I used lambda =0 to create the log needed for an exponential trend.
salesexponent <- tslm(salesTrain ~ trend + season, lambda=0)
summary(salesexponent)
##
## Call:
## tslm(formula = salesTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
Based on the output from our exponential trend model with seasonality, we can see the p value for season2 (or the second quarter) is not statistically significantly different than Q1. We can interpret this as meaning Q2 average sales are approximately equal to Q1.
See below:
validLength = 2
validationforecast <- forecast(salesexponent, h=validLength)
validationforecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
salesexponent <- tslm(salesTS ~ trend + season, lambda=0)
yrange = range(salesTS)
plot(c(1, 7), yrange, type="n", xlab="Year", ylab="Sales", bty="l", xaxt="n", yaxt="n")
lines(salesTS, bty="l")
axis(1, at=seq(1,7,1), labels=format(seq(1,7,1)))
axis(2, at=seq(45000,105000,10000), las=2)
lines(salesexponent$fitted, col="red")
legend(1,105000, c("Sales", "Exponential with Seasonality"), lty=c(1,1), col=c("black", "red"), bty="n")
plot(salesexponent$residuals, type="o", bty="l")
Seasonality is captured well in the plot - there is no obvious pattern to the residuals. I think that the residual plot indicates the regression model fits the data well as the residuals are low and centered mainly around zero. *The trend in the data is not captured well by the model. There is a slight U-shape to the residual plot, which makes me feel a quadratic model may have yielded better results.
As I mentioned in (f), there is a U-shape to the residuals, so I would suggest fitting a quadratic trend model to Sales.
Based on the plots, it seems there is an upward linear trend, as well as seasonality present in the Souvenier sales data. There should be 11 seasonal dummy variables (as there are 12 “seasons” - months) and the trend, meaning 12 predictors total.
I read in the Souvenir Sales data and partitioned it into training and validation periods. Once I had done that I ran a regression model with trend and seasonality.
souvensales <- read.csv("~/Personal/SouvenirSales (1).csv",stringsAsFactors = FALSE)
salesTS <- ts(souvensales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(salesTS) - validLength
souvensalesTrain <- window(salesTS,end=c(1995,trainLength))
souvensalesValid <- window(salesTS, start=c(1995,trainLength+1))
ModelA <- tslm(souvensalesTrain ~ trend + season)
summary(ModelA)
##
## Call:
## tslm(formula = souvensalesTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
Based on the coefficients, it is clear December has the highest average sales. This is reasonable for a few reasons: December is summer in Australia, meaning peak season for beach resorts and Christmas and holiday shopping are likely highest in December.
The trend coeffiecent’s p-value is statisitically significant, showing there is in fact a linear trend present in the data. The 245.36 is the slope of that trend line that for each month, there is an average increase of $245.36 in sales.
log(souvensalesTrain to generate log(Sales) as the output.ModelB <- tslm(log(souvensalesTrain) ~ trend + season)
summary(ModelB)
##
## Call:
## tslm(formula = log(souvensalesTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
The model with log(Sales) output is the same as with an exponential trend.
The trend coeffecient of Model B represents the increase month over month as a percentage, so 0.021 shows a 2.1% increase each month.
This data has 84 points until the end of 2001, so to calculate the forecast for February 2002, we will need to use the 86th.
febforecast <- ModelB$coefficients["(Intercept)"] + ModelB$coefficients["trend"]*86 + ModelB$coefficients["season2"]
exp(febforecast)
## (Intercept)
## 17062.99
AForecast <- forecast(ModelA,h=validLength)
accuracy(AForecast$mean,souvensalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
BForecast <- forecast(ModelB,h=validLength)
accuracy((exp(BForecast$mean)),souvensalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
plot(souvensalesValid)
lines(AForecast$mean, col="red")
lines(exp(BForecast$mean), col="green")
Based on the accuracy measures alone, it seems Model B is a better fit for the data. The plot confirms this as well.