Assignment5: Problems[2,4,5]
P.2 \(\textit{Analysis of Canadian Manufacturing Workers Work-Hours: }\)The time series plot in the given figure describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which of the following regression-based models would fit the series best?
ans:
Quadratic trend moded: After looking into time series, this series is yearly series and hence no seasonality but may be business cyclic in nature e.g. changing over every 4-5 years. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble in the given options. We will consider Quadratic trend model since in different times of series it has increasing as well as decreasing trends e.g. between 1974 to 1988 it is decreasing and then it is increasing. RMSE/MAPE is also better for Quadratic trend model as seen below
library(forecast)
## Warning: package 'forecast' was built under R version 3.2.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.2.5
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.3
library(zoo)
setwd("/Users/yusufsultan/rWork")
WorkHours <- read.csv("CanadianWorkHours.csv",stringsAsFactors = FALSE)
str(WorkHours)
## 'data.frame': 35 obs. of 2 variables:
## $ Year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
## $ HoursPerWeek: num 37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...
head(WorkHours)
## Year HoursPerWeek
## 1 1966 37.2
## 2 1967 37.0
## 3 1968 37.4
## 4 1969 37.5
## 5 1970 37.7
## 6 1971 37.7
tail(WorkHours)
## Year HoursPerWeek
## 30 1995 35.7
## 31 1996 35.7
## 32 1997 35.5
## 33 1998 35.6
## 34 1999 36.3
## 35 2000 36.5
WorkHoursTS <- ts(WorkHours$HoursPerWeek,start=c(1966,1),frequency=1)
yrange <- range(WorkHoursTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year", ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(WorkHoursTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
#Data Partioning
totalRecords <- length(WorkHoursTS)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHoursTS, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHoursTS, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))
#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)
##
## Call:
## tslm(formula = train.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95181 -0.31426 0.02328 0.28599 0.74808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.668046 0.153308 245.702 < 2e-16 ***
## trend -0.083315 0.008636 -9.648 2.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared: 0.7687, Adjusted R-squared: 0.7605
## F-statistic: 93.08 on 1 and 28 DF, p-value: 2.111e-10
accuracy(train.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set 1.001342e+00 1.1216734 1.0013422 2.77249855 2.7724986
## MASE ACF1 Theil's U
## Training set 1.585977 0.7728777 NA
## Test set 4.921852 0.4331874 3.166203
#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91631 -0.22458 0.04514 0.26864 0.54400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.982414 0.231481 164.084 < 2e-16 ***
## trend -0.142259 0.034422 -4.133 0.000311 ***
## I(trend^2) 0.001901 0.001077 1.765 0.088905 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.7773
## F-statistic: 51.61 on 2 and 27 DF, p-value: 5.958e-10
accuracy(train.poly.lm.pred,valid.ts)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set 0.557678 0.6987179 0.5576780 1.53970072 1.539701 2.741129
## ACF1 Theil's U
## Training set 0.7384102 NA
## Test set 0.4135712 1.993851
yrange = range(WorkHoursTS)
plot(c(1966,2000),yrange,type="n",xlab="Year",ylab="Hours Per Week",bty="l",xaxt="n",yaxt="n")
lines(WorkHoursTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
# The linear model
lines(train.lm$fitted,col="Green")
# The quadratic model
lines(train.poly.lm$fitted,col="Red")
# The legend
legend(1990,37.5,c("Hours PerWeek","Linear","Quadratic"),lty=c(1,1,1),col=c("black","red","blue"),bty="n")
P.4 \(\textit{Forecasting Department Store Sales}\)
b. Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).
setwd("/Users/yusufsultan/rWork")
StoreSales <- read.csv("DeptStoreSales.csv")
StoreSalesTS <- ts(StoreSales$Sales, start=c(1,1), freq=4)
#Partitioned data
validLength <- 4
trainLength <- length(StoreSalesTS) - validLength
trainTS <- window(StoreSalesTS, start= c(1,1), end= c(1, trainLength))
validTS <- window(StoreSalesTS, start= c(1, trainLength+1), end= c(1, trainLength+validLength))
#Fitted exponential trend with seasonality mode
salesExpo <- tslm(trainTS ~ trend + season, lambda=0)
summary(salesExpo)
##
## Call:
## tslm(formula = trainTS ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.053524 -0.013199 -0.004527 0.014387 0.062681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
## trend 0.011088 0.001295 8.561 3.70e-07 ***
## season2 0.024956 0.020764 1.202 0.248
## season3 0.165343 0.020884 7.917 9.79e-07 ***
## season4 0.433746 0.021084 20.572 2.10e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.63e+11 on 4 and 15 DF, p-value: < 2.2e-16
C.A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales?
ans: average sales higher because the coefficient for season2 is positive make the sales up every month
D.Use this model to forecast sales in quarters 21 and 22.
validPeriodForecasts <- forecast(salesExpo, h=validLength)
validPeriodForecasts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 6 Q1 58793.71 55790.19 61958.92 54090.84 63905.46
## 6 Q2 60951.51 57837.76 64232.89 56076.04 66250.87
## 6 Q3 70920.09 67297.09 74738.13 65247.24 77086.15
## 6 Q4 93788.66 88997.41 98837.86 86286.57 101943.01
The forecasts for Quarters 21 and 22 are 58793.71 and 60951.51.
yrange = range(StoreSalesTS)
plot(c(1,7),yrange,type="n",xlab="Year",ylab="Sales / thousands",bty="l",xaxt="n",yaxt="n")
lines(StoreSalesTS,bty="l")
axis(1,at=seq(1,24,1),labels=format(seq(1,24,1)))
axis(2,at=seq(45000,105000,10000),labels=format(seq(45,105,10)),las=1)
lines(salesExpo$fitted,col="red" ,lwd=2)
lines(validPeriodForecasts$mean,col="Blue",lty=2 ,lwd=2)
legend(1,105000,c("Sales","Exponential Trend+Season","Forecasts"),lty=c(1,1,2),col=c("black","red","blue"),bty="n")
accuracy(validPeriodForecasts, validTS)
## ME RMSE MAE MPE MAPE MASE
## Training set 30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set 5395.00854 6076.912 5395.009 6.62904618 6.629046 1.7311114
## ACF1 Theil's U
## Training set 0.5540818 NA
## Test set 0.2156416 0.4259668
MAPE = 6.63%
plot(salesExpo$residuals, type="o", bty="l" ,lwd=2)
abline(0,0,col="red",lwd=2)
Seasonality is not captured well and does not apper true there is no patterns in the residial plot. The regression model fits the data well. appears to be true, residuals are approximately centered at 0 The trend in the data is not captured well by the model. appears to be true,it show U-shaped in hte training period .
∙ Fit a quadratic trend model to the residuals (with QuarterQuarter and Quarter2Quarter2.)
∙ Fit a quadratic trend model to Sales (with QuarterQuarter and Quarter2Quarter2.) ans: Fit the quadratic trend model to Sales will be the best fit since we have U-shaped pattern in the training period residual plot.
A. Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?
the plots on Figure 6.14 page 140 from the textbook shows two components linear trend and monthly seasonality so 12 - 11 =11 dummy variables for monthly seasonality and 1 variable for the trend so it is a quadratic trend
B.Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.
setwd("/Users/yusufsultan/rWork")
SSales <- read.csv("SouvenirSales.csv",stringsAsFactors = FALSE)
str(SSales)
## 'data.frame': 84 obs. of 2 variables:
## $ Date : chr "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
## $ Sales: num 1665 2398 2841 3547 3753 ...
head(SSales)
## Date Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74
tail(SSales)
## Date Sales
## 79 1-Jul 26155.15
## 80 1-Aug 28586.52
## 81 1-Sep 30505.41
## 82 1-Oct 30821.33
## 83 1-Nov 46634.38
## 84 1-Dec 104660.67
SSalesTS <- ts(SSales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(SSalesTS) - validLength
SSalesTrain <- window(SSalesTS,end=c(1995,trainLength))
SSalesValid <- window(SSalesTS, start=c(1995,trainLength+1))
# fit linear trend with seasonality model to training set
modelA <- tslm(SSalesTrain ~ trend + season)
summary(modelA)
##
## Call:
## tslm(formula = SSalesTrain ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12592 -2359 -411 1940 33651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3065.55 2640.26 -1.161 0.25029
## trend 245.36 34.08 7.199 1.24e-09 ***
## season2 1119.38 3422.06 0.327 0.74474
## season3 4408.84 3422.56 1.288 0.20272
## season4 1462.57 3423.41 0.427 0.67077
## season5 1446.19 3424.60 0.422 0.67434
## season6 1867.98 3426.13 0.545 0.58766
## season7 2988.56 3427.99 0.872 0.38684
## season8 3227.58 3430.19 0.941 0.35058
## season9 3955.56 3432.73 1.152 0.25384
## season10 4821.66 3435.61 1.403 0.16573
## season11 11524.64 3438.82 3.351 0.00141 **
## season12 32469.55 3442.36 9.432 2.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared: 0.7903, Adjusted R-squared: 0.7476
## F-statistic: 18.53 on 12 and 59 DF, p-value: 9.435e-16
Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? December has the highest average sales which make senses based on the location(Australia) and season(Summer)
What does the trend coefficient of model A mean? both the reand and coefficient positive that make the sales go up monthly
C.Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.
#regression model with log(Sales)
modelB <- tslm(log(SSalesTrain) ~ trend + season)
summary(modelB)
##
## Call:
## tslm(formula = log(SSalesTrain) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9306
## F-statistic: 80.4 on 12 and 59 DF, p-value: < 2.2e-16
#exponential trend model with Sales
SSalesExpo <- tslm(SSalesTrain ~ trend + season,lambda=0)
summary(SSalesExpo)
##
## Call:
## tslm(formula = SSalesTrain ~ trend + season, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4529 -0.1163 0.0001 0.1005 0.3438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.646363 0.084120 90.898 < 2e-16 ***
## trend 0.021120 0.001086 19.449 < 2e-16 ***
## season2 0.282015 0.109028 2.587 0.012178 *
## season3 0.694998 0.109044 6.374 3.08e-08 ***
## season4 0.373873 0.109071 3.428 0.001115 **
## season5 0.421710 0.109109 3.865 0.000279 ***
## season6 0.447046 0.109158 4.095 0.000130 ***
## season7 0.583380 0.109217 5.341 1.55e-06 ***
## season8 0.546897 0.109287 5.004 5.37e-06 ***
## season9 0.635565 0.109368 5.811 2.65e-07 ***
## season10 0.729490 0.109460 6.664 9.98e-09 ***
## season11 1.200954 0.109562 10.961 7.38e-16 ***
## season12 1.952202 0.109675 17.800 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2e+10 on 12 and 59 DF, p-value: < 2.2e-16
i. Fitting a model only to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?
exponential trend
The estimated trend coefficient of model B = 0.021120 on the log-scale.It is means sales go up 2.1% per month
str(SSalesTS)
## Time-Series [1:84] from 1995 to 2002: 1665 2398 2841 3547 3753 ...
febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)
## (Intercept)
## 17062.99
the forecast for February 2002 is 17063 (Australian dollars).
# Model (A) Accuracy
modelAForecast <- forecast(modelA,h=validLength)
accuracy(modelAForecast$mean,SSalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924
# Model (B) Accuracy
modelBForecast <- forecast(modelB,h=validLength)
accuracy(exp(modelBForecast$mean),SSalesValid)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253
E.How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.
In descriptive modeling or time series analysis, trends, difference time series is modeled to determine its components in terms of seasonal trends, relation to external factors and like that, descriptive these can then used for decision making and policy formula different descriptive and predictive goals leads to differences in types of methods used and the modeling process itself.finally a predictive model is judged by its predictive accuracy rather than by its ability to provide correct casual explanation so the goal is predictive