Assignment5

Assignment5: Problems[2,4,5]

P.2 \(\textit{Analysis of Canadian Manufacturing Workers Work-Hours: }\)

The time series plot in the given figure describes the average annual number of weekly hours spent by Canadian manufacturing workers. Which of the following regression-based models would fit the series best?

Linear trend model
Linear trend model with seasonality
Quadratic trend model
Quadratic trend model with seasonality

ans:

Quadratic trend moded: After looking into time series, this series is yearly series and hence no seasonality but may be business cyclic in nature e.g. changing over every 4-5 years. At this time we don’t know how to handle business cyclic based time series and moreover this option is not avialble in the given options. We will consider Quadratic trend model since in different times of series it has increasing as well as decreasing trends e.g. between 1974 to 1988 it is decreasing and then it is increasing. RMSE/MAPE is also better for Quadratic trend model as seen below

library(forecast)

## Warning: package 'forecast' was built under R version 3.2.5

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.2.5

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: timeDate

## This is forecast 7.3

library(zoo)
setwd("/Users/yusufsultan/rWork")
WorkHours <- read.csv("CanadianWorkHours.csv",stringsAsFactors = FALSE)
str(WorkHours)

## 'data.frame':    35 obs. of  2 variables:
##  $ Year        : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ HoursPerWeek: num  37.2 37 37.4 37.5 37.7 37.7 37.4 37.2 37.3 37.2 ...

head(WorkHours)

##   Year HoursPerWeek
## 1 1966         37.2
## 2 1967         37.0
## 3 1968         37.4
## 4 1969         37.5
## 5 1970         37.7
## 6 1971         37.7

tail(WorkHours)

##    Year HoursPerWeek
## 30 1995         35.7
## 31 1996         35.7
## 32 1997         35.5
## 33 1998         35.6
## 34 1999         36.3
## 35 2000         36.5

WorkHoursTS <- ts(WorkHours$HoursPerWeek,start=c(1966,1),frequency=1)
yrange <- range(WorkHoursTS)
plot(c(1966, 2000), yrange, type="n", xlab="Year",  ylab="Hours Per Week", bty="l", xaxt="n", yaxt="n")
lines(WorkHoursTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)

#Data Partioning
totalRecords <- length(WorkHoursTS)
nValidationRecords <- 5
nTrainigRecords <- totalRecords - nValidationRecords
train.ts <- window(WorkHoursTS, start = c(1966,1), end = c(1966,nTrainigRecords))
valid.ts <- window(WorkHoursTS, start = c (1966,nTrainigRecords+1), end = c(1966,totalRecords))

#linear trend model
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValidationRecords, level = 0)
summary(train.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95181 -0.31426  0.02328  0.28599  0.74808 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.668046   0.153308 245.702  < 2e-16 ***
## trend       -0.083315   0.008636  -9.648 2.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4094 on 28 degrees of freedom
## Multiple R-squared:  0.7687, Adjusted R-squared:  0.7605 
## F-statistic: 93.08 on 1 and 28 DF,  p-value: 2.111e-10

accuracy(train.lm.pred,valid.ts)

##                        ME      RMSE       MAE         MPE      MAPE
## Training set 9.473975e-16 0.3955153 0.3226642 -0.01191583 0.8913215
## Test set     1.001342e+00 1.1216734 1.0013422  2.77249855 2.7724986
##                  MASE      ACF1 Theil's U
## Training set 1.585977 0.7728777        NA
## Test set     4.921852 0.4331874  3.166203

#Quadratic trend model
train.poly.lm <- tslm(train.ts ~ trend + I(trend^2))
train.poly.lm.pred <- forecast(train.poly.lm, h = nValidationRecords, level = 0)
summary(train.poly.lm)

## 
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91631 -0.22458  0.04514  0.26864  0.54400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.982414   0.231481 164.084  < 2e-16 ***
## trend       -0.142259   0.034422  -4.133 0.000311 ***
## I(trend^2)   0.001901   0.001077   1.765 0.088905 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3948 on 27 degrees of freedom
## Multiple R-squared:  0.7927, Adjusted R-squared:  0.7773 
## F-statistic: 51.61 on 2 and 27 DF,  p-value: 5.958e-10

accuracy(train.poly.lm.pred,valid.ts)

##                    ME      RMSE       MAE         MPE     MAPE     MASE
## Training set 0.000000 0.3745042 0.3042046 -0.01072127 0.836695 1.495243
## Test set     0.557678 0.6987179 0.5576780  1.53970072 1.539701 2.741129
##                   ACF1 Theil's U
## Training set 0.7384102        NA
## Test set     0.4135712  1.993851

yrange = range(WorkHoursTS)
plot(c(1966,2000),yrange,type="n",xlab="Year",ylab="Hours Per Week",bty="l",xaxt="n",yaxt="n")
lines(WorkHoursTS,bty="l")
axis(1, at=seq(1965,2000,5), labels=format(seq(1965,2000,5)))
axis(2, at=seq(34.5,38.0,0.5), labels=format(seq(34.5,38.0,0.5)), las=2)
# The linear model
lines(train.lm$fitted,col="Green")
# The quadratic model
lines(train.poly.lm$fitted,col="Red")
# The legend
legend(1990,37.5,c("Hours PerWeek","Linear","Quadratic"),lty=c(1,1,1),col=c("black","red","blue"),bty="n")

P.4 \(\textit{Forecasting Department Store Sales}\)

The forecaster decided that there is an exponential trend in the series. In order to fit a regression-based model that accounts for this trend, which of the following operations must be performed (either manually or by a function in R)?

Take a logarithm of the Quarter index
Take a logarithm of sales
Take an exponent of sales
Take an exponent of Quarter index

ans: Take a logarithm of sales:To fit a regression-based model that accounts for exponential trend

b. Fit a regression model with an exponential trend and seasonality, using only the first 20 quarters as the training period (remember to first partition the series into training and validation periods).

setwd("/Users/yusufsultan/rWork")
StoreSales <- read.csv("DeptStoreSales.csv")
StoreSalesTS <- ts(StoreSales$Sales, start=c(1,1), freq=4)

#Partitioned data
validLength <- 4
trainLength <- length(StoreSalesTS) - validLength
trainTS <- window(StoreSalesTS, start= c(1,1), end= c(1, trainLength))
validTS <- window(StoreSalesTS, start= c(1, trainLength+1), end= c(1, trainLength+validLength))

#Fitted exponential trend with seasonality mode 
salesExpo <- tslm(trainTS ~ trend + season, lambda=0)
summary(salesExpo)

## 
## Call:
## tslm(formula = trainTS ~ trend + season, lambda = 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.053524 -0.013199 -0.004527  0.014387  0.062681 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.748945   0.018725 574.057  < 2e-16 ***
## trend        0.011088   0.001295   8.561 3.70e-07 ***
## season2      0.024956   0.020764   1.202    0.248    
## season3      0.165343   0.020884   7.917 9.79e-07 ***
## season4      0.433746   0.021084  20.572 2.10e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03277 on 15 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.63e+11 on 4 and 15 DF,  p-value: < 2.2e-16

C.A partial output is shown in Table 6.7. From the output, after adjusting for trend, are Q2 average sales higher, lower or approximately equal to the average Q1 sales?

ans: average sales higher because the coefficient for season2 is positive make the sales up every month

D.Use this model to forecast sales in quarters 21 and 22.

validPeriodForecasts <- forecast(salesExpo, h=validLength)
validPeriodForecasts

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 6 Q1       58793.71 55790.19 61958.92 54090.84  63905.46
## 6 Q2       60951.51 57837.76 64232.89 56076.04  66250.87
## 6 Q3       70920.09 67297.09 74738.13 65247.24  77086.15
## 6 Q4       93788.66 88997.41 98837.86 86286.57 101943.01

The forecasts for Quarters 21 and 22 are 58793.71 and 60951.51.

E.The plot shown in Figure 6.13 describe the fit (top) and forecast errors (bottom) from the regression model.

Recreate these plots.

yrange = range(StoreSalesTS)
plot(c(1,7),yrange,type="n",xlab="Year",ylab="Sales / thousands",bty="l",xaxt="n",yaxt="n")
lines(StoreSalesTS,bty="l")
axis(1,at=seq(1,24,1),labels=format(seq(1,24,1)))
axis(2,at=seq(45000,105000,10000),labels=format(seq(45,105,10)),las=1)
lines(salesExpo$fitted,col="red" ,lwd=2)
lines(validPeriodForecasts$mean,col="Blue",lty=2 ,lwd=2)
legend(1,105000,c("Sales","Exponential Trend+Season","Forecasts"),lty=c(1,1,2),col=c("black","red","blue"),bty="n")

accuracy(validPeriodForecasts, validTS)

##                      ME     RMSE      MAE         MPE     MAPE      MASE
## Training set   30.63141 1759.359 1388.794 -0.04013518 2.215018 0.4456262
## Test set     5395.00854 6076.912 5395.009  6.62904618 6.629046 1.7311114
##                   ACF1 Theil's U
## Training set 0.5540818        NA
## Test set     0.2156416 0.4259668

MAPE = 6.63%

plot(salesExpo$residuals, type="o", bty="l" ,lwd=2)
abline(0,0,col="red",lwd=2)

Based on these plots, what can you say your forecasts for quarters 21 and 22? Are they likely to over-forecast, under-forecast or be reasonably close to the real sales values?

ans : based on the plot above the residual plot shows that the model is under predicting and the forcasts for Q1 and Q2 under-forecastes

F.Looking at the residual plot, which of the following statements appear true?

Seasonality is not captured well and does not apper true there is no patterns in the residial plot. The regression model fits the data well. appears to be true, residuals are approximately centered at 0 The trend in the data is not captured well by the model. appears to be true,it show U-shaped in hte training period .

G.Which of the following solutions is adequate and a parsimonious solution for improving model fit?

∙ Fit a quadratic trend model to the residuals (with QuarterQuarter and Quarter2Quarter2.)

∙ Fit a quadratic trend model to Sales (with QuarterQuarter and Quarter2Quarter2.) ans: Fit the quadratic trend model to Sales will be the best fit since we have U-shaped pattern in the training period residual plot.

\(\textit{Souvenir Sales}\)

In 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation periods, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training period.

A. Based on the two time plots in Figure 6.14, which predictors should be included in the regression model? What is the total number of predictors in the model?

the plots on Figure 6.14 page 140 from the textbook shows two components linear trend and monthly seasonality so 12 - 11 =11 dummy variables for monthly seasonality and 1 variable for the trend so it is a quadratic trend

B.Run a regression model with Sales (in Australian dollars) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model A.

setwd("/Users/yusufsultan/rWork")
SSales <- read.csv("SouvenirSales.csv",stringsAsFactors = FALSE)
str(SSales)

## 'data.frame':    84 obs. of  2 variables:
##  $ Date : chr  "Jan-95" "Feb-95" "Mar-95" "Apr-95" ...
##  $ Sales: num  1665 2398 2841 3547 3753 ...

head(SSales)

##     Date   Sales
## 1 Jan-95 1664.81
## 2 Feb-95 2397.53
## 3 Mar-95 2840.71
## 4 Apr-95 3547.29
## 5 May-95 3752.96
## 6 Jun-95 3714.74

tail(SSales)

##     Date     Sales
## 79 1-Jul  26155.15
## 80 1-Aug  28586.52
## 81 1-Sep  30505.41
## 82 1-Oct  30821.33
## 83 1-Nov  46634.38
## 84 1-Dec 104660.67

SSalesTS <- ts(SSales$Sales,start=c(1995,1),frequency=12)
validLength <- 12
trainLength <- length(SSalesTS) - validLength
SSalesTrain <- window(SSalesTS,end=c(1995,trainLength))
SSalesValid <- window(SSalesTS, start=c(1995,trainLength+1))
# fit linear trend with seasonality model to training set
modelA <- tslm(SSalesTrain ~ trend + season)
summary(modelA)

## 
## Call:
## tslm(formula = SSalesTrain ~ trend + season)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12592  -2359   -411   1940  33651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3065.55    2640.26  -1.161  0.25029    
## trend         245.36      34.08   7.199 1.24e-09 ***
## season2      1119.38    3422.06   0.327  0.74474    
## season3      4408.84    3422.56   1.288  0.20272    
## season4      1462.57    3423.41   0.427  0.67077    
## season5      1446.19    3424.60   0.422  0.67434    
## season6      1867.98    3426.13   0.545  0.58766    
## season7      2988.56    3427.99   0.872  0.38684    
## season8      3227.58    3430.19   0.941  0.35058    
## season9      3955.56    3432.73   1.152  0.25384    
## season10     4821.66    3435.61   1.403  0.16573    
## season11    11524.64    3438.82   3.351  0.00141 ** 
## season12    32469.55    3442.36   9.432 2.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 59 degrees of freedom
## Multiple R-squared:  0.7903, Adjusted R-squared:  0.7476 
## F-statistic: 18.53 on 12 and 59 DF,  p-value: 9.435e-16

Examine the coefficients: Which month tends to have the highest average sales during the year? Why is this reasonable? December has the highest average sales which make senses based on the location(Australia) and season(Summer)
What does the trend coefficient of model A mean? both the reand and coefficient positive that make the sales go up monthly

C.Run a regression model with log(Sales) as the output variable and with a linear trend and monthly predictors. Remember to fit only the training period. Call this model B.

#regression model with log(Sales)
modelB <- tslm(log(SSalesTrain) ~ trend + season)
summary(modelB)

## 
## Call:
## tslm(formula = log(SSalesTrain) ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9306 
## F-statistic:  80.4 on 12 and 59 DF,  p-value: < 2.2e-16

#exponential trend model with Sales
SSalesExpo <- tslm(SSalesTrain ~ trend + season,lambda=0)
summary(SSalesExpo)

## 
## Call:
## tslm(formula = SSalesTrain ~ trend + season, lambda = 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4529 -0.1163  0.0001  0.1005  0.3438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.646363   0.084120  90.898  < 2e-16 ***
## trend       0.021120   0.001086  19.449  < 2e-16 ***
## season2     0.282015   0.109028   2.587 0.012178 *  
## season3     0.694998   0.109044   6.374 3.08e-08 ***
## season4     0.373873   0.109071   3.428 0.001115 ** 
## season5     0.421710   0.109109   3.865 0.000279 ***
## season6     0.447046   0.109158   4.095 0.000130 ***
## season7     0.583380   0.109217   5.341 1.55e-06 ***
## season8     0.546897   0.109287   5.004 5.37e-06 ***
## season9     0.635565   0.109368   5.811 2.65e-07 ***
## season10    0.729490   0.109460   6.664 9.98e-09 ***
## season11    1.200954   0.109562  10.961 7.38e-16 ***
## season12    1.952202   0.109675  17.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 59 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2e+10 on 12 and 59 DF,  p-value: < 2.2e-16

i. Fitting a model only to log(Sales) with a linear trend is equivalent to fitting a model to Sales (in dollars) with what type of trend?

exponential trend

ii. What does the estimated trend coefficient of model B mean?

The estimated trend coefficient of model B = 0.021120 on the log-scale.It is means sales go up 2.1% per month

iii. Use this model to forecast the sales in February 2002.

str(SSalesTS)

##  Time-Series [1:84] from 1995 to 2002: 1665 2398 2841 3547 3753 ...

febForecast <- modelB$coefficients["(Intercept)"] + modelB$coefficients["trend"]*86 + modelB$coefficients["season2"]
exp(febForecast)

## (Intercept) 
##    17062.99

the forecast for February 2002 is 17063 (Australian dollars).

D.Compare the two regression models (A and B) in terms of forecast performance. Which model is preferable for forecasting? Mention at least two reasons based on the information in the outputs.

# Model (A) Accuracy 
modelAForecast <- forecast(modelA,h=validLength)
accuracy(modelAForecast$mean,SSalesValid)

##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 8251.513 17451.55 10055.28 10.53397 26.66568 0.3206228 0.9075924

# Model (B) Accuracy 
modelBForecast <- forecast(modelB,h=validLength)
accuracy(exp(modelBForecast$mean),SSalesValid)

##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 4824.494 7101.444 5191.669 12.35943 15.5191 0.4245018 0.4610253

E.How would you model this data differently if the goal was understanding the different components of sales in the souvenir shop between 1995 and 2001? Mention two differences.

In descriptive modeling or time series analysis, trends, difference time series is modeled to determine its components in terms of seasonal trends, relation to external factors and like that, descriptive these can then used for decision making and policy formula different descriptive and predictive goals leads to differences in types of methods used and the modeling process itself.finally a predictive model is judged by its predictive accuracy rather than by its ability to provide correct casual explanation so the goal is predictive